All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs
@ 2017-08-24 15:54 Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 1/5] net/mlx4: add simple Tx " Moti Haimovsky
                   ` (6 more replies)
  0 siblings, 7 replies; 61+ messages in thread
From: Moti Haimovsky @ 2017-08-24 15:54 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

This series of patches implements the mlx4-pmd with Tx data path that directly
access the device queues for transmitting packets, bypassing the ibverbs Tx
data path altogether.
Using this scheme allows the PMD to work with upstream rdma-core package
instead of the Mellanox OFED one without sacrificing Tx functionality.

These patches should be applied in the order listed below as each depends on
its predecessor to work.

This implementation allows rapid deployment of new features without the need to
update the underlying OFED.

This work depends on
        http://dpdk.org/ml/archives/dev/2017-August/072281.html
        [dpdk-dev] [PATCH v1 00/48] net/mlx4: trim and refactor entire PMD
by Adrien Mazarguil

It had been built and tested using rdma-core-15-1 from
 https://github.com/linux-rdma/rdma-core
and kernel-ml-4.12.0-1.el7.elrepo.x86_64

It had been built and tested using rdma-core-15-1 from
 https://github.com/linux-rdma/rdma-core
and kernel-ml-4.12.0-1.el7.elrepo.x86_64

Moti Haimovsky (5):
  net/mlx4: add simple Tx bypassing ibverbs
  net/mlx4: support multi-segments Tx
  net/mlx4: refine setting Tx completion flag
  net/mlx4: add Tx checksum offloads
  net/mlx4: add loopback Tx from VF

 drivers/net/mlx4/mlx4.c        |   7 +
 drivers/net/mlx4/mlx4.h        |   2 +
 drivers/net/mlx4/mlx4_ethdev.c |   6 +
 drivers/net/mlx4/mlx4_prm.h    | 249 ++++++++++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c   | 456 +++++++++++++++++++++++++++++++++--------
 drivers/net/mlx4/mlx4_rxtx.h   |  39 +++-
 drivers/net/mlx4/mlx4_txq.c    |  66 +++++-
 mk/rte.app.mk                  |   2 +-
 8 files changed, 734 insertions(+), 93 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/5] net/mlx4: add simple Tx bypassing ibverbs
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
@ 2017-08-24 15:54 ` Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 2/5] net/mlx4: support multi-segments Tx Moti Haimovsky
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Moti Haimovsky @ 2017-08-24 15:54 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

PMD now sends the single-buffer packets directly to the device
bypassing the ibv Tx post and poll routines.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 253 +++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 260 +++++++++++++++++++++++++++++++++++--------
 drivers/net/mlx4/mlx4_rxtx.h |  30 ++++-
 drivers/net/mlx4/mlx4_txq.c  |  52 ++++++++-
 mk/rte.app.mk                |   2 +-
 5 files changed, 546 insertions(+), 51 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..c5ce33b
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,253 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 Mellanox.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef RTE_PMD_MLX4_MLX4_PRM_H_
+#define RTE_PMD_MLX4_MLX4_PRM_H_
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx4dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* Basic TxQ building block */
+#define TXBB_SHIFT 6
+#define TXBB_SIZE (1 << TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes... */
+#define MAX_WQE_SIZE		512
+#define MAX_WQE_TXBBS		(MAX_WQE_SIZE / TXBB_SIZE)
+
+/* Send Queue Stamping/Invalidating info */
+#define SQ_STAMP_STRIDE		64
+#define SQ_STAMP_DWORDS		(SQ_STAMP_STRIDE / 4)
+#define SQ_STAMP_SHIFT		31
+#define SQ_STAMP_VAL		0x7fffffff
+
+/* WQE flags */
+#define MLX4_OPCODE_SEND	0x0a
+#define MLX4_EN_BIT_WQE_OWN	0x80000000
+
+#define SIZE_TO_TXBBS(size)     (RTE_ALIGN((size), (TXBB_SIZE)) / (TXBB_SIZE))
+
+/**
+ * Update the HW with the new  CQ consumer value.
+ *
+ * @param cq
+ *   Pointer to the cq structure.
+ */
+static inline void
+mlx4_cq_set_ci(struct mlx4_cq *cq)
+{
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+}
+
+/**
+ * Returns a pointer to the cqe in position n.
+ *
+ * @param cq
+ *   Pointer to the cq structure.
+ * @param n
+ *   The number of the entry its address we seek.
+ *
+ * @return
+ *   pointer to the cqe.
+ */
+static inline struct mlx4_cqe
+*mlx4_get_cqe(struct mlx4_cq *cq, int n)
+{
+	return (struct mlx4_cqe *)(cq->buf + n * cq->cqe_size);
+}
+
+/**
+ * Returns a pointer to the cqe in position n if it is owned by SW.
+ *
+ * @param cq
+ *   Pointer to the cq structure.
+ * @param n
+ *   The number of the entry its address we seek.
+ *
+ * @return
+ *   pointer to the cqe if owned by SW, otherwise returns NULL.
+ */
+static inline void
+*mlx4_get_sw_cqe(struct mlx4_cq *cq, int n)
+{
+	struct mlx4_cqe *cqe = mlx4_get_cqe(cq, n & (cq->cqe_cnt - 1));
+	struct mlx4_cqe *tcqe = cq->cqe_size == 64 ? cqe + 1 : cqe;
+
+	return (!!(tcqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		!!(n & cq->cqe_cnt)) ? NULL : cqe;
+}
+
+/**
+ * returns pointer to the wqe at position n.
+ *
+ * @param sq
+ *   Pointer to the sq.
+ * @param n
+ *   The entry number the queue.
+ *
+ * @return
+ *   A pointer to the required entry.
+ */
+static inline void
+*mlx4_get_send_wqe(struct mlx4_sq *sq, unsigned int n)
+{
+	return sq->buf + n * TXBB_SIZE;
+}
+
+/**
+ * returns the size in bytes of this WQE.
+ *
+ * @param wqe
+ *   Pointer to the WQE we want to interrogate.
+ *
+ * @return
+ *   WQE size in bytes.
+ */
+static inline int
+mlx4_wqe_get_real_size(void *wqe)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl = (struct mlx4_wqe_ctrl_seg *)wqe;
+	return ((ctrl->fence_size & 0x3f) << 4);
+}
+
+/**
+ * Fills the ctrl segment of a WQE with info needed for transmitting the packet.
+ *
+ * @param seg
+ *   Pointer to the control structure in the WQE.
+ * @param owner
+ *   The value for the owner field.
+ * @param fence_size
+ *   Fence bit and WQE size in octowords.
+ * @param srcrb_flags
+ *   High 24 bits are SRC remote buffer; low 8 bits are flags.
+ * @param imm
+ *   Immediate data/Invalidation key..
+ */
+static inline void
+mlx4_set_ctrl_seg(struct mlx4_wqe_ctrl_seg *seg, uint32_t owner,
+	     uint8_t fence_size, uint32_t srcrb_flags, uint32_t imm)
+{
+	seg->fence_size = fence_size;
+	seg->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
+	/*
+	 * The caller should prepare "imm" in advance based on WR opcode.
+	 * For IBV_WR_SEND_WITH_IMM and IBV_WR_RDMA_WRITE_WITH_IMM,
+	 * the "imm" should be assigned as is.
+	 * For the IBV_WR_SEND_WITH_INV, it should be htobe32(imm).
+	 */
+	seg->imm = imm;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
+	 */
+	rte_wmb();
+	seg->owner_opcode = rte_cpu_to_be_32(owner);
+}
+
+/**
+ * Fills a data segment of a WQE with info needed for transmitting
+ * a data fragment.
+ *
+ * @param dseg
+ *   Pointer to a data segment structure in the WQE.
+ *   (WQE may contain several data segments).
+ * @param sq
+ *   fragment info (addr, length, lkey).
+ */
+static inline void
+mlx4_set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ibv_sge *sg)
+{
+	dseg->lkey       = rte_cpu_to_be_32(sg->lkey);
+	dseg->addr       = rte_cpu_to_be_64(sg->addr);
+
+	/*
+	 * Need a barrier here before writing the byte_count field to
+	 * make sure that all the data is visible before the
+	 * byte_count field is set.  Otherwise, if the segment begins
+	 * a new cacheline, the HCA prefetcher could grab the 64-byte
+	 * chunk and get a valid (!= * 0xffffffff) byte count but
+	 * stale data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+
+	if (likely(sg->length))
+		dseg->byte_count = rte_cpu_to_be_32(sg->length);
+	else
+		/* Zero len seg is treated as inline segment with zero data */
+		dseg->byte_count = rte_cpu_to_be_32(0x80000000);
+}
+
+/**
+ * Checks if a BBE sized WQE can be inserted to the sq without
+ * overflowing the Q and that theWQE is not over-sized.
+ *
+ * @param sq
+ *   Pointer to the sq we want to put the wqe in.
+ * @param ntxbb
+ *   Number of EBBs the WQE occupies.
+ */
+static inline int
+mlx4_wq_overflow(struct mlx4_sq *sq, int ntxbb)
+{
+	unsigned int cur;
+
+	cur = sq->head - sq->tail;
+	return ((cur + ntxbb + sq->headroom_txbbs >= sq->txbb_cnt) ||
+		(ntxbb > MAX_WQE_TXBBS));
+}
+
+/**
+ * Calc the  WQE size (in bytes) needed for posting this packet.
+ *
+ * @param count
+ *   The number of data-segments the WQE contains.
+ *
+ * @return
+ *   WQE size in bytes.
+ */
+static inline int
+mlx4_wqe_calc_real_size(unsigned int count)
+{
+	return sizeof(struct mlx4_wqe_ctrl_seg) +
+		(count * sizeof(struct mlx4_wqe_data_seg));
+}
+
+#endif /* RTE_PMD_MLX4_MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b5e7777..0720e34 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -55,10 +55,125 @@
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
+#include <rte_io.h>
 
 #include "mlx4.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
+#include "mlx4_prm.h"
+
+
+typedef int bool;
+#define TRUE	1
+#define FALSE	0
+
+/**
+ * stamp a freed wqe so it won't be reused by the HW.
+ *
+ * @param sq
+ *   Pointer to the sq structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of TX EBBs the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp =
+		rte_cpu_to_be_32(SQ_STAMP_VAL | (!!owner << SQ_STAMP_SHIFT));
+	void *end = sq->buf + sq->size;
+	void *wqe = mlx4_get_send_wqe(sq, index & sq->txbb_cnt_mask);
+	uint32_t *ptr = wqe;
+	int num_txbbs = SIZE_TO_TXBBS(mlx4_wqe_get_real_size(wqe));
+	int i;
+
+	/* Optimize the common case when there are no wraparounds */
+	if (likely((char *)wqe + num_txbbs * TXBB_SIZE <= (char *)end)) {
+		/* Stamp the freed descriptor */
+		for (i = 0; i < num_txbbs * TXBB_SIZE; i += SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor */
+		for (i = 0; i < num_txbbs * TXBB_SIZE; i += SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += SQ_STAMP_DWORDS;
+			if ((void *)ptr >= end) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= rte_cpu_to_be_32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
+ * Poll a CQ for work completions.
+ *
+ * @param txq
+ *   The txq its cq we wish to poll.
+ * @param max_cqes
+ *   Max num of CQEs to handle in this call.
+ *
+ * @return
+ *   The number of pkts that were handled.
+ */
+static int
+mlx4_tx_poll_cq(struct txq *txq, int max_cqes)
+{
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
+
+	while ((max_cqes-- > 0) &&
+		((cqe = mlx4_get_sw_cqe(cq, cons_index)) != NULL)) {
+		cqe += cq->factor;  /* handle CQES with size > 32 */
+		/*
+		 * make sure we read the CQE after we read the
+		 * ownership bit
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      txq, cqe_err->vendor_err, cqe_err->syndrome);
+		}
+		/* Get wqe num reported in the cqe */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* free next descriptor */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		++cons_index;
+	}
+	/*
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	mlx4_cq_set_ci(cq);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	return pkts;
+}
 
 /**
  * Manage Tx completions.
@@ -80,16 +195,15 @@
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
 	int wcs_n;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
+	wcs_n = mlx4_tx_poll_cq(txq, 1);
 	if (unlikely(wcs_n == 0))
 		return 0;
 	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
+		DEBUG("%p: mlx4_poll_cq() failed (wcs_n=%d)",
 		      (void *)txq, wcs_n);
 		return -1;
 	}
@@ -99,7 +213,7 @@
 	 * Assume WC status is successful as nothing can be done about it
 	 * anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += wcs_n;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -183,6 +297,80 @@
 }
 
 /**
+ * Notify mlx4 that work is pending on TXq.
+ *
+ * @param txq
+ *   Pointer to mlx4 Tx queue structure.
+ */
+static inline void
+mlx4_send_flush(struct txq *txq)
+{
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
+}
+
+/**
+ * Posts a single work requests to a send queue.
+ *
+ * @param txq
+ *   The txq to post to.
+ * @param wr
+ *   The work request to handle.
+ * @param bad_wr
+ *   the wr in case that posting had failed.
+ *
+ * @return
+ *   0 - success, -1 error.
+ */
+static int
+mlx4_post_send(struct txq *txq,
+	       struct ibv_send_wr *wr,
+	       struct ibv_send_wr **bad_wr)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	uint32_t srcrb_flags;
+	uint8_t fence_size;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t owner_opcode;
+	int wqe_real_size, nr_txbbs;
+
+	/* for now we support pkts with one buf only */
+	if (wr->num_sge != 1)
+		goto err;
+	/* Calc the needed wqe size for this packet */
+	wqe_real_size = mlx4_wqe_calc_real_size(wr->num_sge);
+	if (unlikely(!wqe_real_size))
+		goto err;
+	nr_txbbs = SIZE_TO_TXBBS(wqe_real_size);
+	/* Are we too big to handle ? */
+	if (unlikely(mlx4_wq_overflow(sq, nr_txbbs)))
+		goto err;
+	/* Get ctrl and single-data wqe entries */
+	ctrl = mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)(((char *)ctrl) +
+		sizeof(struct mlx4_wqe_ctrl_seg));
+	mlx4_set_data_seg(dseg, wr->sg_list);
+	/* For raw eth, the SOLICIT flag is used
+	 * to indicate that no icrc should be calculated
+	 */
+	srcrb_flags = MLX4_WQE_CTRL_SOLICIT |
+		      ((wr->send_flags & IBV_SEND_SIGNALED) ?
+				      MLX4_WQE_CTRL_CQ_UPDATE : 0);
+	fence_size = (wr->send_flags & IBV_SEND_FENCE ?
+		MLX4_WQE_CTRL_FENCE : 0) | ((wqe_real_size / 16) & 0x3f);
+	owner_opcode = MLX4_OPCODE_SEND |
+		       ((sq->head & sq->txbb_cnt) ? MLX4_EN_BIT_WQE_OWN : 0);
+	mlx4_set_ctrl_seg(ctrl, owner_opcode, fence_size, srcrb_flags, 0);
+	sq->head += nr_txbbs;
+	rte_wmb();
+	return 0;
+err:
+	*bad_wr = wr;
+	return -1;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,8 +387,6 @@
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
 	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
@@ -275,6 +461,8 @@
 				elt->buf = NULL;
 				goto stop;
 			}
+			if (buf->pkt_len <= txq->max_inline)
+				send_flags |= IBV_SEND_INLINE;
 			/* Update element. */
 			elt->buf = buf;
 			if (txq->priv->vf)
@@ -285,22 +473,31 @@
 			sge->length = length;
 			sge->lkey = lkey;
 			sent_size += length;
+			/* Set up WR. */
+			wr->sg_list  = sge;
+			wr->num_sge  = segs;
+			wr->opcode   = IBV_WR_SEND;
+			wr->send_flags = send_flags;
+			wr->next     = NULL;
+			/* post the pkt for sending */
+			err = mlx4_post_send(txq, wr, &wr_bad);
+			if (unlikely(err)) {
+				if (unlikely(wr_bad->send_flags &
+					     IBV_SEND_SIGNALED)) {
+					elts_comp_cd = 1;
+					--elts_comp;
+				}
+				elt->buf = NULL;
+				goto stop;
+			}
+			sent_size += length;
 		} else {
 			err = -1;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
 		elts_head = elts_head_next;
 		/* Increment sent bytes counter. */
 		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
@@ -309,38 +506,7 @@
 	/* Increment sent packets counter. */
 	txq->stats.opackets += i;
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	mlx4_send_flush(txq);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
 	txq->elts_comp_cd = elts_comp_cd;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fec998a..e442730 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -41,6 +41,7 @@
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
 #include <infiniband/verbs.h>
+#include <infiniband/mlx4dv.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
@@ -90,7 +91,7 @@ struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
-/** Rx queue counters. */
+/** Tx queue counters. */
 struct mlx4_txq_stats {
 	unsigned int idx; /**< Mapping index. */
 	uint64_t opackets; /**< Total of successfully sent packets. */
@@ -98,6 +99,31 @@ struct mlx4_txq_stats {
 	uint64_t odropped; /**< Total of packets not sent when Tx ring full. */
 };
 
+/** TXQ Info */
+struct mlx4_sq {
+	char	*buf;  /**< SQ buffer. */
+	uint32_t size; /**< SQ size in bytes. */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt;       /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_shift;     /**< The log2 size of the basic block. */
+	uint32_t txbb_cnt_mask;	 /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free .*/
+	uint32_t *db;            /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn;   /**< qp number to write to the doorbell. */
+};
+
+struct mlx4_cq {
+	char	 *buf;       /**< CQ buffer. */
+	uint32_t size;       /**< CQ size in bytes. */
+	uint32_t cqe_cnt;    /**< Num of entries the CQ has. */
+	int	 cqe_size;   /**< size (in bytes) of a CQE. */
+	uint32_t *set_ci_db; /**< Pionter of the consumer-index doorbell. */
+	uint32_t cons_index; /**< last CQE entry that was handled. */
+	uint32_t factor;     /**< CQ data location in a CQE. */
+	int	 cqn;        /**< CQ number */
+};
+
 /** Tx queue descriptor. */
 struct txq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -118,6 +144,8 @@ struct txq {
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 };
 
 /* mlx4_rxq.c */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index e0245b0..1273738 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -62,6 +62,7 @@
 #include "mlx4_autoconf.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
+#include "mlx4_prm.h"
 
 /**
  * Allocate Tx queue elements.
@@ -109,7 +110,8 @@
 	assert(ret == 0);
 	return 0;
 error:
-	rte_free(elts);
+	if (elts != NULL)
+		rte_free(elts);
 	DEBUG("%p: failed, freed everything", (void *)txq);
 	assert(ret > 0);
 	rte_errno = ret;
@@ -241,6 +243,36 @@ struct txq_mp2mr_mbuf_check_data {
 	mlx4_txq_mp2mr(txq, mp);
 }
 
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+
+	sq->buf = ((char *)dqp->buf.buf) + dqp->sq.offset;
+	/* Total len, including headroom and spare WQEs*/
+	sq->size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_shift = TXBB_SHIFT;
+	sq->txbb_cnt = (dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs = (2048 + (1 << dqp->sq.wqe_shift)) >> TXBB_SHIFT;
+
+	/* Save CQ params */
+	cq->buf = dcq->buf.buf;
+	cq->size = dcq->buf.length;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->cqn = dcq->cqn;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_size = dcq->cqe_size;
+	cq->factor = cq->cqe_size > 32 ? 1 : 0;
+}
+
 /**
  * Configure a Tx queue.
  *
@@ -263,7 +295,10 @@ struct txq_mp2mr_mbuf_check_data {
 	       unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct txq tmpl = {
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
+		struct txq tmpl = {
 		.priv = priv,
 		.socket = socket
 	};
@@ -308,6 +343,8 @@ struct txq_mp2mr_mbuf_check_data {
 			/* Max number of scatter/gather elements in a WR. */
 			.max_send_sge = 1,
 			.max_inline_data = MLX4_PMD_MAX_INLINE,
+			.max_recv_wr = 0,
+			.max_recv_sge = 0,
 		},
 		.qp_type = IBV_QPT_RAW_PACKET,
 		/*
@@ -370,6 +407,17 @@ struct txq_mp2mr_mbuf_check_data {
 	DEBUG("%p: txq updated with %p", (void *)txq, (void *)&tmpl);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
+	/* Retrieve device Q info */
+	mlxdv.cq.in = txq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: Failed to retrieve device obj info", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	return 0;
 error:
 	ret = rte_errno;
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index c25fdd9..2f1286e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -128,7 +128,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
 endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LIO_PMD)        += -lrte_pmd_lio
-_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs -lmlx4
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -libverbs
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NFP_PMD)        += -lrte_pmd_nfp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_NULL)       += -lrte_pmd_null
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 2/5] net/mlx4: support multi-segments Tx
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 1/5] net/mlx4: add simple Tx " Moti Haimovsky
@ 2017-08-24 15:54 ` Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 3/5] net/mlx4: refine setting Tx completion flag Moti Haimovsky
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Moti Haimovsky @ 2017-08-24 15:54 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

PRM now supports transmitting packets spanning over arbitrary
amount of buffers.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  |  16 +---
 drivers/net/mlx4/mlx4_rxtx.c | 213 +++++++++++++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |   3 +-
 drivers/net/mlx4/mlx4_txq.c  |  12 ++-
 4 files changed, 170 insertions(+), 74 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index c5ce33b..8b0248a 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -61,7 +61,7 @@
 #define MLX4_OPCODE_SEND	0x0a
 #define MLX4_EN_BIT_WQE_OWN	0x80000000
 
-#define SIZE_TO_TXBBS(size)     (RTE_ALIGN((size), (TXBB_SIZE)) / (TXBB_SIZE))
+#define SIZE_TO_TXBBS(size)	(RTE_ALIGN((size), (TXBB_SIZE)) / (TXBB_SIZE))
 
 /**
  * Update the HW with the new  CQ consumer value.
@@ -148,6 +148,7 @@
 
 /**
  * Fills the ctrl segment of a WQE with info needed for transmitting the packet.
+ * Owner field is filled later.
  *
  * @param seg
  *   Pointer to the control structure in the WQE.
@@ -161,8 +162,8 @@
  *   Immediate data/Invalidation key..
  */
 static inline void
-mlx4_set_ctrl_seg(struct mlx4_wqe_ctrl_seg *seg, uint32_t owner,
-	     uint8_t fence_size, uint32_t srcrb_flags, uint32_t imm)
+mlx4_set_ctrl_seg(struct mlx4_wqe_ctrl_seg *seg, uint8_t fence_size,
+		  uint32_t srcrb_flags, uint32_t imm)
 {
 	seg->fence_size = fence_size;
 	seg->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
@@ -173,13 +174,6 @@
 	 * For the IBV_WR_SEND_WITH_INV, it should be htobe32(imm).
 	 */
 	seg->imm = imm;
-	/*
-	 * Make sure descriptor is fully written before
-	 * setting ownership bit (because HW can start
-	 * executing as soon as we do).
-	 */
-	rte_wmb();
-	seg->owner_opcode = rte_cpu_to_be_32(owner);
 }
 
 /**
@@ -241,7 +235,7 @@
  *   The number of data-segments the WQE contains.
  *
  * @return
- *   WQE size in bytes.
+ *   The calculated WQE size in bytes.
  */
 static inline int
 mlx4_wqe_calc_real_size(unsigned int count)
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 0720e34..e41ea9e 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -309,6 +309,101 @@
 }
 
 /**
+ * Copy a WQE written in the bounce buffer back to the SQ.
+ * Routine is used when a WQE wraps-around the SQ and therefore needs a
+ * special attention. note that the WQE is written backward to the SQ.
+ *
+ * @param txq
+ *   Pointer to mlx4 Tx queue structure.
+ * @param index
+ *   First SQ TXBB index for this WQE.
+ * @param desc_size
+ *   TXBB-aligned sixe of the WQE.
+ *
+ * @return
+ *   A pointer to the control segment of this WQE in the SQ.
+ */
+static struct mlx4_wqe_ctrl_seg
+*mlx4_bounce_to_desc(struct txq *txq,
+		     uint32_t index,
+		     unsigned int desc_size)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	uint32_t copy = (sq->txbb_cnt - index) * TXBB_SIZE;
+	int i;
+
+	for (i = desc_size - copy - 4; i >= 0; i -= 4) {
+		if ((i & (TXBB_SIZE - 1)) == 0)
+			rte_wmb();
+		*((uint32_t *)(sq->buf + i)) =
+			*((uint32_t *)(txq->bounce_buf + copy + i));
+	}
+	for (i = copy - 4; i >= 4; i -= 4) {
+		if ((i & (TXBB_SIZE - 1)) == 0)
+			rte_wmb();
+		*((uint32_t *)(sq->buf + index * TXBB_SIZE + i)) =
+		*((uint32_t *)(txq->bounce_buf + i));
+	}
+	/* Return real descriptor location */
+	return (struct mlx4_wqe_ctrl_seg *)(sq->buf + index * TXBB_SIZE);
+}
+
+/**
+ * Handle address translation of scattered buffers for mlx4_tx_burst().
+ *
+ * @param txq
+ *   TX queue structure.
+ * @param segs
+ *   Number of segments in buf.
+ * @param elt
+ *   TX queue element to fill.
+ * @param[in] buf
+ *   Buffer to process.
+ * @param elts_head
+ *   Index of the linear buffer to use if necessary (normally txq->elts_head).
+ * @param[out] sges
+ *   Array filled with SGEs on success.
+ *
+ * @return
+ *   A structure containing the processed packet size in bytes and the
+ *   number of SGEs. Both fields are set to (unsigned int)-1 in case of
+ *   failure.
+ */
+static inline int
+mlx4_tx_sg_virt_to_lkey(struct txq *txq, struct rte_mbuf *buf,
+			struct ibv_sge *sges, unsigned int segs)
+{
+	unsigned int j;
+
+	/* Register segments as SGEs. */
+	for (j = 0; (j != segs); ++j) {
+		struct ibv_sge *sge = &sges[j];
+		uint32_t lkey;
+
+		/* Retrieve Memory Region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			goto stop;
+		}
+		/* Update SGE. */
+		sge->addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		if (txq->priv->vf)
+			rte_prefetch0((volatile void *)
+				      (uintptr_t)sge->addr);
+		sge->length = buf->data_len;
+		sge->lkey = lkey;
+		buf = buf->next;
+	}
+	return 0;
+stop:
+	return -1;
+}
+
+
+/**
  * Posts a single work requests to a send queue.
  *
  * @param txq
@@ -323,36 +418,53 @@
  */
 static int
 mlx4_post_send(struct txq *txq,
+	       struct rte_mbuf *pkt,
 	       struct ibv_send_wr *wr,
 	       struct ibv_send_wr **bad_wr)
 {
 	struct mlx4_wqe_ctrl_seg *ctrl;
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
+	struct ibv_sge sge[wr->num_sge];
 	uint32_t srcrb_flags;
 	uint8_t fence_size;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t owner_opcode;
-	int wqe_real_size, nr_txbbs;
+	int wqe_real_size, wqe_size, nr_txbbs, i;
+	bool bounce = FALSE;
 
-	/* for now we support pkts with one buf only */
-	if (wr->num_sge != 1)
+	if (unlikely(mlx4_tx_sg_virt_to_lkey(txq, pkt, sge, wr->num_sge)))
 		goto err;
+	wr->sg_list = sge;
 	/* Calc the needed wqe size for this packet */
 	wqe_real_size = mlx4_wqe_calc_real_size(wr->num_sge);
 	if (unlikely(!wqe_real_size))
 		goto err;
+	wqe_size = RTE_ALIGN(wqe_real_size, TXBB_SIZE);
 	nr_txbbs = SIZE_TO_TXBBS(wqe_real_size);
 	/* Are we too big to handle ? */
 	if (unlikely(mlx4_wq_overflow(sq, nr_txbbs)))
 		goto err;
-	/* Get ctrl and single-data wqe entries */
-	ctrl = mlx4_get_send_wqe(sq, head_idx);
+	/* Get ctrl entry */
+	if (likely(head_idx + nr_txbbs <= sq->txbb_cnt)) {
+		ctrl = mlx4_get_send_wqe(sq, head_idx);
+	} else {
+		/* handle the case of wqe wraps around the SQ by working with
+		 * a side-buf and when done copying it back to the SQ
+		 */
+		ctrl = (struct mlx4_wqe_ctrl_seg *)txq->bounce_buf;
+		bounce = TRUE;
+	}
+	/* Get data-seg entry */
 	dseg = (struct mlx4_wqe_data_seg *)(((char *)ctrl) +
 		sizeof(struct mlx4_wqe_ctrl_seg));
-	mlx4_set_data_seg(dseg, wr->sg_list);
-	/* For raw eth, the SOLICIT flag is used
-	 * to indicate that no icrc should be calculated
+	/* Fill-in date from last to first */
+	for (i = wr->num_sge  - 1; i >= 0; --i)
+		mlx4_set_data_seg(dseg + i,  wr->sg_list + i);
+	/* Handle control info
+	 *
+	 * For raw eth, the SOLICIT flag is used to indicate that
+	 * no icrc should be calculated
 	 */
 	srcrb_flags = MLX4_WQE_CTRL_SOLICIT |
 		      ((wr->send_flags & IBV_SEND_SIGNALED) ?
@@ -361,7 +473,19 @@
 		MLX4_WQE_CTRL_FENCE : 0) | ((wqe_real_size / 16) & 0x3f);
 	owner_opcode = MLX4_OPCODE_SEND |
 		       ((sq->head & sq->txbb_cnt) ? MLX4_EN_BIT_WQE_OWN : 0);
-	mlx4_set_ctrl_seg(ctrl, owner_opcode, fence_size, srcrb_flags, 0);
+	/* fill in ctrl info but ownership */
+	mlx4_set_ctrl_seg(ctrl, fence_size, srcrb_flags, 0);
+       /* If we used a bounce buffer then copy wqe back into sq */
+	if (unlikely(bounce))
+		ctrl = mlx4_bounce_to_desc(txq, head_idx, wqe_size);
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
+	 */
+	 rte_wmb();
+	 ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
+
 	sq->head += nr_txbbs;
 	rte_wmb();
 	return 0;
@@ -439,62 +563,31 @@
 		/* Request Tx completion. */
 		if (unlikely(--elts_comp_cd == 0)) {
 			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
 			send_flags |= IBV_SEND_SIGNALED;
 		}
-		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
-				elt->buf = NULL;
-				goto stop;
-			}
-			if (buf->pkt_len <= txq->max_inline)
-				send_flags |= IBV_SEND_INLINE;
-			/* Update element. */
-			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
+		if (buf->pkt_len <= txq->max_inline)
+			send_flags |= IBV_SEND_INLINE;
+		/* Update element. */
+		elt->buf = buf;
+		if (txq->priv->vf)
 			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
-			/* Set up WR. */
-			wr->sg_list  = sge;
-			wr->num_sge  = segs;
-			wr->opcode   = IBV_WR_SEND;
-			wr->send_flags = send_flags;
-			wr->next     = NULL;
-			/* post the pkt for sending */
-			err = mlx4_post_send(txq, wr, &wr_bad);
-			if (unlikely(err)) {
-				if (unlikely(wr_bad->send_flags &
-					     IBV_SEND_SIGNALED)) {
-					elts_comp_cd = 1;
-					--elts_comp;
-				}
-				elt->buf = NULL;
-				goto stop;
-			}
-			sent_size += length;
-		} else {
-			err = -1;
+		/* Set up WR. */
+		wr->sg_list  = NULL; /* handled in post_send */
+		wr->num_sge  = segs;
+		wr->opcode   = IBV_WR_SEND;
+		wr->send_flags = send_flags;
+		wr->next     = NULL;
+		/* post the pkt for sending */
+		err = mlx4_post_send(txq, buf, wr, &wr_bad);
+		if (unlikely(err)) {
+			if (unlikely(wr_bad->send_flags &
+				     IBV_SEND_SIGNALED))
+				elts_comp_cd = 1;
+			elt->buf = NULL;
 			goto stop;
 		}
+		++elts_comp;
+		sent_size += buf->pkt_len;
 		elts_head = elts_head_next;
 		/* Increment sent bytes counter. */
 		txq->stats.obytes += sent_size;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index e442730..7cae7e2 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -139,13 +139,14 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
+	unsigned int elts_comp; /**< Number of pkts waiting for completion. */
 	unsigned int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
+	char *bounce_buf; /**< Side memory to be used when wqe wraps around */
 };
 
 /* mlx4_rxq.c */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 1273738..6f6ea9c 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -83,8 +83,14 @@
 		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->socket);
 	int ret = 0;
 
-	if (elts == NULL) {
-		ERROR("%p: can't allocate packets array", (void *)txq);
+	/* Allocate Bounce-buf memory */
+	txq->bounce_buf = (char *)rte_zmalloc_socket("TXQ",
+						     MAX_WQE_SIZE,
+						     RTE_CACHE_LINE_MIN_SIZE,
+						     txq->socket);
+
+	if ((elts == NULL) || (txq->bounce_buf == NULL)) {
+		ERROR("%p: can't allocate TXQ memory", (void *)txq);
 		ret = ENOMEM;
 		goto error;
 	}
@@ -110,6 +116,8 @@
 	assert(ret == 0);
 	return 0;
 error:
+	if (txq->bounce_buf != NULL)
+		rte_free(txq->bounce_buf);
 	if (elts != NULL)
 		rte_free(elts);
 	DEBUG("%p: failed, freed everything", (void *)txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 3/5] net/mlx4: refine setting Tx completion flag
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 1/5] net/mlx4: add simple Tx " Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 2/5] net/mlx4: support multi-segments Tx Moti Haimovsky
@ 2017-08-24 15:54 ` Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 4/5] net/mlx4: add Tx checksum offloads Moti Haimovsky
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Moti Haimovsky @ 2017-08-24 15:54 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

PMD now take into consideration the amount of entries in the TxQ
a packet occupies when choosing weather to set the report-completion flag
to the chip or not.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 30 +++++++++++-------------------
 drivers/net/mlx4/mlx4_rxtx.h |  2 +-
 2 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index e41ea9e..dae0e47 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -461,14 +461,16 @@
 	/* Fill-in date from last to first */
 	for (i = wr->num_sge  - 1; i >= 0; --i)
 		mlx4_set_data_seg(dseg + i,  wr->sg_list + i);
-	/* Handle control info
-	 *
-	 * For raw eth, the SOLICIT flag is used to indicate that
-	 * no icrc should be calculated
-	 */
-	srcrb_flags = MLX4_WQE_CTRL_SOLICIT |
-		      ((wr->send_flags & IBV_SEND_SIGNALED) ?
-				      MLX4_WQE_CTRL_CQ_UPDATE : 0);
+	/* Handle control info */
+	/* For raw eth always set the SOLICIT flag */
+	/*  Request Tx completion. */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		srcrb_flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+	} else {
+		srcrb_flags = MLX4_WQE_CTRL_SOLICIT;
+	}
 	fence_size = (wr->send_flags & IBV_SEND_FENCE ?
 		MLX4_WQE_CTRL_FENCE : 0) | ((wqe_real_size / 16) & 0x3f);
 	owner_opcode = MLX4_OPCODE_SEND |
@@ -514,13 +516,12 @@
 	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -560,11 +561,6 @@
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			send_flags |= IBV_SEND_SIGNALED;
-		}
 		if (buf->pkt_len <= txq->max_inline)
 			send_flags |= IBV_SEND_INLINE;
 		/* Update element. */
@@ -580,9 +576,6 @@
 		/* post the pkt for sending */
 		err = mlx4_post_send(txq, buf, wr, &wr_bad);
 		if (unlikely(err)) {
-			if (unlikely(wr_bad->send_flags &
-				     IBV_SEND_SIGNALED))
-				elts_comp_cd = 1;
 			elt->buf = NULL;
 			goto stop;
 		}
@@ -602,7 +595,6 @@
 	mlx4_send_flush(txq);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 7cae7e2..35e0de7 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -140,7 +140,7 @@ struct txq {
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
 	unsigned int elts_comp; /**< Number of pkts waiting for completion. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 4/5] net/mlx4: add Tx checksum offloads
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
                   ` (2 preceding siblings ...)
  2017-08-24 15:54 ` [PATCH 3/5] net/mlx4: refine setting Tx completion flag Moti Haimovsky
@ 2017-08-24 15:54 ` Moti Haimovsky
  2017-08-24 15:54 ` [PATCH 5/5] net/mlx4: add loopback Tx from VF Moti Haimovsky
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Moti Haimovsky @ 2017-08-24 15:54 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

PMD now supports offloading the ip and tcp/udp checksum header calculation
(including tunneled packets) to the hardware.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4.c        |  7 +++++++
 drivers/net/mlx4/mlx4.h        |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h    |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 25 +++++++++++++++++++++----
 drivers/net/mlx4/mlx4_rxtx.h   |  2 ++
 drivers/net/mlx4/mlx4_txq.c    |  4 +++-
 7 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index b084903..3149be6 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -397,6 +397,7 @@ struct mlx4_conf {
 		.ports.present = 0,
 	};
 	unsigned int vf;
+	unsigned int tunnel_en;
 	int i;
 
 	(void)pci_drv;
@@ -456,6 +457,9 @@ struct mlx4_conf {
 		rte_errno = ENODEV;
 		goto error;
 	}
+	/* Only cx3-pro supports L3 tunneling */
+	tunnel_en = (device_attr.vendor_part_id ==
+		     PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
 	INFO("%u port(s) detected", device_attr.phys_port_cnt);
 	conf.ports.present |= (UINT64_C(1) << device_attr.phys_port_cnt) - 1;
 	if (mlx4_args(pci_dev->device.devargs, &conf)) {
@@ -529,6 +533,9 @@ struct mlx4_conf {
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->tunnel_en = tunnel_en;
+		priv->hw_csum =
+		     !!(device_attr.device_cap_flags & IBV_DEVICE_RAW_IP_CSUM);
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 93e5502..439a828 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -104,6 +104,8 @@ struct priv {
 	unsigned int vf:1; /* This is a VF device. */
 	unsigned int intr_alarm:1; /* An interrupt alarm is scheduled. */
 	unsigned int isolated:1; /* Toggle isolated mode. */
+	unsigned int hw_csum:1; /* Checksum offload is supported. */
+	unsigned int tunnel_en:1; /* Device tunneling is enabled */
 	struct rte_intr_handle intr_handle; /* Port interrupt handle. */
 	struct rte_flow_drop *flow_drop_queue; /* Flow drop queue. */
 	LIST_HEAD(mlx4_flows, rte_flow) flows;
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a9e8059..e4ecbfa 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,6 +553,12 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM  |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->tunnel_en)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->speed_capa =
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 8b0248a..38e9a45 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -60,6 +60,8 @@
 /* WQE flags */
 #define MLX4_OPCODE_SEND	0x0a
 #define MLX4_EN_BIT_WQE_OWN	0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define SIZE_TO_TXBBS(size)	(RTE_ALIGN((size), (TXBB_SIZE)) / (TXBB_SIZE))
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index dae0e47..3415f63 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -475,9 +475,27 @@
 		MLX4_WQE_CTRL_FENCE : 0) | ((wqe_real_size / 16) & 0x3f);
 	owner_opcode = MLX4_OPCODE_SEND |
 		       ((sq->head & sq->txbb_cnt) ? MLX4_EN_BIT_WQE_OWN : 0);
+	/* Should we enable HW CKSUM offload ? */
+	if (txq->priv->hw_csum &&
+	    (pkt->ol_flags &
+	    (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = pkt->ol_flags &
+					     (PKT_TX_TUNNEL_GRE |
+					      PKT_TX_TUNNEL_VXLAN);
+
+		if (is_tunneled && txq->tunnel_en) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
+		} else {
+			srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
+				      MLX4_WQE_CTRL_TCP_UDP_CSUM;
+		}
+	}
 	/* fill in ctrl info but ownership */
 	mlx4_set_ctrl_seg(ctrl, fence_size, srcrb_flags, 0);
-       /* If we used a bounce buffer then copy wqe back into sq */
+	/* If we used a bounce buffer then copy wqe back into sq */
 	if (unlikely(bounce))
 		ctrl = mlx4_bounce_to_desc(txq, head_idx, wqe_size);
 	/*
@@ -485,9 +503,8 @@
 	 * setting ownership bit (because HW can start
 	 * executing as soon as we do).
 	 */
-	 rte_wmb();
-	 ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
-
+	rte_wmb();
+	ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 	sq->head += nr_txbbs;
 	rte_wmb();
 	return 0;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 35e0de7..b4675b7 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -146,6 +146,8 @@ struct txq {
 	unsigned int socket; /**< CPU socket ID for allocations. */
 	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
+	uint16_t tunnel_en:1;
+	/* When set TX offload for tunneled packets are supported. */
 	char *bounce_buf; /**< Side memory to be used when wqe wraps around */
 };
 
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 6f6ea9c..cecd5e8 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -306,7 +306,7 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4dv_obj mlxdv;
 	struct mlx4dv_qp dv_qp;
 	struct mlx4dv_cq dv_cq;
-		struct txq tmpl = {
+	struct txq tmpl = {
 		.priv = priv,
 		.socket = socket
 	};
@@ -334,6 +334,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	if (priv->tunnel_en)
+		tmpl.tunnel_en = 1;
 	DEBUG("priv->device_attr.max_qp_wr is %d",
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 5/5] net/mlx4: add loopback Tx from VF
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
                   ` (3 preceding siblings ...)
  2017-08-24 15:54 ` [PATCH 4/5] net/mlx4: add Tx checksum offloads Moti Haimovsky
@ 2017-08-24 15:54 ` Moti Haimovsky
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
  2017-10-24 11:56 ` [PATCH 0/5] new mlx4 Tx " Nélio Laranjeiro
  6 siblings, 0 replies; 61+ messages in thread
From: Moti Haimovsky @ 2017-08-24 15:54 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

Added loopback functionality use when the chip is a VF in order to
enable packet transmission between VFs and between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  |  2 +-
 drivers/net/mlx4/mlx4_rxtx.c | 28 ++++++++++++++++++++++------
 drivers/net/mlx4/mlx4_rxtx.h |  2 ++
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 38e9a45..e328cff 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -168,7 +168,7 @@
 		  uint32_t srcrb_flags, uint32_t imm)
 {
 	seg->fence_size = fence_size;
-	seg->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
+	seg->srcrb_flags = srcrb_flags;
 	/*
 	 * The caller should prepare "imm" in advance based on WR opcode.
 	 * For IBV_WR_SEND_WITH_IMM and IBV_WR_RDMA_WRITE_WITH_IMM,
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 3415f63..ed19c72 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -426,7 +426,11 @@
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct ibv_sge sge[wr->num_sge];
-	uint32_t srcrb_flags;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
+	uint32_t imm = 0;
 	uint8_t fence_size;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t owner_opcode;
@@ -466,10 +470,10 @@
 	/*  Request Tx completion. */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
-		srcrb_flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
+		srcrb.flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
 	} else {
-		srcrb_flags = MLX4_WQE_CTRL_SOLICIT;
+		srcrb.flags = MLX4_WQE_CTRL_SOLICIT;
 	}
 	fence_size = (wr->send_flags & IBV_SEND_FENCE ?
 		MLX4_WQE_CTRL_FENCE : 0) | ((wqe_real_size / 16) & 0x3f);
@@ -487,14 +491,26 @@
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
+				srcrb.flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
 		} else {
-			srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
 				      MLX4_WQE_CTRL_TCP_UDP_CSUM;
 		}
 	}
+	/* convert flags to BE before adding the mac address (if at all)
+	 * to it
+	 */
+	srcrb.flags = rte_cpu_to_be_32(srcrb.flags);
+	/* Copy dst mac address to wqe. This allows loopback in eSwitch,
+	 * so that VFs and PF can communicate with each other
+	 */
+	if (txq->lb) {
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						sizeof(uint16_t)));
+	}
 	/* fill in ctrl info but ownership */
-	mlx4_set_ctrl_seg(ctrl, fence_size, srcrb_flags, 0);
+	mlx4_set_ctrl_seg(ctrl, fence_size, srcrb.flags, imm);
 	/* If we used a bounce buffer then copy wqe back into sq */
 	if (unlikely(bounce))
 		ctrl = mlx4_bounce_to_desc(txq, head_idx, wqe_size);
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index b4675b7..8e407f5 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -148,6 +148,8 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	uint16_t tunnel_en:1;
 	/* When set TX offload for tunneled packets are supported. */
+	uint16_t lb:1;
+	/* Whether pkts should be looped-back by eswitch or not */
 	char *bounce_buf; /**< Side memory to be used when wqe wraps around */
 };
 
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index cecd5e8..296d72d 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -410,6 +410,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* If a VF device - need to loopback xmitted packets */
+	tmpl.lb = !!(priv->vf);
 	/* Clean up txq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old txq just in case", (void *)txq);
 	mlx4_txq_cleanup(txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
                   ` (4 preceding siblings ...)
  2017-08-24 15:54 ` [PATCH 5/5] net/mlx4: add loopback Tx from VF Moti Haimovsky
@ 2017-10-03 10:48 ` Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 1/6] net/mlx4: add simple Tx " Matan Azrad
                     ` (8 more replies)
  2017-10-24 11:56 ` [PATCH 0/5] new mlx4 Tx " Nélio Laranjeiro
  6 siblings, 9 replies; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky

v2:
Rearange patches.
Semantics.
Enhancements.
Fix compilation issues. 

Moti Haimovsky (6):
  net/mlx4: add simple Tx bypassing ibverbs
  net/mlx4: get back Rx flow functionality
  net/mlx4: support multi-segments Tx
  net/mlx4: get back Tx checksum offloads
  net/mlx4: get back Rx checksum offloads
  net/mlx4: add loopback Tx from VF

 drivers/net/mlx4/mlx4.c        |  11 +
 drivers/net/mlx4/mlx4.h        |  13 +-
 drivers/net/mlx4/mlx4_ethdev.c |  10 +
 drivers/net/mlx4/mlx4_prm.h    | 129 +++++++
 drivers/net/mlx4/mlx4_rxq.c    | 181 ++++++----
 drivers/net/mlx4/mlx4_rxtx.c   | 787 ++++++++++++++++++++++++++++++-----------
 drivers/net/mlx4/mlx4_rxtx.h   |  61 ++--
 drivers/net/mlx4/mlx4_txq.c    | 104 +++++-
 drivers/net/mlx4/mlx4_utils.h  |  20 ++
 mk/rte.app.mk                  |   2 +-
 10 files changed, 989 insertions(+), 329 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 1/6] net/mlx4: add simple Tx bypassing ibverbs
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
@ 2017-10-03 10:48   ` Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 2/6] net/mlx4: get back Rx flow functionality Matan Azrad
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

Modify PMD to send single-buffer packets directly to the device
bypassing the ibv Tx post and poll routines.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 108 +++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 353 ++++++++++++++++++++++++++++++++-----------
 drivers/net/mlx4/mlx4_rxtx.h |  32 ++--
 drivers/net/mlx4/mlx4_txq.c  |  90 ++++++++---
 mk/rte.app.mk                |   2 +-
 5 files changed, 463 insertions(+), 122 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..6d1800a
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,108 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 6WIND S.A.
+ *   Copyright 2017 Mellanox
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef RTE_PMD_MLX4_PRM_H_
+#define RTE_PMD_MLX4_PRM_H_
+
+#include <rte_byteorder.h>
+#include <rte_branch_prediction.h>
+#include <rte_atomic.h>
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx4dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* ConnectX-3 Tx queue basic block. */
+#define MLX4_TXBB_SHIFT 6
+#define MLX4_TXBB_SIZE (1 << MLX4_TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes. */
+#define MLX4_MAX_WQE_SIZE 512
+#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+
+/* Send queue stamping/invalidating information. */
+#define MLX4_SQ_STAMP_STRIDE 64
+#define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
+#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_STAMP_VAL 0x7fffffff
+
+/* Work queue element (WQE) flags. */
+#define MLX4_BIT_WQE_OWN 0x80000000
+
+#define MLX4_SIZE_TO_TXBBS(size) \
+		(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
+
+/* Send queue information. */
+struct mlx4_sq {
+	char *buf; /**< SQ buffer. */
+	char *eob; /**< End of SQ buffer */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t *db; /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
+};
+
+#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
+
+/* Completion queue information. */
+struct mlx4_cq {
+	char *buf; /**< Pointer to the completion queue buffer. */
+	uint32_t cqe_cnt; /**< Number of entries in the queue. */
+	uint32_t cqe_64:1; /**< CQ entry size is 64 bytes. */
+	uint32_t cons_index; /**< Last queue entry that was handled. */
+	uint32_t *set_ci_db; /**< Pointer to the completion queue doorbell. */
+};
+
+/*
+ * cqe = cq->buf + cons_index * cqe_size + cqe_offset
+ * Where cqe_size is 32 or 64 bytes and
+ * cqe_offset is 0 or 32 (depending on cqe_size).
+ */
+#define mlx4_get_cqe(cq, n) (__extension__({ \
+				typeof(cq) q = (cq); \
+				(q)->buf + \
+				(((n) & ((q)->cqe_cnt - 1)) << \
+				 (5 + (q)->cqe_64)) + \
+				((q)->cqe_64 << 5); \
+			    }))
+
+#endif /* RTE_PMD_MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b5e7777..55c8e9a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -40,6 +40,7 @@
 #include <inttypes.h>
 #include <stdint.h>
 #include <string.h>
+#include <stdbool.h>
 
 /* Verbs headers do not support -pedantic. */
 #ifdef PEDANTIC
@@ -52,15 +53,76 @@
 
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
+#include <rte_io.h>
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
 /**
+ * Stamp a WQE so it won't be reused by the HW.
+ * Routine is used when freeing WQE used by the chip or when failing
+ * building an WQ entry has failed leaving partial information on the queue.
+ *
+ * @param sq
+ *   Pointer to the sq structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param num_txbbs
+ *   Number of blocks to stamp.
+ *   If < 0 the routine will use the size written in the WQ entry.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of TX basic blocs (TXBB) the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp =
+		rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+				 (!!owner << MLX4_SQ_STAMP_SHIFT));
+	char *wqe = mlx4_get_send_wqe(sq, (index & sq->txbb_cnt_mask));
+	uint32_t *ptr = (uint32_t *)wqe;
+	int i;
+	int txbbs_size;
+	int num_txbbs;
+
+	/* Extract the size from the control segment of the WQE. */
+	num_txbbs = MLX4_SIZE_TO_TXBBS((((struct mlx4_wqe_ctrl_seg *)
+					 wqe)->fence_size & 0x3f) << 4);
+	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	/* Optimize the common case when there is no wrap-around */
+	if (wqe + txbbs_size <= sq->eob) {
+		/* Stamp the freed descriptor */
+		for (i = 0;
+		     i < txbbs_size;
+		     i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor */
+		for (i = 0;
+		     i < txbbs_size;
+		     i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+			if ((char *)ptr >= sq->eob) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= RTE_BE32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
  * Manage Tx completions.
  *
  * When sending a burst, mlx4_tx_burst() posts several WRs.
@@ -80,26 +142,73 @@
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
-	int wcs_n;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
-	if (unlikely(wcs_n == 0))
+	/*
+	 * Traverse over all CQ entries reported and handle each WQ entry
+	 * reported by them.
+	 */
+	do {
+		cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cons_index);
+		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		    !!(cons_index & cq->cqe_cnt)))
+			break;
+		/*
+		 * make sure we read the CQE after we read the
+		 * ownership bit
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      (void *)txq, cqe_err->vendor_err, cqe_err->syndrome);
+		}
+		/* Get WQE index reported in the CQE. */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* free next descriptor */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		cons_index++;
+	} while (true);
+	if (unlikely(pkts == 0))
 		return 0;
-	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
-		      (void *)txq, wcs_n);
-		return -1;
-	}
-	elts_comp -= wcs_n;
+	/*
+	 * Update CQ.
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	/*
+	 * Update the list of packets posted for transmission.
+	 */
+	elts_comp -= pkts;
 	assert(elts_comp <= txq->elts_comp);
 	/*
-	 * Assume WC status is successful as nothing can be done about it
-	 * anyway.
+	 * Assume completion status is successful as nothing can be done about
+	 * it anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -117,7 +226,7 @@
  * @return
  *   Memory pool where data is located for given mbuf.
  */
-static struct rte_mempool *
+static inline struct rte_mempool *
 mlx4_txq_mb2mp(struct rte_mbuf *buf)
 {
 	if (unlikely(RTE_MBUF_INDIRECT(buf)))
@@ -158,7 +267,7 @@
 	/* Add a new entry, register MR first. */
 	DEBUG("%p: discovered new memory pool \"%s\" (%p)",
 	      (void *)txq, mp->name, (void *)mp);
-	mr = mlx4_mp2mr(txq->priv->pd, mp);
+	mr = mlx4_mp2mr(txq->ctrl.priv->pd, mp);
 	if (unlikely(mr == NULL)) {
 		DEBUG("%p: unable to configure MR, ibv_reg_mr() failed.",
 		      (void *)txq);
@@ -183,6 +292,124 @@
 }
 
 /**
+ * Posts a single work request to a send queue.
+ *
+ * @param txq
+ *   The Tx queue to post to.
+ * @param wr
+ *   The work request to handle.
+ * @param bad_wr
+ *   The wr in case that posting had failed.
+ *
+ * @return
+ *   0 - success, negative errno value otherwise and rte_errno is set.
+ */
+static inline int
+mlx4_post_send(struct txq *txq,
+	       struct rte_mbuf *pkt,
+	       uint32_t send_flags)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t lkey;
+	uintptr_t addr;
+	int wqe_real_size;
+	int nr_txbbs;
+	int rc;
+
+	/* Calculate the needed work queue entry size for this packet. */
+	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+	/* Check that there is room for this WQE in the send queue and
+	 * that the WQE size is legal.
+	 */
+	if (likely(((sq->head - sq->tail) + nr_txbbs +
+		    sq->headroom_txbbs >= sq->txbb_cnt) ||
+		   (nr_txbbs > MLX4_MAX_WQE_TXBBS))) {
+		rc = ENOSPC;
+		goto err;
+	}
+	/* Get the control and single-data entries of the WQE */
+	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)(((char *)ctrl) +
+		sizeof(struct mlx4_wqe_ctrl_seg));
+	/*
+	 * Fill the data segment with buffer information.
+	 */
+	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
+	rte_prefetch0((volatile void *)addr);
+	dseg->addr = rte_cpu_to_be_64(addr);
+	/* Memory region key for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		/* MR does not exist. */
+		DEBUG("%p: unable to get MP <-> MR"
+		      " association", (void *)txq);
+		/*
+		 * Restamp entry in case of failure.
+		 * Make sure that size is written correctly.
+		 * Note that we give ownership to the SW, not the HW.
+		 */
+		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+		mlx4_txq_stamp_freed_wqe(sq, head_idx,
+					 (sq->head & sq->txbb_cnt) ? 0 : 1);
+		rc = EFAULT;
+		goto err;
+	}
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * Need a barrier here before writing the byte_count field to
+	 * make sure that all the data is visible before the
+	 * byte_count field is set.  Otherwise, if the segment begins
+	 * a new cacheline, the HCA prefetcher could grab the 64-byte
+	 * chunk and get a valid (!= * 0xffffffff) byte count but
+	 * stale data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+	if (likely(pkt->data_len))
+		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
+	else
+		/*
+		 * Zero length segment is treated as inline segment
+		 * with zero data.
+		 */
+		dseg->byte_count = RTE_BE32(0x80000000);
+	/*
+	 * Fill the control parameters for this packet.
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
+	 * should be calculated
+	 */
+	ctrl->srcrb_flags =
+		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
+				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+	/*
+	 * The caller should prepare "imm" in advance in order to support
+	 * VF to VF communication (when the device is a virtual-function
+	 * device (VF)).
+	 */
+	ctrl->imm = 0;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
+	 */
+	rte_wmb();
+	ctrl->owner_opcode =
+		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
+				 ((sq->head & sq->txbb_cnt) ?
+				  MLX4_BIT_WQE_OWN : 0));
+	sq->head += nr_txbbs;
+	return 0;
+err:
+	rte_errno = rc;
+	return -rc;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,13 +426,11 @@
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
-	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
 	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
+	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
@@ -229,9 +454,7 @@
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		struct ibv_send_wr *wr = &elt->wr;
 		unsigned int segs = buf->nb_segs;
-		unsigned int sent_size = 0;
 		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
@@ -254,93 +477,43 @@
 		if (unlikely(--elts_comp_cd == 0)) {
 			elts_comp_cd = txq->elts_comp_cd_init;
 			++elts_comp;
-			send_flags |= IBV_SEND_SIGNALED;
+			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
 		}
 		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
+			/* Update element. */
+			elt->buf = buf;
+			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+			/* post the pkt for sending */
+			err = mlx4_post_send(txq, buf, send_flags);
+			if (unlikely(err)) {
+				if (unlikely(send_flags &
+					     MLX4_WQE_CTRL_CQ_UPDATE)) {
+					elts_comp_cd = 1;
+					--elts_comp;
+				}
 				elt->buf = NULL;
 				goto stop;
 			}
-			/* Update element. */
 			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
+			bytes_sent += buf->pkt_len;
 		} else {
-			err = -1;
+			err = -EINVAL;
+			rte_errno = -err;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
 		elts_head = elts_head_next;
-		/* Increment sent bytes counter. */
-		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
-	/* Increment sent packets counter. */
+	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
+	txq->stats.obytes += bytes_sent;
+	/* Make sure that descriptors are written before doorbell record. */
+	rte_wmb();
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
 	txq->elts_comp_cd = elts_comp_cd;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fec998a..b515472 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -40,6 +40,7 @@
 #ifdef PEDANTIC
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
+#include <infiniband/mlx4dv.h>
 #include <infiniband/verbs.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
@@ -50,6 +51,7 @@
 #include <rte_mempool.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 
 /** Rx queue counters. */
 struct mlx4_rxq_stats {
@@ -85,8 +87,6 @@ struct rxq {
 
 /** Tx element. */
 struct txq_elt {
-	struct ibv_send_wr wr; /* Work request. */
-	struct ibv_sge sge; /* Scatter/gather element. */
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
@@ -100,24 +100,28 @@ struct mlx4_txq_stats {
 
 /** Tx queue descriptor. */
 struct txq {
-	struct priv *priv; /**< Back pointer to private data. */
-	struct {
-		const struct rte_mempool *mp; /**< Cached memory pool. */
-		struct ibv_mr *mr; /**< Memory region (for mp). */
-		uint32_t lkey; /**< mr->lkey copy. */
-	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
-	struct ibv_cq *cq; /**< Completion queue. */
-	struct ibv_qp *qp; /**< Queue pair. */
-	uint32_t max_inline; /**< Max inline send size. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	struct txq_elt (*elts)[]; /**< Tx elements. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
 	unsigned int elts_comp; /**< Number of completion requests. */
 	unsigned int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
+	unsigned int elts_n; /**< (*elts)[] length. */
+	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
-	unsigned int socket; /**< CPU socket ID for allocations. */
+	uint32_t max_inline; /**< Max inline send size. */
+	struct {
+		const struct rte_mempool *mp; /**< Cached memory pool. */
+		struct ibv_mr *mr; /**< Memory region (for mp). */
+		uint32_t lkey; /**< mr->lkey copy. */
+	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
+	struct {
+		struct priv *priv; /**< Back pointer to private data. */
+		unsigned int socket; /**< CPU socket ID for allocations. */
+		struct ibv_cq *cq; /**< Completion queue. */
+		struct ibv_qp *qp; /**< Queue pair. */
+	} ctrl;
 };
 
 /* mlx4_rxq.c */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index e0245b0..492779f 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -62,6 +62,7 @@
 #include "mlx4_autoconf.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
+#include "mlx4_prm.h"
 
 /**
  * Allocate Tx queue elements.
@@ -79,7 +80,7 @@
 {
 	unsigned int i;
 	struct txq_elt (*elts)[elts_n] =
-		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->socket);
+		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->ctrl.socket);
 	int ret = 0;
 
 	if (elts == NULL) {
@@ -170,10 +171,10 @@
 
 	DEBUG("cleaning up %p", (void *)txq);
 	mlx4_txq_free_elts(txq);
-	if (txq->qp != NULL)
-		claim_zero(ibv_destroy_qp(txq->qp));
-	if (txq->cq != NULL)
-		claim_zero(ibv_destroy_cq(txq->cq));
+	if (txq->ctrl.qp != NULL)
+		claim_zero(ibv_destroy_qp(txq->ctrl.qp));
+	if (txq->ctrl.cq != NULL)
+		claim_zero(ibv_destroy_cq(txq->ctrl.cq));
 	for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) {
 		if (txq->mp2mr[i].mp == NULL)
 			break;
@@ -242,6 +243,42 @@ struct txq_mp2mr_mbuf_check_data {
 }
 
 /**
+ * Retrieves information needed in order to directly access the Tx queue.
+ *
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param mlxdv
+ *   Pointer to device information for this Tx queue.
+ */
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+	/* Total sq length, including headroom and spare WQEs*/
+	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+
+	sq->buf = ((char *)dqp->buf.buf) + dqp->sq.offset;
+	/* Total len, including headroom and spare WQEs*/
+	sq->eob = sq->buf + sq_size;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_cnt =
+		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs =
+		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
+	cq->buf = dcq->buf.buf;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_64 = (dcq->cqe_size & 64) ? 1 : 0;
+}
+
+/**
  * Configure a Tx queue.
  *
  * @param dev
@@ -263,9 +300,15 @@ struct txq_mp2mr_mbuf_check_data {
 	       unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
+
 	struct txq tmpl = {
-		.priv = priv,
-		.socket = socket
+		.ctrl = {
+			.priv = priv,
+			.socket = socket
+		},
 	};
 	union {
 		struct ibv_qp_init_attr init;
@@ -284,8 +327,8 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	/* MRs will be registered in mp2mr[] later. */
-	tmpl.cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
-	if (tmpl.cq == NULL) {
+	tmpl.ctrl.cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
+	if (tmpl.ctrl.cq == NULL) {
 		rte_errno = ENOMEM;
 		ERROR("%p: CQ creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -297,9 +340,9 @@ struct txq_mp2mr_mbuf_check_data {
 	      priv->device_attr.max_sge);
 	attr.init = (struct ibv_qp_init_attr){
 		/* CQ to be associated with the send queue. */
-		.send_cq = tmpl.cq,
+		.send_cq = tmpl.ctrl.cq,
 		/* CQ to be associated with the receive queue. */
-		.recv_cq = tmpl.cq,
+		.recv_cq = tmpl.ctrl.cq,
 		.cap = {
 			/* Max number of outstanding WRs. */
 			.max_send_wr = ((priv->device_attr.max_qp_wr < desc) ?
@@ -316,8 +359,8 @@ struct txq_mp2mr_mbuf_check_data {
 		 */
 		.sq_sig_all = 0,
 	};
-	tmpl.qp = ibv_create_qp(priv->pd, &attr.init);
-	if (tmpl.qp == NULL) {
+	tmpl.ctrl.qp = ibv_create_qp(priv->pd, &attr.init);
+	if (tmpl.ctrl.qp == NULL) {
 		rte_errno = errno ? errno : EINVAL;
 		ERROR("%p: QP creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -331,7 +374,8 @@ struct txq_mp2mr_mbuf_check_data {
 		/* Primary port number. */
 		.port_num = priv->port
 	};
-	ret = ibv_modify_qp(tmpl.qp, &attr.mod, IBV_QP_STATE | IBV_QP_PORT);
+	ret = ibv_modify_qp(tmpl.ctrl.qp, &attr.mod,
+			    IBV_QP_STATE | IBV_QP_PORT);
 	if (ret) {
 		rte_errno = ret;
 		ERROR("%p: QP state to IBV_QPS_INIT failed: %s",
@@ -348,7 +392,7 @@ struct txq_mp2mr_mbuf_check_data {
 	attr.mod = (struct ibv_qp_attr){
 		.qp_state = IBV_QPS_RTR
 	};
-	ret = ibv_modify_qp(tmpl.qp, &attr.mod, IBV_QP_STATE);
+	ret = ibv_modify_qp(tmpl.ctrl.qp, &attr.mod, IBV_QP_STATE);
 	if (ret) {
 		rte_errno = ret;
 		ERROR("%p: QP state to IBV_QPS_RTR failed: %s",
@@ -356,7 +400,7 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	attr.mod.qp_state = IBV_QPS_RTS;
-	ret = ibv_modify_qp(tmpl.qp, &attr.mod, IBV_QP_STATE);
+	ret = ibv_modify_qp(tmpl.ctrl.qp, &attr.mod, IBV_QP_STATE);
 	if (ret) {
 		rte_errno = ret;
 		ERROR("%p: QP state to IBV_QPS_RTS failed: %s",
@@ -370,6 +414,18 @@ struct txq_mp2mr_mbuf_check_data {
 	DEBUG("%p: txq updated with %p", (void *)txq, (void *)&tmpl);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
+	/* Retrieve device Q info */
+	mlxdv.cq.in = txq->ctrl.cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->ctrl.qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: Failed to obtain information needed for "
+		      "accessing the device queues", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	return 0;
 error:
 	ret = rte_errno;
@@ -459,7 +515,7 @@ struct txq_mp2mr_mbuf_check_data {
 
 	if (txq == NULL)
 		return;
-	priv = txq->priv;
+	priv = txq->ctrl.priv;
 	for (i = 0; i != priv->dev->data->nb_tx_queues; ++i)
 		if (priv->dev->data->tx_queues[i] == txq) {
 			DEBUG("%p: removing Tx queue %p from list",
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index c25fdd9..2f1286e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -128,7 +128,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
 endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LIO_PMD)        += -lrte_pmd_lio
-_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs -lmlx4
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -libverbs
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NFP_PMD)        += -lrte_pmd_nfp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_NULL)       += -lrte_pmd_null
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 2/6] net/mlx4: get back Rx flow functionality
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 1/6] net/mlx4: add simple Tx " Matan Azrad
@ 2017-10-03 10:48   ` Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 3/6] net/mlx4: support multi-segments Tx Matan Azrad
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for accessing the hardware directly when handling
Rx packets eliminating the need to use verbs in the Rx datapath.

Now the number of scatters is calculated on the fly, according to the
maximum expected packet size.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
---
 drivers/net/mlx4/mlx4.h       |  11 --
 drivers/net/mlx4/mlx4_rxq.c   | 176 +++++++++++++++++++-------------
 drivers/net/mlx4/mlx4_rxtx.c  | 229 ++++++++++++++++++++++++------------------
 drivers/net/mlx4/mlx4_rxtx.h  |  19 ++--
 drivers/net/mlx4/mlx4_utils.h |  20 ++++
 5 files changed, 268 insertions(+), 187 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 93e5502..b6e1ef2 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -57,17 +57,6 @@
 /* Maximum size for inline data. */
 #define MLX4_PMD_MAX_INLINE 0
 
-/*
- * Maximum number of cached Memory Pools (MPs) per TX queue. Each RTE MP
- * from which buffers are to be transmitted will have to be mapped by this
- * driver to their own Memory Region (MR). This is a slow operation.
- *
- * This value is always 1 for RX queues.
- */
-#ifndef MLX4_PMD_TX_MP_CACHE
-#define MLX4_PMD_TX_MP_CACHE 8
-#endif
-
 /* Interrupt alarm timeout value in microseconds. */
 #define MLX4_INTR_ALARM_TIMEOUT 100000
 
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 409983f..d7447c4 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -51,6 +51,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
 
+#include <rte_byteorder.h>
 #include <rte_common.h>
 #include <rte_errno.h>
 #include <rte_ethdev.h>
@@ -77,60 +78,63 @@
 mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 {
 	unsigned int i;
-	struct rxq_elt (*elts)[elts_n] =
-		rte_calloc_socket("RXQ elements", 1, sizeof(*elts), 0,
-				  rxq->socket);
+	const unsigned int sge_n = 1 << rxq->sge_n;
+	struct rte_mbuf *(*elts)[elts_n] =
+		rte_calloc_socket("RXQ", 1, sizeof(*elts), 0, rxq->socket);
 
 	if (elts == NULL) {
+		elts_n = 0;
 		rte_errno = ENOMEM;
 		ERROR("%p: can't allocate packets array", (void *)rxq);
 		goto error;
 	}
-	/* For each WR (packet). */
-	for (i = 0; (i != elts_n); ++i) {
-		struct rxq_elt *elt = &(*elts)[i];
-		struct ibv_recv_wr *wr = &elt->wr;
-		struct ibv_sge *sge = &(*elts)[i].sge;
-		struct rte_mbuf *buf = rte_pktmbuf_alloc(rxq->mp);
+	rxq->elts = elts;
+	for (i = 0; i != elts_n; ++i) {
+		struct rte_mbuf *buf;
+		volatile struct mlx4_wqe_data_seg *scat =
+			&(*rxq->hw.wqes)[i];
 
+		buf = rte_pktmbuf_alloc(rxq->mp);
 		if (buf == NULL) {
 			rte_errno = ENOMEM;
 			ERROR("%p: empty mbuf pool", (void *)rxq);
 			goto error;
 		}
-		elt->buf = buf;
-		wr->next = &(*elts)[(i + 1)].wr;
-		wr->sg_list = sge;
-		wr->num_sge = 1;
 		/* Headroom is reserved by rte_pktmbuf_alloc(). */
 		assert(buf->data_off == RTE_PKTMBUF_HEADROOM);
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
-		/* sge->addr must be able to store a pointer. */
-		assert(sizeof(sge->addr) >= sizeof(uintptr_t));
-		/* SGE keeps its headroom. */
-		sge->addr = (uintptr_t)
-			((uint8_t *)buf->buf_addr + RTE_PKTMBUF_HEADROOM);
-		sge->length = (buf->buf_len - RTE_PKTMBUF_HEADROOM);
-		sge->lkey = rxq->mr->lkey;
-		/* Redundant check for tailroom. */
-		assert(sge->length == rte_pktmbuf_tailroom(buf));
+		assert(!buf->next);
+		/* Only the first segment keeps headroom. */
+		if (i % sge_n)
+			buf->data_off = 0;
+		buf->port = rxq->port_id;
+		buf->data_len = rte_pktmbuf_tailroom(buf);
+		buf->pkt_len = rte_pktmbuf_tailroom(buf);
+		buf->nb_segs = 1;
+		/* scat->addr must be able to store a pointer. */
+		assert(sizeof(scat->addr) >= sizeof(uintptr_t));
+		*scat = (struct mlx4_wqe_data_seg){
+			.addr =
+			    rte_cpu_to_be_64(rte_pktmbuf_mtod(buf, uintptr_t)),
+			.byte_count = rte_cpu_to_be_32(buf->data_len),
+			.lkey = rte_cpu_to_be_32(rxq->mr->lkey),
+		};
+		(*rxq->elts)[i] = buf;
 	}
-	/* The last WR pointer must be NULL. */
-	(*elts)[(i - 1)].wr.next = NULL;
-	DEBUG("%p: allocated and configured %u single-segment WRs",
-	      (void *)rxq, elts_n);
-	rxq->elts_n = elts_n;
-	rxq->elts_head = 0;
-	rxq->elts = elts;
+	DEBUG("%p: allocated and configured %u segments (max %u packets)",
+	      (void *)rxq, elts_n, elts_n >> rxq->sge_n);
+	rxq->elts_n = log2above(elts_n);
 	return 0;
 error:
-	if (elts != NULL) {
-		for (i = 0; (i != RTE_DIM(*elts)); ++i)
-			rte_pktmbuf_free_seg((*elts)[i].buf);
-		rte_free(elts);
+	for (i = 0; i != elts_n; ++i) {
+		if ((*rxq->elts)[i] != NULL)
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
+		(*rxq->elts)[i] = NULL;
 	}
+	rte_free(rxq->elts);
+	rxq->elts = NULL;
 	DEBUG("%p: failed, freed everything", (void *)rxq);
 	assert(rte_errno > 0);
 	return -rte_errno;
@@ -146,17 +150,18 @@
 mlx4_rxq_free_elts(struct rxq *rxq)
 {
 	unsigned int i;
-	unsigned int elts_n = rxq->elts_n;
-	struct rxq_elt (*elts)[elts_n] = rxq->elts;
 
 	DEBUG("%p: freeing WRs", (void *)rxq);
+	if (rxq->elts == NULL)
+		return;
+
+	for (i = 0; i != (1u << rxq->elts_n); ++i) {
+		if ((*rxq->elts)[i] != NULL)
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
+	}
+	rte_free(rxq->elts);
 	rxq->elts_n = 0;
 	rxq->elts = NULL;
-	if (elts == NULL)
-		return;
-	for (i = 0; (i != RTE_DIM(*elts)); ++i)
-		rte_pktmbuf_free_seg((*elts)[i].buf);
-	rte_free(elts);
 }
 
 /**
@@ -198,7 +203,8 @@
  *   QP pointer or NULL in case of error and rte_errno is set.
  */
 static struct ibv_qp *
-mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc)
+mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq,
+		  uint16_t desc, unsigned int sge_n)
 {
 	struct ibv_qp *qp;
 	struct ibv_qp_init_attr attr = {
@@ -212,7 +218,7 @@
 					priv->device_attr.max_qp_wr :
 					desc),
 			/* Max number of scatter/gather elements in a WR. */
-			.max_recv_sge = 1,
+			.max_recv_sge = sge_n,
 		},
 		.qp_type = IBV_QPT_RAW_PACKET,
 	};
@@ -248,32 +254,43 @@
 	       struct rte_mempool *mp)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct rxq tmpl = {
 		.priv = priv,
 		.mp = mp,
 		.socket = socket
 	};
 	struct ibv_qp_attr mod;
-	struct ibv_recv_wr *bad_wr;
 	unsigned int mb_len;
 	int ret;
 
 	(void)conf; /* Thresholds configuration (ignored). */
 	mb_len = rte_pktmbuf_data_room_size(mp);
-	if (desc == 0) {
-		rte_errno = EINVAL;
-		ERROR("%p: invalid number of Rx descriptors", (void *)dev);
-		goto error;
-	}
 	/* Enable scattered packets support for this queue if necessary. */
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
 	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
-		;
+		tmpl.sge_n = 0;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
-		WARN("%p: scattered mode has been requested but is"
-		     " not supported, this may lead to packet loss",
-		     (void *)dev);
+		unsigned int sges_n;
+		unsigned int rx_pkt_len =
+				dev->data->dev_conf.rxmode.jumbo_frame ?
+				dev->data->dev_conf.rxmode.max_rx_pkt_len :
+				ETHER_MTU;
+
+		if (rx_pkt_len < ETHER_MTU)
+			rx_pkt_len = ETHER_MTU;
+		/* Only the first mbuf has a headroom */
+		rx_pkt_len = rx_pkt_len - mb_len + RTE_PKTMBUF_HEADROOM;
+		/*
+		 * Determine the number of SGEs needed for a full packet
+		 * and round it to the next power of two.
+		 */
+		sges_n = (rx_pkt_len / mb_len) + !!(rx_pkt_len % mb_len) + 1;
+		tmpl.sge_n = log2above(sges_n);
+		desc >>= tmpl.sge_n;
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -282,6 +299,8 @@
 		     dev->data->dev_conf.rxmode.max_rx_pkt_len,
 		     mb_len - RTE_PKTMBUF_HEADROOM);
 	}
+	DEBUG("%p: number of sges %u (%u WRs)",
+	      (void *)dev, 1 << tmpl.sge_n, desc);
 	/* Use the entire Rx mempool as the memory region. */
 	tmpl.mr = mlx4_mp2mr(priv->pd, mp);
 	if (tmpl.mr == NULL) {
@@ -317,7 +336,7 @@
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
 	      priv->device_attr.max_sge);
-	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc);
+	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc, 1 << tmpl.sge_n);
 	if (tmpl.qp == NULL) {
 		ERROR("%p: QP creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -336,21 +355,6 @@
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
-	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
-	if (ret) {
-		ERROR("%p: RXQ allocation failed: %s",
-		      (void *)dev, strerror(rte_errno));
-		goto error;
-	}
-	ret = ibv_post_recv(tmpl.qp, &(*tmpl.elts)[0].wr, &bad_wr);
-	if (ret) {
-		rte_errno = ret;
-		ERROR("%p: ibv_post_recv() failed for WR %p: %s",
-		      (void *)dev,
-		      (void *)bad_wr,
-		      strerror(rte_errno));
-		goto error;
-	}
 	mod = (struct ibv_qp_attr){
 		.qp_state = IBV_QPS_RTR
 	};
@@ -361,14 +365,44 @@
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Get HW depended info */
+	mlxdv.cq.in = tmpl.cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = tmpl.qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: Failed to retrieve device obj info", (void *)dev);
+		goto error;
+	}
+	/* Init HW depended fields */
+	tmpl.hw.wqes =
+		(volatile struct mlx4_wqe_data_seg (*)[])
+		((char *)dv_qp.buf.buf + dv_qp.rq.offset);
+	tmpl.hw.rq_db = dv_qp.rdb;
+	tmpl.hw.rq_ci = 0;
+	tmpl.mcq.buf = dv_cq.buf.buf;
+	tmpl.mcq.cqe_cnt = dv_cq.cqe_cnt;
+	tmpl.mcq.set_ci_db = dv_cq.set_ci_db;
+	tmpl.mcq.cqe_64 = (dv_cq.cqe_size & 64) ? 1 : 0;
 	/* Save port ID. */
 	tmpl.port_id = dev->data->port_id;
 	DEBUG("%p: RTE port ID: %u", (void *)rxq, tmpl.port_id);
+	ret = mlx4_rxq_alloc_elts(&tmpl, desc << tmpl.sge_n);
+	if (ret) {
+		ERROR("%p: RXQ allocation failed: %s",
+		      (void *)dev, strerror(rte_errno));
+		goto error;
+	}
 	/* Clean up rxq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old rxq just in case", (void *)rxq);
 	mlx4_rxq_cleanup(rxq);
 	*rxq = tmpl;
 	DEBUG("%p: rxq updated with %p", (void *)rxq, (void *)&tmpl);
+	/* Update doorbell counter. */
+	rxq->hw.rq_ci = desc;
+	rte_wmb();
+	*rxq->hw.rq_db = rte_cpu_to_be_32(rxq->hw.rq_ci);
 	return 0;
 error:
 	ret = rte_errno;
@@ -406,6 +440,12 @@
 	struct rxq *rxq = dev->data->rx_queues[idx];
 	int ret;
 
+	if (!rte_is_power_of_2(desc)) {
+		desc = 1 << log2above(desc);
+		WARN("%p: increased number of descriptors in RX queue %u"
+		     " to the next power of two (%d)",
+		     (void *)dev, idx, desc);
+	}
 	DEBUG("%p: configuring queue %u for %u descriptors",
 	      (void *)dev, idx, desc);
 	if (idx >= dev->data->nb_rx_queues) {
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 55c8e9a..e45bb3b 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -521,9 +521,45 @@
 }
 
 /**
- * DPDK callback for Rx.
+ * Poll one CQE from CQ.
  *
- * The following function doesn't manage scattered packets.
+ * @param rxq
+ *   Pointer to the receive queue structure.
+ * @param[out] out
+ *   Just polled CQE.
+ *
+ * @return
+ *   Number of bytes of the CQE, 0 in case there is no completion.
+ */
+static unsigned int
+mlx4_cq_poll_one(struct rxq *rxq,
+		 struct mlx4_cqe **out)
+{
+	int ret = 0;
+	struct mlx4_cqe *cqe = NULL;
+	struct mlx4_cq *cq = &rxq->mcq;
+
+	cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cq->cons_index);
+	if (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+	    !!(cq->cons_index & cq->cqe_cnt))
+		goto out;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rte_rmb();
+	assert(!(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK));
+	assert((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) !=
+	       MLX4_CQE_OPCODE_ERROR);
+	ret = rte_be_to_cpu_32(cqe->byte_cnt);
+	++cq->cons_index;
+out:
+	*out = cqe;
+	return ret;
+}
+
+/**
+ * DPDK callback for RX with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -538,112 +574,109 @@
 uint16_t
 mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
-	struct rxq *rxq = (struct rxq *)dpdk_rxq;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
-	const unsigned int elts_n = rxq->elts_n;
-	unsigned int elts_head = rxq->elts_head;
-	struct ibv_wc wcs[pkts_n];
-	struct ibv_recv_wr *wr_head = NULL;
-	struct ibv_recv_wr **wr_next = &wr_head;
-	struct ibv_recv_wr *wr_bad = NULL;
-	unsigned int i;
-	unsigned int pkts_ret = 0;
-	int ret;
+	struct rxq *rxq = dpdk_rxq;
+	const unsigned int wr_cnt = (1 << rxq->elts_n) - 1;
+	const unsigned int sge_n = rxq->sge_n;
+	struct rte_mbuf *pkt = NULL;
+	struct rte_mbuf *seg = NULL;
+	unsigned int i = 0;
+	unsigned int rq_ci = (rxq->hw.rq_ci << sge_n);
+	int len = 0;
 
-	ret = ibv_poll_cq(rxq->cq, pkts_n, wcs);
-	if (unlikely(ret == 0))
-		return 0;
-	if (unlikely(ret < 0)) {
-		DEBUG("rxq=%p, ibv_poll_cq() failed (wc_n=%d)",
-		      (void *)rxq, ret);
-		return 0;
-	}
-	assert(ret <= (int)pkts_n);
-	/* For each work completion. */
-	for (i = 0; i != (unsigned int)ret; ++i) {
-		struct ibv_wc *wc = &wcs[i];
-		struct rxq_elt *elt = &(*elts)[elts_head];
-		struct ibv_recv_wr *wr = &elt->wr;
-		uint32_t len = wc->byte_len;
-		struct rte_mbuf *seg = elt->buf;
-		struct rte_mbuf *rep;
+	while (pkts_n) {
+		struct mlx4_cqe *cqe;
+		unsigned int idx = rq_ci & wr_cnt;
+		struct rte_mbuf *rep = (*rxq->elts)[idx];
+		volatile struct mlx4_wqe_data_seg *scat =
+					&(*rxq->hw.wqes)[idx];
 
-		/* Sanity checks. */
-		assert(wr->sg_list == &elt->sge);
-		assert(wr->num_sge == 1);
-		assert(elts_head < rxq->elts_n);
-		assert(rxq->elts_head < rxq->elts_n);
-		/*
-		 * Fetch initial bytes of packet descriptor into a
-		 * cacheline while allocating rep.
-		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
-		/* Link completed WRs together for repost. */
-		*wr_next = wr;
-		wr_next = &wr->next;
-		if (unlikely(wc->status != IBV_WC_SUCCESS)) {
-			/* Whatever, just repost the offending WR. */
-			DEBUG("rxq=%p: bad work completion status (%d): %s",
-			      (void *)rxq, wc->status,
-			      ibv_wc_status_str(wc->status));
-			/* Increment dropped packets counter. */
-			++rxq->stats.idropped;
-			goto repost;
-		}
+		/* Update the 'next' pointer of the previous segment. */
+		if (pkt)
+			seg->next = rep;
+		seg = rep;
+		rte_prefetch0(seg);
+		rte_prefetch0(scat);
 		rep = rte_mbuf_raw_alloc(rxq->mp);
 		if (unlikely(rep == NULL)) {
-			/*
-			 * Unable to allocate a replacement mbuf,
-			 * repost WR.
-			 */
-			DEBUG("rxq=%p: can't allocate a new mbuf",
-			      (void *)rxq);
-			/* Increase out of memory counters. */
 			++rxq->stats.rx_nombuf;
-			++rxq->priv->dev->data->rx_mbuf_alloc_failed;
-			goto repost;
+			if (!pkt) {
+				/*
+				 * No buffers before we even started,
+				 * bail out silently.
+				 */
+				break;
+			}
+			while (pkt != seg) {
+				assert(pkt != (*rxq->elts)[idx]);
+				rep = pkt->next;
+				pkt->next = NULL;
+				pkt->nb_segs = 1;
+				rte_mbuf_raw_free(pkt);
+				pkt = rep;
+			}
+			break;
+		}
+		if (!pkt) {
+			/* Looking for the new packet */
+			len = mlx4_cq_poll_one(rxq, &cqe);
+			if (!len) {
+				rte_mbuf_raw_free(rep);
+				break;
+			}
+			if (unlikely(len < 0)) {
+				/* RX error, packet is likely too large. */
+				rte_mbuf_raw_free(rep);
+				++rxq->stats.idropped;
+				goto skip;
+			}
+			pkt = seg;
+			pkt->packet_type = 0;
+			pkt->ol_flags = 0;
+			pkt->pkt_len = len;
 		}
-		/* Reconfigure sge to use rep instead of seg. */
-		elt->sge.addr = (uintptr_t)rep->buf_addr + RTE_PKTMBUF_HEADROOM;
-		assert(elt->sge.lkey == rxq->mr->lkey);
-		elt->buf = rep;
-		/* Update seg information. */
-		seg->data_off = RTE_PKTMBUF_HEADROOM;
-		seg->nb_segs = 1;
-		seg->port = rxq->port_id;
-		seg->next = NULL;
-		seg->pkt_len = len;
+		rep->nb_segs = 1;
+		rep->port = rxq->port_id;
+		rep->data_len = seg->data_len;
+		rep->data_off = seg->data_off;
+		(*rxq->elts)[idx] = rep;
+		/*
+		 * Fill NIC descriptor with the new buffer. The lkey and size
+		 * of the buffers are already known, only the buffer address
+		 * changes.
+		 */
+		scat->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, uintptr_t));
+		if (len > seg->data_len) {
+			len -= seg->data_len;
+			++pkt->nb_segs;
+			++rq_ci;
+			continue;
+		}
+		/* The last segment. */
 		seg->data_len = len;
-		seg->packet_type = 0;
-		seg->ol_flags = 0;
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += pkt->pkt_len;
 		/* Return packet. */
-		*(pkts++) = seg;
-		++pkts_ret;
-		/* Increase bytes counter. */
-		rxq->stats.ibytes += len;
-repost:
-		if (++elts_head >= elts_n)
-			elts_head = 0;
-		continue;
+		*(pkts++) = pkt;
+		pkt = NULL;
+		--pkts_n;
+		++i;
+skip:
+		/* Align consumer index to the next stride. */
+		rq_ci >>= sge_n;
+		++rq_ci;
+		rq_ci <<= sge_n;
 	}
-	if (unlikely(i == 0))
+	if (unlikely((i == 0) && ((rq_ci >> sge_n) == rxq->hw.rq_ci)))
 		return 0;
-	/* Repost WRs. */
-	*wr_next = NULL;
-	assert(wr_head);
-	ret = ibv_post_recv(rxq->qp, wr_head, &wr_bad);
-	if (unlikely(ret)) {
-		/* Inability to repost WRs is fatal. */
-		DEBUG("%p: recv_burst(): failed (ret=%d)",
-		      (void *)rxq->priv,
-		      ret);
-		abort();
-	}
-	rxq->elts_head = elts_head;
-	/* Increase packets counter. */
-	rxq->stats.ipackets += pkts_ret;
-	return pkts_ret;
+	/* Update the consumer index. */
+	rxq->hw.rq_ci = rq_ci >> sge_n;
+	rte_wmb();
+	*rxq->hw.rq_db = rte_cpu_to_be_32(rxq->hw.rq_ci);
+	*rxq->mcq.set_ci_db =
+		rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+	return i;
 }
 
 /**
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index b515472..df83552 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -62,13 +62,6 @@ struct mlx4_rxq_stats {
 	uint64_t rx_nombuf; /**< Total of Rx mbuf allocation failures. */
 };
 
-/** Rx element. */
-struct rxq_elt {
-	struct ibv_recv_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
-	struct rte_mbuf *buf; /**< Buffer. */
-};
-
 /** Rx queue descriptor. */
 struct rxq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -78,9 +71,15 @@ struct rxq {
 	struct ibv_qp *qp; /**< Queue pair. */
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
 	unsigned int port_id; /**< Port ID for incoming packets. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	struct rxq_elt (*elts)[]; /**< Rx elements. */
+	unsigned int elts_n; /**< Log 2 of Mbufs. */
+	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
+	struct {
+		volatile struct mlx4_wqe_data_seg(*wqes)[];
+		volatile uint32_t *rq_db;
+		uint16_t rq_ci;
+	} hw;
+	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
+	unsigned int sge_n; /**< Log 2 of SGEs number. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
diff --git a/drivers/net/mlx4/mlx4_utils.h b/drivers/net/mlx4/mlx4_utils.h
index 0fbdc71..d6f729f 100644
--- a/drivers/net/mlx4/mlx4_utils.h
+++ b/drivers/net/mlx4/mlx4_utils.h
@@ -108,4 +108,24 @@
 
 int mlx4_fd_set_non_blocking(int fd);
 
+/**
+ * Return nearest power of two above input value.
+ *
+ * @param v
+ *   Input value.
+ *
+ * @return
+ *   Nearest power of two above input value.
+ */
+static inline unsigned int
+log2above(unsigned int v)
+{
+	unsigned int l;
+	unsigned int r;
+
+	for (l = 0, r = 0; (v >> 1); ++l, v >>= 1)
+		r |= (v & 1);
+	return l + r;
+}
+
 #endif /* MLX4_UTILS_H_ */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 3/6] net/mlx4: support multi-segments Tx
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 1/6] net/mlx4: add simple Tx " Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 2/6] net/mlx4: get back Rx flow functionality Matan Azrad
@ 2017-10-03 10:48   ` Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 4/6] net/mlx4: get back Tx checksum offloads Matan Azrad
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for transmitting packets spanning over
multiple buffers.
In this patch we also take into consideration the amount of entries
a packet occupies in the TxQ when setting the report-completion flag
of the chip.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 208 ++++++++++++++++++++++++-------------------
 drivers/net/mlx4/mlx4_rxtx.h |   6 +-
 drivers/net/mlx4/mlx4_txq.c  |  12 ++-
 3 files changed, 129 insertions(+), 97 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index e45bb3b..4200716 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -63,6 +63,16 @@
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
+/*
+ * Pointer-value pair structure
+ * used in tx_post_send for saving the first DWORD (32 byte)
+ * of a TXBB0
+ */
+struct pv {
+	struct mlx4_wqe_data_seg *dseg;
+	uint32_t val;
+};
+
 /**
  * Stamp a WQE so it won't be reused by the HW.
  * Routine is used when freeing WQE used by the chip or when failing
@@ -296,34 +306,38 @@
  *
  * @param txq
  *   The Tx queue to post to.
- * @param wr
- *   The work request to handle.
- * @param bad_wr
- *   The wr in case that posting had failed.
+ * @param pkt
+ *   The packet to transmit.
  *
  * @return
  *   0 - success, negative errno value otherwise and rte_errno is set.
  */
 static inline int
 mlx4_post_send(struct txq *txq,
-	       struct rte_mbuf *pkt,
-	       uint32_t send_flags)
+	       struct rte_mbuf *pkt)
 {
 	struct mlx4_wqe_ctrl_seg *ctrl;
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
+	struct rte_mbuf *buf;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
+	uint32_t srcrb_flags;
+	uint32_t owner_opcode = MLX4_OPCODE_SEND;
+	uint32_t byte_count;
 	int wqe_real_size;
 	int nr_txbbs;
 	int rc;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
+	int pv_counter = 0;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
 			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
 	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
-	/* Check that there is room for this WQE in the send queue and
+	/*
+	 * Check that there is room for this WQE in the send queue and
 	 * that the WQE size is legal.
 	 */
 	if (likely(((sq->head - sq->tail) + nr_txbbs +
@@ -332,76 +346,108 @@
 		rc = ENOSPC;
 		goto err;
 	}
-	/* Get the control and single-data entries of the WQE */
+	/* Get the control and data entries of the WQE. */
 	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
 	dseg = (struct mlx4_wqe_data_seg *)(((char *)ctrl) +
 		sizeof(struct mlx4_wqe_ctrl_seg));
-	/*
-	 * Fill the data segment with buffer information.
-	 */
-	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
-	rte_prefetch0((volatile void *)addr);
-	dseg->addr = rte_cpu_to_be_64(addr);
-	/* Memory region key for this memory pool. */
-	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
-	if (unlikely(lkey == (uint32_t)-1)) {
-		/* MR does not exist. */
-		DEBUG("%p: unable to get MP <-> MR"
-		      " association", (void *)txq);
-		/*
-		 * Restamp entry in case of failure.
-		 * Make sure that size is written correctly.
-		 * Note that we give ownership to the SW, not the HW.
+	/* Fill the data segments with buffer information. */
+	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
+		addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		rte_prefetch0((volatile void *)addr);
+		/* Handle WQE wraparound. */
+		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
+			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+		dseg->addr = rte_cpu_to_be_64(addr);
+		/* Memory region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR"
+			      " association", (void *)txq);
+			/*
+			 * Restamp entry in case of failure.
+			 * Make sure that size is written correctly
+			 * Note that we give ownership to the SW, not the HW.
+			 */
+			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+			mlx4_txq_stamp_freed_wqe(sq, head_idx,
+				     (sq->head & sq->txbb_cnt) ? 0 : 1);
+			rc = EFAULT;
+			goto err;
+		}
+		dseg->lkey = rte_cpu_to_be_32(lkey);
+		if (likely(buf->data_len))
+			byte_count = rte_cpu_to_be_32(buf->data_len);
+		else
+			/*
+			 * Zero length segment is treated as inline segment
+			 * with zero data.
+			 */
+			byte_count = RTE_BE32(0x80000000);
+		/* If the data segment is not at the beginning of a
+		 * Tx basic block(TXBB) then write the byte count,
+		 * else postpone the writing to just before updating the
+		 * control segment.
 		 */
-		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-		mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					 (sq->head & sq->txbb_cnt) ? 0 : 1);
-		rc = EFAULT;
-		goto err;
+		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			/*
+			 * Need a barrier here before writing the byte_count
+			 * fields to make sure that all the data is visible
+			 * before the byte_count field is set.
+			 * Otherwise, if the segment begins a new cacheline,
+			 * the HCA prefetcher could grab the 64-byte chunk and
+			 * get a valid (!= * 0xffffffff) byte count but stale
+			 * data, and end up sending the wrong data.
+			 */
+			rte_io_wmb();
+			dseg->byte_count = byte_count;
+		} else {
+			/*
+			 * This data segment starts at the beginning of a new
+			 * TXBB, so we need to postpone its byte_count writing
+			 * for later.
+			 */
+			pv[pv_counter].dseg = dseg;
+			pv[pv_counter++].val = byte_count;
+		}
 	}
-	dseg->lkey = rte_cpu_to_be_32(lkey);
-	/*
-	 * Need a barrier here before writing the byte_count field to
-	 * make sure that all the data is visible before the
-	 * byte_count field is set.  Otherwise, if the segment begins
-	 * a new cacheline, the HCA prefetcher could grab the 64-byte
-	 * chunk and get a valid (!= * 0xffffffff) byte count but
-	 * stale data, and end up sending the wrong data.
-	 */
-	rte_io_wmb();
-	if (likely(pkt->data_len))
-		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
-	else
-		/*
-		 * Zero length segment is treated as inline segment
-		 * with zero data.
-		 */
-		dseg->byte_count = RTE_BE32(0x80000000);
-	/*
-	 * Fill the control parameters for this packet.
-	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
-	 * should be calculated
-	 */
-	ctrl->srcrb_flags =
-		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
-				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	/* Write the first DWORD of each TXBB save earlier. */
+	if (pv_counter) {
+		/* Need a barrier here before writing the byte_count. */
+		rte_io_wmb();
+		for (--pv_counter; pv_counter  >= 0; pv_counter--)
+			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
+	}
+	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
 	 * The caller should prepare "imm" in advance in order to support
 	 * VF to VF communication (when the device is a virtual-function
 	 * device (VF)).
-	 */
+	*/
 	ctrl->imm = 0;
 	/*
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
+	 * should be calculated.
+	 */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+				       MLX4_WQE_CTRL_CQ_UPDATE);
+	} else {
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+	}
+	ctrl->srcrb_flags = srcrb_flags;
+	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
 	 * executing as soon as we do).
 	 */
-	rte_wmb();
-	ctrl->owner_opcode =
-		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
-				 ((sq->head & sq->txbb_cnt) ?
-				  MLX4_BIT_WQE_OWN : 0));
+	 rte_wmb();
+	 ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					       ((sq->head & sq->txbb_cnt) ?
+					       MLX4_BIT_WQE_OWN : 0));
 	sq->head += nr_txbbs;
 	return 0;
 err:
@@ -428,14 +474,13 @@
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -454,8 +499,6 @@
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		unsigned int segs = buf->nb_segs;
-		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -473,34 +516,16 @@
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
-			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
-		}
-		if (likely(segs == 1)) {
-			/* Update element. */
-			elt->buf = buf;
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			/* post the pkt for sending */
-			err = mlx4_post_send(txq, buf, send_flags);
-			if (unlikely(err)) {
-				if (unlikely(send_flags &
-					     MLX4_WQE_CTRL_CQ_UPDATE)) {
-					elts_comp_cd = 1;
-					--elts_comp;
-				}
-				elt->buf = NULL;
-				goto stop;
-			}
-			elt->buf = buf;
-			bytes_sent += buf->pkt_len;
-		} else {
-			err = -EINVAL;
-			rte_errno = -err;
+		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+		/* post the packet for sending. */
+		err = mlx4_post_send(txq, buf);
+		if (unlikely(err)) {
+			elt->buf = NULL;
 			goto stop;
 		}
+		elt->buf = buf;
+		bytes_sent += buf->pkt_len;
+		++elts_comp;
 		elts_head = elts_head_next;
 	}
 stop:
@@ -516,7 +541,6 @@
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index df83552..1b90533 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -103,13 +103,15 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
+	unsigned int elts_comp; /**< Number of pkts waiting for completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	char *bounce_buf;
+	/**< memory used for storing the first DWORD of data TXBBs. */
 	struct {
 		const struct rte_mempool *mp; /**< Cached memory pool. */
 		struct ibv_mr *mr; /**< Memory region (for mp). */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 492779f..9333311 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -83,8 +83,14 @@
 		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->ctrl.socket);
 	int ret = 0;
 
-	if (elts == NULL) {
-		ERROR("%p: can't allocate packets array", (void *)txq);
+	/* Allocate Bounce-buf memory */
+	txq->bounce_buf = (char *)rte_zmalloc_socket("TXQ",
+						     MLX4_MAX_WQE_SIZE,
+						     RTE_CACHE_LINE_MIN_SIZE,
+						     txq->ctrl.socket);
+
+	if ((elts == NULL) || (txq->bounce_buf == NULL)) {
+		ERROR("%p: can't allocate TXQ memory", (void *)txq);
 		ret = ENOMEM;
 		goto error;
 	}
@@ -110,6 +116,7 @@
 	assert(ret == 0);
 	return 0;
 error:
+	rte_free(txq->bounce_buf);
 	rte_free(elts);
 	DEBUG("%p: failed, freed everything", (void *)txq);
 	assert(ret > 0);
@@ -303,7 +310,6 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4dv_obj mlxdv;
 	struct mlx4dv_qp dv_qp;
 	struct mlx4dv_cq dv_cq;
-
 	struct txq tmpl = {
 		.ctrl = {
 			.priv = priv,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 4/6] net/mlx4: get back Tx checksum offloads
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
                     ` (2 preceding siblings ...)
  2017-10-03 10:48   ` [PATCH v2 3/6] net/mlx4: support multi-segments Tx Matan Azrad
@ 2017-10-03 10:48   ` Matan Azrad
  2017-10-03 10:48   ` [PATCH v2 5/6] net/mlx4: get back Rx " Matan Azrad
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP
checksum calculation.
This commit also includes support for offloading IPV4, UDP and TCP
tunnel checksum calculation to the hardware.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4.c        |  9 +++++++++
 drivers/net/mlx4/mlx4.h        |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h    |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 25 +++++++++++++++++++++----
 drivers/net/mlx4/mlx4_rxtx.h   |  2 ++
 drivers/net/mlx4/mlx4_txq.c    |  2 ++
 7 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index b084903..a0e76ee 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -397,6 +397,7 @@ struct mlx4_conf {
 		.ports.present = 0,
 	};
 	unsigned int vf;
+	unsigned int tunnel_en;
 	int i;
 
 	(void)pci_drv;
@@ -456,6 +457,9 @@ struct mlx4_conf {
 		rte_errno = ENODEV;
 		goto error;
 	}
+	/* Only cx3-pro supports L3 tunneling */
+	tunnel_en = (device_attr.vendor_part_id ==
+		     PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
 	INFO("%u port(s) detected", device_attr.phys_port_cnt);
 	conf.ports.present |= (UINT64_C(1) << device_attr.phys_port_cnt) - 1;
 	if (mlx4_args(pci_dev->device.devargs, &conf)) {
@@ -529,6 +533,11 @@ struct mlx4_conf {
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->hw_csum =
+		     !!(device_attr.device_cap_flags & IBV_DEVICE_RAW_IP_CSUM);
+		priv->hw_csum_l2tun = tunnel_en;
+		DEBUG("L2 tunnel checksum offloads are %ssupported",
+		      (priv->hw_csum_l2tun ? "" : "not "));
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index b6e1ef2..d0bce91 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -93,6 +93,8 @@ struct priv {
 	unsigned int vf:1; /* This is a VF device. */
 	unsigned int intr_alarm:1; /* An interrupt alarm is scheduled. */
 	unsigned int isolated:1; /* Toggle isolated mode. */
+	unsigned int hw_csum:1; /* Checksum offload is supported. */
+	unsigned int hw_csum_l2tun:1; /* Checksum support for L2 tunnels. */
 	struct rte_intr_handle intr_handle; /* Port interrupt handle. */
 	struct rte_flow_drop *flow_drop_queue; /* Flow drop queue. */
 	LIST_HEAD(mlx4_flows, rte_flow) flows;
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a9e8059..95cc6e4 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,6 +553,12 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM  |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->hw_csum_l2tun)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->speed_capa =
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 6d1800a..57f5a46 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -64,6 +64,8 @@
 
 /* Work queue element (WQE) flags. */
 #define MLX4_BIT_WQE_OWN 0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define MLX4_SIZE_TO_TXBBS(size) \
 		(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 4200716..2757aec 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -433,12 +433,29 @@ struct pv {
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
-				       MLX4_WQE_CTRL_CQ_UPDATE);
+		srcrb_flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
 	} else {
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+		srcrb_flags = MLX4_WQE_CTRL_SOLICIT;
 	}
-	ctrl->srcrb_flags = srcrb_flags;
+	/* Enable HW checksum offload if requested */
+	if (txq->csum &&
+	    (pkt->ol_flags &
+	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = pkt->ol_flags &
+					     (PKT_TX_TUNNEL_GRE |
+					      PKT_TX_TUNNEL_VXLAN);
+
+		if (is_tunneled && txq->csum_l2tun) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
+		} else {
+			srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
+				      MLX4_WQE_CTRL_TCP_UDP_CSUM;
+		}
+	}
+	ctrl->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 1b90533..dc283e1 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -110,6 +110,8 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint32_t csum:1; /**< Checksum is supported and enabled */
+	uint32_t csum_l2tun:1; /**< L2 tun Checksum is supported and enabled */
 	char *bounce_buf;
 	/**< memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 9333311..2d776eb 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -340,6 +340,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	tmpl.csum = priv->hw_csum;
+	tmpl.csum_l2tun = priv->hw_csum_l2tun;
 	DEBUG("priv->device_attr.max_qp_wr is %d",
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 5/6] net/mlx4: get back Rx checksum offloads
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
                     ` (3 preceding siblings ...)
  2017-10-03 10:48   ` [PATCH v2 4/6] net/mlx4: get back Tx checksum offloads Matan Azrad
@ 2017-10-03 10:48   ` Matan Azrad
  2017-10-03 22:26     ` Ferruh Yigit
  2017-10-03 10:48   ` [PATCH v2 6/6] net/mlx4: add loopback Tx from VF Matan Azrad
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP
checksum verification.
This commit also includes support for offloading IPV4, UDP and TCP tunnel
checksum verification to the hardware.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
---
 drivers/net/mlx4/mlx4.c        |   2 +
 drivers/net/mlx4/mlx4_ethdev.c |   8 ++-
 drivers/net/mlx4/mlx4_prm.h    |  19 +++++++
 drivers/net/mlx4/mlx4_rxq.c    |   5 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 120 ++++++++++++++++++++++++++++++++++++++++-
 drivers/net/mlx4/mlx4_rxtx.h   |   2 +
 6 files changed, 152 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index a0e76ee..865ffdd 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -535,6 +535,8 @@ struct mlx4_conf {
 		priv->vf = vf;
 		priv->hw_csum =
 		     !!(device_attr.device_cap_flags & IBV_DEVICE_RAW_IP_CSUM);
+		DEBUG("checksum offloading is %ssupported",
+		      (priv->hw_csum ? "" : "not "));
 		priv->hw_csum_l2tun = tunnel_en;
 		DEBUG("L2 tunnel checksum offloads are %ssupported",
 		      (priv->hw_csum_l2tun ? "" : "not "));
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index 95cc6e4..6dbf273 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,10 +553,14 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
-	if (priv->hw_csum)
+	if (priv->hw_csum) {
 		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
-					  DEV_TX_OFFLOAD_UDP_CKSUM  |
+					  DEV_TX_OFFLOAD_UDP_CKSUM |
 					  DEV_TX_OFFLOAD_TCP_CKSUM);
+		info->rx_offload_capa |= (DEV_RX_OFFLOAD_IPV4_CKSUM |
+					  DEV_RX_OFFLOAD_UDP_CKSUM |
+					  DEV_RX_OFFLOAD_TCP_CKSUM);
+	}
 	if (priv->hw_csum_l2tun)
 		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 57f5a46..b4c14b9 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -70,6 +70,25 @@
 #define MLX4_SIZE_TO_TXBBS(size) \
 		(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
 
+/* Generic macro to convert MLX4 to IBV flags. */
+#define MLX4_TRANSPOSE(val, from, to) \
+		(__extension__({ \
+			typeof(val) _val = (val); \
+			typeof(from) _from = (from); \
+			typeof(to) _to = (to); \
+			(((_from) >= (_to)) ? \
+			(((_val) & (_from)) / ((_from) / (_to))) : \
+			(((_val) & (_from)) * ((_to) / (_from)))); \
+		}))
+
+/* CQE checksum flags */
+enum {
+	MLX4_CQE_L2_TUNNEL_IPV4    = (int)(1U << 25),
+	MLX4_CQE_L2_TUNNEL_L4_CSUM = (int)(1U << 26),
+	MLX4_CQE_L2_TUNNEL         = (int)(1U << 27),
+	MLX4_CQE_L2_TUNNEL_IPOK    = (int)(1U << 31),
+};
+
 /* Send queue information. */
 struct mlx4_sq {
 	char *buf; /**< SQ buffer. */
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index d7447c4..053318f 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -267,6 +267,11 @@
 	int ret;
 
 	(void)conf; /* Thresholds configuration (ignored). */
+	/* Toggle Rx checksum offload if hardware supports it. */
+	if (priv->hw_csum)
+		tmpl.csum = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
+	if (priv->hw_csum_l2tun)
+		tmpl.csum_l2tun = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
 	mb_len = rte_pktmbuf_data_room_size(mp);
 	/* Enable scattered packets support for this queue if necessary. */
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 2757aec..1e91aaf 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -562,6 +562,110 @@ struct pv {
 }
 
 /**
+ * Translate Rx completion flags to packet type.
+ *
+ * @param flags
+ *   Rx completion flags returned by poll_length_flags().
+ *
+ * @return
+ *   Packet type for struct rte_mbuf.
+ */
+static inline uint32_t
+rxq_cq_to_pkt_type(uint32_t flags)
+{
+	uint32_t pkt_type;
+
+	if (flags & MLX4_CQE_L2_TUNNEL)
+		pkt_type =
+			MLX4_TRANSPOSE(flags,
+			       (uint32_t)MLX4_CQE_L2_TUNNEL_IPV4,
+			       (uint32_t)RTE_PTYPE_L3_IPV4_EXT_UNKNOWN) |
+			MLX4_TRANSPOSE(flags,
+			       (uint32_t)MLX4_CQE_STATUS_IPV4_PKT,
+			       (uint32_t)RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN);
+	else
+		pkt_type =
+			MLX4_TRANSPOSE(flags,
+			       (uint32_t)MLX4_CQE_STATUS_IPV4_PKT,
+			       (uint32_t)RTE_PTYPE_L3_IPV4_EXT_UNKNOWN);
+	ERROR("pkt_type 0x%x", pkt_type); //
+	return pkt_type;
+}
+
+/**
+ * Translate Rx completion flags to offload flags.
+ *
+ * @param  flags
+ *   Rx completion flags returned by poll_length_flags().
+ * @param csum
+ *   Rx checksum enable flag
+ * @param csum_l2tun
+ *   Rx L2-tun checksum enable flag
+ *
+ * @return
+ *   Offload flags (ol_flags) for struct rte_mbuf.
+ */
+static inline uint32_t
+rxq_cq_to_ol_flags(uint32_t flags, unsigned int csum, unsigned int csum_l2tun)
+{
+	uint32_t ol_flags = 0;
+
+	if (csum)
+		ol_flags |=
+			MLX4_TRANSPOSE(flags,
+				(uint64_t)MLX4_CQE_STATUS_IP_HDR_CSUM_OK,
+				PKT_RX_IP_CKSUM_GOOD) |
+			MLX4_TRANSPOSE(flags,
+				(uint64_t)MLX4_CQE_STATUS_TCP_UDP_CSUM_OK,
+				PKT_RX_L4_CKSUM_GOOD);
+	if ((flags & MLX4_CQE_L2_TUNNEL) && csum_l2tun)
+		ol_flags |=
+			MLX4_TRANSPOSE(flags,
+				       (uint64_t)MLX4_CQE_L2_TUNNEL_IPOK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			MLX4_TRANSPOSE(flags,
+				       (uint64_t)MLX4_CQE_L2_TUNNEL_L4_CSUM,
+				       PKT_RX_L4_CKSUM_GOOD);
+	return ol_flags;
+}
+
+/**
+ * Get Rx checksum CQE flags.
+ *
+ * @param cqe
+ *   Pointer to cqe structure.
+ * @param csum
+ *   Rx checksum enable flag
+ * @param csum_l2tun
+ *   RX L2-tun checksum enable flag
+ *
+ * @return
+ *   CQE's flags.
+ */
+static inline uint32_t
+mlx4_cqe_flags(struct mlx4_cqe *cqe,
+	       int csum, unsigned int csum_l2tun)
+{
+	uint32_t flags = 0;
+
+	/*
+	 * The relevant bits are in different locations on their
+	 * CQE fields therefore we can join them in one 32bit
+	 * variable.
+	 */
+	if (csum)
+		flags = rte_be_to_cpu_32(cqe->status) &
+			MLX4_CQE_STATUS_IPV4_CSUM_OK;
+	if (csum_l2tun)
+		flags |= rte_be_to_cpu_32(cqe->vlan_my_qpn) &
+			 (MLX4_CQE_L2_TUNNEL |
+			  MLX4_CQE_L2_TUNNEL_IPOK |
+			  MLX4_CQE_L2_TUNNEL_L4_CSUM |
+			  MLX4_CQE_L2_TUNNEL_IPV4);
+		return flags;
+}
+
+/**
  * Poll one CQE from CQ.
  *
  * @param rxq
@@ -600,7 +704,7 @@ struct pv {
 }
 
 /**
- * DPDK callback for RX with scattered packets support.
+ * DPDK callback for Rx with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -665,7 +769,7 @@ struct pv {
 				break;
 			}
 			if (unlikely(len < 0)) {
-				/* RX error, packet is likely too large. */
+				/* Rx error, packet is likely too large. */
 				rte_mbuf_raw_free(rep);
 				++rxq->stats.idropped;
 				goto skip;
@@ -673,6 +777,18 @@ struct pv {
 			pkt = seg;
 			pkt->packet_type = 0;
 			pkt->ol_flags = 0;
+			if (rxq->csum | rxq->csum_l2tun) {
+				uint32_t flags =
+					mlx4_cqe_flags(cqe, rxq->csum,
+						       rxq->csum_l2tun);
+				pkt->ol_flags =
+					rxq_cq_to_ol_flags(flags, rxq->csum,
+							   rxq->csum_l2tun);
+				pkt->packet_type = rxq_cq_to_pkt_type(flags);
+			} else {
+				pkt->packet_type = 0;
+				pkt->ol_flags = 0;
+			}
 			pkt->pkt_len = len;
 		}
 		rep->nb_segs = 1;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index dc283e1..75c98c1 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -80,6 +80,8 @@ struct rxq {
 	} hw;
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	unsigned int sge_n; /**< Log 2 of SGEs number. */
+	unsigned int csum:1; /**< Enable checksum offloading. */
+	unsigned int csum_l2tun:1; /**< Enable checksum for L2 tunnels. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 6/6] net/mlx4: add loopback Tx from VF
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
                     ` (4 preceding siblings ...)
  2017-10-03 10:48   ` [PATCH v2 5/6] net/mlx4: get back Rx " Matan Azrad
@ 2017-10-03 10:48   ` Matan Azrad
  2017-10-03 22:27   ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Ferruh Yigit
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Matan Azrad @ 2017-10-03 10:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds loopback functionality used when the chip is a VF
in order to enable packet transmission between VFs and between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 38 ++++++++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |  2 ++
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 1e91aaf..85fb6d7 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -320,10 +320,13 @@ struct pv {
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct rte_mbuf *buf;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
-	uint32_t srcrb_flags;
 	uint32_t owner_opcode = MLX4_OPCODE_SEND;
 	uint32_t byte_count;
 	int wqe_real_size;
@@ -421,21 +424,15 @@ struct pv {
 	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
-	 * The caller should prepare "imm" in advance in order to support
-	 * VF to VF communication (when the device is a virtual-function
-	 * device (VF)).
-	*/
-	ctrl->imm = 0;
-	/*
 	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
 	 * should be calculated.
 	 */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
+		srcrb.flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
 	} else {
-		srcrb_flags = MLX4_WQE_CTRL_SOLICIT;
+		srcrb.flags = MLX4_WQE_CTRL_SOLICIT;
 	}
 	/* Enable HW checksum offload if requested */
 	if (txq->csum &&
@@ -449,13 +446,30 @@ struct pv {
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
+				srcrb.flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
 		} else {
-			srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
 				      MLX4_WQE_CTRL_TCP_UDP_CSUM;
 		}
 	}
-	ctrl->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
+	/*
+	 * convert flags to BE before adding the mac address (if at all)
+	 * to it
+	 */
+	srcrb.flags = rte_cpu_to_be_32(srcrb.flags);
+	if (txq->lb) {
+		/*
+		 * Copy destination mac address to the wqe,
+		 * this allows loopback in eSwitch, so that VFs and PF
+		 * can communicate with each other.
+		 */
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						      sizeof(uint16_t)));
+	} else {
+		ctrl->imm = 0;
+	}
+	ctrl->srcrb_flags = srcrb.flags;
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 75c98c1..6f33d1c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -114,6 +114,8 @@ struct txq {
 	uint32_t max_inline; /**< Max inline send size. */
 	uint32_t csum:1; /**< Checksum is supported and enabled */
 	uint32_t csum_l2tun:1; /**< L2 tun Checksum is supported and enabled */
+	uint32_t lb:1;
+	/**< Whether pkts should be looped-back by eswitch or not */
 	char *bounce_buf;
 	/**< memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 2d776eb..fd1dce0 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -415,6 +415,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* If a VF device - need to loopback xmitted packets */
+	tmpl.lb = !!(priv->vf);
 	/* Clean up txq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old txq just in case", (void *)txq);
 	mlx4_txq_cleanup(txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 5/6] net/mlx4: get back Rx checksum offloads
  2017-10-03 10:48   ` [PATCH v2 5/6] net/mlx4: get back Rx " Matan Azrad
@ 2017-10-03 22:26     ` Ferruh Yigit
  0 siblings, 0 replies; 61+ messages in thread
From: Ferruh Yigit @ 2017-10-03 22:26 UTC (permalink / raw)
  To: Matan Azrad, Adrien Mazarguil; +Cc: dev, Moti Haimovsky, Vasily Philipov

On 10/3/2017 11:48 AM, Matan Azrad wrote:
> From: Moti Haimovsky <motih@mellanox.com>
> 
> This patch adds hardware offloading support for IPV4, UDP and TCP
> checksum verification.
> This commit also includes support for offloading IPV4, UDP and TCP tunnel
> checksum verification to the hardware.
> 
> Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>

<...>

> +/**
> + * Get Rx checksum CQE flags.
> + *
> + * @param cqe
> + *   Pointer to cqe structure.
> + * @param csum
> + *   Rx checksum enable flag
> + * @param csum_l2tun
> + *   RX L2-tun checksum enable flag
> + *
> + * @return
> + *   CQE's flags.
> + */
> +static inline uint32_t
> +mlx4_cqe_flags(struct mlx4_cqe *cqe,
> +	       int csum, unsigned int csum_l2tun)
> +{
> +	uint32_t flags = 0;
> +
> +	/*
> +	 * The relevant bits are in different locations on their
> +	 * CQE fields therefore we can join them in one 32bit
> +	 * variable.
> +	 */
> +	if (csum)
> +		flags = rte_be_to_cpu_32(cqe->status) &
> +			MLX4_CQE_STATUS_IPV4_CSUM_OK;
> +	if (csum_l2tun)
> +		flags |= rte_be_to_cpu_32(cqe->vlan_my_qpn) &
> +			 (MLX4_CQE_L2_TUNNEL |
> +			  MLX4_CQE_L2_TUNNEL_IPOK |
> +			  MLX4_CQE_L2_TUNNEL_L4_CSUM |
> +			  MLX4_CQE_L2_TUNNEL_IPV4);
> +		return flags;

Wrong indentation is triggering a compiler warning here:

.../dpdk/drivers/net/mlx4/mlx4_rxtx.c: In function ‘mlx4_cqe_flags’:
.../dpdk/drivers/net/mlx4/mlx4_rxtx.c:673:2: error: this ‘if’ clause
does not guard... [-Werror=misleading-indentation]
  if (csum_l2tun)
  ^~
.../dpdk/drivers/net/mlx4/mlx4_rxtx.c:679:3: note: ...this statement,
but the latter is misleadingly indented as if it were guarded by the ‘if’
   return flags;
   ^~~~~~

<...>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
                     ` (5 preceding siblings ...)
  2017-10-03 10:48   ` [PATCH v2 6/6] net/mlx4: add loopback Tx from VF Matan Azrad
@ 2017-10-03 22:27   ` Ferruh Yigit
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
  8 siblings, 0 replies; 61+ messages in thread
From: Ferruh Yigit @ 2017-10-03 22:27 UTC (permalink / raw)
  To: Matan Azrad, Adrien Mazarguil; +Cc: dev, Moti Haimovsky

On 10/3/2017 11:48 AM, Matan Azrad wrote:
> v2:
> Rearange patches.
> Semantics.
> Enhancements.
> Fix compilation issues. 
> 
> Moti Haimovsky (6):
>   net/mlx4: add simple Tx bypassing ibverbs
>   net/mlx4: get back Rx flow functionality
>   net/mlx4: support multi-segments Tx
>   net/mlx4: get back Tx checksum offloads
>   net/mlx4: get back Rx checksum offloads
>   net/mlx4: add loopback Tx from VF

While sending new version, can you please rebase on latest next-net?
Although it is easy to resolve this cause a merge conflict.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 0/6] new mlx4 datapath bypassing ibverbs
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
                     ` (6 preceding siblings ...)
  2017-10-03 22:27   ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Ferruh Yigit
@ 2017-10-04 18:48   ` Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 1/6] net/mlx4: add simple Tx bypassing Verbs Adrien Mazarguil
                       ` (6 more replies)
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
  8 siblings, 7 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad

Took me a while to finally review this series. Since there is not much time
left, I'm taking care of v3 with several minor changes summarized below and
my ack included directly.

v3 (Adrien):
- Drop a few unrelated or unnecessary changes such as the removal of
  MLX4_PMD_TX_MP_CACHE.
- Move device checksum support detection code to its previous location.
- Fix include guard in mlx4_prm.h.
- Reorder #includes alphabetically.
- Replace MLX4_TRANSPOSE() macro with documented inline function.
- Remove extra spaces and blank lines.
- Use uint8_t * instead of char * for buffers.
- Replace mlx4_get_cqe() macro with a documented inline function.
- Replace several unsigned int with uint32_t.
- Add consistency to field names (sge_n => sges_n).
- Make mbuf size checks in RX queue setup function similar to mlx5.
- Update various comments.
- Fix indentation.
- Replace run-time endian conversion with static ones where possible.
- Reorder fields in struct rxq and struct txq for consistency, remove
  one level of unnecessary inner structures.
- Fix memory leak on Tx bounce buffer.
- Update commit logs.
- Fix remaining checkpatch warnings.

v2 (Matan):
Rearange patches.
Semantics.
Enhancements.
Fix compilation issues.

Moti Haimovsky (6):
  net/mlx4: add simple Tx bypassing Verbs
  net/mlx4: restore full Rx support bypassing Verbs
  net/mlx4: restore Tx gather support
  net/mlx4: restore Tx checksum offloads
  net/mlx4: restore Rx offloads
  net/mlx4: add loopback Tx from VF

 drivers/net/mlx4/mlx4.c        |  11 +
 drivers/net/mlx4/mlx4.h        |   2 +
 drivers/net/mlx4/mlx4_ethdev.c |  10 +
 drivers/net/mlx4/mlx4_prm.h    | 152 +++++++
 drivers/net/mlx4/mlx4_rxq.c    | 179 ++++++---
 drivers/net/mlx4/mlx4_rxtx.c   | 768 ++++++++++++++++++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.h   |  54 +--
 drivers/net/mlx4/mlx4_txq.c    |  67 +++-
 drivers/net/mlx4/mlx4_utils.h  |  20 +
 mk/rte.app.mk                  |   2 +-
 10 files changed, 975 insertions(+), 290 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 1/6] net/mlx4: add simple Tx bypassing Verbs
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
@ 2017-10-04 18:48     ` Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 2/6] net/mlx4: restore full Rx support " Adrien Mazarguil
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad

From: Moti Haimovsky <motih@mellanox.com>

Modify PMD to send single-buffer packets directly to the device bypassing
the Verbs Tx post and poll routines.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 120 ++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 337 ++++++++++++++++++++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.h |  28 ++--
 drivers/net/mlx4/mlx4_txq.c  |  51 ++++++
 mk/rte.app.mk                |   2 +-
 5 files changed, 436 insertions(+), 102 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..085a595
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,120 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 6WIND S.A.
+ *   Copyright 2017 Mellanox
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef MLX4_PRM_H_
+#define MLX4_PRM_H_
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_byteorder.h>
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/mlx4dv.h>
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* ConnectX-3 Tx queue basic block. */
+#define MLX4_TXBB_SHIFT 6
+#define MLX4_TXBB_SIZE (1 << MLX4_TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes. */
+#define MLX4_MAX_WQE_SIZE 512
+#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+
+/* Send queue stamping/invalidating information. */
+#define MLX4_SQ_STAMP_STRIDE 64
+#define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
+#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_STAMP_VAL 0x7fffffff
+
+/* Work queue element (WQE) flags. */
+#define MLX4_BIT_WQE_OWN 0x80000000
+
+#define MLX4_SIZE_TO_TXBBS(size) \
+	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
+
+/* Send queue information. */
+struct mlx4_sq {
+	uint8_t *buf; /**< SQ buffer. */
+	uint8_t *eob; /**< End of SQ buffer */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t *db; /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
+};
+
+#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
+
+/* Completion queue information. */
+struct mlx4_cq {
+	uint8_t *buf; /**< Pointer to the completion queue buffer. */
+	uint32_t cqe_cnt; /**< Number of entries in the queue. */
+	uint32_t cqe_64:1; /**< CQ entry size is 64 bytes. */
+	uint32_t cons_index; /**< Last queue entry that was handled. */
+	uint32_t *set_ci_db; /**< Pointer to the completion queue doorbell. */
+};
+
+/**
+ * Retrieve a CQE entry from a CQ.
+ *
+ * cqe = cq->buf + cons_index * cqe_size + cqe_offset
+ *
+ * Where cqe_size is 32 or 64 bytes and cqe_offset is 0 or 32 (depending on
+ * cqe_size).
+ *
+ * @param cq
+ *   CQ to retrieve entry from.
+ * @param index
+ *   Entry index.
+ *
+ * @return
+ *   Pointer to CQE entry.
+ */
+static inline struct mlx4_cqe *
+mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
+{
+	return (struct mlx4_cqe *)(cq->buf +
+				   ((index & (cq->cqe_cnt - 1)) <<
+				    (5 + cq->cqe_64)) +
+				   (cq->cqe_64 << 5));
+}
+
+#endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b5e7777..35367a2 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -52,15 +52,72 @@
 
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
+#include <rte_io.h>
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
 /**
+ * Stamp a WQE so it won't be reused by the HW.
+ *
+ * Routine is used when freeing WQE used by the chip or when failing
+ * building an WQ entry has failed leaving partial information on the queue.
+ *
+ * @param sq
+ *   Pointer to the SQ structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param num_txbbs
+ *   Number of blocks to stamp.
+ *   If < 0 the routine will use the size written in the WQ entry.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of Tx basic blocs (TXBB) the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+					  (!!owner << MLX4_SQ_STAMP_SHIFT));
+	uint8_t *wqe = mlx4_get_send_wqe(sq, (index & sq->txbb_cnt_mask));
+	uint32_t *ptr = (uint32_t *)wqe;
+	int i;
+	int txbbs_size;
+	int num_txbbs;
+
+	/* Extract the size from the control segment of the WQE. */
+	num_txbbs = MLX4_SIZE_TO_TXBBS((((struct mlx4_wqe_ctrl_seg *)
+					 wqe)->fence_size & 0x3f) << 4);
+	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	/* Optimize the common case when there is no wrap-around. */
+	if (wqe + txbbs_size <= sq->eob) {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+			if ((uint8_t *)ptr >= sq->eob) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= RTE_BE32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
  * Manage Tx completions.
  *
  * When sending a burst, mlx4_tx_burst() posts several WRs.
@@ -80,26 +137,71 @@ mlx4_txq_complete(struct txq *txq)
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
-	int wcs_n;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
-	if (unlikely(wcs_n == 0))
+	/*
+	 * Traverse over all CQ entries reported and handle each WQ entry
+	 * reported by them.
+	 */
+	do {
+		cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cons_index);
+		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		    !!(cons_index & cq->cqe_cnt)))
+			break;
+		/*
+		 * Make sure we read the CQE after we read the ownership bit.
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      (void *)txq, cqe_err->vendor_err,
+			      cqe_err->syndrome);
+		}
+		/* Get WQE index reported in the CQE. */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* Free next descriptor. */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		cons_index++;
+	} while (1);
+	if (unlikely(pkts == 0))
 		return 0;
-	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
-		      (void *)txq, wcs_n);
-		return -1;
-	}
-	elts_comp -= wcs_n;
+	/*
+	 * Update CQ.
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	/* Update the list of packets posted for transmission. */
+	elts_comp -= pkts;
 	assert(elts_comp <= txq->elts_comp);
 	/*
-	 * Assume WC status is successful as nothing can be done about it
-	 * anyway.
+	 * Assume completion status is successful as nothing can be done about
+	 * it anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -183,6 +285,119 @@ mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
 }
 
 /**
+ * Posts a single work request to a send queue.
+ *
+ * @param txq
+ *   Target Tx queue.
+ * @param pkt
+ *   Packet to transmit.
+ * @param send_flags
+ *   @p MLX4_WQE_CTRL_CQ_UPDATE to request completion on this packet.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+static inline int
+mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt, uint32_t send_flags)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t lkey;
+	uintptr_t addr;
+	int wqe_real_size;
+	int nr_txbbs;
+	int rc;
+
+	/* Calculate the needed work queue entry size for this packet. */
+	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+	/*
+	 * Check that there is room for this WQE in the send queue and that
+	 * the WQE size is legal.
+	 */
+	if (((sq->head - sq->tail) + nr_txbbs +
+	     sq->headroom_txbbs) >= sq->txbb_cnt ||
+	    nr_txbbs > MLX4_MAX_WQE_TXBBS) {
+		rc = ENOSPC;
+		goto err;
+	}
+	/* Get the control and single-data entries of the WQE. */
+	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
+					    sizeof(struct mlx4_wqe_ctrl_seg));
+	/* Fill the data segment with buffer information. */
+	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
+	rte_prefetch0((volatile void *)addr);
+	dseg->addr = rte_cpu_to_be_64(addr);
+	/* Memory region key for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		/* MR does not exist. */
+		DEBUG("%p: unable to get MP <-> MR association", (void *)txq);
+		/*
+		 * Restamp entry in case of failure, make sure that size is
+		 * written correctly.
+		 * Note that we give ownership to the SW, not the HW.
+		 */
+		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+		mlx4_txq_stamp_freed_wqe(sq, head_idx,
+					 (sq->head & sq->txbb_cnt) ? 0 : 1);
+		rc = EFAULT;
+		goto err;
+	}
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * Need a barrier here before writing the byte_count field to
+	 * make sure that all the data is visible before the
+	 * byte_count field is set. Otherwise, if the segment begins
+	 * a new cache line, the HCA prefetcher could grab the 64-byte
+	 * chunk and get a valid (!= 0xffffffff) byte count but
+	 * stale data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+	if (likely(pkt->data_len))
+		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
+	else
+		/*
+		 * Zero length segment is treated as inline segment
+		 * with zero data.
+		 */
+		dseg->byte_count = RTE_BE32(0x80000000);
+	/*
+	 * Fill the control parameters for this packet.
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
+	 * should be calculated.
+	 */
+	ctrl->srcrb_flags =
+		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
+				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+	/*
+	 * The caller should prepare "imm" in advance in order to support
+	 * VF to VF communication (when the device is a virtual-function
+	 * device (VF)).
+	 */
+	ctrl->imm = 0;
+	/*
+	 * Make sure descriptor is fully written before setting ownership
+	 * bit (because HW can start executing as soon as we do).
+	 */
+	rte_wmb();
+	ctrl->owner_opcode =
+		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
+				 ((sq->head & sq->txbb_cnt) ?
+				  MLX4_BIT_WQE_OWN : 0));
+	sq->head += nr_txbbs;
+	return 0;
+err:
+	rte_errno = rc;
+	return -rc;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,13 +414,11 @@ uint16_t
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
-	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
 	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
+	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
@@ -229,9 +442,7 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		struct ibv_send_wr *wr = &elt->wr;
 		unsigned int segs = buf->nb_segs;
-		unsigned int sent_size = 0;
 		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
@@ -254,93 +465,43 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		if (unlikely(--elts_comp_cd == 0)) {
 			elts_comp_cd = txq->elts_comp_cd_init;
 			++elts_comp;
-			send_flags |= IBV_SEND_SIGNALED;
+			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
 		}
 		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
+			/* Update element. */
+			elt->buf = buf;
+			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+			/* Post the packet for sending. */
+			err = mlx4_post_send(txq, buf, send_flags);
+			if (unlikely(err)) {
+				if (unlikely(send_flags &
+					     MLX4_WQE_CTRL_CQ_UPDATE)) {
+					elts_comp_cd = 1;
+					--elts_comp;
+				}
 				elt->buf = NULL;
 				goto stop;
 			}
-			/* Update element. */
 			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
+			bytes_sent += buf->pkt_len;
 		} else {
-			err = -1;
+			err = -EINVAL;
+			rte_errno = -err;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
 		elts_head = elts_head_next;
-		/* Increment sent bytes counter. */
-		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
-	/* Increment sent packets counter. */
+	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
+	txq->stats.obytes += bytes_sent;
+	/* Make sure that descriptors are written before doorbell record. */
+	rte_wmb();
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
 	txq->elts_comp_cd = elts_comp_cd;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fec998a..cc5951c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -40,6 +40,7 @@
 #ifdef PEDANTIC
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
+#include <infiniband/mlx4dv.h>
 #include <infiniband/verbs.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
@@ -50,6 +51,7 @@
 #include <rte_mempool.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 
 /** Rx queue counters. */
 struct mlx4_rxq_stats {
@@ -85,8 +87,6 @@ struct rxq {
 
 /** Tx element. */
 struct txq_elt {
-	struct ibv_send_wr wr; /* Work request. */
-	struct ibv_sge sge; /* Scatter/gather element. */
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
@@ -100,24 +100,26 @@ struct mlx4_txq_stats {
 
 /** Tx queue descriptor. */
 struct txq {
-	struct priv *priv; /**< Back pointer to private data. */
-	struct {
-		const struct rte_mempool *mp; /**< Cached memory pool. */
-		struct ibv_mr *mr; /**< Memory region (for mp). */
-		uint32_t lkey; /**< mr->lkey copy. */
-	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
-	struct ibv_cq *cq; /**< Completion queue. */
-	struct ibv_qp *qp; /**< Queue pair. */
-	uint32_t max_inline; /**< Max inline send size. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	struct txq_elt (*elts)[]; /**< Tx elements. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
 	unsigned int elts_comp; /**< Number of completion requests. */
 	unsigned int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
+	unsigned int elts_n; /**< (*elts)[] length. */
+	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
+	uint32_t max_inline; /**< Max inline send size. */
+	struct {
+		const struct rte_mempool *mp; /**< Cached memory pool. */
+		struct ibv_mr *mr; /**< Memory region (for mp). */
+		uint32_t lkey; /**< mr->lkey copy. */
+	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
+	struct priv *priv; /**< Back pointer to private data. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
+	struct ibv_cq *cq; /**< Completion queue. */
+	struct ibv_qp *qp; /**< Queue pair. */
 };
 
 /* mlx4_rxq.c */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index e0245b0..fb28ef2 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -62,6 +62,7 @@
 #include "mlx4_autoconf.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
+#include "mlx4_prm.h"
 
 /**
  * Allocate Tx queue elements.
@@ -242,6 +243,41 @@ mlx4_txq_mp2mr_iter(struct rte_mempool *mp, void *arg)
 }
 
 /**
+ * Retrieves information needed in order to directly access the Tx queue.
+ *
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param mlxdv
+ *   Pointer to device information for this Tx queue.
+ */
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	/* Total length, including headroom and spare WQEs. */
+	sq->eob = sq->buf + sq_size;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_cnt =
+		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs =
+		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
+	cq->buf = dcq->buf.buf;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_64 = (dcq->cqe_size & 64) ? 1 : 0;
+}
+
+/**
  * Configure a Tx queue.
  *
  * @param dev
@@ -263,6 +299,9 @@ mlx4_txq_setup(struct rte_eth_dev *dev, struct txq *txq, uint16_t desc,
 	       unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct txq tmpl = {
 		.priv = priv,
 		.socket = socket
@@ -370,6 +409,18 @@ mlx4_txq_setup(struct rte_eth_dev *dev, struct txq *txq, uint16_t desc,
 	DEBUG("%p: txq updated with %p", (void *)txq, (void *)&tmpl);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = txq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: failed to obtain information needed for"
+		      " accessing the device queues", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	return 0;
 error:
 	ret = rte_errno;
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 29507dc..1435cb6 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -133,7 +133,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
 endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LIO_PMD)        += -lrte_pmd_lio
-_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs -lmlx4
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -libverbs -lmlx5
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NFP_PMD)        += -lrte_pmd_nfp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_NULL)       += -lrte_pmd_null
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 2/6] net/mlx4: restore full Rx support bypassing Verbs
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 1/6] net/mlx4: add simple Tx bypassing Verbs Adrien Mazarguil
@ 2017-10-04 18:48     ` Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 3/6] net/mlx4: restore Tx gather support Adrien Mazarguil
                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for accessing the hardware directly when handling
Rx packets eliminating the need to use Verbs in the Rx data path.

The number of scatters is calculated on the fly, according to the maximum
expected packet size.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxq.c   | 174 ++++++++++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.c  | 226 +++++++++++++++++++++----------------
 drivers/net/mlx4/mlx4_rxtx.h  |  19 ++--
 drivers/net/mlx4/mlx4_utils.h |  20 ++++
 4 files changed, 270 insertions(+), 169 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 409983f..44d095d 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -51,6 +51,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
 
+#include <rte_byteorder.h>
 #include <rte_common.h>
 #include <rte_errno.h>
 #include <rte_ethdev.h>
@@ -77,20 +78,18 @@ static int
 mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 {
 	unsigned int i;
-	struct rxq_elt (*elts)[elts_n] =
-		rte_calloc_socket("RXQ elements", 1, sizeof(*elts), 0,
-				  rxq->socket);
+	const uint32_t sges_n = 1 << rxq->sges_n;
+	struct rte_mbuf *(*elts)[elts_n] =
+		rte_calloc_socket("RXQ", 1, sizeof(*elts), 0, rxq->socket);
 
+	assert(rte_is_power_of_2(elts_n));
 	if (elts == NULL) {
 		rte_errno = ENOMEM;
 		ERROR("%p: can't allocate packets array", (void *)rxq);
 		goto error;
 	}
-	/* For each WR (packet). */
 	for (i = 0; (i != elts_n); ++i) {
-		struct rxq_elt *elt = &(*elts)[i];
-		struct ibv_recv_wr *wr = &elt->wr;
-		struct ibv_sge *sge = &(*elts)[i].sge;
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[i];
 		struct rte_mbuf *buf = rte_pktmbuf_alloc(rxq->mp);
 
 		if (buf == NULL) {
@@ -98,37 +97,35 @@ mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 			ERROR("%p: empty mbuf pool", (void *)rxq);
 			goto error;
 		}
-		elt->buf = buf;
-		wr->next = &(*elts)[(i + 1)].wr;
-		wr->sg_list = sge;
-		wr->num_sge = 1;
 		/* Headroom is reserved by rte_pktmbuf_alloc(). */
 		assert(buf->data_off == RTE_PKTMBUF_HEADROOM);
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
-		/* sge->addr must be able to store a pointer. */
-		assert(sizeof(sge->addr) >= sizeof(uintptr_t));
-		/* SGE keeps its headroom. */
-		sge->addr = (uintptr_t)
-			((uint8_t *)buf->buf_addr + RTE_PKTMBUF_HEADROOM);
-		sge->length = (buf->buf_len - RTE_PKTMBUF_HEADROOM);
-		sge->lkey = rxq->mr->lkey;
-		/* Redundant check for tailroom. */
-		assert(sge->length == rte_pktmbuf_tailroom(buf));
+		/* Only the first segment keeps headroom. */
+		if (i % sges_n)
+			buf->data_off = 0;
+		buf->port = rxq->port_id;
+		buf->data_len = rte_pktmbuf_tailroom(buf);
+		buf->pkt_len = rte_pktmbuf_tailroom(buf);
+		buf->nb_segs = 1;
+		*scat = (struct mlx4_wqe_data_seg){
+			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
+								  uintptr_t)),
+			.byte_count = rte_cpu_to_be_32(buf->data_len),
+			.lkey = rte_cpu_to_be_32(rxq->mr->lkey),
+		};
+		(*elts)[i] = buf;
 	}
-	/* The last WR pointer must be NULL. */
-	(*elts)[(i - 1)].wr.next = NULL;
-	DEBUG("%p: allocated and configured %u single-segment WRs",
-	      (void *)rxq, elts_n);
-	rxq->elts_n = elts_n;
-	rxq->elts_head = 0;
+	DEBUG("%p: allocated and configured %u segments (max %u packets)",
+	      (void *)rxq, elts_n, elts_n >> rxq->sges_n);
+	rxq->elts_n = log2above(elts_n);
 	rxq->elts = elts;
 	return 0;
 error:
 	if (elts != NULL) {
 		for (i = 0; (i != RTE_DIM(*elts)); ++i)
-			rte_pktmbuf_free_seg((*elts)[i].buf);
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
 		rte_free(elts);
 	}
 	DEBUG("%p: failed, freed everything", (void *)rxq);
@@ -146,17 +143,16 @@ static void
 mlx4_rxq_free_elts(struct rxq *rxq)
 {
 	unsigned int i;
-	unsigned int elts_n = rxq->elts_n;
-	struct rxq_elt (*elts)[elts_n] = rxq->elts;
 
-	DEBUG("%p: freeing WRs", (void *)rxq);
+	if (rxq->elts == NULL)
+		return;
+	DEBUG("%p: freeing Rx queue elements", (void *)rxq);
+	for (i = 0; i != (1u << rxq->elts_n); ++i)
+		if ((*rxq->elts)[i] != NULL)
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
+	rte_free(rxq->elts);
 	rxq->elts_n = 0;
 	rxq->elts = NULL;
-	if (elts == NULL)
-		return;
-	for (i = 0; (i != RTE_DIM(*elts)); ++i)
-		rte_pktmbuf_free_seg((*elts)[i].buf);
-	rte_free(elts);
 }
 
 /**
@@ -193,12 +189,15 @@ mlx4_rxq_cleanup(struct rxq *rxq)
  *   Completion queue to associate with QP.
  * @param desc
  *   Number of descriptors in QP (hint only).
+ * @param sges_n
+ *   Maximum number of segments per packet.
  *
  * @return
  *   QP pointer or NULL in case of error and rte_errno is set.
  */
 static struct ibv_qp *
-mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc)
+mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc,
+		  uint32_t sges_n)
 {
 	struct ibv_qp *qp;
 	struct ibv_qp_init_attr attr = {
@@ -211,8 +210,8 @@ mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc)
 			.max_recv_wr = ((priv->device_attr.max_qp_wr < desc) ?
 					priv->device_attr.max_qp_wr :
 					desc),
-			/* Max number of scatter/gather elements in a WR. */
-			.max_recv_sge = 1,
+			/* Maximum number of segments per packet. */
+			.max_recv_sge = sges_n,
 		},
 		.qp_type = IBV_QPT_RAW_PACKET,
 	};
@@ -248,13 +247,15 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 	       struct rte_mempool *mp)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct rxq tmpl = {
 		.priv = priv,
 		.mp = mp,
 		.socket = socket
 	};
 	struct ibv_qp_attr mod;
-	struct ibv_recv_wr *bad_wr;
 	unsigned int mb_len;
 	int ret;
 
@@ -269,11 +270,31 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
 	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
-		;
+		tmpl.sges_n = 0;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
-		WARN("%p: scattered mode has been requested but is"
-		     " not supported, this may lead to packet loss",
-		     (void *)dev);
+		uint32_t size =
+			RTE_PKTMBUF_HEADROOM +
+			dev->data->dev_conf.rxmode.max_rx_pkt_len;
+		uint32_t sges_n;
+
+		/*
+		 * Determine the number of SGEs needed for a full packet
+		 * and round it to the next power of two.
+		 */
+		sges_n = log2above((size / mb_len) + !!(size % mb_len));
+		tmpl.sges_n = sges_n;
+		/* Make sure sges_n did not overflow. */
+		size = mb_len * (1 << tmpl.sges_n);
+		size -= RTE_PKTMBUF_HEADROOM;
+		if (size < dev->data->dev_conf.rxmode.max_rx_pkt_len) {
+			rte_errno = EOVERFLOW;
+			ERROR("%p: too many SGEs (%u) needed to handle"
+			      " requested maximum packet size %u",
+			      (void *)dev,
+			      1 << sges_n,
+			      dev->data->dev_conf.rxmode.max_rx_pkt_len);
+			goto error;
+		}
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -282,6 +303,17 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 		     dev->data->dev_conf.rxmode.max_rx_pkt_len,
 		     mb_len - RTE_PKTMBUF_HEADROOM);
 	}
+	DEBUG("%p: maximum number of segments per packet: %u",
+	      (void *)dev, 1 << tmpl.sges_n);
+	if (desc % (1 << tmpl.sges_n)) {
+		rte_errno = EINVAL;
+		ERROR("%p: number of RX queue descriptors (%u) is not a"
+		      " multiple of maximum segments per packet (%u)",
+		      (void *)dev,
+		      desc,
+		      1 << tmpl.sges_n);
+		goto error;
+	}
 	/* Use the entire Rx mempool as the memory region. */
 	tmpl.mr = mlx4_mp2mr(priv->pd, mp);
 	if (tmpl.mr == NULL) {
@@ -306,7 +338,8 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 			goto error;
 		}
 	}
-	tmpl.cq = ibv_create_cq(priv->ctx, desc, NULL, tmpl.channel, 0);
+	tmpl.cq = ibv_create_cq(priv->ctx, desc >> tmpl.sges_n, NULL,
+				tmpl.channel, 0);
 	if (tmpl.cq == NULL) {
 		rte_errno = ENOMEM;
 		ERROR("%p: CQ creation failure: %s",
@@ -317,7 +350,8 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
 	      priv->device_attr.max_sge);
-	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc);
+	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc >> tmpl.sges_n,
+				    1 << tmpl.sges_n);
 	if (tmpl.qp == NULL) {
 		ERROR("%p: QP creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -336,21 +370,6 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
-	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
-	if (ret) {
-		ERROR("%p: RXQ allocation failed: %s",
-		      (void *)dev, strerror(rte_errno));
-		goto error;
-	}
-	ret = ibv_post_recv(tmpl.qp, &(*tmpl.elts)[0].wr, &bad_wr);
-	if (ret) {
-		rte_errno = ret;
-		ERROR("%p: ibv_post_recv() failed for WR %p: %s",
-		      (void *)dev,
-		      (void *)bad_wr,
-		      strerror(rte_errno));
-		goto error;
-	}
 	mod = (struct ibv_qp_attr){
 		.qp_state = IBV_QPS_RTR
 	};
@@ -361,14 +380,43 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = tmpl.cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = tmpl.qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: failed to obtain device information", (void *)dev);
+		goto error;
+	}
+	tmpl.wqes =
+		(volatile struct mlx4_wqe_data_seg (*)[])
+		((uintptr_t)dv_qp.buf.buf + dv_qp.rq.offset);
+	tmpl.rq_db = dv_qp.rdb;
+	tmpl.rq_ci = 0;
+	tmpl.mcq.buf = dv_cq.buf.buf;
+	tmpl.mcq.cqe_cnt = dv_cq.cqe_cnt;
+	tmpl.mcq.set_ci_db = dv_cq.set_ci_db;
+	tmpl.mcq.cqe_64 = (dv_cq.cqe_size & 64) ? 1 : 0;
 	/* Save port ID. */
 	tmpl.port_id = dev->data->port_id;
 	DEBUG("%p: RTE port ID: %u", (void *)rxq, tmpl.port_id);
+	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
+	if (ret) {
+		ERROR("%p: RXQ allocation failed: %s",
+		      (void *)dev, strerror(rte_errno));
+		goto error;
+	}
 	/* Clean up rxq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old rxq just in case", (void *)rxq);
 	mlx4_rxq_cleanup(rxq);
 	*rxq = tmpl;
 	DEBUG("%p: rxq updated with %p", (void *)rxq, (void *)&tmpl);
+	/* Update doorbell counter. */
+	rxq->rq_ci = desc >> rxq->sges_n;
+	rte_wmb();
+	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 	return 0;
 error:
 	ret = rte_errno;
@@ -406,6 +454,12 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	struct rxq *rxq = dev->data->rx_queues[idx];
 	int ret;
 
+	if (!rte_is_power_of_2(desc)) {
+		desc = 1 << log2above(desc);
+		WARN("%p: increased number of descriptors in RX queue %u"
+		     " to the next power of two (%d)",
+		     (void *)dev, idx, desc);
+	}
 	DEBUG("%p: configuring queue %u for %u descriptors",
 	      (void *)dev, idx, desc);
 	if (idx >= dev->data->nb_rx_queues) {
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 35367a2..fd8ef7b 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -509,9 +509,44 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
- * DPDK callback for Rx.
+ * Poll one CQE from CQ.
  *
- * The following function doesn't manage scattered packets.
+ * @param rxq
+ *   Pointer to the receive queue structure.
+ * @param[out] out
+ *   Just polled CQE.
+ *
+ * @return
+ *   Number of bytes of the CQE, 0 in case there is no completion.
+ */
+static unsigned int
+mlx4_cq_poll_one(struct rxq *rxq, struct mlx4_cqe **out)
+{
+	int ret = 0;
+	struct mlx4_cqe *cqe = NULL;
+	struct mlx4_cq *cq = &rxq->mcq;
+
+	cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cq->cons_index);
+	if (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+	    !!(cq->cons_index & cq->cqe_cnt))
+		goto out;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rte_rmb();
+	assert(!(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK));
+	assert((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) !=
+	       MLX4_CQE_OPCODE_ERROR);
+	ret = rte_be_to_cpu_32(cqe->byte_cnt);
+	++cq->cons_index;
+out:
+	*out = cqe;
+	return ret;
+}
+
+/**
+ * DPDK callback for Rx with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -526,112 +561,107 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 uint16_t
 mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
-	struct rxq *rxq = (struct rxq *)dpdk_rxq;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
-	const unsigned int elts_n = rxq->elts_n;
-	unsigned int elts_head = rxq->elts_head;
-	struct ibv_wc wcs[pkts_n];
-	struct ibv_recv_wr *wr_head = NULL;
-	struct ibv_recv_wr **wr_next = &wr_head;
-	struct ibv_recv_wr *wr_bad = NULL;
-	unsigned int i;
-	unsigned int pkts_ret = 0;
-	int ret;
+	struct rxq *rxq = dpdk_rxq;
+	const uint32_t wr_cnt = (1 << rxq->elts_n) - 1;
+	const uint16_t sges_n = rxq->sges_n;
+	struct rte_mbuf *pkt = NULL;
+	struct rte_mbuf *seg = NULL;
+	unsigned int i = 0;
+	uint32_t rq_ci = rxq->rq_ci << sges_n;
+	int len = 0;
 
-	ret = ibv_poll_cq(rxq->cq, pkts_n, wcs);
-	if (unlikely(ret == 0))
-		return 0;
-	if (unlikely(ret < 0)) {
-		DEBUG("rxq=%p, ibv_poll_cq() failed (wc_n=%d)",
-		      (void *)rxq, ret);
-		return 0;
-	}
-	assert(ret <= (int)pkts_n);
-	/* For each work completion. */
-	for (i = 0; i != (unsigned int)ret; ++i) {
-		struct ibv_wc *wc = &wcs[i];
-		struct rxq_elt *elt = &(*elts)[elts_head];
-		struct ibv_recv_wr *wr = &elt->wr;
-		uint32_t len = wc->byte_len;
-		struct rte_mbuf *seg = elt->buf;
-		struct rte_mbuf *rep;
+	while (pkts_n) {
+		struct mlx4_cqe *cqe;
+		uint32_t idx = rq_ci & wr_cnt;
+		struct rte_mbuf *rep = (*rxq->elts)[idx];
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[idx];
 
-		/* Sanity checks. */
-		assert(wr->sg_list == &elt->sge);
-		assert(wr->num_sge == 1);
-		assert(elts_head < rxq->elts_n);
-		assert(rxq->elts_head < rxq->elts_n);
-		/*
-		 * Fetch initial bytes of packet descriptor into a
-		 * cacheline while allocating rep.
-		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
-		/* Link completed WRs together for repost. */
-		*wr_next = wr;
-		wr_next = &wr->next;
-		if (unlikely(wc->status != IBV_WC_SUCCESS)) {
-			/* Whatever, just repost the offending WR. */
-			DEBUG("rxq=%p: bad work completion status (%d): %s",
-			      (void *)rxq, wc->status,
-			      ibv_wc_status_str(wc->status));
-			/* Increment dropped packets counter. */
-			++rxq->stats.idropped;
-			goto repost;
-		}
+		/* Update the 'next' pointer of the previous segment. */
+		if (pkt)
+			seg->next = rep;
+		seg = rep;
+		rte_prefetch0(seg);
+		rte_prefetch0(scat);
 		rep = rte_mbuf_raw_alloc(rxq->mp);
 		if (unlikely(rep == NULL)) {
-			/*
-			 * Unable to allocate a replacement mbuf,
-			 * repost WR.
-			 */
-			DEBUG("rxq=%p: can't allocate a new mbuf",
-			      (void *)rxq);
-			/* Increase out of memory counters. */
 			++rxq->stats.rx_nombuf;
-			++rxq->priv->dev->data->rx_mbuf_alloc_failed;
-			goto repost;
+			if (!pkt) {
+				/*
+				 * No buffers before we even started,
+				 * bail out silently.
+				 */
+				break;
+			}
+			while (pkt != seg) {
+				assert(pkt != (*rxq->elts)[idx]);
+				rep = pkt->next;
+				pkt->next = NULL;
+				pkt->nb_segs = 1;
+				rte_mbuf_raw_free(pkt);
+				pkt = rep;
+			}
+			break;
+		}
+		if (!pkt) {
+			/* Looking for the new packet. */
+			len = mlx4_cq_poll_one(rxq, &cqe);
+			if (!len) {
+				rte_mbuf_raw_free(rep);
+				break;
+			}
+			if (unlikely(len < 0)) {
+				/* Rx error, packet is likely too large. */
+				rte_mbuf_raw_free(rep);
+				++rxq->stats.idropped;
+				goto skip;
+			}
+			pkt = seg;
+			pkt->packet_type = 0;
+			pkt->ol_flags = 0;
+			pkt->pkt_len = len;
 		}
-		/* Reconfigure sge to use rep instead of seg. */
-		elt->sge.addr = (uintptr_t)rep->buf_addr + RTE_PKTMBUF_HEADROOM;
-		assert(elt->sge.lkey == rxq->mr->lkey);
-		elt->buf = rep;
-		/* Update seg information. */
-		seg->data_off = RTE_PKTMBUF_HEADROOM;
-		seg->nb_segs = 1;
-		seg->port = rxq->port_id;
-		seg->next = NULL;
-		seg->pkt_len = len;
+		rep->nb_segs = 1;
+		rep->port = rxq->port_id;
+		rep->data_len = seg->data_len;
+		rep->data_off = seg->data_off;
+		(*rxq->elts)[idx] = rep;
+		/*
+		 * Fill NIC descriptor with the new buffer. The lkey and size
+		 * of the buffers are already known, only the buffer address
+		 * changes.
+		 */
+		scat->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, uintptr_t));
+		if (len > seg->data_len) {
+			len -= seg->data_len;
+			++pkt->nb_segs;
+			++rq_ci;
+			continue;
+		}
+		/* The last segment. */
 		seg->data_len = len;
-		seg->packet_type = 0;
-		seg->ol_flags = 0;
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += pkt->pkt_len;
 		/* Return packet. */
-		*(pkts++) = seg;
-		++pkts_ret;
-		/* Increase bytes counter. */
-		rxq->stats.ibytes += len;
-repost:
-		if (++elts_head >= elts_n)
-			elts_head = 0;
-		continue;
+		*(pkts++) = pkt;
+		pkt = NULL;
+		--pkts_n;
+		++i;
+skip:
+		/* Align consumer index to the next stride. */
+		rq_ci >>= sges_n;
+		++rq_ci;
+		rq_ci <<= sges_n;
 	}
-	if (unlikely(i == 0))
+	if (unlikely(i == 0 && (rq_ci >> sges_n) == rxq->rq_ci))
 		return 0;
-	/* Repost WRs. */
-	*wr_next = NULL;
-	assert(wr_head);
-	ret = ibv_post_recv(rxq->qp, wr_head, &wr_bad);
-	if (unlikely(ret)) {
-		/* Inability to repost WRs is fatal. */
-		DEBUG("%p: recv_burst(): failed (ret=%d)",
-		      (void *)rxq->priv,
-		      ret);
-		abort();
-	}
-	rxq->elts_head = elts_head;
-	/* Increase packets counter. */
-	rxq->stats.ipackets += pkts_ret;
-	return pkts_ret;
+	/* Update the consumer index. */
+	rxq->rq_ci = rq_ci >> sges_n;
+	rte_wmb();
+	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
+	*rxq->mcq.set_ci_db = rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+	return i;
 }
 
 /**
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index cc5951c..ac84177 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -62,13 +62,6 @@ struct mlx4_rxq_stats {
 	uint64_t rx_nombuf; /**< Total of Rx mbuf allocation failures. */
 };
 
-/** Rx element. */
-struct rxq_elt {
-	struct ibv_recv_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
-	struct rte_mbuf *buf; /**< Buffer. */
-};
-
 /** Rx queue descriptor. */
 struct rxq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -77,10 +70,14 @@ struct rxq {
 	struct ibv_cq *cq; /**< Completion queue. */
 	struct ibv_qp *qp; /**< Queue pair. */
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
-	unsigned int port_id; /**< Port ID for incoming packets. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	struct rxq_elt (*elts)[]; /**< Rx elements. */
+	uint16_t rq_ci; /**< Saved RQ consumer index. */
+	uint16_t port_id; /**< Port ID for incoming packets. */
+	uint16_t sges_n; /**< Number of segments per packet (log2 value). */
+	uint16_t elts_n; /**< Mbuf queue size (log2 value). */
+	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
+	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
+	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
diff --git a/drivers/net/mlx4/mlx4_utils.h b/drivers/net/mlx4/mlx4_utils.h
index 0fbdc71..d6f729f 100644
--- a/drivers/net/mlx4/mlx4_utils.h
+++ b/drivers/net/mlx4/mlx4_utils.h
@@ -108,4 +108,24 @@ pmd_drv_log_basename(const char *s)
 
 int mlx4_fd_set_non_blocking(int fd);
 
+/**
+ * Return nearest power of two above input value.
+ *
+ * @param v
+ *   Input value.
+ *
+ * @return
+ *   Nearest power of two above input value.
+ */
+static inline unsigned int
+log2above(unsigned int v)
+{
+	unsigned int l;
+	unsigned int r;
+
+	for (l = 0, r = 0; (v >> 1); ++l, v >>= 1)
+		r |= (v & 1);
+	return l + r;
+}
+
 #endif /* MLX4_UTILS_H_ */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 3/6] net/mlx4: restore Tx gather support
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 1/6] net/mlx4: add simple Tx bypassing Verbs Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 2/6] net/mlx4: restore full Rx support " Adrien Mazarguil
@ 2017-10-04 18:48     ` Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 4/6] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for transmitting packets spanning over multiple
buffers.

In this patch we also take into consideration the amount of entries a
packet occupies in the TxQ when setting the report-completion flag of the
chip.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 197 ++++++++++++++++++++++----------------
 drivers/net/mlx4/mlx4_rxtx.h |   6 +-
 drivers/net/mlx4/mlx4_txq.c  |  12 ++-
 3 files changed, 127 insertions(+), 88 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index fd8ef7b..cc0baaa 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -63,6 +63,15 @@
 #include "mlx4_utils.h"
 
 /**
+ * Pointer-value pair structure used in tx_post_send for saving the first
+ * DWORD (32 byte) of a TXBB.
+ */
+struct pv {
+	struct mlx4_wqe_data_seg *dseg;
+	uint32_t val;
+};
+
+/**
  * Stamp a WQE so it won't be reused by the HW.
  *
  * Routine is used when freeing WQE used by the chip or when failing
@@ -291,24 +300,28 @@ mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
  *   Target Tx queue.
  * @param pkt
  *   Packet to transmit.
- * @param send_flags
- *   @p MLX4_WQE_CTRL_CQ_UPDATE to request completion on this packet.
  *
  * @return
  *   0 on success, negative errno value otherwise and rte_errno is set.
  */
 static inline int
-mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt, uint32_t send_flags)
+mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 {
 	struct mlx4_wqe_ctrl_seg *ctrl;
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
+	struct rte_mbuf *buf;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
+	uint32_t srcrb_flags;
+	uint32_t owner_opcode = MLX4_OPCODE_SEND;
+	uint32_t byte_count;
 	int wqe_real_size;
 	int nr_txbbs;
 	int rc;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
+	int pv_counter = 0;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
@@ -324,56 +337,81 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt, uint32_t send_flags)
 		rc = ENOSPC;
 		goto err;
 	}
-	/* Get the control and single-data entries of the WQE. */
+	/* Get the control and data entries of the WQE. */
 	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
 	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
 					    sizeof(struct mlx4_wqe_ctrl_seg));
-	/* Fill the data segment with buffer information. */
-	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
-	rte_prefetch0((volatile void *)addr);
-	dseg->addr = rte_cpu_to_be_64(addr);
-	/* Memory region key for this memory pool. */
-	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
-	if (unlikely(lkey == (uint32_t)-1)) {
-		/* MR does not exist. */
-		DEBUG("%p: unable to get MP <-> MR association", (void *)txq);
+	/* Fill the data segments with buffer information. */
+	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
+		addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		rte_prefetch0((volatile void *)addr);
+		/* Handle WQE wraparound. */
+		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
+			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+		dseg->addr = rte_cpu_to_be_64(addr);
+		/* Memory region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			/*
+			 * Restamp entry in case of failure.
+			 * Make sure that size is written correctly
+			 * Note that we give ownership to the SW, not the HW.
+			 */
+			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+			mlx4_txq_stamp_freed_wqe(sq, head_idx,
+				     (sq->head & sq->txbb_cnt) ? 0 : 1);
+			rc = EFAULT;
+			goto err;
+		}
+		dseg->lkey = rte_cpu_to_be_32(lkey);
+		if (likely(buf->data_len)) {
+			byte_count = rte_cpu_to_be_32(buf->data_len);
+		} else {
+			/*
+			 * Zero length segment is treated as inline segment
+			 * with zero data.
+			 */
+			byte_count = RTE_BE32(0x80000000);
+		}
 		/*
-		 * Restamp entry in case of failure, make sure that size is
-		 * written correctly.
-		 * Note that we give ownership to the SW, not the HW.
+		 * If the data segment is not at the beginning of a
+		 * Tx basic block (TXBB) then write the byte count,
+		 * else postpone the writing to just before updating the
+		 * control segment.
 		 */
-		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-		mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					 (sq->head & sq->txbb_cnt) ? 0 : 1);
-		rc = EFAULT;
-		goto err;
+		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			/*
+			 * Need a barrier here before writing the byte_count
+			 * fields to make sure that all the data is visible
+			 * before the byte_count field is set.
+			 * Otherwise, if the segment begins a new cacheline,
+			 * the HCA prefetcher could grab the 64-byte chunk and
+			 * get a valid (!= 0xffffffff) byte count but stale
+			 * data, and end up sending the wrong data.
+			 */
+			rte_io_wmb();
+			dseg->byte_count = byte_count;
+		} else {
+			/*
+			 * This data segment starts at the beginning of a new
+			 * TXBB, so we need to postpone its byte_count writing
+			 * for later.
+			 */
+			pv[pv_counter].dseg = dseg;
+			pv[pv_counter++].val = byte_count;
+		}
 	}
-	dseg->lkey = rte_cpu_to_be_32(lkey);
-	/*
-	 * Need a barrier here before writing the byte_count field to
-	 * make sure that all the data is visible before the
-	 * byte_count field is set. Otherwise, if the segment begins
-	 * a new cache line, the HCA prefetcher could grab the 64-byte
-	 * chunk and get a valid (!= 0xffffffff) byte count but
-	 * stale data, and end up sending the wrong data.
-	 */
-	rte_io_wmb();
-	if (likely(pkt->data_len))
-		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
-	else
-		/*
-		 * Zero length segment is treated as inline segment
-		 * with zero data.
-		 */
-		dseg->byte_count = RTE_BE32(0x80000000);
-	/*
-	 * Fill the control parameters for this packet.
-	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
-	 * should be calculated.
-	 */
-	ctrl->srcrb_flags =
-		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
-				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	/* Write the first DWORD of each TXBB save earlier. */
+	if (pv_counter) {
+		/* Need a barrier here before writing the byte_count. */
+		rte_io_wmb();
+		for (--pv_counter; pv_counter  >= 0; pv_counter--)
+			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
+	}
+	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
 	 * The caller should prepare "imm" in advance in order to support
@@ -382,14 +420,27 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt, uint32_t send_flags)
 	 */
 	ctrl->imm = 0;
 	/*
-	 * Make sure descriptor is fully written before setting ownership
-	 * bit (because HW can start executing as soon as we do).
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
+	 * should be calculated.
+	 */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+				       MLX4_WQE_CTRL_CQ_UPDATE);
+	} else {
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+	}
+	ctrl->srcrb_flags = srcrb_flags;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
 	 */
 	rte_wmb();
-	ctrl->owner_opcode =
-		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
-				 ((sq->head & sq->txbb_cnt) ?
-				  MLX4_BIT_WQE_OWN : 0));
+	ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					      ((sq->head & sq->txbb_cnt) ?
+					       MLX4_BIT_WQE_OWN : 0));
 	sq->head += nr_txbbs;
 	return 0;
 err:
@@ -416,14 +467,13 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -442,8 +492,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		unsigned int segs = buf->nb_segs;
-		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -461,34 +509,16 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
-			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
-		}
-		if (likely(segs == 1)) {
-			/* Update element. */
-			elt->buf = buf;
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			/* Post the packet for sending. */
-			err = mlx4_post_send(txq, buf, send_flags);
-			if (unlikely(err)) {
-				if (unlikely(send_flags &
-					     MLX4_WQE_CTRL_CQ_UPDATE)) {
-					elts_comp_cd = 1;
-					--elts_comp;
-				}
-				elt->buf = NULL;
-				goto stop;
-			}
-			elt->buf = buf;
-			bytes_sent += buf->pkt_len;
-		} else {
-			err = -EINVAL;
-			rte_errno = -err;
+		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+		/* Post the packet for sending. */
+		err = mlx4_post_send(txq, buf);
+		if (unlikely(err)) {
+			elt->buf = NULL;
 			goto stop;
 		}
+		elt->buf = buf;
+		bytes_sent += buf->pkt_len;
+		++elts_comp;
 		elts_head = elts_head_next;
 	}
 stop:
@@ -504,7 +534,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index ac84177..528e286 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -101,13 +101,15 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
+	unsigned int elts_comp; /**< Number of packets awaiting completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint8_t *bounce_buf;
+	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
 		const struct rte_mempool *mp; /**< Cached memory pool. */
 		struct ibv_mr *mr; /**< Memory region (for mp). */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index fb28ef2..7552a88 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -83,8 +83,13 @@ mlx4_txq_alloc_elts(struct txq *txq, unsigned int elts_n)
 		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->socket);
 	int ret = 0;
 
-	if (elts == NULL) {
-		ERROR("%p: can't allocate packets array", (void *)txq);
+	/* Allocate bounce buffer. */
+	txq->bounce_buf = rte_zmalloc_socket("TXQ",
+					     MLX4_MAX_WQE_SIZE,
+					     RTE_CACHE_LINE_MIN_SIZE,
+					     txq->socket);
+	if (!elts || !txq->bounce_buf) {
+		ERROR("%p: can't allocate TXQ memory", (void *)txq);
 		ret = ENOMEM;
 		goto error;
 	}
@@ -110,6 +115,8 @@ mlx4_txq_alloc_elts(struct txq *txq, unsigned int elts_n)
 	assert(ret == 0);
 	return 0;
 error:
+	rte_free(txq->bounce_buf);
+	txq->bounce_buf = NULL;
 	rte_free(elts);
 	DEBUG("%p: failed, freed everything", (void *)txq);
 	assert(ret > 0);
@@ -175,6 +182,7 @@ mlx4_txq_cleanup(struct txq *txq)
 		claim_zero(ibv_destroy_qp(txq->qp));
 	if (txq->cq != NULL)
 		claim_zero(ibv_destroy_cq(txq->cq));
+	rte_free(txq->bounce_buf);
 	for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) {
 		if (txq->mp2mr[i].mp == NULL)
 			break;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 4/6] net/mlx4: restore Tx checksum offloads
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
                       ` (2 preceding siblings ...)
  2017-10-04 18:48     ` [PATCH v3 3/6] net/mlx4: restore Tx gather support Adrien Mazarguil
@ 2017-10-04 18:48     ` Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 5/6] net/mlx4: restore Rx offloads Adrien Mazarguil
                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPv4, UDP and TCP checksum
calculation, including inner/outer checksums on supported tunnel types.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4.c        | 11 +++++++++++
 drivers/net/mlx4/mlx4.h        |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h    |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 19 +++++++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.h   |  2 ++
 drivers/net/mlx4/mlx4_txq.c    |  2 ++
 7 files changed, 44 insertions(+)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index b084903..385ddaa 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -529,6 +529,17 @@ mlx4_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->hw_csum =	!!(device_attr.device_cap_flags &
+				   IBV_DEVICE_RAW_IP_CSUM);
+		DEBUG("checksum offloading is %ssupported",
+		      (priv->hw_csum ? "" : "not "));
+		/* Only ConnectX-3 Pro supports tunneling. */
+		priv->hw_csum_l2tun =
+			priv->hw_csum &&
+			(device_attr.vendor_part_id ==
+			 PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
+		DEBUG("L2 tunnel checksum offloads are %ssupported",
+		      (priv->hw_csum_l2tun ? "" : "not "));
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 93e5502..0b71867 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -104,6 +104,8 @@ struct priv {
 	unsigned int vf:1; /* This is a VF device. */
 	unsigned int intr_alarm:1; /* An interrupt alarm is scheduled. */
 	unsigned int isolated:1; /* Toggle isolated mode. */
+	unsigned int hw_csum:1; /* Checksum offload is supported. */
+	unsigned int hw_csum_l2tun:1; /* Checksum support for L2 tunnels. */
 	struct rte_intr_handle intr_handle; /* Port interrupt handle. */
 	struct rte_flow_drop *flow_drop_queue; /* Flow drop queue. */
 	LIST_HEAD(mlx4_flows, rte_flow) flows;
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a9e8059..bec1787 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,6 +553,12 @@ mlx4_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->hw_csum_l2tun)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->speed_capa =
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 085a595..df5a6b4 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -64,6 +64,8 @@
 
 /* Work queue element (WQE) flags. */
 #define MLX4_BIT_WQE_OWN 0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index cc0baaa..fe7d5d0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -431,6 +431,25 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	} else {
 		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
+	/* Enable HW checksum offload if requested */
+	if (txq->csum &&
+	    (pkt->ol_flags &
+	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = (pkt->ol_flags &
+					      (PKT_TX_TUNNEL_GRE |
+					       PKT_TX_TUNNEL_VXLAN));
+
+		if (is_tunneled && txq->csum_l2tun) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |=
+					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
+		} else {
+			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+						MLX4_WQE_CTRL_TCP_UDP_CSUM);
+		}
+	}
 	ctrl->srcrb_flags = srcrb_flags;
 	/*
 	 * Make sure descriptor is fully written before
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 528e286..a742f61 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -108,6 +108,8 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7552a88..96429bc 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -338,6 +338,8 @@ mlx4_txq_setup(struct rte_eth_dev *dev, struct txq *txq, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	tmpl.csum = priv->hw_csum;
+	tmpl.csum_l2tun = priv->hw_csum_l2tun;
 	DEBUG("priv->device_attr.max_qp_wr is %d",
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 5/6] net/mlx4: restore Rx offloads
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
                       ` (3 preceding siblings ...)
  2017-10-04 18:48     ` [PATCH v3 4/6] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
@ 2017-10-04 18:48     ` Adrien Mazarguil
  2017-10-04 18:48     ` [PATCH v3 6/6] net/mlx4: add loopback Tx from VF Adrien Mazarguil
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP checksum
verification, including inner/outer checksums on supported tunnel types.

It also restores packet type recognition support.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_ethdev.c |   6 +-
 drivers/net/mlx4/mlx4_prm.h    |  30 +++++++++
 drivers/net/mlx4/mlx4_rxq.c    |   5 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 118 +++++++++++++++++++++++++++++++++++-
 drivers/net/mlx4/mlx4_rxtx.h   |   2 +
 5 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index bec1787..6dbf273 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,10 +553,14 @@ mlx4_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
-	if (priv->hw_csum)
+	if (priv->hw_csum) {
 		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
 					  DEV_TX_OFFLOAD_UDP_CKSUM |
 					  DEV_TX_OFFLOAD_TCP_CKSUM);
+		info->rx_offload_capa |= (DEV_RX_OFFLOAD_IPV4_CKSUM |
+					  DEV_RX_OFFLOAD_UDP_CKSUM |
+					  DEV_RX_OFFLOAD_TCP_CKSUM);
+	}
 	if (priv->hw_csum_l2tun)
 		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index df5a6b4..0d76a73 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -70,6 +70,14 @@
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
 
+/* CQE checksum flags. */
+enum {
+	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
+	MLX4_CQE_L2_TUNNEL_L4_CSUM = (int)(1u << 26),
+	MLX4_CQE_L2_TUNNEL = (int)(1u << 27),
+	MLX4_CQE_L2_TUNNEL_IPOK = (int)(1u << 31),
+};
+
 /* Send queue information. */
 struct mlx4_sq {
 	uint8_t *buf; /**< SQ buffer. */
@@ -119,4 +127,26 @@ mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
 				   (cq->cqe_64 << 5));
 }
 
+/**
+ * Transpose a flag in a value.
+ *
+ * @param val
+ *   Input value.
+ * @param from
+ *   Flag to retrieve from input value.
+ * @param to
+ *   Flag to set in output value.
+ *
+ * @return
+ *   Output value with transposed flag enabled if present on input.
+ */
+static inline uint64_t
+mlx4_transpose(uint64_t val, uint64_t from, uint64_t to)
+{
+	return (from >= to ?
+		(val & from) / (from / to) :
+		(val & from) * (to / from));
+}
+
+
 #endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 44d095d..a021a32 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -260,6 +260,11 @@ mlx4_rxq_setup(struct rte_eth_dev *dev, struct rxq *rxq, uint16_t desc,
 	int ret;
 
 	(void)conf; /* Thresholds configuration (ignored). */
+	/* Toggle Rx checksum offload if hardware supports it. */
+	if (priv->hw_csum)
+		tmpl.csum = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
+	if (priv->hw_csum_l2tun)
+		tmpl.csum_l2tun = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
 	mb_len = rte_pktmbuf_data_room_size(mp);
 	if (desc == 0) {
 		rte_errno = EINVAL;
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index fe7d5d0..87c5261 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -557,6 +557,107 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
+ * Translate Rx completion flags to packet type.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ *
+ * @return
+ *   Packet type in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_pkt_type(uint32_t flags)
+{
+	uint32_t pkt_type;
+
+	if (flags & MLX4_CQE_L2_TUNNEL)
+		pkt_type =
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPV4,
+				       RTE_PTYPE_L3_IPV4_EXT_UNKNOWN) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IPV4_PKT,
+				       RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN);
+	else
+		pkt_type = mlx4_transpose(flags,
+					  MLX4_CQE_STATUS_IPV4_PKT,
+					  RTE_PTYPE_L3_IPV4_EXT_UNKNOWN);
+	return pkt_type;
+}
+
+/**
+ * Translate Rx completion flags to offload flags.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   Offload flags (ol_flags) in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_ol_flags(uint32_t flags, int csum, int csum_l2tun)
+{
+	uint32_t ol_flags = 0;
+
+	if (csum)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IP_HDR_CSUM_OK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_TCP_UDP_CSUM_OK,
+				       PKT_RX_L4_CKSUM_GOOD);
+	if ((flags & MLX4_CQE_L2_TUNNEL) && csum_l2tun)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPOK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_L4_CSUM,
+				       PKT_RX_L4_CKSUM_GOOD);
+	return ol_flags;
+}
+
+/**
+ * Extract checksum information from CQE flags.
+ *
+ * @param cqe
+ *   Pointer to CQE structure.
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   CQE checksum information.
+ */
+static inline uint32_t
+mlx4_cqe_flags(struct mlx4_cqe *cqe, int csum, int csum_l2tun)
+{
+	uint32_t flags = 0;
+
+	/*
+	 * The relevant bits are in different locations on their
+	 * CQE fields therefore we can join them in one 32bit
+	 * variable.
+	 */
+	if (csum)
+		flags = (rte_be_to_cpu_32(cqe->status) &
+			 MLX4_CQE_STATUS_IPV4_CSUM_OK);
+	if (csum_l2tun)
+		flags |= (rte_be_to_cpu_32(cqe->vlan_my_qpn) &
+			  (MLX4_CQE_L2_TUNNEL |
+			   MLX4_CQE_L2_TUNNEL_IPOK |
+			   MLX4_CQE_L2_TUNNEL_L4_CSUM |
+			   MLX4_CQE_L2_TUNNEL_IPV4));
+	return flags;
+}
+
+/**
  * Poll one CQE from CQ.
  *
  * @param rxq
@@ -664,8 +765,21 @@ mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				goto skip;
 			}
 			pkt = seg;
-			pkt->packet_type = 0;
-			pkt->ol_flags = 0;
+			if (rxq->csum | rxq->csum_l2tun) {
+				uint32_t flags =
+					mlx4_cqe_flags(cqe,
+						       rxq->csum,
+						       rxq->csum_l2tun);
+
+				pkt->ol_flags =
+					rxq_cq_to_ol_flags(flags,
+							   rxq->csum,
+							   rxq->csum_l2tun);
+				pkt->packet_type = rxq_cq_to_pkt_type(flags);
+			} else {
+				pkt->packet_type = 0;
+				pkt->ol_flags = 0;
+			}
 			pkt->pkt_len = len;
 		}
 		rep->nb_segs = 1;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index a742f61..6aad41a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -77,6 +77,8 @@ struct rxq {
 	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
 	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
 	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 6/6] net/mlx4: add loopback Tx from VF
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
                       ` (4 preceding siblings ...)
  2017-10-04 18:48     ` [PATCH v3 5/6] net/mlx4: restore Rx offloads Adrien Mazarguil
@ 2017-10-04 18:48     ` Adrien Mazarguil
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-04 18:48 UTC (permalink / raw)
  To: dev; +Cc: Moti Haimovsky, Matan Azrad

From: Moti Haimovsky <motih@mellanox.com>

This patch adds loopback functionality used when the chip is a VF in order
to enable packet transmission between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 33 +++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |  1 +
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 87c5261..36173ad 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -311,10 +311,13 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct rte_mbuf *buf;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
-	uint32_t srcrb_flags;
 	uint32_t owner_opcode = MLX4_OPCODE_SEND;
 	uint32_t byte_count;
 	int wqe_real_size;
@@ -414,22 +417,16 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
-	 * The caller should prepare "imm" in advance in order to support
-	 * VF to VF communication (when the device is a virtual-function
-	 * device (VF)).
-	 */
-	ctrl->imm = 0;
-	/*
 	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
 	 * should be calculated.
 	 */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 				       MLX4_WQE_CTRL_CQ_UPDATE);
 	} else {
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
 	/* Enable HW checksum offload if requested */
 	if (txq->csum &&
@@ -443,14 +440,26 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |=
+				srcrb.flags |=
 					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
 		} else {
-			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
 						MLX4_WQE_CTRL_TCP_UDP_CSUM);
 		}
 	}
-	ctrl->srcrb_flags = srcrb_flags;
+	if (txq->lb) {
+		/*
+		 * Copy destination MAC address to the WQE, this allows
+		 * loopback in eSwitch, so that VFs and PF can communicate
+		 * with each other.
+		 */
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						      sizeof(uint16_t)));
+	} else {
+		ctrl->imm = 0;
+	}
+	ctrl->srcrb_flags = srcrb.flags;
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 6aad41a..37f31f4 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -112,6 +112,7 @@ struct txq {
 	uint32_t max_inline; /**< Max inline send size. */
 	uint32_t csum:1; /**< Enable checksum offloading. */
 	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
+	uint32_t lb:1; /**< Whether packets should be looped back by eSwitch. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 96429bc..9d1be95 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -412,6 +412,8 @@ mlx4_txq_setup(struct rte_eth_dev *dev, struct txq *txq, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Enable Tx loopback for VF devices. */
+	tmpl.lb = !!(priv->vf);
 	/* Clean up txq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old txq just in case", (void *)txq);
 	mlx4_txq_cleanup(txq);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
                     ` (7 preceding siblings ...)
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
@ 2017-10-04 21:48   ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 1/7] net/mlx4: add simple Tx " Ophir Munk
                       ` (7 more replies)
  8 siblings, 8 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:48 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Ophir Munk

Changes from v2:
* Split "net/mlx4: support multi-segments Rx" commit from "net/mlx4: get back Rx flow functionality" commit
* Semantics, code styling
* Fix check-git-log warnings
* Fix checkpatches warnings

Next (currently not included) changes:
* Replacing MLX4_TRANSPOSE() macro (Generic macro to convert MLX4 to IBV flags) with a look-up table as in mlx5
for example: mlx5_set_ptype_table() function - in order to improve performance.
This change is delicate and should be verified first with regression tests

* PMD documentation update when no longer working with MLNX_OFED
Documentation updtes require specific kernel, rdma_core and FW versions as well as installation procedures.
These details should be supplied by regression team.

Moti Haimovsky (6):
  net/mlx4: add simple Tx bypassing ibverbs
  net/mlx4: get back Rx flow functionality
  net/mlx4: support multi-segments Tx
  net/mlx4: get back Tx checksum offloads
  net/mlx4: get back Rx checksum offloads
  net/mlx4: add loopback Tx from VF

Vasily Philipov (1):
  net/mlx4: support multi-segments Rx

 drivers/net/mlx4/mlx4.c        |  11 +
 drivers/net/mlx4/mlx4.h        |  13 +-
 drivers/net/mlx4/mlx4_ethdev.c |  10 +
 drivers/net/mlx4/mlx4_prm.h    | 129 +++++++
 drivers/net/mlx4/mlx4_rxq.c    | 181 ++++++----
 drivers/net/mlx4/mlx4_rxtx.c   | 788 ++++++++++++++++++++++++++++++-----------
 drivers/net/mlx4/mlx4_rxtx.h   |  61 ++--
 drivers/net/mlx4/mlx4_txq.c    | 104 +++++-
 drivers/net/mlx4/mlx4_utils.h  |  20 ++
 mk/rte.app.mk                  |   2 +-
 10 files changed, 990 insertions(+), 329 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 1/7] net/mlx4: add simple Tx bypassing ibverbs
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 2/7] net/mlx4: get back Rx flow functionality Ophir Munk
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

Modify PMD to send single-buffer packets directly to the device
bypassing the ibv Tx post and poll routines.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 108 +++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 354 ++++++++++++++++++++++++++++++++-----------
 drivers/net/mlx4/mlx4_rxtx.h |  32 ++--
 drivers/net/mlx4/mlx4_txq.c  |  90 ++++++++---
 mk/rte.app.mk                |   2 +-
 5 files changed, 464 insertions(+), 122 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..6d1800a
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,108 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 6WIND S.A.
+ *   Copyright 2017 Mellanox
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef RTE_PMD_MLX4_PRM_H_
+#define RTE_PMD_MLX4_PRM_H_
+
+#include <rte_byteorder.h>
+#include <rte_branch_prediction.h>
+#include <rte_atomic.h>
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/verbs.h>
+#include <infiniband/mlx4dv.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* ConnectX-3 Tx queue basic block. */
+#define MLX4_TXBB_SHIFT 6
+#define MLX4_TXBB_SIZE (1 << MLX4_TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes. */
+#define MLX4_MAX_WQE_SIZE 512
+#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+
+/* Send queue stamping/invalidating information. */
+#define MLX4_SQ_STAMP_STRIDE 64
+#define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
+#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_STAMP_VAL 0x7fffffff
+
+/* Work queue element (WQE) flags. */
+#define MLX4_BIT_WQE_OWN 0x80000000
+
+#define MLX4_SIZE_TO_TXBBS(size) \
+		(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
+
+/* Send queue information. */
+struct mlx4_sq {
+	char *buf; /**< SQ buffer. */
+	char *eob; /**< End of SQ buffer */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t *db; /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
+};
+
+#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
+
+/* Completion queue information. */
+struct mlx4_cq {
+	char *buf; /**< Pointer to the completion queue buffer. */
+	uint32_t cqe_cnt; /**< Number of entries in the queue. */
+	uint32_t cqe_64:1; /**< CQ entry size is 64 bytes. */
+	uint32_t cons_index; /**< Last queue entry that was handled. */
+	uint32_t *set_ci_db; /**< Pointer to the completion queue doorbell. */
+};
+
+/*
+ * cqe = cq->buf + cons_index * cqe_size + cqe_offset
+ * Where cqe_size is 32 or 64 bytes and
+ * cqe_offset is 0 or 32 (depending on cqe_size).
+ */
+#define mlx4_get_cqe(cq, n) (__extension__({ \
+				typeof(cq) q = (cq); \
+				(q)->buf + \
+				(((n) & ((q)->cqe_cnt - 1)) << \
+				 (5 + (q)->cqe_64)) + \
+				((q)->cqe_64 << 5); \
+			    }))
+
+#endif /* RTE_PMD_MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b5e7777..c7c190d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -40,6 +40,7 @@
 #include <inttypes.h>
 #include <stdint.h>
 #include <string.h>
+#include <stdbool.h>
 
 /* Verbs headers do not support -pedantic. */
 #ifdef PEDANTIC
@@ -52,15 +53,76 @@
 
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
+#include <rte_io.h>
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
 /**
+ * Stamp a WQE so it won't be reused by the HW.
+ * Routine is used when freeing WQE used by the chip or when failing
+ * building an WQ entry has failed leaving partial information on the queue.
+ *
+ * @param sq
+ *   Pointer to the sq structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param num_txbbs
+ *   Number of blocks to stamp.
+ *   If < 0 the routine will use the size written in the WQ entry.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of TX basic blocs (TXBB) the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp =
+		rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+				 (!!owner << MLX4_SQ_STAMP_SHIFT));
+	char *wqe = mlx4_get_send_wqe(sq, (index & sq->txbb_cnt_mask));
+	uint32_t *ptr = (uint32_t *)wqe;
+	int i;
+	int txbbs_size;
+	int num_txbbs;
+
+	/* Extract the size from the control segment of the WQE. */
+	num_txbbs = MLX4_SIZE_TO_TXBBS((((struct mlx4_wqe_ctrl_seg *)
+					 wqe)->fence_size & 0x3f) << 4);
+	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	/* Optimize the common case when there is no wrap-around */
+	if (wqe + txbbs_size <= sq->eob) {
+		/* Stamp the freed descriptor */
+		for (i = 0;
+		     i < txbbs_size;
+		     i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor */
+		for (i = 0;
+		     i < txbbs_size;
+		     i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+			if ((char *)ptr >= sq->eob) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= RTE_BE32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
  * Manage Tx completions.
  *
  * When sending a burst, mlx4_tx_burst() posts several WRs.
@@ -80,26 +142,74 @@
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
-	int wcs_n;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
-	if (unlikely(wcs_n == 0))
+	/*
+	 * Traverse over all CQ entries reported and handle each WQ entry
+	 * reported by them.
+	 */
+	do {
+		cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cons_index);
+		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		    !!(cons_index & cq->cqe_cnt)))
+			break;
+		/*
+		 * make sure we read the CQE after we read the
+		 * ownership bit
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      (void *)txq, cqe_err->vendor_err,
+			      cqe_err->syndrome);
+		}
+		/* Get WQE index reported in the CQE. */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* free next descriptor */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		cons_index++;
+	} while (true);
+	if (unlikely(pkts == 0))
 		return 0;
-	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
-		      (void *)txq, wcs_n);
-		return -1;
-	}
-	elts_comp -= wcs_n;
+	/*
+	 * Update CQ.
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	/*
+	 * Update the list of packets posted for transmission.
+	 */
+	elts_comp -= pkts;
 	assert(elts_comp <= txq->elts_comp);
 	/*
-	 * Assume WC status is successful as nothing can be done about it
-	 * anyway.
+	 * Assume completion status is successful as nothing can be done about
+	 * it anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -117,7 +227,7 @@
  * @return
  *   Memory pool where data is located for given mbuf.
  */
-static struct rte_mempool *
+static inline struct rte_mempool *
 mlx4_txq_mb2mp(struct rte_mbuf *buf)
 {
 	if (unlikely(RTE_MBUF_INDIRECT(buf)))
@@ -158,7 +268,7 @@
 	/* Add a new entry, register MR first. */
 	DEBUG("%p: discovered new memory pool \"%s\" (%p)",
 	      (void *)txq, mp->name, (void *)mp);
-	mr = mlx4_mp2mr(txq->priv->pd, mp);
+	mr = mlx4_mp2mr(txq->ctrl.priv->pd, mp);
 	if (unlikely(mr == NULL)) {
 		DEBUG("%p: unable to configure MR, ibv_reg_mr() failed.",
 		      (void *)txq);
@@ -183,6 +293,124 @@
 }
 
 /**
+ * Posts a single work request to a send queue.
+ *
+ * @param txq
+ *   The Tx queue to post to.
+ * @param wr
+ *   The work request to handle.
+ * @param bad_wr
+ *   The wr in case that posting had failed.
+ *
+ * @return
+ *   0 - success, negative errno value otherwise and rte_errno is set.
+ */
+static inline int
+mlx4_post_send(struct txq *txq,
+	       struct rte_mbuf *pkt,
+	       uint32_t send_flags)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t lkey;
+	uintptr_t addr;
+	int wqe_real_size;
+	int nr_txbbs;
+	int rc;
+
+	/* Calculate the needed work queue entry size for this packet. */
+	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+	/* Check that there is room for this WQE in the send queue and
+	 * that the WQE size is legal.
+	 */
+	if (likely(((sq->head - sq->tail) + nr_txbbs +
+		    sq->headroom_txbbs >= sq->txbb_cnt) ||
+		    nr_txbbs > MLX4_MAX_WQE_TXBBS)) {
+		rc = ENOSPC;
+		goto err;
+	}
+	/* Get the control and single-data entries of the WQE */
+	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)(((char *)ctrl) +
+		sizeof(struct mlx4_wqe_ctrl_seg));
+	/*
+	 * Fill the data segment with buffer information.
+	 */
+	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
+	rte_prefetch0((volatile void *)addr);
+	dseg->addr = rte_cpu_to_be_64(addr);
+	/* Memory region key for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		/* MR does not exist. */
+		DEBUG("%p: unable to get MP <-> MR"
+		      " association", (void *)txq);
+		/*
+		 * Restamp entry in case of failure.
+		 * Make sure that size is written correctly.
+		 * Note that we give ownership to the SW, not the HW.
+		 */
+		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+		mlx4_txq_stamp_freed_wqe(sq, head_idx,
+					 (sq->head & sq->txbb_cnt) ? 0 : 1);
+		rc = EFAULT;
+		goto err;
+	}
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * Need a barrier here before writing the byte_count field to
+	 * make sure that all the data is visible before the
+	 * byte_count field is set.  Otherwise, if the segment begins
+	 * a new cacheline, the HCA prefetcher could grab the 64-byte
+	 * chunk and get a valid (!= * 0xffffffff) byte count but
+	 * stale data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+	if (likely(pkt->data_len))
+		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
+	else
+		/*
+		 * Zero length segment is treated as inline segment
+		 * with zero data.
+		 */
+		dseg->byte_count = RTE_BE32(0x80000000);
+	/*
+	 * Fill the control parameters for this packet.
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
+	 * should be calculated
+	 */
+	ctrl->srcrb_flags =
+		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
+				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+	/*
+	 * The caller should prepare "imm" in advance in order to support
+	 * VF to VF communication (when the device is a virtual-function
+	 * device (VF)).
+	 */
+	ctrl->imm = 0;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
+	 */
+	rte_wmb();
+	ctrl->owner_opcode =
+		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
+				 ((sq->head & sq->txbb_cnt) ?
+				  MLX4_BIT_WQE_OWN : 0));
+	sq->head += nr_txbbs;
+	return 0;
+err:
+	rte_errno = rc;
+	return -rc;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,13 +427,11 @@
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
-	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
 	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
+	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
@@ -229,9 +455,7 @@
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		struct ibv_send_wr *wr = &elt->wr;
 		unsigned int segs = buf->nb_segs;
-		unsigned int sent_size = 0;
 		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
@@ -254,93 +478,43 @@
 		if (unlikely(--elts_comp_cd == 0)) {
 			elts_comp_cd = txq->elts_comp_cd_init;
 			++elts_comp;
-			send_flags |= IBV_SEND_SIGNALED;
+			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
 		}
 		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
+			/* Update element. */
+			elt->buf = buf;
+			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+			/* post the pkt for sending */
+			err = mlx4_post_send(txq, buf, send_flags);
+			if (unlikely(err)) {
+				if (unlikely(send_flags &
+					     MLX4_WQE_CTRL_CQ_UPDATE)) {
+					elts_comp_cd = 1;
+					--elts_comp;
+				}
 				elt->buf = NULL;
 				goto stop;
 			}
-			/* Update element. */
 			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
+			bytes_sent += buf->pkt_len;
 		} else {
-			err = -1;
+			err = -EINVAL;
+			rte_errno = -err;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
 		elts_head = elts_head_next;
-		/* Increment sent bytes counter. */
-		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
-	/* Increment sent packets counter. */
+	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
+	txq->stats.obytes += bytes_sent;
+	/* Make sure that descriptors are written before doorbell record. */
+	rte_wmb();
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
 	txq->elts_comp_cd = elts_comp_cd;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fec998a..b515472 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -40,6 +40,7 @@
 #ifdef PEDANTIC
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
+#include <infiniband/mlx4dv.h>
 #include <infiniband/verbs.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
@@ -50,6 +51,7 @@
 #include <rte_mempool.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 
 /** Rx queue counters. */
 struct mlx4_rxq_stats {
@@ -85,8 +87,6 @@ struct rxq {
 
 /** Tx element. */
 struct txq_elt {
-	struct ibv_send_wr wr; /* Work request. */
-	struct ibv_sge sge; /* Scatter/gather element. */
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
@@ -100,24 +100,28 @@ struct mlx4_txq_stats {
 
 /** Tx queue descriptor. */
 struct txq {
-	struct priv *priv; /**< Back pointer to private data. */
-	struct {
-		const struct rte_mempool *mp; /**< Cached memory pool. */
-		struct ibv_mr *mr; /**< Memory region (for mp). */
-		uint32_t lkey; /**< mr->lkey copy. */
-	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
-	struct ibv_cq *cq; /**< Completion queue. */
-	struct ibv_qp *qp; /**< Queue pair. */
-	uint32_t max_inline; /**< Max inline send size. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	struct txq_elt (*elts)[]; /**< Tx elements. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
 	unsigned int elts_comp; /**< Number of completion requests. */
 	unsigned int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
+	unsigned int elts_n; /**< (*elts)[] length. */
+	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
-	unsigned int socket; /**< CPU socket ID for allocations. */
+	uint32_t max_inline; /**< Max inline send size. */
+	struct {
+		const struct rte_mempool *mp; /**< Cached memory pool. */
+		struct ibv_mr *mr; /**< Memory region (for mp). */
+		uint32_t lkey; /**< mr->lkey copy. */
+	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
+	struct {
+		struct priv *priv; /**< Back pointer to private data. */
+		unsigned int socket; /**< CPU socket ID for allocations. */
+		struct ibv_cq *cq; /**< Completion queue. */
+		struct ibv_qp *qp; /**< Queue pair. */
+	} ctrl;
 };
 
 /* mlx4_rxq.c */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index e0245b0..492779f 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -62,6 +62,7 @@
 #include "mlx4_autoconf.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
+#include "mlx4_prm.h"
 
 /**
  * Allocate Tx queue elements.
@@ -79,7 +80,7 @@
 {
 	unsigned int i;
 	struct txq_elt (*elts)[elts_n] =
-		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->socket);
+		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->ctrl.socket);
 	int ret = 0;
 
 	if (elts == NULL) {
@@ -170,10 +171,10 @@
 
 	DEBUG("cleaning up %p", (void *)txq);
 	mlx4_txq_free_elts(txq);
-	if (txq->qp != NULL)
-		claim_zero(ibv_destroy_qp(txq->qp));
-	if (txq->cq != NULL)
-		claim_zero(ibv_destroy_cq(txq->cq));
+	if (txq->ctrl.qp != NULL)
+		claim_zero(ibv_destroy_qp(txq->ctrl.qp));
+	if (txq->ctrl.cq != NULL)
+		claim_zero(ibv_destroy_cq(txq->ctrl.cq));
 	for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) {
 		if (txq->mp2mr[i].mp == NULL)
 			break;
@@ -242,6 +243,42 @@ struct txq_mp2mr_mbuf_check_data {
 }
 
 /**
+ * Retrieves information needed in order to directly access the Tx queue.
+ *
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param mlxdv
+ *   Pointer to device information for this Tx queue.
+ */
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+	/* Total sq length, including headroom and spare WQEs*/
+	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+
+	sq->buf = ((char *)dqp->buf.buf) + dqp->sq.offset;
+	/* Total len, including headroom and spare WQEs*/
+	sq->eob = sq->buf + sq_size;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_cnt =
+		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs =
+		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
+	cq->buf = dcq->buf.buf;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_64 = (dcq->cqe_size & 64) ? 1 : 0;
+}
+
+/**
  * Configure a Tx queue.
  *
  * @param dev
@@ -263,9 +300,15 @@ struct txq_mp2mr_mbuf_check_data {
 	       unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
+
 	struct txq tmpl = {
-		.priv = priv,
-		.socket = socket
+		.ctrl = {
+			.priv = priv,
+			.socket = socket
+		},
 	};
 	union {
 		struct ibv_qp_init_attr init;
@@ -284,8 +327,8 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	/* MRs will be registered in mp2mr[] later. */
-	tmpl.cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
-	if (tmpl.cq == NULL) {
+	tmpl.ctrl.cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
+	if (tmpl.ctrl.cq == NULL) {
 		rte_errno = ENOMEM;
 		ERROR("%p: CQ creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -297,9 +340,9 @@ struct txq_mp2mr_mbuf_check_data {
 	      priv->device_attr.max_sge);
 	attr.init = (struct ibv_qp_init_attr){
 		/* CQ to be associated with the send queue. */
-		.send_cq = tmpl.cq,
+		.send_cq = tmpl.ctrl.cq,
 		/* CQ to be associated with the receive queue. */
-		.recv_cq = tmpl.cq,
+		.recv_cq = tmpl.ctrl.cq,
 		.cap = {
 			/* Max number of outstanding WRs. */
 			.max_send_wr = ((priv->device_attr.max_qp_wr < desc) ?
@@ -316,8 +359,8 @@ struct txq_mp2mr_mbuf_check_data {
 		 */
 		.sq_sig_all = 0,
 	};
-	tmpl.qp = ibv_create_qp(priv->pd, &attr.init);
-	if (tmpl.qp == NULL) {
+	tmpl.ctrl.qp = ibv_create_qp(priv->pd, &attr.init);
+	if (tmpl.ctrl.qp == NULL) {
 		rte_errno = errno ? errno : EINVAL;
 		ERROR("%p: QP creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -331,7 +374,8 @@ struct txq_mp2mr_mbuf_check_data {
 		/* Primary port number. */
 		.port_num = priv->port
 	};
-	ret = ibv_modify_qp(tmpl.qp, &attr.mod, IBV_QP_STATE | IBV_QP_PORT);
+	ret = ibv_modify_qp(tmpl.ctrl.qp, &attr.mod,
+			    IBV_QP_STATE | IBV_QP_PORT);
 	if (ret) {
 		rte_errno = ret;
 		ERROR("%p: QP state to IBV_QPS_INIT failed: %s",
@@ -348,7 +392,7 @@ struct txq_mp2mr_mbuf_check_data {
 	attr.mod = (struct ibv_qp_attr){
 		.qp_state = IBV_QPS_RTR
 	};
-	ret = ibv_modify_qp(tmpl.qp, &attr.mod, IBV_QP_STATE);
+	ret = ibv_modify_qp(tmpl.ctrl.qp, &attr.mod, IBV_QP_STATE);
 	if (ret) {
 		rte_errno = ret;
 		ERROR("%p: QP state to IBV_QPS_RTR failed: %s",
@@ -356,7 +400,7 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	attr.mod.qp_state = IBV_QPS_RTS;
-	ret = ibv_modify_qp(tmpl.qp, &attr.mod, IBV_QP_STATE);
+	ret = ibv_modify_qp(tmpl.ctrl.qp, &attr.mod, IBV_QP_STATE);
 	if (ret) {
 		rte_errno = ret;
 		ERROR("%p: QP state to IBV_QPS_RTS failed: %s",
@@ -370,6 +414,18 @@ struct txq_mp2mr_mbuf_check_data {
 	DEBUG("%p: txq updated with %p", (void *)txq, (void *)&tmpl);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
+	/* Retrieve device Q info */
+	mlxdv.cq.in = txq->ctrl.cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->ctrl.qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: Failed to obtain information needed for "
+		      "accessing the device queues", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	return 0;
 error:
 	ret = rte_errno;
@@ -459,7 +515,7 @@ struct txq_mp2mr_mbuf_check_data {
 
 	if (txq == NULL)
 		return;
-	priv = txq->priv;
+	priv = txq->ctrl.priv;
 	for (i = 0; i != priv->dev->data->nb_tx_queues; ++i)
 		if (priv->dev->data->tx_queues[i] == txq) {
 			DEBUG("%p: removing Tx queue %p from list",
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index c25fdd9..2f1286e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -128,7 +128,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
 endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LIO_PMD)        += -lrte_pmd_lio
-_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs -lmlx4
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -libverbs
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NFP_PMD)        += -lrte_pmd_nfp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_NULL)       += -lrte_pmd_null
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 2/7] net/mlx4: get back Rx flow functionality
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 1/7] net/mlx4: add simple Tx " Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 3/7] net/mlx4: support multi-segments Rx Ophir Munk
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky,
	Vasily Philipov, Ophir Munk

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for accessing the hardware directly when
handling Rx packets eliminating the need to use verbs in the Rx
datapath.

The number of scatters is limited to one.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/mlx4/mlx4.h       |  11 ---
 drivers/net/mlx4/mlx4_rxq.c   | 149 ++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.c  | 225 ++++++++++++++++++++++++------------------
 drivers/net/mlx4/mlx4_rxtx.h  |  18 ++--
 drivers/net/mlx4/mlx4_utils.h |  20 ++++
 5 files changed, 242 insertions(+), 181 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 93e5502..b6e1ef2 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -57,17 +57,6 @@
 /* Maximum size for inline data. */
 #define MLX4_PMD_MAX_INLINE 0
 
-/*
- * Maximum number of cached Memory Pools (MPs) per TX queue. Each RTE MP
- * from which buffers are to be transmitted will have to be mapped by this
- * driver to their own Memory Region (MR). This is a slow operation.
- *
- * This value is always 1 for RX queues.
- */
-#ifndef MLX4_PMD_TX_MP_CACHE
-#define MLX4_PMD_TX_MP_CACHE 8
-#endif
-
 /* Interrupt alarm timeout value in microseconds. */
 #define MLX4_INTR_ALARM_TIMEOUT 100000
 
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 409983f..cb18f20 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -51,6 +51,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
 
+#include <rte_byteorder.h>
 #include <rte_common.h>
 #include <rte_errno.h>
 #include <rte_ethdev.h>
@@ -77,60 +78,59 @@
 mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 {
 	unsigned int i;
-	struct rxq_elt (*elts)[elts_n] =
-		rte_calloc_socket("RXQ elements", 1, sizeof(*elts), 0,
-				  rxq->socket);
+	struct rte_mbuf *(*elts)[elts_n] =
+		rte_calloc_socket("RXQ", 1, sizeof(*elts), 0, rxq->socket);
 
 	if (elts == NULL) {
+		elts_n = 0;
 		rte_errno = ENOMEM;
 		ERROR("%p: can't allocate packets array", (void *)rxq);
 		goto error;
 	}
-	/* For each WR (packet). */
-	for (i = 0; (i != elts_n); ++i) {
-		struct rxq_elt *elt = &(*elts)[i];
-		struct ibv_recv_wr *wr = &elt->wr;
-		struct ibv_sge *sge = &(*elts)[i].sge;
-		struct rte_mbuf *buf = rte_pktmbuf_alloc(rxq->mp);
+	rxq->elts = elts;
+	for (i = 0; i != elts_n; ++i) {
+		struct rte_mbuf *buf;
+		volatile struct mlx4_wqe_data_seg *scat =
+			&(*rxq->hw.wqes)[i];
 
+		buf = rte_pktmbuf_alloc(rxq->mp);
 		if (buf == NULL) {
 			rte_errno = ENOMEM;
 			ERROR("%p: empty mbuf pool", (void *)rxq);
 			goto error;
 		}
-		elt->buf = buf;
-		wr->next = &(*elts)[(i + 1)].wr;
-		wr->sg_list = sge;
-		wr->num_sge = 1;
 		/* Headroom is reserved by rte_pktmbuf_alloc(). */
 		assert(buf->data_off == RTE_PKTMBUF_HEADROOM);
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
-		/* sge->addr must be able to store a pointer. */
-		assert(sizeof(sge->addr) >= sizeof(uintptr_t));
-		/* SGE keeps its headroom. */
-		sge->addr = (uintptr_t)
-			((uint8_t *)buf->buf_addr + RTE_PKTMBUF_HEADROOM);
-		sge->length = (buf->buf_len - RTE_PKTMBUF_HEADROOM);
-		sge->lkey = rxq->mr->lkey;
-		/* Redundant check for tailroom. */
-		assert(sge->length == rte_pktmbuf_tailroom(buf));
+		assert(!buf->next);
+		buf->port = rxq->port_id;
+		buf->data_len = rte_pktmbuf_tailroom(buf);
+		buf->pkt_len = rte_pktmbuf_tailroom(buf);
+		buf->nb_segs = 1;
+		/* scat->addr must be able to store a pointer. */
+		assert(sizeof(scat->addr) >= sizeof(uintptr_t));
+		*scat = (struct mlx4_wqe_data_seg){
+			.addr =
+			    rte_cpu_to_be_64(rte_pktmbuf_mtod(buf, uintptr_t)),
+			.byte_count = rte_cpu_to_be_32(buf->data_len),
+			.lkey = rte_cpu_to_be_32(rxq->mr->lkey),
+		};
+		(*rxq->elts)[i] = buf;
 	}
-	/* The last WR pointer must be NULL. */
-	(*elts)[(i - 1)].wr.next = NULL;
 	DEBUG("%p: allocated and configured %u single-segment WRs",
 	      (void *)rxq, elts_n);
-	rxq->elts_n = elts_n;
-	rxq->elts_head = 0;
-	rxq->elts = elts;
+	rxq->elts_n = log2above(elts_n);
 	return 0;
 error:
-	if (elts != NULL) {
-		for (i = 0; (i != RTE_DIM(*elts)); ++i)
-			rte_pktmbuf_free_seg((*elts)[i].buf);
-		rte_free(elts);
+	for (i = 0; i != elts_n; ++i) {
+		if ((*rxq->elts)[i] != NULL)
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
+		(*rxq->elts)[i] = NULL;
 	}
+	rte_free(rxq->elts);
+	rxq->elts = NULL;
 	DEBUG("%p: failed, freed everything", (void *)rxq);
 	assert(rte_errno > 0);
 	return -rte_errno;
@@ -146,17 +146,18 @@
 mlx4_rxq_free_elts(struct rxq *rxq)
 {
 	unsigned int i;
-	unsigned int elts_n = rxq->elts_n;
-	struct rxq_elt (*elts)[elts_n] = rxq->elts;
 
 	DEBUG("%p: freeing WRs", (void *)rxq);
+	if (rxq->elts == NULL)
+		return;
+
+	for (i = 0; i != (1u << rxq->elts_n); ++i) {
+		if ((*rxq->elts)[i] != NULL)
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
+	}
+	rte_free(rxq->elts);
 	rxq->elts_n = 0;
 	rxq->elts = NULL;
-	if (elts == NULL)
-		return;
-	for (i = 0; (i != RTE_DIM(*elts)); ++i)
-		rte_pktmbuf_free_seg((*elts)[i].buf);
-	rte_free(elts);
 }
 
 /**
@@ -248,32 +249,35 @@
 	       struct rte_mempool *mp)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct rxq tmpl = {
 		.priv = priv,
 		.mp = mp,
 		.socket = socket
 	};
 	struct ibv_qp_attr mod;
-	struct ibv_recv_wr *bad_wr;
 	unsigned int mb_len;
 	int ret;
 
 	(void)conf; /* Thresholds configuration (ignored). */
 	mb_len = rte_pktmbuf_data_room_size(mp);
-	if (desc == 0) {
-		rte_errno = EINVAL;
-		ERROR("%p: invalid number of Rx descriptors", (void *)dev);
-		goto error;
-	}
 	/* Enable scattered packets support for this queue if necessary. */
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
 	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
 		;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
-		WARN("%p: scattered mode has been requested but is"
-		     " not supported, this may lead to packet loss",
-		     (void *)dev);
+		unsigned int rx_pkt_len =
+				dev->data->dev_conf.rxmode.jumbo_frame ?
+				dev->data->dev_conf.rxmode.max_rx_pkt_len :
+				ETHER_MTU;
+
+		if (rx_pkt_len < ETHER_MTU)
+			rx_pkt_len = ETHER_MTU;
+		/* Only the first mbuf has a headroom */
+		rx_pkt_len = rx_pkt_len - mb_len + RTE_PKTMBUF_HEADROOM;
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -336,21 +340,6 @@
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
-	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
-	if (ret) {
-		ERROR("%p: RXQ allocation failed: %s",
-		      (void *)dev, strerror(rte_errno));
-		goto error;
-	}
-	ret = ibv_post_recv(tmpl.qp, &(*tmpl.elts)[0].wr, &bad_wr);
-	if (ret) {
-		rte_errno = ret;
-		ERROR("%p: ibv_post_recv() failed for WR %p: %s",
-		      (void *)dev,
-		      (void *)bad_wr,
-		      strerror(rte_errno));
-		goto error;
-	}
 	mod = (struct ibv_qp_attr){
 		.qp_state = IBV_QPS_RTR
 	};
@@ -361,14 +350,44 @@
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Get HW depended info */
+	mlxdv.cq.in = tmpl.cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = tmpl.qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: Failed to retrieve device obj info", (void *)dev);
+		goto error;
+	}
+	/* Init HW depended fields */
+	tmpl.hw.wqes =
+		(volatile struct mlx4_wqe_data_seg (*)[])
+		((char *)dv_qp.buf.buf + dv_qp.rq.offset);
+	tmpl.hw.rq_db = dv_qp.rdb;
+	tmpl.hw.rq_ci = 0;
+	tmpl.mcq.buf = dv_cq.buf.buf;
+	tmpl.mcq.cqe_cnt = dv_cq.cqe_cnt;
+	tmpl.mcq.set_ci_db = dv_cq.set_ci_db;
+	tmpl.mcq.cqe_64 = (dv_cq.cqe_size & 64) ? 1 : 0;
 	/* Save port ID. */
 	tmpl.port_id = dev->data->port_id;
 	DEBUG("%p: RTE port ID: %u", (void *)rxq, tmpl.port_id);
+	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
+	if (ret) {
+		ERROR("%p: RXQ allocation failed: %s",
+		      (void *)dev, strerror(rte_errno));
+		goto error;
+	}
 	/* Clean up rxq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old rxq just in case", (void *)rxq);
 	mlx4_rxq_cleanup(rxq);
 	*rxq = tmpl;
 	DEBUG("%p: rxq updated with %p", (void *)rxq, (void *)&tmpl);
+	/* Update doorbell counter. */
+	rxq->hw.rq_ci = desc;
+	rte_wmb();
+	*rxq->hw.rq_db = rte_cpu_to_be_32(rxq->hw.rq_ci);
 	return 0;
 error:
 	ret = rte_errno;
@@ -406,6 +425,12 @@
 	struct rxq *rxq = dev->data->rx_queues[idx];
 	int ret;
 
+	if (!rte_is_power_of_2(desc)) {
+		desc = 1 << log2above(desc);
+		WARN("%p: increased number of descriptors in RX queue %u"
+		     " to the next power of two (%d)",
+		     (void *)dev, idx, desc);
+	}
 	DEBUG("%p: configuring queue %u for %u descriptors",
 	      (void *)dev, idx, desc);
 	if (idx >= dev->data->nb_rx_queues) {
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index c7c190d..b4391bf 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -522,9 +522,45 @@
 }
 
 /**
- * DPDK callback for Rx.
+ * Poll one CQE from CQ.
  *
- * The following function doesn't manage scattered packets.
+ * @param rxq
+ *   Pointer to the receive queue structure.
+ * @param[out] out
+ *   Just polled CQE.
+ *
+ * @return
+ *   Number of bytes of the CQE, 0 in case there is no completion.
+ */
+static unsigned int
+mlx4_cq_poll_one(struct rxq *rxq,
+		 struct mlx4_cqe **out)
+{
+	int ret = 0;
+	struct mlx4_cqe *cqe = NULL;
+	struct mlx4_cq *cq = &rxq->mcq;
+
+	cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cq->cons_index);
+	if (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+	    !!(cq->cons_index & cq->cqe_cnt))
+		goto out;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rte_rmb();
+	assert(!(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK));
+	assert((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) !=
+	       MLX4_CQE_OPCODE_ERROR);
+	ret = rte_be_to_cpu_32(cqe->byte_cnt);
+	++cq->cons_index;
+out:
+	*out = cqe;
+	return ret;
+}
+
+/**
+ * DPDK callback for RX with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -539,112 +575,105 @@
 uint16_t
 mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
-	struct rxq *rxq = (struct rxq *)dpdk_rxq;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
-	const unsigned int elts_n = rxq->elts_n;
-	unsigned int elts_head = rxq->elts_head;
-	struct ibv_wc wcs[pkts_n];
-	struct ibv_recv_wr *wr_head = NULL;
-	struct ibv_recv_wr **wr_next = &wr_head;
-	struct ibv_recv_wr *wr_bad = NULL;
-	unsigned int i;
-	unsigned int pkts_ret = 0;
-	int ret;
+	struct rxq *rxq = dpdk_rxq;
+	const unsigned int wr_cnt = (1 << rxq->elts_n) - 1;
+	struct rte_mbuf *pkt = NULL;
+	struct rte_mbuf *seg = NULL;
+	unsigned int i = 0;
+	unsigned int rq_ci = (rxq->hw.rq_ci);
+	int len = 0;
 
-	ret = ibv_poll_cq(rxq->cq, pkts_n, wcs);
-	if (unlikely(ret == 0))
-		return 0;
-	if (unlikely(ret < 0)) {
-		DEBUG("rxq=%p, ibv_poll_cq() failed (wc_n=%d)",
-		      (void *)rxq, ret);
-		return 0;
-	}
-	assert(ret <= (int)pkts_n);
-	/* For each work completion. */
-	for (i = 0; i != (unsigned int)ret; ++i) {
-		struct ibv_wc *wc = &wcs[i];
-		struct rxq_elt *elt = &(*elts)[elts_head];
-		struct ibv_recv_wr *wr = &elt->wr;
-		uint32_t len = wc->byte_len;
-		struct rte_mbuf *seg = elt->buf;
-		struct rte_mbuf *rep;
+	while (pkts_n) {
+		struct mlx4_cqe *cqe;
+		unsigned int idx = rq_ci & wr_cnt;
+		struct rte_mbuf *rep = (*rxq->elts)[idx];
+		volatile struct mlx4_wqe_data_seg *scat =
+					&(*rxq->hw.wqes)[idx];
 
-		/* Sanity checks. */
-		assert(wr->sg_list == &elt->sge);
-		assert(wr->num_sge == 1);
-		assert(elts_head < rxq->elts_n);
-		assert(rxq->elts_head < rxq->elts_n);
-		/*
-		 * Fetch initial bytes of packet descriptor into a
-		 * cacheline while allocating rep.
-		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
-		/* Link completed WRs together for repost. */
-		*wr_next = wr;
-		wr_next = &wr->next;
-		if (unlikely(wc->status != IBV_WC_SUCCESS)) {
-			/* Whatever, just repost the offending WR. */
-			DEBUG("rxq=%p: bad work completion status (%d): %s",
-			      (void *)rxq, wc->status,
-			      ibv_wc_status_str(wc->status));
-			/* Increment dropped packets counter. */
-			++rxq->stats.idropped;
-			goto repost;
-		}
+		/* Update the 'next' pointer of the previous segment. */
+		if (pkt)
+			seg->next = rep;
+		seg = rep;
+		rte_prefetch0(seg);
+		rte_prefetch0(scat);
 		rep = rte_mbuf_raw_alloc(rxq->mp);
 		if (unlikely(rep == NULL)) {
-			/*
-			 * Unable to allocate a replacement mbuf,
-			 * repost WR.
-			 */
-			DEBUG("rxq=%p: can't allocate a new mbuf",
-			      (void *)rxq);
-			/* Increase out of memory counters. */
 			++rxq->stats.rx_nombuf;
-			++rxq->priv->dev->data->rx_mbuf_alloc_failed;
-			goto repost;
+			if (!pkt) {
+				/*
+				 * No buffers before we even started,
+				 * bail out silently.
+				 */
+				break;
+			}
+			while (pkt != seg) {
+				assert(pkt != (*rxq->elts)[idx]);
+				rep = pkt->next;
+				pkt->next = NULL;
+				pkt->nb_segs = 1;
+				rte_mbuf_raw_free(pkt);
+				pkt = rep;
+			}
+			break;
+		}
+		if (!pkt) {
+			/* Looking for the new packet */
+			len = mlx4_cq_poll_one(rxq, &cqe);
+			if (!len) {
+				rte_mbuf_raw_free(rep);
+				break;
+			}
+			if (unlikely(len < 0)) {
+				/* RX error, packet is likely too large. */
+				rte_mbuf_raw_free(rep);
+				++rxq->stats.idropped;
+				goto skip;
+			}
+			pkt = seg;
+			pkt->packet_type = 0;
+			pkt->ol_flags = 0;
+			pkt->pkt_len = len;
 		}
-		/* Reconfigure sge to use rep instead of seg. */
-		elt->sge.addr = (uintptr_t)rep->buf_addr + RTE_PKTMBUF_HEADROOM;
-		assert(elt->sge.lkey == rxq->mr->lkey);
-		elt->buf = rep;
-		/* Update seg information. */
-		seg->data_off = RTE_PKTMBUF_HEADROOM;
-		seg->nb_segs = 1;
-		seg->port = rxq->port_id;
-		seg->next = NULL;
-		seg->pkt_len = len;
+		rep->nb_segs = 1;
+		rep->port = rxq->port_id;
+		rep->data_len = seg->data_len;
+		rep->data_off = seg->data_off;
+		(*rxq->elts)[idx] = rep;
+		/*
+		 * Fill NIC descriptor with the new buffer. The lkey and size
+		 * of the buffers are already known, only the buffer address
+		 * changes.
+		 */
+		scat->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, uintptr_t));
+		if (len > seg->data_len) {
+			len -= seg->data_len;
+			++pkt->nb_segs;
+			++rq_ci;
+			continue;
+		}
+		/* The last segment. */
 		seg->data_len = len;
-		seg->packet_type = 0;
-		seg->ol_flags = 0;
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += pkt->pkt_len;
 		/* Return packet. */
-		*(pkts++) = seg;
-		++pkts_ret;
-		/* Increase bytes counter. */
-		rxq->stats.ibytes += len;
-repost:
-		if (++elts_head >= elts_n)
-			elts_head = 0;
-		continue;
+		*(pkts++) = pkt;
+		pkt = NULL;
+		--pkts_n;
+		++i;
+skip:
+		++rq_ci;
 	}
-	if (unlikely(i == 0))
+	if (unlikely(i == 0 && rq_ci == rxq->hw.rq_ci))
 		return 0;
-	/* Repost WRs. */
-	*wr_next = NULL;
-	assert(wr_head);
-	ret = ibv_post_recv(rxq->qp, wr_head, &wr_bad);
-	if (unlikely(ret)) {
-		/* Inability to repost WRs is fatal. */
-		DEBUG("%p: recv_burst(): failed (ret=%d)",
-		      (void *)rxq->priv,
-		      ret);
-		abort();
-	}
-	rxq->elts_head = elts_head;
-	/* Increase packets counter. */
-	rxq->stats.ipackets += pkts_ret;
-	return pkts_ret;
+	/* Update the consumer index. */
+	rxq->hw.rq_ci = rq_ci;
+	rte_wmb();
+	*rxq->hw.rq_db = rte_cpu_to_be_32(rxq->hw.rq_ci);
+	*rxq->mcq.set_ci_db =
+		rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+	return i;
 }
 
 /**
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index b515472..fa2481c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -62,13 +62,6 @@ struct mlx4_rxq_stats {
 	uint64_t rx_nombuf; /**< Total of Rx mbuf allocation failures. */
 };
 
-/** Rx element. */
-struct rxq_elt {
-	struct ibv_recv_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
-	struct rte_mbuf *buf; /**< Buffer. */
-};
-
 /** Rx queue descriptor. */
 struct rxq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -78,9 +71,14 @@ struct rxq {
 	struct ibv_qp *qp; /**< Queue pair. */
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
 	unsigned int port_id; /**< Port ID for incoming packets. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	struct rxq_elt (*elts)[]; /**< Rx elements. */
+	unsigned int elts_n; /**< Log 2 of Mbufs. */
+	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
+	struct {
+		volatile struct mlx4_wqe_data_seg(*wqes)[];
+		volatile uint32_t *rq_db;
+		uint16_t rq_ci;
+	} hw;
+	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
diff --git a/drivers/net/mlx4/mlx4_utils.h b/drivers/net/mlx4/mlx4_utils.h
index 0fbdc71..d6f729f 100644
--- a/drivers/net/mlx4/mlx4_utils.h
+++ b/drivers/net/mlx4/mlx4_utils.h
@@ -108,4 +108,24 @@
 
 int mlx4_fd_set_non_blocking(int fd);
 
+/**
+ * Return nearest power of two above input value.
+ *
+ * @param v
+ *   Input value.
+ *
+ * @return
+ *   Nearest power of two above input value.
+ */
+static inline unsigned int
+log2above(unsigned int v)
+{
+	unsigned int l;
+	unsigned int r;
+
+	for (l = 0, r = 0; (v >> 1); ++l, v >>= 1)
+		r |= (v & 1);
+	return l + r;
+}
+
 #endif /* MLX4_UTILS_H_ */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 3/7] net/mlx4: support multi-segments Rx
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 1/7] net/mlx4: add simple Tx " Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 2/7] net/mlx4: get back Rx flow functionality Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 4/7] net/mlx4: support multi-segments Tx Ophir Munk
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Vasily Philipov,
	Ophir Munk

From: Vasily Philipov <vasilyf@mellanox.com>

Getting hw directly on RX fast path without verbs call.

Now the number of scatters is calculated on the fly according to the
maximum expected packet size.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
This commit is a split from a previous commit
"net/mlx4: get back Rx flow functionality"

 drivers/net/mlx4/mlx4_rxq.c  | 29 ++++++++++++++++++++++-------
 drivers/net/mlx4/mlx4_rxtx.c | 10 +++++++---
 drivers/net/mlx4/mlx4_rxtx.h |  1 +
 3 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index cb18f20..7d13121 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -78,6 +78,7 @@
 mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 {
 	unsigned int i;
+	const unsigned int sge_n = 1 << rxq->sge_n;
 	struct rte_mbuf *(*elts)[elts_n] =
 		rte_calloc_socket("RXQ", 1, sizeof(*elts), 0, rxq->socket);
 
@@ -105,6 +106,9 @@
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
 		assert(!buf->next);
+		/* Only the first segment keeps headroom. */
+		if (i % sge_n)
+			buf->data_off = 0;
 		buf->port = rxq->port_id;
 		buf->data_len = rte_pktmbuf_tailroom(buf);
 		buf->pkt_len = rte_pktmbuf_tailroom(buf);
@@ -119,8 +123,8 @@
 		};
 		(*rxq->elts)[i] = buf;
 	}
-	DEBUG("%p: allocated and configured %u single-segment WRs",
-	      (void *)rxq, elts_n);
+	DEBUG("%p: allocated and configured %u segments (max %u packets)",
+	      (void *)rxq, elts_n, elts_n >> rxq->sge_n);
 	rxq->elts_n = log2above(elts_n);
 	return 0;
 error:
@@ -199,7 +203,8 @@
  *   QP pointer or NULL in case of error and rte_errno is set.
  */
 static struct ibv_qp *
-mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc)
+mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq,
+		  uint16_t desc, unsigned int sge_n)
 {
 	struct ibv_qp *qp;
 	struct ibv_qp_init_attr attr = {
@@ -213,7 +218,7 @@
 					priv->device_attr.max_qp_wr :
 					desc),
 			/* Max number of scatter/gather elements in a WR. */
-			.max_recv_sge = 1,
+			.max_recv_sge = sge_n,
 		},
 		.qp_type = IBV_QPT_RAW_PACKET,
 	};
@@ -267,8 +272,9 @@
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
 	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
-		;
+		tmpl.sge_n = 0;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
+		unsigned int sge_n;
 		unsigned int rx_pkt_len =
 				dev->data->dev_conf.rxmode.jumbo_frame ?
 				dev->data->dev_conf.rxmode.max_rx_pkt_len :
@@ -278,6 +284,13 @@
 			rx_pkt_len = ETHER_MTU;
 		/* Only the first mbuf has a headroom */
 		rx_pkt_len = rx_pkt_len - mb_len + RTE_PKTMBUF_HEADROOM;
+		/*
+		 * Determine the number of SGEs needed for a full packet
+		 * and round it to the next power of two.
+		 */
+		sge_n = (rx_pkt_len / mb_len) + !!(rx_pkt_len % mb_len) + 1;
+		tmpl.sge_n = log2above(sge_n);
+		desc >>= tmpl.sge_n;
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -286,6 +299,8 @@
 		     dev->data->dev_conf.rxmode.max_rx_pkt_len,
 		     mb_len - RTE_PKTMBUF_HEADROOM);
 	}
+	DEBUG("%p: number of sges %u (%u WRs)",
+	      (void *)dev, 1 << tmpl.sge_n, desc);
 	/* Use the entire Rx mempool as the memory region. */
 	tmpl.mr = mlx4_mp2mr(priv->pd, mp);
 	if (tmpl.mr == NULL) {
@@ -321,7 +336,7 @@
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
 	      priv->device_attr.max_sge);
-	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc);
+	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc, 1 << tmpl.sge_n);
 	if (tmpl.qp == NULL) {
 		ERROR("%p: QP creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -373,7 +388,7 @@
 	/* Save port ID. */
 	tmpl.port_id = dev->data->port_id;
 	DEBUG("%p: RTE port ID: %u", (void *)rxq, tmpl.port_id);
-	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
+	ret = mlx4_rxq_alloc_elts(&tmpl, desc << tmpl.sge_n);
 	if (ret) {
 		ERROR("%p: RXQ allocation failed: %s",
 		      (void *)dev, strerror(rte_errno));
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b4391bf..f517505 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -577,10 +577,11 @@
 {
 	struct rxq *rxq = dpdk_rxq;
 	const unsigned int wr_cnt = (1 << rxq->elts_n) - 1;
+	const unsigned int sge_n = rxq->sge_n;
 	struct rte_mbuf *pkt = NULL;
 	struct rte_mbuf *seg = NULL;
 	unsigned int i = 0;
-	unsigned int rq_ci = (rxq->hw.rq_ci);
+	unsigned int rq_ci = (rxq->hw.rq_ci << sge_n);
 	int len = 0;
 
 	while (pkts_n) {
@@ -661,12 +662,15 @@
 		--pkts_n;
 		++i;
 skip:
+		/* Align consumer index to the next stride. */
+		rq_ci >>= sge_n;
 		++rq_ci;
+		rq_ci <<= sge_n;
 	}
-	if (unlikely(i == 0 && rq_ci == rxq->hw.rq_ci))
+	if (unlikely(i == 0 && (rq_ci >> sge_n) == rxq->hw.rq_ci))
 		return 0;
 	/* Update the consumer index. */
-	rxq->hw.rq_ci = rq_ci;
+	rxq->hw.rq_ci = rq_ci >> sge_n;
 	rte_wmb();
 	*rxq->hw.rq_db = rte_cpu_to_be_32(rxq->hw.rq_ci);
 	*rxq->mcq.set_ci_db =
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fa2481c..df83552 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -79,6 +79,7 @@ struct rxq {
 		uint16_t rq_ci;
 	} hw;
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
+	unsigned int sge_n; /**< Log 2 of SGEs number. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 4/7] net/mlx4: support multi-segments Tx
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
                       ` (2 preceding siblings ...)
  2017-10-04 21:49     ` [PATCH v3 3/7] net/mlx4: support multi-segments Rx Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 5/7] net/mlx4: get back Tx checksum offloads Ophir Munk
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for transmitting packets spanning over
multiple buffers.
In this patch we also take into consideration the amount of entries
a packet occupies in the TxQ when setting the report-completion flag
of the chip.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 208 ++++++++++++++++++++++++-------------------
 drivers/net/mlx4/mlx4_rxtx.h |   6 +-
 drivers/net/mlx4/mlx4_txq.c  |  12 ++-
 3 files changed, 129 insertions(+), 97 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index f517505..bc0e353 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -63,6 +63,16 @@
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
+/*
+ * Pointer-value pair structure
+ * used in tx_post_send for saving the first DWORD (32 byte)
+ * of a TXBB0
+ */
+struct pv {
+	struct mlx4_wqe_data_seg *dseg;
+	uint32_t val;
+};
+
 /**
  * Stamp a WQE so it won't be reused by the HW.
  * Routine is used when freeing WQE used by the chip or when failing
@@ -297,34 +307,38 @@
  *
  * @param txq
  *   The Tx queue to post to.
- * @param wr
- *   The work request to handle.
- * @param bad_wr
- *   The wr in case that posting had failed.
+ * @param pkt
+ *   The packet to transmit.
  *
  * @return
  *   0 - success, negative errno value otherwise and rte_errno is set.
  */
 static inline int
 mlx4_post_send(struct txq *txq,
-	       struct rte_mbuf *pkt,
-	       uint32_t send_flags)
+	       struct rte_mbuf *pkt)
 {
 	struct mlx4_wqe_ctrl_seg *ctrl;
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
+	struct rte_mbuf *buf;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
+	uint32_t srcrb_flags;
+	uint32_t owner_opcode = MLX4_OPCODE_SEND;
+	uint32_t byte_count;
 	int wqe_real_size;
 	int nr_txbbs;
 	int rc;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
+	int pv_counter = 0;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
 			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
 	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
-	/* Check that there is room for this WQE in the send queue and
+	/*
+	 * Check that there is room for this WQE in the send queue and
 	 * that the WQE size is legal.
 	 */
 	if (likely(((sq->head - sq->tail) + nr_txbbs +
@@ -333,76 +347,108 @@
 		rc = ENOSPC;
 		goto err;
 	}
-	/* Get the control and single-data entries of the WQE */
+	/* Get the control and data entries of the WQE. */
 	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
 	dseg = (struct mlx4_wqe_data_seg *)(((char *)ctrl) +
 		sizeof(struct mlx4_wqe_ctrl_seg));
-	/*
-	 * Fill the data segment with buffer information.
-	 */
-	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
-	rte_prefetch0((volatile void *)addr);
-	dseg->addr = rte_cpu_to_be_64(addr);
-	/* Memory region key for this memory pool. */
-	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
-	if (unlikely(lkey == (uint32_t)-1)) {
-		/* MR does not exist. */
-		DEBUG("%p: unable to get MP <-> MR"
-		      " association", (void *)txq);
-		/*
-		 * Restamp entry in case of failure.
-		 * Make sure that size is written correctly.
-		 * Note that we give ownership to the SW, not the HW.
+	/* Fill the data segments with buffer information. */
+	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
+		addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		rte_prefetch0((volatile void *)addr);
+		/* Handle WQE wraparound. */
+		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
+			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+		dseg->addr = rte_cpu_to_be_64(addr);
+		/* Memory region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR"
+			      " association", (void *)txq);
+			/*
+			 * Restamp entry in case of failure.
+			 * Make sure that size is written correctly
+			 * Note that we give ownership to the SW, not the HW.
+			 */
+			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+			mlx4_txq_stamp_freed_wqe(sq, head_idx,
+				     (sq->head & sq->txbb_cnt) ? 0 : 1);
+			rc = EFAULT;
+			goto err;
+		}
+		dseg->lkey = rte_cpu_to_be_32(lkey);
+		if (likely(buf->data_len))
+			byte_count = rte_cpu_to_be_32(buf->data_len);
+		else
+			/*
+			 * Zero length segment is treated as inline segment
+			 * with zero data.
+			 */
+			byte_count = RTE_BE32(0x80000000);
+		/* If the data segment is not at the beginning of a
+		 * Tx basic block(TXBB) then write the byte count,
+		 * else postpone the writing to just before updating the
+		 * control segment.
 		 */
-		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-		mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					 (sq->head & sq->txbb_cnt) ? 0 : 1);
-		rc = EFAULT;
-		goto err;
+		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			/*
+			 * Need a barrier here before writing the byte_count
+			 * fields to make sure that all the data is visible
+			 * before the byte_count field is set.
+			 * Otherwise, if the segment begins a new cacheline,
+			 * the HCA prefetcher could grab the 64-byte chunk and
+			 * get a valid (!= * 0xffffffff) byte count but stale
+			 * data, and end up sending the wrong data.
+			 */
+			rte_io_wmb();
+			dseg->byte_count = byte_count;
+		} else {
+			/*
+			 * This data segment starts at the beginning of a new
+			 * TXBB, so we need to postpone its byte_count writing
+			 * for later.
+			 */
+			pv[pv_counter].dseg = dseg;
+			pv[pv_counter++].val = byte_count;
+		}
 	}
-	dseg->lkey = rte_cpu_to_be_32(lkey);
-	/*
-	 * Need a barrier here before writing the byte_count field to
-	 * make sure that all the data is visible before the
-	 * byte_count field is set.  Otherwise, if the segment begins
-	 * a new cacheline, the HCA prefetcher could grab the 64-byte
-	 * chunk and get a valid (!= * 0xffffffff) byte count but
-	 * stale data, and end up sending the wrong data.
-	 */
-	rte_io_wmb();
-	if (likely(pkt->data_len))
-		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
-	else
-		/*
-		 * Zero length segment is treated as inline segment
-		 * with zero data.
-		 */
-		dseg->byte_count = RTE_BE32(0x80000000);
-	/*
-	 * Fill the control parameters for this packet.
-	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
-	 * should be calculated
-	 */
-	ctrl->srcrb_flags =
-		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
-				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	/* Write the first DWORD of each TXBB save earlier. */
+	if (pv_counter) {
+		/* Need a barrier here before writing the byte_count. */
+		rte_io_wmb();
+		for (--pv_counter; pv_counter  >= 0; pv_counter--)
+			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
+	}
+	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
 	 * The caller should prepare "imm" in advance in order to support
 	 * VF to VF communication (when the device is a virtual-function
 	 * device (VF)).
-	 */
+	*/
 	ctrl->imm = 0;
 	/*
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
+	 * should be calculated.
+	 */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+				       MLX4_WQE_CTRL_CQ_UPDATE);
+	} else {
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+	}
+	ctrl->srcrb_flags = srcrb_flags;
+	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
 	 * executing as soon as we do).
 	 */
-	rte_wmb();
-	ctrl->owner_opcode =
-		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
-				 ((sq->head & sq->txbb_cnt) ?
-				  MLX4_BIT_WQE_OWN : 0));
+	 rte_wmb();
+	 ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					       ((sq->head & sq->txbb_cnt) ?
+					       MLX4_BIT_WQE_OWN : 0));
 	sq->head += nr_txbbs;
 	return 0;
 err:
@@ -429,14 +475,13 @@
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -455,8 +500,6 @@
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		unsigned int segs = buf->nb_segs;
-		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -474,34 +517,16 @@
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
-			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
-		}
-		if (likely(segs == 1)) {
-			/* Update element. */
-			elt->buf = buf;
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			/* post the pkt for sending */
-			err = mlx4_post_send(txq, buf, send_flags);
-			if (unlikely(err)) {
-				if (unlikely(send_flags &
-					     MLX4_WQE_CTRL_CQ_UPDATE)) {
-					elts_comp_cd = 1;
-					--elts_comp;
-				}
-				elt->buf = NULL;
-				goto stop;
-			}
-			elt->buf = buf;
-			bytes_sent += buf->pkt_len;
-		} else {
-			err = -EINVAL;
-			rte_errno = -err;
+		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+		/* post the packet for sending. */
+		err = mlx4_post_send(txq, buf);
+		if (unlikely(err)) {
+			elt->buf = NULL;
 			goto stop;
 		}
+		elt->buf = buf;
+		bytes_sent += buf->pkt_len;
+		++elts_comp;
 		elts_head = elts_head_next;
 	}
 stop:
@@ -517,7 +542,6 @@
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index df83552..1b90533 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -103,13 +103,15 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
+	unsigned int elts_comp; /**< Number of pkts waiting for completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	char *bounce_buf;
+	/**< memory used for storing the first DWORD of data TXBBs. */
 	struct {
 		const struct rte_mempool *mp; /**< Cached memory pool. */
 		struct ibv_mr *mr; /**< Memory region (for mp). */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 492779f..bbdeda3 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -83,8 +83,14 @@
 		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->ctrl.socket);
 	int ret = 0;
 
-	if (elts == NULL) {
-		ERROR("%p: can't allocate packets array", (void *)txq);
+	/* Allocate Bounce-buf memory */
+	txq->bounce_buf = (char *)rte_zmalloc_socket("TXQ",
+						     MLX4_MAX_WQE_SIZE,
+						     RTE_CACHE_LINE_MIN_SIZE,
+						     txq->ctrl.socket);
+
+	if (elts == NULL || txq->bounce_buf == NULL) {
+		ERROR("%p: can't allocate TXQ memory", (void *)txq);
 		ret = ENOMEM;
 		goto error;
 	}
@@ -110,6 +116,7 @@
 	assert(ret == 0);
 	return 0;
 error:
+	rte_free(txq->bounce_buf);
 	rte_free(elts);
 	DEBUG("%p: failed, freed everything", (void *)txq);
 	assert(ret > 0);
@@ -303,7 +310,6 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4dv_obj mlxdv;
 	struct mlx4dv_qp dv_qp;
 	struct mlx4dv_cq dv_cq;
-
 	struct txq tmpl = {
 		.ctrl = {
 			.priv = priv,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 5/7] net/mlx4: get back Tx checksum offloads
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
                       ` (3 preceding siblings ...)
  2017-10-04 21:49     ` [PATCH v3 4/7] net/mlx4: support multi-segments Tx Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 6/7] net/mlx4: get back Rx " Ophir Munk
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP
checksum calculation.
This commit also includes support for offloading IPV4, UDP and TCP
tunnel checksum calculation to the hardware.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4.c        |  9 +++++++++
 drivers/net/mlx4/mlx4.h        |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h    |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 25 +++++++++++++++++++++----
 drivers/net/mlx4/mlx4_rxtx.h   |  2 ++
 drivers/net/mlx4/mlx4_txq.c    |  2 ++
 7 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index b084903..a0e76ee 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -397,6 +397,7 @@ struct mlx4_conf {
 		.ports.present = 0,
 	};
 	unsigned int vf;
+	unsigned int tunnel_en;
 	int i;
 
 	(void)pci_drv;
@@ -456,6 +457,9 @@ struct mlx4_conf {
 		rte_errno = ENODEV;
 		goto error;
 	}
+	/* Only cx3-pro supports L3 tunneling */
+	tunnel_en = (device_attr.vendor_part_id ==
+		     PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
 	INFO("%u port(s) detected", device_attr.phys_port_cnt);
 	conf.ports.present |= (UINT64_C(1) << device_attr.phys_port_cnt) - 1;
 	if (mlx4_args(pci_dev->device.devargs, &conf)) {
@@ -529,6 +533,11 @@ struct mlx4_conf {
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->hw_csum =
+		     !!(device_attr.device_cap_flags & IBV_DEVICE_RAW_IP_CSUM);
+		priv->hw_csum_l2tun = tunnel_en;
+		DEBUG("L2 tunnel checksum offloads are %ssupported",
+		      (priv->hw_csum_l2tun ? "" : "not "));
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index b6e1ef2..d0bce91 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -93,6 +93,8 @@ struct priv {
 	unsigned int vf:1; /* This is a VF device. */
 	unsigned int intr_alarm:1; /* An interrupt alarm is scheduled. */
 	unsigned int isolated:1; /* Toggle isolated mode. */
+	unsigned int hw_csum:1; /* Checksum offload is supported. */
+	unsigned int hw_csum_l2tun:1; /* Checksum support for L2 tunnels. */
 	struct rte_intr_handle intr_handle; /* Port interrupt handle. */
 	struct rte_flow_drop *flow_drop_queue; /* Flow drop queue. */
 	LIST_HEAD(mlx4_flows, rte_flow) flows;
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a9e8059..95cc6e4 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,6 +553,12 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM  |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->hw_csum_l2tun)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->speed_capa =
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 6d1800a..57f5a46 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -64,6 +64,8 @@
 
 /* Work queue element (WQE) flags. */
 #define MLX4_BIT_WQE_OWN 0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define MLX4_SIZE_TO_TXBBS(size) \
 		(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index bc0e353..ea92ebb 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -434,12 +434,29 @@ struct pv {
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
-				       MLX4_WQE_CTRL_CQ_UPDATE);
+		srcrb_flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
 	} else {
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+		srcrb_flags = MLX4_WQE_CTRL_SOLICIT;
 	}
-	ctrl->srcrb_flags = srcrb_flags;
+	/* Enable HW checksum offload if requested */
+	if (txq->csum &&
+	    (pkt->ol_flags &
+	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = pkt->ol_flags &
+					     (PKT_TX_TUNNEL_GRE |
+					      PKT_TX_TUNNEL_VXLAN);
+
+		if (is_tunneled && txq->csum_l2tun) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
+		} else {
+			srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
+				      MLX4_WQE_CTRL_TCP_UDP_CSUM;
+		}
+	}
+	ctrl->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 1b90533..dc283e1 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -110,6 +110,8 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint32_t csum:1; /**< Checksum is supported and enabled */
+	uint32_t csum_l2tun:1; /**< L2 tun Checksum is supported and enabled */
 	char *bounce_buf;
 	/**< memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index bbdeda3..4eb739c 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -340,6 +340,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	tmpl.csum = priv->hw_csum;
+	tmpl.csum_l2tun = priv->hw_csum_l2tun;
 	DEBUG("priv->device_attr.max_qp_wr is %d",
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 6/7] net/mlx4: get back Rx checksum offloads
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
                       ` (4 preceding siblings ...)
  2017-10-04 21:49     ` [PATCH v3 5/7] net/mlx4: get back Tx checksum offloads Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 21:49     ` [PATCH v3 7/7] net/mlx4: add loopback Tx from VF Ophir Munk
  2017-10-04 22:37     ` [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs Ferruh Yigit
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky,
	Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP
checksum verification.
This commit also includes support for offloading IPV4, UDP and TCP tunnel
checksum verification to the hardware.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
---
 drivers/net/mlx4/mlx4.c        |   2 +
 drivers/net/mlx4/mlx4_ethdev.c |   8 ++-
 drivers/net/mlx4/mlx4_prm.h    |  19 +++++++
 drivers/net/mlx4/mlx4_rxq.c    |   5 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 120 ++++++++++++++++++++++++++++++++++++++++-
 drivers/net/mlx4/mlx4_rxtx.h   |   2 +
 6 files changed, 152 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index a0e76ee..865ffdd 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -535,6 +535,8 @@ struct mlx4_conf {
 		priv->vf = vf;
 		priv->hw_csum =
 		     !!(device_attr.device_cap_flags & IBV_DEVICE_RAW_IP_CSUM);
+		DEBUG("checksum offloading is %ssupported",
+		      (priv->hw_csum ? "" : "not "));
 		priv->hw_csum_l2tun = tunnel_en;
 		DEBUG("L2 tunnel checksum offloads are %ssupported",
 		      (priv->hw_csum_l2tun ? "" : "not "));
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index 95cc6e4..6dbf273 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,10 +553,14 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
-	if (priv->hw_csum)
+	if (priv->hw_csum) {
 		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
-					  DEV_TX_OFFLOAD_UDP_CKSUM  |
+					  DEV_TX_OFFLOAD_UDP_CKSUM |
 					  DEV_TX_OFFLOAD_TCP_CKSUM);
+		info->rx_offload_capa |= (DEV_RX_OFFLOAD_IPV4_CKSUM |
+					  DEV_RX_OFFLOAD_UDP_CKSUM |
+					  DEV_RX_OFFLOAD_TCP_CKSUM);
+	}
 	if (priv->hw_csum_l2tun)
 		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 57f5a46..73c3d55 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -70,6 +70,25 @@
 #define MLX4_SIZE_TO_TXBBS(size) \
 		(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
 
+/* Generic macro to convert MLX4 to IBV flags. */
+#define MLX4_TRANSPOSE(val, from, to) \
+		(__extension__({ \
+			typeof(val) _val = (val); \
+			typeof(from) _from = (from); \
+			typeof(to) _to = (to); \
+			(((_from) >= (_to)) ? \
+			(((_val) & (_from)) / ((_from) / (_to))) : \
+			(((_val) & (_from)) * ((_to) / (_from)))); \
+		}))
+
+/* CQE checksum flags */
+enum {
+	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1U << 25),
+	MLX4_CQE_L2_TUNNEL_L4_CSUM = (int)(1U << 26),
+	MLX4_CQE_L2_TUNNEL = (int)(1U << 27),
+	MLX4_CQE_L2_TUNNEL_IPOK = (int)(1U << 31),
+};
+
 /* Send queue information. */
 struct mlx4_sq {
 	char *buf; /**< SQ buffer. */
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 7d13121..889f05c 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -267,6 +267,11 @@
 	int ret;
 
 	(void)conf; /* Thresholds configuration (ignored). */
+	/* Toggle Rx checksum offload if hardware supports it. */
+	if (priv->hw_csum)
+		tmpl.csum = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
+	if (priv->hw_csum_l2tun)
+		tmpl.csum_l2tun = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
 	mb_len = rte_pktmbuf_data_room_size(mp);
 	/* Enable scattered packets support for this queue if necessary. */
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index ea92ebb..ca66b1d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -563,6 +563,110 @@ struct pv {
 }
 
 /**
+ * Translate Rx completion flags to packet type.
+ *
+ * @param flags
+ *   Rx completion flags returned by poll_length_flags().
+ *
+ * @return
+ *   Packet type for struct rte_mbuf.
+ */
+static inline uint32_t
+rxq_cq_to_pkt_type(uint32_t flags)
+{
+	uint32_t pkt_type;
+
+	if (flags & MLX4_CQE_L2_TUNNEL)
+		pkt_type =
+			MLX4_TRANSPOSE(flags,
+			       (uint32_t)MLX4_CQE_L2_TUNNEL_IPV4,
+			       (uint32_t)RTE_PTYPE_L3_IPV4_EXT_UNKNOWN) |
+			MLX4_TRANSPOSE(flags,
+			       (uint32_t)MLX4_CQE_STATUS_IPV4_PKT,
+			       (uint32_t)RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN);
+	else
+		pkt_type =
+			MLX4_TRANSPOSE(flags,
+			       (uint32_t)MLX4_CQE_STATUS_IPV4_PKT,
+			       (uint32_t)RTE_PTYPE_L3_IPV4_EXT_UNKNOWN);
+	ERROR("pkt_type 0x%x", pkt_type); //
+	return pkt_type;
+}
+
+/**
+ * Translate Rx completion flags to offload flags.
+ *
+ * @param  flags
+ *   Rx completion flags returned by poll_length_flags().
+ * @param csum
+ *   Rx checksum enable flag
+ * @param csum_l2tun
+ *   Rx L2 tunnel checksum enable flag
+ *
+ * @return
+ *   Offload flags (ol_flags) for struct rte_mbuf.
+ */
+static inline uint32_t
+rxq_cq_to_ol_flags(uint32_t flags, unsigned int csum, unsigned int csum_l2tun)
+{
+	uint32_t ol_flags = 0;
+
+	if (csum)
+		ol_flags |=
+			MLX4_TRANSPOSE(flags,
+				(uint64_t)MLX4_CQE_STATUS_IP_HDR_CSUM_OK,
+				PKT_RX_IP_CKSUM_GOOD) |
+			MLX4_TRANSPOSE(flags,
+				(uint64_t)MLX4_CQE_STATUS_TCP_UDP_CSUM_OK,
+				PKT_RX_L4_CKSUM_GOOD);
+	if ((flags & MLX4_CQE_L2_TUNNEL) && csum_l2tun)
+		ol_flags |=
+			MLX4_TRANSPOSE(flags,
+				       (uint64_t)MLX4_CQE_L2_TUNNEL_IPOK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			MLX4_TRANSPOSE(flags,
+				       (uint64_t)MLX4_CQE_L2_TUNNEL_L4_CSUM,
+				       PKT_RX_L4_CKSUM_GOOD);
+	return ol_flags;
+}
+
+/**
+ * Get Rx checksum CQE flags.
+ *
+ * @param cqe
+ *   Pointer to cqe structure.
+ * @param csum
+ *   Rx checksum enable flag
+ * @param csum_l2tun
+ *   RX L2 tunnel checksum enable flag
+ *
+ * @return
+ *   CQE flags in CPU order
+ */
+static inline uint32_t
+mlx4_cqe_flags(struct mlx4_cqe *cqe,
+	       int csum, unsigned int csum_l2tun)
+{
+	uint32_t flags = 0;
+
+	/*
+	 * The relevant bits are in different locations on their
+	 * CQE fields therefore we can join them in one 32-bit
+	 * variable.
+	 */
+	if (csum)
+		flags = rte_be_to_cpu_32(cqe->status) &
+			MLX4_CQE_STATUS_IPV4_CSUM_OK;
+	if (csum_l2tun)
+		flags |= rte_be_to_cpu_32(cqe->vlan_my_qpn) &
+			 (MLX4_CQE_L2_TUNNEL |
+			  MLX4_CQE_L2_TUNNEL_IPOK |
+			  MLX4_CQE_L2_TUNNEL_L4_CSUM |
+			  MLX4_CQE_L2_TUNNEL_IPV4);
+		return flags;
+}
+
+/**
  * Poll one CQE from CQ.
  *
  * @param rxq
@@ -601,7 +705,7 @@ struct pv {
 }
 
 /**
- * DPDK callback for RX with scattered packets support.
+ * DPDK callback for Rx with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -666,7 +770,7 @@ struct pv {
 				break;
 			}
 			if (unlikely(len < 0)) {
-				/* RX error, packet is likely too large. */
+				/* Rx error, packet is likely too large. */
 				rte_mbuf_raw_free(rep);
 				++rxq->stats.idropped;
 				goto skip;
@@ -674,6 +778,18 @@ struct pv {
 			pkt = seg;
 			pkt->packet_type = 0;
 			pkt->ol_flags = 0;
+			if (rxq->csum | rxq->csum_l2tun) {
+				uint32_t flags =
+					mlx4_cqe_flags(cqe, rxq->csum,
+						       rxq->csum_l2tun);
+				pkt->ol_flags =
+					rxq_cq_to_ol_flags(flags, rxq->csum,
+							   rxq->csum_l2tun);
+				pkt->packet_type = rxq_cq_to_pkt_type(flags);
+			} else {
+				pkt->packet_type = 0;
+				pkt->ol_flags = 0;
+			}
 			pkt->pkt_len = len;
 		}
 		rep->nb_segs = 1;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index dc283e1..75c98c1 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -80,6 +80,8 @@ struct rxq {
 	} hw;
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	unsigned int sge_n; /**< Log 2 of SGEs number. */
+	unsigned int csum:1; /**< Enable checksum offloading. */
+	unsigned int csum_l2tun:1; /**< Enable checksum for L2 tunnels. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 7/7] net/mlx4: add loopback Tx from VF
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
                       ` (5 preceding siblings ...)
  2017-10-04 21:49     ` [PATCH v3 6/7] net/mlx4: get back Rx " Ophir Munk
@ 2017-10-04 21:49     ` Ophir Munk
  2017-10-04 22:37     ` [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs Ferruh Yigit
  7 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-04 21:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds loopback functionality used when the chip is a VF
in order to enable packet transmission between VFs and between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 38 ++++++++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |  2 ++
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index ca66b1d..87c4c38 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -321,10 +321,13 @@ struct pv {
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct rte_mbuf *buf;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
-	uint32_t srcrb_flags;
 	uint32_t owner_opcode = MLX4_OPCODE_SEND;
 	uint32_t byte_count;
 	int wqe_real_size;
@@ -422,21 +425,15 @@ struct pv {
 	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
-	 * The caller should prepare "imm" in advance in order to support
-	 * VF to VF communication (when the device is a virtual-function
-	 * device (VF)).
-	*/
-	ctrl->imm = 0;
-	/*
 	 * For raw Ethernet, the SOLICIT flag is used to indicate that no icrc
 	 * should be calculated.
 	 */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
+		srcrb.flags = MLX4_WQE_CTRL_SOLICIT | MLX4_WQE_CTRL_CQ_UPDATE;
 	} else {
-		srcrb_flags = MLX4_WQE_CTRL_SOLICIT;
+		srcrb.flags = MLX4_WQE_CTRL_SOLICIT;
 	}
 	/* Enable HW checksum offload if requested */
 	if (txq->csum &&
@@ -450,13 +447,30 @@ struct pv {
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
+				srcrb.flags |= MLX4_WQE_CTRL_IP_HDR_CSUM;
 		} else {
-			srcrb_flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= MLX4_WQE_CTRL_IP_HDR_CSUM |
 				      MLX4_WQE_CTRL_TCP_UDP_CSUM;
 		}
 	}
-	ctrl->srcrb_flags = rte_cpu_to_be_32(srcrb_flags);
+	/*
+	 * convert flags to BE before adding the mac address (if at all)
+	 * to it
+	 */
+	srcrb.flags = rte_cpu_to_be_32(srcrb.flags);
+	if (txq->lb) {
+		/*
+		 * Copy destination mac address to the wqe,
+		 * this allows loopback in eSwitch, so that VFs and PF
+		 * can communicate with each other.
+		 */
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						      sizeof(uint16_t)));
+	} else {
+		ctrl->imm = 0;
+	}
+	ctrl->srcrb_flags = srcrb.flags;
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 75c98c1..6f33d1c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -114,6 +114,8 @@ struct txq {
 	uint32_t max_inline; /**< Max inline send size. */
 	uint32_t csum:1; /**< Checksum is supported and enabled */
 	uint32_t csum_l2tun:1; /**< L2 tun Checksum is supported and enabled */
+	uint32_t lb:1;
+	/**< Whether pkts should be looped-back by eswitch or not */
 	char *bounce_buf;
 	/**< memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 4eb739c..8205647 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -415,6 +415,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* If a VF device - need to loopback xmitted packets */
+	tmpl.lb = !!(priv->vf);
 	/* Clean up txq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old txq just in case", (void *)txq);
 	mlx4_txq_cleanup(txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
                       ` (6 preceding siblings ...)
  2017-10-04 21:49     ` [PATCH v3 7/7] net/mlx4: add loopback Tx from VF Ophir Munk
@ 2017-10-04 22:37     ` Ferruh Yigit
  2017-10-04 22:46       ` Thomas Monjalon
  7 siblings, 1 reply; 61+ messages in thread
From: Ferruh Yigit @ 2017-10-04 22:37 UTC (permalink / raw)
  To: Ophir Munk, Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad

On 10/4/2017 10:48 PM, Ophir Munk wrote:
> Changes from v2:
> * Split "net/mlx4: support multi-segments Rx" commit from "net/mlx4: get back Rx flow functionality" commit
> * Semantics, code styling
> * Fix check-git-log warnings
> * Fix checkpatches warnings
> 
> Next (currently not included) changes:
> * Replacing MLX4_TRANSPOSE() macro (Generic macro to convert MLX4 to IBV flags) with a look-up table as in mlx5
> for example: mlx5_set_ptype_table() function - in order to improve performance.
> This change is delicate and should be verified first with regression tests
> 
> * PMD documentation update when no longer working with MLNX_OFED
> Documentation updtes require specific kernel, rdma_core and FW versions as well as installation procedures.
> These details should be supplied by regression team.
> 
> Moti Haimovsky (6):
>   net/mlx4: add simple Tx bypassing ibverbs
>   net/mlx4: get back Rx flow functionality
>   net/mlx4: support multi-segments Tx
>   net/mlx4: get back Tx checksum offloads
>   net/mlx4: get back Rx checksum offloads
>   net/mlx4: add loopback Tx from VF
> 
> Vasily Philipov (1):
>   net/mlx4: support multi-segments Rx

Adried looks like beat you with a few hours :)

Can you please clarify which set is valid one, thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-04 22:37     ` [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs Ferruh Yigit
@ 2017-10-04 22:46       ` Thomas Monjalon
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Monjalon @ 2017-10-04 22:46 UTC (permalink / raw)
  To: Ferruh Yigit, Ophir Munk, Adrien Mazarguil; +Cc: dev, Olga Shern, Matan Azrad

05/10/2017 00:37, Ferruh Yigit:
> On 10/4/2017 10:48 PM, Ophir Munk wrote:
> > Moti Haimovsky (6):
> >   net/mlx4: add simple Tx bypassing ibverbs
> >   net/mlx4: get back Rx flow functionality
> >   net/mlx4: support multi-segments Tx
> >   net/mlx4: get back Tx checksum offloads
> >   net/mlx4: get back Rx checksum offloads
> >   net/mlx4: add loopback Tx from VF
> > 
> > Vasily Philipov (1):
> >   net/mlx4: support multi-segments Rx
> 
> Adried looks like beat you with a few hours :)
> 
> Can you please clarify which set is valid one, thanks.

I think we need a v4 merging both v3.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
                       ` (5 preceding siblings ...)
  2017-10-04 18:48     ` [PATCH v3 6/6] net/mlx4: add loopback Tx from VF Adrien Mazarguil
@ 2017-10-05  9:33     ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 1/7] net/mlx4: add simple Tx bypassing Verbs Ophir Munk
                         ` (9 more replies)
  6 siblings, 10 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Ophir Munk

v4 (Ophir):
- Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4: 
  restore full Rx support bypassing Verbs" commit

v3 (Adrien):
- Drop a few unrelated or unnecessary changes such as the removal of
  MLX4_PMD_TX_MP_CACHE.
- Move device checksum support detection code to its previous location.
- Fix include guard in mlx4_prm.h.
- Reorder #includes alphabetically.
- Replace MLX4_TRANSPOSE() macro with documented inline function.
- Remove extra spaces and blank lines.
- Use uint8_t * instead of char * for buffers.
- Replace mlx4_get_cqe() macro with a documented inline function.
- Replace several unsigned int with uint32_t.
- Add consistency to field names (sge_n => sges_n).
- Make mbuf size checks in RX queue setup function similar to mlx5.
- Update various comments.
- Fix indentation.
- Replace run-time endian conversion with static ones where possible.
- Reorder fields in struct rxq and struct txq for consistency, remove
  one level of unnecessary inner structures.
- Fix memory leak on Tx bounce buffer.
- Update commit logs.
- Fix remaining checkpatch warnings.

v2 (Matan):
Rearange patches.
Semantics.
Enhancements.
Fix compilation issues.

Moti Haimovsky (6):
  net/mlx4: add simple Tx bypassing Verbs
  net/mlx4: restore full Rx support bypassing Verbs
  net/mlx4: restore Tx gather support
  net/mlx4: restore Tx checksum offloads
  net/mlx4: restore Rx offloads
  net/mlx4: add loopback Tx from VF

Ophir Munk (1):
  net/mlx4: restore Rx scatter support

 drivers/net/mlx4/mlx4.c        |  11 +
 drivers/net/mlx4/mlx4.h        |   2 +
 drivers/net/mlx4/mlx4_ethdev.c |  10 +
 drivers/net/mlx4/mlx4_prm.h    | 152 ++++++++
 drivers/net/mlx4/mlx4_rxq.c    | 179 ++++++----
 drivers/net/mlx4/mlx4_rxtx.c   | 768 ++++++++++++++++++++++++++++++-----------
 drivers/net/mlx4/mlx4_rxtx.h   |  54 +--
 drivers/net/mlx4/mlx4_txq.c    |  67 +++-
 drivers/net/mlx4/mlx4_utils.h  |  20 ++
 mk/rte.app.mk                  |   2 +-
 10 files changed, 975 insertions(+), 290 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 1/7] net/mlx4: add simple Tx bypassing Verbs
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 2/7] net/mlx4: restore full Rx support " Ophir Munk
                         ` (8 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

Modify PMD to send single-buffer packets directly to the device bypassing
the Verbs Tx post and poll routines.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 120 +++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 337 ++++++++++++++++++++++++++++++++-----------
 drivers/net/mlx4/mlx4_rxtx.h |  28 ++--
 drivers/net/mlx4/mlx4_txq.c  |  51 +++++++
 mk/rte.app.mk                |   2 +-
 5 files changed, 436 insertions(+), 102 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..085a595
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,120 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 6WIND S.A.
+ *   Copyright 2017 Mellanox
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef MLX4_PRM_H_
+#define MLX4_PRM_H_
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_byteorder.h>
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/mlx4dv.h>
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* ConnectX-3 Tx queue basic block. */
+#define MLX4_TXBB_SHIFT 6
+#define MLX4_TXBB_SIZE (1 << MLX4_TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes. */
+#define MLX4_MAX_WQE_SIZE 512
+#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+
+/* Send queue stamping/invalidating information. */
+#define MLX4_SQ_STAMP_STRIDE 64
+#define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
+#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_STAMP_VAL 0x7fffffff
+
+/* Work queue element (WQE) flags. */
+#define MLX4_BIT_WQE_OWN 0x80000000
+
+#define MLX4_SIZE_TO_TXBBS(size) \
+	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
+
+/* Send queue information. */
+struct mlx4_sq {
+	uint8_t *buf; /**< SQ buffer. */
+	uint8_t *eob; /**< End of SQ buffer */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t *db; /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
+};
+
+#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
+
+/* Completion queue information. */
+struct mlx4_cq {
+	uint8_t *buf; /**< Pointer to the completion queue buffer. */
+	uint32_t cqe_cnt; /**< Number of entries in the queue. */
+	uint32_t cqe_64:1; /**< CQ entry size is 64 bytes. */
+	uint32_t cons_index; /**< Last queue entry that was handled. */
+	uint32_t *set_ci_db; /**< Pointer to the completion queue doorbell. */
+};
+
+/**
+ * Retrieve a CQE entry from a CQ.
+ *
+ * cqe = cq->buf + cons_index * cqe_size + cqe_offset
+ *
+ * Where cqe_size is 32 or 64 bytes and cqe_offset is 0 or 32 (depending on
+ * cqe_size).
+ *
+ * @param cq
+ *   CQ to retrieve entry from.
+ * @param index
+ *   Entry index.
+ *
+ * @return
+ *   Pointer to CQE entry.
+ */
+static inline struct mlx4_cqe *
+mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
+{
+	return (struct mlx4_cqe *)(cq->buf +
+				   ((index & (cq->cqe_cnt - 1)) <<
+				    (5 + cq->cqe_64)) +
+				   (cq->cqe_64 << 5));
+}
+
+#endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b5e7777..35367a2 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -52,15 +52,72 @@
 
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
+#include <rte_io.h>
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
 /**
+ * Stamp a WQE so it won't be reused by the HW.
+ *
+ * Routine is used when freeing WQE used by the chip or when failing
+ * building an WQ entry has failed leaving partial information on the queue.
+ *
+ * @param sq
+ *   Pointer to the SQ structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param num_txbbs
+ *   Number of blocks to stamp.
+ *   If < 0 the routine will use the size written in the WQ entry.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of Tx basic blocs (TXBB) the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+					  (!!owner << MLX4_SQ_STAMP_SHIFT));
+	uint8_t *wqe = mlx4_get_send_wqe(sq, (index & sq->txbb_cnt_mask));
+	uint32_t *ptr = (uint32_t *)wqe;
+	int i;
+	int txbbs_size;
+	int num_txbbs;
+
+	/* Extract the size from the control segment of the WQE. */
+	num_txbbs = MLX4_SIZE_TO_TXBBS((((struct mlx4_wqe_ctrl_seg *)
+					 wqe)->fence_size & 0x3f) << 4);
+	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	/* Optimize the common case when there is no wrap-around. */
+	if (wqe + txbbs_size <= sq->eob) {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+			if ((uint8_t *)ptr >= sq->eob) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= RTE_BE32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
  * Manage Tx completions.
  *
  * When sending a burst, mlx4_tx_burst() posts several WRs.
@@ -80,26 +137,71 @@
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
-	int wcs_n;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
-	if (unlikely(wcs_n == 0))
+	/*
+	 * Traverse over all CQ entries reported and handle each WQ entry
+	 * reported by them.
+	 */
+	do {
+		cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cons_index);
+		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		    !!(cons_index & cq->cqe_cnt)))
+			break;
+		/*
+		 * Make sure we read the CQE after we read the ownership bit.
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      (void *)txq, cqe_err->vendor_err,
+			      cqe_err->syndrome);
+		}
+		/* Get WQE index reported in the CQE. */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* Free next descriptor. */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		cons_index++;
+	} while (1);
+	if (unlikely(pkts == 0))
 		return 0;
-	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
-		      (void *)txq, wcs_n);
-		return -1;
-	}
-	elts_comp -= wcs_n;
+	/*
+	 * Update CQ.
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	/* Update the list of packets posted for transmission. */
+	elts_comp -= pkts;
 	assert(elts_comp <= txq->elts_comp);
 	/*
-	 * Assume WC status is successful as nothing can be done about it
-	 * anyway.
+	 * Assume completion status is successful as nothing can be done about
+	 * it anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -183,6 +285,119 @@
 }
 
 /**
+ * Posts a single work request to a send queue.
+ *
+ * @param txq
+ *   Target Tx queue.
+ * @param pkt
+ *   Packet to transmit.
+ * @param send_flags
+ *   @p MLX4_WQE_CTRL_CQ_UPDATE to request completion on this packet.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+static inline int
+mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt, uint32_t send_flags)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t lkey;
+	uintptr_t addr;
+	int wqe_real_size;
+	int nr_txbbs;
+	int rc;
+
+	/* Calculate the needed work queue entry size for this packet. */
+	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+	/*
+	 * Check that there is room for this WQE in the send queue and that
+	 * the WQE size is legal.
+	 */
+	if (((sq->head - sq->tail) + nr_txbbs +
+	     sq->headroom_txbbs) >= sq->txbb_cnt ||
+	    nr_txbbs > MLX4_MAX_WQE_TXBBS) {
+		rc = ENOSPC;
+		goto err;
+	}
+	/* Get the control and single-data entries of the WQE. */
+	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
+					    sizeof(struct mlx4_wqe_ctrl_seg));
+	/* Fill the data segment with buffer information. */
+	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
+	rte_prefetch0((volatile void *)addr);
+	dseg->addr = rte_cpu_to_be_64(addr);
+	/* Memory region key for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		/* MR does not exist. */
+		DEBUG("%p: unable to get MP <-> MR association", (void *)txq);
+		/*
+		 * Restamp entry in case of failure, make sure that size is
+		 * written correctly.
+		 * Note that we give ownership to the SW, not the HW.
+		 */
+		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+		mlx4_txq_stamp_freed_wqe(sq, head_idx,
+					 (sq->head & sq->txbb_cnt) ? 0 : 1);
+		rc = EFAULT;
+		goto err;
+	}
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * Need a barrier here before writing the byte_count field to
+	 * make sure that all the data is visible before the
+	 * byte_count field is set. Otherwise, if the segment begins
+	 * a new cache line, the HCA prefetcher could grab the 64-byte
+	 * chunk and get a valid (!= 0xffffffff) byte count but
+	 * stale data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+	if (likely(pkt->data_len))
+		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
+	else
+		/*
+		 * Zero length segment is treated as inline segment
+		 * with zero data.
+		 */
+		dseg->byte_count = RTE_BE32(0x80000000);
+	/*
+	 * Fill the control parameters for this packet.
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
+	 * should be calculated.
+	 */
+	ctrl->srcrb_flags =
+		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
+				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+	/*
+	 * The caller should prepare "imm" in advance in order to support
+	 * VF to VF communication (when the device is a virtual-function
+	 * device (VF)).
+	 */
+	ctrl->imm = 0;
+	/*
+	 * Make sure descriptor is fully written before setting ownership
+	 * bit (because HW can start executing as soon as we do).
+	 */
+	rte_wmb();
+	ctrl->owner_opcode =
+		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
+				 ((sq->head & sq->txbb_cnt) ?
+				  MLX4_BIT_WQE_OWN : 0));
+	sq->head += nr_txbbs;
+	return 0;
+err:
+	rte_errno = rc;
+	return -rc;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,13 +414,11 @@
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
-	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
 	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
+	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
@@ -229,9 +442,7 @@
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		struct ibv_send_wr *wr = &elt->wr;
 		unsigned int segs = buf->nb_segs;
-		unsigned int sent_size = 0;
 		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
@@ -254,93 +465,43 @@
 		if (unlikely(--elts_comp_cd == 0)) {
 			elts_comp_cd = txq->elts_comp_cd_init;
 			++elts_comp;
-			send_flags |= IBV_SEND_SIGNALED;
+			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
 		}
 		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
+			/* Update element. */
+			elt->buf = buf;
+			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+			/* Post the packet for sending. */
+			err = mlx4_post_send(txq, buf, send_flags);
+			if (unlikely(err)) {
+				if (unlikely(send_flags &
+					     MLX4_WQE_CTRL_CQ_UPDATE)) {
+					elts_comp_cd = 1;
+					--elts_comp;
+				}
 				elt->buf = NULL;
 				goto stop;
 			}
-			/* Update element. */
 			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
+			bytes_sent += buf->pkt_len;
 		} else {
-			err = -1;
+			err = -EINVAL;
+			rte_errno = -err;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
 		elts_head = elts_head_next;
-		/* Increment sent bytes counter. */
-		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
-	/* Increment sent packets counter. */
+	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
+	txq->stats.obytes += bytes_sent;
+	/* Make sure that descriptors are written before doorbell record. */
+	rte_wmb();
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
 	txq->elts_comp_cd = elts_comp_cd;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fec998a..cc5951c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -40,6 +40,7 @@
 #ifdef PEDANTIC
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
+#include <infiniband/mlx4dv.h>
 #include <infiniband/verbs.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
@@ -50,6 +51,7 @@
 #include <rte_mempool.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 
 /** Rx queue counters. */
 struct mlx4_rxq_stats {
@@ -85,8 +87,6 @@ struct rxq {
 
 /** Tx element. */
 struct txq_elt {
-	struct ibv_send_wr wr; /* Work request. */
-	struct ibv_sge sge; /* Scatter/gather element. */
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
@@ -100,24 +100,26 @@ struct mlx4_txq_stats {
 
 /** Tx queue descriptor. */
 struct txq {
-	struct priv *priv; /**< Back pointer to private data. */
-	struct {
-		const struct rte_mempool *mp; /**< Cached memory pool. */
-		struct ibv_mr *mr; /**< Memory region (for mp). */
-		uint32_t lkey; /**< mr->lkey copy. */
-	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
-	struct ibv_cq *cq; /**< Completion queue. */
-	struct ibv_qp *qp; /**< Queue pair. */
-	uint32_t max_inline; /**< Max inline send size. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	struct txq_elt (*elts)[]; /**< Tx elements. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
 	unsigned int elts_comp; /**< Number of completion requests. */
 	unsigned int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
+	unsigned int elts_n; /**< (*elts)[] length. */
+	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
+	uint32_t max_inline; /**< Max inline send size. */
+	struct {
+		const struct rte_mempool *mp; /**< Cached memory pool. */
+		struct ibv_mr *mr; /**< Memory region (for mp). */
+		uint32_t lkey; /**< mr->lkey copy. */
+	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
+	struct priv *priv; /**< Back pointer to private data. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
+	struct ibv_cq *cq; /**< Completion queue. */
+	struct ibv_qp *qp; /**< Queue pair. */
 };
 
 /* mlx4_rxq.c */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index e0245b0..fb28ef2 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -62,6 +62,7 @@
 #include "mlx4_autoconf.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
+#include "mlx4_prm.h"
 
 /**
  * Allocate Tx queue elements.
@@ -242,6 +243,41 @@ struct txq_mp2mr_mbuf_check_data {
 }
 
 /**
+ * Retrieves information needed in order to directly access the Tx queue.
+ *
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param mlxdv
+ *   Pointer to device information for this Tx queue.
+ */
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	/* Total length, including headroom and spare WQEs. */
+	sq->eob = sq->buf + sq_size;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_cnt =
+		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs =
+		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
+	cq->buf = dcq->buf.buf;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_64 = (dcq->cqe_size & 64) ? 1 : 0;
+}
+
+/**
  * Configure a Tx queue.
  *
  * @param dev
@@ -263,6 +299,9 @@ struct txq_mp2mr_mbuf_check_data {
 	       unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct txq tmpl = {
 		.priv = priv,
 		.socket = socket
@@ -370,6 +409,18 @@ struct txq_mp2mr_mbuf_check_data {
 	DEBUG("%p: txq updated with %p", (void *)txq, (void *)&tmpl);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = txq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: failed to obtain information needed for"
+		      " accessing the device queues", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	return 0;
 error:
 	ret = rte_errno;
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 29507dc..1435cb6 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -133,7 +133,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
 endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LIO_PMD)        += -lrte_pmd_lio
-_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -libverbs -lmlx4
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -libverbs -lmlx5
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NFP_PMD)        += -lrte_pmd_nfp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_NULL)       += -lrte_pmd_null
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 2/7] net/mlx4: restore full Rx support bypassing Verbs
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 1/7] net/mlx4: add simple Tx bypassing Verbs Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 3/7] net/mlx4: restore Rx scatter support Ophir Munk
                         ` (7 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky,
	Vasily Philipov, Ophir Munk

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for accessing the hardware directly when handling
Rx packets eliminating the need to use Verbs in the Rx data path.

The number of scatters is limited to one.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxq.c   | 110 +++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.c  | 223 +++++++++++++++++++++++-------------------
 drivers/net/mlx4/mlx4_rxtx.h  |  18 ++--
 drivers/net/mlx4/mlx4_utils.h |  20 ++++
 4 files changed, 212 insertions(+), 159 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 409983f..9b98d86 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -51,6 +51,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
 
+#include <rte_byteorder.h>
 #include <rte_common.h>
 #include <rte_errno.h>
 #include <rte_ethdev.h>
@@ -77,20 +78,17 @@
 mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 {
 	unsigned int i;
-	struct rxq_elt (*elts)[elts_n] =
-		rte_calloc_socket("RXQ elements", 1, sizeof(*elts), 0,
-				  rxq->socket);
+	struct rte_mbuf *(*elts)[elts_n] =
+		rte_calloc_socket("RXQ", 1, sizeof(*elts), 0, rxq->socket);
 
+	assert(rte_is_power_of_2(elts_n));
 	if (elts == NULL) {
 		rte_errno = ENOMEM;
 		ERROR("%p: can't allocate packets array", (void *)rxq);
 		goto error;
 	}
-	/* For each WR (packet). */
 	for (i = 0; (i != elts_n); ++i) {
-		struct rxq_elt *elt = &(*elts)[i];
-		struct ibv_recv_wr *wr = &elt->wr;
-		struct ibv_sge *sge = &(*elts)[i].sge;
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[i];
 		struct rte_mbuf *buf = rte_pktmbuf_alloc(rxq->mp);
 
 		if (buf == NULL) {
@@ -98,37 +96,32 @@
 			ERROR("%p: empty mbuf pool", (void *)rxq);
 			goto error;
 		}
-		elt->buf = buf;
-		wr->next = &(*elts)[(i + 1)].wr;
-		wr->sg_list = sge;
-		wr->num_sge = 1;
 		/* Headroom is reserved by rte_pktmbuf_alloc(). */
 		assert(buf->data_off == RTE_PKTMBUF_HEADROOM);
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
-		/* sge->addr must be able to store a pointer. */
-		assert(sizeof(sge->addr) >= sizeof(uintptr_t));
-		/* SGE keeps its headroom. */
-		sge->addr = (uintptr_t)
-			((uint8_t *)buf->buf_addr + RTE_PKTMBUF_HEADROOM);
-		sge->length = (buf->buf_len - RTE_PKTMBUF_HEADROOM);
-		sge->lkey = rxq->mr->lkey;
-		/* Redundant check for tailroom. */
-		assert(sge->length == rte_pktmbuf_tailroom(buf));
+		buf->port = rxq->port_id;
+		buf->data_len = rte_pktmbuf_tailroom(buf);
+		buf->pkt_len = rte_pktmbuf_tailroom(buf);
+		buf->nb_segs = 1;
+		*scat = (struct mlx4_wqe_data_seg){
+			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
+								  uintptr_t)),
+			.byte_count = rte_cpu_to_be_32(buf->data_len),
+			.lkey = rte_cpu_to_be_32(rxq->mr->lkey),
+		};
+		(*elts)[i] = buf;
 	}
-	/* The last WR pointer must be NULL. */
-	(*elts)[(i - 1)].wr.next = NULL;
 	DEBUG("%p: allocated and configured %u single-segment WRs",
 	      (void *)rxq, elts_n);
-	rxq->elts_n = elts_n;
-	rxq->elts_head = 0;
+	rxq->elts_n = log2above(elts_n);
 	rxq->elts = elts;
 	return 0;
 error:
 	if (elts != NULL) {
 		for (i = 0; (i != RTE_DIM(*elts)); ++i)
-			rte_pktmbuf_free_seg((*elts)[i].buf);
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
 		rte_free(elts);
 	}
 	DEBUG("%p: failed, freed everything", (void *)rxq);
@@ -146,17 +139,16 @@
 mlx4_rxq_free_elts(struct rxq *rxq)
 {
 	unsigned int i;
-	unsigned int elts_n = rxq->elts_n;
-	struct rxq_elt (*elts)[elts_n] = rxq->elts;
 
-	DEBUG("%p: freeing WRs", (void *)rxq);
+	if (rxq->elts == NULL)
+		return;
+	DEBUG("%p: freeing Rx queue elements", (void *)rxq);
+	for (i = 0; i != (1u << rxq->elts_n); ++i)
+		if ((*rxq->elts)[i] != NULL)
+			rte_pktmbuf_free_seg((*rxq->elts)[i]);
+	rte_free(rxq->elts);
 	rxq->elts_n = 0;
 	rxq->elts = NULL;
-	if (elts == NULL)
-		return;
-	for (i = 0; (i != RTE_DIM(*elts)); ++i)
-		rte_pktmbuf_free_seg((*elts)[i].buf);
-	rte_free(elts);
 }
 
 /**
@@ -211,7 +203,7 @@
 			.max_recv_wr = ((priv->device_attr.max_qp_wr < desc) ?
 					priv->device_attr.max_qp_wr :
 					desc),
-			/* Max number of scatter/gather elements in a WR. */
+			/* Maximum number of segments per packet. */
 			.max_recv_sge = 1,
 		},
 		.qp_type = IBV_QPT_RAW_PACKET,
@@ -248,13 +240,15 @@
 	       struct rte_mempool *mp)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct rxq tmpl = {
 		.priv = priv,
 		.mp = mp,
 		.socket = socket
 	};
 	struct ibv_qp_attr mod;
-	struct ibv_recv_wr *bad_wr;
 	unsigned int mb_len;
 	int ret;
 
@@ -336,21 +330,6 @@
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
-	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
-	if (ret) {
-		ERROR("%p: RXQ allocation failed: %s",
-		      (void *)dev, strerror(rte_errno));
-		goto error;
-	}
-	ret = ibv_post_recv(tmpl.qp, &(*tmpl.elts)[0].wr, &bad_wr);
-	if (ret) {
-		rte_errno = ret;
-		ERROR("%p: ibv_post_recv() failed for WR %p: %s",
-		      (void *)dev,
-		      (void *)bad_wr,
-		      strerror(rte_errno));
-		goto error;
-	}
 	mod = (struct ibv_qp_attr){
 		.qp_state = IBV_QPS_RTR
 	};
@@ -361,9 +340,34 @@
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = tmpl.cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = tmpl.qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		ERROR("%p: failed to obtain device information", (void *)dev);
+		goto error;
+	}
+	tmpl.wqes =
+		(volatile struct mlx4_wqe_data_seg (*)[])
+		((uintptr_t)dv_qp.buf.buf + dv_qp.rq.offset);
+	tmpl.rq_db = dv_qp.rdb;
+	tmpl.rq_ci = 0;
+	tmpl.mcq.buf = dv_cq.buf.buf;
+	tmpl.mcq.cqe_cnt = dv_cq.cqe_cnt;
+	tmpl.mcq.set_ci_db = dv_cq.set_ci_db;
+	tmpl.mcq.cqe_64 = (dv_cq.cqe_size & 64) ? 1 : 0;
 	/* Save port ID. */
 	tmpl.port_id = dev->data->port_id;
 	DEBUG("%p: RTE port ID: %u", (void *)rxq, tmpl.port_id);
+	ret = mlx4_rxq_alloc_elts(&tmpl, desc);
+	if (ret) {
+		ERROR("%p: RXQ allocation failed: %s",
+		      (void *)dev, strerror(rte_errno));
+		goto error;
+	}
 	/* Clean up rxq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old rxq just in case", (void *)rxq);
 	mlx4_rxq_cleanup(rxq);
@@ -406,6 +410,12 @@
 	struct rxq *rxq = dev->data->rx_queues[idx];
 	int ret;
 
+	if (!rte_is_power_of_2(desc)) {
+		desc = 1 << log2above(desc);
+		WARN("%p: increased number of descriptors in RX queue %u"
+		     " to the next power of two (%d)",
+		     (void *)dev, idx, desc);
+	}
 	DEBUG("%p: configuring queue %u for %u descriptors",
 	      (void *)dev, idx, desc);
 	if (idx >= dev->data->nb_rx_queues) {
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 35367a2..5c1b8ef 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -509,9 +509,44 @@
 }
 
 /**
- * DPDK callback for Rx.
+ * Poll one CQE from CQ.
  *
- * The following function doesn't manage scattered packets.
+ * @param rxq
+ *   Pointer to the receive queue structure.
+ * @param[out] out
+ *   Just polled CQE.
+ *
+ * @return
+ *   Number of bytes of the CQE, 0 in case there is no completion.
+ */
+static unsigned int
+mlx4_cq_poll_one(struct rxq *rxq, struct mlx4_cqe **out)
+{
+	int ret = 0;
+	struct mlx4_cqe *cqe = NULL;
+	struct mlx4_cq *cq = &rxq->mcq;
+
+	cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cq->cons_index);
+	if (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+	    !!(cq->cons_index & cq->cqe_cnt))
+		goto out;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rte_rmb();
+	assert(!(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK));
+	assert((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) !=
+	       MLX4_CQE_OPCODE_ERROR);
+	ret = rte_be_to_cpu_32(cqe->byte_cnt);
+	++cq->cons_index;
+out:
+	*out = cqe;
+	return ret;
+}
+
+/**
+ * DPDK callback for Rx with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -526,112 +561,104 @@
 uint16_t
 mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
-	struct rxq *rxq = (struct rxq *)dpdk_rxq;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
-	const unsigned int elts_n = rxq->elts_n;
-	unsigned int elts_head = rxq->elts_head;
-	struct ibv_wc wcs[pkts_n];
-	struct ibv_recv_wr *wr_head = NULL;
-	struct ibv_recv_wr **wr_next = &wr_head;
-	struct ibv_recv_wr *wr_bad = NULL;
-	unsigned int i;
-	unsigned int pkts_ret = 0;
-	int ret;
+	struct rxq *rxq = dpdk_rxq;
+	const uint32_t wr_cnt = (1 << rxq->elts_n) - 1;
+	struct rte_mbuf *pkt = NULL;
+	struct rte_mbuf *seg = NULL;
+	unsigned int i = 0;
+	uint32_t rq_ci = rxq->rq_ci;
+	int len = 0;
 
-	ret = ibv_poll_cq(rxq->cq, pkts_n, wcs);
-	if (unlikely(ret == 0))
-		return 0;
-	if (unlikely(ret < 0)) {
-		DEBUG("rxq=%p, ibv_poll_cq() failed (wc_n=%d)",
-		      (void *)rxq, ret);
-		return 0;
-	}
-	assert(ret <= (int)pkts_n);
-	/* For each work completion. */
-	for (i = 0; i != (unsigned int)ret; ++i) {
-		struct ibv_wc *wc = &wcs[i];
-		struct rxq_elt *elt = &(*elts)[elts_head];
-		struct ibv_recv_wr *wr = &elt->wr;
-		uint32_t len = wc->byte_len;
-		struct rte_mbuf *seg = elt->buf;
-		struct rte_mbuf *rep;
+	while (pkts_n) {
+		struct mlx4_cqe *cqe;
+		uint32_t idx = rq_ci & wr_cnt;
+		struct rte_mbuf *rep = (*rxq->elts)[idx];
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[idx];
 
-		/* Sanity checks. */
-		assert(wr->sg_list == &elt->sge);
-		assert(wr->num_sge == 1);
-		assert(elts_head < rxq->elts_n);
-		assert(rxq->elts_head < rxq->elts_n);
-		/*
-		 * Fetch initial bytes of packet descriptor into a
-		 * cacheline while allocating rep.
-		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
-		/* Link completed WRs together for repost. */
-		*wr_next = wr;
-		wr_next = &wr->next;
-		if (unlikely(wc->status != IBV_WC_SUCCESS)) {
-			/* Whatever, just repost the offending WR. */
-			DEBUG("rxq=%p: bad work completion status (%d): %s",
-			      (void *)rxq, wc->status,
-			      ibv_wc_status_str(wc->status));
-			/* Increment dropped packets counter. */
-			++rxq->stats.idropped;
-			goto repost;
-		}
+		/* Update the 'next' pointer of the previous segment. */
+		if (pkt)
+			seg->next = rep;
+		seg = rep;
+		rte_prefetch0(seg);
+		rte_prefetch0(scat);
 		rep = rte_mbuf_raw_alloc(rxq->mp);
 		if (unlikely(rep == NULL)) {
-			/*
-			 * Unable to allocate a replacement mbuf,
-			 * repost WR.
-			 */
-			DEBUG("rxq=%p: can't allocate a new mbuf",
-			      (void *)rxq);
-			/* Increase out of memory counters. */
 			++rxq->stats.rx_nombuf;
-			++rxq->priv->dev->data->rx_mbuf_alloc_failed;
-			goto repost;
+			if (!pkt) {
+				/*
+				 * No buffers before we even started,
+				 * bail out silently.
+				 */
+				break;
+			}
+			while (pkt != seg) {
+				assert(pkt != (*rxq->elts)[idx]);
+				rep = pkt->next;
+				pkt->next = NULL;
+				pkt->nb_segs = 1;
+				rte_mbuf_raw_free(pkt);
+				pkt = rep;
+			}
+			break;
+		}
+		if (!pkt) {
+			/* Looking for the new packet. */
+			len = mlx4_cq_poll_one(rxq, &cqe);
+			if (!len) {
+				rte_mbuf_raw_free(rep);
+				break;
+			}
+			if (unlikely(len < 0)) {
+				/* Rx error, packet is likely too large. */
+				rte_mbuf_raw_free(rep);
+				++rxq->stats.idropped;
+				goto skip;
+			}
+			pkt = seg;
+			pkt->packet_type = 0;
+			pkt->ol_flags = 0;
+			pkt->pkt_len = len;
 		}
-		/* Reconfigure sge to use rep instead of seg. */
-		elt->sge.addr = (uintptr_t)rep->buf_addr + RTE_PKTMBUF_HEADROOM;
-		assert(elt->sge.lkey == rxq->mr->lkey);
-		elt->buf = rep;
-		/* Update seg information. */
-		seg->data_off = RTE_PKTMBUF_HEADROOM;
-		seg->nb_segs = 1;
-		seg->port = rxq->port_id;
-		seg->next = NULL;
-		seg->pkt_len = len;
+		rep->nb_segs = 1;
+		rep->port = rxq->port_id;
+		rep->data_len = seg->data_len;
+		rep->data_off = seg->data_off;
+		(*rxq->elts)[idx] = rep;
+		/*
+		 * Fill NIC descriptor with the new buffer. The lkey and size
+		 * of the buffers are already known, only the buffer address
+		 * changes.
+		 */
+		scat->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, uintptr_t));
+		if (len > seg->data_len) {
+			len -= seg->data_len;
+			++pkt->nb_segs;
+			++rq_ci;
+			continue;
+		}
+		/* The last segment. */
 		seg->data_len = len;
-		seg->packet_type = 0;
-		seg->ol_flags = 0;
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += pkt->pkt_len;
 		/* Return packet. */
-		*(pkts++) = seg;
-		++pkts_ret;
-		/* Increase bytes counter. */
-		rxq->stats.ibytes += len;
-repost:
-		if (++elts_head >= elts_n)
-			elts_head = 0;
-		continue;
+		*(pkts++) = pkt;
+		pkt = NULL;
+		--pkts_n;
+		++i;
+skip:
+		/* Update consumer index */
+		++rq_ci;
 	}
-	if (unlikely(i == 0))
+	if (unlikely(i == 0 && rq_ci == rxq->rq_ci))
 		return 0;
-	/* Repost WRs. */
-	*wr_next = NULL;
-	assert(wr_head);
-	ret = ibv_post_recv(rxq->qp, wr_head, &wr_bad);
-	if (unlikely(ret)) {
-		/* Inability to repost WRs is fatal. */
-		DEBUG("%p: recv_burst(): failed (ret=%d)",
-		      (void *)rxq->priv,
-		      ret);
-		abort();
-	}
-	rxq->elts_head = elts_head;
-	/* Increase packets counter. */
-	rxq->stats.ipackets += pkts_ret;
-	return pkts_ret;
+	/* Update the consumer index. */
+	rxq->rq_ci = rq_ci;
+	rte_wmb();
+	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
+	*rxq->mcq.set_ci_db = rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+	return i;
 }
 
 /**
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index cc5951c..939ae75 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -62,13 +62,6 @@ struct mlx4_rxq_stats {
 	uint64_t rx_nombuf; /**< Total of Rx mbuf allocation failures. */
 };
 
-/** Rx element. */
-struct rxq_elt {
-	struct ibv_recv_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
-	struct rte_mbuf *buf; /**< Buffer. */
-};
-
 /** Rx queue descriptor. */
 struct rxq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -77,10 +70,13 @@ struct rxq {
 	struct ibv_cq *cq; /**< Completion queue. */
 	struct ibv_qp *qp; /**< Queue pair. */
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
-	unsigned int port_id; /**< Port ID for incoming packets. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	struct rxq_elt (*elts)[]; /**< Rx elements. */
+	uint16_t rq_ci; /**< Saved RQ consumer index. */
+	uint16_t port_id; /**< Port ID for incoming packets. */
+	uint16_t elts_n; /**< Mbuf queue size (log2 value). */
+	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
+	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
+	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 };
diff --git a/drivers/net/mlx4/mlx4_utils.h b/drivers/net/mlx4/mlx4_utils.h
index 0fbdc71..d6f729f 100644
--- a/drivers/net/mlx4/mlx4_utils.h
+++ b/drivers/net/mlx4/mlx4_utils.h
@@ -108,4 +108,24 @@
 
 int mlx4_fd_set_non_blocking(int fd);
 
+/**
+ * Return nearest power of two above input value.
+ *
+ * @param v
+ *   Input value.
+ *
+ * @return
+ *   Nearest power of two above input value.
+ */
+static inline unsigned int
+log2above(unsigned int v)
+{
+	unsigned int l;
+	unsigned int r;
+
+	for (l = 0, r = 0; (v >> 1); ++l, v >>= 1)
+		r |= (v & 1);
+	return l + r;
+}
+
 #endif /* MLX4_UTILS_H_ */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 3/7] net/mlx4: restore Rx scatter support
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 1/7] net/mlx4: add simple Tx bypassing Verbs Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 2/7] net/mlx4: restore full Rx support " Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 4/7] net/mlx4: restore Tx gather support Ophir Munk
                         ` (6 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Ophir Munk,
	Vasily Philipov

Calculate the number of scatters on the fly according to
the maximum expected packet size.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxq.c  | 64 +++++++++++++++++++++++++++++++++++++-------
 drivers/net/mlx4/mlx4_rxtx.c | 11 +++++---
 drivers/net/mlx4/mlx4_rxtx.h |  1 +
 3 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 9b98d86..44d095d 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -78,6 +78,7 @@
 mlx4_rxq_alloc_elts(struct rxq *rxq, unsigned int elts_n)
 {
 	unsigned int i;
+	const uint32_t sges_n = 1 << rxq->sges_n;
 	struct rte_mbuf *(*elts)[elts_n] =
 		rte_calloc_socket("RXQ", 1, sizeof(*elts), 0, rxq->socket);
 
@@ -101,6 +102,9 @@
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
+		/* Only the first segment keeps headroom. */
+		if (i % sges_n)
+			buf->data_off = 0;
 		buf->port = rxq->port_id;
 		buf->data_len = rte_pktmbuf_tailroom(buf);
 		buf->pkt_len = rte_pktmbuf_tailroom(buf);
@@ -113,8 +117,8 @@
 		};
 		(*elts)[i] = buf;
 	}
-	DEBUG("%p: allocated and configured %u single-segment WRs",
-	      (void *)rxq, elts_n);
+	DEBUG("%p: allocated and configured %u segments (max %u packets)",
+	      (void *)rxq, elts_n, elts_n >> rxq->sges_n);
 	rxq->elts_n = log2above(elts_n);
 	rxq->elts = elts;
 	return 0;
@@ -185,12 +189,15 @@
  *   Completion queue to associate with QP.
  * @param desc
  *   Number of descriptors in QP (hint only).
+ * @param sges_n
+ *   Maximum number of segments per packet.
  *
  * @return
  *   QP pointer or NULL in case of error and rte_errno is set.
  */
 static struct ibv_qp *
-mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc)
+mlx4_rxq_setup_qp(struct priv *priv, struct ibv_cq *cq, uint16_t desc,
+		  uint32_t sges_n)
 {
 	struct ibv_qp *qp;
 	struct ibv_qp_init_attr attr = {
@@ -204,7 +211,7 @@
 					priv->device_attr.max_qp_wr :
 					desc),
 			/* Maximum number of segments per packet. */
-			.max_recv_sge = 1,
+			.max_recv_sge = sges_n,
 		},
 		.qp_type = IBV_QPT_RAW_PACKET,
 	};
@@ -263,11 +270,31 @@
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
 	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
-		;
+		tmpl.sges_n = 0;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
-		WARN("%p: scattered mode has been requested but is"
-		     " not supported, this may lead to packet loss",
-		     (void *)dev);
+		uint32_t size =
+			RTE_PKTMBUF_HEADROOM +
+			dev->data->dev_conf.rxmode.max_rx_pkt_len;
+		uint32_t sges_n;
+
+		/*
+		 * Determine the number of SGEs needed for a full packet
+		 * and round it to the next power of two.
+		 */
+		sges_n = log2above((size / mb_len) + !!(size % mb_len));
+		tmpl.sges_n = sges_n;
+		/* Make sure sges_n did not overflow. */
+		size = mb_len * (1 << tmpl.sges_n);
+		size -= RTE_PKTMBUF_HEADROOM;
+		if (size < dev->data->dev_conf.rxmode.max_rx_pkt_len) {
+			rte_errno = EOVERFLOW;
+			ERROR("%p: too many SGEs (%u) needed to handle"
+			      " requested maximum packet size %u",
+			      (void *)dev,
+			      1 << sges_n,
+			      dev->data->dev_conf.rxmode.max_rx_pkt_len);
+			goto error;
+		}
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -276,6 +303,17 @@
 		     dev->data->dev_conf.rxmode.max_rx_pkt_len,
 		     mb_len - RTE_PKTMBUF_HEADROOM);
 	}
+	DEBUG("%p: maximum number of segments per packet: %u",
+	      (void *)dev, 1 << tmpl.sges_n);
+	if (desc % (1 << tmpl.sges_n)) {
+		rte_errno = EINVAL;
+		ERROR("%p: number of RX queue descriptors (%u) is not a"
+		      " multiple of maximum segments per packet (%u)",
+		      (void *)dev,
+		      desc,
+		      1 << tmpl.sges_n);
+		goto error;
+	}
 	/* Use the entire Rx mempool as the memory region. */
 	tmpl.mr = mlx4_mp2mr(priv->pd, mp);
 	if (tmpl.mr == NULL) {
@@ -300,7 +338,8 @@
 			goto error;
 		}
 	}
-	tmpl.cq = ibv_create_cq(priv->ctx, desc, NULL, tmpl.channel, 0);
+	tmpl.cq = ibv_create_cq(priv->ctx, desc >> tmpl.sges_n, NULL,
+				tmpl.channel, 0);
 	if (tmpl.cq == NULL) {
 		rte_errno = ENOMEM;
 		ERROR("%p: CQ creation failure: %s",
@@ -311,7 +350,8 @@
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
 	      priv->device_attr.max_sge);
-	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc);
+	tmpl.qp = mlx4_rxq_setup_qp(priv, tmpl.cq, desc >> tmpl.sges_n,
+				    1 << tmpl.sges_n);
 	if (tmpl.qp == NULL) {
 		ERROR("%p: QP creation failure: %s",
 		      (void *)dev, strerror(rte_errno));
@@ -373,6 +413,10 @@
 	mlx4_rxq_cleanup(rxq);
 	*rxq = tmpl;
 	DEBUG("%p: rxq updated with %p", (void *)rxq, (void *)&tmpl);
+	/* Update doorbell counter. */
+	rxq->rq_ci = desc >> rxq->sges_n;
+	rte_wmb();
+	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 	return 0;
 error:
 	ret = rte_errno;
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 5c1b8ef..fd8ef7b 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -563,10 +563,11 @@
 {
 	struct rxq *rxq = dpdk_rxq;
 	const uint32_t wr_cnt = (1 << rxq->elts_n) - 1;
+	const uint16_t sges_n = rxq->sges_n;
 	struct rte_mbuf *pkt = NULL;
 	struct rte_mbuf *seg = NULL;
 	unsigned int i = 0;
-	uint32_t rq_ci = rxq->rq_ci;
+	uint32_t rq_ci = rxq->rq_ci << sges_n;
 	int len = 0;
 
 	while (pkts_n) {
@@ -646,13 +647,15 @@
 		--pkts_n;
 		++i;
 skip:
-		/* Update consumer index */
+		/* Align consumer index to the next stride. */
+		rq_ci >>= sges_n;
 		++rq_ci;
+		rq_ci <<= sges_n;
 	}
-	if (unlikely(i == 0 && rq_ci == rxq->rq_ci))
+	if (unlikely(i == 0 && (rq_ci >> sges_n) == rxq->rq_ci))
 		return 0;
 	/* Update the consumer index. */
-	rxq->rq_ci = rq_ci;
+	rxq->rq_ci = rq_ci >> sges_n;
 	rte_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 	*rxq->mcq.set_ci_db = rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 939ae75..ac84177 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -72,6 +72,7 @@ struct rxq {
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
 	uint16_t rq_ci; /**< Saved RQ consumer index. */
 	uint16_t port_id; /**< Port ID for incoming packets. */
+	uint16_t sges_n; /**< Number of segments per packet (log2 value). */
 	uint16_t elts_n; /**< Mbuf queue size (log2 value). */
 	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
 	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 4/7] net/mlx4: restore Tx gather support
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (2 preceding siblings ...)
  2017-10-05  9:33       ` [PATCH v4 3/7] net/mlx4: restore Rx scatter support Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 5/7] net/mlx4: restore Tx checksum offloads Ophir Munk
                         ` (5 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for transmitting packets spanning over multiple
buffers.

In this patch we also take into consideration the amount of entries a
packet occupies in the TxQ when setting the report-completion flag of the
chip.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 197 +++++++++++++++++++++++++------------------
 drivers/net/mlx4/mlx4_rxtx.h |   6 +-
 drivers/net/mlx4/mlx4_txq.c  |  12 ++-
 3 files changed, 127 insertions(+), 88 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index fd8ef7b..cc0baaa 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -63,6 +63,15 @@
 #include "mlx4_utils.h"
 
 /**
+ * Pointer-value pair structure used in tx_post_send for saving the first
+ * DWORD (32 byte) of a TXBB.
+ */
+struct pv {
+	struct mlx4_wqe_data_seg *dseg;
+	uint32_t val;
+};
+
+/**
  * Stamp a WQE so it won't be reused by the HW.
  *
  * Routine is used when freeing WQE used by the chip or when failing
@@ -291,24 +300,28 @@
  *   Target Tx queue.
  * @param pkt
  *   Packet to transmit.
- * @param send_flags
- *   @p MLX4_WQE_CTRL_CQ_UPDATE to request completion on this packet.
  *
  * @return
  *   0 on success, negative errno value otherwise and rte_errno is set.
  */
 static inline int
-mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt, uint32_t send_flags)
+mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 {
 	struct mlx4_wqe_ctrl_seg *ctrl;
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
+	struct rte_mbuf *buf;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
+	uint32_t srcrb_flags;
+	uint32_t owner_opcode = MLX4_OPCODE_SEND;
+	uint32_t byte_count;
 	int wqe_real_size;
 	int nr_txbbs;
 	int rc;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
+	int pv_counter = 0;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
@@ -324,56 +337,81 @@
 		rc = ENOSPC;
 		goto err;
 	}
-	/* Get the control and single-data entries of the WQE. */
+	/* Get the control and data entries of the WQE. */
 	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
 	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
 					    sizeof(struct mlx4_wqe_ctrl_seg));
-	/* Fill the data segment with buffer information. */
-	addr = rte_pktmbuf_mtod(pkt, uintptr_t);
-	rte_prefetch0((volatile void *)addr);
-	dseg->addr = rte_cpu_to_be_64(addr);
-	/* Memory region key for this memory pool. */
-	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(pkt));
-	if (unlikely(lkey == (uint32_t)-1)) {
-		/* MR does not exist. */
-		DEBUG("%p: unable to get MP <-> MR association", (void *)txq);
+	/* Fill the data segments with buffer information. */
+	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
+		addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		rte_prefetch0((volatile void *)addr);
+		/* Handle WQE wraparound. */
+		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
+			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+		dseg->addr = rte_cpu_to_be_64(addr);
+		/* Memory region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			/*
+			 * Restamp entry in case of failure.
+			 * Make sure that size is written correctly
+			 * Note that we give ownership to the SW, not the HW.
+			 */
+			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+			mlx4_txq_stamp_freed_wqe(sq, head_idx,
+				     (sq->head & sq->txbb_cnt) ? 0 : 1);
+			rc = EFAULT;
+			goto err;
+		}
+		dseg->lkey = rte_cpu_to_be_32(lkey);
+		if (likely(buf->data_len)) {
+			byte_count = rte_cpu_to_be_32(buf->data_len);
+		} else {
+			/*
+			 * Zero length segment is treated as inline segment
+			 * with zero data.
+			 */
+			byte_count = RTE_BE32(0x80000000);
+		}
 		/*
-		 * Restamp entry in case of failure, make sure that size is
-		 * written correctly.
-		 * Note that we give ownership to the SW, not the HW.
+		 * If the data segment is not at the beginning of a
+		 * Tx basic block (TXBB) then write the byte count,
+		 * else postpone the writing to just before updating the
+		 * control segment.
 		 */
-		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-		mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					 (sq->head & sq->txbb_cnt) ? 0 : 1);
-		rc = EFAULT;
-		goto err;
+		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			/*
+			 * Need a barrier here before writing the byte_count
+			 * fields to make sure that all the data is visible
+			 * before the byte_count field is set.
+			 * Otherwise, if the segment begins a new cacheline,
+			 * the HCA prefetcher could grab the 64-byte chunk and
+			 * get a valid (!= 0xffffffff) byte count but stale
+			 * data, and end up sending the wrong data.
+			 */
+			rte_io_wmb();
+			dseg->byte_count = byte_count;
+		} else {
+			/*
+			 * This data segment starts at the beginning of a new
+			 * TXBB, so we need to postpone its byte_count writing
+			 * for later.
+			 */
+			pv[pv_counter].dseg = dseg;
+			pv[pv_counter++].val = byte_count;
+		}
 	}
-	dseg->lkey = rte_cpu_to_be_32(lkey);
-	/*
-	 * Need a barrier here before writing the byte_count field to
-	 * make sure that all the data is visible before the
-	 * byte_count field is set. Otherwise, if the segment begins
-	 * a new cache line, the HCA prefetcher could grab the 64-byte
-	 * chunk and get a valid (!= 0xffffffff) byte count but
-	 * stale data, and end up sending the wrong data.
-	 */
-	rte_io_wmb();
-	if (likely(pkt->data_len))
-		dseg->byte_count = rte_cpu_to_be_32(pkt->data_len);
-	else
-		/*
-		 * Zero length segment is treated as inline segment
-		 * with zero data.
-		 */
-		dseg->byte_count = RTE_BE32(0x80000000);
-	/*
-	 * Fill the control parameters for this packet.
-	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
-	 * should be calculated.
-	 */
-	ctrl->srcrb_flags =
-		rte_cpu_to_be_32(MLX4_WQE_CTRL_SOLICIT |
-				 (send_flags & MLX4_WQE_CTRL_CQ_UPDATE));
+	/* Write the first DWORD of each TXBB save earlier. */
+	if (pv_counter) {
+		/* Need a barrier here before writing the byte_count. */
+		rte_io_wmb();
+		for (--pv_counter; pv_counter  >= 0; pv_counter--)
+			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
+	}
+	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
 	 * The caller should prepare "imm" in advance in order to support
@@ -382,14 +420,27 @@
 	 */
 	ctrl->imm = 0;
 	/*
-	 * Make sure descriptor is fully written before setting ownership
-	 * bit (because HW can start executing as soon as we do).
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
+	 * should be calculated.
+	 */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+				       MLX4_WQE_CTRL_CQ_UPDATE);
+	} else {
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+	}
+	ctrl->srcrb_flags = srcrb_flags;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
 	 */
 	rte_wmb();
-	ctrl->owner_opcode =
-		rte_cpu_to_be_32(MLX4_OPCODE_SEND |
-				 ((sq->head & sq->txbb_cnt) ?
-				  MLX4_BIT_WQE_OWN : 0));
+	ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					      ((sq->head & sq->txbb_cnt) ?
+					       MLX4_BIT_WQE_OWN : 0));
 	sq->head += nr_txbbs;
 	return 0;
 err:
@@ -416,14 +467,13 @@
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -442,8 +492,6 @@
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		unsigned int segs = buf->nb_segs;
-		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -461,34 +509,16 @@
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
-			send_flags |= MLX4_WQE_CTRL_CQ_UPDATE;
-		}
-		if (likely(segs == 1)) {
-			/* Update element. */
-			elt->buf = buf;
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			/* Post the packet for sending. */
-			err = mlx4_post_send(txq, buf, send_flags);
-			if (unlikely(err)) {
-				if (unlikely(send_flags &
-					     MLX4_WQE_CTRL_CQ_UPDATE)) {
-					elts_comp_cd = 1;
-					--elts_comp;
-				}
-				elt->buf = NULL;
-				goto stop;
-			}
-			elt->buf = buf;
-			bytes_sent += buf->pkt_len;
-		} else {
-			err = -EINVAL;
-			rte_errno = -err;
+		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+		/* Post the packet for sending. */
+		err = mlx4_post_send(txq, buf);
+		if (unlikely(err)) {
+			elt->buf = NULL;
 			goto stop;
 		}
+		elt->buf = buf;
+		bytes_sent += buf->pkt_len;
+		++elts_comp;
 		elts_head = elts_head_next;
 	}
 stop:
@@ -504,7 +534,6 @@
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index ac84177..528e286 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -101,13 +101,15 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
+	unsigned int elts_comp; /**< Number of packets awaiting completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint8_t *bounce_buf;
+	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
 		const struct rte_mempool *mp; /**< Cached memory pool. */
 		struct ibv_mr *mr; /**< Memory region (for mp). */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index fb28ef2..7552a88 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -83,8 +83,13 @@
 		rte_calloc_socket("TXQ", 1, sizeof(*elts), 0, txq->socket);
 	int ret = 0;
 
-	if (elts == NULL) {
-		ERROR("%p: can't allocate packets array", (void *)txq);
+	/* Allocate bounce buffer. */
+	txq->bounce_buf = rte_zmalloc_socket("TXQ",
+					     MLX4_MAX_WQE_SIZE,
+					     RTE_CACHE_LINE_MIN_SIZE,
+					     txq->socket);
+	if (!elts || !txq->bounce_buf) {
+		ERROR("%p: can't allocate TXQ memory", (void *)txq);
 		ret = ENOMEM;
 		goto error;
 	}
@@ -110,6 +115,8 @@
 	assert(ret == 0);
 	return 0;
 error:
+	rte_free(txq->bounce_buf);
+	txq->bounce_buf = NULL;
 	rte_free(elts);
 	DEBUG("%p: failed, freed everything", (void *)txq);
 	assert(ret > 0);
@@ -175,6 +182,7 @@
 		claim_zero(ibv_destroy_qp(txq->qp));
 	if (txq->cq != NULL)
 		claim_zero(ibv_destroy_cq(txq->cq));
+	rte_free(txq->bounce_buf);
 	for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) {
 		if (txq->mp2mr[i].mp == NULL)
 			break;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 5/7] net/mlx4: restore Tx checksum offloads
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (3 preceding siblings ...)
  2017-10-05  9:33       ` [PATCH v4 4/7] net/mlx4: restore Tx gather support Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 6/7] net/mlx4: restore Rx offloads Ophir Munk
                         ` (4 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPv4, UDP and TCP checksum
calculation, including inner/outer checksums on supported tunnel types.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4.c        | 11 +++++++++++
 drivers/net/mlx4/mlx4.h        |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h    |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 19 +++++++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.h   |  2 ++
 drivers/net/mlx4/mlx4_txq.c    |  2 ++
 7 files changed, 44 insertions(+)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index b084903..385ddaa 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -529,6 +529,17 @@ struct mlx4_conf {
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->hw_csum =	!!(device_attr.device_cap_flags &
+				   IBV_DEVICE_RAW_IP_CSUM);
+		DEBUG("checksum offloading is %ssupported",
+		      (priv->hw_csum ? "" : "not "));
+		/* Only ConnectX-3 Pro supports tunneling. */
+		priv->hw_csum_l2tun =
+			priv->hw_csum &&
+			(device_attr.vendor_part_id ==
+			 PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
+		DEBUG("L2 tunnel checksum offloads are %ssupported",
+		      (priv->hw_csum_l2tun ? "" : "not "));
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 93e5502..0b71867 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -104,6 +104,8 @@ struct priv {
 	unsigned int vf:1; /* This is a VF device. */
 	unsigned int intr_alarm:1; /* An interrupt alarm is scheduled. */
 	unsigned int isolated:1; /* Toggle isolated mode. */
+	unsigned int hw_csum:1; /* Checksum offload is supported. */
+	unsigned int hw_csum_l2tun:1; /* Checksum support for L2 tunnels. */
 	struct rte_intr_handle intr_handle; /* Port interrupt handle. */
 	struct rte_flow_drop *flow_drop_queue; /* Flow drop queue. */
 	LIST_HEAD(mlx4_flows, rte_flow) flows;
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a9e8059..bec1787 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,6 +553,12 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->hw_csum_l2tun)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->speed_capa =
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 085a595..df5a6b4 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -64,6 +64,8 @@
 
 /* Work queue element (WQE) flags. */
 #define MLX4_BIT_WQE_OWN 0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index cc0baaa..fe7d5d0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -431,6 +431,25 @@ struct pv {
 	} else {
 		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
+	/* Enable HW checksum offload if requested */
+	if (txq->csum &&
+	    (pkt->ol_flags &
+	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = (pkt->ol_flags &
+					      (PKT_TX_TUNNEL_GRE |
+					       PKT_TX_TUNNEL_VXLAN));
+
+		if (is_tunneled && txq->csum_l2tun) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |=
+					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
+		} else {
+			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+						MLX4_WQE_CTRL_TCP_UDP_CSUM);
+		}
+	}
 	ctrl->srcrb_flags = srcrb_flags;
 	/*
 	 * Make sure descriptor is fully written before
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 528e286..a742f61 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -108,6 +108,8 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7552a88..96429bc 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -338,6 +338,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	tmpl.csum = priv->hw_csum;
+	tmpl.csum_l2tun = priv->hw_csum_l2tun;
 	DEBUG("priv->device_attr.max_qp_wr is %d",
 	      priv->device_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 6/7] net/mlx4: restore Rx offloads
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (4 preceding siblings ...)
  2017-10-05  9:33       ` [PATCH v4 5/7] net/mlx4: restore Tx checksum offloads Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05  9:33       ` [PATCH v4 7/7] net/mlx4: add loopback Tx from VF Ophir Munk
                         ` (3 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky,
	Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP checksum
verification, including inner/outer checksums on supported tunnel types.

It also restores packet type recognition support.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_ethdev.c |   6 ++-
 drivers/net/mlx4/mlx4_prm.h    |  30 +++++++++++
 drivers/net/mlx4/mlx4_rxq.c    |   5 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 118 ++++++++++++++++++++++++++++++++++++++++-
 drivers/net/mlx4/mlx4_rxtx.h   |   2 +
 5 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index bec1787..6dbf273 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -553,10 +553,14 @@
 	info->max_mac_addrs = 1;
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
-	if (priv->hw_csum)
+	if (priv->hw_csum) {
 		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
 					  DEV_TX_OFFLOAD_UDP_CKSUM |
 					  DEV_TX_OFFLOAD_TCP_CKSUM);
+		info->rx_offload_capa |= (DEV_RX_OFFLOAD_IPV4_CKSUM |
+					  DEV_RX_OFFLOAD_UDP_CKSUM |
+					  DEV_RX_OFFLOAD_TCP_CKSUM);
+	}
 	if (priv->hw_csum_l2tun)
 		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index df5a6b4..0d76a73 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -70,6 +70,14 @@
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
 
+/* CQE checksum flags. */
+enum {
+	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
+	MLX4_CQE_L2_TUNNEL_L4_CSUM = (int)(1u << 26),
+	MLX4_CQE_L2_TUNNEL = (int)(1u << 27),
+	MLX4_CQE_L2_TUNNEL_IPOK = (int)(1u << 31),
+};
+
 /* Send queue information. */
 struct mlx4_sq {
 	uint8_t *buf; /**< SQ buffer. */
@@ -119,4 +127,26 @@ struct mlx4_cq {
 				   (cq->cqe_64 << 5));
 }
 
+/**
+ * Transpose a flag in a value.
+ *
+ * @param val
+ *   Input value.
+ * @param from
+ *   Flag to retrieve from input value.
+ * @param to
+ *   Flag to set in output value.
+ *
+ * @return
+ *   Output value with transposed flag enabled if present on input.
+ */
+static inline uint64_t
+mlx4_transpose(uint64_t val, uint64_t from, uint64_t to)
+{
+	return (from >= to ?
+		(val & from) / (from / to) :
+		(val & from) * (to / from));
+}
+
+
 #endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 44d095d..a021a32 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -260,6 +260,11 @@
 	int ret;
 
 	(void)conf; /* Thresholds configuration (ignored). */
+	/* Toggle Rx checksum offload if hardware supports it. */
+	if (priv->hw_csum)
+		tmpl.csum = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
+	if (priv->hw_csum_l2tun)
+		tmpl.csum_l2tun = !!dev->data->dev_conf.rxmode.hw_ip_checksum;
 	mb_len = rte_pktmbuf_data_room_size(mp);
 	if (desc == 0) {
 		rte_errno = EINVAL;
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index fe7d5d0..87c5261 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -557,6 +557,107 @@ struct pv {
 }
 
 /**
+ * Translate Rx completion flags to packet type.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ *
+ * @return
+ *   Packet type in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_pkt_type(uint32_t flags)
+{
+	uint32_t pkt_type;
+
+	if (flags & MLX4_CQE_L2_TUNNEL)
+		pkt_type =
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPV4,
+				       RTE_PTYPE_L3_IPV4_EXT_UNKNOWN) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IPV4_PKT,
+				       RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN);
+	else
+		pkt_type = mlx4_transpose(flags,
+					  MLX4_CQE_STATUS_IPV4_PKT,
+					  RTE_PTYPE_L3_IPV4_EXT_UNKNOWN);
+	return pkt_type;
+}
+
+/**
+ * Translate Rx completion flags to offload flags.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   Offload flags (ol_flags) in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_ol_flags(uint32_t flags, int csum, int csum_l2tun)
+{
+	uint32_t ol_flags = 0;
+
+	if (csum)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IP_HDR_CSUM_OK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_TCP_UDP_CSUM_OK,
+				       PKT_RX_L4_CKSUM_GOOD);
+	if ((flags & MLX4_CQE_L2_TUNNEL) && csum_l2tun)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPOK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_L4_CSUM,
+				       PKT_RX_L4_CKSUM_GOOD);
+	return ol_flags;
+}
+
+/**
+ * Extract checksum information from CQE flags.
+ *
+ * @param cqe
+ *   Pointer to CQE structure.
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   CQE checksum information.
+ */
+static inline uint32_t
+mlx4_cqe_flags(struct mlx4_cqe *cqe, int csum, int csum_l2tun)
+{
+	uint32_t flags = 0;
+
+	/*
+	 * The relevant bits are in different locations on their
+	 * CQE fields therefore we can join them in one 32bit
+	 * variable.
+	 */
+	if (csum)
+		flags = (rte_be_to_cpu_32(cqe->status) &
+			 MLX4_CQE_STATUS_IPV4_CSUM_OK);
+	if (csum_l2tun)
+		flags |= (rte_be_to_cpu_32(cqe->vlan_my_qpn) &
+			  (MLX4_CQE_L2_TUNNEL |
+			   MLX4_CQE_L2_TUNNEL_IPOK |
+			   MLX4_CQE_L2_TUNNEL_L4_CSUM |
+			   MLX4_CQE_L2_TUNNEL_IPV4));
+	return flags;
+}
+
+/**
  * Poll one CQE from CQ.
  *
  * @param rxq
@@ -664,8 +765,21 @@ struct pv {
 				goto skip;
 			}
 			pkt = seg;
-			pkt->packet_type = 0;
-			pkt->ol_flags = 0;
+			if (rxq->csum | rxq->csum_l2tun) {
+				uint32_t flags =
+					mlx4_cqe_flags(cqe,
+						       rxq->csum,
+						       rxq->csum_l2tun);
+
+				pkt->ol_flags =
+					rxq_cq_to_ol_flags(flags,
+							   rxq->csum,
+							   rxq->csum_l2tun);
+				pkt->packet_type = rxq_cq_to_pkt_type(flags);
+			} else {
+				pkt->packet_type = 0;
+				pkt->ol_flags = 0;
+			}
 			pkt->pkt_len = len;
 		}
 		rep->nb_segs = 1;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index a742f61..6aad41a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -77,6 +77,8 @@ struct rxq {
 	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
 	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
 	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 7/7] net/mlx4: add loopback Tx from VF
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (5 preceding siblings ...)
  2017-10-05  9:33       ` [PATCH v4 6/7] net/mlx4: restore Rx offloads Ophir Munk
@ 2017-10-05  9:33       ` Ophir Munk
  2017-10-05 11:40       ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
                         ` (2 subsequent siblings)
  9 siblings, 0 replies; 61+ messages in thread
From: Ophir Munk @ 2017-10-05  9:33 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds loopback functionality used when the chip is a VF in order
to enable packet transmission between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 33 +++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |  1 +
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 87c5261..36173ad 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -311,10 +311,13 @@ struct pv {
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct rte_mbuf *buf;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
-	uint32_t srcrb_flags;
 	uint32_t owner_opcode = MLX4_OPCODE_SEND;
 	uint32_t byte_count;
 	int wqe_real_size;
@@ -414,22 +417,16 @@ struct pv {
 	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
-	 * The caller should prepare "imm" in advance in order to support
-	 * VF to VF communication (when the device is a virtual-function
-	 * device (VF)).
-	 */
-	ctrl->imm = 0;
-	/*
 	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
 	 * should be calculated.
 	 */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 				       MLX4_WQE_CTRL_CQ_UPDATE);
 	} else {
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
 	/* Enable HW checksum offload if requested */
 	if (txq->csum &&
@@ -443,14 +440,26 @@ struct pv {
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |=
+				srcrb.flags |=
 					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
 		} else {
-			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
 						MLX4_WQE_CTRL_TCP_UDP_CSUM);
 		}
 	}
-	ctrl->srcrb_flags = srcrb_flags;
+	if (txq->lb) {
+		/*
+		 * Copy destination MAC address to the WQE, this allows
+		 * loopback in eSwitch, so that VFs and PF can communicate
+		 * with each other.
+		 */
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						      sizeof(uint16_t)));
+	} else {
+		ctrl->imm = 0;
+	}
+	ctrl->srcrb_flags = srcrb.flags;
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 6aad41a..37f31f4 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -112,6 +112,7 @@ struct txq {
 	uint32_t max_inline; /**< Max inline send size. */
 	uint32_t csum:1; /**< Enable checksum offloading. */
 	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
+	uint32_t lb:1; /**< Whether packets should be looped back by eSwitch. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 96429bc..9d1be95 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -412,6 +412,8 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Enable Tx loopback for VF devices. */
+	tmpl.lb = !!(priv->vf);
 	/* Clean up txq in case we're reinitializing it. */
 	DEBUG("%p: cleaning-up old txq just in case", (void *)txq);
 	mlx4_txq_cleanup(txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (6 preceding siblings ...)
  2017-10-05  9:33       ` [PATCH v4 7/7] net/mlx4: add loopback Tx from VF Ophir Munk
@ 2017-10-05 11:40       ` Adrien Mazarguil
  2017-10-05 18:48       ` Ferruh Yigit
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
  9 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-05 11:40 UTC (permalink / raw)
  To: Ophir Munk; +Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad

On Thu, Oct 05, 2017 at 09:33:05AM +0000, Ophir Munk wrote:
> v4 (Ophir):
> - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4: 
>   restore full Rx support bypassing Verbs" commit
> 
> v3 (Adrien):
> - Drop a few unrelated or unnecessary changes such as the removal of
>   MLX4_PMD_TX_MP_CACHE.
> - Move device checksum support detection code to its previous location.
> - Fix include guard in mlx4_prm.h.
> - Reorder #includes alphabetically.
> - Replace MLX4_TRANSPOSE() macro with documented inline function.
> - Remove extra spaces and blank lines.
> - Use uint8_t * instead of char * for buffers.
> - Replace mlx4_get_cqe() macro with a documented inline function.
> - Replace several unsigned int with uint32_t.
> - Add consistency to field names (sge_n => sges_n).
> - Make mbuf size checks in RX queue setup function similar to mlx5.
> - Update various comments.
> - Fix indentation.
> - Replace run-time endian conversion with static ones where possible.
> - Reorder fields in struct rxq and struct txq for consistency, remove
>   one level of unnecessary inner structures.
> - Fix memory leak on Tx bounce buffer.
> - Update commit logs.
> - Fix remaining checkpatch warnings.
> 
> v2 (Matan):
> Rearange patches.
> Semantics.
> Enhancements.
> Fix compilation issues.
> 
> Moti Haimovsky (6):
>   net/mlx4: add simple Tx bypassing Verbs
>   net/mlx4: restore full Rx support bypassing Verbs
>   net/mlx4: restore Tx gather support
>   net/mlx4: restore Tx checksum offloads
>   net/mlx4: restore Rx offloads
>   net/mlx4: add loopback Tx from VF
> 
> Ophir Munk (1):
>   net/mlx4: restore Rx scatter support

Thanks Ophir for merging both v3's.

Ferruh, v4 supersedes all prior revisions (Moti's v1, Matan's v2, my own v3
and Ophir's v3-bis, I can't update patchwork for all of them).

For the entire series:

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (7 preceding siblings ...)
  2017-10-05 11:40       ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
@ 2017-10-05 18:48       ` Ferruh Yigit
  2017-10-05 18:54         ` Ferruh Yigit
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
  9 siblings, 1 reply; 61+ messages in thread
From: Ferruh Yigit @ 2017-10-05 18:48 UTC (permalink / raw)
  To: Ophir Munk, Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad

On 10/5/2017 10:33 AM, Ophir Munk wrote:
> v4 (Ophir):
> - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4: 
>   restore full Rx support bypassing Verbs" commit
> 
> v3 (Adrien):
> - Drop a few unrelated or unnecessary changes such as the removal of
>   MLX4_PMD_TX_MP_CACHE.
> - Move device checksum support detection code to its previous location.
> - Fix include guard in mlx4_prm.h.
> - Reorder #includes alphabetically.
> - Replace MLX4_TRANSPOSE() macro with documented inline function.
> - Remove extra spaces and blank lines.
> - Use uint8_t * instead of char * for buffers.
> - Replace mlx4_get_cqe() macro with a documented inline function.
> - Replace several unsigned int with uint32_t.
> - Add consistency to field names (sge_n => sges_n).
> - Make mbuf size checks in RX queue setup function similar to mlx5.
> - Update various comments.
> - Fix indentation.
> - Replace run-time endian conversion with static ones where possible.
> - Reorder fields in struct rxq and struct txq for consistency, remove
>   one level of unnecessary inner structures.
> - Fix memory leak on Tx bounce buffer.
> - Update commit logs.
> - Fix remaining checkpatch warnings.
> 
> v2 (Matan):
> Rearange patches.
> Semantics.
> Enhancements.
> Fix compilation issues.
> 
> Moti Haimovsky (6):
>   net/mlx4: add simple Tx bypassing Verbs
>   net/mlx4: restore full Rx support bypassing Verbs
>   net/mlx4: restore Tx gather support
>   net/mlx4: restore Tx checksum offloads
>   net/mlx4: restore Rx offloads
>   net/mlx4: add loopback Tx from VF
> 
> Ophir Munk (1):
>   net/mlx4: restore Rx scatter support

Hi Ophir,

I am a little confused, can you please help me?

Currently both mlx4 and mlx5 should support both rdma-core and MLX-OFED,
is this correct?

When I try to compile these patches with rdma-core, it is giving warning
for shared library [1].

If I try to compile with mlx-ofed, getting missing header error [2].

What is the dependency for mlx4 driver now?


[1]
mlx4_rxq.o: In function `mlx4_rxq_setup':
.../dpdk/drivers/net/mlx4/mlx4_rxq.c:393: undefined reference to
`mlx4dv_init_obj'
mlx4_txq.o: In function `mlx4_txq_setup':
.../dpdk/drivers/net/mlx4/mlx4_txq.c:429: undefined reference to
`mlx4dv_init_obj'


[2]
In file included from .../drivers/net/mlx4/mlx4_flow.c:66:
.../drivers/net/mlx4/mlx4_rxtx.h:43:10: fatal error:
'infiniband/mlx4dv.h' file not found
#include <infiniband/mlx4dv.h>
         ^~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs
  2017-10-05 18:48       ` Ferruh Yigit
@ 2017-10-05 18:54         ` Ferruh Yigit
  0 siblings, 0 replies; 61+ messages in thread
From: Ferruh Yigit @ 2017-10-05 18:54 UTC (permalink / raw)
  To: Ophir Munk, Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Olga Shern, Matan Azrad

On 10/5/2017 7:48 PM, Ferruh Yigit wrote:
> On 10/5/2017 10:33 AM, Ophir Munk wrote:
>> v4 (Ophir):
>> - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4: 
>>   restore full Rx support bypassing Verbs" commit
>>
>> v3 (Adrien):
>> - Drop a few unrelated or unnecessary changes such as the removal of
>>   MLX4_PMD_TX_MP_CACHE.
>> - Move device checksum support detection code to its previous location.
>> - Fix include guard in mlx4_prm.h.
>> - Reorder #includes alphabetically.
>> - Replace MLX4_TRANSPOSE() macro with documented inline function.
>> - Remove extra spaces and blank lines.
>> - Use uint8_t * instead of char * for buffers.
>> - Replace mlx4_get_cqe() macro with a documented inline function.
>> - Replace several unsigned int with uint32_t.
>> - Add consistency to field names (sge_n => sges_n).
>> - Make mbuf size checks in RX queue setup function similar to mlx5.
>> - Update various comments.
>> - Fix indentation.
>> - Replace run-time endian conversion with static ones where possible.
>> - Reorder fields in struct rxq and struct txq for consistency, remove
>>   one level of unnecessary inner structures.
>> - Fix memory leak on Tx bounce buffer.
>> - Update commit logs.
>> - Fix remaining checkpatch warnings.
>>
>> v2 (Matan):
>> Rearange patches.
>> Semantics.
>> Enhancements.
>> Fix compilation issues.
>>
>> Moti Haimovsky (6):
>>   net/mlx4: add simple Tx bypassing Verbs
>>   net/mlx4: restore full Rx support bypassing Verbs
>>   net/mlx4: restore Tx gather support
>>   net/mlx4: restore Tx checksum offloads
>>   net/mlx4: restore Rx offloads
>>   net/mlx4: add loopback Tx from VF
>>
>> Ophir Munk (1):
>>   net/mlx4: restore Rx scatter support
> 
> Hi Ophir,
> 
> I am a little confused, can you please help me?
> 
> Currently both mlx4 and mlx5 should support both rdma-core and MLX-OFED,
> is this correct?
> 
> When I try to compile these patches with rdma-core, it is giving warning
> for shared library [1].

Ahh, this is because missing library in Makefile, can you please send a
new version to fix this?

   diff --git a/drivers/net/mlx4/Makefile b/drivers/net/mlx4/Makefile
   index 0515cd7ef..3b3a02047 100644
   --- a/drivers/net/mlx4/Makefile
   +++ b/drivers/net/mlx4/Makefile
   @@ -54,7 +54,7 @@ CFLAGS += -D_BSD_SOURCE
    CFLAGS += -D_DEFAULT_SOURCE
    CFLAGS += -D_XOPEN_SOURCE=600
    CFLAGS += $(WERROR_FLAGS)
   -LDLIBS += -libverbs
   +LDLIBS += -libverbs -lmlx4

    # A few warnings cannot be avoided in external headers.
    CFLAGS += -Wno-error=cast-qual

> 
> If I try to compile with mlx-ofed, getting missing header error [2].

Can you please clarify this?

> 
> What is the dependency for mlx4 driver now?
> 
> 
> [1]
> mlx4_rxq.o: In function `mlx4_rxq_setup':
> .../dpdk/drivers/net/mlx4/mlx4_rxq.c:393: undefined reference to
> `mlx4dv_init_obj'
> mlx4_txq.o: In function `mlx4_txq_setup':
> .../dpdk/drivers/net/mlx4/mlx4_txq.c:429: undefined reference to
> `mlx4dv_init_obj'
> 
> 
> [2]
> In file included from .../drivers/net/mlx4/mlx4_flow.c:66:
> .../drivers/net/mlx4/mlx4_rxtx.h:43:10: fatal error:
> 'infiniband/mlx4dv.h' file not found
> #include <infiniband/mlx4dv.h>
>          ^~~~~~~~~~~~~~~~~~~~~
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v5 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
                         ` (8 preceding siblings ...)
  2017-10-05 18:48       ` Ferruh Yigit
@ 2017-10-11 18:31       ` Adrien Mazarguil
  2017-10-11 18:31         ` [PATCH v5 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
                           ` (5 more replies)
  9 siblings, 6 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-11 18:31 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

Hopefully the last iteration for this series.

v5 (Ophir & Adrien):
- Merged Rx scatter/Tx gather code back into individual Rx/Tx commits
  for consistency due to a couple of issues with gather-less Tx.
- Rebased on top of the latest mlx4 control path changes (RSS support).

v4 (Ophir):
- Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4:
  restore full Rx support bypassing Verbs" commit

v3 (Adrien):
- Drop a few unrelated or unnecessary changes such as the removal of
  MLX4_PMD_TX_MP_CACHE.
- Move device checksum support detection code to its previous location.
- Fix include guard in mlx4_prm.h.
- Reorder #includes alphabetically.
- Replace MLX4_TRANSPOSE() macro with documented inline function.
- Remove extra spaces and blank lines.
- Use uint8_t * instead of char * for buffers.
- Replace mlx4_get_cqe() macro with a documented inline function.
- Replace several unsigned int with uint32_t.
- Add consistency to field names (sge_n => sges_n).
- Make mbuf size checks in RX queue setup function similar to mlx5.
- Update various comments.
- Fix indentation.
- Replace run-time endian conversion with static ones where possible.
- Reorder fields in struct rxq and struct txq for consistency, remove
  one level of unnecessary inner structures.
- Fix memory leak on Tx bounce buffer.
- Update commit logs.
- Fix remaining checkpatch warnings.

v2 (Matan):
Rearange patches.
Semantics.
Enhancements.
Fix compilation issues.

Moti Haimovsky (5):
  net/mlx4: add Tx bypassing Verbs
  net/mlx4: add Rx bypassing Verbs
  net/mlx4: restore Tx checksum offloads
  net/mlx4: restore Rx offloads
  net/mlx4: add loopback Tx from VF

 drivers/net/mlx4/mlx4.c        |  11 +
 drivers/net/mlx4/mlx4.h        |   2 +
 drivers/net/mlx4/mlx4_ethdev.c |  10 +
 drivers/net/mlx4/mlx4_prm.h    | 151 +++++++
 drivers/net/mlx4/mlx4_rxq.c    | 156 +++++---
 drivers/net/mlx4/mlx4_rxtx.c   | 768 ++++++++++++++++++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.h   |  54 +--
 drivers/net/mlx4/mlx4_txq.c    |  63 +++
 8 files changed, 942 insertions(+), 273 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v5 1/5] net/mlx4: add Tx bypassing Verbs
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
@ 2017-10-11 18:31         ` Adrien Mazarguil
  2017-10-11 18:31         ` [PATCH v5 2/5] net/mlx4: add Rx " Adrien Mazarguil
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-11 18:31 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

Modify PMD to send single-buffer packets directly to the device
bypassing the Verbs Tx post and poll routines.

Tx gather support: add support for transmitting packets spanning
over multiple buffers.

Take into consideration the amount of entries a packet occupies
in the TxQ when setting the report-completion flag of the chip.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 120 ++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 398 ++++++++++++++++++++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.h |  30 +--
 drivers/net/mlx4/mlx4_txq.c  |  59 ++++++
 4 files changed, 490 insertions(+), 117 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..085a595
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,120 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 6WIND S.A.
+ *   Copyright 2017 Mellanox
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef MLX4_PRM_H_
+#define MLX4_PRM_H_
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_byteorder.h>
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/mlx4dv.h>
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* ConnectX-3 Tx queue basic block. */
+#define MLX4_TXBB_SHIFT 6
+#define MLX4_TXBB_SIZE (1 << MLX4_TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes. */
+#define MLX4_MAX_WQE_SIZE 512
+#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+
+/* Send queue stamping/invalidating information. */
+#define MLX4_SQ_STAMP_STRIDE 64
+#define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
+#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_STAMP_VAL 0x7fffffff
+
+/* Work queue element (WQE) flags. */
+#define MLX4_BIT_WQE_OWN 0x80000000
+
+#define MLX4_SIZE_TO_TXBBS(size) \
+	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
+
+/* Send queue information. */
+struct mlx4_sq {
+	uint8_t *buf; /**< SQ buffer. */
+	uint8_t *eob; /**< End of SQ buffer */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t *db; /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
+};
+
+#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
+
+/* Completion queue information. */
+struct mlx4_cq {
+	uint8_t *buf; /**< Pointer to the completion queue buffer. */
+	uint32_t cqe_cnt; /**< Number of entries in the queue. */
+	uint32_t cqe_64:1; /**< CQ entry size is 64 bytes. */
+	uint32_t cons_index; /**< Last queue entry that was handled. */
+	uint32_t *set_ci_db; /**< Pointer to the completion queue doorbell. */
+};
+
+/**
+ * Retrieve a CQE entry from a CQ.
+ *
+ * cqe = cq->buf + cons_index * cqe_size + cqe_offset
+ *
+ * Where cqe_size is 32 or 64 bytes and cqe_offset is 0 or 32 (depending on
+ * cqe_size).
+ *
+ * @param cq
+ *   CQ to retrieve entry from.
+ * @param index
+ *   Entry index.
+ *
+ * @return
+ *   Pointer to CQE entry.
+ */
+static inline struct mlx4_cqe *
+mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
+{
+	return (struct mlx4_cqe *)(cq->buf +
+				   ((index & (cq->cqe_cnt - 1)) <<
+				    (5 + cq->cqe_64)) +
+				   (cq->cqe_64 << 5));
+}
+
+#endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 859f1bd..38b87a0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -52,15 +52,81 @@
 
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
+#include <rte_io.h>
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
 /**
+ * Pointer-value pair structure used in tx_post_send for saving the first
+ * DWORD (32 byte) of a TXBB.
+ */
+struct pv {
+	struct mlx4_wqe_data_seg *dseg;
+	uint32_t val;
+};
+
+/**
+ * Stamp a WQE so it won't be reused by the HW.
+ *
+ * Routine is used when freeing WQE used by the chip or when failing
+ * building an WQ entry has failed leaving partial information on the queue.
+ *
+ * @param sq
+ *   Pointer to the SQ structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param num_txbbs
+ *   Number of blocks to stamp.
+ *   If < 0 the routine will use the size written in the WQ entry.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of Tx basic blocs (TXBB) the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+					  (!!owner << MLX4_SQ_STAMP_SHIFT));
+	uint8_t *wqe = mlx4_get_send_wqe(sq, (index & sq->txbb_cnt_mask));
+	uint32_t *ptr = (uint32_t *)wqe;
+	int i;
+	int txbbs_size;
+	int num_txbbs;
+
+	/* Extract the size from the control segment of the WQE. */
+	num_txbbs = MLX4_SIZE_TO_TXBBS((((struct mlx4_wqe_ctrl_seg *)
+					 wqe)->fence_size & 0x3f) << 4);
+	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	/* Optimize the common case when there is no wrap-around. */
+	if (wqe + txbbs_size <= sq->eob) {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+			if ((uint8_t *)ptr >= sq->eob) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= RTE_BE32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
  * Manage Tx completions.
  *
  * When sending a burst, mlx4_tx_burst() posts several WRs.
@@ -80,26 +146,71 @@ mlx4_txq_complete(struct txq *txq)
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
-	int wcs_n;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
-	if (unlikely(wcs_n == 0))
+	/*
+	 * Traverse over all CQ entries reported and handle each WQ entry
+	 * reported by them.
+	 */
+	do {
+		cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cons_index);
+		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		    !!(cons_index & cq->cqe_cnt)))
+			break;
+		/*
+		 * Make sure we read the CQE after we read the ownership bit.
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      (void *)txq, cqe_err->vendor_err,
+			      cqe_err->syndrome);
+		}
+		/* Get WQE index reported in the CQE. */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* Free next descriptor. */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		cons_index++;
+	} while (1);
+	if (unlikely(pkts == 0))
 		return 0;
-	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
-		      (void *)txq, wcs_n);
-		return -1;
-	}
-	elts_comp -= wcs_n;
+	/*
+	 * Update CQ.
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	/* Update the list of packets posted for transmission. */
+	elts_comp -= pkts;
 	assert(elts_comp <= txq->elts_comp);
 	/*
-	 * Assume WC status is successful as nothing can be done about it
-	 * anyway.
+	 * Assume completion status is successful as nothing can be done about
+	 * it anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -183,6 +294,161 @@ mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
 }
 
 /**
+ * Posts a single work request to a send queue.
+ *
+ * @param txq
+ *   Target Tx queue.
+ * @param pkt
+ *   Packet to transmit.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+static inline int
+mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	struct rte_mbuf *buf;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t lkey;
+	uintptr_t addr;
+	uint32_t srcrb_flags;
+	uint32_t owner_opcode = MLX4_OPCODE_SEND;
+	uint32_t byte_count;
+	int wqe_real_size;
+	int nr_txbbs;
+	int rc;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
+	int pv_counter = 0;
+
+	/* Calculate the needed work queue entry size for this packet. */
+	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+	/*
+	 * Check that there is room for this WQE in the send queue and that
+	 * the WQE size is legal.
+	 */
+	if (((sq->head - sq->tail) + nr_txbbs +
+	     sq->headroom_txbbs) >= sq->txbb_cnt ||
+	    nr_txbbs > MLX4_MAX_WQE_TXBBS) {
+		rc = ENOSPC;
+		goto err;
+	}
+	/* Get the control and data entries of the WQE. */
+	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
+					    sizeof(struct mlx4_wqe_ctrl_seg));
+	/* Fill the data segments with buffer information. */
+	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
+		addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		rte_prefetch0((volatile void *)addr);
+		/* Handle WQE wraparound. */
+		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
+			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+		dseg->addr = rte_cpu_to_be_64(addr);
+		/* Memory region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			/*
+			 * Restamp entry in case of failure.
+			 * Make sure that size is written correctly
+			 * Note that we give ownership to the SW, not the HW.
+			 */
+			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+			mlx4_txq_stamp_freed_wqe(sq, head_idx,
+				     (sq->head & sq->txbb_cnt) ? 0 : 1);
+			rc = EFAULT;
+			goto err;
+		}
+		dseg->lkey = rte_cpu_to_be_32(lkey);
+		if (likely(buf->data_len)) {
+			byte_count = rte_cpu_to_be_32(buf->data_len);
+		} else {
+			/*
+			 * Zero length segment is treated as inline segment
+			 * with zero data.
+			 */
+			byte_count = RTE_BE32(0x80000000);
+		}
+		/*
+		 * If the data segment is not at the beginning of a
+		 * Tx basic block (TXBB) then write the byte count,
+		 * else postpone the writing to just before updating the
+		 * control segment.
+		 */
+		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			/*
+			 * Need a barrier here before writing the byte_count
+			 * fields to make sure that all the data is visible
+			 * before the byte_count field is set.
+			 * Otherwise, if the segment begins a new cacheline,
+			 * the HCA prefetcher could grab the 64-byte chunk and
+			 * get a valid (!= 0xffffffff) byte count but stale
+			 * data, and end up sending the wrong data.
+			 */
+			rte_io_wmb();
+			dseg->byte_count = byte_count;
+		} else {
+			/*
+			 * This data segment starts at the beginning of a new
+			 * TXBB, so we need to postpone its byte_count writing
+			 * for later.
+			 */
+			pv[pv_counter].dseg = dseg;
+			pv[pv_counter++].val = byte_count;
+		}
+	}
+	/* Write the first DWORD of each TXBB save earlier. */
+	if (pv_counter) {
+		/* Need a barrier here before writing the byte_count. */
+		rte_io_wmb();
+		for (--pv_counter; pv_counter  >= 0; pv_counter--)
+			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
+	}
+	/* Fill the control parameters for this packet. */
+	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+	/*
+	 * The caller should prepare "imm" in advance in order to support
+	 * VF to VF communication (when the device is a virtual-function
+	 * device (VF)).
+	 */
+	ctrl->imm = 0;
+	/*
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
+	 * should be calculated.
+	 */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+				       MLX4_WQE_CTRL_CQ_UPDATE);
+	} else {
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+	}
+	ctrl->srcrb_flags = srcrb_flags;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
+	 */
+	rte_wmb();
+	ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					      ((sq->head & sq->txbb_cnt) ?
+					       MLX4_BIT_WQE_OWN : 0));
+	sq->head += nr_txbbs;
+	return 0;
+err:
+	rte_errno = rc;
+	return -rc;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,18 +465,15 @@ uint16_t
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
-	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
+	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -229,10 +492,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		struct ibv_send_wr *wr = &elt->wr;
-		unsigned int segs = buf->nb_segs;
-		unsigned int sent_size = 0;
-		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -250,100 +509,31 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
-			send_flags |= IBV_SEND_SIGNALED;
-		}
-		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
-				elt->buf = NULL;
-				goto stop;
-			}
-			/* Update element. */
-			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
-		} else {
-			err = -1;
+		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+		/* Post the packet for sending. */
+		err = mlx4_post_send(txq, buf);
+		if (unlikely(err)) {
+			elt->buf = NULL;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
+		elt->buf = buf;
+		bytes_sent += buf->pkt_len;
+		++elts_comp;
 		elts_head = elts_head_next;
-		/* Increment sent bytes counter. */
-		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
-	/* Increment sent packets counter. */
+	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
+	txq->stats.obytes += bytes_sent;
+	/* Make sure that descriptors are written before doorbell record. */
+	rte_wmb();
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index eca966f..ff27126 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -41,6 +41,7 @@
 #ifdef PEDANTIC
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
+#include <infiniband/mlx4dv.h>
 #include <infiniband/verbs.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
@@ -51,6 +52,7 @@
 #include <rte_mempool.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 
 /** Rx queue counters. */
 struct mlx4_rxq_stats {
@@ -101,8 +103,6 @@ struct mlx4_rss {
 
 /** Tx element. */
 struct txq_elt {
-	struct ibv_send_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
@@ -116,24 +116,28 @@ struct mlx4_txq_stats {
 
 /** Tx queue descriptor. */
 struct txq {
-	struct priv *priv; /**< Back pointer to private data. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
+	unsigned int elts_head; /**< Current index in (*elts)[]. */
+	unsigned int elts_tail; /**< First element awaiting completion. */
+	unsigned int elts_comp; /**< Number of packets awaiting completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
+	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
+	unsigned int elts_n; /**< (*elts)[] length. */
+	struct txq_elt (*elts)[]; /**< Tx elements. */
+	struct mlx4_txq_stats stats; /**< Tx queue counters. */
+	uint32_t max_inline; /**< Max inline send size. */
+	uint8_t *bounce_buf;
+	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
 		const struct rte_mempool *mp; /**< Cached memory pool. */
 		struct ibv_mr *mr; /**< Memory region (for mp). */
 		uint32_t lkey; /**< mr->lkey copy. */
 	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
+	struct priv *priv; /**< Back pointer to private data. */
+	unsigned int socket; /**< CPU socket ID for allocations. */
 	struct ibv_cq *cq; /**< Completion queue. */
 	struct ibv_qp *qp; /**< Queue pair. */
-	uint32_t max_inline; /**< Max inline send size. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	struct txq_elt (*elts)[]; /**< Tx elements. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
-	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
-	struct mlx4_txq_stats stats; /**< Tx queue counters. */
-	unsigned int socket; /**< CPU socket ID for allocations. */
 	uint8_t data[]; /**< Remaining queue resources. */
 };
 
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7042cd9..4258513 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -60,6 +60,7 @@
 
 #include "mlx4.h"
 #include "mlx4_autoconf.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
@@ -148,6 +149,41 @@ mlx4_txq_mp2mr_iter(struct rte_mempool *mp, void *arg)
 }
 
 /**
+ * Retrieves information needed in order to directly access the Tx queue.
+ *
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param mlxdv
+ *   Pointer to device information for this Tx queue.
+ */
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	/* Total length, including headroom and spare WQEs. */
+	sq->eob = sq->buf + sq_size;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_cnt =
+		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs =
+		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
+	cq->buf = dcq->buf.buf;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_64 = (dcq->cqe_size & 64) ? 1 : 0;
+}
+
+/**
  * DPDK callback to configure a Tx queue.
  *
  * @param dev
@@ -169,9 +205,13 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct txq_elt (*elts)[desc];
 	struct ibv_qp_init_attr qp_init_attr;
 	struct txq *txq;
+	uint8_t *bounce_buf;
 	struct rte_malloc_vec vec[] = {
 		{
 			.align = RTE_CACHE_LINE_SIZE,
@@ -183,6 +223,11 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			.size = sizeof(*elts),
 			.addr = (void **)&elts,
 		},
+		{
+			.align = RTE_CACHE_LINE_SIZE,
+			.size = MLX4_MAX_WQE_SIZE,
+			.addr = (void **)&bounce_buf,
+		},
 	};
 	int ret;
 
@@ -231,6 +276,7 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
 		.elts_comp_cd_init =
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
+		.bounce_buf = bounce_buf,
 	};
 	txq->cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
 	if (!txq->cq) {
@@ -297,6 +343,19 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = txq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		rte_errno = EINVAL;
+		ERROR("%p: failed to obtain information needed for"
+		      " accessing the device queues", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
 	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v5 2/5] net/mlx4: add Rx bypassing Verbs
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
  2017-10-11 18:31         ` [PATCH v5 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
@ 2017-10-11 18:31         ` Adrien Mazarguil
  2017-10-11 18:32         ` [PATCH v5 3/5] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-11 18:31 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for accessing the hardware directly when
handling Rx packets eliminating the need to use Verbs in the Rx data
path.

Rx scatter support: calculate the number of scatters on the fly
according to the maximum expected packet size.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxq.c  | 151 +++++++++++++++++--------
 drivers/net/mlx4/mlx4_rxtx.c | 226 +++++++++++++++++++++-----------------
 drivers/net/mlx4/mlx4_rxtx.h |  19 ++--
 3 files changed, 241 insertions(+), 155 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index e7bde2e..fb6c080 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -51,6 +51,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
 
+#include <rte_byteorder.h>
 #include <rte_common.h>
 #include <rte_errno.h>
 #include <rte_ethdev.h>
@@ -312,45 +313,46 @@ void mlx4_rss_detach(struct mlx4_rss *rss)
 static int
 mlx4_rxq_alloc_elts(struct rxq *rxq)
 {
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
+	const uint32_t elts_n = 1 << rxq->elts_n;
+	const uint32_t sges_n = 1 << rxq->sges_n;
+	struct rte_mbuf *(*elts)[elts_n] = rxq->elts;
 	unsigned int i;
 
-	/* For each WR (packet). */
+	assert(rte_is_power_of_2(elts_n));
 	for (i = 0; i != RTE_DIM(*elts); ++i) {
-		struct rxq_elt *elt = &(*elts)[i];
-		struct ibv_recv_wr *wr = &elt->wr;
-		struct ibv_sge *sge = &(*elts)[i].sge;
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[i];
 		struct rte_mbuf *buf = rte_pktmbuf_alloc(rxq->mp);
 
 		if (buf == NULL) {
 			while (i--) {
-				rte_pktmbuf_free_seg((*elts)[i].buf);
-				(*elts)[i].buf = NULL;
+				rte_pktmbuf_free_seg((*elts)[i]);
+				(*elts)[i] = NULL;
 			}
 			rte_errno = ENOMEM;
 			return -rte_errno;
 		}
-		elt->buf = buf;
-		wr->next = &(*elts)[(i + 1)].wr;
-		wr->sg_list = sge;
-		wr->num_sge = 1;
 		/* Headroom is reserved by rte_pktmbuf_alloc(). */
 		assert(buf->data_off == RTE_PKTMBUF_HEADROOM);
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
-		/* sge->addr must be able to store a pointer. */
-		assert(sizeof(sge->addr) >= sizeof(uintptr_t));
-		/* SGE keeps its headroom. */
-		sge->addr = (uintptr_t)
-			((uint8_t *)buf->buf_addr + RTE_PKTMBUF_HEADROOM);
-		sge->length = (buf->buf_len - RTE_PKTMBUF_HEADROOM);
-		sge->lkey = rxq->mr->lkey;
-		/* Redundant check for tailroom. */
-		assert(sge->length == rte_pktmbuf_tailroom(buf));
+		/* Only the first segment keeps headroom. */
+		if (i % sges_n)
+			buf->data_off = 0;
+		buf->port = rxq->port_id;
+		buf->data_len = rte_pktmbuf_tailroom(buf);
+		buf->pkt_len = rte_pktmbuf_tailroom(buf);
+		buf->nb_segs = 1;
+		*scat = (struct mlx4_wqe_data_seg){
+			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
+								  uintptr_t)),
+			.byte_count = rte_cpu_to_be_32(buf->data_len),
+			.lkey = rte_cpu_to_be_32(rxq->mr->lkey),
+		};
+		(*elts)[i] = buf;
 	}
-	/* The last WR pointer must be NULL. */
-	(*elts)[(i - 1)].wr.next = NULL;
+	DEBUG("%p: allocated and configured %u segments (max %u packets)",
+	      (void *)rxq, elts_n, elts_n / sges_n);
 	return 0;
 }
 
@@ -364,14 +366,14 @@ static void
 mlx4_rxq_free_elts(struct rxq *rxq)
 {
 	unsigned int i;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
+	struct rte_mbuf *(*elts)[1 << rxq->elts_n] = rxq->elts;
 
-	DEBUG("%p: freeing WRs", (void *)rxq);
+	DEBUG("%p: freeing Rx queue elements", (void *)rxq);
 	for (i = 0; (i != RTE_DIM(*elts)); ++i) {
-		if (!(*elts)[i].buf)
+		if (!(*elts)[i])
 			continue;
-		rte_pktmbuf_free_seg((*elts)[i].buf);
-		(*elts)[i].buf = NULL;
+		rte_pktmbuf_free_seg((*elts)[i]);
+		(*elts)[i] = NULL;
 	}
 }
 
@@ -400,8 +402,11 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    struct rte_mempool *mp)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_rwq dv_rwq;
+	struct mlx4dv_cq dv_cq;
 	uint32_t mb_len = rte_pktmbuf_data_room_size(mp);
-	struct rxq_elt (*elts)[desc];
+	struct rte_mbuf *(*elts)[rte_align32pow2(desc)];
 	struct rte_flow_error error;
 	struct rxq *rxq;
 	struct rte_malloc_vec vec[] = {
@@ -439,6 +444,12 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		ERROR("%p: invalid number of Rx descriptors", (void *)dev);
 		return -rte_errno;
 	}
+	if (desc != RTE_DIM(*elts)) {
+		desc = RTE_DIM(*elts);
+		WARN("%p: increased number of descriptors in Rx queue %u"
+		     " to the next power of two (%u)",
+		     (void *)dev, idx, desc);
+	}
 	/* Allocate and initialize Rx queue. */
 	rte_zmallocv_socket("RXQ", vec, RTE_DIM(vec), socket);
 	if (!rxq) {
@@ -450,8 +461,8 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		.priv = priv,
 		.mp = mp,
 		.port_id = dev->data->port_id,
-		.elts_n = desc,
-		.elts_head = 0,
+		.sges_n = 0,
+		.elts_n = rte_log2_u32(desc),
 		.elts = elts,
 		.stats.idx = idx,
 		.socket = socket,
@@ -462,9 +473,29 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
 		;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
-		WARN("%p: scattered mode has been requested but is"
-		     " not supported, this may lead to packet loss",
-		     (void *)dev);
+		uint32_t size =
+			RTE_PKTMBUF_HEADROOM +
+			dev->data->dev_conf.rxmode.max_rx_pkt_len;
+		uint32_t sges_n;
+
+		/*
+		 * Determine the number of SGEs needed for a full packet
+		 * and round it to the next power of two.
+		 */
+		sges_n = rte_log2_u32((size / mb_len) + !!(size % mb_len));
+		rxq->sges_n = sges_n;
+		/* Make sure sges_n did not overflow. */
+		size = mb_len * (1 << rxq->sges_n);
+		size -= RTE_PKTMBUF_HEADROOM;
+		if (size < dev->data->dev_conf.rxmode.max_rx_pkt_len) {
+			rte_errno = EOVERFLOW;
+			ERROR("%p: too many SGEs (%u) needed to handle"
+			      " requested maximum packet size %u",
+			      (void *)dev,
+			      1 << sges_n,
+			      dev->data->dev_conf.rxmode.max_rx_pkt_len);
+			goto error;
+		}
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -473,6 +504,17 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		     dev->data->dev_conf.rxmode.max_rx_pkt_len,
 		     mb_len - RTE_PKTMBUF_HEADROOM);
 	}
+	DEBUG("%p: maximum number of segments per packet: %u",
+	      (void *)dev, 1 << rxq->sges_n);
+	if (desc % (1 << rxq->sges_n)) {
+		rte_errno = EINVAL;
+		ERROR("%p: number of Rx queue descriptors (%u) is not a"
+		      " multiple of maximum segments per packet (%u)",
+		      (void *)dev,
+		      desc,
+		      1 << rxq->sges_n);
+		goto error;
+	}
 	/* Use the entire Rx mempool as the memory region. */
 	rxq->mr = mlx4_mp2mr(priv->pd, mp);
 	if (!rxq->mr) {
@@ -497,7 +539,8 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			goto error;
 		}
 	}
-	rxq->cq = ibv_create_cq(priv->ctx, desc, NULL, rxq->channel, 0);
+	rxq->cq = ibv_create_cq(priv->ctx, desc >> rxq->sges_n, NULL,
+				rxq->channel, 0);
 	if (!rxq->cq) {
 		rte_errno = ENOMEM;
 		ERROR("%p: CQ creation failure: %s",
@@ -508,8 +551,8 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		(priv->ctx,
 		 &(struct ibv_wq_init_attr){
 			.wq_type = IBV_WQT_RQ,
-			.max_wr = RTE_MIN(priv->device_attr.max_qp_wr, desc),
-			.max_sge = 1,
+			.max_wr = desc >> rxq->sges_n,
+			.max_sge = 1 << rxq->sges_n,
 			.pd = priv->pd,
 			.cq = rxq->cq,
 		 });
@@ -531,27 +574,43 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
-	ret = mlx4_rxq_alloc_elts(rxq);
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = rxq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.rwq.in = rxq->wq;
+	mlxdv.rwq.out = &dv_rwq;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_RWQ | MLX4DV_OBJ_CQ);
 	if (ret) {
-		ERROR("%p: RXQ allocation failed: %s",
-		      (void *)dev, strerror(rte_errno));
+		rte_errno = EINVAL;
+		ERROR("%p: failed to obtain device information", (void *)dev);
 		goto error;
 	}
-	ret = ibv_post_wq_recv(rxq->wq, &(*rxq->elts)[0].wr,
-			       &(struct ibv_recv_wr *){ NULL });
+	rxq->wqes =
+		(volatile struct mlx4_wqe_data_seg (*)[])
+		((uintptr_t)dv_rwq.buf.buf + dv_rwq.rq.offset);
+	rxq->rq_db = dv_rwq.rdb;
+	rxq->rq_ci = 0;
+	rxq->mcq.buf = dv_cq.buf.buf;
+	rxq->mcq.cqe_cnt = dv_cq.cqe_cnt;
+	rxq->mcq.set_ci_db = dv_cq.set_ci_db;
+	rxq->mcq.cqe_64 = (dv_cq.cqe_size & 64) ? 1 : 0;
+	ret = mlx4_rxq_alloc_elts(rxq);
 	if (ret) {
-		rte_errno = ret;
-		ERROR("%p: ibv_post_recv() failed: %s",
-		      (void *)dev,
-		      strerror(rte_errno));
+		ERROR("%p: RXQ allocation failed: %s",
+		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
 	DEBUG("%p: adding Rx queue %p to list", (void *)dev, (void *)rxq);
 	dev->data->rx_queues[idx] = rxq;
 	/* Enable associated flows. */
 	ret = mlx4_flow_sync(priv, &error);
-	if (!ret)
+	if (!ret) {
+		/* Update doorbell counter. */
+		rxq->rq_ci = desc >> rxq->sges_n;
+		rte_wmb();
+		*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 		return 0;
+	}
 	ERROR("cannot re-attach flow rules to queue %u"
 	      " (code %d, \"%s\"), flow error type %d, cause %p, message: %s",
 	      idx, -ret, strerror(-ret), error.type, error.cause,
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 38b87a0..cc0baaa 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -538,9 +538,44 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
- * DPDK callback for Rx.
+ * Poll one CQE from CQ.
  *
- * The following function doesn't manage scattered packets.
+ * @param rxq
+ *   Pointer to the receive queue structure.
+ * @param[out] out
+ *   Just polled CQE.
+ *
+ * @return
+ *   Number of bytes of the CQE, 0 in case there is no completion.
+ */
+static unsigned int
+mlx4_cq_poll_one(struct rxq *rxq, struct mlx4_cqe **out)
+{
+	int ret = 0;
+	struct mlx4_cqe *cqe = NULL;
+	struct mlx4_cq *cq = &rxq->mcq;
+
+	cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cq->cons_index);
+	if (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+	    !!(cq->cons_index & cq->cqe_cnt))
+		goto out;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rte_rmb();
+	assert(!(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK));
+	assert((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) !=
+	       MLX4_CQE_OPCODE_ERROR);
+	ret = rte_be_to_cpu_32(cqe->byte_cnt);
+	++cq->cons_index;
+out:
+	*out = cqe;
+	return ret;
+}
+
+/**
+ * DPDK callback for Rx with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -555,112 +590,107 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 uint16_t
 mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
-	struct rxq *rxq = (struct rxq *)dpdk_rxq;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
-	const unsigned int elts_n = rxq->elts_n;
-	unsigned int elts_head = rxq->elts_head;
-	struct ibv_wc wcs[pkts_n];
-	struct ibv_recv_wr *wr_head = NULL;
-	struct ibv_recv_wr **wr_next = &wr_head;
-	struct ibv_recv_wr *wr_bad = NULL;
-	unsigned int i;
-	unsigned int pkts_ret = 0;
-	int ret;
+	struct rxq *rxq = dpdk_rxq;
+	const uint32_t wr_cnt = (1 << rxq->elts_n) - 1;
+	const uint16_t sges_n = rxq->sges_n;
+	struct rte_mbuf *pkt = NULL;
+	struct rte_mbuf *seg = NULL;
+	unsigned int i = 0;
+	uint32_t rq_ci = rxq->rq_ci << sges_n;
+	int len = 0;
 
-	ret = ibv_poll_cq(rxq->cq, pkts_n, wcs);
-	if (unlikely(ret == 0))
-		return 0;
-	if (unlikely(ret < 0)) {
-		DEBUG("rxq=%p, ibv_poll_cq() failed (wc_n=%d)",
-		      (void *)rxq, ret);
-		return 0;
-	}
-	assert(ret <= (int)pkts_n);
-	/* For each work completion. */
-	for (i = 0; i != (unsigned int)ret; ++i) {
-		struct ibv_wc *wc = &wcs[i];
-		struct rxq_elt *elt = &(*elts)[elts_head];
-		struct ibv_recv_wr *wr = &elt->wr;
-		uint32_t len = wc->byte_len;
-		struct rte_mbuf *seg = elt->buf;
-		struct rte_mbuf *rep;
+	while (pkts_n) {
+		struct mlx4_cqe *cqe;
+		uint32_t idx = rq_ci & wr_cnt;
+		struct rte_mbuf *rep = (*rxq->elts)[idx];
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[idx];
 
-		/* Sanity checks. */
-		assert(wr->sg_list == &elt->sge);
-		assert(wr->num_sge == 1);
-		assert(elts_head < rxq->elts_n);
-		assert(rxq->elts_head < rxq->elts_n);
-		/*
-		 * Fetch initial bytes of packet descriptor into a
-		 * cacheline while allocating rep.
-		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
-		/* Link completed WRs together for repost. */
-		*wr_next = wr;
-		wr_next = &wr->next;
-		if (unlikely(wc->status != IBV_WC_SUCCESS)) {
-			/* Whatever, just repost the offending WR. */
-			DEBUG("rxq=%p: bad work completion status (%d): %s",
-			      (void *)rxq, wc->status,
-			      ibv_wc_status_str(wc->status));
-			/* Increment dropped packets counter. */
-			++rxq->stats.idropped;
-			goto repost;
-		}
+		/* Update the 'next' pointer of the previous segment. */
+		if (pkt)
+			seg->next = rep;
+		seg = rep;
+		rte_prefetch0(seg);
+		rte_prefetch0(scat);
 		rep = rte_mbuf_raw_alloc(rxq->mp);
 		if (unlikely(rep == NULL)) {
-			/*
-			 * Unable to allocate a replacement mbuf,
-			 * repost WR.
-			 */
-			DEBUG("rxq=%p: can't allocate a new mbuf",
-			      (void *)rxq);
-			/* Increase out of memory counters. */
 			++rxq->stats.rx_nombuf;
-			++rxq->priv->dev->data->rx_mbuf_alloc_failed;
-			goto repost;
+			if (!pkt) {
+				/*
+				 * No buffers before we even started,
+				 * bail out silently.
+				 */
+				break;
+			}
+			while (pkt != seg) {
+				assert(pkt != (*rxq->elts)[idx]);
+				rep = pkt->next;
+				pkt->next = NULL;
+				pkt->nb_segs = 1;
+				rte_mbuf_raw_free(pkt);
+				pkt = rep;
+			}
+			break;
+		}
+		if (!pkt) {
+			/* Looking for the new packet. */
+			len = mlx4_cq_poll_one(rxq, &cqe);
+			if (!len) {
+				rte_mbuf_raw_free(rep);
+				break;
+			}
+			if (unlikely(len < 0)) {
+				/* Rx error, packet is likely too large. */
+				rte_mbuf_raw_free(rep);
+				++rxq->stats.idropped;
+				goto skip;
+			}
+			pkt = seg;
+			pkt->packet_type = 0;
+			pkt->ol_flags = 0;
+			pkt->pkt_len = len;
+		}
+		rep->nb_segs = 1;
+		rep->port = rxq->port_id;
+		rep->data_len = seg->data_len;
+		rep->data_off = seg->data_off;
+		(*rxq->elts)[idx] = rep;
+		/*
+		 * Fill NIC descriptor with the new buffer. The lkey and size
+		 * of the buffers are already known, only the buffer address
+		 * changes.
+		 */
+		scat->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, uintptr_t));
+		if (len > seg->data_len) {
+			len -= seg->data_len;
+			++pkt->nb_segs;
+			++rq_ci;
+			continue;
 		}
-		/* Reconfigure sge to use rep instead of seg. */
-		elt->sge.addr = (uintptr_t)rep->buf_addr + RTE_PKTMBUF_HEADROOM;
-		assert(elt->sge.lkey == rxq->mr->lkey);
-		elt->buf = rep;
-		/* Update seg information. */
-		seg->data_off = RTE_PKTMBUF_HEADROOM;
-		seg->nb_segs = 1;
-		seg->port = rxq->port_id;
-		seg->next = NULL;
-		seg->pkt_len = len;
+		/* The last segment. */
 		seg->data_len = len;
-		seg->packet_type = 0;
-		seg->ol_flags = 0;
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += pkt->pkt_len;
 		/* Return packet. */
-		*(pkts++) = seg;
-		++pkts_ret;
-		/* Increase bytes counter. */
-		rxq->stats.ibytes += len;
-repost:
-		if (++elts_head >= elts_n)
-			elts_head = 0;
-		continue;
+		*(pkts++) = pkt;
+		pkt = NULL;
+		--pkts_n;
+		++i;
+skip:
+		/* Align consumer index to the next stride. */
+		rq_ci >>= sges_n;
+		++rq_ci;
+		rq_ci <<= sges_n;
 	}
-	if (unlikely(i == 0))
+	if (unlikely(i == 0 && (rq_ci >> sges_n) == rxq->rq_ci))
 		return 0;
-	/* Repost WRs. */
-	*wr_next = NULL;
-	assert(wr_head);
-	ret = ibv_post_wq_recv(rxq->wq, wr_head, &wr_bad);
-	if (unlikely(ret)) {
-		/* Inability to repost WRs is fatal. */
-		DEBUG("%p: recv_burst(): failed (ret=%d)",
-		      (void *)rxq->priv,
-		      ret);
-		abort();
-	}
-	rxq->elts_head = elts_head;
-	/* Increase packets counter. */
-	rxq->stats.ipackets += pkts_ret;
-	return pkts_ret;
+	/* Update the consumer index. */
+	rxq->rq_ci = rq_ci >> sges_n;
+	rte_wmb();
+	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
+	*rxq->mcq.set_ci_db = rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+	return i;
 }
 
 /**
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index ff27126..fa5738f 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -63,13 +63,6 @@ struct mlx4_rxq_stats {
 	uint64_t rx_nombuf; /**< Total of Rx mbuf allocation failures. */
 };
 
-/** Rx element. */
-struct rxq_elt {
-	struct ibv_recv_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
-	struct rte_mbuf *buf; /**< Buffer. */
-};
-
 /** Rx queue descriptor. */
 struct rxq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -78,10 +71,14 @@ struct rxq {
 	struct ibv_cq *cq; /**< Completion queue. */
 	struct ibv_wq *wq; /**< Work queue. */
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
-	unsigned int port_id; /**< Port ID for incoming packets. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	struct rxq_elt (*elts)[]; /**< Rx elements. */
+	uint16_t rq_ci; /**< Saved RQ consumer index. */
+	uint16_t port_id; /**< Port ID for incoming packets. */
+	uint16_t sges_n; /**< Number of segments per packet (log2 value). */
+	uint16_t elts_n; /**< Mbuf queue size (log2 value). */
+	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
+	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
+	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 	uint8_t data[]; /**< Remaining queue resources. */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v5 3/5] net/mlx4: restore Tx checksum offloads
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
  2017-10-11 18:31         ` [PATCH v5 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
  2017-10-11 18:31         ` [PATCH v5 2/5] net/mlx4: add Rx " Adrien Mazarguil
@ 2017-10-11 18:32         ` Adrien Mazarguil
  2017-10-11 18:32         ` [PATCH v5 4/5] net/mlx4: restore Rx offloads Adrien Mazarguil
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-11 18:32 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPv4, UDP and TCP checksum
calculation, including inner/outer checksums on supported tunnel types.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4.c        | 11 +++++++++++
 drivers/net/mlx4/mlx4.h        |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h    |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 19 +++++++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.h   |  2 ++
 drivers/net/mlx4/mlx4_txq.c    |  2 ++
 7 files changed, 44 insertions(+)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index 0db9a19..a297b9a 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -566,6 +566,17 @@ mlx4_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->hw_csum =	!!(device_attr.device_cap_flags &
+				   IBV_DEVICE_RAW_IP_CSUM);
+		DEBUG("checksum offloading is %ssupported",
+		      (priv->hw_csum ? "" : "not "));
+		/* Only ConnectX-3 Pro supports tunneling. */
+		priv->hw_csum_l2tun =
+			priv->hw_csum &&
+			(device_attr.vendor_part_id ==
+			 PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
+		DEBUG("L2 tunnel checksum offloads are %ssupported",
+		      (priv->hw_csum_l2tun ? "" : "not "));
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index f4da8c6..e0a9853 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -113,6 +113,8 @@ struct priv {
 	uint32_t vf:1; /**< This is a VF device. */
 	uint32_t intr_alarm:1; /**< An interrupt alarm is scheduled. */
 	uint32_t isolated:1; /**< Toggle isolated mode. */
+	uint32_t hw_csum:1; /* Checksum offload is supported. */
+	uint32_t hw_csum_l2tun:1; /* Checksum support for L2 tunnels. */
 	struct rte_intr_handle intr_handle; /**< Port interrupt handle. */
 	struct mlx4_drop *drop; /**< Shared resources for drop flow rules. */
 	LIST_HEAD(, mlx4_rss) rss; /**< Shared targets for Rx flow rules. */
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index 3623909..a8c0ee2 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -767,6 +767,12 @@ mlx4_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->max_mac_addrs = RTE_DIM(priv->mac);
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->hw_csum_l2tun)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->hash_key_size = MLX4_RSS_HASH_KEY_SIZE;
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 085a595..df5a6b4 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -64,6 +64,8 @@
 
 /* Work queue element (WQE) flags. */
 #define MLX4_BIT_WQE_OWN 0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index cc0baaa..fe7d5d0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -431,6 +431,25 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	} else {
 		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
+	/* Enable HW checksum offload if requested */
+	if (txq->csum &&
+	    (pkt->ol_flags &
+	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = (pkt->ol_flags &
+					      (PKT_TX_TUNNEL_GRE |
+					       PKT_TX_TUNNEL_VXLAN));
+
+		if (is_tunneled && txq->csum_l2tun) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |=
+					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
+		} else {
+			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+						MLX4_WQE_CTRL_TCP_UDP_CSUM);
+		}
+	}
 	ctrl->srcrb_flags = srcrb_flags;
 	/*
 	 * Make sure descriptor is fully written before
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fa5738f..6c88efb 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -124,6 +124,8 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 4258513..41cdc4d 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -276,6 +276,8 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
 		.elts_comp_cd_init =
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
+		.csum = priv->hw_csum,
+		.csum_l2tun = priv->hw_csum_l2tun,
 		.bounce_buf = bounce_buf,
 	};
 	txq->cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v5 4/5] net/mlx4: restore Rx offloads
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
                           ` (2 preceding siblings ...)
  2017-10-11 18:32         ` [PATCH v5 3/5] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
@ 2017-10-11 18:32         ` Adrien Mazarguil
  2017-10-11 18:32         ` [PATCH v5 5/5] net/mlx4: add loopback Tx from VF Adrien Mazarguil
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
  5 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-11 18:32 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP checksum
verification, including inner/outer checksums on supported tunnel types.

It also restores packet type recognition support.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_ethdev.c |   6 +-
 drivers/net/mlx4/mlx4_prm.h    |  29 +++++++++
 drivers/net/mlx4/mlx4_rxq.c    |   5 ++
 drivers/net/mlx4/mlx4_rxtx.c   | 118 +++++++++++++++++++++++++++++++++++-
 drivers/net/mlx4/mlx4_rxtx.h   |   2 +
 5 files changed, 157 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a8c0ee2..ca2170e 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -767,10 +767,14 @@ mlx4_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->max_mac_addrs = RTE_DIM(priv->mac);
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
-	if (priv->hw_csum)
+	if (priv->hw_csum) {
 		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
 					  DEV_TX_OFFLOAD_UDP_CKSUM |
 					  DEV_TX_OFFLOAD_TCP_CKSUM);
+		info->rx_offload_capa |= (DEV_RX_OFFLOAD_IPV4_CKSUM |
+					  DEV_RX_OFFLOAD_UDP_CKSUM |
+					  DEV_RX_OFFLOAD_TCP_CKSUM);
+	}
 	if (priv->hw_csum_l2tun)
 		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index df5a6b4..3a77502 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -70,6 +70,14 @@
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
 
+/* CQE checksum flags. */
+enum {
+	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
+	MLX4_CQE_L2_TUNNEL_L4_CSUM = (int)(1u << 26),
+	MLX4_CQE_L2_TUNNEL = (int)(1u << 27),
+	MLX4_CQE_L2_TUNNEL_IPOK = (int)(1u << 31),
+};
+
 /* Send queue information. */
 struct mlx4_sq {
 	uint8_t *buf; /**< SQ buffer. */
@@ -119,4 +127,25 @@ mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
 				   (cq->cqe_64 << 5));
 }
 
+/**
+ * Transpose a flag in a value.
+ *
+ * @param val
+ *   Input value.
+ * @param from
+ *   Flag to retrieve from input value.
+ * @param to
+ *   Flag to set in output value.
+ *
+ * @return
+ *   Output value with transposed flag enabled if present on input.
+ */
+static inline uint64_t
+mlx4_transpose(uint64_t val, uint64_t from, uint64_t to)
+{
+	return (from >= to ?
+		(val & from) / (from / to) :
+		(val & from) * (to / from));
+}
+
 #endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index fb6c080..800ec2e 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -464,6 +464,11 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		.sges_n = 0,
 		.elts_n = rte_log2_u32(desc),
 		.elts = elts,
+		/* Toggle Rx checksum offload if hardware supports it. */
+		.csum = (priv->hw_csum &&
+			 dev->data->dev_conf.rxmode.hw_ip_checksum),
+		.csum_l2tun = (priv->hw_csum_l2tun &&
+			       dev->data->dev_conf.rxmode.hw_ip_checksum),
 		.stats.idx = idx,
 		.socket = socket,
 	};
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index fe7d5d0..87c5261 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -557,6 +557,107 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
+ * Translate Rx completion flags to packet type.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ *
+ * @return
+ *   Packet type in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_pkt_type(uint32_t flags)
+{
+	uint32_t pkt_type;
+
+	if (flags & MLX4_CQE_L2_TUNNEL)
+		pkt_type =
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPV4,
+				       RTE_PTYPE_L3_IPV4_EXT_UNKNOWN) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IPV4_PKT,
+				       RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN);
+	else
+		pkt_type = mlx4_transpose(flags,
+					  MLX4_CQE_STATUS_IPV4_PKT,
+					  RTE_PTYPE_L3_IPV4_EXT_UNKNOWN);
+	return pkt_type;
+}
+
+/**
+ * Translate Rx completion flags to offload flags.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   Offload flags (ol_flags) in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_ol_flags(uint32_t flags, int csum, int csum_l2tun)
+{
+	uint32_t ol_flags = 0;
+
+	if (csum)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IP_HDR_CSUM_OK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_TCP_UDP_CSUM_OK,
+				       PKT_RX_L4_CKSUM_GOOD);
+	if ((flags & MLX4_CQE_L2_TUNNEL) && csum_l2tun)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPOK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_L4_CSUM,
+				       PKT_RX_L4_CKSUM_GOOD);
+	return ol_flags;
+}
+
+/**
+ * Extract checksum information from CQE flags.
+ *
+ * @param cqe
+ *   Pointer to CQE structure.
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   CQE checksum information.
+ */
+static inline uint32_t
+mlx4_cqe_flags(struct mlx4_cqe *cqe, int csum, int csum_l2tun)
+{
+	uint32_t flags = 0;
+
+	/*
+	 * The relevant bits are in different locations on their
+	 * CQE fields therefore we can join them in one 32bit
+	 * variable.
+	 */
+	if (csum)
+		flags = (rte_be_to_cpu_32(cqe->status) &
+			 MLX4_CQE_STATUS_IPV4_CSUM_OK);
+	if (csum_l2tun)
+		flags |= (rte_be_to_cpu_32(cqe->vlan_my_qpn) &
+			  (MLX4_CQE_L2_TUNNEL |
+			   MLX4_CQE_L2_TUNNEL_IPOK |
+			   MLX4_CQE_L2_TUNNEL_L4_CSUM |
+			   MLX4_CQE_L2_TUNNEL_IPV4));
+	return flags;
+}
+
+/**
  * Poll one CQE from CQ.
  *
  * @param rxq
@@ -664,8 +765,21 @@ mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				goto skip;
 			}
 			pkt = seg;
-			pkt->packet_type = 0;
-			pkt->ol_flags = 0;
+			if (rxq->csum | rxq->csum_l2tun) {
+				uint32_t flags =
+					mlx4_cqe_flags(cqe,
+						       rxq->csum,
+						       rxq->csum_l2tun);
+
+				pkt->ol_flags =
+					rxq_cq_to_ol_flags(flags,
+							   rxq->csum,
+							   rxq->csum_l2tun);
+				pkt->packet_type = rxq_cq_to_pkt_type(flags);
+			} else {
+				pkt->packet_type = 0;
+				pkt->ol_flags = 0;
+			}
 			pkt->pkt_len = len;
 		}
 		rep->nb_segs = 1;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 6c88efb..51af69c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -78,6 +78,8 @@ struct rxq {
 	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
 	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
 	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v5 5/5] net/mlx4: add loopback Tx from VF
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
                           ` (3 preceding siblings ...)
  2017-10-11 18:32         ` [PATCH v5 4/5] net/mlx4: restore Rx offloads Adrien Mazarguil
@ 2017-10-11 18:32         ` Adrien Mazarguil
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
  5 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-11 18:32 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds loopback functionality used when the chip is a VF in order
to enable packet transmission between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 33 +++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |  1 +
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 87c5261..36173ad 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -311,10 +311,13 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct rte_mbuf *buf;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
-	uint32_t srcrb_flags;
 	uint32_t owner_opcode = MLX4_OPCODE_SEND;
 	uint32_t byte_count;
 	int wqe_real_size;
@@ -414,22 +417,16 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
-	 * The caller should prepare "imm" in advance in order to support
-	 * VF to VF communication (when the device is a virtual-function
-	 * device (VF)).
-	 */
-	ctrl->imm = 0;
-	/*
 	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
 	 * should be calculated.
 	 */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 				       MLX4_WQE_CTRL_CQ_UPDATE);
 	} else {
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
 	/* Enable HW checksum offload if requested */
 	if (txq->csum &&
@@ -443,14 +440,26 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |=
+				srcrb.flags |=
 					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
 		} else {
-			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
 						MLX4_WQE_CTRL_TCP_UDP_CSUM);
 		}
 	}
-	ctrl->srcrb_flags = srcrb_flags;
+	if (txq->lb) {
+		/*
+		 * Copy destination MAC address to the WQE, this allows
+		 * loopback in eSwitch, so that VFs and PF can communicate
+		 * with each other.
+		 */
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						      sizeof(uint16_t)));
+	} else {
+		ctrl->imm = 0;
+	}
+	ctrl->srcrb_flags = srcrb.flags;
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 51af69c..e10bbca 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -128,6 +128,7 @@ struct txq {
 	uint32_t max_inline; /**< Max inline send size. */
 	uint32_t csum:1; /**< Enable checksum offloading. */
 	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
+	uint32_t lb:1; /**< Whether packets should be looped back by eSwitch. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 41cdc4d..df4feb5 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -278,6 +278,8 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
 		.csum = priv->hw_csum,
 		.csum_l2tun = priv->hw_csum_l2tun,
+		/* Enable Tx loopback for VF devices. */
+		.lb = !!priv->vf,
 		.bounce_buf = bounce_buf,
 	};
 	txq->cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
                           ` (4 preceding siblings ...)
  2017-10-11 18:32         ` [PATCH v5 5/5] net/mlx4: add loopback Tx from VF Adrien Mazarguil
@ 2017-10-12 12:29         ` Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
                             ` (6 more replies)
  5 siblings, 7 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-12 12:29 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

Hopefully the last iteration for this series.

v6 (Adrien):
- Updated features documentation (mlx4.ini) in the relevant patches.
- Rebased on the latest changes brought by RSS support v2 series.

v5 (Ophir & Adrien):
- Merged Rx scatter/Tx gather code back into individual Rx/Tx commits
  for consistency due to a couple of issues with gather-less Tx.
- Rebased on top of the latest mlx4 control path changes (RSS support).

v4 (Ophir):
- Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4:
  restore full Rx support bypassing Verbs" commit

v3 (Adrien):
- Drop a few unrelated or unnecessary changes such as the removal of
  MLX4_PMD_TX_MP_CACHE.
- Move device checksum support detection code to its previous location.
- Fix include guard in mlx4_prm.h.
- Reorder #includes alphabetically.
- Replace MLX4_TRANSPOSE() macro with documented inline function.
- Remove extra spaces and blank lines.
- Use uint8_t * instead of char * for buffers.
- Replace mlx4_get_cqe() macro with a documented inline function.
- Replace several unsigned int with uint32_t.
- Add consistency to field names (sge_n => sges_n).
- Make mbuf size checks in RX queue setup function similar to mlx5.
- Update various comments.
- Fix indentation.
- Replace run-time endian conversion with static ones where possible.
- Reorder fields in struct rxq and struct txq for consistency, remove
  one level of unnecessary inner structures.
- Fix memory leak on Tx bounce buffer.
- Update commit logs.
- Fix remaining checkpatch warnings.

v2 (Matan):
Rearange patches.
Semantics.
Enhancements.
Fix compilation issues.

Moti Haimovsky (5):
  net/mlx4: add Tx bypassing Verbs
  net/mlx4: add Rx bypassing Verbs
  net/mlx4: restore Tx checksum offloads
  net/mlx4: restore Rx offloads
  net/mlx4: add loopback Tx from VF

 doc/guides/nics/features/mlx4.ini |   6 +
 drivers/net/mlx4/mlx4.c           |  11 +
 drivers/net/mlx4/mlx4.h           |   2 +
 drivers/net/mlx4/mlx4_ethdev.c    |  10 +
 drivers/net/mlx4/mlx4_prm.h       | 151 +++++++
 drivers/net/mlx4/mlx4_rxq.c       | 156 +++++--
 drivers/net/mlx4/mlx4_rxtx.c      | 768 ++++++++++++++++++++++++---------
 drivers/net/mlx4/mlx4_rxtx.h      |  54 +--
 drivers/net/mlx4/mlx4_txq.c       |  63 +++
 9 files changed, 948 insertions(+), 273 deletions(-)
 create mode 100644 drivers/net/mlx4/mlx4_prm.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v6 1/5] net/mlx4: add Tx bypassing Verbs
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
@ 2017-10-12 12:29           ` Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 2/5] net/mlx4: add Rx " Adrien Mazarguil
                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-12 12:29 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

Modify PMD to send single-buffer packets directly to the device
bypassing the Verbs Tx post and poll routines.

Tx gather support: add support for transmitting packets spanning
over multiple buffers.

Take into consideration the amount of entries a packet occupies
in the TxQ when setting the report-completion flag of the chip.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_prm.h  | 120 ++++++++++++
 drivers/net/mlx4/mlx4_rxtx.c | 398 ++++++++++++++++++++++++++++----------
 drivers/net/mlx4/mlx4_rxtx.h |  30 +--
 drivers/net/mlx4/mlx4_txq.c  |  59 ++++++
 4 files changed, 490 insertions(+), 117 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
new file mode 100644
index 0000000..085a595
--- /dev/null
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -0,0 +1,120 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2017 6WIND S.A.
+ *   Copyright 2017 Mellanox
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef MLX4_PRM_H_
+#define MLX4_PRM_H_
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_byteorder.h>
+
+/* Verbs headers do not support -pedantic. */
+#ifdef PEDANTIC
+#pragma GCC diagnostic ignored "-Wpedantic"
+#endif
+#include <infiniband/mlx4dv.h>
+#include <infiniband/verbs.h>
+#ifdef PEDANTIC
+#pragma GCC diagnostic error "-Wpedantic"
+#endif
+
+/* ConnectX-3 Tx queue basic block. */
+#define MLX4_TXBB_SHIFT 6
+#define MLX4_TXBB_SIZE (1 << MLX4_TXBB_SHIFT)
+
+/* Typical TSO descriptor with 16 gather entries is 352 bytes. */
+#define MLX4_MAX_WQE_SIZE 512
+#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+
+/* Send queue stamping/invalidating information. */
+#define MLX4_SQ_STAMP_STRIDE 64
+#define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
+#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_STAMP_VAL 0x7fffffff
+
+/* Work queue element (WQE) flags. */
+#define MLX4_BIT_WQE_OWN 0x80000000
+
+#define MLX4_SIZE_TO_TXBBS(size) \
+	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
+
+/* Send queue information. */
+struct mlx4_sq {
+	uint8_t *buf; /**< SQ buffer. */
+	uint8_t *eob; /**< End of SQ buffer */
+	uint32_t head; /**< SQ head counter in units of TXBBS. */
+	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
+	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
+	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
+	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t *db; /**< Pointer to the doorbell. */
+	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
+};
+
+#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
+
+/* Completion queue information. */
+struct mlx4_cq {
+	uint8_t *buf; /**< Pointer to the completion queue buffer. */
+	uint32_t cqe_cnt; /**< Number of entries in the queue. */
+	uint32_t cqe_64:1; /**< CQ entry size is 64 bytes. */
+	uint32_t cons_index; /**< Last queue entry that was handled. */
+	uint32_t *set_ci_db; /**< Pointer to the completion queue doorbell. */
+};
+
+/**
+ * Retrieve a CQE entry from a CQ.
+ *
+ * cqe = cq->buf + cons_index * cqe_size + cqe_offset
+ *
+ * Where cqe_size is 32 or 64 bytes and cqe_offset is 0 or 32 (depending on
+ * cqe_size).
+ *
+ * @param cq
+ *   CQ to retrieve entry from.
+ * @param index
+ *   Entry index.
+ *
+ * @return
+ *   Pointer to CQE entry.
+ */
+static inline struct mlx4_cqe *
+mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
+{
+	return (struct mlx4_cqe *)(cq->buf +
+				   ((index & (cq->cqe_cnt - 1)) <<
+				    (5 + cq->cqe_64)) +
+				   (cq->cqe_64 << 5));
+}
+
+#endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 859f1bd..38b87a0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -52,15 +52,81 @@
 
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
+#include <rte_io.h>
 #include <rte_mbuf.h>
 #include <rte_mempool.h>
 #include <rte_prefetch.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
 /**
+ * Pointer-value pair structure used in tx_post_send for saving the first
+ * DWORD (32 byte) of a TXBB.
+ */
+struct pv {
+	struct mlx4_wqe_data_seg *dseg;
+	uint32_t val;
+};
+
+/**
+ * Stamp a WQE so it won't be reused by the HW.
+ *
+ * Routine is used when freeing WQE used by the chip or when failing
+ * building an WQ entry has failed leaving partial information on the queue.
+ *
+ * @param sq
+ *   Pointer to the SQ structure.
+ * @param index
+ *   Index of the freed WQE.
+ * @param num_txbbs
+ *   Number of blocks to stamp.
+ *   If < 0 the routine will use the size written in the WQ entry.
+ * @param owner
+ *   The value of the WQE owner bit to use in the stamp.
+ *
+ * @return
+ *   The number of Tx basic blocs (TXBB) the WQE contained.
+ */
+static int
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+{
+	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+					  (!!owner << MLX4_SQ_STAMP_SHIFT));
+	uint8_t *wqe = mlx4_get_send_wqe(sq, (index & sq->txbb_cnt_mask));
+	uint32_t *ptr = (uint32_t *)wqe;
+	int i;
+	int txbbs_size;
+	int num_txbbs;
+
+	/* Extract the size from the control segment of the WQE. */
+	num_txbbs = MLX4_SIZE_TO_TXBBS((((struct mlx4_wqe_ctrl_seg *)
+					 wqe)->fence_size & 0x3f) << 4);
+	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	/* Optimize the common case when there is no wrap-around. */
+	if (wqe + txbbs_size <= sq->eob) {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+		}
+	} else {
+		/* Stamp the freed descriptor. */
+		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
+			*ptr = stamp;
+			ptr += MLX4_SQ_STAMP_DWORDS;
+			if ((uint8_t *)ptr >= sq->eob) {
+				ptr = (uint32_t *)sq->buf;
+				stamp ^= RTE_BE32(0x80000000);
+			}
+		}
+	}
+	return num_txbbs;
+}
+
+/**
  * Manage Tx completions.
  *
  * When sending a burst, mlx4_tx_burst() posts several WRs.
@@ -80,26 +146,71 @@ mlx4_txq_complete(struct txq *txq)
 	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
 	const unsigned int elts_n = txq->elts_n;
-	struct ibv_wc wcs[elts_comp];
-	int wcs_n;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cqe *cqe;
+	uint32_t cons_index = cq->cons_index;
+	uint16_t new_index;
+	uint16_t nr_txbbs = 0;
+	int pkts = 0;
 
 	if (unlikely(elts_comp == 0))
 		return 0;
-	wcs_n = ibv_poll_cq(txq->cq, elts_comp, wcs);
-	if (unlikely(wcs_n == 0))
+	/*
+	 * Traverse over all CQ entries reported and handle each WQ entry
+	 * reported by them.
+	 */
+	do {
+		cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cons_index);
+		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+		    !!(cons_index & cq->cqe_cnt)))
+			break;
+		/*
+		 * Make sure we read the CQE after we read the ownership bit.
+		 */
+		rte_rmb();
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR)) {
+			struct mlx4_err_cqe *cqe_err =
+				(struct mlx4_err_cqe *)cqe;
+			ERROR("%p CQE error - vendor syndrome: 0x%x"
+			      " syndrome: 0x%x\n",
+			      (void *)txq, cqe_err->vendor_err,
+			      cqe_err->syndrome);
+		}
+		/* Get WQE index reported in the CQE. */
+		new_index =
+			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		do {
+			/* Free next descriptor. */
+			nr_txbbs +=
+				mlx4_txq_stamp_freed_wqe(sq,
+				     (sq->tail + nr_txbbs) & sq->txbb_cnt_mask,
+				     !!((sq->tail + nr_txbbs) & sq->txbb_cnt));
+			pkts++;
+		} while (((sq->tail + nr_txbbs) & sq->txbb_cnt_mask) !=
+			 new_index);
+		cons_index++;
+	} while (1);
+	if (unlikely(pkts == 0))
 		return 0;
-	if (unlikely(wcs_n < 0)) {
-		DEBUG("%p: ibv_poll_cq() failed (wcs_n=%d)",
-		      (void *)txq, wcs_n);
-		return -1;
-	}
-	elts_comp -= wcs_n;
+	/*
+	 * Update CQ.
+	 * To prevent CQ overflow we first update CQ consumer and only then
+	 * the ring consumer.
+	 */
+	cq->cons_index = cons_index;
+	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & 0xffffff);
+	rte_wmb();
+	sq->tail = sq->tail + nr_txbbs;
+	/* Update the list of packets posted for transmission. */
+	elts_comp -= pkts;
 	assert(elts_comp <= txq->elts_comp);
 	/*
-	 * Assume WC status is successful as nothing can be done about it
-	 * anyway.
+	 * Assume completion status is successful as nothing can be done about
+	 * it anyway.
 	 */
-	elts_tail += wcs_n * txq->elts_comp_cd_init;
+	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
@@ -183,6 +294,161 @@ mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
 }
 
 /**
+ * Posts a single work request to a send queue.
+ *
+ * @param txq
+ *   Target Tx queue.
+ * @param pkt
+ *   Packet to transmit.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+static inline int
+mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
+{
+	struct mlx4_wqe_ctrl_seg *ctrl;
+	struct mlx4_wqe_data_seg *dseg;
+	struct mlx4_sq *sq = &txq->msq;
+	struct rte_mbuf *buf;
+	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+	uint32_t lkey;
+	uintptr_t addr;
+	uint32_t srcrb_flags;
+	uint32_t owner_opcode = MLX4_OPCODE_SEND;
+	uint32_t byte_count;
+	int wqe_real_size;
+	int nr_txbbs;
+	int rc;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
+	int pv_counter = 0;
+
+	/* Calculate the needed work queue entry size for this packet. */
+	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+	/*
+	 * Check that there is room for this WQE in the send queue and that
+	 * the WQE size is legal.
+	 */
+	if (((sq->head - sq->tail) + nr_txbbs +
+	     sq->headroom_txbbs) >= sq->txbb_cnt ||
+	    nr_txbbs > MLX4_MAX_WQE_TXBBS) {
+		rc = ENOSPC;
+		goto err;
+	}
+	/* Get the control and data entries of the WQE. */
+	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
+	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
+					    sizeof(struct mlx4_wqe_ctrl_seg));
+	/* Fill the data segments with buffer information. */
+	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
+		addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		rte_prefetch0((volatile void *)addr);
+		/* Handle WQE wraparound. */
+		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
+			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+		dseg->addr = rte_cpu_to_be_64(addr);
+		/* Memory region key for this memory pool. */
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			/* MR does not exist. */
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			/*
+			 * Restamp entry in case of failure.
+			 * Make sure that size is written correctly
+			 * Note that we give ownership to the SW, not the HW.
+			 */
+			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+			mlx4_txq_stamp_freed_wqe(sq, head_idx,
+				     (sq->head & sq->txbb_cnt) ? 0 : 1);
+			rc = EFAULT;
+			goto err;
+		}
+		dseg->lkey = rte_cpu_to_be_32(lkey);
+		if (likely(buf->data_len)) {
+			byte_count = rte_cpu_to_be_32(buf->data_len);
+		} else {
+			/*
+			 * Zero length segment is treated as inline segment
+			 * with zero data.
+			 */
+			byte_count = RTE_BE32(0x80000000);
+		}
+		/*
+		 * If the data segment is not at the beginning of a
+		 * Tx basic block (TXBB) then write the byte count,
+		 * else postpone the writing to just before updating the
+		 * control segment.
+		 */
+		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			/*
+			 * Need a barrier here before writing the byte_count
+			 * fields to make sure that all the data is visible
+			 * before the byte_count field is set.
+			 * Otherwise, if the segment begins a new cacheline,
+			 * the HCA prefetcher could grab the 64-byte chunk and
+			 * get a valid (!= 0xffffffff) byte count but stale
+			 * data, and end up sending the wrong data.
+			 */
+			rte_io_wmb();
+			dseg->byte_count = byte_count;
+		} else {
+			/*
+			 * This data segment starts at the beginning of a new
+			 * TXBB, so we need to postpone its byte_count writing
+			 * for later.
+			 */
+			pv[pv_counter].dseg = dseg;
+			pv[pv_counter++].val = byte_count;
+		}
+	}
+	/* Write the first DWORD of each TXBB save earlier. */
+	if (pv_counter) {
+		/* Need a barrier here before writing the byte_count. */
+		rte_io_wmb();
+		for (--pv_counter; pv_counter  >= 0; pv_counter--)
+			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
+	}
+	/* Fill the control parameters for this packet. */
+	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+	/*
+	 * The caller should prepare "imm" in advance in order to support
+	 * VF to VF communication (when the device is a virtual-function
+	 * device (VF)).
+	 */
+	ctrl->imm = 0;
+	/*
+	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
+	 * should be calculated.
+	 */
+	txq->elts_comp_cd -= nr_txbbs;
+	if (unlikely(txq->elts_comp_cd <= 0)) {
+		txq->elts_comp_cd = txq->elts_comp_cd_init;
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+				       MLX4_WQE_CTRL_CQ_UPDATE);
+	} else {
+		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+	}
+	ctrl->srcrb_flags = srcrb_flags;
+	/*
+	 * Make sure descriptor is fully written before
+	 * setting ownership bit (because HW can start
+	 * executing as soon as we do).
+	 */
+	rte_wmb();
+	ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					      ((sq->head & sq->txbb_cnt) ?
+					       MLX4_BIT_WQE_OWN : 0));
+	sq->head += nr_txbbs;
+	return 0;
+err:
+	rte_errno = rc;
+	return -rc;
+}
+
+/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -199,18 +465,15 @@ uint16_t
 mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	struct txq *txq = (struct txq *)dpdk_txq;
-	struct ibv_send_wr *wr_head = NULL;
-	struct ibv_send_wr **wr_next = &wr_head;
-	struct ibv_send_wr *wr_bad = NULL;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
-	unsigned int elts_comp_cd = txq->elts_comp_cd;
 	unsigned int elts_comp = 0;
+	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
 	int err;
 
-	assert(elts_comp_cd != 0);
+	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
 	max = (elts_n - (elts_head - txq->elts_tail));
 	if (max > elts_n)
@@ -229,10 +492,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		struct ibv_send_wr *wr = &elt->wr;
-		unsigned int segs = buf->nb_segs;
-		unsigned int sent_size = 0;
-		uint32_t send_flags = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -250,100 +509,31 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				tmp = next;
 			} while (tmp != NULL);
 		}
-		/* Request Tx completion. */
-		if (unlikely(--elts_comp_cd == 0)) {
-			elts_comp_cd = txq->elts_comp_cd_init;
-			++elts_comp;
-			send_flags |= IBV_SEND_SIGNALED;
-		}
-		if (likely(segs == 1)) {
-			struct ibv_sge *sge = &elt->sge;
-			uintptr_t addr;
-			uint32_t length;
-			uint32_t lkey;
-
-			/* Retrieve buffer information. */
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			length = buf->data_len;
-			/* Retrieve memory region key for this memory pool. */
-			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			if (unlikely(lkey == (uint32_t)-1)) {
-				/* MR does not exist. */
-				DEBUG("%p: unable to get MP <-> MR"
-				      " association", (void *)txq);
-				/* Clean up Tx element. */
-				elt->buf = NULL;
-				goto stop;
-			}
-			/* Update element. */
-			elt->buf = buf;
-			if (txq->priv->vf)
-				rte_prefetch0((volatile void *)
-					      (uintptr_t)addr);
-			RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-			sge->addr = addr;
-			sge->length = length;
-			sge->lkey = lkey;
-			sent_size += length;
-		} else {
-			err = -1;
+		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
+		/* Post the packet for sending. */
+		err = mlx4_post_send(txq, buf);
+		if (unlikely(err)) {
+			elt->buf = NULL;
 			goto stop;
 		}
-		if (sent_size <= txq->max_inline)
-			send_flags |= IBV_SEND_INLINE;
+		elt->buf = buf;
+		bytes_sent += buf->pkt_len;
+		++elts_comp;
 		elts_head = elts_head_next;
-		/* Increment sent bytes counter. */
-		txq->stats.obytes += sent_size;
-		/* Set up WR. */
-		wr->sg_list = &elt->sge;
-		wr->num_sge = segs;
-		wr->opcode = IBV_WR_SEND;
-		wr->send_flags = send_flags;
-		*wr_next = wr;
-		wr_next = &wr->next;
 	}
 stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
-	/* Increment sent packets counter. */
+	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
+	txq->stats.obytes += bytes_sent;
+	/* Make sure that descriptors are written before doorbell record. */
+	rte_wmb();
 	/* Ring QP doorbell. */
-	*wr_next = NULL;
-	assert(wr_head);
-	err = ibv_post_send(txq->qp, wr_head, &wr_bad);
-	if (unlikely(err)) {
-		uint64_t obytes = 0;
-		uint64_t opackets = 0;
-
-		/* Rewind bad WRs. */
-		while (wr_bad != NULL) {
-			int j;
-
-			/* Force completion request if one was lost. */
-			if (wr_bad->send_flags & IBV_SEND_SIGNALED) {
-				elts_comp_cd = 1;
-				--elts_comp;
-			}
-			++opackets;
-			for (j = 0; j < wr_bad->num_sge; ++j)
-				obytes += wr_bad->sg_list[j].length;
-			elts_head = (elts_head ? elts_head : elts_n) - 1;
-			wr_bad = wr_bad->next;
-		}
-		txq->stats.opackets -= opackets;
-		txq->stats.obytes -= obytes;
-		i -= opackets;
-		DEBUG("%p: ibv_post_send() failed, %" PRIu64 " packets"
-		      " (%" PRIu64 " bytes) rejected: %s",
-		      (void *)txq,
-		      opackets,
-		      obytes,
-		      (err <= -1) ? "Internal error" : strerror(err));
-	}
+	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head = elts_head;
 	txq->elts_comp += elts_comp;
-	txq->elts_comp_cd = elts_comp_cd;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index eca966f..ff27126 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -41,6 +41,7 @@
 #ifdef PEDANTIC
 #pragma GCC diagnostic ignored "-Wpedantic"
 #endif
+#include <infiniband/mlx4dv.h>
 #include <infiniband/verbs.h>
 #ifdef PEDANTIC
 #pragma GCC diagnostic error "-Wpedantic"
@@ -51,6 +52,7 @@
 #include <rte_mempool.h>
 
 #include "mlx4.h"
+#include "mlx4_prm.h"
 
 /** Rx queue counters. */
 struct mlx4_rxq_stats {
@@ -101,8 +103,6 @@ struct mlx4_rss {
 
 /** Tx element. */
 struct txq_elt {
-	struct ibv_send_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
 	struct rte_mbuf *buf; /**< Buffer. */
 };
 
@@ -116,24 +116,28 @@ struct mlx4_txq_stats {
 
 /** Tx queue descriptor. */
 struct txq {
-	struct priv *priv; /**< Back pointer to private data. */
+	struct mlx4_sq msq; /**< Info for directly manipulating the SQ. */
+	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
+	unsigned int elts_head; /**< Current index in (*elts)[]. */
+	unsigned int elts_tail; /**< First element awaiting completion. */
+	unsigned int elts_comp; /**< Number of packets awaiting completion. */
+	int elts_comp_cd; /**< Countdown for next completion. */
+	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
+	unsigned int elts_n; /**< (*elts)[] length. */
+	struct txq_elt (*elts)[]; /**< Tx elements. */
+	struct mlx4_txq_stats stats; /**< Tx queue counters. */
+	uint32_t max_inline; /**< Max inline send size. */
+	uint8_t *bounce_buf;
+	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
 		const struct rte_mempool *mp; /**< Cached memory pool. */
 		struct ibv_mr *mr; /**< Memory region (for mp). */
 		uint32_t lkey; /**< mr->lkey copy. */
 	} mp2mr[MLX4_PMD_TX_MP_CACHE]; /**< MP to MR translation table. */
+	struct priv *priv; /**< Back pointer to private data. */
+	unsigned int socket; /**< CPU socket ID for allocations. */
 	struct ibv_cq *cq; /**< Completion queue. */
 	struct ibv_qp *qp; /**< Queue pair. */
-	uint32_t max_inline; /**< Max inline send size. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	struct txq_elt (*elts)[]; /**< Tx elements. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of completion requests. */
-	unsigned int elts_comp_cd; /**< Countdown for next completion. */
-	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
-	struct mlx4_txq_stats stats; /**< Tx queue counters. */
-	unsigned int socket; /**< CPU socket ID for allocations. */
 	uint8_t data[]; /**< Remaining queue resources. */
 };
 
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 915f8d7..fbb028a 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -60,6 +60,7 @@
 
 #include "mlx4.h"
 #include "mlx4_autoconf.h"
+#include "mlx4_prm.h"
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
@@ -148,6 +149,41 @@ mlx4_txq_mp2mr_iter(struct rte_mempool *mp, void *arg)
 }
 
 /**
+ * Retrieves information needed in order to directly access the Tx queue.
+ *
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param mlxdv
+ *   Pointer to device information for this Tx queue.
+ */
+static void
+mlx4_txq_fill_dv_obj_info(struct txq *txq, struct mlx4dv_obj *mlxdv)
+{
+	struct mlx4_sq *sq = &txq->msq;
+	struct mlx4_cq *cq = &txq->mcq;
+	struct mlx4dv_qp *dqp = mlxdv->qp.out;
+	struct mlx4dv_cq *dcq = mlxdv->cq.out;
+	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	/* Total length, including headroom and spare WQEs. */
+	sq->eob = sq->buf + sq_size;
+	sq->head = 0;
+	sq->tail = 0;
+	sq->txbb_cnt =
+		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
+	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->db = dqp->sdb;
+	sq->doorbell_qpn = dqp->doorbell_qpn;
+	sq->headroom_txbbs =
+		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
+	cq->buf = dcq->buf.buf;
+	cq->cqe_cnt = dcq->cqe_cnt;
+	cq->set_ci_db = dcq->set_ci_db;
+	cq->cqe_64 = (dcq->cqe_size & 64) ? 1 : 0;
+}
+
+/**
  * DPDK callback to configure a Tx queue.
  *
  * @param dev
@@ -169,9 +205,13 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    unsigned int socket, const struct rte_eth_txconf *conf)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_qp dv_qp;
+	struct mlx4dv_cq dv_cq;
 	struct txq_elt (*elts)[desc];
 	struct ibv_qp_init_attr qp_init_attr;
 	struct txq *txq;
+	uint8_t *bounce_buf;
 	struct mlx4_malloc_vec vec[] = {
 		{
 			.align = RTE_CACHE_LINE_SIZE,
@@ -183,6 +223,11 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			.size = sizeof(*elts),
 			.addr = (void **)&elts,
 		},
+		{
+			.align = RTE_CACHE_LINE_SIZE,
+			.size = MLX4_MAX_WQE_SIZE,
+			.addr = (void **)&bounce_buf,
+		},
 	};
 	int ret;
 
@@ -231,6 +276,7 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
 		.elts_comp_cd_init =
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
+		.bounce_buf = bounce_buf,
 	};
 	txq->cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
 	if (!txq->cq) {
@@ -297,6 +343,19 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = txq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.qp.in = txq->qp;
+	mlxdv.qp.out = &dv_qp;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_QP | MLX4DV_OBJ_CQ);
+	if (ret) {
+		rte_errno = EINVAL;
+		ERROR("%p: failed to obtain information needed for"
+		      " accessing the device queues", (void *)dev);
+		goto error;
+	}
+	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
 	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 2/5] net/mlx4: add Rx bypassing Verbs
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
@ 2017-10-12 12:29           ` Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 3/5] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
                             ` (4 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-12 12:29 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds support for accessing the hardware directly when
handling Rx packets eliminating the need to use Verbs in the Rx data
path.

Rx scatter support: calculate the number of scatters on the fly
according to the maximum expected packet size.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 doc/guides/nics/features/mlx4.ini |   1 +
 drivers/net/mlx4/mlx4_rxq.c       | 151 +++++++++++++++-------
 drivers/net/mlx4/mlx4_rxtx.c      | 226 +++++++++++++++++++--------------
 drivers/net/mlx4/mlx4_rxtx.h      |  19 ++-
 4 files changed, 242 insertions(+), 155 deletions(-)

diff --git a/doc/guides/nics/features/mlx4.ini b/doc/guides/nics/features/mlx4.ini
index 9750ebf..19ae688 100644
--- a/doc/guides/nics/features/mlx4.ini
+++ b/doc/guides/nics/features/mlx4.ini
@@ -12,6 +12,7 @@ Rx interrupt         = Y
 Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
+Scattered Rx         = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 483fe9b..39c83bc 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -51,6 +51,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif
 
+#include <rte_byteorder.h>
 #include <rte_common.h>
 #include <rte_errno.h>
 #include <rte_ethdev.h>
@@ -312,45 +313,46 @@ void mlx4_rss_detach(struct mlx4_rss *rss)
 static int
 mlx4_rxq_alloc_elts(struct rxq *rxq)
 {
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
+	const uint32_t elts_n = 1 << rxq->elts_n;
+	const uint32_t sges_n = 1 << rxq->sges_n;
+	struct rte_mbuf *(*elts)[elts_n] = rxq->elts;
 	unsigned int i;
 
-	/* For each WR (packet). */
+	assert(rte_is_power_of_2(elts_n));
 	for (i = 0; i != RTE_DIM(*elts); ++i) {
-		struct rxq_elt *elt = &(*elts)[i];
-		struct ibv_recv_wr *wr = &elt->wr;
-		struct ibv_sge *sge = &(*elts)[i].sge;
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[i];
 		struct rte_mbuf *buf = rte_pktmbuf_alloc(rxq->mp);
 
 		if (buf == NULL) {
 			while (i--) {
-				rte_pktmbuf_free_seg((*elts)[i].buf);
-				(*elts)[i].buf = NULL;
+				rte_pktmbuf_free_seg((*elts)[i]);
+				(*elts)[i] = NULL;
 			}
 			rte_errno = ENOMEM;
 			return -rte_errno;
 		}
-		elt->buf = buf;
-		wr->next = &(*elts)[(i + 1)].wr;
-		wr->sg_list = sge;
-		wr->num_sge = 1;
 		/* Headroom is reserved by rte_pktmbuf_alloc(). */
 		assert(buf->data_off == RTE_PKTMBUF_HEADROOM);
 		/* Buffer is supposed to be empty. */
 		assert(rte_pktmbuf_data_len(buf) == 0);
 		assert(rte_pktmbuf_pkt_len(buf) == 0);
-		/* sge->addr must be able to store a pointer. */
-		assert(sizeof(sge->addr) >= sizeof(uintptr_t));
-		/* SGE keeps its headroom. */
-		sge->addr = (uintptr_t)
-			((uint8_t *)buf->buf_addr + RTE_PKTMBUF_HEADROOM);
-		sge->length = (buf->buf_len - RTE_PKTMBUF_HEADROOM);
-		sge->lkey = rxq->mr->lkey;
-		/* Redundant check for tailroom. */
-		assert(sge->length == rte_pktmbuf_tailroom(buf));
+		/* Only the first segment keeps headroom. */
+		if (i % sges_n)
+			buf->data_off = 0;
+		buf->port = rxq->port_id;
+		buf->data_len = rte_pktmbuf_tailroom(buf);
+		buf->pkt_len = rte_pktmbuf_tailroom(buf);
+		buf->nb_segs = 1;
+		*scat = (struct mlx4_wqe_data_seg){
+			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
+								  uintptr_t)),
+			.byte_count = rte_cpu_to_be_32(buf->data_len),
+			.lkey = rte_cpu_to_be_32(rxq->mr->lkey),
+		};
+		(*elts)[i] = buf;
 	}
-	/* The last WR pointer must be NULL. */
-	(*elts)[(i - 1)].wr.next = NULL;
+	DEBUG("%p: allocated and configured %u segments (max %u packets)",
+	      (void *)rxq, elts_n, elts_n / sges_n);
 	return 0;
 }
 
@@ -364,14 +366,14 @@ static void
 mlx4_rxq_free_elts(struct rxq *rxq)
 {
 	unsigned int i;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
+	struct rte_mbuf *(*elts)[1 << rxq->elts_n] = rxq->elts;
 
-	DEBUG("%p: freeing WRs", (void *)rxq);
+	DEBUG("%p: freeing Rx queue elements", (void *)rxq);
 	for (i = 0; (i != RTE_DIM(*elts)); ++i) {
-		if (!(*elts)[i].buf)
+		if (!(*elts)[i])
 			continue;
-		rte_pktmbuf_free_seg((*elts)[i].buf);
-		(*elts)[i].buf = NULL;
+		rte_pktmbuf_free_seg((*elts)[i]);
+		(*elts)[i] = NULL;
 	}
 }
 
@@ -400,8 +402,11 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    struct rte_mempool *mp)
 {
 	struct priv *priv = dev->data->dev_private;
+	struct mlx4dv_obj mlxdv;
+	struct mlx4dv_rwq dv_rwq;
+	struct mlx4dv_cq dv_cq;
 	uint32_t mb_len = rte_pktmbuf_data_room_size(mp);
-	struct rxq_elt (*elts)[desc];
+	struct rte_mbuf *(*elts)[rte_align32pow2(desc)];
 	struct rte_flow_error error;
 	struct rxq *rxq;
 	struct mlx4_malloc_vec vec[] = {
@@ -439,6 +444,12 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		ERROR("%p: invalid number of Rx descriptors", (void *)dev);
 		return -rte_errno;
 	}
+	if (desc != RTE_DIM(*elts)) {
+		desc = RTE_DIM(*elts);
+		WARN("%p: increased number of descriptors in Rx queue %u"
+		     " to the next power of two (%u)",
+		     (void *)dev, idx, desc);
+	}
 	/* Allocate and initialize Rx queue. */
 	mlx4_zmallocv_socket("RXQ", vec, RTE_DIM(vec), socket);
 	if (!rxq) {
@@ -450,8 +461,8 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		.priv = priv,
 		.mp = mp,
 		.port_id = dev->data->port_id,
-		.elts_n = desc,
-		.elts_head = 0,
+		.sges_n = 0,
+		.elts_n = rte_log2_u32(desc),
 		.elts = elts,
 		.stats.idx = idx,
 		.socket = socket,
@@ -462,9 +473,29 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
 		;
 	} else if (dev->data->dev_conf.rxmode.enable_scatter) {
-		WARN("%p: scattered mode has been requested but is"
-		     " not supported, this may lead to packet loss",
-		     (void *)dev);
+		uint32_t size =
+			RTE_PKTMBUF_HEADROOM +
+			dev->data->dev_conf.rxmode.max_rx_pkt_len;
+		uint32_t sges_n;
+
+		/*
+		 * Determine the number of SGEs needed for a full packet
+		 * and round it to the next power of two.
+		 */
+		sges_n = rte_log2_u32((size / mb_len) + !!(size % mb_len));
+		rxq->sges_n = sges_n;
+		/* Make sure sges_n did not overflow. */
+		size = mb_len * (1 << rxq->sges_n);
+		size -= RTE_PKTMBUF_HEADROOM;
+		if (size < dev->data->dev_conf.rxmode.max_rx_pkt_len) {
+			rte_errno = EOVERFLOW;
+			ERROR("%p: too many SGEs (%u) needed to handle"
+			      " requested maximum packet size %u",
+			      (void *)dev,
+			      1 << sges_n,
+			      dev->data->dev_conf.rxmode.max_rx_pkt_len);
+			goto error;
+		}
 	} else {
 		WARN("%p: the requested maximum Rx packet size (%u) is"
 		     " larger than a single mbuf (%u) and scattered"
@@ -473,6 +504,17 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		     dev->data->dev_conf.rxmode.max_rx_pkt_len,
 		     mb_len - RTE_PKTMBUF_HEADROOM);
 	}
+	DEBUG("%p: maximum number of segments per packet: %u",
+	      (void *)dev, 1 << rxq->sges_n);
+	if (desc % (1 << rxq->sges_n)) {
+		rte_errno = EINVAL;
+		ERROR("%p: number of Rx queue descriptors (%u) is not a"
+		      " multiple of maximum segments per packet (%u)",
+		      (void *)dev,
+		      desc,
+		      1 << rxq->sges_n);
+		goto error;
+	}
 	/* Use the entire Rx mempool as the memory region. */
 	rxq->mr = mlx4_mp2mr(priv->pd, mp);
 	if (!rxq->mr) {
@@ -497,7 +539,8 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			goto error;
 		}
 	}
-	rxq->cq = ibv_create_cq(priv->ctx, desc, NULL, rxq->channel, 0);
+	rxq->cq = ibv_create_cq(priv->ctx, desc >> rxq->sges_n, NULL,
+				rxq->channel, 0);
 	if (!rxq->cq) {
 		rte_errno = ENOMEM;
 		ERROR("%p: CQ creation failure: %s",
@@ -508,8 +551,8 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		(priv->ctx,
 		 &(struct ibv_wq_init_attr){
 			.wq_type = IBV_WQT_RQ,
-			.max_wr = RTE_MIN(priv->device_attr.max_qp_wr, desc),
-			.max_sge = 1,
+			.max_wr = desc >> rxq->sges_n,
+			.max_sge = 1 << rxq->sges_n,
 			.pd = priv->pd,
 			.cq = rxq->cq,
 		 });
@@ -531,27 +574,43 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
-	ret = mlx4_rxq_alloc_elts(rxq);
+	/* Retrieve device queue information. */
+	mlxdv.cq.in = rxq->cq;
+	mlxdv.cq.out = &dv_cq;
+	mlxdv.rwq.in = rxq->wq;
+	mlxdv.rwq.out = &dv_rwq;
+	ret = mlx4dv_init_obj(&mlxdv, MLX4DV_OBJ_RWQ | MLX4DV_OBJ_CQ);
 	if (ret) {
-		ERROR("%p: RXQ allocation failed: %s",
-		      (void *)dev, strerror(rte_errno));
+		rte_errno = EINVAL;
+		ERROR("%p: failed to obtain device information", (void *)dev);
 		goto error;
 	}
-	ret = ibv_post_wq_recv(rxq->wq, &(*rxq->elts)[0].wr,
-			       &(struct ibv_recv_wr *){ NULL });
+	rxq->wqes =
+		(volatile struct mlx4_wqe_data_seg (*)[])
+		((uintptr_t)dv_rwq.buf.buf + dv_rwq.rq.offset);
+	rxq->rq_db = dv_rwq.rdb;
+	rxq->rq_ci = 0;
+	rxq->mcq.buf = dv_cq.buf.buf;
+	rxq->mcq.cqe_cnt = dv_cq.cqe_cnt;
+	rxq->mcq.set_ci_db = dv_cq.set_ci_db;
+	rxq->mcq.cqe_64 = (dv_cq.cqe_size & 64) ? 1 : 0;
+	ret = mlx4_rxq_alloc_elts(rxq);
 	if (ret) {
-		rte_errno = ret;
-		ERROR("%p: ibv_post_recv() failed: %s",
-		      (void *)dev,
-		      strerror(rte_errno));
+		ERROR("%p: RXQ allocation failed: %s",
+		      (void *)dev, strerror(rte_errno));
 		goto error;
 	}
 	DEBUG("%p: adding Rx queue %p to list", (void *)dev, (void *)rxq);
 	dev->data->rx_queues[idx] = rxq;
 	/* Enable associated flows. */
 	ret = mlx4_flow_sync(priv, &error);
-	if (!ret)
+	if (!ret) {
+		/* Update doorbell counter. */
+		rxq->rq_ci = desc >> rxq->sges_n;
+		rte_wmb();
+		*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 		return 0;
+	}
 	ERROR("cannot re-attach flow rules to queue %u"
 	      " (code %d, \"%s\"), flow error type %d, cause %p, message: %s",
 	      idx, -ret, strerror(-ret), error.type, error.cause,
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 38b87a0..cc0baaa 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -538,9 +538,44 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
- * DPDK callback for Rx.
+ * Poll one CQE from CQ.
  *
- * The following function doesn't manage scattered packets.
+ * @param rxq
+ *   Pointer to the receive queue structure.
+ * @param[out] out
+ *   Just polled CQE.
+ *
+ * @return
+ *   Number of bytes of the CQE, 0 in case there is no completion.
+ */
+static unsigned int
+mlx4_cq_poll_one(struct rxq *rxq, struct mlx4_cqe **out)
+{
+	int ret = 0;
+	struct mlx4_cqe *cqe = NULL;
+	struct mlx4_cq *cq = &rxq->mcq;
+
+	cqe = (struct mlx4_cqe *)mlx4_get_cqe(cq, cq->cons_index);
+	if (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
+	    !!(cq->cons_index & cq->cqe_cnt))
+		goto out;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rte_rmb();
+	assert(!(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK));
+	assert((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) !=
+	       MLX4_CQE_OPCODE_ERROR);
+	ret = rte_be_to_cpu_32(cqe->byte_cnt);
+	++cq->cons_index;
+out:
+	*out = cqe;
+	return ret;
+}
+
+/**
+ * DPDK callback for Rx with scattered packets support.
  *
  * @param dpdk_rxq
  *   Generic pointer to Rx queue structure.
@@ -555,112 +590,107 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 uint16_t
 mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
-	struct rxq *rxq = (struct rxq *)dpdk_rxq;
-	struct rxq_elt (*elts)[rxq->elts_n] = rxq->elts;
-	const unsigned int elts_n = rxq->elts_n;
-	unsigned int elts_head = rxq->elts_head;
-	struct ibv_wc wcs[pkts_n];
-	struct ibv_recv_wr *wr_head = NULL;
-	struct ibv_recv_wr **wr_next = &wr_head;
-	struct ibv_recv_wr *wr_bad = NULL;
-	unsigned int i;
-	unsigned int pkts_ret = 0;
-	int ret;
+	struct rxq *rxq = dpdk_rxq;
+	const uint32_t wr_cnt = (1 << rxq->elts_n) - 1;
+	const uint16_t sges_n = rxq->sges_n;
+	struct rte_mbuf *pkt = NULL;
+	struct rte_mbuf *seg = NULL;
+	unsigned int i = 0;
+	uint32_t rq_ci = rxq->rq_ci << sges_n;
+	int len = 0;
 
-	ret = ibv_poll_cq(rxq->cq, pkts_n, wcs);
-	if (unlikely(ret == 0))
-		return 0;
-	if (unlikely(ret < 0)) {
-		DEBUG("rxq=%p, ibv_poll_cq() failed (wc_n=%d)",
-		      (void *)rxq, ret);
-		return 0;
-	}
-	assert(ret <= (int)pkts_n);
-	/* For each work completion. */
-	for (i = 0; i != (unsigned int)ret; ++i) {
-		struct ibv_wc *wc = &wcs[i];
-		struct rxq_elt *elt = &(*elts)[elts_head];
-		struct ibv_recv_wr *wr = &elt->wr;
-		uint32_t len = wc->byte_len;
-		struct rte_mbuf *seg = elt->buf;
-		struct rte_mbuf *rep;
+	while (pkts_n) {
+		struct mlx4_cqe *cqe;
+		uint32_t idx = rq_ci & wr_cnt;
+		struct rte_mbuf *rep = (*rxq->elts)[idx];
+		volatile struct mlx4_wqe_data_seg *scat = &(*rxq->wqes)[idx];
 
-		/* Sanity checks. */
-		assert(wr->sg_list == &elt->sge);
-		assert(wr->num_sge == 1);
-		assert(elts_head < rxq->elts_n);
-		assert(rxq->elts_head < rxq->elts_n);
-		/*
-		 * Fetch initial bytes of packet descriptor into a
-		 * cacheline while allocating rep.
-		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
-		/* Link completed WRs together for repost. */
-		*wr_next = wr;
-		wr_next = &wr->next;
-		if (unlikely(wc->status != IBV_WC_SUCCESS)) {
-			/* Whatever, just repost the offending WR. */
-			DEBUG("rxq=%p: bad work completion status (%d): %s",
-			      (void *)rxq, wc->status,
-			      ibv_wc_status_str(wc->status));
-			/* Increment dropped packets counter. */
-			++rxq->stats.idropped;
-			goto repost;
-		}
+		/* Update the 'next' pointer of the previous segment. */
+		if (pkt)
+			seg->next = rep;
+		seg = rep;
+		rte_prefetch0(seg);
+		rte_prefetch0(scat);
 		rep = rte_mbuf_raw_alloc(rxq->mp);
 		if (unlikely(rep == NULL)) {
-			/*
-			 * Unable to allocate a replacement mbuf,
-			 * repost WR.
-			 */
-			DEBUG("rxq=%p: can't allocate a new mbuf",
-			      (void *)rxq);
-			/* Increase out of memory counters. */
 			++rxq->stats.rx_nombuf;
-			++rxq->priv->dev->data->rx_mbuf_alloc_failed;
-			goto repost;
+			if (!pkt) {
+				/*
+				 * No buffers before we even started,
+				 * bail out silently.
+				 */
+				break;
+			}
+			while (pkt != seg) {
+				assert(pkt != (*rxq->elts)[idx]);
+				rep = pkt->next;
+				pkt->next = NULL;
+				pkt->nb_segs = 1;
+				rte_mbuf_raw_free(pkt);
+				pkt = rep;
+			}
+			break;
+		}
+		if (!pkt) {
+			/* Looking for the new packet. */
+			len = mlx4_cq_poll_one(rxq, &cqe);
+			if (!len) {
+				rte_mbuf_raw_free(rep);
+				break;
+			}
+			if (unlikely(len < 0)) {
+				/* Rx error, packet is likely too large. */
+				rte_mbuf_raw_free(rep);
+				++rxq->stats.idropped;
+				goto skip;
+			}
+			pkt = seg;
+			pkt->packet_type = 0;
+			pkt->ol_flags = 0;
+			pkt->pkt_len = len;
+		}
+		rep->nb_segs = 1;
+		rep->port = rxq->port_id;
+		rep->data_len = seg->data_len;
+		rep->data_off = seg->data_off;
+		(*rxq->elts)[idx] = rep;
+		/*
+		 * Fill NIC descriptor with the new buffer. The lkey and size
+		 * of the buffers are already known, only the buffer address
+		 * changes.
+		 */
+		scat->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, uintptr_t));
+		if (len > seg->data_len) {
+			len -= seg->data_len;
+			++pkt->nb_segs;
+			++rq_ci;
+			continue;
 		}
-		/* Reconfigure sge to use rep instead of seg. */
-		elt->sge.addr = (uintptr_t)rep->buf_addr + RTE_PKTMBUF_HEADROOM;
-		assert(elt->sge.lkey == rxq->mr->lkey);
-		elt->buf = rep;
-		/* Update seg information. */
-		seg->data_off = RTE_PKTMBUF_HEADROOM;
-		seg->nb_segs = 1;
-		seg->port = rxq->port_id;
-		seg->next = NULL;
-		seg->pkt_len = len;
+		/* The last segment. */
 		seg->data_len = len;
-		seg->packet_type = 0;
-		seg->ol_flags = 0;
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += pkt->pkt_len;
 		/* Return packet. */
-		*(pkts++) = seg;
-		++pkts_ret;
-		/* Increase bytes counter. */
-		rxq->stats.ibytes += len;
-repost:
-		if (++elts_head >= elts_n)
-			elts_head = 0;
-		continue;
+		*(pkts++) = pkt;
+		pkt = NULL;
+		--pkts_n;
+		++i;
+skip:
+		/* Align consumer index to the next stride. */
+		rq_ci >>= sges_n;
+		++rq_ci;
+		rq_ci <<= sges_n;
 	}
-	if (unlikely(i == 0))
+	if (unlikely(i == 0 && (rq_ci >> sges_n) == rxq->rq_ci))
 		return 0;
-	/* Repost WRs. */
-	*wr_next = NULL;
-	assert(wr_head);
-	ret = ibv_post_wq_recv(rxq->wq, wr_head, &wr_bad);
-	if (unlikely(ret)) {
-		/* Inability to repost WRs is fatal. */
-		DEBUG("%p: recv_burst(): failed (ret=%d)",
-		      (void *)rxq->priv,
-		      ret);
-		abort();
-	}
-	rxq->elts_head = elts_head;
-	/* Increase packets counter. */
-	rxq->stats.ipackets += pkts_ret;
-	return pkts_ret;
+	/* Update the consumer index. */
+	rxq->rq_ci = rq_ci >> sges_n;
+	rte_wmb();
+	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
+	*rxq->mcq.set_ci_db = rte_cpu_to_be_32(rxq->mcq.cons_index & 0xffffff);
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+	return i;
 }
 
 /**
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index ff27126..fa5738f 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -63,13 +63,6 @@ struct mlx4_rxq_stats {
 	uint64_t rx_nombuf; /**< Total of Rx mbuf allocation failures. */
 };
 
-/** Rx element. */
-struct rxq_elt {
-	struct ibv_recv_wr wr; /**< Work request. */
-	struct ibv_sge sge; /**< Scatter/gather element. */
-	struct rte_mbuf *buf; /**< Buffer. */
-};
-
 /** Rx queue descriptor. */
 struct rxq {
 	struct priv *priv; /**< Back pointer to private data. */
@@ -78,10 +71,14 @@ struct rxq {
 	struct ibv_cq *cq; /**< Completion queue. */
 	struct ibv_wq *wq; /**< Work queue. */
 	struct ibv_comp_channel *channel; /**< Rx completion channel. */
-	unsigned int port_id; /**< Port ID for incoming packets. */
-	unsigned int elts_n; /**< (*elts)[] length. */
-	unsigned int elts_head; /**< Current index in (*elts)[]. */
-	struct rxq_elt (*elts)[]; /**< Rx elements. */
+	uint16_t rq_ci; /**< Saved RQ consumer index. */
+	uint16_t port_id; /**< Port ID for incoming packets. */
+	uint16_t sges_n; /**< Number of segments per packet (log2 value). */
+	uint16_t elts_n; /**< Mbuf queue size (log2 value). */
+	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
+	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
+	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
 	uint8_t data[]; /**< Remaining queue resources. */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 3/5] net/mlx4: restore Tx checksum offloads
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 2/5] net/mlx4: add Rx " Adrien Mazarguil
@ 2017-10-12 12:29           ` Adrien Mazarguil
  2017-10-12 12:29           ` [PATCH v6 4/5] net/mlx4: restore Rx offloads Adrien Mazarguil
                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-12 12:29 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPv4, UDP and TCP checksum
calculation, including inner/outer checksums on supported tunnel types.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 doc/guides/nics/features/mlx4.ini |  4 ++++
 drivers/net/mlx4/mlx4.c           | 11 +++++++++++
 drivers/net/mlx4/mlx4.h           |  2 ++
 drivers/net/mlx4/mlx4_ethdev.c    |  6 ++++++
 drivers/net/mlx4/mlx4_prm.h       |  2 ++
 drivers/net/mlx4/mlx4_rxtx.c      | 19 +++++++++++++++++++
 drivers/net/mlx4/mlx4_rxtx.h      |  2 ++
 drivers/net/mlx4/mlx4_txq.c       |  2 ++
 8 files changed, 48 insertions(+)

diff --git a/doc/guides/nics/features/mlx4.ini b/doc/guides/nics/features/mlx4.ini
index 19ae688..366e051 100644
--- a/doc/guides/nics/features/mlx4.ini
+++ b/doc/guides/nics/features/mlx4.ini
@@ -20,6 +20,10 @@ Multicast MAC filter = Y
 RSS hash             = Y
 SR-IOV               = Y
 VLAN filter          = Y
+L3 checksum offload  = Y
+L4 checksum offload  = Y
+Inner L3 checksum    = Y
+Inner L4 checksum    = Y
 Basic stats          = Y
 Stats per queue      = Y
 Other kdrv           = Y
diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index 0db9a19..a297b9a 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -566,6 +566,17 @@ mlx4_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 		priv->pd = pd;
 		priv->mtu = ETHER_MTU;
 		priv->vf = vf;
+		priv->hw_csum =	!!(device_attr.device_cap_flags &
+				   IBV_DEVICE_RAW_IP_CSUM);
+		DEBUG("checksum offloading is %ssupported",
+		      (priv->hw_csum ? "" : "not "));
+		/* Only ConnectX-3 Pro supports tunneling. */
+		priv->hw_csum_l2tun =
+			priv->hw_csum &&
+			(device_attr.vendor_part_id ==
+			 PCI_DEVICE_ID_MELLANOX_CONNECTX3PRO);
+		DEBUG("L2 tunnel checksum offloads are %ssupported",
+		      (priv->hw_csum_l2tun ? "" : "not "));
 		/* Configure the first MAC address by default. */
 		if (mlx4_get_mac(priv, &mac.addr_bytes)) {
 			ERROR("cannot get MAC address, is mlx4_en loaded?"
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index f4da8c6..e0a9853 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -113,6 +113,8 @@ struct priv {
 	uint32_t vf:1; /**< This is a VF device. */
 	uint32_t intr_alarm:1; /**< An interrupt alarm is scheduled. */
 	uint32_t isolated:1; /**< Toggle isolated mode. */
+	uint32_t hw_csum:1; /* Checksum offload is supported. */
+	uint32_t hw_csum_l2tun:1; /* Checksum support for L2 tunnels. */
 	struct rte_intr_handle intr_handle; /**< Port interrupt handle. */
 	struct mlx4_drop *drop; /**< Shared resources for drop flow rules. */
 	LIST_HEAD(, mlx4_rss) rss; /**< Shared targets for Rx flow rules. */
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index 3623909..a8c0ee2 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -767,6 +767,12 @@ mlx4_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->max_mac_addrs = RTE_DIM(priv->mac);
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
+	if (priv->hw_csum)
+		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
+					  DEV_TX_OFFLOAD_UDP_CKSUM |
+					  DEV_TX_OFFLOAD_TCP_CKSUM);
+	if (priv->hw_csum_l2tun)
+		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
 		info->if_index = if_nametoindex(ifname);
 	info->hash_key_size = MLX4_RSS_HASH_KEY_SIZE;
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index 085a595..df5a6b4 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -64,6 +64,8 @@
 
 /* Work queue element (WQE) flags. */
 #define MLX4_BIT_WQE_OWN 0x80000000
+#define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
+#define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index cc0baaa..fe7d5d0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -431,6 +431,25 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	} else {
 		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
+	/* Enable HW checksum offload if requested */
+	if (txq->csum &&
+	    (pkt->ol_flags &
+	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+		const uint64_t is_tunneled = (pkt->ol_flags &
+					      (PKT_TX_TUNNEL_GRE |
+					       PKT_TX_TUNNEL_VXLAN));
+
+		if (is_tunneled && txq->csum_l2tun) {
+			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+					MLX4_WQE_CTRL_IL4_HDR_CSUM;
+			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+				srcrb_flags |=
+					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
+		} else {
+			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+						MLX4_WQE_CTRL_TCP_UDP_CSUM);
+		}
+	}
 	ctrl->srcrb_flags = srcrb_flags;
 	/*
 	 * Make sure descriptor is fully written before
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index fa5738f..6c88efb 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -124,6 +124,8 @@ struct txq {
 	struct txq_elt (*elts)[]; /**< Tx elements. */
 	struct mlx4_txq_stats stats; /**< Tx queue counters. */
 	uint32_t max_inline; /**< Max inline send size. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index fbb028a..0e27df2 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -276,6 +276,8 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
 		.elts_comp_cd_init =
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
+		.csum = priv->hw_csum,
+		.csum_l2tun = priv->hw_csum_l2tun,
 		.bounce_buf = bounce_buf,
 	};
 	txq->cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 4/5] net/mlx4: restore Rx offloads
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
                             ` (2 preceding siblings ...)
  2017-10-12 12:29           ` [PATCH v6 3/5] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
@ 2017-10-12 12:29           ` Adrien Mazarguil
  2017-10-12 12:30           ` [PATCH v6 5/5] net/mlx4: add loopback Tx from VF Adrien Mazarguil
                             ` (2 subsequent siblings)
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-12 12:29 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky, Vasily Philipov

From: Moti Haimovsky <motih@mellanox.com>

This patch adds hardware offloading support for IPV4, UDP and TCP checksum
verification, including inner/outer checksums on supported tunnel types.

It also restores packet type recognition support.

Signed-off-by: Vasily Philipov <vasilyf@mellanox.com>
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 doc/guides/nics/features/mlx4.ini |   1 +
 drivers/net/mlx4/mlx4_ethdev.c    |   6 +-
 drivers/net/mlx4/mlx4_prm.h       |  29 ++++++++
 drivers/net/mlx4/mlx4_rxq.c       |   5 ++
 drivers/net/mlx4/mlx4_rxtx.c      | 118 ++++++++++++++++++++++++++++++++-
 drivers/net/mlx4/mlx4_rxtx.h      |   2 +
 6 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/doc/guides/nics/features/mlx4.ini b/doc/guides/nics/features/mlx4.ini
index 366e051..f6efd21 100644
--- a/doc/guides/nics/features/mlx4.ini
+++ b/doc/guides/nics/features/mlx4.ini
@@ -24,6 +24,7 @@ L3 checksum offload  = Y
 L4 checksum offload  = Y
 Inner L3 checksum    = Y
 Inner L4 checksum    = Y
+Packet type parsing  = Y
 Basic stats          = Y
 Stats per queue      = Y
 Other kdrv           = Y
diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c
index a8c0ee2..ca2170e 100644
--- a/drivers/net/mlx4/mlx4_ethdev.c
+++ b/drivers/net/mlx4/mlx4_ethdev.c
@@ -767,10 +767,14 @@ mlx4_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->max_mac_addrs = RTE_DIM(priv->mac);
 	info->rx_offload_capa = 0;
 	info->tx_offload_capa = 0;
-	if (priv->hw_csum)
+	if (priv->hw_csum) {
 		info->tx_offload_capa |= (DEV_TX_OFFLOAD_IPV4_CKSUM |
 					  DEV_TX_OFFLOAD_UDP_CKSUM |
 					  DEV_TX_OFFLOAD_TCP_CKSUM);
+		info->rx_offload_capa |= (DEV_RX_OFFLOAD_IPV4_CKSUM |
+					  DEV_RX_OFFLOAD_UDP_CKSUM |
+					  DEV_RX_OFFLOAD_TCP_CKSUM);
+	}
 	if (priv->hw_csum_l2tun)
 		info->tx_offload_capa |= DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
 	if (mlx4_get_ifname(priv, &ifname) == 0)
diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index df5a6b4..3a77502 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -70,6 +70,14 @@
 #define MLX4_SIZE_TO_TXBBS(size) \
 	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
 
+/* CQE checksum flags. */
+enum {
+	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
+	MLX4_CQE_L2_TUNNEL_L4_CSUM = (int)(1u << 26),
+	MLX4_CQE_L2_TUNNEL = (int)(1u << 27),
+	MLX4_CQE_L2_TUNNEL_IPOK = (int)(1u << 31),
+};
+
 /* Send queue information. */
 struct mlx4_sq {
 	uint8_t *buf; /**< SQ buffer. */
@@ -119,4 +127,25 @@ mlx4_get_cqe(struct mlx4_cq *cq, uint32_t index)
 				   (cq->cqe_64 << 5));
 }
 
+/**
+ * Transpose a flag in a value.
+ *
+ * @param val
+ *   Input value.
+ * @param from
+ *   Flag to retrieve from input value.
+ * @param to
+ *   Flag to set in output value.
+ *
+ * @return
+ *   Output value with transposed flag enabled if present on input.
+ */
+static inline uint64_t
+mlx4_transpose(uint64_t val, uint64_t from, uint64_t to)
+{
+	return (from >= to ?
+		(val & from) / (from / to) :
+		(val & from) * (to / from));
+}
+
 #endif /* MLX4_PRM_H_ */
diff --git a/drivers/net/mlx4/mlx4_rxq.c b/drivers/net/mlx4/mlx4_rxq.c
index 39c83bc..7ce5b26 100644
--- a/drivers/net/mlx4/mlx4_rxq.c
+++ b/drivers/net/mlx4/mlx4_rxq.c
@@ -464,6 +464,11 @@ mlx4_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		.sges_n = 0,
 		.elts_n = rte_log2_u32(desc),
 		.elts = elts,
+		/* Toggle Rx checksum offload if hardware supports it. */
+		.csum = (priv->hw_csum &&
+			 dev->data->dev_conf.rxmode.hw_ip_checksum),
+		.csum_l2tun = (priv->hw_csum_l2tun &&
+			       dev->data->dev_conf.rxmode.hw_ip_checksum),
 		.stats.idx = idx,
 		.socket = socket,
 	};
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index fe7d5d0..87c5261 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -557,6 +557,107 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
+ * Translate Rx completion flags to packet type.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ *
+ * @return
+ *   Packet type in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_pkt_type(uint32_t flags)
+{
+	uint32_t pkt_type;
+
+	if (flags & MLX4_CQE_L2_TUNNEL)
+		pkt_type =
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPV4,
+				       RTE_PTYPE_L3_IPV4_EXT_UNKNOWN) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IPV4_PKT,
+				       RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN);
+	else
+		pkt_type = mlx4_transpose(flags,
+					  MLX4_CQE_STATUS_IPV4_PKT,
+					  RTE_PTYPE_L3_IPV4_EXT_UNKNOWN);
+	return pkt_type;
+}
+
+/**
+ * Translate Rx completion flags to offload flags.
+ *
+ * @param flags
+ *   Rx completion flags returned by mlx4_cqe_flags().
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   Offload flags (ol_flags) in mbuf format.
+ */
+static inline uint32_t
+rxq_cq_to_ol_flags(uint32_t flags, int csum, int csum_l2tun)
+{
+	uint32_t ol_flags = 0;
+
+	if (csum)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_IP_HDR_CSUM_OK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_STATUS_TCP_UDP_CSUM_OK,
+				       PKT_RX_L4_CKSUM_GOOD);
+	if ((flags & MLX4_CQE_L2_TUNNEL) && csum_l2tun)
+		ol_flags |=
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_IPOK,
+				       PKT_RX_IP_CKSUM_GOOD) |
+			mlx4_transpose(flags,
+				       MLX4_CQE_L2_TUNNEL_L4_CSUM,
+				       PKT_RX_L4_CKSUM_GOOD);
+	return ol_flags;
+}
+
+/**
+ * Extract checksum information from CQE flags.
+ *
+ * @param cqe
+ *   Pointer to CQE structure.
+ * @param csum
+ *   Whether Rx checksums are enabled.
+ * @param csum_l2tun
+ *   Whether Rx L2 tunnel checksums are enabled.
+ *
+ * @return
+ *   CQE checksum information.
+ */
+static inline uint32_t
+mlx4_cqe_flags(struct mlx4_cqe *cqe, int csum, int csum_l2tun)
+{
+	uint32_t flags = 0;
+
+	/*
+	 * The relevant bits are in different locations on their
+	 * CQE fields therefore we can join them in one 32bit
+	 * variable.
+	 */
+	if (csum)
+		flags = (rte_be_to_cpu_32(cqe->status) &
+			 MLX4_CQE_STATUS_IPV4_CSUM_OK);
+	if (csum_l2tun)
+		flags |= (rte_be_to_cpu_32(cqe->vlan_my_qpn) &
+			  (MLX4_CQE_L2_TUNNEL |
+			   MLX4_CQE_L2_TUNNEL_IPOK |
+			   MLX4_CQE_L2_TUNNEL_L4_CSUM |
+			   MLX4_CQE_L2_TUNNEL_IPV4));
+	return flags;
+}
+
+/**
  * Poll one CQE from CQ.
  *
  * @param rxq
@@ -664,8 +765,21 @@ mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				goto skip;
 			}
 			pkt = seg;
-			pkt->packet_type = 0;
-			pkt->ol_flags = 0;
+			if (rxq->csum | rxq->csum_l2tun) {
+				uint32_t flags =
+					mlx4_cqe_flags(cqe,
+						       rxq->csum,
+						       rxq->csum_l2tun);
+
+				pkt->ol_flags =
+					rxq_cq_to_ol_flags(flags,
+							   rxq->csum,
+							   rxq->csum_l2tun);
+				pkt->packet_type = rxq_cq_to_pkt_type(flags);
+			} else {
+				pkt->packet_type = 0;
+				pkt->ol_flags = 0;
+			}
 			pkt->pkt_len = len;
 		}
 		rep->nb_segs = 1;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 6c88efb..51af69c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -78,6 +78,8 @@ struct rxq {
 	struct rte_mbuf *(*elts)[]; /**< Rx elements. */
 	volatile struct mlx4_wqe_data_seg (*wqes)[]; /**< HW queue entries. */
 	volatile uint32_t *rq_db; /**< RQ doorbell record. */
+	uint32_t csum:1; /**< Enable checksum offloading. */
+	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
 	struct mlx4_cq mcq;  /**< Info for directly manipulating the CQ. */
 	struct mlx4_rxq_stats stats; /**< Rx queue counters. */
 	unsigned int socket; /**< CPU socket ID for allocations. */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 5/5] net/mlx4: add loopback Tx from VF
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
                             ` (3 preceding siblings ...)
  2017-10-12 12:29           ` [PATCH v6 4/5] net/mlx4: restore Rx offloads Adrien Mazarguil
@ 2017-10-12 12:30           ` Adrien Mazarguil
  2017-10-24  6:29           ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs gowrishankar muthukrishnan
  2017-10-24 16:59           ` Ferruh Yigit
  6 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-12 12:30 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

From: Moti Haimovsky <motih@mellanox.com>

This patch adds loopback functionality used when the chip is a VF in order
to enable packet transmission between VFs and PF.

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 33 +++++++++++++++++++++------------
 drivers/net/mlx4/mlx4_rxtx.h |  1 +
 drivers/net/mlx4/mlx4_txq.c  |  2 ++
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 87c5261..36173ad 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -311,10 +311,13 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	struct mlx4_wqe_data_seg *dseg;
 	struct mlx4_sq *sq = &txq->msq;
 	struct rte_mbuf *buf;
+	union {
+		uint32_t flags;
+		uint16_t flags16[2];
+	} srcrb;
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	uint32_t lkey;
 	uintptr_t addr;
-	uint32_t srcrb_flags;
 	uint32_t owner_opcode = MLX4_OPCODE_SEND;
 	uint32_t byte_count;
 	int wqe_real_size;
@@ -414,22 +417,16 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 	/* Fill the control parameters for this packet. */
 	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
 	/*
-	 * The caller should prepare "imm" in advance in order to support
-	 * VF to VF communication (when the device is a virtual-function
-	 * device (VF)).
-	 */
-	ctrl->imm = 0;
-	/*
 	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
 	 * should be calculated.
 	 */
 	txq->elts_comp_cd -= nr_txbbs;
 	if (unlikely(txq->elts_comp_cd <= 0)) {
 		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 				       MLX4_WQE_CTRL_CQ_UPDATE);
 	} else {
-		srcrb_flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
+		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 	}
 	/* Enable HW checksum offload if requested */
 	if (txq->csum &&
@@ -443,14 +440,26 @@ mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
 			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
 					MLX4_WQE_CTRL_IL4_HDR_CSUM;
 			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb_flags |=
+				srcrb.flags |=
 					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
 		} else {
-			srcrb_flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+			srcrb.flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
 						MLX4_WQE_CTRL_TCP_UDP_CSUM);
 		}
 	}
-	ctrl->srcrb_flags = srcrb_flags;
+	if (txq->lb) {
+		/*
+		 * Copy destination MAC address to the WQE, this allows
+		 * loopback in eSwitch, so that VFs and PF can communicate
+		 * with each other.
+		 */
+		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
+		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
+						      sizeof(uint16_t)));
+	} else {
+		ctrl->imm = 0;
+	}
+	ctrl->srcrb_flags = srcrb.flags;
 	/*
 	 * Make sure descriptor is fully written before
 	 * setting ownership bit (because HW can start
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 51af69c..e10bbca 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -128,6 +128,7 @@ struct txq {
 	uint32_t max_inline; /**< Max inline send size. */
 	uint32_t csum:1; /**< Enable checksum offloading. */
 	uint32_t csum_l2tun:1; /**< Same for L2 tunnels. */
+	uint32_t lb:1; /**< Whether packets should be looped back by eSwitch. */
 	uint8_t *bounce_buf;
 	/**< Memory used for storing the first DWORD of data TXBBs. */
 	struct {
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 0e27df2..6d3dd78 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -278,6 +278,8 @@ mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			RTE_MIN(MLX4_PMD_TX_PER_COMP_REQ, desc / 4),
 		.csum = priv->hw_csum,
 		.csum_l2tun = priv->hw_csum_l2tun,
+		/* Enable Tx loopback for VF devices. */
+		.lb = !!priv->vf,
 		.bounce_buf = bounce_buf,
 	};
 	txq->cq = ibv_create_cq(priv->ctx, desc, NULL, NULL, 0);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
                             ` (4 preceding siblings ...)
  2017-10-12 12:30           ` [PATCH v6 5/5] net/mlx4: add loopback Tx from VF Adrien Mazarguil
@ 2017-10-24  6:29           ` gowrishankar muthukrishnan
  2017-10-24  8:49             ` gowrishankar muthukrishnan
  2017-10-24 16:59           ` Ferruh Yigit
  6 siblings, 1 reply; 61+ messages in thread
From: gowrishankar muthukrishnan @ 2017-10-24  6:29 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: Ferruh Yigit, dev, Matan Azrad, Ophir Munk, Moti Haimovsky,
	Pradeep Satyanarayana

Hi Adrien,
I am trying to compile mlx4 (and later to try mlx5) pmd in RHEL 7.4 
(ppc64le) without Mellanox OFED,
using current master (which has below patch series).

As I do so, I hit with below compile error:

   dpdk/drivers/net/mlx4/mlx4.c:53:31: fatal error: infiniband/mlx4dv.h: 
No such file or directory
    #include <infiniband/mlx4dv.h>
                                  ^
   compilation terminated.
   make[6]: *** [mlx4.o] Error 1

I tried to find rpm for this directverbs include, but I could not. Could 
you advice if I have to install any additional rpm in RHEL 7.4 ?
To note, I have rdma-core and libibverbs installed.

Thanks,
Gowrishankar


On Thursday 12 October 2017 05:59 PM, Adrien Mazarguil wrote:
> Hopefully the last iteration for this series.
>
> v6 (Adrien):
> - Updated features documentation (mlx4.ini) in the relevant patches.
> - Rebased on the latest changes brought by RSS support v2 series.
>
> v5 (Ophir & Adrien):
> - Merged Rx scatter/Tx gather code back into individual Rx/Tx commits
>    for consistency due to a couple of issues with gather-less Tx.
> - Rebased on top of the latest mlx4 control path changes (RSS support).
>
> v4 (Ophir):
> - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4:
>    restore full Rx support bypassing Verbs" commit
>
> v3 (Adrien):
> - Drop a few unrelated or unnecessary changes such as the removal of
>    MLX4_PMD_TX_MP_CACHE.
> - Move device checksum support detection code to its previous location.
> - Fix include guard in mlx4_prm.h.
> - Reorder #includes alphabetically.
> - Replace MLX4_TRANSPOSE() macro with documented inline function.
> - Remove extra spaces and blank lines.
> - Use uint8_t * instead of char * for buffers.
> - Replace mlx4_get_cqe() macro with a documented inline function.
> - Replace several unsigned int with uint32_t.
> - Add consistency to field names (sge_n => sges_n).
> - Make mbuf size checks in RX queue setup function similar to mlx5.
> - Update various comments.
> - Fix indentation.
> - Replace run-time endian conversion with static ones where possible.
> - Reorder fields in struct rxq and struct txq for consistency, remove
>    one level of unnecessary inner structures.
> - Fix memory leak on Tx bounce buffer.
> - Update commit logs.
> - Fix remaining checkpatch warnings.
>
> v2 (Matan):
> Rearange patches.
> Semantics.
> Enhancements.
> Fix compilation issues.
>
> Moti Haimovsky (5):
>    net/mlx4: add Tx bypassing Verbs
>    net/mlx4: add Rx bypassing Verbs
>    net/mlx4: restore Tx checksum offloads
>    net/mlx4: restore Rx offloads
>    net/mlx4: add loopback Tx from VF
>
>   doc/guides/nics/features/mlx4.ini |   6 +
>   drivers/net/mlx4/mlx4.c           |  11 +
>   drivers/net/mlx4/mlx4.h           |   2 +
>   drivers/net/mlx4/mlx4_ethdev.c    |  10 +
>   drivers/net/mlx4/mlx4_prm.h       | 151 +++++++
>   drivers/net/mlx4/mlx4_rxq.c       | 156 +++++--
>   drivers/net/mlx4/mlx4_rxtx.c      | 768 ++++++++++++++++++++++++---------
>   drivers/net/mlx4/mlx4_rxtx.h      |  54 +--
>   drivers/net/mlx4/mlx4_txq.c       |  63 +++
>   9 files changed, 948 insertions(+), 273 deletions(-)
>   create mode 100644 drivers/net/mlx4/mlx4_prm.h
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-24  6:29           ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs gowrishankar muthukrishnan
@ 2017-10-24  8:49             ` gowrishankar muthukrishnan
  2017-10-24  9:55               ` Nélio Laranjeiro
  0 siblings, 1 reply; 61+ messages in thread
From: gowrishankar muthukrishnan @ 2017-10-24  8:49 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: Ferruh Yigit, dev, Matan Azrad, Ophir Munk, Moti Haimovsky,
	Pradeep Satyanarayana

On Tuesday 24 October 2017 11:59 AM, gowrishankar muthukrishnan wrote:
> Hi Adrien,
> I am trying to compile mlx4 (and later to try mlx5) pmd in RHEL 7.4 
> (ppc64le) without Mellanox OFED,
> using current master (which has below patch series).
>
> As I do so, I hit with below compile error:
>
>   dpdk/drivers/net/mlx4/mlx4.c:53:31: fatal error: 
> infiniband/mlx4dv.h: No such file or directory
>    #include <infiniband/mlx4dv.h>
>                                  ^
>   compilation terminated.
>   make[6]: *** [mlx4.o] Error 1
>
> I tried to find rpm for this directverbs include, but I could not. 
> Could you advice if I have to install any additional rpm in RHEL 7.4 ?
> To note, I have rdma-core and libibverbs installed.
>

I see we need rdma-core v15 atleast for mlx5 pmd (where as the version 
available in RHEL 7.4 is v13).

   https://patchwork.kernel.org/patch/9937201/

Similar dependency might be there for mlx4 as well. So I think it is not 
possible to compile it in RHEL 7.4 without dependent
rpms upgraded.

> Thanks,
> Gowrishankar
>
>
> On Thursday 12 October 2017 05:59 PM, Adrien Mazarguil wrote:
>> Hopefully the last iteration for this series.
>>
>> v6 (Adrien):
>> - Updated features documentation (mlx4.ini) in the relevant patches.
>> - Rebased on the latest changes brought by RSS support v2 series.
>>
>> v5 (Ophir & Adrien):
>> - Merged Rx scatter/Tx gather code back into individual Rx/Tx commits
>>    for consistency due to a couple of issues with gather-less Tx.
>> - Rebased on top of the latest mlx4 control path changes (RSS support).
>>
>> v4 (Ophir):
>> - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4:
>>    restore full Rx support bypassing Verbs" commit
>>
>> v3 (Adrien):
>> - Drop a few unrelated or unnecessary changes such as the removal of
>>    MLX4_PMD_TX_MP_CACHE.
>> - Move device checksum support detection code to its previous location.
>> - Fix include guard in mlx4_prm.h.
>> - Reorder #includes alphabetically.
>> - Replace MLX4_TRANSPOSE() macro with documented inline function.
>> - Remove extra spaces and blank lines.
>> - Use uint8_t * instead of char * for buffers.
>> - Replace mlx4_get_cqe() macro with a documented inline function.
>> - Replace several unsigned int with uint32_t.
>> - Add consistency to field names (sge_n => sges_n).
>> - Make mbuf size checks in RX queue setup function similar to mlx5.
>> - Update various comments.
>> - Fix indentation.
>> - Replace run-time endian conversion with static ones where possible.
>> - Reorder fields in struct rxq and struct txq for consistency, remove
>>    one level of unnecessary inner structures.
>> - Fix memory leak on Tx bounce buffer.
>> - Update commit logs.
>> - Fix remaining checkpatch warnings.
>>
>> v2 (Matan):
>> Rearange patches.
>> Semantics.
>> Enhancements.
>> Fix compilation issues.
>>
>> Moti Haimovsky (5):
>>    net/mlx4: add Tx bypassing Verbs
>>    net/mlx4: add Rx bypassing Verbs
>>    net/mlx4: restore Tx checksum offloads
>>    net/mlx4: restore Rx offloads
>>    net/mlx4: add loopback Tx from VF
>>
>>   doc/guides/nics/features/mlx4.ini |   6 +
>>   drivers/net/mlx4/mlx4.c           |  11 +
>>   drivers/net/mlx4/mlx4.h           |   2 +
>>   drivers/net/mlx4/mlx4_ethdev.c    |  10 +
>>   drivers/net/mlx4/mlx4_prm.h       | 151 +++++++
>>   drivers/net/mlx4/mlx4_rxq.c       | 156 +++++--
>>   drivers/net/mlx4/mlx4_rxtx.c      | 768 
>> ++++++++++++++++++++++++---------
>>   drivers/net/mlx4/mlx4_rxtx.h      |  54 +--
>>   drivers/net/mlx4/mlx4_txq.c       |  63 +++
>>   9 files changed, 948 insertions(+), 273 deletions(-)
>>   create mode 100644 drivers/net/mlx4/mlx4_prm.h
>>
>
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-24  8:49             ` gowrishankar muthukrishnan
@ 2017-10-24  9:55               ` Nélio Laranjeiro
  2017-10-24 10:01                 ` Adrien Mazarguil
  0 siblings, 1 reply; 61+ messages in thread
From: Nélio Laranjeiro @ 2017-10-24  9:55 UTC (permalink / raw)
  To: gowrishankar muthukrishnan
  Cc: Adrien Mazarguil, Ferruh Yigit, dev, Matan Azrad, Ophir Munk,
	Moti Haimovsky, Pradeep Satyanarayana

Hi,

On Tue, Oct 24, 2017 at 02:19:51PM +0530, gowrishankar muthukrishnan wrote:
> On Tuesday 24 October 2017 11:59 AM, gowrishankar muthukrishnan wrote:
> > Hi Adrien,
> > I am trying to compile mlx4 (and later to try mlx5) pmd in RHEL 7.4
> > (ppc64le) without Mellanox OFED,
> > using current master (which has below patch series).
> > 
> > As I do so, I hit with below compile error:
> > 
> >   dpdk/drivers/net/mlx4/mlx4.c:53:31: fatal error: infiniband/mlx4dv.h:
> > No such file or directory
> >    #include <infiniband/mlx4dv.h>
> >                                  ^
> >   compilation terminated.
> >   make[6]: *** [mlx4.o] Error 1
> > 
> > I tried to find rpm for this directverbs include, but I could not. Could
> > you advice if I have to install any additional rpm in RHEL 7.4 ?
> > To note, I have rdma-core and libibverbs installed.
> > 
> 
> I see we need rdma-core v15 atleast for mlx5 pmd (where as the version
> available in RHEL 7.4 is v13).
> 
>   https://patchwork.kernel.org/patch/9937201/
> 
> Similar dependency might be there for mlx4 as well. So I think it is not
> possible to compile it in RHEL 7.4 without dependent
> rpms upgraded.

The procedure is described in the mlx5 documentation [1].
It should be the same for mlx4.

> > Thanks,
> > Gowrishankar
> > 
> > 
> > On Thursday 12 October 2017 05:59 PM, Adrien Mazarguil wrote:
> > > Hopefully the last iteration for this series.
> > > 
> > > v6 (Adrien):
> > > - Updated features documentation (mlx4.ini) in the relevant patches.
> > > - Rebased on the latest changes brought by RSS support v2 series.
> > > 
> > > v5 (Ophir & Adrien):
> > > - Merged Rx scatter/Tx gather code back into individual Rx/Tx commits
> > >    for consistency due to a couple of issues with gather-less Tx.
> > > - Rebased on top of the latest mlx4 control path changes (RSS support).
> > > 
> > > v4 (Ophir):
> > > - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4:
> > >    restore full Rx support bypassing Verbs" commit
> > > 
> > > v3 (Adrien):
> > > - Drop a few unrelated or unnecessary changes such as the removal of
> > >    MLX4_PMD_TX_MP_CACHE.
> > > - Move device checksum support detection code to its previous location.
> > > - Fix include guard in mlx4_prm.h.
> > > - Reorder #includes alphabetically.
> > > - Replace MLX4_TRANSPOSE() macro with documented inline function.
> > > - Remove extra spaces and blank lines.
> > > - Use uint8_t * instead of char * for buffers.
> > > - Replace mlx4_get_cqe() macro with a documented inline function.
> > > - Replace several unsigned int with uint32_t.
> > > - Add consistency to field names (sge_n => sges_n).
> > > - Make mbuf size checks in RX queue setup function similar to mlx5.
> > > - Update various comments.
> > > - Fix indentation.
> > > - Replace run-time endian conversion with static ones where possible.
> > > - Reorder fields in struct rxq and struct txq for consistency, remove
> > >    one level of unnecessary inner structures.
> > > - Fix memory leak on Tx bounce buffer.
> > > - Update commit logs.
> > > - Fix remaining checkpatch warnings.
> > > 
> > > v2 (Matan):
> > > Rearange patches.
> > > Semantics.
> > > Enhancements.
> > > Fix compilation issues.
> > > 
> > > Moti Haimovsky (5):
> > >    net/mlx4: add Tx bypassing Verbs
> > >    net/mlx4: add Rx bypassing Verbs
> > >    net/mlx4: restore Tx checksum offloads
> > >    net/mlx4: restore Rx offloads
> > >    net/mlx4: add loopback Tx from VF
> > > 
> > >   doc/guides/nics/features/mlx4.ini |   6 +
> > >   drivers/net/mlx4/mlx4.c           |  11 +
> > >   drivers/net/mlx4/mlx4.h           |   2 +
> > >   drivers/net/mlx4/mlx4_ethdev.c    |  10 +
> > >   drivers/net/mlx4/mlx4_prm.h       | 151 +++++++
> > >   drivers/net/mlx4/mlx4_rxq.c       | 156 +++++--
> > >   drivers/net/mlx4/mlx4_rxtx.c      | 768
> > > ++++++++++++++++++++++++---------
> > >   drivers/net/mlx4/mlx4_rxtx.h      |  54 +--
> > >   drivers/net/mlx4/mlx4_txq.c       |  63 +++
> > >   9 files changed, 948 insertions(+), 273 deletions(-)
> > >   create mode 100644 drivers/net/mlx4/mlx4_prm.h
> > > 
> > 
> > 
> 

Regards,

[1] http://dpdk.org/doc/guides/nics/mlx5.html#installation

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-24  9:55               ` Nélio Laranjeiro
@ 2017-10-24 10:01                 ` Adrien Mazarguil
  0 siblings, 0 replies; 61+ messages in thread
From: Adrien Mazarguil @ 2017-10-24 10:01 UTC (permalink / raw)
  To: gowrishankar muthukrishnan, Nelio Laranjeiro
  Cc: Ferruh Yigit, dev, Matan Azrad, Ophir Munk, Moti Haimovsky,
	Pradeep Satyanarayana

On Tue, Oct 24, 2017 at 11:55:42AM +0200, Nélio Laranjeiro wrote:
> Hi,
> 
> On Tue, Oct 24, 2017 at 02:19:51PM +0530, gowrishankar muthukrishnan wrote:
> > On Tuesday 24 October 2017 11:59 AM, gowrishankar muthukrishnan wrote:
> > > Hi Adrien,
> > > I am trying to compile mlx4 (and later to try mlx5) pmd in RHEL 7.4
> > > (ppc64le) without Mellanox OFED,
> > > using current master (which has below patch series).
> > > 
> > > As I do so, I hit with below compile error:
> > > 
> > >   dpdk/drivers/net/mlx4/mlx4.c:53:31: fatal error: infiniband/mlx4dv.h:
> > > No such file or directory
> > >    #include <infiniband/mlx4dv.h>
> > >                                  ^
> > >   compilation terminated.
> > >   make[6]: *** [mlx4.o] Error 1
> > > 
> > > I tried to find rpm for this directverbs include, but I could not. Could
> > > you advice if I have to install any additional rpm in RHEL 7.4 ?
> > > To note, I have rdma-core and libibverbs installed.
> > > 
> > 
> > I see we need rdma-core v15 atleast for mlx5 pmd (where as the version
> > available in RHEL 7.4 is v13).
> > 
> >   https://patchwork.kernel.org/patch/9937201/
> > 
> > Similar dependency might be there for mlx4 as well. So I think it is not
> > possible to compile it in RHEL 7.4 without dependent
> > rpms upgraded.
> 
> The procedure is described in the mlx5 documentation [1].
> It should be the same for mlx4.
> Regards,
<snip>
> 
> [1] http://dpdk.org/doc/guides/nics/mlx5.html#installation

Yes, I confirm mlx4 documentation is not up to date yet regarding its
dependencies, it will be modified soon.

Thanks for the report.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs
  2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
                   ` (5 preceding siblings ...)
  2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
@ 2017-10-24 11:56 ` Nélio Laranjeiro
  6 siblings, 0 replies; 61+ messages in thread
From: Nélio Laranjeiro @ 2017-10-24 11:56 UTC (permalink / raw)
  To: Moti Haimovsky; +Cc: adrien.mazarguil, dev

Hi Moti,

Small comments/question to clarify the situation for any user of this
series.

On Thu, Aug 24, 2017 at 06:54:05PM +0300, Moti Haimovsky wrote:
> This series of patches implements the mlx4-pmd with Tx data path that directly
> access the device queues for transmitting packets, bypassing the ibverbs Tx
> data path altogether.
> Using this scheme allows the PMD to work with upstream rdma-core package
> instead of the Mellanox OFED one without sacrificing Tx functionality.
> 
> These patches should be applied in the order listed below as each depends on
> its predecessor to work.
> 
> This implementation allows rapid deployment of new features without the need to
> update the underlying OFED.

Seems this explanation is wrong, the PMD still relies on Verbs for
control plane configuration and thus for new features it will still
require an update of the underlying layers i.e RDMA-Core or MLNX_OFED >=
4.2 and the associated Linux Kernel Mellanox drivers.

> This work depends on
>         http://dpdk.org/ml/archives/dev/2017-August/072281.html
>         [dpdk-dev] [PATCH v1 00/48] net/mlx4: trim and refactor entire PMD
> by Adrien Mazarguil
> 
> It had been built and tested using rdma-core-15-1 from
>  https://github.com/linux-rdma/rdma-core
> and kernel-ml-4.12.0-1.el7.elrepo.x86_64
> 
> It had been built and tested using rdma-core-15-1 from
>  https://github.com/linux-rdma/rdma-core

Seems this version (15-1) does not exists on this repository the latest
being v15.
Does this series compiles and work on this version, or should the user
use some point above the v15?

> and kernel-ml-4.12.0-1.el7.elrepo.x86_64
> 
> Moti Haimovsky (5):
>   net/mlx4: add simple Tx bypassing ibverbs
>   net/mlx4: support multi-segments Tx
>   net/mlx4: refine setting Tx completion flag
>   net/mlx4: add Tx checksum offloads
>   net/mlx4: add loopback Tx from VF
> 
>  drivers/net/mlx4/mlx4.c        |   7 +
>  drivers/net/mlx4/mlx4.h        |   2 +
>  drivers/net/mlx4/mlx4_ethdev.c |   6 +
>  drivers/net/mlx4/mlx4_prm.h    | 249 ++++++++++++++++++++++
>  drivers/net/mlx4/mlx4_rxtx.c   | 456 +++++++++++++++++++++++++++++++++--------
>  drivers/net/mlx4/mlx4_rxtx.h   |  39 +++-
>  drivers/net/mlx4/mlx4_txq.c    |  66 +++++-
>  mk/rte.app.mk                  |   2 +-
>  8 files changed, 734 insertions(+), 93 deletions(-)
>  create mode 100644 drivers/net/mlx4/mlx4_prm.h
> 
> -- 
> 1.8.3.1

Thanks,

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs
  2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
                             ` (5 preceding siblings ...)
  2017-10-24  6:29           ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs gowrishankar muthukrishnan
@ 2017-10-24 16:59           ` Ferruh Yigit
  6 siblings, 0 replies; 61+ messages in thread
From: Ferruh Yigit @ 2017-10-24 16:59 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Matan Azrad, Ophir Munk, Moti Haimovsky

On 10/12/2017 5:29 AM, Adrien Mazarguil wrote:
> Hopefully the last iteration for this series.
> 
> v6 (Adrien):
> - Updated features documentation (mlx4.ini) in the relevant patches.
> - Rebased on the latest changes brought by RSS support v2 series.
> 
> v5 (Ophir & Adrien):
> - Merged Rx scatter/Tx gather code back into individual Rx/Tx commits
>   for consistency due to a couple of issues with gather-less Tx.
> - Rebased on top of the latest mlx4 control path changes (RSS support).
> 
> v4 (Ophir):
> - Split "net/mlx4: restore Rx scatter support" commit from "net/mlx4:
>   restore full Rx support bypassing Verbs" commit
> 
> v3 (Adrien):
> - Drop a few unrelated or unnecessary changes such as the removal of
>   MLX4_PMD_TX_MP_CACHE.
> - Move device checksum support detection code to its previous location.
> - Fix include guard in mlx4_prm.h.
> - Reorder #includes alphabetically.
> - Replace MLX4_TRANSPOSE() macro with documented inline function.
> - Remove extra spaces and blank lines.
> - Use uint8_t * instead of char * for buffers.
> - Replace mlx4_get_cqe() macro with a documented inline function.
> - Replace several unsigned int with uint32_t.
> - Add consistency to field names (sge_n => sges_n).
> - Make mbuf size checks in RX queue setup function similar to mlx5.
> - Update various comments.
> - Fix indentation.
> - Replace run-time endian conversion with static ones where possible.
> - Reorder fields in struct rxq and struct txq for consistency, remove
>   one level of unnecessary inner structures.
> - Fix memory leak on Tx bounce buffer.
> - Update commit logs.
> - Fix remaining checkpatch warnings.
> 
> v2 (Matan):
> Rearange patches.
> Semantics.
> Enhancements.
> Fix compilation issues.
> 
> Moti Haimovsky (5):
>   net/mlx4: add Tx bypassing Verbs
>   net/mlx4: add Rx bypassing Verbs
>   net/mlx4: restore Tx checksum offloads
>   net/mlx4: restore Rx offloads
>   net/mlx4: add loopback Tx from VF

Series applied to dpdk-next-net/master, thanks.

(It has been long that patch applied, but seems email forgotten)

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2017-10-24 16:59 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-24 15:54 [PATCH 0/5] new mlx4 Tx datapath bypassing ibverbs Moti Haimovsky
2017-08-24 15:54 ` [PATCH 1/5] net/mlx4: add simple Tx " Moti Haimovsky
2017-08-24 15:54 ` [PATCH 2/5] net/mlx4: support multi-segments Tx Moti Haimovsky
2017-08-24 15:54 ` [PATCH 3/5] net/mlx4: refine setting Tx completion flag Moti Haimovsky
2017-08-24 15:54 ` [PATCH 4/5] net/mlx4: add Tx checksum offloads Moti Haimovsky
2017-08-24 15:54 ` [PATCH 5/5] net/mlx4: add loopback Tx from VF Moti Haimovsky
2017-10-03 10:48 ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Matan Azrad
2017-10-03 10:48   ` [PATCH v2 1/6] net/mlx4: add simple Tx " Matan Azrad
2017-10-03 10:48   ` [PATCH v2 2/6] net/mlx4: get back Rx flow functionality Matan Azrad
2017-10-03 10:48   ` [PATCH v2 3/6] net/mlx4: support multi-segments Tx Matan Azrad
2017-10-03 10:48   ` [PATCH v2 4/6] net/mlx4: get back Tx checksum offloads Matan Azrad
2017-10-03 10:48   ` [PATCH v2 5/6] net/mlx4: get back Rx " Matan Azrad
2017-10-03 22:26     ` Ferruh Yigit
2017-10-03 10:48   ` [PATCH v2 6/6] net/mlx4: add loopback Tx from VF Matan Azrad
2017-10-03 22:27   ` [PATCH v2 0/6] new mlx4 datapath bypassing ibverbs Ferruh Yigit
2017-10-04 18:48   ` [PATCH v3 " Adrien Mazarguil
2017-10-04 18:48     ` [PATCH v3 1/6] net/mlx4: add simple Tx bypassing Verbs Adrien Mazarguil
2017-10-04 18:48     ` [PATCH v3 2/6] net/mlx4: restore full Rx support " Adrien Mazarguil
2017-10-04 18:48     ` [PATCH v3 3/6] net/mlx4: restore Tx gather support Adrien Mazarguil
2017-10-04 18:48     ` [PATCH v3 4/6] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
2017-10-04 18:48     ` [PATCH v3 5/6] net/mlx4: restore Rx offloads Adrien Mazarguil
2017-10-04 18:48     ` [PATCH v3 6/6] net/mlx4: add loopback Tx from VF Adrien Mazarguil
2017-10-05  9:33     ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Ophir Munk
2017-10-05  9:33       ` [PATCH v4 1/7] net/mlx4: add simple Tx bypassing Verbs Ophir Munk
2017-10-05  9:33       ` [PATCH v4 2/7] net/mlx4: restore full Rx support " Ophir Munk
2017-10-05  9:33       ` [PATCH v4 3/7] net/mlx4: restore Rx scatter support Ophir Munk
2017-10-05  9:33       ` [PATCH v4 4/7] net/mlx4: restore Tx gather support Ophir Munk
2017-10-05  9:33       ` [PATCH v4 5/7] net/mlx4: restore Tx checksum offloads Ophir Munk
2017-10-05  9:33       ` [PATCH v4 6/7] net/mlx4: restore Rx offloads Ophir Munk
2017-10-05  9:33       ` [PATCH v4 7/7] net/mlx4: add loopback Tx from VF Ophir Munk
2017-10-05 11:40       ` [PATCH v4 0/7] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
2017-10-05 18:48       ` Ferruh Yigit
2017-10-05 18:54         ` Ferruh Yigit
2017-10-11 18:31       ` [PATCH v5 0/5] " Adrien Mazarguil
2017-10-11 18:31         ` [PATCH v5 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
2017-10-11 18:31         ` [PATCH v5 2/5] net/mlx4: add Rx " Adrien Mazarguil
2017-10-11 18:32         ` [PATCH v5 3/5] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
2017-10-11 18:32         ` [PATCH v5 4/5] net/mlx4: restore Rx offloads Adrien Mazarguil
2017-10-11 18:32         ` [PATCH v5 5/5] net/mlx4: add loopback Tx from VF Adrien Mazarguil
2017-10-12 12:29         ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs Adrien Mazarguil
2017-10-12 12:29           ` [PATCH v6 1/5] net/mlx4: add Tx bypassing Verbs Adrien Mazarguil
2017-10-12 12:29           ` [PATCH v6 2/5] net/mlx4: add Rx " Adrien Mazarguil
2017-10-12 12:29           ` [PATCH v6 3/5] net/mlx4: restore Tx checksum offloads Adrien Mazarguil
2017-10-12 12:29           ` [PATCH v6 4/5] net/mlx4: restore Rx offloads Adrien Mazarguil
2017-10-12 12:30           ` [PATCH v6 5/5] net/mlx4: add loopback Tx from VF Adrien Mazarguil
2017-10-24  6:29           ` [PATCH v6 0/5] new mlx4 datapath bypassing ibverbs gowrishankar muthukrishnan
2017-10-24  8:49             ` gowrishankar muthukrishnan
2017-10-24  9:55               ` Nélio Laranjeiro
2017-10-24 10:01                 ` Adrien Mazarguil
2017-10-24 16:59           ` Ferruh Yigit
2017-10-04 21:48   ` [PATCH v3 0/7] " Ophir Munk
2017-10-04 21:49     ` [PATCH v3 1/7] net/mlx4: add simple Tx " Ophir Munk
2017-10-04 21:49     ` [PATCH v3 2/7] net/mlx4: get back Rx flow functionality Ophir Munk
2017-10-04 21:49     ` [PATCH v3 3/7] net/mlx4: support multi-segments Rx Ophir Munk
2017-10-04 21:49     ` [PATCH v3 4/7] net/mlx4: support multi-segments Tx Ophir Munk
2017-10-04 21:49     ` [PATCH v3 5/7] net/mlx4: get back Tx checksum offloads Ophir Munk
2017-10-04 21:49     ` [PATCH v3 6/7] net/mlx4: get back Rx " Ophir Munk
2017-10-04 21:49     ` [PATCH v3 7/7] net/mlx4: add loopback Tx from VF Ophir Munk
2017-10-04 22:37     ` [PATCH v3 0/7] new mlx4 datapath bypassing ibverbs Ferruh Yigit
2017-10-04 22:46       ` Thomas Monjalon
2017-10-24 11:56 ` [PATCH 0/5] new mlx4 Tx " Nélio Laranjeiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.