All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] improve mlx4 Tx performance
@ 2017-11-28 12:19 Matan Azrad
  2017-11-28 12:19 ` [PATCH 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

This series improves mlx4 Tx performance and fix and clean some Tx code. 
1. 10% MPPS improvement for 1 queue, 1 core, 64B packets, txonly mode.
2. 20% MPPS improvement for 1 queue, 1 core, 32B*4(segs) packets, txonly mode.

Matan Azrad (8):
  net/mlx4: fix Tx packet drop application report
  net/mlx4: remove unnecessary Tx wraparound checks
  net/mlx4: remove restamping from Tx error path
  net/mlx4: optimize Tx multi-segment case
  net/mlx4: merge Tx queue rings management
  net/mlx4: mitigate Tx send entry size calculations
  net/mlx4: align Tx descriptors number
  net/mlx4: remove Tx completion elements counter

 drivers/net/mlx4/mlx4_prm.h  |  20 +-
 drivers/net/mlx4/mlx4_rxtx.c | 443 +++++++++++++++++++------------------------
 drivers/net/mlx4/mlx4_rxtx.h |  38 +++-
 drivers/net/mlx4/mlx4_txq.c  |  44 +++--
 4 files changed, 265 insertions(+), 280 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/8] net/mlx4: fix Tx packet drop application report
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:57   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, stable

When invalid lkey is sent to HW, HW sends an error notification in
completion function.

The previous code wouldn't crash but doesn't add any application report
in case of completion error, so application cannot know that packet
actually was dropped in case of invalid lkey.

Return back the lkey validation to Tx path.

Fixes: 2eee458746bc ("net/mlx4: remove error flows from Tx fast path")
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 2bfa8b1..0d008ed 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -468,7 +468,6 @@ struct pv {
 		/* Memory region key (big endian) for this memory pool. */
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
 		dseg->lkey = rte_cpu_to_be_32(lkey);
-#ifndef NDEBUG
 		/* Calculate the needed work queue entry size for this packet */
 		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
@@ -486,7 +485,6 @@ struct pv {
 					(sq->head & sq->txbb_cnt) ? 0 : 1);
 			return -1;
 		}
-#endif /* NDEBUG */
 		if (likely(sbuf->data_len)) {
 			byte_count = rte_cpu_to_be_32(sbuf->data_len);
 		} else {
@@ -636,7 +634,6 @@ struct pv {
 			/* Memory region key (big endian). */
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
 			dseg->lkey = rte_cpu_to_be_32(lkey);
-#ifndef NDEBUG
 			if (unlikely(dseg->lkey ==
 				rte_cpu_to_be_32((uint32_t)-1))) {
 				/* MR does not exist. */
@@ -655,7 +652,6 @@ struct pv {
 				elt->buf = NULL;
 				break;
 			}
-#endif /* NDEBUG */
 			/* Never be TXBB aligned, no need compiler barrier. */
 			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
 			/* Fill the control parameters for this packet. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/8] net/mlx4: remove unnecessary Tx wraparound checks
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
  2017-11-28 12:19 ` [PATCH 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:57   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

There is no need to check Tx queue wraparound for segments which are
not at the beginning of a Tx block. Especially relevant in a single
segment case.

Remove unnecessary aforementioned checks from Tx path.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 0d008ed..9a32b3f 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -461,15 +461,11 @@ struct pv {
 	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
 		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
 		rte_prefetch0((volatile void *)addr);
-		/* Handle WQE wraparound. */
-		if (dseg >= (volatile struct mlx4_wqe_data_seg *)sq->eob)
-			dseg = (volatile struct mlx4_wqe_data_seg *)sq->buf;
-		dseg->addr = rte_cpu_to_be_64(addr);
 		/* Memory region key (big endian) for this memory pool. */
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
 		dseg->lkey = rte_cpu_to_be_32(lkey);
 		/* Calculate the needed work queue entry size for this packet */
-		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
+		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
 			DEBUG("%p: unable to get MP <-> MR association",
 					(void *)txq);
@@ -501,6 +497,8 @@ struct pv {
 		 * control segment.
 		 */
 		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			dseg->addr = rte_cpu_to_be_64(addr);
+			dseg->lkey = rte_cpu_to_be_32(lkey);
 #if RTE_CACHE_LINE_SIZE < 64
 			/*
 			 * Need a barrier here before writing the byte_count
@@ -520,6 +518,13 @@ struct pv {
 			 * TXBB, so we need to postpone its byte_count writing
 			 * for later.
 			 */
+			/* Handle WQE wraparound. */
+			if (dseg >=
+			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
+				dseg = (volatile struct mlx4_wqe_data_seg *)
+					sq->buf;
+			dseg->addr = rte_cpu_to_be_64(addr);
+			dseg->lkey = rte_cpu_to_be_32(lkey);
 			pv[pv_counter].dseg = dseg;
 			pv[pv_counter++].val = byte_count;
 		}
@@ -625,11 +630,6 @@ struct pv {
 					sizeof(struct mlx4_wqe_ctrl_seg));
 			addr = rte_pktmbuf_mtod(buf, uintptr_t);
 			rte_prefetch0((volatile void *)addr);
-			/* Handle WQE wraparound. */
-			if (dseg >=
-				(volatile struct mlx4_wqe_data_seg *)sq->eob)
-				dseg = (volatile struct mlx4_wqe_data_seg *)
-						sq->buf;
 			dseg->addr = rte_cpu_to_be_64(addr);
 			/* Memory region key (big endian). */
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/8] net/mlx4: remove restamping from Tx error path
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
  2017-11-28 12:19 ` [PATCH 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
  2017-11-28 12:19 ` [PATCH 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:58   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

At error time, the first 4 bytes of each WQE Tx block still have not
writen, so no need to stamp them because they are already stamped.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 9a32b3f..1d8240a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -468,17 +468,7 @@ struct pv {
 		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
 			DEBUG("%p: unable to get MP <-> MR association",
-					(void *)txq);
-			/*
-			 * Restamp entry in case of failure.
-			 * Make sure that size is written correctly
-			 * Note that we give ownership to the SW, not the HW.
-			 */
-			wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
-				buf->nb_segs * sizeof(struct mlx4_wqe_data_seg);
-			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-			mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					(sq->head & sq->txbb_cnt) ? 0 : 1);
+			      (void *)txq);
 			return -1;
 		}
 		if (likely(sbuf->data_len)) {
@@ -639,16 +629,6 @@ struct pv {
 				/* MR does not exist. */
 				DEBUG("%p: unable to get MP <-> MR association",
 				      (void *)txq);
-				/*
-				 * Restamp entry in case of failure.
-				 * Make sure that size is written correctly
-				 * Note that we give ownership to the SW,
-				 * not the HW.
-				 */
-				ctrl->fence_size =
-					(WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
-				mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					     (sq->head & sq->txbb_cnt) ? 0 : 1);
 				elt->buf = NULL;
 				break;
 			}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
                   ` (2 preceding siblings ...)
  2017-11-28 12:19 ` [PATCH 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:58   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

mlx4 Tx block can handle up to 4 data segments or control segment + up
to 3 data segments. The first data segment in each not first Tx block
must validate Tx queue wraparound and must use IO memory barrier before
writing the byte count.

The previous multi-segment code used "for" loop to iterate over all
packet segments and separated first Tx block data case by "if"
statments.

Use switch case and unconditional branches instead of "for" loop can
optimize the case and prevents the unnecessary checks for each data
segment; This hints to compiler to create opitimized jump table.

Optimize this case by switch case and unconditional branches usage.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 165 +++++++++++++++++++++++++------------------
 drivers/net/mlx4/mlx4_rxtx.h |  33 +++++++++
 2 files changed, 128 insertions(+), 70 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 1d8240a..b9cb2fc 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -432,15 +432,14 @@ struct pv {
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	volatile struct mlx4_wqe_ctrl_seg *ctrl;
 	volatile struct mlx4_wqe_data_seg *dseg;
-	struct rte_mbuf *sbuf;
+	struct rte_mbuf *sbuf = buf;
 	uint32_t lkey;
-	uintptr_t addr;
-	uint32_t byte_count;
 	int pv_counter = 0;
+	int nb_segs = buf->nb_segs;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
-		buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
+		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
 	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
 	/*
 	 * Check that there is room for this WQE in the send queue and that
@@ -457,67 +456,99 @@ struct pv {
 	dseg = (volatile struct mlx4_wqe_data_seg *)
 			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
 	*pctrl = ctrl;
-	/* Fill the data segments with buffer information. */
-	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
-		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
-		rte_prefetch0((volatile void *)addr);
-		/* Memory region key (big endian) for this memory pool. */
+	/*
+	 * Fill the data segments with buffer information.
+	 * First WQE TXBB head segment is always control segment,
+	 * so jump to tail TXBB data segments code for the first
+	 * WQE data segments filling.
+	 */
+	goto txbb_tail_segs;
+txbb_head_seg:
+	/* Memory region key (big endian) for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		DEBUG("%p: unable to get MP <-> MR association",
+		      (void *)txq);
+		return -1;
+	}
+	/* Handle WQE wraparound. */
+	if (dseg >=
+		(volatile struct mlx4_wqe_data_seg *)sq->eob)
+		dseg = (volatile struct mlx4_wqe_data_seg *)
+			sq->buf;
+	dseg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf, uintptr_t));
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * This data segment starts at the beginning of a new
+	 * TXBB, so we need to postpone its byte_count writing
+	 * for later.
+	 */
+	pv[pv_counter].dseg = dseg;
+	/*
+	 * Zero length segment is treated as inline segment
+	 * with zero data.
+	 */
+	pv[pv_counter++].val = rte_cpu_to_be_32(sbuf->data_len ?
+						sbuf->data_len : 0x80000000);
+	sbuf = sbuf->next;
+	dseg++;
+	nb_segs--;
+txbb_tail_segs:
+	/* Jump to default if there are more than two segments remaining. */
+	switch (nb_segs) {
+	default:
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
-		dseg->lkey = rte_cpu_to_be_32(lkey);
-		/* Calculate the needed work queue entry size for this packet */
-		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
-			/* MR does not exist. */
+		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
 			return -1;
 		}
-		if (likely(sbuf->data_len)) {
-			byte_count = rte_cpu_to_be_32(sbuf->data_len);
-		} else {
-			/*
-			 * Zero length segment is treated as inline segment
-			 * with zero data.
-			 */
-			byte_count = RTE_BE32(0x80000000);
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		sbuf = sbuf->next;
+		dseg++;
+		nb_segs--;
+		/* fallthrough */
+	case 2:
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			return -1;
 		}
-		/*
-		 * If the data segment is not at the beginning of a
-		 * Tx basic block (TXBB) then write the byte count,
-		 * else postpone the writing to just before updating the
-		 * control segment.
-		 */
-		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
-			dseg->addr = rte_cpu_to_be_64(addr);
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-#if RTE_CACHE_LINE_SIZE < 64
-			/*
-			 * Need a barrier here before writing the byte_count
-			 * fields to make sure that all the data is visible
-			 * before the byte_count field is set.
-			 * Otherwise, if the segment begins a new cacheline,
-			 * the HCA prefetcher could grab the 64-byte chunk and
-			 * get a valid (!= 0xffffffff) byte count but stale
-			 * data, and end up sending the wrong data.
-			 */
-			rte_io_wmb();
-#endif /* RTE_CACHE_LINE_SIZE */
-			dseg->byte_count = byte_count;
-		} else {
-			/*
-			 * This data segment starts at the beginning of a new
-			 * TXBB, so we need to postpone its byte_count writing
-			 * for later.
-			 */
-			/* Handle WQE wraparound. */
-			if (dseg >=
-			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
-				dseg = (volatile struct mlx4_wqe_data_seg *)
-					sq->buf;
-			dseg->addr = rte_cpu_to_be_64(addr);
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-			pv[pv_counter].dseg = dseg;
-			pv[pv_counter++].val = byte_count;
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		sbuf = sbuf->next;
+		dseg++;
+		nb_segs--;
+		/* fallthrough */
+	case 1:
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			return -1;
+		}
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		nb_segs--;
+		if (nb_segs) {
+			sbuf = sbuf->next;
+			dseg++;
+			goto txbb_head_seg;
 		}
+		/* fallthrough */
+	case 0:
+		break;
 	}
 	/* Write the first DWORD of each TXBB save earlier. */
 	if (pv_counter) {
@@ -583,7 +614,6 @@ struct pv {
 		} srcrb;
 		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 		uint32_t lkey;
-		uintptr_t addr;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -618,24 +648,19 @@ struct pv {
 			dseg = (volatile struct mlx4_wqe_data_seg *)
 					((uintptr_t)ctrl +
 					sizeof(struct mlx4_wqe_ctrl_seg));
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			rte_prefetch0((volatile void *)addr);
-			dseg->addr = rte_cpu_to_be_64(addr);
-			/* Memory region key (big endian). */
+
+			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-			if (unlikely(dseg->lkey ==
-				rte_cpu_to_be_32((uint32_t)-1))) {
+			if (unlikely(lkey == (uint32_t)-1)) {
 				/* MR does not exist. */
 				DEBUG("%p: unable to get MP <-> MR association",
 				      (void *)txq);
 				elt->buf = NULL;
 				break;
 			}
-			/* Never be TXBB aligned, no need compiler barrier. */
-			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
-			/* Fill the control parameters for this packet. */
-			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
+			mlx4_fill_tx_data_seg(dseg, lkey,
+					      rte_pktmbuf_mtod(buf, uintptr_t),
+					      rte_cpu_to_be_32(buf->data_len));
 			nr_txbbs = 1;
 		} else {
 			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 463df2b..8207232 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 	return mlx4_txq_add_mr(txq, mp, i);
 }
 
+/**
+ * Write Tx data segment to the SQ.
+ *
+ * @param dseg
+ *   Pointer to data segment in SQ.
+ * @param lkey
+ *   Memory region lkey.
+ * @param addr
+ *   Data address.
+ * @param byte_count
+ *   Big Endian bytes count of the data to send.
+ */
+static inline void
+mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
+		       uint32_t lkey, uintptr_t addr, uint32_t byte_count)
+{
+	dseg->addr = rte_cpu_to_be_64(addr);
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+#if RTE_CACHE_LINE_SIZE < 64
+	/*
+	 * Need a barrier here before writing the byte_count
+	 * fields to make sure that all the data is visible
+	 * before the byte_count field is set.
+	 * Otherwise, if the segment begins a new cacheline,
+	 * the HCA prefetcher could grab the 64-byte chunk and
+	 * get a valid (!= 0xffffffff) byte count but stale
+	 * data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+#endif /* RTE_CACHE_LINE_SIZE */
+	dseg->byte_count = byte_count;
+}
+
 #endif /* MLX4_RXTX_H_ */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/8] net/mlx4: merge Tx queue rings management
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
                   ` (3 preceding siblings ...)
  2017-11-28 12:19 ` [PATCH 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:58   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

The Tx queue send ring was managed by Tx block head,tail,count and mask
management variables which were used for managing the send queue remain
space and next places of empty or completted work queue entries.

This method suffered from an actual addresses recalculation per packet,
an unnecessary Tx block based calculations and an expensive dual
management of Tx rings.

Move send queue ring calculation to be based on actual addresses while
managing it by descriptors ring indexes.

Add new work queue entry pointer to the descriptor element to hold the
appropriate entry in the send queue.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  |  20 ++--
 drivers/net/mlx4/mlx4_rxtx.c | 241 +++++++++++++++++++------------------------
 drivers/net/mlx4/mlx4_rxtx.h |   1 +
 drivers/net/mlx4/mlx4_txq.c  |  23 +++--
 4 files changed, 126 insertions(+), 159 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index fcc7c12..2ca303a 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -54,22 +54,18 @@
 
 /* Typical TSO descriptor with 16 gather entries is 352 bytes. */
 #define MLX4_MAX_WQE_SIZE 512
-#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+#define MLX4_SEG_SHIFT 4
 
 /* Send queue stamping/invalidating information. */
 #define MLX4_SQ_STAMP_STRIDE 64
 #define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
-#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_OWNER_BIT 31
 #define MLX4_SQ_STAMP_VAL 0x7fffffff
 
 /* Work queue element (WQE) flags. */
-#define MLX4_BIT_WQE_OWN 0x80000000
 #define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
 #define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
-#define MLX4_SIZE_TO_TXBBS(size) \
-	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
-
 /* CQE checksum flags. */
 enum {
 	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
@@ -98,17 +94,15 @@ enum {
 struct mlx4_sq {
 	volatile uint8_t *buf; /**< SQ buffer. */
 	volatile uint8_t *eob; /**< End of SQ buffer */
-	uint32_t head; /**< SQ head counter in units of TXBBS. */
-	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
-	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
-	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
-	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t size; /**< SQ size includes headroom. */
+	int32_t remain_size; /**< Remain WQE size in SQ. */
+	/**< Default owner opcode with HW valid owner bit. */
+	uint32_t owner_opcode;
+	uint32_t stamp; /**< Stamp value with an invalid HW owner bit. */
 	volatile uint32_t *db; /**< Pointer to the doorbell. */
 	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
 };
 
-#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
-
 /* Completion queue events, numbers and masks. */
 #define MLX4_CQ_DB_GEQ_N_MASK 0x3
 #define MLX4_CQ_DOORBELL 0x20
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b9cb2fc..0a8ef93 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -61,9 +61,6 @@
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
-#define WQE_ONE_DATA_SEG_SIZE \
-	(sizeof(struct mlx4_wqe_ctrl_seg) + sizeof(struct mlx4_wqe_data_seg))
-
 /**
  * Pointer-value pair structure used in tx_post_send for saving the first
  * DWORD (32 byte) of a TXBB.
@@ -268,52 +265,48 @@ struct pv {
  *
  * @param sq
  *   Pointer to the SQ structure.
- * @param index
- *   Index of the freed WQE.
- * @param num_txbbs
- *   Number of blocks to stamp.
- *   If < 0 the routine will use the size written in the WQ entry.
- * @param owner
- *   The value of the WQE owner bit to use in the stamp.
+ * @param wqe
+ *   Pointer of WQE to stamp.
  *
  * @return
- *   The number of Tx basic blocs (TXBB) the WQE contained.
+ *   WQE size.
  */
-static int
-mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+static uint32_t
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
 {
-	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
-					  (!!owner << MLX4_SQ_STAMP_SHIFT));
-	volatile uint8_t *wqe = mlx4_get_send_wqe(sq,
-						(index & sq->txbb_cnt_mask));
-	volatile uint32_t *ptr = (volatile uint32_t *)wqe;
-	int i;
-	int txbbs_size;
-	int num_txbbs;
-
+	uint32_t stamp = sq->stamp;
+	volatile uint32_t *next_txbb = *wqe;
 	/* Extract the size from the control segment of the WQE. */
-	num_txbbs = MLX4_SIZE_TO_TXBBS((((volatile struct mlx4_wqe_ctrl_seg *)
-					 wqe)->fence_size & 0x3f) << 4);
-	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	uint32_t size = RTE_ALIGN((uint32_t)
+				  ((((volatile struct mlx4_wqe_ctrl_seg *)
+				     next_txbb)->fence_size & 0x3f) << 4),
+				  MLX4_TXBB_SIZE);
+	uint32_t size_cd = size;
+
 	/* Optimize the common case when there is no wrap-around. */
-	if (wqe + txbbs_size <= sq->eob) {
+	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
 		/* Stamp the freed descriptor. */
-		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
-			*ptr = stamp;
-			ptr += MLX4_SQ_STAMP_DWORDS;
-		}
+		do {
+			*next_txbb = stamp;
+			next_txbb += MLX4_SQ_STAMP_DWORDS;
+			size_cd -= MLX4_TXBB_SIZE;
+		} while (size_cd);
 	} else {
 		/* Stamp the freed descriptor. */
-		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
-			*ptr = stamp;
-			ptr += MLX4_SQ_STAMP_DWORDS;
-			if ((volatile uint8_t *)ptr >= sq->eob) {
-				ptr = (volatile uint32_t *)sq->buf;
-				stamp ^= RTE_BE32(0x80000000);
+		do {
+			*next_txbb = stamp;
+			next_txbb += MLX4_SQ_STAMP_DWORDS;
+			if ((volatile uint8_t *)next_txbb >= sq->eob) {
+				next_txbb = (volatile uint32_t *)sq->buf;
+				/* Flip invalid stamping ownership. */
+				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
+				sq->stamp = stamp;
 			}
-		}
+			size_cd -= MLX4_TXBB_SIZE;
+		} while (size_cd);
 	}
-	return num_txbbs;
+	*wqe = next_txbb;
+	return size;
 }
 
 /**
@@ -326,24 +319,22 @@ struct pv {
  *
  * @param txq
  *   Pointer to Tx queue structure.
- *
- * @return
- *   0 on success, -1 on failure.
  */
-static int
+static void
 mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
 				  struct mlx4_sq *sq)
 {
-	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
-	unsigned int sq_tail = sq->tail;
 	struct mlx4_cq *cq = &txq->mcq;
 	volatile struct mlx4_cqe *cqe;
 	uint32_t cons_index = cq->cons_index;
-	uint16_t new_index;
-	uint16_t nr_txbbs = 0;
-	int pkts = 0;
-
+	volatile uint32_t *first_wqe;
+	volatile uint32_t *next_wqe = (volatile uint32_t *)
+			((&(*txq->elts)[elts_tail])->wqe);
+	volatile uint32_t *last_wqe;
+	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
+			 MLX4_TXBB_SHIFT) - 1;
+	uint32_t pkts = 0;
 	/*
 	 * Traverse over all CQ entries reported and handle each WQ entry
 	 * reported by them.
@@ -353,11 +344,11 @@ struct pv {
 		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
 		    !!(cons_index & cq->cqe_cnt)))
 			break;
+#ifndef NDEBUG
 		/*
 		 * Make sure we read the CQE after we read the ownership bit.
 		 */
 		rte_io_rmb();
-#ifndef NDEBUG
 		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
 			     MLX4_CQE_OPCODE_ERROR)) {
 			volatile struct mlx4_err_cqe *cqe_err =
@@ -366,41 +357,32 @@ struct pv {
 			      " syndrome: 0x%x\n",
 			      (void *)txq, cqe_err->vendor_err,
 			      cqe_err->syndrome);
+			break;
 		}
 #endif /* NDEBUG */
-		/* Get WQE index reported in the CQE. */
-		new_index =
-			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		/* Get WQE address buy index from the CQE. */
+		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
+			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
+			 MLX4_TXBB_SHIFT));
 		do {
 			/* Free next descriptor. */
-			sq_tail += nr_txbbs;
-			nr_txbbs =
-				mlx4_txq_stamp_freed_wqe(sq,
-				     sq_tail & sq->txbb_cnt_mask,
-				     !!(sq_tail & sq->txbb_cnt));
+			first_wqe = next_wqe;
+			sq->remain_size +=
+				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
 			pkts++;
-		} while ((sq_tail & sq->txbb_cnt_mask) != new_index);
+		} while (first_wqe != last_wqe);
 		cons_index++;
 	} while (1);
 	if (unlikely(pkts == 0))
-		return 0;
-	/* Update CQ. */
+		return;
+	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
-	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & MLX4_CQ_DB_CI_MASK);
-	sq->tail = sq_tail + nr_txbbs;
-	/* Update the list of packets posted for transmission. */
-	elts_comp -= pkts;
-	assert(elts_comp <= txq->elts_comp);
-	/*
-	 * Assume completion status is successful as nothing can be done about
-	 * it anyway.
-	 */
+	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
+	txq->elts_comp -= pkts;
 	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
-	txq->elts_comp = elts_comp;
-	return 0;
 }
 
 /**
@@ -421,41 +403,27 @@ struct pv {
 	return buf->pool;
 }
 
-static int
+static volatile struct mlx4_wqe_ctrl_seg *
 mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
-		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
+		   volatile struct mlx4_wqe_ctrl_seg *ctrl)
 {
-	int wqe_real_size;
-	int nr_txbbs;
 	struct pv *pv = (struct pv *)txq->bounce_buf;
 	struct mlx4_sq *sq = &txq->msq;
-	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
-	volatile struct mlx4_wqe_ctrl_seg *ctrl;
-	volatile struct mlx4_wqe_data_seg *dseg;
 	struct rte_mbuf *sbuf = buf;
 	uint32_t lkey;
 	int pv_counter = 0;
 	int nb_segs = buf->nb_segs;
+	int32_t wqe_size;
+	volatile struct mlx4_wqe_data_seg *dseg =
+		(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
 
-	/* Calculate the needed work queue entry size for this packet. */
-	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
-		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
-	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
-	/*
-	 * Check that there is room for this WQE in the send queue and that
-	 * the WQE size is legal.
-	 */
-	if (((sq->head - sq->tail) + nr_txbbs +
-				sq->headroom_txbbs) >= sq->txbb_cnt ||
-			nr_txbbs > MLX4_MAX_WQE_TXBBS) {
-		return -1;
-	}
-	/* Get the control and data entries of the WQE. */
-	ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
-			mlx4_get_send_wqe(sq, head_idx);
-	dseg = (volatile struct mlx4_wqe_data_seg *)
-			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
-	*pctrl = ctrl;
+	ctrl->fence_size = 1 + nb_segs;
+	wqe_size = RTE_ALIGN((int32_t)(ctrl->fence_size << MLX4_SEG_SHIFT),
+			     MLX4_TXBB_SIZE);
+	/* Validate WQE size and WQE space in the send queue. */
+	if (sq->remain_size < wqe_size ||
+	    wqe_size > MLX4_MAX_WQE_SIZE)
+		return NULL;
 	/*
 	 * Fill the data segments with buffer information.
 	 * First WQE TXBB head segment is always control segment,
@@ -469,7 +437,7 @@ struct pv {
 	if (unlikely(lkey == (uint32_t)-1)) {
 		DEBUG("%p: unable to get MP <-> MR association",
 		      (void *)txq);
-		return -1;
+		return NULL;
 	}
 	/* Handle WQE wraparound. */
 	if (dseg >=
@@ -501,7 +469,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -517,7 +485,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -533,7 +501,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -557,9 +525,10 @@ struct pv {
 		for (--pv_counter; pv_counter  >= 0; pv_counter--)
 			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
 	}
-	/* Fill the control parameters for this packet. */
-	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-	return nr_txbbs;
+	sq->remain_size -= wqe_size;
+	/* Align next WQE address to the next TXBB. */
+	return (volatile struct mlx4_wqe_ctrl_seg *)
+		((volatile uint8_t *)ctrl + wqe_size);
 }
 
 /**
@@ -585,7 +554,8 @@ struct pv {
 	unsigned int i;
 	unsigned int max;
 	struct mlx4_sq *sq = &txq->msq;
-	int nr_txbbs;
+	volatile struct mlx4_wqe_ctrl_seg *ctrl;
+	struct txq_elt *elt;
 
 	assert(txq->elts_comp_cd != 0);
 	if (likely(txq->elts_comp != 0))
@@ -599,29 +569,30 @@ struct pv {
 	--max;
 	if (max > pkts_n)
 		max = pkts_n;
+	elt = &(*txq->elts)[elts_head];
+	/* Each element saves its appropriate work queue. */
+	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
 		unsigned int elts_head_next =
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
-		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		uint32_t owner_opcode = MLX4_OPCODE_SEND;
-		volatile struct mlx4_wqe_ctrl_seg *ctrl;
-		volatile struct mlx4_wqe_data_seg *dseg;
+		uint32_t owner_opcode = sq->owner_opcode;
+		volatile struct mlx4_wqe_data_seg *dseg =
+				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
+		volatile struct mlx4_wqe_ctrl_seg *ctrl_next;
 		union {
 			uint32_t flags;
 			uint16_t flags16[2];
 		} srcrb;
-		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 		uint32_t lkey;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
 			struct rte_mbuf *tmp = elt->buf;
-
 #ifndef NDEBUG
 			/* Poisoning. */
-			memset(elt, 0x66, sizeof(*elt));
+			elt->buf = (struct rte_mbuf *)0x6666666666666666;
 #endif
 			/* Faster than rte_pktmbuf_free(). */
 			do {
@@ -633,23 +604,11 @@ struct pv {
 		}
 		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
 		if (buf->nb_segs == 1) {
-			/*
-			 * Check that there is room for this WQE in the send
-			 * queue and that the WQE size is legal
-			 */
-			if (((sq->head - sq->tail) + 1 + sq->headroom_txbbs) >=
-			     sq->txbb_cnt || 1 > MLX4_MAX_WQE_TXBBS) {
+			/* Validate WQE space in the send queue. */
+			if (sq->remain_size < MLX4_TXBB_SIZE) {
 				elt->buf = NULL;
 				break;
 			}
-			/* Get the control and data entries of the WQE. */
-			ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
-					mlx4_get_send_wqe(sq, head_idx);
-			dseg = (volatile struct mlx4_wqe_data_seg *)
-					((uintptr_t)ctrl +
-					sizeof(struct mlx4_wqe_ctrl_seg));
-
-			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
 			if (unlikely(lkey == (uint32_t)-1)) {
 				/* MR does not exist. */
@@ -658,23 +617,33 @@ struct pv {
 				elt->buf = NULL;
 				break;
 			}
-			mlx4_fill_tx_data_seg(dseg, lkey,
+			mlx4_fill_tx_data_seg(dseg++, lkey,
 					      rte_pktmbuf_mtod(buf, uintptr_t),
 					      rte_cpu_to_be_32(buf->data_len));
-			nr_txbbs = 1;
+			/* Set WQE size in 16-byte units. */
+			ctrl->fence_size = 0x2;
+			sq->remain_size -= MLX4_TXBB_SIZE;
+			/* Align next WQE address to the next TXBB. */
+			ctrl_next = ctrl + 0x4;
 		} else {
-			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
-			if (nr_txbbs < 0) {
+			ctrl_next = mlx4_tx_burst_segs(buf, txq, ctrl);
+			if (!ctrl_next) {
 				elt->buf = NULL;
 				break;
 			}
 		}
+		/* Hold SQ ring wrap around. */
+		if ((volatile uint8_t *)ctrl_next >= sq->eob) {
+			ctrl_next = (volatile struct mlx4_wqe_ctrl_seg *)
+				((volatile uint8_t *)ctrl_next - sq->size);
+			/* Flip HW valid ownership. */
+			sq->owner_opcode ^= 0x1 << MLX4_SQ_OWNER_BIT;
+		}
 		/*
 		 * For raw Ethernet, the SOLICIT flag is used to indicate
 		 * that no ICRC should be calculated.
 		 */
-		txq->elts_comp_cd -= nr_txbbs;
-		if (unlikely(txq->elts_comp_cd <= 0)) {
+		if (--txq->elts_comp_cd == 0) {
 			txq->elts_comp_cd = txq->elts_comp_cd_init;
 			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 					       MLX4_WQE_CTRL_CQ_UPDATE);
@@ -720,13 +689,13 @@ struct pv {
 		 * executing as soon as we do).
 		 */
 		rte_io_wmb();
-		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
-					      ((sq->head & sq->txbb_cnt) ?
-						       MLX4_BIT_WQE_OWN : 0));
-		sq->head += nr_txbbs;
+		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		elts_head = elts_head_next;
+		elt_next->wqe = ctrl_next;
+		ctrl = ctrl_next;
+		elt = elt_next;
 	}
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 8207232..c092afa 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -105,6 +105,7 @@ struct mlx4_rss {
 /** Tx element. */
 struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
+	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
 };
 
 /** Rx queue counters. */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7882a4d..4c7b62a 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -84,6 +84,7 @@
 		assert(elt->buf != NULL);
 		rte_pktmbuf_free(elt->buf);
 		elt->buf = NULL;
+		elt->wqe = NULL;
 		if (++elts_tail == RTE_DIM(*elts))
 			elts_tail = 0;
 	}
@@ -163,20 +164,19 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4_cq *cq = &txq->mcq;
 	struct mlx4dv_qp *dqp = mlxdv->qp.out;
 	struct mlx4dv_cq *dcq = mlxdv->cq.out;
-	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
 
-	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
 	/* Total length, including headroom and spare WQEs. */
-	sq->eob = sq->buf + sq_size;
-	sq->head = 0;
-	sq->tail = 0;
-	sq->txbb_cnt =
-		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
-	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	sq->eob = sq->buf + sq->size;
+	uint32_t headroom_size = 2048 + (1 << dqp->sq.wqe_shift);
+	/* Continuous headroom size bytes must always stay freed. */
+	sq->remain_size = sq->size - headroom_size;
+	sq->owner_opcode = MLX4_OPCODE_SEND | (0 << MLX4_SQ_OWNER_BIT);
+	sq->stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+				     (0 << MLX4_SQ_OWNER_BIT));
 	sq->db = dqp->sdb;
 	sq->doorbell_qpn = dqp->doorbell_qpn;
-	sq->headroom_txbbs =
-		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
 	cq->buf = dcq->buf.buf;
 	cq->cqe_cnt = dcq->cqe_cnt;
 	cq->set_ci_db = dcq->set_ci_db;
@@ -362,6 +362,9 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
+	/* Save first wqe pointer in the first element. */
+	(&(*txq->elts)[0])->wqe =
+		(volatile struct mlx4_wqe_ctrl_seg *)txq->msq.buf;
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
 	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/8] net/mlx4: mitigate Tx send entry size calculations
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
                   ` (4 preceding siblings ...)
  2017-11-28 12:19 ` [PATCH 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:59   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 7/8] net/mlx4: align Tx descriptors number Matan Azrad
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

The previuse code took a send queue entry size for stamping from the
send queue entry pointed by completion queue entry; This 2 reads were
done per packet in completion stage.

The completion burst packets number is managed by fixed size stored in
Tx queue, so we can infer that each valid completion entry actually frees
the next fixed number packets.

The descriptors ring holds the send queue entry, so we just can infer
all the completion burst packet entries size by simple calculation and
prevent calculations per packet.

Adjust completion functions to free full completion bursts packets
by one time and prevent per packet work queue entry reads and
calculations.

Save only start of completion burst or Tx burst send queue entry
pointers in the appropriate descriptor element.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 105 +++++++++++++++++++------------------------
 drivers/net/mlx4/mlx4_rxtx.h |   5 ++-
 2 files changed, 50 insertions(+), 60 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 0a8ef93..30f2930 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -258,54 +258,47 @@ struct pv {
 };
 
 /**
- * Stamp a WQE so it won't be reused by the HW.
+ * Stamp TXBB burst so it won't be reused by the HW.
  *
  * Routine is used when freeing WQE used by the chip or when failing
  * building an WQ entry has failed leaving partial information on the queue.
  *
  * @param sq
  *   Pointer to the SQ structure.
- * @param wqe
- *   Pointer of WQE to stamp.
+ * @param start
+ *   Pointer to the first TXBB to stamp.
+ * @param end
+ *   Pointer to the followed end TXBB to stamp.
  *
  * @return
- *   WQE size.
+ *   Stamping burst size in byte units.
  */
-static uint32_t
-mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
+static int32_t
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t *start,
+			 volatile uint32_t *end)
 {
 	uint32_t stamp = sq->stamp;
-	volatile uint32_t *next_txbb = *wqe;
-	/* Extract the size from the control segment of the WQE. */
-	uint32_t size = RTE_ALIGN((uint32_t)
-				  ((((volatile struct mlx4_wqe_ctrl_seg *)
-				     next_txbb)->fence_size & 0x3f) << 4),
-				  MLX4_TXBB_SIZE);
-	uint32_t size_cd = size;
+	int32_t size = (intptr_t)end - (intptr_t)start;
 
-	/* Optimize the common case when there is no wrap-around. */
-	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
-		/* Stamp the freed descriptor. */
+	assert(start != end);
+	/* Hold SQ ring wrap around. */
+	if (size < 0) {
+		size = (int32_t)sq->size + size;
 		do {
-			*next_txbb = stamp;
-			next_txbb += MLX4_SQ_STAMP_DWORDS;
-			size_cd -= MLX4_TXBB_SIZE;
-		} while (size_cd);
-	} else {
-		/* Stamp the freed descriptor. */
-		do {
-			*next_txbb = stamp;
-			next_txbb += MLX4_SQ_STAMP_DWORDS;
-			if ((volatile uint8_t *)next_txbb >= sq->eob) {
-				next_txbb = (volatile uint32_t *)sq->buf;
-				/* Flip invalid stamping ownership. */
-				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
-				sq->stamp = stamp;
-			}
-			size_cd -= MLX4_TXBB_SIZE;
-		} while (size_cd);
+			*start = stamp;
+			start += MLX4_SQ_STAMP_DWORDS;
+		} while (start != (volatile uint32_t *)sq->eob);
+		start = (volatile uint32_t *)sq->buf;
+		/* Flip invalid stamping ownership. */
+		stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
+		sq->stamp = stamp;
+		if (start == end)
+			return size;
 	}
-	*wqe = next_txbb;
+	do {
+		*start = stamp;
+		start += MLX4_SQ_STAMP_DWORDS;
+	} while (start != end);
 	return size;
 }
 
@@ -327,14 +320,10 @@ struct pv {
 	unsigned int elts_tail = txq->elts_tail;
 	struct mlx4_cq *cq = &txq->mcq;
 	volatile struct mlx4_cqe *cqe;
+	uint32_t completed;
 	uint32_t cons_index = cq->cons_index;
-	volatile uint32_t *first_wqe;
-	volatile uint32_t *next_wqe = (volatile uint32_t *)
-			((&(*txq->elts)[elts_tail])->wqe);
-	volatile uint32_t *last_wqe;
-	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
-			 MLX4_TXBB_SHIFT) - 1;
-	uint32_t pkts = 0;
+	volatile uint32_t *first_txbb;
+
 	/*
 	 * Traverse over all CQ entries reported and handle each WQ entry
 	 * reported by them.
@@ -360,28 +349,23 @@ struct pv {
 			break;
 		}
 #endif /* NDEBUG */
-		/* Get WQE address buy index from the CQE. */
-		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
-			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
-			 MLX4_TXBB_SHIFT));
-		do {
-			/* Free next descriptor. */
-			first_wqe = next_wqe;
-			sq->remain_size +=
-				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
-			pkts++;
-		} while (first_wqe != last_wqe);
 		cons_index++;
 	} while (1);
-	if (unlikely(pkts == 0))
+	completed = (cons_index - cq->cons_index) * txq->elts_comp_cd_init;
+	if (unlikely(!completed))
 		return;
+	/* First stamping address is the end of the last one. */
+	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
+	elts_tail += completed;
+	if (elts_tail >= elts_n)
+		elts_tail -= elts_n;
+	/* The new tail element holds the end address. */
+	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
+		(&(*txq->elts)[elts_tail])->eocb);
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
-	txq->elts_comp -= pkts;
-	elts_tail += pkts;
-	if (elts_tail >= elts_n)
-		elts_tail -= elts_n;
+	txq->elts_comp -= completed;
 	txq->elts_tail = elts_tail;
 }
 
@@ -570,7 +554,7 @@ struct pv {
 	if (max > pkts_n)
 		max = pkts_n;
 	elt = &(*txq->elts)[elts_head];
-	/* Each element saves its appropriate work queue. */
+	/* First Tx burst element saves the next WQE control segment. */
 	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
@@ -644,6 +628,8 @@ struct pv {
 		 * that no ICRC should be calculated.
 		 */
 		if (--txq->elts_comp_cd == 0) {
+			/* Save the completion burst end address. */
+			elt_next->eocb = (volatile uint32_t *)ctrl_next;
 			txq->elts_comp_cd = txq->elts_comp_cd_init;
 			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 					       MLX4_WQE_CTRL_CQ_UPDATE);
@@ -693,13 +679,14 @@ struct pv {
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		elts_head = elts_head_next;
-		elt_next->wqe = ctrl_next;
 		ctrl = ctrl_next;
 		elt = elt_next;
 	}
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
+	/* Save WQE address of the next Tx burst element. */
+	elt->wqe = ctrl;
 	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
 	txq->stats.obytes += bytes_sent;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index c092afa..9d83aeb 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -105,7 +105,10 @@ struct mlx4_rss {
 /** Tx element. */
 struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
-	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
+	union {
+		volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
+		volatile uint32_t *eocb; /**< End of completion burst. */
+	};
 };
 
 /** Rx queue counters. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 7/8] net/mlx4: align Tx descriptors number
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
                   ` (5 preceding siblings ...)
  2017-11-28 12:19 ` [PATCH 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:59   ` Adrien Mazarguil
  2017-11-28 12:19 ` [PATCH 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Using power of 2 descriptors number makes the ring management easier
and allows to use mask operation instead of wraparound conditions.

Adjust Tx descriptor number to be power of 2 and change calculation to
use mask accordingly.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 22 ++++++++--------------
 drivers/net/mlx4/mlx4_txq.c  | 20 ++++++++++++--------
 2 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 30f2930..b5aaf4c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -314,7 +314,7 @@ struct pv {
  *   Pointer to Tx queue structure.
  */
 static void
-mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
+mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
 				  struct mlx4_sq *sq)
 {
 	unsigned int elts_tail = txq->elts_tail;
@@ -355,13 +355,11 @@ struct pv {
 	if (unlikely(!completed))
 		return;
 	/* First stamping address is the end of the last one. */
-	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
+	first_txbb = (&(*txq->elts)[elts_tail & elts_m])->eocb;
 	elts_tail += completed;
-	if (elts_tail >= elts_n)
-		elts_tail -= elts_n;
 	/* The new tail element holds the end address. */
 	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
-		(&(*txq->elts)[elts_tail])->eocb);
+		(&(*txq->elts)[elts_tail & elts_m])->eocb);
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
@@ -534,6 +532,7 @@ struct pv {
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
+	const unsigned int elts_m = elts_n - 1;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
@@ -543,24 +542,20 @@ struct pv {
 
 	assert(txq->elts_comp_cd != 0);
 	if (likely(txq->elts_comp != 0))
-		mlx4_txq_complete(txq, elts_n, sq);
+		mlx4_txq_complete(txq, elts_m, sq);
 	max = (elts_n - (elts_head - txq->elts_tail));
-	if (max > elts_n)
-		max -= elts_n;
 	assert(max >= 1);
 	assert(max <= elts_n);
 	/* Always leave one free entry in the ring. */
 	--max;
 	if (max > pkts_n)
 		max = pkts_n;
-	elt = &(*txq->elts)[elts_head];
+	elt = &(*txq->elts)[elts_head & elts_m];
 	/* First Tx burst element saves the next WQE control segment. */
 	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
-		unsigned int elts_head_next =
-			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
-		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
+		struct txq_elt *elt_next = &(*txq->elts)[++elts_head & elts_m];
 		uint32_t owner_opcode = sq->owner_opcode;
 		volatile struct mlx4_wqe_data_seg *dseg =
 				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
@@ -678,7 +673,6 @@ struct pv {
 		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
-		elts_head = elts_head_next;
 		ctrl = ctrl_next;
 		elt = elt_next;
 	}
@@ -694,7 +688,7 @@ struct pv {
 	rte_wmb();
 	/* Ring QP doorbell. */
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
-	txq->elts_head = elts_head;
+	txq->elts_head += i;
 	txq->elts_comp += i;
 	return i;
 }
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 4c7b62a..253075a 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -76,17 +76,16 @@
 	unsigned int elts_head = txq->elts_head;
 	unsigned int elts_tail = txq->elts_tail;
 	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
+	unsigned int elts_m = txq->elts_n - 1;
 
 	DEBUG("%p: freeing WRs", (void *)txq);
 	while (elts_tail != elts_head) {
-		struct txq_elt *elt = &(*elts)[elts_tail];
+		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
 
 		assert(elt->buf != NULL);
 		rte_pktmbuf_free(elt->buf);
 		elt->buf = NULL;
 		elt->wqe = NULL;
-		if (++elts_tail == RTE_DIM(*elts))
-			elts_tail = 0;
 	}
 	txq->elts_tail = txq->elts_head;
 }
@@ -208,7 +207,9 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4dv_obj mlxdv;
 	struct mlx4dv_qp dv_qp;
 	struct mlx4dv_cq dv_cq;
-	struct txq_elt (*elts)[desc];
+	uint32_t elts_size = desc > 0x1000 ? 0x1000 :
+		rte_align32pow2((uint32_t)desc);
+	struct txq_elt (*elts)[elts_size];
 	struct ibv_qp_init_attr qp_init_attr;
 	struct txq *txq;
 	uint8_t *bounce_buf;
@@ -247,11 +248,14 @@ struct txq_mp2mr_mbuf_check_data {
 		      (void *)dev, idx);
 		return -rte_errno;
 	}
-	if (!desc) {
-		rte_errno = EINVAL;
-		ERROR("%p: invalid number of Tx descriptors", (void *)dev);
-		return -rte_errno;
+	if ((uint32_t)desc != elts_size) {
+		desc = (uint16_t)elts_size;
+		WARN("%p: changed number of descriptors in TX queue %u"
+		     " to be power of two (%d)",
+		     (void *)dev, idx, desc);
 	}
+	DEBUG("%p: configuring queue %u for %u descriptors",
+	      (void *)dev, idx, desc);
 	/* Allocate and initialize Tx queue. */
 	mlx4_zmallocv_socket("TXQ", vec, RTE_DIM(vec), socket);
 	if (!txq) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 8/8] net/mlx4: remove Tx completion elements counter
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
                   ` (6 preceding siblings ...)
  2017-11-28 12:19 ` [PATCH 7/8] net/mlx4: align Tx descriptors number Matan Azrad
@ 2017-11-28 12:19 ` Matan Azrad
  2017-12-06 10:59   ` Adrien Mazarguil
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-11-28 12:19 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

This counter saved the descriptor elements which are waiting to be
completted and was used to know if completion function should be
called.

This completion check can be done by other elements management
variables and we can prevent this counter management.

Remove this counter and replace the completion check easily by other
elements management variables.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 8 +++-----
 drivers/net/mlx4/mlx4_rxtx.h | 1 -
 drivers/net/mlx4/mlx4_txq.c  | 1 -
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index b5aaf4c..b7b8489 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -363,7 +363,6 @@ struct pv {
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
-	txq->elts_comp -= completed;
 	txq->elts_tail = elts_tail;
 }
 
@@ -535,15 +534,15 @@ struct pv {
 	const unsigned int elts_m = elts_n - 1;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
-	unsigned int max;
+	unsigned int max = elts_head - txq->elts_tail;
 	struct mlx4_sq *sq = &txq->msq;
 	volatile struct mlx4_wqe_ctrl_seg *ctrl;
 	struct txq_elt *elt;
 
 	assert(txq->elts_comp_cd != 0);
-	if (likely(txq->elts_comp != 0))
+	if (likely(max >= txq->elts_comp_cd_init))
 		mlx4_txq_complete(txq, elts_m, sq);
-	max = (elts_n - (elts_head - txq->elts_tail));
+	max = elts_n - max;
 	assert(max >= 1);
 	assert(max <= elts_n);
 	/* Always leave one free entry in the ring. */
@@ -689,7 +688,6 @@ struct pv {
 	/* Ring QP doorbell. */
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head += i;
-	txq->elts_comp += i;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 9d83aeb..096a569 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -125,7 +125,6 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of packets awaiting completion. */
 	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 253075a..b310aee 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -273,7 +273,6 @@ struct txq_mp2mr_mbuf_check_data {
 		.elts = elts,
 		.elts_head = 0,
 		.elts_tail = 0,
-		.elts_comp = 0,
 		/*
 		 * Request send completion every MLX4_PMD_TX_PER_COMP_REQ
 		 * packets or at least 4 times per ring.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] net/mlx4: fix Tx packet drop application report
  2017-11-28 12:19 ` [PATCH 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
@ 2017-12-06 10:57   ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:57 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev, stable

On Tue, Nov 28, 2017 at 12:19:23PM +0000, Matan Azrad wrote:
> When invalid lkey is sent to HW, HW sends an error notification in
> completion function.
> 
> The previous code wouldn't crash but doesn't add any application report
> in case of completion error, so application cannot know that packet
> actually was dropped in case of invalid lkey.
> 
> Return back the lkey validation to Tx path.
> 
> Fixes: 2eee458746bc ("net/mlx4: remove error flows from Tx fast path")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] net/mlx4: remove unnecessary Tx wraparound checks
  2017-11-28 12:19 ` [PATCH 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
@ 2017-12-06 10:57   ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:57 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:24PM +0000, Matan Azrad wrote:
> There is no need to check Tx queue wraparound for segments which are
> not at the beginning of a Tx block. Especially relevant in a single
> segment case.
> 
> Remove unnecessary aforementioned checks from Tx path.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] net/mlx4: remove restamping from Tx error path
  2017-11-28 12:19 ` [PATCH 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
@ 2017-12-06 10:58   ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:58 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:25PM +0000, Matan Azrad wrote:
> At error time, the first 4 bytes of each WQE Tx block still have not
> writen, so no need to stamp them because they are already stamped.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
  2017-11-28 12:19 ` [PATCH 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
@ 2017-12-06 10:58   ` Adrien Mazarguil
  2017-12-06 11:29     ` Matan Azrad
  0 siblings, 1 reply; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:58 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote:
> mlx4 Tx block can handle up to 4 data segments or control segment + up
> to 3 data segments. The first data segment in each not first Tx block
> must validate Tx queue wraparound and must use IO memory barrier before
> writing the byte count.
> 
> The previous multi-segment code used "for" loop to iterate over all
> packet segments and separated first Tx block data case by "if"
> statments.

statments => statements

> 
> Use switch case and unconditional branches instead of "for" loop can
> optimize the case and prevents the unnecessary checks for each data
> segment; This hints to compiler to create opitimized jump table.

opitimized => optimized

> 
> Optimize this case by switch case and unconditional branches usage.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 165 +++++++++++++++++++++++++------------------
>  drivers/net/mlx4/mlx4_rxtx.h |  33 +++++++++
>  2 files changed, 128 insertions(+), 70 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index 1d8240a..b9cb2fc 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -432,15 +432,14 @@ struct pv {
>  	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
>  	volatile struct mlx4_wqe_ctrl_seg *ctrl;
>  	volatile struct mlx4_wqe_data_seg *dseg;
> -	struct rte_mbuf *sbuf;
> +	struct rte_mbuf *sbuf = buf;
>  	uint32_t lkey;
> -	uintptr_t addr;
> -	uint32_t byte_count;
>  	int pv_counter = 0;
> +	int nb_segs = buf->nb_segs;
>  
>  	/* Calculate the needed work queue entry size for this packet. */
>  	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
> -		buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> +		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
>  	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
>  	/*
>  	 * Check that there is room for this WQE in the send queue and that
> @@ -457,67 +456,99 @@ struct pv {
>  	dseg = (volatile struct mlx4_wqe_data_seg *)
>  			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
>  	*pctrl = ctrl;
> -	/* Fill the data segments with buffer information. */
> -	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
> -		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
> -		rte_prefetch0((volatile void *)addr);
> -		/* Memory region key (big endian) for this memory pool. */
> +	/*
> +	 * Fill the data segments with buffer information.
> +	 * First WQE TXBB head segment is always control segment,
> +	 * so jump to tail TXBB data segments code for the first
> +	 * WQE data segments filling.
> +	 */
> +	goto txbb_tail_segs;
> +txbb_head_seg:

I'm not fundamentally opposed to "goto" unlike a lot of people out there,
but this doesn't look good. It's OK to use goto for error cases and to
extricate yourself when trapped in an inner loop, also in some optimization
scenarios where it sometimes make sense, but not when the same can be
achieved through standard loop constructs and keywords.

In this case I'm under the impression you could have managed with a
do { ... } while (...) construct. You need to try harder to reorganize these
changes or prove it can't be done without negatively impacting performance.

Doing so should make this patch shorter as well.

> +	/* Memory region key (big endian) for this memory pool. */
> +	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> +	if (unlikely(lkey == (uint32_t)-1)) {
> +		DEBUG("%p: unable to get MP <-> MR association",
> +		      (void *)txq);
> +		return -1;
> +	}
> +	/* Handle WQE wraparound. */
> +	if (dseg >=
> +		(volatile struct mlx4_wqe_data_seg *)sq->eob)
> +		dseg = (volatile struct mlx4_wqe_data_seg *)
> +			sq->buf;
> +	dseg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf, uintptr_t));
> +	dseg->lkey = rte_cpu_to_be_32(lkey);
> +	/*
> +	 * This data segment starts at the beginning of a new
> +	 * TXBB, so we need to postpone its byte_count writing
> +	 * for later.
> +	 */
> +	pv[pv_counter].dseg = dseg;
> +	/*
> +	 * Zero length segment is treated as inline segment
> +	 * with zero data.
> +	 */
> +	pv[pv_counter++].val = rte_cpu_to_be_32(sbuf->data_len ?
> +						sbuf->data_len : 0x80000000);
> +	sbuf = sbuf->next;
> +	dseg++;
> +	nb_segs--;
> +txbb_tail_segs:
> +	/* Jump to default if there are more than two segments remaining. */
> +	switch (nb_segs) {
> +	default:
>  		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> -		dseg->lkey = rte_cpu_to_be_32(lkey);
> -		/* Calculate the needed work queue entry size for this packet */
> -		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
> -			/* MR does not exist. */
> +		if (unlikely(lkey == (uint32_t)-1)) {
>  			DEBUG("%p: unable to get MP <-> MR association",
>  			      (void *)txq);
>  			return -1;
>  		}
> -		if (likely(sbuf->data_len)) {
> -			byte_count = rte_cpu_to_be_32(sbuf->data_len);
> -		} else {
> -			/*
> -			 * Zero length segment is treated as inline segment
> -			 * with zero data.
> -			 */
> -			byte_count = RTE_BE32(0x80000000);
> +		mlx4_fill_tx_data_seg(dseg, lkey,
> +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> +				      rte_cpu_to_be_32(sbuf->data_len ?
> +						       sbuf->data_len :
> +						       0x80000000));
> +		sbuf = sbuf->next;
> +		dseg++;
> +		nb_segs--;
> +		/* fallthrough */
> +	case 2:
> +		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> +		if (unlikely(lkey == (uint32_t)-1)) {
> +			DEBUG("%p: unable to get MP <-> MR association",
> +			      (void *)txq);
> +			return -1;
>  		}
> -		/*
> -		 * If the data segment is not at the beginning of a
> -		 * Tx basic block (TXBB) then write the byte count,
> -		 * else postpone the writing to just before updating the
> -		 * control segment.
> -		 */
> -		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> -			dseg->addr = rte_cpu_to_be_64(addr);
> -			dseg->lkey = rte_cpu_to_be_32(lkey);
> -#if RTE_CACHE_LINE_SIZE < 64
> -			/*
> -			 * Need a barrier here before writing the byte_count
> -			 * fields to make sure that all the data is visible
> -			 * before the byte_count field is set.
> -			 * Otherwise, if the segment begins a new cacheline,
> -			 * the HCA prefetcher could grab the 64-byte chunk and
> -			 * get a valid (!= 0xffffffff) byte count but stale
> -			 * data, and end up sending the wrong data.
> -			 */
> -			rte_io_wmb();
> -#endif /* RTE_CACHE_LINE_SIZE */
> -			dseg->byte_count = byte_count;
> -		} else {
> -			/*
> -			 * This data segment starts at the beginning of a new
> -			 * TXBB, so we need to postpone its byte_count writing
> -			 * for later.
> -			 */
> -			/* Handle WQE wraparound. */
> -			if (dseg >=
> -			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
> -				dseg = (volatile struct mlx4_wqe_data_seg *)
> -					sq->buf;
> -			dseg->addr = rte_cpu_to_be_64(addr);
> -			dseg->lkey = rte_cpu_to_be_32(lkey);
> -			pv[pv_counter].dseg = dseg;
> -			pv[pv_counter++].val = byte_count;
> +		mlx4_fill_tx_data_seg(dseg, lkey,
> +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> +				      rte_cpu_to_be_32(sbuf->data_len ?
> +						       sbuf->data_len :
> +						       0x80000000));
> +		sbuf = sbuf->next;
> +		dseg++;
> +		nb_segs--;
> +		/* fallthrough */
> +	case 1:
> +		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> +		if (unlikely(lkey == (uint32_t)-1)) {
> +			DEBUG("%p: unable to get MP <-> MR association",
> +			      (void *)txq);
> +			return -1;
> +		}
> +		mlx4_fill_tx_data_seg(dseg, lkey,
> +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> +				      rte_cpu_to_be_32(sbuf->data_len ?
> +						       sbuf->data_len :
> +						       0x80000000));
> +		nb_segs--;
> +		if (nb_segs) {
> +			sbuf = sbuf->next;
> +			dseg++;
> +			goto txbb_head_seg;
>  		}
> +		/* fallthrough */
> +	case 0:
> +		break;
>  	}

I think this "switch (nb_segs)" idea is an interesting approach, but should
occur inside a loop construct as previously described.

>  	/* Write the first DWORD of each TXBB save earlier. */
>  	if (pv_counter) {
> @@ -583,7 +614,6 @@ struct pv {
>  		} srcrb;
>  		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
>  		uint32_t lkey;
> -		uintptr_t addr;
>  
>  		/* Clean up old buffer. */
>  		if (likely(elt->buf != NULL)) {
> @@ -618,24 +648,19 @@ struct pv {
>  			dseg = (volatile struct mlx4_wqe_data_seg *)
>  					((uintptr_t)ctrl +
>  					sizeof(struct mlx4_wqe_ctrl_seg));
> -			addr = rte_pktmbuf_mtod(buf, uintptr_t);
> -			rte_prefetch0((volatile void *)addr);
> -			dseg->addr = rte_cpu_to_be_64(addr);
> -			/* Memory region key (big endian). */
> +
> +			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
>  			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
> -			dseg->lkey = rte_cpu_to_be_32(lkey);
> -			if (unlikely(dseg->lkey ==
> -				rte_cpu_to_be_32((uint32_t)-1))) {
> +			if (unlikely(lkey == (uint32_t)-1)) {
>  				/* MR does not exist. */
>  				DEBUG("%p: unable to get MP <-> MR association",
>  				      (void *)txq);
>  				elt->buf = NULL;
>  				break;
>  			}
> -			/* Never be TXBB aligned, no need compiler barrier. */
> -			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
> -			/* Fill the control parameters for this packet. */
> -			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
> +			mlx4_fill_tx_data_seg(dseg, lkey,
> +					      rte_pktmbuf_mtod(buf, uintptr_t),
> +					      rte_cpu_to_be_32(buf->data_len));
>  			nr_txbbs = 1;
>  		} else {
>  			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
> diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
> index 463df2b..8207232 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.h
> +++ b/drivers/net/mlx4/mlx4_rxtx.h
> @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
>  	return mlx4_txq_add_mr(txq, mp, i);
>  }
>  
> +/**
> + * Write Tx data segment to the SQ.
> + *
> + * @param dseg
> + *   Pointer to data segment in SQ.
> + * @param lkey
> + *   Memory region lkey.
> + * @param addr
> + *   Data address.
> + * @param byte_count
> + *   Big Endian bytes count of the data to send.

Big Endian => Big endian

How about using the dedicated type to properly document it?
See rte_be32_t from rte_byteorder.h.

> + */
> +static inline void
> +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
> +		       uint32_t lkey, uintptr_t addr, uint32_t byte_count)
> +{
> +	dseg->addr = rte_cpu_to_be_64(addr);
> +	dseg->lkey = rte_cpu_to_be_32(lkey);
> +#if RTE_CACHE_LINE_SIZE < 64
> +	/*
> +	 * Need a barrier here before writing the byte_count
> +	 * fields to make sure that all the data is visible
> +	 * before the byte_count field is set.
> +	 * Otherwise, if the segment begins a new cacheline,
> +	 * the HCA prefetcher could grab the 64-byte chunk and
> +	 * get a valid (!= 0xffffffff) byte count but stale
> +	 * data, and end up sending the wrong data.
> +	 */
> +	rte_io_wmb();
> +#endif /* RTE_CACHE_LINE_SIZE */
> +	dseg->byte_count = byte_count;
> +}
> +

No need to expose this function in a header file. Note that rte_cpu_*() and
rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h
respectively.

>  #endif /* MLX4_RXTX_H_ */
> -- 
> 1.8.3.1
> 

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] net/mlx4: merge Tx queue rings management
  2017-11-28 12:19 ` [PATCH 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
@ 2017-12-06 10:58   ` Adrien Mazarguil
  2017-12-06 11:43     ` Matan Azrad
  0 siblings, 1 reply; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:58 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:27PM +0000, Matan Azrad wrote:
> The Tx queue send ring was managed by Tx block head,tail,count and mask
> management variables which were used for managing the send queue remain
> space and next places of empty or completted work queue entries.

completted => completed

> 
> This method suffered from an actual addresses recalculation per packet,
> an unnecessary Tx block based calculations and an expensive dual
> management of Tx rings.
> 
> Move send queue ring calculation to be based on actual addresses while
> managing it by descriptors ring indexes.
> 
> Add new work queue entry pointer to the descriptor element to hold the
> appropriate entry in the send queue.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  drivers/net/mlx4/mlx4_prm.h  |  20 ++--
>  drivers/net/mlx4/mlx4_rxtx.c | 241 +++++++++++++++++++------------------------
>  drivers/net/mlx4/mlx4_rxtx.h |   1 +
>  drivers/net/mlx4/mlx4_txq.c  |  23 +++--
>  4 files changed, 126 insertions(+), 159 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
> index fcc7c12..2ca303a 100644
> --- a/drivers/net/mlx4/mlx4_prm.h
> +++ b/drivers/net/mlx4/mlx4_prm.h
> @@ -54,22 +54,18 @@
>  
>  /* Typical TSO descriptor with 16 gather entries is 352 bytes. */
>  #define MLX4_MAX_WQE_SIZE 512
> -#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
> +#define MLX4_SEG_SHIFT 4
>  
>  /* Send queue stamping/invalidating information. */
>  #define MLX4_SQ_STAMP_STRIDE 64
>  #define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
> -#define MLX4_SQ_STAMP_SHIFT 31
> +#define MLX4_SQ_OWNER_BIT 31
>  #define MLX4_SQ_STAMP_VAL 0x7fffffff
>  
>  /* Work queue element (WQE) flags. */
> -#define MLX4_BIT_WQE_OWN 0x80000000
>  #define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
>  #define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
>  
> -#define MLX4_SIZE_TO_TXBBS(size) \
> -	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
> -
>  /* CQE checksum flags. */
>  enum {
>  	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
> @@ -98,17 +94,15 @@ enum {
>  struct mlx4_sq {
>  	volatile uint8_t *buf; /**< SQ buffer. */
>  	volatile uint8_t *eob; /**< End of SQ buffer */
> -	uint32_t head; /**< SQ head counter in units of TXBBS. */
> -	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
> -	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
> -	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
> -	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
> +	uint32_t size; /**< SQ size includes headroom. */
> +	int32_t remain_size; /**< Remain WQE size in SQ. */

Remain => Remaining?

By "size", do you mean "room" as there could be several WQEs in there?

Note before reviewing the rest of this patch, the fact it's a signed integer
bothers me; it's probably a mistake. You should standardize on unsigned
values everywhere.

> +	/**< Default owner opcode with HW valid owner bit. */

The "/**<" syntax requires the comment to come after the documented
field. You should either move this line below "owner_opcode" or use "/**".

> +	uint32_t owner_opcode;
> +	uint32_t stamp; /**< Stamp value with an invalid HW owner bit. */
>  	volatile uint32_t *db; /**< Pointer to the doorbell. */
>  	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
>  };
>  
> -#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
> -
>  /* Completion queue events, numbers and masks. */
>  #define MLX4_CQ_DB_GEQ_N_MASK 0x3
>  #define MLX4_CQ_DOORBELL 0x20
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index b9cb2fc..0a8ef93 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -61,9 +61,6 @@
>  #include "mlx4_rxtx.h"
>  #include "mlx4_utils.h"
>  
> -#define WQE_ONE_DATA_SEG_SIZE \
> -	(sizeof(struct mlx4_wqe_ctrl_seg) + sizeof(struct mlx4_wqe_data_seg))
> -
>  /**
>   * Pointer-value pair structure used in tx_post_send for saving the first
>   * DWORD (32 byte) of a TXBB.
> @@ -268,52 +265,48 @@ struct pv {
>   *
>   * @param sq
>   *   Pointer to the SQ structure.
> - * @param index
> - *   Index of the freed WQE.
> - * @param num_txbbs
> - *   Number of blocks to stamp.
> - *   If < 0 the routine will use the size written in the WQ entry.
> - * @param owner
> - *   The value of the WQE owner bit to use in the stamp.
> + * @param wqe
> + *   Pointer of WQE to stamp.

Looks like it's not just a simple pointer to the WQE to stamp seeing this
function also stores the address of the next WQE in the provided buffer
(uint32_t **wqe). It's not documented as such.

>   *
>   * @return
> - *   The number of Tx basic blocs (TXBB) the WQE contained.
> + *   WQE size.
>   */
> -static int
> -mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
> +static uint32_t
> +mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
>  {
> -	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
> -					  (!!owner << MLX4_SQ_STAMP_SHIFT));
> -	volatile uint8_t *wqe = mlx4_get_send_wqe(sq,
> -						(index & sq->txbb_cnt_mask));
> -	volatile uint32_t *ptr = (volatile uint32_t *)wqe;
> -	int i;
> -	int txbbs_size;
> -	int num_txbbs;
> -
> +	uint32_t stamp = sq->stamp;
> +	volatile uint32_t *next_txbb = *wqe;
>  	/* Extract the size from the control segment of the WQE. */
> -	num_txbbs = MLX4_SIZE_TO_TXBBS((((volatile struct mlx4_wqe_ctrl_seg *)
> -					 wqe)->fence_size & 0x3f) << 4);
> -	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
> +	uint32_t size = RTE_ALIGN((uint32_t)
> +				  ((((volatile struct mlx4_wqe_ctrl_seg *)
> +				     next_txbb)->fence_size & 0x3f) << 4),
> +				  MLX4_TXBB_SIZE);
> +	uint32_t size_cd = size;
> +
>  	/* Optimize the common case when there is no wrap-around. */
> -	if (wqe + txbbs_size <= sq->eob) {
> +	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
>  		/* Stamp the freed descriptor. */
> -		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
> -			*ptr = stamp;
> -			ptr += MLX4_SQ_STAMP_DWORDS;
> -		}
> +		do {
> +			*next_txbb = stamp;
> +			next_txbb += MLX4_SQ_STAMP_DWORDS;
> +			size_cd -= MLX4_TXBB_SIZE;
> +		} while (size_cd);
>  	} else {
>  		/* Stamp the freed descriptor. */
> -		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
> -			*ptr = stamp;
> -			ptr += MLX4_SQ_STAMP_DWORDS;
> -			if ((volatile uint8_t *)ptr >= sq->eob) {
> -				ptr = (volatile uint32_t *)sq->buf;
> -				stamp ^= RTE_BE32(0x80000000);
> +		do {
> +			*next_txbb = stamp;
> +			next_txbb += MLX4_SQ_STAMP_DWORDS;
> +			if ((volatile uint8_t *)next_txbb >= sq->eob) {
> +				next_txbb = (volatile uint32_t *)sq->buf;
> +				/* Flip invalid stamping ownership. */
> +				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
> +				sq->stamp = stamp;
>  			}
> -		}
> +			size_cd -= MLX4_TXBB_SIZE;
> +		} while (size_cd);
>  	}
> -	return num_txbbs;
> +	*wqe = next_txbb;
> +	return size;
>  }
>  
>  /**
> @@ -326,24 +319,22 @@ struct pv {
>   *
>   * @param txq
>   *   Pointer to Tx queue structure.
> - *
> - * @return
> - *   0 on success, -1 on failure.
>   */
> -static int
> +static void
>  mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
>  				  struct mlx4_sq *sq)
>  {
> -	unsigned int elts_comp = txq->elts_comp;
>  	unsigned int elts_tail = txq->elts_tail;
> -	unsigned int sq_tail = sq->tail;
>  	struct mlx4_cq *cq = &txq->mcq;
>  	volatile struct mlx4_cqe *cqe;
>  	uint32_t cons_index = cq->cons_index;
> -	uint16_t new_index;
> -	uint16_t nr_txbbs = 0;
> -	int pkts = 0;
> -
> +	volatile uint32_t *first_wqe;
> +	volatile uint32_t *next_wqe = (volatile uint32_t *)
> +			((&(*txq->elts)[elts_tail])->wqe);
> +	volatile uint32_t *last_wqe;
> +	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
> +			 MLX4_TXBB_SHIFT) - 1;
> +	uint32_t pkts = 0;
>  	/*
>  	 * Traverse over all CQ entries reported and handle each WQ entry
>  	 * reported by them.
> @@ -353,11 +344,11 @@ struct pv {
>  		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
>  		    !!(cons_index & cq->cqe_cnt)))
>  			break;
> +#ifndef NDEBUG
>  		/*
>  		 * Make sure we read the CQE after we read the ownership bit.
>  		 */
>  		rte_io_rmb();
> -#ifndef NDEBUG
>  		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
>  			     MLX4_CQE_OPCODE_ERROR)) {
>  			volatile struct mlx4_err_cqe *cqe_err =
> @@ -366,41 +357,32 @@ struct pv {
>  			      " syndrome: 0x%x\n",
>  			      (void *)txq, cqe_err->vendor_err,
>  			      cqe_err->syndrome);
> +			break;
>  		}
>  #endif /* NDEBUG */
> -		/* Get WQE index reported in the CQE. */
> -		new_index =
> -			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
> +		/* Get WQE address buy index from the CQE. */
> +		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
> +			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
> +			 MLX4_TXBB_SHIFT));
>  		do {
>  			/* Free next descriptor. */
> -			sq_tail += nr_txbbs;
> -			nr_txbbs =
> -				mlx4_txq_stamp_freed_wqe(sq,
> -				     sq_tail & sq->txbb_cnt_mask,
> -				     !!(sq_tail & sq->txbb_cnt));
> +			first_wqe = next_wqe;
> +			sq->remain_size +=
> +				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
>  			pkts++;
> -		} while ((sq_tail & sq->txbb_cnt_mask) != new_index);
> +		} while (first_wqe != last_wqe);
>  		cons_index++;
>  	} while (1);
>  	if (unlikely(pkts == 0))
> -		return 0;
> -	/* Update CQ. */
> +		return;
> +	/* Update CQ consumer index. */
>  	cq->cons_index = cons_index;
> -	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & MLX4_CQ_DB_CI_MASK);
> -	sq->tail = sq_tail + nr_txbbs;
> -	/* Update the list of packets posted for transmission. */
> -	elts_comp -= pkts;
> -	assert(elts_comp <= txq->elts_comp);
> -	/*
> -	 * Assume completion status is successful as nothing can be done about
> -	 * it anyway.
> -	 */
> +	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
> +	txq->elts_comp -= pkts;
>  	elts_tail += pkts;
>  	if (elts_tail >= elts_n)
>  		elts_tail -= elts_n;
>  	txq->elts_tail = elts_tail;
> -	txq->elts_comp = elts_comp;
> -	return 0;
>  }
>  
>  /**
> @@ -421,41 +403,27 @@ struct pv {
>  	return buf->pool;
>  }
>  
> -static int
> +static volatile struct mlx4_wqe_ctrl_seg *
>  mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
> -		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
> +		   volatile struct mlx4_wqe_ctrl_seg *ctrl)

Can you use this opportunity to document this function?

>  {
> -	int wqe_real_size;
> -	int nr_txbbs;
>  	struct pv *pv = (struct pv *)txq->bounce_buf;
>  	struct mlx4_sq *sq = &txq->msq;
> -	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
> -	volatile struct mlx4_wqe_ctrl_seg *ctrl;
> -	volatile struct mlx4_wqe_data_seg *dseg;
>  	struct rte_mbuf *sbuf = buf;
>  	uint32_t lkey;
>  	int pv_counter = 0;
>  	int nb_segs = buf->nb_segs;
> +	int32_t wqe_size;
> +	volatile struct mlx4_wqe_data_seg *dseg =
> +		(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
>  
> -	/* Calculate the needed work queue entry size for this packet. */
> -	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
> -		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> -	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
> -	/*
> -	 * Check that there is room for this WQE in the send queue and that
> -	 * the WQE size is legal.
> -	 */
> -	if (((sq->head - sq->tail) + nr_txbbs +
> -				sq->headroom_txbbs) >= sq->txbb_cnt ||
> -			nr_txbbs > MLX4_MAX_WQE_TXBBS) {
> -		return -1;
> -	}
> -	/* Get the control and data entries of the WQE. */
> -	ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
> -			mlx4_get_send_wqe(sq, head_idx);
> -	dseg = (volatile struct mlx4_wqe_data_seg *)
> -			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
> -	*pctrl = ctrl;
> +	ctrl->fence_size = 1 + nb_segs;
> +	wqe_size = RTE_ALIGN((int32_t)(ctrl->fence_size << MLX4_SEG_SHIFT),
> +			     MLX4_TXBB_SIZE);
> +	/* Validate WQE size and WQE space in the send queue. */
> +	if (sq->remain_size < wqe_size ||
> +	    wqe_size > MLX4_MAX_WQE_SIZE)
> +		return NULL;
>  	/*
>  	 * Fill the data segments with buffer information.
>  	 * First WQE TXBB head segment is always control segment,
> @@ -469,7 +437,7 @@ struct pv {
>  	if (unlikely(lkey == (uint32_t)-1)) {
>  		DEBUG("%p: unable to get MP <-> MR association",
>  		      (void *)txq);
> -		return -1;
> +		return NULL;
>  	}
>  	/* Handle WQE wraparound. */
>  	if (dseg >=
> @@ -501,7 +469,7 @@ struct pv {
>  		if (unlikely(lkey == (uint32_t)-1)) {
>  			DEBUG("%p: unable to get MP <-> MR association",
>  			      (void *)txq);
> -			return -1;
> +			return NULL;
>  		}
>  		mlx4_fill_tx_data_seg(dseg, lkey,
>  				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> @@ -517,7 +485,7 @@ struct pv {
>  		if (unlikely(lkey == (uint32_t)-1)) {
>  			DEBUG("%p: unable to get MP <-> MR association",
>  			      (void *)txq);
> -			return -1;
> +			return NULL;
>  		}
>  		mlx4_fill_tx_data_seg(dseg, lkey,
>  				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> @@ -533,7 +501,7 @@ struct pv {
>  		if (unlikely(lkey == (uint32_t)-1)) {
>  			DEBUG("%p: unable to get MP <-> MR association",
>  			      (void *)txq);
> -			return -1;
> +			return NULL;
>  		}
>  		mlx4_fill_tx_data_seg(dseg, lkey,
>  				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> @@ -557,9 +525,10 @@ struct pv {
>  		for (--pv_counter; pv_counter  >= 0; pv_counter--)
>  			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
>  	}
> -	/* Fill the control parameters for this packet. */
> -	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
> -	return nr_txbbs;
> +	sq->remain_size -= wqe_size;
> +	/* Align next WQE address to the next TXBB. */
> +	return (volatile struct mlx4_wqe_ctrl_seg *)
> +		((volatile uint8_t *)ctrl + wqe_size);
>  }
>  
>  /**
> @@ -585,7 +554,8 @@ struct pv {
>  	unsigned int i;
>  	unsigned int max;
>  	struct mlx4_sq *sq = &txq->msq;
> -	int nr_txbbs;
> +	volatile struct mlx4_wqe_ctrl_seg *ctrl;
> +	struct txq_elt *elt;
>  
>  	assert(txq->elts_comp_cd != 0);
>  	if (likely(txq->elts_comp != 0))
> @@ -599,29 +569,30 @@ struct pv {
>  	--max;
>  	if (max > pkts_n)
>  		max = pkts_n;
> +	elt = &(*txq->elts)[elts_head];
> +	/* Each element saves its appropriate work queue. */
> +	ctrl = elt->wqe;
>  	for (i = 0; (i != max); ++i) {
>  		struct rte_mbuf *buf = pkts[i];
>  		unsigned int elts_head_next =
>  			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
>  		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
> -		struct txq_elt *elt = &(*txq->elts)[elts_head];
> -		uint32_t owner_opcode = MLX4_OPCODE_SEND;
> -		volatile struct mlx4_wqe_ctrl_seg *ctrl;
> -		volatile struct mlx4_wqe_data_seg *dseg;
> +		uint32_t owner_opcode = sq->owner_opcode;
> +		volatile struct mlx4_wqe_data_seg *dseg =
> +				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
> +		volatile struct mlx4_wqe_ctrl_seg *ctrl_next;
>  		union {
>  			uint32_t flags;
>  			uint16_t flags16[2];
>  		} srcrb;
> -		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
>  		uint32_t lkey;
>  
>  		/* Clean up old buffer. */
>  		if (likely(elt->buf != NULL)) {
>  			struct rte_mbuf *tmp = elt->buf;
> -

Empty line following variable declarations should stay.

>  #ifndef NDEBUG
>  			/* Poisoning. */
> -			memset(elt, 0x66, sizeof(*elt));
> +			elt->buf = (struct rte_mbuf *)0x6666666666666666;

Note this address depends on pointer size, which may in turn trigger a
compilation warning/error. Keep memset() on elt->buf.

>  #endif
>  			/* Faster than rte_pktmbuf_free(). */
>  			do {
> @@ -633,23 +604,11 @@ struct pv {
>  		}
>  		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
>  		if (buf->nb_segs == 1) {
> -			/*
> -			 * Check that there is room for this WQE in the send
> -			 * queue and that the WQE size is legal
> -			 */
> -			if (((sq->head - sq->tail) + 1 + sq->headroom_txbbs) >=
> -			     sq->txbb_cnt || 1 > MLX4_MAX_WQE_TXBBS) {
> +			/* Validate WQE space in the send queue. */
> +			if (sq->remain_size < MLX4_TXBB_SIZE) {
>  				elt->buf = NULL;
>  				break;
>  			}
> -			/* Get the control and data entries of the WQE. */
> -			ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
> -					mlx4_get_send_wqe(sq, head_idx);
> -			dseg = (volatile struct mlx4_wqe_data_seg *)
> -					((uintptr_t)ctrl +
> -					sizeof(struct mlx4_wqe_ctrl_seg));
> -
> -			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
>  			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
>  			if (unlikely(lkey == (uint32_t)-1)) {
>  				/* MR does not exist. */
> @@ -658,23 +617,33 @@ struct pv {
>  				elt->buf = NULL;
>  				break;
>  			}
> -			mlx4_fill_tx_data_seg(dseg, lkey,
> +			mlx4_fill_tx_data_seg(dseg++, lkey,
>  					      rte_pktmbuf_mtod(buf, uintptr_t),
>  					      rte_cpu_to_be_32(buf->data_len));
> -			nr_txbbs = 1;
> +			/* Set WQE size in 16-byte units. */
> +			ctrl->fence_size = 0x2;
> +			sq->remain_size -= MLX4_TXBB_SIZE;
> +			/* Align next WQE address to the next TXBB. */
> +			ctrl_next = ctrl + 0x4;
>  		} else {
> -			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
> -			if (nr_txbbs < 0) {
> +			ctrl_next = mlx4_tx_burst_segs(buf, txq, ctrl);
> +			if (!ctrl_next) {
>  				elt->buf = NULL;
>  				break;
>  			}
>  		}
> +		/* Hold SQ ring wrap around. */
> +		if ((volatile uint8_t *)ctrl_next >= sq->eob) {
> +			ctrl_next = (volatile struct mlx4_wqe_ctrl_seg *)
> +				((volatile uint8_t *)ctrl_next - sq->size);
> +			/* Flip HW valid ownership. */
> +			sq->owner_opcode ^= 0x1 << MLX4_SQ_OWNER_BIT;
> +		}
>  		/*
>  		 * For raw Ethernet, the SOLICIT flag is used to indicate
>  		 * that no ICRC should be calculated.
>  		 */
> -		txq->elts_comp_cd -= nr_txbbs;
> -		if (unlikely(txq->elts_comp_cd <= 0)) {
> +		if (--txq->elts_comp_cd == 0) {
>  			txq->elts_comp_cd = txq->elts_comp_cd_init;
>  			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
>  					       MLX4_WQE_CTRL_CQ_UPDATE);
> @@ -720,13 +689,13 @@ struct pv {
>  		 * executing as soon as we do).
>  		 */
>  		rte_io_wmb();
> -		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> -					      ((sq->head & sq->txbb_cnt) ?
> -						       MLX4_BIT_WQE_OWN : 0));
> -		sq->head += nr_txbbs;
> +		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
>  		elt->buf = buf;
>  		bytes_sent += buf->pkt_len;
>  		elts_head = elts_head_next;
> +		elt_next->wqe = ctrl_next;
> +		ctrl = ctrl_next;
> +		elt = elt_next;
>  	}
>  	/* Take a shortcut if nothing must be sent. */
>  	if (unlikely(i == 0))
> diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
> index 8207232..c092afa 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.h
> +++ b/drivers/net/mlx4/mlx4_rxtx.h
> @@ -105,6 +105,7 @@ struct mlx4_rss {
>  /** Tx element. */
>  struct txq_elt {
>  	struct rte_mbuf *buf; /**< Buffer. */
> +	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
>  };
>  
>  /** Rx queue counters. */
> diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
> index 7882a4d..4c7b62a 100644
> --- a/drivers/net/mlx4/mlx4_txq.c
> +++ b/drivers/net/mlx4/mlx4_txq.c
> @@ -84,6 +84,7 @@
>  		assert(elt->buf != NULL);
>  		rte_pktmbuf_free(elt->buf);
>  		elt->buf = NULL;
> +		elt->wqe = NULL;
>  		if (++elts_tail == RTE_DIM(*elts))
>  			elts_tail = 0;
>  	}
> @@ -163,20 +164,19 @@ struct txq_mp2mr_mbuf_check_data {
>  	struct mlx4_cq *cq = &txq->mcq;
>  	struct mlx4dv_qp *dqp = mlxdv->qp.out;
>  	struct mlx4dv_cq *dcq = mlxdv->cq.out;
> -	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
>  
> -	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
>  	/* Total length, including headroom and spare WQEs. */
> -	sq->eob = sq->buf + sq_size;
> -	sq->head = 0;
> -	sq->tail = 0;
> -	sq->txbb_cnt =
> -		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
> -	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
> +	sq->size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
> +	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
> +	sq->eob = sq->buf + sq->size;
> +	uint32_t headroom_size = 2048 + (1 << dqp->sq.wqe_shift);
> +	/* Continuous headroom size bytes must always stay freed. */
> +	sq->remain_size = sq->size - headroom_size;
> +	sq->owner_opcode = MLX4_OPCODE_SEND | (0 << MLX4_SQ_OWNER_BIT);
> +	sq->stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
> +				     (0 << MLX4_SQ_OWNER_BIT));
>  	sq->db = dqp->sdb;
>  	sq->doorbell_qpn = dqp->doorbell_qpn;
> -	sq->headroom_txbbs =
> -		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
>  	cq->buf = dcq->buf.buf;
>  	cq->cqe_cnt = dcq->cqe_cnt;
>  	cq->set_ci_db = dcq->set_ci_db;
> @@ -362,6 +362,9 @@ struct txq_mp2mr_mbuf_check_data {
>  		goto error;
>  	}
>  	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
> +	/* Save first wqe pointer in the first element. */
> +	(&(*txq->elts)[0])->wqe =
> +		(volatile struct mlx4_wqe_ctrl_seg *)txq->msq.buf;
>  	/* Pre-register known mempools. */
>  	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
>  	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
> -- 
> 1.8.3.1
> 

Otherwise this patch looks OK.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/8] net/mlx4: mitigate Tx send entry size calculations
  2017-11-28 12:19 ` [PATCH 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
@ 2017-12-06 10:59   ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:59 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:28PM +0000, Matan Azrad wrote:
> The previuse code took a send queue entry size for stamping from the
> send queue entry pointed by completion queue entry; This 2 reads were
> done per packet in completion stage.
> 
> The completion burst packets number is managed by fixed size stored in
> Tx queue, so we can infer that each valid completion entry actually frees
> the next fixed number packets.
> 
> The descriptors ring holds the send queue entry, so we just can infer
> all the completion burst packet entries size by simple calculation and
> prevent calculations per packet.
> 
> Adjust completion functions to free full completion bursts packets
> by one time and prevent per packet work queue entry reads and
> calculations.
> 
> Save only start of completion burst or Tx burst send queue entry
> pointers in the appropriate descriptor element.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/8] net/mlx4: align Tx descriptors number
  2017-11-28 12:19 ` [PATCH 7/8] net/mlx4: align Tx descriptors number Matan Azrad
@ 2017-12-06 10:59   ` Adrien Mazarguil
  2017-12-06 11:44     ` Matan Azrad
  0 siblings, 1 reply; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:59 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:29PM +0000, Matan Azrad wrote:
> Using power of 2 descriptors number makes the ring management easier
> and allows to use mask operation instead of wraparound conditions.
> 
> Adjust Tx descriptor number to be power of 2 and change calculation to
> use mask accordingly.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

A few minor comments, see below.

> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 22 ++++++++--------------
>  drivers/net/mlx4/mlx4_txq.c  | 20 ++++++++++++--------
>  2 files changed, 20 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index 30f2930..b5aaf4c 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -314,7 +314,7 @@ struct pv {
>   *   Pointer to Tx queue structure.
>   */
>  static void
> -mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
> +mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
>  				  struct mlx4_sq *sq)

Documentation needs updating.

>  {
>  	unsigned int elts_tail = txq->elts_tail;
> @@ -355,13 +355,11 @@ struct pv {
>  	if (unlikely(!completed))
>  		return;
>  	/* First stamping address is the end of the last one. */
> -	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
> +	first_txbb = (&(*txq->elts)[elts_tail & elts_m])->eocb;
>  	elts_tail += completed;
> -	if (elts_tail >= elts_n)
> -		elts_tail -= elts_n;
>  	/* The new tail element holds the end address. */
>  	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
> -		(&(*txq->elts)[elts_tail])->eocb);
> +		(&(*txq->elts)[elts_tail & elts_m])->eocb);
>  	/* Update CQ consumer index. */
>  	cq->cons_index = cons_index;
>  	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
> @@ -534,6 +532,7 @@ struct pv {
>  	struct txq *txq = (struct txq *)dpdk_txq;
>  	unsigned int elts_head = txq->elts_head;
>  	const unsigned int elts_n = txq->elts_n;
> +	const unsigned int elts_m = elts_n - 1;
>  	unsigned int bytes_sent = 0;
>  	unsigned int i;
>  	unsigned int max;
> @@ -543,24 +542,20 @@ struct pv {
>  
>  	assert(txq->elts_comp_cd != 0);
>  	if (likely(txq->elts_comp != 0))
> -		mlx4_txq_complete(txq, elts_n, sq);
> +		mlx4_txq_complete(txq, elts_m, sq);
>  	max = (elts_n - (elts_head - txq->elts_tail));
> -	if (max > elts_n)
> -		max -= elts_n;
>  	assert(max >= 1);
>  	assert(max <= elts_n);
>  	/* Always leave one free entry in the ring. */
>  	--max;
>  	if (max > pkts_n)
>  		max = pkts_n;
> -	elt = &(*txq->elts)[elts_head];
> +	elt = &(*txq->elts)[elts_head & elts_m];
>  	/* First Tx burst element saves the next WQE control segment. */
>  	ctrl = elt->wqe;
>  	for (i = 0; (i != max); ++i) {
>  		struct rte_mbuf *buf = pkts[i];
> -		unsigned int elts_head_next =
> -			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
> -		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
> +		struct txq_elt *elt_next = &(*txq->elts)[++elts_head & elts_m];
>  		uint32_t owner_opcode = sq->owner_opcode;
>  		volatile struct mlx4_wqe_data_seg *dseg =
>  				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
> @@ -678,7 +673,6 @@ struct pv {
>  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
>  		elt->buf = buf;
>  		bytes_sent += buf->pkt_len;
> -		elts_head = elts_head_next;
>  		ctrl = ctrl_next;
>  		elt = elt_next;
>  	}
> @@ -694,7 +688,7 @@ struct pv {
>  	rte_wmb();
>  	/* Ring QP doorbell. */
>  	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
> -	txq->elts_head = elts_head;
> +	txq->elts_head += i;
>  	txq->elts_comp += i;
>  	return i;
>  }
> diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
> index 4c7b62a..253075a 100644
> --- a/drivers/net/mlx4/mlx4_txq.c
> +++ b/drivers/net/mlx4/mlx4_txq.c
> @@ -76,17 +76,16 @@
>  	unsigned int elts_head = txq->elts_head;
>  	unsigned int elts_tail = txq->elts_tail;
>  	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
> +	unsigned int elts_m = txq->elts_n - 1;
>  
>  	DEBUG("%p: freeing WRs", (void *)txq);
>  	while (elts_tail != elts_head) {
> -		struct txq_elt *elt = &(*elts)[elts_tail];
> +		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
>  
>  		assert(elt->buf != NULL);
>  		rte_pktmbuf_free(elt->buf);
>  		elt->buf = NULL;
>  		elt->wqe = NULL;
> -		if (++elts_tail == RTE_DIM(*elts))
> -			elts_tail = 0;
>  	}
>  	txq->elts_tail = txq->elts_head;
>  }
> @@ -208,7 +207,9 @@ struct txq_mp2mr_mbuf_check_data {
>  	struct mlx4dv_obj mlxdv;
>  	struct mlx4dv_qp dv_qp;
>  	struct mlx4dv_cq dv_cq;
> -	struct txq_elt (*elts)[desc];
> +	uint32_t elts_size = desc > 0x1000 ? 0x1000 :
> +		rte_align32pow2((uint32_t)desc);

Where is that magical 0x1000 value coming from? It should at least come
through a macro definition.

> +	struct txq_elt (*elts)[elts_size];
>  	struct ibv_qp_init_attr qp_init_attr;
>  	struct txq *txq;
>  	uint8_t *bounce_buf;
> @@ -247,11 +248,14 @@ struct txq_mp2mr_mbuf_check_data {
>  		      (void *)dev, idx);
>  		return -rte_errno;
>  	}
> -	if (!desc) {
> -		rte_errno = EINVAL;
> -		ERROR("%p: invalid number of Tx descriptors", (void *)dev);
> -		return -rte_errno;
> +	if ((uint32_t)desc != elts_size) {
> +		desc = (uint16_t)elts_size;
> +		WARN("%p: changed number of descriptors in TX queue %u"
> +		     " to be power of two (%d)",
> +		     (void *)dev, idx, desc);

You should display the same message as in mlx4_rxq.c for consistency
(also TX => Tx).

>  	}
> +	DEBUG("%p: configuring queue %u for %u descriptors",
> +	      (void *)dev, idx, desc);

Unnecessary duplicated debugging message already printed at the beginning of
this function. Yes this is a different value but WARN() made that pretty
clear.

>  	/* Allocate and initialize Tx queue. */
>  	mlx4_zmallocv_socket("TXQ", vec, RTE_DIM(vec), socket);
>  	if (!txq) {
> -- 
> 1.8.3.1
> 

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] net/mlx4: remove Tx completion elements counter
  2017-11-28 12:19 ` [PATCH 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
@ 2017-12-06 10:59   ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 10:59 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Tue, Nov 28, 2017 at 12:19:30PM +0000, Matan Azrad wrote:
> This counter saved the descriptor elements which are waiting to be
> completted and was used to know if completion function should be

completted => completed

> called.
> 
> This completion check can be done by other elements management
> variables and we can prevent this counter management.
> 
> Remove this counter and replace the completion check easily by other
> elements management variables.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

It's nice to finally get rid of this useless counter,

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
  2017-12-06 10:58   ` Adrien Mazarguil
@ 2017-12-06 11:29     ` Matan Azrad
  2017-12-06 11:55       ` Adrien Mazarguil
  0 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 11:29 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, December 6, 2017 12:59 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
> 
> On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote:
> > mlx4 Tx block can handle up to 4 data segments or control segment + up
> > to 3 data segments. The first data segment in each not first Tx block
> > must validate Tx queue wraparound and must use IO memory barrier
> > before writing the byte count.
> >
> > The previous multi-segment code used "for" loop to iterate over all
> > packet segments and separated first Tx block data case by "if"
> > statments.
> 
> statments => statements
> 
> >
> > Use switch case and unconditional branches instead of "for" loop can
> > optimize the case and prevents the unnecessary checks for each data
> > segment; This hints to compiler to create opitimized jump table.
> 
> opitimized => optimized
> 
> >
> > Optimize this case by switch case and unconditional branches usage.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 165
> > +++++++++++++++++++++++++------------------
> >  drivers/net/mlx4/mlx4_rxtx.h |  33 +++++++++
> >  2 files changed, 128 insertions(+), 70 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > b/drivers/net/mlx4/mlx4_rxtx.c index 1d8240a..b9cb2fc 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -432,15 +432,14 @@ struct pv {
> >  	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
> >  	volatile struct mlx4_wqe_ctrl_seg *ctrl;
> >  	volatile struct mlx4_wqe_data_seg *dseg;
> > -	struct rte_mbuf *sbuf;
> > +	struct rte_mbuf *sbuf = buf;
> >  	uint32_t lkey;
> > -	uintptr_t addr;
> > -	uint32_t byte_count;
> >  	int pv_counter = 0;
> > +	int nb_segs = buf->nb_segs;
> >
> >  	/* Calculate the needed work queue entry size for this packet. */
> >  	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
> > -		buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> > +		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> >  	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
> >  	/*
> >  	 * Check that there is room for this WQE in the send queue and that
> > @@ -457,67 +456,99 @@ struct pv {
> >  	dseg = (volatile struct mlx4_wqe_data_seg *)
> >  			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
> >  	*pctrl = ctrl;
> > -	/* Fill the data segments with buffer information. */
> > -	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
> > -		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
> > -		rte_prefetch0((volatile void *)addr);
> > -		/* Memory region key (big endian) for this memory pool. */
> > +	/*
> > +	 * Fill the data segments with buffer information.
> > +	 * First WQE TXBB head segment is always control segment,
> > +	 * so jump to tail TXBB data segments code for the first
> > +	 * WQE data segments filling.
> > +	 */
> > +	goto txbb_tail_segs;
> > +txbb_head_seg:
> 
> I'm not fundamentally opposed to "goto" unlike a lot of people out there,
> but this doesn't look good. It's OK to use goto for error cases and to extricate
> yourself when trapped in an inner loop, also in some optimization scenarios
> where it sometimes make sense, but not when the same can be achieved
> through standard loop constructs and keywords.
> 
> In this case I'm under the impression you could have managed with a do { ... }
> while (...) construct. You need to try harder to reorganize these changes or
> prove it can't be done without negatively impacting performance.
> 
> Doing so should make this patch shorter as well.
> 

I noticed this could be done with loop and without unconditional branches, but I checked it and found nice performance improvement using that way.
When I used the loop I degraded performance.  
 
> > +	/* Memory region key (big endian) for this memory pool. */
> > +	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > +	if (unlikely(lkey == (uint32_t)-1)) {
> > +		DEBUG("%p: unable to get MP <-> MR association",
> > +		      (void *)txq);
> > +		return -1;
> > +	}
> > +	/* Handle WQE wraparound. */
> > +	if (dseg >=
> > +		(volatile struct mlx4_wqe_data_seg *)sq->eob)
> > +		dseg = (volatile struct mlx4_wqe_data_seg *)
> > +			sq->buf;
> > +	dseg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf,
> uintptr_t));
> > +	dseg->lkey = rte_cpu_to_be_32(lkey);
> > +	/*
> > +	 * This data segment starts at the beginning of a new
> > +	 * TXBB, so we need to postpone its byte_count writing
> > +	 * for later.
> > +	 */
> > +	pv[pv_counter].dseg = dseg;
> > +	/*
> > +	 * Zero length segment is treated as inline segment
> > +	 * with zero data.
> > +	 */
> > +	pv[pv_counter++].val = rte_cpu_to_be_32(sbuf->data_len ?
> > +						sbuf->data_len :
> 0x80000000);
> > +	sbuf = sbuf->next;
> > +	dseg++;
> > +	nb_segs--;
> > +txbb_tail_segs:
> > +	/* Jump to default if there are more than two segments remaining.
> */
> > +	switch (nb_segs) {
> > +	default:
> >  		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > -		dseg->lkey = rte_cpu_to_be_32(lkey);
> > -		/* Calculate the needed work queue entry size for this
> packet */
> > -		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
> > -			/* MR does not exist. */
> > +		if (unlikely(lkey == (uint32_t)-1)) {
> >  			DEBUG("%p: unable to get MP <-> MR association",
> >  			      (void *)txq);
> >  			return -1;
> >  		}
> > -		if (likely(sbuf->data_len)) {
> > -			byte_count = rte_cpu_to_be_32(sbuf->data_len);
> > -		} else {
> > -			/*
> > -			 * Zero length segment is treated as inline segment
> > -			 * with zero data.
> > -			 */
> > -			byte_count = RTE_BE32(0x80000000);
> > +		mlx4_fill_tx_data_seg(dseg, lkey,
> > +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> > +				      rte_cpu_to_be_32(sbuf->data_len ?
> > +						       sbuf->data_len :
> > +						       0x80000000));
> > +		sbuf = sbuf->next;
> > +		dseg++;
> > +		nb_segs--;
> > +		/* fallthrough */
> > +	case 2:
> > +		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > +		if (unlikely(lkey == (uint32_t)-1)) {
> > +			DEBUG("%p: unable to get MP <-> MR association",
> > +			      (void *)txq);
> > +			return -1;
> >  		}
> > -		/*
> > -		 * If the data segment is not at the beginning of a
> > -		 * Tx basic block (TXBB) then write the byte count,
> > -		 * else postpone the writing to just before updating the
> > -		 * control segment.
> > -		 */
> > -		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> > -			dseg->addr = rte_cpu_to_be_64(addr);
> > -			dseg->lkey = rte_cpu_to_be_32(lkey);
> > -#if RTE_CACHE_LINE_SIZE < 64
> > -			/*
> > -			 * Need a barrier here before writing the byte_count
> > -			 * fields to make sure that all the data is visible
> > -			 * before the byte_count field is set.
> > -			 * Otherwise, if the segment begins a new cacheline,
> > -			 * the HCA prefetcher could grab the 64-byte chunk
> and
> > -			 * get a valid (!= 0xffffffff) byte count but stale
> > -			 * data, and end up sending the wrong data.
> > -			 */
> > -			rte_io_wmb();
> > -#endif /* RTE_CACHE_LINE_SIZE */
> > -			dseg->byte_count = byte_count;
> > -		} else {
> > -			/*
> > -			 * This data segment starts at the beginning of a new
> > -			 * TXBB, so we need to postpone its byte_count
> writing
> > -			 * for later.
> > -			 */
> > -			/* Handle WQE wraparound. */
> > -			if (dseg >=
> > -			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
> > -				dseg = (volatile struct mlx4_wqe_data_seg *)
> > -					sq->buf;
> > -			dseg->addr = rte_cpu_to_be_64(addr);
> > -			dseg->lkey = rte_cpu_to_be_32(lkey);
> > -			pv[pv_counter].dseg = dseg;
> > -			pv[pv_counter++].val = byte_count;
> > +		mlx4_fill_tx_data_seg(dseg, lkey,
> > +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> > +				      rte_cpu_to_be_32(sbuf->data_len ?
> > +						       sbuf->data_len :
> > +						       0x80000000));
> > +		sbuf = sbuf->next;
> > +		dseg++;
> > +		nb_segs--;
> > +		/* fallthrough */
> > +	case 1:
> > +		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > +		if (unlikely(lkey == (uint32_t)-1)) {
> > +			DEBUG("%p: unable to get MP <-> MR association",
> > +			      (void *)txq);
> > +			return -1;
> > +		}
> > +		mlx4_fill_tx_data_seg(dseg, lkey,
> > +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> > +				      rte_cpu_to_be_32(sbuf->data_len ?
> > +						       sbuf->data_len :
> > +						       0x80000000));
> > +		nb_segs--;
> > +		if (nb_segs) {
> > +			sbuf = sbuf->next;
> > +			dseg++;
> > +			goto txbb_head_seg;
> >  		}
> > +		/* fallthrough */
> > +	case 0:
> > +		break;
> >  	}
> 
> I think this "switch (nb_segs)" idea is an interesting approach, but should
> occur inside a loop construct as previously described.
> 

Same comment as above.

> >  	/* Write the first DWORD of each TXBB save earlier. */
> >  	if (pv_counter) {
> > @@ -583,7 +614,6 @@ struct pv {
> >  		} srcrb;
> >  		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
> >  		uint32_t lkey;
> > -		uintptr_t addr;
> >
> >  		/* Clean up old buffer. */
> >  		if (likely(elt->buf != NULL)) {
> > @@ -618,24 +648,19 @@ struct pv {
> >  			dseg = (volatile struct mlx4_wqe_data_seg *)
> >  					((uintptr_t)ctrl +
> >  					sizeof(struct mlx4_wqe_ctrl_seg));
> > -			addr = rte_pktmbuf_mtod(buf, uintptr_t);
> > -			rte_prefetch0((volatile void *)addr);
> > -			dseg->addr = rte_cpu_to_be_64(addr);
> > -			/* Memory region key (big endian). */
> > +
> > +			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4)
> & 0x3f;
> >  			lkey = mlx4_txq_mp2mr(txq,
> mlx4_txq_mb2mp(buf));
> > -			dseg->lkey = rte_cpu_to_be_32(lkey);
> > -			if (unlikely(dseg->lkey ==
> > -				rte_cpu_to_be_32((uint32_t)-1))) {
> > +			if (unlikely(lkey == (uint32_t)-1)) {
> >  				/* MR does not exist. */
> >  				DEBUG("%p: unable to get MP <-> MR
> association",
> >  				      (void *)txq);
> >  				elt->buf = NULL;
> >  				break;
> >  			}
> > -			/* Never be TXBB aligned, no need compiler barrier.
> */
> > -			dseg->byte_count = rte_cpu_to_be_32(buf-
> >data_len);
> > -			/* Fill the control parameters for this packet. */
> > -			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4)
> & 0x3f;
> > +			mlx4_fill_tx_data_seg(dseg, lkey,
> > +					      rte_pktmbuf_mtod(buf,
> uintptr_t),
> > +					      rte_cpu_to_be_32(buf-
> >data_len));
> >  			nr_txbbs = 1;
> >  		} else {
> >  			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl); diff -
> -git
> > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index
> > 463df2b..8207232 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.h
> > +++ b/drivers/net/mlx4/mlx4_rxtx.h
> > @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev
> *dev, uint16_t idx,
> >  	return mlx4_txq_add_mr(txq, mp, i);
> >  }
> >
> > +/**
> > + * Write Tx data segment to the SQ.
> > + *
> > + * @param dseg
> > + *   Pointer to data segment in SQ.
> > + * @param lkey
> > + *   Memory region lkey.
> > + * @param addr
> > + *   Data address.
> > + * @param byte_count
> > + *   Big Endian bytes count of the data to send.
> 
> Big Endian => Big endian
> 
> How about using the dedicated type to properly document it?
> See rte_be32_t from rte_byteorder.h.
> 
Learned new something, thanks, will check it.

> > + */
> > +static inline void
> > +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
> > +		       uint32_t lkey, uintptr_t addr, uint32_t byte_count) {
> > +	dseg->addr = rte_cpu_to_be_64(addr);
> > +	dseg->lkey = rte_cpu_to_be_32(lkey); #if RTE_CACHE_LINE_SIZE <
> 64
> > +	/*
> > +	 * Need a barrier here before writing the byte_count
> > +	 * fields to make sure that all the data is visible
> > +	 * before the byte_count field is set.
> > +	 * Otherwise, if the segment begins a new cacheline,
> > +	 * the HCA prefetcher could grab the 64-byte chunk and
> > +	 * get a valid (!= 0xffffffff) byte count but stale
> > +	 * data, and end up sending the wrong data.
> > +	 */
> > +	rte_io_wmb();
> > +#endif /* RTE_CACHE_LINE_SIZE */
> > +	dseg->byte_count = byte_count;
> > +}
> > +
> 
> No need to expose this function in a header file. Note that rte_cpu_*() and
> rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h
> respectively.
> 

Shouldn't inline functions be in header files?

> >  #endif /* MLX4_RXTX_H_ */
> > --
> > 1.8.3.1
> >
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] net/mlx4: merge Tx queue rings management
  2017-12-06 10:58   ` Adrien Mazarguil
@ 2017-12-06 11:43     ` Matan Azrad
  2017-12-06 12:09       ` Adrien Mazarguil
  0 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 11:43 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, December 6, 2017 12:59 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH 5/8] net/mlx4: merge Tx queue rings management
> 
> On Tue, Nov 28, 2017 at 12:19:27PM +0000, Matan Azrad wrote:
> > The Tx queue send ring was managed by Tx block head,tail,count and
> > mask management variables which were used for managing the send
> queue
> > remain space and next places of empty or completted work queue entries.
> 
> completted => completed
> 
> >
> > This method suffered from an actual addresses recalculation per
> > packet, an unnecessary Tx block based calculations and an expensive
> > dual management of Tx rings.
> >
> > Move send queue ring calculation to be based on actual addresses while
> > managing it by descriptors ring indexes.
> >
> > Add new work queue entry pointer to the descriptor element to hold the
> > appropriate entry in the send queue.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  drivers/net/mlx4/mlx4_prm.h  |  20 ++--  drivers/net/mlx4/mlx4_rxtx.c
> > | 241 +++++++++++++++++++------------------------
> >  drivers/net/mlx4/mlx4_rxtx.h |   1 +
> >  drivers/net/mlx4/mlx4_txq.c  |  23 +++--
> >  4 files changed, 126 insertions(+), 159 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
> > index fcc7c12..2ca303a 100644
> > --- a/drivers/net/mlx4/mlx4_prm.h
> > +++ b/drivers/net/mlx4/mlx4_prm.h
> > @@ -54,22 +54,18 @@
> >
> >  /* Typical TSO descriptor with 16 gather entries is 352 bytes. */
> > #define MLX4_MAX_WQE_SIZE 512 -#define MLX4_MAX_WQE_TXBBS
> > (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
> > +#define MLX4_SEG_SHIFT 4
> >
> >  /* Send queue stamping/invalidating information. */  #define
> > MLX4_SQ_STAMP_STRIDE 64  #define MLX4_SQ_STAMP_DWORDS
> > (MLX4_SQ_STAMP_STRIDE / 4) -#define MLX4_SQ_STAMP_SHIFT 31
> > +#define MLX4_SQ_OWNER_BIT 31
> >  #define MLX4_SQ_STAMP_VAL 0x7fffffff
> >
> >  /* Work queue element (WQE) flags. */ -#define MLX4_BIT_WQE_OWN
> > 0x80000000  #define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)  #define
> > MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
> >
> > -#define MLX4_SIZE_TO_TXBBS(size) \
> > -	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
> > -
> >  /* CQE checksum flags. */
> >  enum {
> >  	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25), @@ -98,17 +94,15
> @@ enum
> > {  struct mlx4_sq {
> >  	volatile uint8_t *buf; /**< SQ buffer. */
> >  	volatile uint8_t *eob; /**< End of SQ buffer */
> > -	uint32_t head; /**< SQ head counter in units of TXBBS. */
> > -	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
> > -	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
> > -	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
> > -	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept
> free. */
> > +	uint32_t size; /**< SQ size includes headroom. */
> > +	int32_t remain_size; /**< Remain WQE size in SQ. */
> 
> Remain => Remaining?
> 
OK
> By "size", do you mean "room" as there could be several WQEs in there?
> 
Size in bytes.
remaining size | remaining space | remaining room | remaining bytes , What are you preferred?

> Note before reviewing the rest of this patch, the fact it's a signed integer
> bothers me; it's probably a mistake.

There is place in the code where this variable can used for signed calculations.

> You should standardize on unsigned values everywhere.

Why?
Each field with the most appropriate type.

> 

> > +	/**< Default owner opcode with HW valid owner bit. */
> 
> The "/**<" syntax requires the comment to come after the documented
> field. You should either move this line below "owner_opcode" or use "/**".
> 
OK

> > +	uint32_t owner_opcode;
> > +	uint32_t stamp; /**< Stamp value with an invalid HW owner bit. */
> >  	volatile uint32_t *db; /**< Pointer to the doorbell. */
> >  	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
> > };
> >
> > -#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) *
> > (MLX4_TXBB_SIZE)))
> > -
> >  /* Completion queue events, numbers and masks. */  #define
> > MLX4_CQ_DB_GEQ_N_MASK 0x3  #define MLX4_CQ_DOORBELL 0x20 diff -
> -git
> > a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c index
> > b9cb2fc..0a8ef93 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -61,9 +61,6 @@
> >  #include "mlx4_rxtx.h"
> >  #include "mlx4_utils.h"
> >
> > -#define WQE_ONE_DATA_SEG_SIZE \
> > -	(sizeof(struct mlx4_wqe_ctrl_seg) + sizeof(struct
> mlx4_wqe_data_seg))
> > -
> >  /**
> >   * Pointer-value pair structure used in tx_post_send for saving the first
> >   * DWORD (32 byte) of a TXBB.
> > @@ -268,52 +265,48 @@ struct pv {
> >   *
> >   * @param sq
> >   *   Pointer to the SQ structure.
> > - * @param index
> > - *   Index of the freed WQE.
> > - * @param num_txbbs
> > - *   Number of blocks to stamp.
> > - *   If < 0 the routine will use the size written in the WQ entry.
> > - * @param owner
> > - *   The value of the WQE owner bit to use in the stamp.
> > + * @param wqe
> > + *   Pointer of WQE to stamp.
> 
> Looks like it's not just a simple pointer to the WQE to stamp seeing this
> function also stores the address of the next WQE in the provided buffer
> (uint32_t **wqe). It's not documented as such.
> 
Yes, you right, I will change it, it is going to be changed in the next series patch :)

> >   *
> >   * @return
> > - *   The number of Tx basic blocs (TXBB) the WQE contained.
> > + *   WQE size.
> >   */
> > -static int
> > -mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t
> > owner)
> > +static uint32_t
> > +mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t
> **wqe)
> >  {
> > -	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
> > -					  (!!owner <<
> MLX4_SQ_STAMP_SHIFT));
> > -	volatile uint8_t *wqe = mlx4_get_send_wqe(sq,
> > -						(index & sq-
> >txbb_cnt_mask));
> > -	volatile uint32_t *ptr = (volatile uint32_t *)wqe;
> > -	int i;
> > -	int txbbs_size;
> > -	int num_txbbs;
> > -
> > +	uint32_t stamp = sq->stamp;
> > +	volatile uint32_t *next_txbb = *wqe;
> >  	/* Extract the size from the control segment of the WQE. */
> > -	num_txbbs = MLX4_SIZE_TO_TXBBS((((volatile struct
> mlx4_wqe_ctrl_seg *)
> > -					 wqe)->fence_size & 0x3f) << 4);
> > -	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
> > +	uint32_t size = RTE_ALIGN((uint32_t)
> > +				  ((((volatile struct mlx4_wqe_ctrl_seg *)
> > +				     next_txbb)->fence_size & 0x3f) << 4),
> > +				  MLX4_TXBB_SIZE);
> > +	uint32_t size_cd = size;
> > +
> >  	/* Optimize the common case when there is no wrap-around. */
> > -	if (wqe + txbbs_size <= sq->eob) {
> > +	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
> >  		/* Stamp the freed descriptor. */
> > -		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
> > -			*ptr = stamp;
> > -			ptr += MLX4_SQ_STAMP_DWORDS;
> > -		}
> > +		do {
> > +			*next_txbb = stamp;
> > +			next_txbb += MLX4_SQ_STAMP_DWORDS;
> > +			size_cd -= MLX4_TXBB_SIZE;
> > +		} while (size_cd);
> >  	} else {
> >  		/* Stamp the freed descriptor. */
> > -		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
> > -			*ptr = stamp;
> > -			ptr += MLX4_SQ_STAMP_DWORDS;
> > -			if ((volatile uint8_t *)ptr >= sq->eob) {
> > -				ptr = (volatile uint32_t *)sq->buf;
> > -				stamp ^= RTE_BE32(0x80000000);
> > +		do {
> > +			*next_txbb = stamp;
> > +			next_txbb += MLX4_SQ_STAMP_DWORDS;
> > +			if ((volatile uint8_t *)next_txbb >= sq->eob) {
> > +				next_txbb = (volatile uint32_t *)sq->buf;
> > +				/* Flip invalid stamping ownership. */
> > +				stamp ^= RTE_BE32(0x1 <<
> MLX4_SQ_OWNER_BIT);
> > +				sq->stamp = stamp;
> >  			}
> > -		}
> > +			size_cd -= MLX4_TXBB_SIZE;
> > +		} while (size_cd);
> >  	}
> > -	return num_txbbs;
> > +	*wqe = next_txbb;
> > +	return size;
> >  }
> >
> >  /**
> > @@ -326,24 +319,22 @@ struct pv {
> >   *
> >   * @param txq
> >   *   Pointer to Tx queue structure.
> > - *
> > - * @return
> > - *   0 on success, -1 on failure.
> >   */
> > -static int
> > +static void
> >  mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
> >  				  struct mlx4_sq *sq)
> >  {
> > -	unsigned int elts_comp = txq->elts_comp;
> >  	unsigned int elts_tail = txq->elts_tail;
> > -	unsigned int sq_tail = sq->tail;
> >  	struct mlx4_cq *cq = &txq->mcq;
> >  	volatile struct mlx4_cqe *cqe;
> >  	uint32_t cons_index = cq->cons_index;
> > -	uint16_t new_index;
> > -	uint16_t nr_txbbs = 0;
> > -	int pkts = 0;
> > -
> > +	volatile uint32_t *first_wqe;
> > +	volatile uint32_t *next_wqe = (volatile uint32_t *)
> > +			((&(*txq->elts)[elts_tail])->wqe);
> > +	volatile uint32_t *last_wqe;
> > +	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
> > +			 MLX4_TXBB_SHIFT) - 1;
> > +	uint32_t pkts = 0;
> >  	/*
> >  	 * Traverse over all CQ entries reported and handle each WQ entry
> >  	 * reported by them.
> > @@ -353,11 +344,11 @@ struct pv {
> >  		if (unlikely(!!(cqe->owner_sr_opcode &
> MLX4_CQE_OWNER_MASK) ^
> >  		    !!(cons_index & cq->cqe_cnt)))
> >  			break;
> > +#ifndef NDEBUG
> >  		/*
> >  		 * Make sure we read the CQE after we read the ownership
> bit.
> >  		 */
> >  		rte_io_rmb();
> > -#ifndef NDEBUG
> >  		if (unlikely((cqe->owner_sr_opcode &
> MLX4_CQE_OPCODE_MASK) ==
> >  			     MLX4_CQE_OPCODE_ERROR)) {
> >  			volatile struct mlx4_err_cqe *cqe_err = @@ -366,41
> +357,32 @@
> > struct pv {
> >  			      " syndrome: 0x%x\n",
> >  			      (void *)txq, cqe_err->vendor_err,
> >  			      cqe_err->syndrome);
> > +			break;
> >  		}
> >  #endif /* NDEBUG */
> > -		/* Get WQE index reported in the CQE. */
> > -		new_index =
> > -			rte_be_to_cpu_16(cqe->wqe_index) & sq-
> >txbb_cnt_mask;
> > +		/* Get WQE address buy index from the CQE. */
> > +		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
> > +			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
> > +			 MLX4_TXBB_SHIFT));
> >  		do {
> >  			/* Free next descriptor. */
> > -			sq_tail += nr_txbbs;
> > -			nr_txbbs =
> > -				mlx4_txq_stamp_freed_wqe(sq,
> > -				     sq_tail & sq->txbb_cnt_mask,
> > -				     !!(sq_tail & sq->txbb_cnt));
> > +			first_wqe = next_wqe;
> > +			sq->remain_size +=
> > +				mlx4_txq_stamp_freed_wqe(sq,
> &next_wqe);
> >  			pkts++;
> > -		} while ((sq_tail & sq->txbb_cnt_mask) != new_index);
> > +		} while (first_wqe != last_wqe);
> >  		cons_index++;
> >  	} while (1);
> >  	if (unlikely(pkts == 0))
> > -		return 0;
> > -	/* Update CQ. */
> > +		return;
> > +	/* Update CQ consumer index. */
> >  	cq->cons_index = cons_index;
> > -	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> MLX4_CQ_DB_CI_MASK);
> > -	sq->tail = sq_tail + nr_txbbs;
> > -	/* Update the list of packets posted for transmission. */
> > -	elts_comp -= pkts;
> > -	assert(elts_comp <= txq->elts_comp);
> > -	/*
> > -	 * Assume completion status is successful as nothing can be done
> about
> > -	 * it anyway.
> > -	 */
> > +	*cq->set_ci_db = rte_cpu_to_be_32(cons_index &
> MLX4_CQ_DB_CI_MASK);
> > +	txq->elts_comp -= pkts;
> >  	elts_tail += pkts;
> >  	if (elts_tail >= elts_n)
> >  		elts_tail -= elts_n;
> >  	txq->elts_tail = elts_tail;
> > -	txq->elts_comp = elts_comp;
> > -	return 0;
> >  }
> >
> >  /**
> > @@ -421,41 +403,27 @@ struct pv {
> >  	return buf->pool;
> >  }
> >
> > -static int
> > +static volatile struct mlx4_wqe_ctrl_seg *
> >  mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
> > -		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
> > +		   volatile struct mlx4_wqe_ctrl_seg *ctrl)
> 
> Can you use this opportunity to document this function?
> 
Sure, new patch for this?

> >  {
> > -	int wqe_real_size;
> > -	int nr_txbbs;
> >  	struct pv *pv = (struct pv *)txq->bounce_buf;
> >  	struct mlx4_sq *sq = &txq->msq;
> > -	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
> > -	volatile struct mlx4_wqe_ctrl_seg *ctrl;
> > -	volatile struct mlx4_wqe_data_seg *dseg;
> >  	struct rte_mbuf *sbuf = buf;
> >  	uint32_t lkey;
> >  	int pv_counter = 0;
> >  	int nb_segs = buf->nb_segs;
> > +	int32_t wqe_size;
> > +	volatile struct mlx4_wqe_data_seg *dseg =
> > +		(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
> >
> > -	/* Calculate the needed work queue entry size for this packet. */
> > -	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
> > -		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> > -	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
> > -	/*
> > -	 * Check that there is room for this WQE in the send queue and that
> > -	 * the WQE size is legal.
> > -	 */
> > -	if (((sq->head - sq->tail) + nr_txbbs +
> > -				sq->headroom_txbbs) >= sq->txbb_cnt ||
> > -			nr_txbbs > MLX4_MAX_WQE_TXBBS) {
> > -		return -1;
> > -	}
> > -	/* Get the control and data entries of the WQE. */
> > -	ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
> > -			mlx4_get_send_wqe(sq, head_idx);
> > -	dseg = (volatile struct mlx4_wqe_data_seg *)
> > -			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
> > -	*pctrl = ctrl;
> > +	ctrl->fence_size = 1 + nb_segs;
> > +	wqe_size = RTE_ALIGN((int32_t)(ctrl->fence_size <<
> MLX4_SEG_SHIFT),
> > +			     MLX4_TXBB_SIZE);
> > +	/* Validate WQE size and WQE space in the send queue. */
> > +	if (sq->remain_size < wqe_size ||
> > +	    wqe_size > MLX4_MAX_WQE_SIZE)
> > +		return NULL;
> >  	/*
> >  	 * Fill the data segments with buffer information.
> >  	 * First WQE TXBB head segment is always control segment, @@ -
> 469,7
> > +437,7 @@ struct pv {
> >  	if (unlikely(lkey == (uint32_t)-1)) {
> >  		DEBUG("%p: unable to get MP <-> MR association",
> >  		      (void *)txq);
> > -		return -1;
> > +		return NULL;
> >  	}
> >  	/* Handle WQE wraparound. */
> >  	if (dseg >=
> > @@ -501,7 +469,7 @@ struct pv {
> >  		if (unlikely(lkey == (uint32_t)-1)) {
> >  			DEBUG("%p: unable to get MP <-> MR association",
> >  			      (void *)txq);
> > -			return -1;
> > +			return NULL;
> >  		}
> >  		mlx4_fill_tx_data_seg(dseg, lkey,
> >  				      rte_pktmbuf_mtod(sbuf, uintptr_t), @@ -
> 517,7 +485,7 @@
> > struct pv {
> >  		if (unlikely(lkey == (uint32_t)-1)) {
> >  			DEBUG("%p: unable to get MP <-> MR association",
> >  			      (void *)txq);
> > -			return -1;
> > +			return NULL;
> >  		}
> >  		mlx4_fill_tx_data_seg(dseg, lkey,
> >  				      rte_pktmbuf_mtod(sbuf, uintptr_t), @@ -
> 533,7 +501,7 @@
> > struct pv {
> >  		if (unlikely(lkey == (uint32_t)-1)) {
> >  			DEBUG("%p: unable to get MP <-> MR association",
> >  			      (void *)txq);
> > -			return -1;
> > +			return NULL;
> >  		}
> >  		mlx4_fill_tx_data_seg(dseg, lkey,
> >  				      rte_pktmbuf_mtod(sbuf, uintptr_t), @@ -
> 557,9 +525,10 @@
> > struct pv {
> >  		for (--pv_counter; pv_counter  >= 0; pv_counter--)
> >  			pv[pv_counter].dseg->byte_count =
> pv[pv_counter].val;
> >  	}
> > -	/* Fill the control parameters for this packet. */
> > -	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
> > -	return nr_txbbs;
> > +	sq->remain_size -= wqe_size;
> > +	/* Align next WQE address to the next TXBB. */
> > +	return (volatile struct mlx4_wqe_ctrl_seg *)
> > +		((volatile uint8_t *)ctrl + wqe_size);
> >  }
> >
> >  /**
> > @@ -585,7 +554,8 @@ struct pv {
> >  	unsigned int i;
> >  	unsigned int max;
> >  	struct mlx4_sq *sq = &txq->msq;
> > -	int nr_txbbs;
> > +	volatile struct mlx4_wqe_ctrl_seg *ctrl;
> > +	struct txq_elt *elt;
> >
> >  	assert(txq->elts_comp_cd != 0);
> >  	if (likely(txq->elts_comp != 0))
> > @@ -599,29 +569,30 @@ struct pv {
> >  	--max;
> >  	if (max > pkts_n)
> >  		max = pkts_n;
> > +	elt = &(*txq->elts)[elts_head];
> > +	/* Each element saves its appropriate work queue. */
> > +	ctrl = elt->wqe;
> >  	for (i = 0; (i != max); ++i) {
> >  		struct rte_mbuf *buf = pkts[i];
> >  		unsigned int elts_head_next =
> >  			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
> >  		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
> > -		struct txq_elt *elt = &(*txq->elts)[elts_head];
> > -		uint32_t owner_opcode = MLX4_OPCODE_SEND;
> > -		volatile struct mlx4_wqe_ctrl_seg *ctrl;
> > -		volatile struct mlx4_wqe_data_seg *dseg;
> > +		uint32_t owner_opcode = sq->owner_opcode;
> > +		volatile struct mlx4_wqe_data_seg *dseg =
> > +				(volatile struct mlx4_wqe_data_seg *)(ctrl +
> 1);
> > +		volatile struct mlx4_wqe_ctrl_seg *ctrl_next;
> >  		union {
> >  			uint32_t flags;
> >  			uint16_t flags16[2];
> >  		} srcrb;
> > -		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
> >  		uint32_t lkey;
> >
> >  		/* Clean up old buffer. */
> >  		if (likely(elt->buf != NULL)) {
> >  			struct rte_mbuf *tmp = elt->buf;
> > -
> 
> Empty line following variable declarations should stay.
> 
> >  #ifndef NDEBUG
> >  			/* Poisoning. */
> > -			memset(elt, 0x66, sizeof(*elt));
> > +			elt->buf = (struct rte_mbuf *)0x6666666666666666;
> 
> Note this address depends on pointer size, which may in turn trigger a
> compilation warning/error. Keep memset() on elt->buf.
>

 ok

> >  #endif
> >  			/* Faster than rte_pktmbuf_free(). */
> >  			do {
> > @@ -633,23 +604,11 @@ struct pv {
> >  		}
> >  		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
> >  		if (buf->nb_segs == 1) {
> > -			/*
> > -			 * Check that there is room for this WQE in the send
> > -			 * queue and that the WQE size is legal
> > -			 */
> > -			if (((sq->head - sq->tail) + 1 + sq->headroom_txbbs)
> >=
> > -			     sq->txbb_cnt || 1 > MLX4_MAX_WQE_TXBBS) {
> > +			/* Validate WQE space in the send queue. */
> > +			if (sq->remain_size < MLX4_TXBB_SIZE) {
> >  				elt->buf = NULL;
> >  				break;
> >  			}
> > -			/* Get the control and data entries of the WQE. */
> > -			ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
> > -					mlx4_get_send_wqe(sq, head_idx);
> > -			dseg = (volatile struct mlx4_wqe_data_seg *)
> > -					((uintptr_t)ctrl +
> > -					sizeof(struct mlx4_wqe_ctrl_seg));
> > -
> > -			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4)
> & 0x3f;
> >  			lkey = mlx4_txq_mp2mr(txq,
> mlx4_txq_mb2mp(buf));
> >  			if (unlikely(lkey == (uint32_t)-1)) {
> >  				/* MR does not exist. */
> > @@ -658,23 +617,33 @@ struct pv {
> >  				elt->buf = NULL;
> >  				break;
> >  			}
> > -			mlx4_fill_tx_data_seg(dseg, lkey,
> > +			mlx4_fill_tx_data_seg(dseg++, lkey,
> >  					      rte_pktmbuf_mtod(buf,
> uintptr_t),
> >  					      rte_cpu_to_be_32(buf-
> >data_len));
> > -			nr_txbbs = 1;
> > +			/* Set WQE size in 16-byte units. */
> > +			ctrl->fence_size = 0x2;
> > +			sq->remain_size -= MLX4_TXBB_SIZE;
> > +			/* Align next WQE address to the next TXBB. */
> > +			ctrl_next = ctrl + 0x4;
> >  		} else {
> > -			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
> > -			if (nr_txbbs < 0) {
> > +			ctrl_next = mlx4_tx_burst_segs(buf, txq, ctrl);
> > +			if (!ctrl_next) {
> >  				elt->buf = NULL;
> >  				break;
> >  			}
> >  		}
> > +		/* Hold SQ ring wrap around. */
> > +		if ((volatile uint8_t *)ctrl_next >= sq->eob) {
> > +			ctrl_next = (volatile struct mlx4_wqe_ctrl_seg *)
> > +				((volatile uint8_t *)ctrl_next - sq->size);
> > +			/* Flip HW valid ownership. */
> > +			sq->owner_opcode ^= 0x1 <<
> MLX4_SQ_OWNER_BIT;
> > +		}
> >  		/*
> >  		 * For raw Ethernet, the SOLICIT flag is used to indicate
> >  		 * that no ICRC should be calculated.
> >  		 */
> > -		txq->elts_comp_cd -= nr_txbbs;
> > -		if (unlikely(txq->elts_comp_cd <= 0)) {
> > +		if (--txq->elts_comp_cd == 0) {
> >  			txq->elts_comp_cd = txq->elts_comp_cd_init;
> >  			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
> >  					       MLX4_WQE_CTRL_CQ_UPDATE);
> @@ -720,13 +689,13 @@ struct pv
> > {
> >  		 * executing as soon as we do).
> >  		 */
> >  		rte_io_wmb();
> > -		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> > -					      ((sq->head & sq->txbb_cnt) ?
> > -						       MLX4_BIT_WQE_OWN :
> 0));
> > -		sq->head += nr_txbbs;
> > +		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
> >  		elt->buf = buf;
> >  		bytes_sent += buf->pkt_len;
> >  		elts_head = elts_head_next;
> > +		elt_next->wqe = ctrl_next;
> > +		ctrl = ctrl_next;
> > +		elt = elt_next;
> >  	}
> >  	/* Take a shortcut if nothing must be sent. */
> >  	if (unlikely(i == 0))
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.h
> > b/drivers/net/mlx4/mlx4_rxtx.h index 8207232..c092afa 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.h
> > +++ b/drivers/net/mlx4/mlx4_rxtx.h
> > @@ -105,6 +105,7 @@ struct mlx4_rss {
> >  /** Tx element. */
> >  struct txq_elt {
> >  	struct rte_mbuf *buf; /**< Buffer. */
> > +	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
> >  };
> >
> >  /** Rx queue counters. */
> > diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
> > index 7882a4d..4c7b62a 100644
> > --- a/drivers/net/mlx4/mlx4_txq.c
> > +++ b/drivers/net/mlx4/mlx4_txq.c
> > @@ -84,6 +84,7 @@
> >  		assert(elt->buf != NULL);
> >  		rte_pktmbuf_free(elt->buf);
> >  		elt->buf = NULL;
> > +		elt->wqe = NULL;
> >  		if (++elts_tail == RTE_DIM(*elts))
> >  			elts_tail = 0;
> >  	}
> > @@ -163,20 +164,19 @@ struct txq_mp2mr_mbuf_check_data {
> >  	struct mlx4_cq *cq = &txq->mcq;
> >  	struct mlx4dv_qp *dqp = mlxdv->qp.out;
> >  	struct mlx4dv_cq *dcq = mlxdv->cq.out;
> > -	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp-
> >sq.offset;
> >
> > -	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
> >  	/* Total length, including headroom and spare WQEs. */
> > -	sq->eob = sq->buf + sq_size;
> > -	sq->head = 0;
> > -	sq->tail = 0;
> > -	sq->txbb_cnt =
> > -		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >>
> MLX4_TXBB_SHIFT;
> > -	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
> > +	sq->size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
> > +	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
> > +	sq->eob = sq->buf + sq->size;
> > +	uint32_t headroom_size = 2048 + (1 << dqp->sq.wqe_shift);
> > +	/* Continuous headroom size bytes must always stay freed. */
> > +	sq->remain_size = sq->size - headroom_size;
> > +	sq->owner_opcode = MLX4_OPCODE_SEND | (0 <<
> MLX4_SQ_OWNER_BIT);
> > +	sq->stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
> > +				     (0 << MLX4_SQ_OWNER_BIT));
> >  	sq->db = dqp->sdb;
> >  	sq->doorbell_qpn = dqp->doorbell_qpn;
> > -	sq->headroom_txbbs =
> > -		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
> >  	cq->buf = dcq->buf.buf;
> >  	cq->cqe_cnt = dcq->cqe_cnt;
> >  	cq->set_ci_db = dcq->set_ci_db;
> > @@ -362,6 +362,9 @@ struct txq_mp2mr_mbuf_check_data {
> >  		goto error;
> >  	}
> >  	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
> > +	/* Save first wqe pointer in the first element. */
> > +	(&(*txq->elts)[0])->wqe =
> > +		(volatile struct mlx4_wqe_ctrl_seg *)txq->msq.buf;
> >  	/* Pre-register known mempools. */
> >  	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
> >  	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
> > --
> > 1.8.3.1
> >
> 
> Otherwise this patch looks OK.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/8] net/mlx4: align Tx descriptors number
  2017-12-06 10:59   ` Adrien Mazarguil
@ 2017-12-06 11:44     ` Matan Azrad
  0 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 11:44 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, December 6, 2017 12:59 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH 7/8] net/mlx4: align Tx descriptors number
> 
> On Tue, Nov 28, 2017 at 12:19:29PM +0000, Matan Azrad wrote:
> > Using power of 2 descriptors number makes the ring management easier
> > and allows to use mask operation instead of wraparound conditions.
> >
> > Adjust Tx descriptor number to be power of 2 and change calculation to
> > use mask accordingly.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> 
> A few minor comments, see below.
> 
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 22 ++++++++--------------
> > drivers/net/mlx4/mlx4_txq.c  | 20 ++++++++++++--------
> >  2 files changed, 20 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > b/drivers/net/mlx4/mlx4_rxtx.c index 30f2930..b5aaf4c 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -314,7 +314,7 @@ struct pv {
> >   *   Pointer to Tx queue structure.
> >   */
> >  static void
> > -mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
> > +mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
> >  				  struct mlx4_sq *sq)
> 
> Documentation needs updating.
> 

OK

> >  {
> >  	unsigned int elts_tail = txq->elts_tail; @@ -355,13 +355,11 @@
> > struct pv {
> >  	if (unlikely(!completed))
> >  		return;
> >  	/* First stamping address is the end of the last one. */
> > -	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
> > +	first_txbb = (&(*txq->elts)[elts_tail & elts_m])->eocb;
> >  	elts_tail += completed;
> > -	if (elts_tail >= elts_n)
> > -		elts_tail -= elts_n;
> >  	/* The new tail element holds the end address. */
> >  	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
> > -		(&(*txq->elts)[elts_tail])->eocb);
> > +		(&(*txq->elts)[elts_tail & elts_m])->eocb);
> >  	/* Update CQ consumer index. */
> >  	cq->cons_index = cons_index;
> >  	*cq->set_ci_db = rte_cpu_to_be_32(cons_index &
> MLX4_CQ_DB_CI_MASK);
> > @@ -534,6 +532,7 @@ struct pv {
> >  	struct txq *txq = (struct txq *)dpdk_txq;
> >  	unsigned int elts_head = txq->elts_head;
> >  	const unsigned int elts_n = txq->elts_n;
> > +	const unsigned int elts_m = elts_n - 1;
> >  	unsigned int bytes_sent = 0;
> >  	unsigned int i;
> >  	unsigned int max;
> > @@ -543,24 +542,20 @@ struct pv {
> >
> >  	assert(txq->elts_comp_cd != 0);
> >  	if (likely(txq->elts_comp != 0))
> > -		mlx4_txq_complete(txq, elts_n, sq);
> > +		mlx4_txq_complete(txq, elts_m, sq);
> >  	max = (elts_n - (elts_head - txq->elts_tail));
> > -	if (max > elts_n)
> > -		max -= elts_n;
> >  	assert(max >= 1);
> >  	assert(max <= elts_n);
> >  	/* Always leave one free entry in the ring. */
> >  	--max;
> >  	if (max > pkts_n)
> >  		max = pkts_n;
> > -	elt = &(*txq->elts)[elts_head];
> > +	elt = &(*txq->elts)[elts_head & elts_m];
> >  	/* First Tx burst element saves the next WQE control segment. */
> >  	ctrl = elt->wqe;
> >  	for (i = 0; (i != max); ++i) {
> >  		struct rte_mbuf *buf = pkts[i];
> > -		unsigned int elts_head_next =
> > -			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
> > -		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
> > +		struct txq_elt *elt_next = &(*txq->elts)[++elts_head &
> elts_m];
> >  		uint32_t owner_opcode = sq->owner_opcode;
> >  		volatile struct mlx4_wqe_data_seg *dseg =
> >  				(volatile struct mlx4_wqe_data_seg *)(ctrl +
> 1); @@ -678,7 +673,6
> > @@ struct pv {
> >  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
> >  		elt->buf = buf;
> >  		bytes_sent += buf->pkt_len;
> > -		elts_head = elts_head_next;
> >  		ctrl = ctrl_next;
> >  		elt = elt_next;
> >  	}
> > @@ -694,7 +688,7 @@ struct pv {
> >  	rte_wmb();
> >  	/* Ring QP doorbell. */
> >  	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
> > -	txq->elts_head = elts_head;
> > +	txq->elts_head += i;
> >  	txq->elts_comp += i;
> >  	return i;
> >  }
> > diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
> > index 4c7b62a..253075a 100644
> > --- a/drivers/net/mlx4/mlx4_txq.c
> > +++ b/drivers/net/mlx4/mlx4_txq.c
> > @@ -76,17 +76,16 @@
> >  	unsigned int elts_head = txq->elts_head;
> >  	unsigned int elts_tail = txq->elts_tail;
> >  	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
> > +	unsigned int elts_m = txq->elts_n - 1;
> >
> >  	DEBUG("%p: freeing WRs", (void *)txq);
> >  	while (elts_tail != elts_head) {
> > -		struct txq_elt *elt = &(*elts)[elts_tail];
> > +		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
> >
> >  		assert(elt->buf != NULL);
> >  		rte_pktmbuf_free(elt->buf);
> >  		elt->buf = NULL;
> >  		elt->wqe = NULL;
> > -		if (++elts_tail == RTE_DIM(*elts))
> > -			elts_tail = 0;
> >  	}
> >  	txq->elts_tail = txq->elts_head;
> >  }
> > @@ -208,7 +207,9 @@ struct txq_mp2mr_mbuf_check_data {
> >  	struct mlx4dv_obj mlxdv;
> >  	struct mlx4dv_qp dv_qp;
> >  	struct mlx4dv_cq dv_cq;
> > -	struct txq_elt (*elts)[desc];
> > +	uint32_t elts_size = desc > 0x1000 ? 0x1000 :
> > +		rte_align32pow2((uint32_t)desc);
> 
> Where is that magical 0x1000 value coming from? It should at least come
> through a macro definition.
> 
> > +	struct txq_elt (*elts)[elts_size];
> >  	struct ibv_qp_init_attr qp_init_attr;
> >  	struct txq *txq;
> >  	uint8_t *bounce_buf;
> > @@ -247,11 +248,14 @@ struct txq_mp2mr_mbuf_check_data {
> >  		      (void *)dev, idx);
> >  		return -rte_errno;
> >  	}
> > -	if (!desc) {
> > -		rte_errno = EINVAL;
> > -		ERROR("%p: invalid number of Tx descriptors", (void *)dev);
> > -		return -rte_errno;
> > +	if ((uint32_t)desc != elts_size) {
> > +		desc = (uint16_t)elts_size;
> > +		WARN("%p: changed number of descriptors in TX queue %u"
> > +		     " to be power of two (%d)",
> > +		     (void *)dev, idx, desc);
> 
> You should display the same message as in mlx4_rxq.c for consistency (also
> TX => Tx).
> 
OK

> >  	}
> > +	DEBUG("%p: configuring queue %u for %u descriptors",
> > +	      (void *)dev, idx, desc);
> 
> Unnecessary duplicated debugging message already printed at the beginning
> of this function. Yes this is a different value but WARN() made that pretty
> clear.
> 
> >  	/* Allocate and initialize Tx queue. */
> >  	mlx4_zmallocv_socket("TXQ", vec, RTE_DIM(vec), socket);
> >  	if (!txq) {
> > --
> > 1.8.3.1
> >
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
  2017-12-06 11:29     ` Matan Azrad
@ 2017-12-06 11:55       ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 11:55 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 11:29:38AM +0000, Matan Azrad wrote:
> Hi Adrien
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Wednesday, December 6, 2017 12:59 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
> > 
> > On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote:
> > > mlx4 Tx block can handle up to 4 data segments or control segment + up
> > > to 3 data segments. The first data segment in each not first Tx block
> > > must validate Tx queue wraparound and must use IO memory barrier
> > > before writing the byte count.
> > >
> > > The previous multi-segment code used "for" loop to iterate over all
> > > packet segments and separated first Tx block data case by "if"
> > > statments.
> > 
> > statments => statements
> > 
> > >
> > > Use switch case and unconditional branches instead of "for" loop can
> > > optimize the case and prevents the unnecessary checks for each data
> > > segment; This hints to compiler to create opitimized jump table.
> > 
> > opitimized => optimized
> > 
> > >
> > > Optimize this case by switch case and unconditional branches usage.
> > >
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > >  drivers/net/mlx4/mlx4_rxtx.c | 165
> > > +++++++++++++++++++++++++------------------
> > >  drivers/net/mlx4/mlx4_rxtx.h |  33 +++++++++
> > >  2 files changed, 128 insertions(+), 70 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
<snip>
> > > +	/*
> > > +	 * Fill the data segments with buffer information.
> > > +	 * First WQE TXBB head segment is always control segment,
> > > +	 * so jump to tail TXBB data segments code for the first
> > > +	 * WQE data segments filling.
> > > +	 */
> > > +	goto txbb_tail_segs;
> > > +txbb_head_seg:
> > 
> > I'm not fundamentally opposed to "goto" unlike a lot of people out there,
> > but this doesn't look good. It's OK to use goto for error cases and to extricate
> > yourself when trapped in an inner loop, also in some optimization scenarios
> > where it sometimes make sense, but not when the same can be achieved
> > through standard loop constructs and keywords.
> > 
> > In this case I'm under the impression you could have managed with a do { ... }
> > while (...) construct. You need to try harder to reorganize these changes or
> > prove it can't be done without negatively impacting performance.
> > 
> > Doing so should make this patch shorter as well.
> > 
> 
> I noticed this could be done with loop and without unconditional branches, but I checked it and found nice performance improvement using that way.
> When I used the loop I degraded performance.  

OK, you can leave it as is in the meantime then, we'll keep that in mind for
the next refactoring iteration.

<snip>
> > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index
> > > 463df2b..8207232 100644
> > > --- a/drivers/net/mlx4/mlx4_rxtx.h
> > > +++ b/drivers/net/mlx4/mlx4_rxtx.h
> > > @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev
> > *dev, uint16_t idx,
> > >  	return mlx4_txq_add_mr(txq, mp, i);
> > >  }
> > >
> > > +/**
> > > + * Write Tx data segment to the SQ.
> > > + *
> > > + * @param dseg
> > > + *   Pointer to data segment in SQ.
> > > + * @param lkey
> > > + *   Memory region lkey.
> > > + * @param addr
> > > + *   Data address.
> > > + * @param byte_count
> > > + *   Big Endian bytes count of the data to send.
> > 
> > Big Endian => Big endian
> > 
> > How about using the dedicated type to properly document it?
> > See rte_be32_t from rte_byteorder.h.
> > 
> Learned new something, thanks, will check it.
> 
> > > + */
> > > +static inline void
> > > +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
> > > +		       uint32_t lkey, uintptr_t addr, uint32_t byte_count) {
> > > +	dseg->addr = rte_cpu_to_be_64(addr);
> > > +	dseg->lkey = rte_cpu_to_be_32(lkey); #if RTE_CACHE_LINE_SIZE <
> > 64
> > > +	/*
> > > +	 * Need a barrier here before writing the byte_count
> > > +	 * fields to make sure that all the data is visible
> > > +	 * before the byte_count field is set.
> > > +	 * Otherwise, if the segment begins a new cacheline,
> > > +	 * the HCA prefetcher could grab the 64-byte chunk and
> > > +	 * get a valid (!= 0xffffffff) byte count but stale
> > > +	 * data, and end up sending the wrong data.
> > > +	 */
> > > +	rte_io_wmb();
> > > +#endif /* RTE_CACHE_LINE_SIZE */
> > > +	dseg->byte_count = byte_count;
> > > +}
> > > +
> > 
> > No need to expose this function in a header file. Note that rte_cpu_*() and
> > rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h
> > respectively.
> > 
> 
> Shouldn't inline functions be in header files?

Not necessarily, like other function types actually. They have to be
declared where needed, in a header file only if several files depend on
them, otherwise they bring namespace pollution for no apparent reason.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] net/mlx4: merge Tx queue rings management
  2017-12-06 11:43     ` Matan Azrad
@ 2017-12-06 12:09       ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 12:09 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 11:43:25AM +0000, Matan Azrad wrote:
> Hi Adrien
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Wednesday, December 6, 2017 12:59 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH 5/8] net/mlx4: merge Tx queue rings management
> > 
> > On Tue, Nov 28, 2017 at 12:19:27PM +0000, Matan Azrad wrote:
> > > The Tx queue send ring was managed by Tx block head,tail,count and
> > > mask management variables which were used for managing the send
> > queue
> > > remain space and next places of empty or completted work queue entries.
> > 
> > completted => completed
> > 
> > >
> > > This method suffered from an actual addresses recalculation per
> > > packet, an unnecessary Tx block based calculations and an expensive
> > > dual management of Tx rings.
> > >
> > > Move send queue ring calculation to be based on actual addresses while
> > > managing it by descriptors ring indexes.
> > >
> > > Add new work queue entry pointer to the descriptor element to hold the
> > > appropriate entry in the send queue.
> > >
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > >  drivers/net/mlx4/mlx4_prm.h  |  20 ++--  drivers/net/mlx4/mlx4_rxtx.c
> > > | 241 +++++++++++++++++++------------------------
> > >  drivers/net/mlx4/mlx4_rxtx.h |   1 +
> > >  drivers/net/mlx4/mlx4_txq.c  |  23 +++--
> > >  4 files changed, 126 insertions(+), 159 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
> > > index fcc7c12..2ca303a 100644
> > > --- a/drivers/net/mlx4/mlx4_prm.h
> > > +++ b/drivers/net/mlx4/mlx4_prm.h
<snip>
> > > {  struct mlx4_sq {
> > >  	volatile uint8_t *buf; /**< SQ buffer. */
> > >  	volatile uint8_t *eob; /**< End of SQ buffer */
> > > -	uint32_t head; /**< SQ head counter in units of TXBBS. */
> > > -	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
> > > -	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
> > > -	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
> > > -	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept
> > free. */
> > > +	uint32_t size; /**< SQ size includes headroom. */
> > > +	int32_t remain_size; /**< Remain WQE size in SQ. */
> > 
> > Remain => Remaining?
> > 
> OK
> > By "size", do you mean "room" as there could be several WQEs in there?
> > 
> Size in bytes.
> remaining size | remaining space | remaining room | remaining bytes , What are you preferred?

I think this should fully clarify:

 Remaining WQE room in SQ (bytes).

> 
> > Note before reviewing the rest of this patch, the fact it's a signed integer
> > bothers me; it's probably a mistake.
> 
> There is place in the code where this variable can used for signed calculations.

My point is these signed calculations shouldn't be signed in the first
place. A buffer size cannot be negative.

> 
> > You should standardize on unsigned values everywhere.
> 
> Why?
> Each field with the most appropriate type.

Because you used the wrong type everywhere else. The root cause seems to be
with:

 #define MLX4_MAX_WQE_SIZE 512

Which must either be cast when used or redefined like:

 #define MLX4_MAX_WQE_SIZE 512u

Or even:

 #define MLX4_MAX_WQE_SIZE UINT32_C(512)

<snip>
> > > a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c index
> > > b9cb2fc..0a8ef93 100644
> > > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > > +++ b/drivers/net/mlx4/mlx4_rxtx.c
<snip>
> > > @@ -421,41 +403,27 @@ struct pv {
> > >  	return buf->pool;
> > >  }
> > >
> > > -static int
> > > +static volatile struct mlx4_wqe_ctrl_seg *
> > >  mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
> > > -		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
> > > +		   volatile struct mlx4_wqe_ctrl_seg *ctrl)
> > 
> > Can you use this opportunity to document this function?
> > 
> Sure, new patch for this?

You can make it part of this one if you prefer, no problem either way.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 0/8]  improve mlx4 Tx performance
  2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
                   ` (7 preceding siblings ...)
  2017-11-28 12:19 ` [PATCH 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
@ 2017-12-06 14:48 ` Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
                     ` (8 more replies)
  8 siblings, 9 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

This series improves mlx4 Tx performance and fix and clean some Tx code. 
1. 10% MPPS improvement for 1 queue, 1 core, 64B packets, txonly mode.
2. 20% MPPS improvement for 1 queue, 1 core, 32B*4(segs) packets, txonly mode.

V2:
Add missed function descriptions.
Accurate descriptions.
Change Tx descriptor alignment to be like Rx.
Move mlx4_fill_tx_data_seg to mlx4_rxtx.c and use rte_be32_t for byte count.
Change remain_size type to uin32_t.
Poisoning with memset.

Matan Azrad (8):
  net/mlx4: fix Tx packet drop application report
  net/mlx4: remove unnecessary Tx wraparound checks
  net/mlx4: remove restamping from Tx error path
  net/mlx4: optimize Tx multi-segment case
  net/mlx4: merge Tx queue rings management
  net/mlx4: mitigate Tx send entry size calculations
  net/mlx4: align Tx descriptors number
  net/mlx4: remove Tx completion elements counter

 drivers/net/mlx4/mlx4_prm.h  |  20 +-
 drivers/net/mlx4/mlx4_rxtx.c | 492 +++++++++++++++++++++----------------------
 drivers/net/mlx4/mlx4_rxtx.h |   5 +-
 drivers/net/mlx4/mlx4_txq.c  |  37 ++--
 4 files changed, 279 insertions(+), 275 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 1/8] net/mlx4: fix Tx packet drop application report
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, stable

When invalid lkey is sent to HW, HW sends an error notification in
completion function.

The previous code wouldn't crash but doesn't add any application report
in case of completion error, so application cannot know that packet
actually was dropped in case of invalid lkey.

Return back the lkey validation to Tx path.

Fixes: 2eee458746bc ("net/mlx4: remove error flows from Tx fast path")
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 2bfa8b1..0d008ed 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -468,7 +468,6 @@ struct pv {
 		/* Memory region key (big endian) for this memory pool. */
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
 		dseg->lkey = rte_cpu_to_be_32(lkey);
-#ifndef NDEBUG
 		/* Calculate the needed work queue entry size for this packet */
 		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
@@ -486,7 +485,6 @@ struct pv {
 					(sq->head & sq->txbb_cnt) ? 0 : 1);
 			return -1;
 		}
-#endif /* NDEBUG */
 		if (likely(sbuf->data_len)) {
 			byte_count = rte_cpu_to_be_32(sbuf->data_len);
 		} else {
@@ -636,7 +634,6 @@ struct pv {
 			/* Memory region key (big endian). */
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
 			dseg->lkey = rte_cpu_to_be_32(lkey);
-#ifndef NDEBUG
 			if (unlikely(dseg->lkey ==
 				rte_cpu_to_be_32((uint32_t)-1))) {
 				/* MR does not exist. */
@@ -655,7 +652,6 @@ struct pv {
 				elt->buf = NULL;
 				break;
 			}
-#endif /* NDEBUG */
 			/* Never be TXBB aligned, no need compiler barrier. */
 			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
 			/* Fill the control parameters for this packet. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 2/8] net/mlx4: remove unnecessary Tx wraparound checks
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

There is no need to check Tx queue wraparound for segments which are
not at the beginning of a Tx block. Especially relevant in a single
segment case.

Remove unnecessary aforementioned checks from Tx path.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 0d008ed..9a32b3f 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -461,15 +461,11 @@ struct pv {
 	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
 		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
 		rte_prefetch0((volatile void *)addr);
-		/* Handle WQE wraparound. */
-		if (dseg >= (volatile struct mlx4_wqe_data_seg *)sq->eob)
-			dseg = (volatile struct mlx4_wqe_data_seg *)sq->buf;
-		dseg->addr = rte_cpu_to_be_64(addr);
 		/* Memory region key (big endian) for this memory pool. */
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
 		dseg->lkey = rte_cpu_to_be_32(lkey);
 		/* Calculate the needed work queue entry size for this packet */
-		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
+		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
 			DEBUG("%p: unable to get MP <-> MR association",
 					(void *)txq);
@@ -501,6 +497,8 @@ struct pv {
 		 * control segment.
 		 */
 		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			dseg->addr = rte_cpu_to_be_64(addr);
+			dseg->lkey = rte_cpu_to_be_32(lkey);
 #if RTE_CACHE_LINE_SIZE < 64
 			/*
 			 * Need a barrier here before writing the byte_count
@@ -520,6 +518,13 @@ struct pv {
 			 * TXBB, so we need to postpone its byte_count writing
 			 * for later.
 			 */
+			/* Handle WQE wraparound. */
+			if (dseg >=
+			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
+				dseg = (volatile struct mlx4_wqe_data_seg *)
+					sq->buf;
+			dseg->addr = rte_cpu_to_be_64(addr);
+			dseg->lkey = rte_cpu_to_be_32(lkey);
 			pv[pv_counter].dseg = dseg;
 			pv[pv_counter++].val = byte_count;
 		}
@@ -625,11 +630,6 @@ struct pv {
 					sizeof(struct mlx4_wqe_ctrl_seg));
 			addr = rte_pktmbuf_mtod(buf, uintptr_t);
 			rte_prefetch0((volatile void *)addr);
-			/* Handle WQE wraparound. */
-			if (dseg >=
-				(volatile struct mlx4_wqe_data_seg *)sq->eob)
-				dseg = (volatile struct mlx4_wqe_data_seg *)
-						sq->buf;
 			dseg->addr = rte_cpu_to_be_64(addr);
 			/* Memory region key (big endian). */
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 3/8] net/mlx4: remove restamping from Tx error path
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

At error time, the first 4 bytes of each WQE Tx block still have not
writen, so no need to stamp them because they are already stamped.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 9a32b3f..1d8240a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -468,17 +468,7 @@ struct pv {
 		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
 			DEBUG("%p: unable to get MP <-> MR association",
-					(void *)txq);
-			/*
-			 * Restamp entry in case of failure.
-			 * Make sure that size is written correctly
-			 * Note that we give ownership to the SW, not the HW.
-			 */
-			wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
-				buf->nb_segs * sizeof(struct mlx4_wqe_data_seg);
-			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-			mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					(sq->head & sq->txbb_cnt) ? 0 : 1);
+			      (void *)txq);
 			return -1;
 		}
 		if (likely(sbuf->data_len)) {
@@ -639,16 +629,6 @@ struct pv {
 				/* MR does not exist. */
 				DEBUG("%p: unable to get MP <-> MR association",
 				      (void *)txq);
-				/*
-				 * Restamp entry in case of failure.
-				 * Make sure that size is written correctly
-				 * Note that we give ownership to the SW,
-				 * not the HW.
-				 */
-				ctrl->fence_size =
-					(WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
-				mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					     (sq->head & sq->txbb_cnt) ? 0 : 1);
 				elt->buf = NULL;
 				break;
 			}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 4/8] net/mlx4: optimize Tx multi-segment case
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
                     ` (2 preceding siblings ...)
  2017-12-06 14:48   ` [PATCH v2 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 16:22     ` Adrien Mazarguil
  2017-12-06 14:48   ` [PATCH v2 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

mlx4 Tx block can handle up to 4 data segments or control segment + up
to 3 data segments. The first data segment in each not first Tx block
must validate Tx queue wraparound and must use IO memory barrier before
writing the byte count.

The previous multi-segment code used "for" loop to iterate over all
packet segments and separated first Tx block data case by "if"
statements.

Use switch case and unconditional branches instead of "for" loop can
optimize the case and prevents the unnecessary checks for each data
segment; This hints to compiler to create optimized jump table.

Optimize this case by switch case and unconditional branches usage.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 198 ++++++++++++++++++++++++++++---------------
 1 file changed, 128 insertions(+), 70 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 1d8240a..adf02c0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -421,6 +421,39 @@ struct pv {
 	return buf->pool;
 }
 
+/**
+ * Write Tx data segment to the SQ.
+ *
+ * @param dseg
+ *   Pointer to data segment in SQ.
+ * @param lkey
+ *   Memory region lkey.
+ * @param addr
+ *   Data address.
+ * @param byte_count
+ *   Big endian bytes count of the data to send.
+ */
+static inline void
+mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
+		       uint32_t lkey, uintptr_t addr, rte_be32_t  byte_count)
+{
+	dseg->addr = rte_cpu_to_be_64(addr);
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+#if RTE_CACHE_LINE_SIZE < 64
+	/*
+	 * Need a barrier here before writing the byte_count
+	 * fields to make sure that all the data is visible
+	 * before the byte_count field is set.
+	 * Otherwise, if the segment begins a new cacheline,
+	 * the HCA prefetcher could grab the 64-byte chunk and
+	 * get a valid (!= 0xffffffff) byte count but stale
+	 * data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+#endif /* RTE_CACHE_LINE_SIZE */
+	dseg->byte_count = byte_count;
+}
+
 static int
 mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
 		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
@@ -432,15 +465,14 @@ struct pv {
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	volatile struct mlx4_wqe_ctrl_seg *ctrl;
 	volatile struct mlx4_wqe_data_seg *dseg;
-	struct rte_mbuf *sbuf;
+	struct rte_mbuf *sbuf = buf;
 	uint32_t lkey;
-	uintptr_t addr;
-	uint32_t byte_count;
 	int pv_counter = 0;
+	int nb_segs = buf->nb_segs;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
-		buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
+		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
 	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
 	/*
 	 * Check that there is room for this WQE in the send queue and that
@@ -457,67 +489,99 @@ struct pv {
 	dseg = (volatile struct mlx4_wqe_data_seg *)
 			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
 	*pctrl = ctrl;
-	/* Fill the data segments with buffer information. */
-	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
-		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
-		rte_prefetch0((volatile void *)addr);
-		/* Memory region key (big endian) for this memory pool. */
+	/*
+	 * Fill the data segments with buffer information.
+	 * First WQE TXBB head segment is always control segment,
+	 * so jump to tail TXBB data segments code for the first
+	 * WQE data segments filling.
+	 */
+	goto txbb_tail_segs;
+txbb_head_seg:
+	/* Memory region key (big endian) for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		DEBUG("%p: unable to get MP <-> MR association",
+		      (void *)txq);
+		return -1;
+	}
+	/* Handle WQE wraparound. */
+	if (dseg >=
+		(volatile struct mlx4_wqe_data_seg *)sq->eob)
+		dseg = (volatile struct mlx4_wqe_data_seg *)
+			sq->buf;
+	dseg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf, uintptr_t));
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * This data segment starts at the beginning of a new
+	 * TXBB, so we need to postpone its byte_count writing
+	 * for later.
+	 */
+	pv[pv_counter].dseg = dseg;
+	/*
+	 * Zero length segment is treated as inline segment
+	 * with zero data.
+	 */
+	pv[pv_counter++].val = rte_cpu_to_be_32(sbuf->data_len ?
+						sbuf->data_len : 0x80000000);
+	sbuf = sbuf->next;
+	dseg++;
+	nb_segs--;
+txbb_tail_segs:
+	/* Jump to default if there are more than two segments remaining. */
+	switch (nb_segs) {
+	default:
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
-		dseg->lkey = rte_cpu_to_be_32(lkey);
-		/* Calculate the needed work queue entry size for this packet */
-		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
-			/* MR does not exist. */
+		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
 			return -1;
 		}
-		if (likely(sbuf->data_len)) {
-			byte_count = rte_cpu_to_be_32(sbuf->data_len);
-		} else {
-			/*
-			 * Zero length segment is treated as inline segment
-			 * with zero data.
-			 */
-			byte_count = RTE_BE32(0x80000000);
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		sbuf = sbuf->next;
+		dseg++;
+		nb_segs--;
+		/* fallthrough */
+	case 2:
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			return -1;
 		}
-		/*
-		 * If the data segment is not at the beginning of a
-		 * Tx basic block (TXBB) then write the byte count,
-		 * else postpone the writing to just before updating the
-		 * control segment.
-		 */
-		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
-			dseg->addr = rte_cpu_to_be_64(addr);
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-#if RTE_CACHE_LINE_SIZE < 64
-			/*
-			 * Need a barrier here before writing the byte_count
-			 * fields to make sure that all the data is visible
-			 * before the byte_count field is set.
-			 * Otherwise, if the segment begins a new cacheline,
-			 * the HCA prefetcher could grab the 64-byte chunk and
-			 * get a valid (!= 0xffffffff) byte count but stale
-			 * data, and end up sending the wrong data.
-			 */
-			rte_io_wmb();
-#endif /* RTE_CACHE_LINE_SIZE */
-			dseg->byte_count = byte_count;
-		} else {
-			/*
-			 * This data segment starts at the beginning of a new
-			 * TXBB, so we need to postpone its byte_count writing
-			 * for later.
-			 */
-			/* Handle WQE wraparound. */
-			if (dseg >=
-			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
-				dseg = (volatile struct mlx4_wqe_data_seg *)
-					sq->buf;
-			dseg->addr = rte_cpu_to_be_64(addr);
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-			pv[pv_counter].dseg = dseg;
-			pv[pv_counter++].val = byte_count;
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		sbuf = sbuf->next;
+		dseg++;
+		nb_segs--;
+		/* fallthrough */
+	case 1:
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			return -1;
+		}
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		nb_segs--;
+		if (nb_segs) {
+			sbuf = sbuf->next;
+			dseg++;
+			goto txbb_head_seg;
 		}
+		/* fallthrough */
+	case 0:
+		break;
 	}
 	/* Write the first DWORD of each TXBB save earlier. */
 	if (pv_counter) {
@@ -583,7 +647,6 @@ struct pv {
 		} srcrb;
 		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 		uint32_t lkey;
-		uintptr_t addr;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -618,24 +681,19 @@ struct pv {
 			dseg = (volatile struct mlx4_wqe_data_seg *)
 					((uintptr_t)ctrl +
 					sizeof(struct mlx4_wqe_ctrl_seg));
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			rte_prefetch0((volatile void *)addr);
-			dseg->addr = rte_cpu_to_be_64(addr);
-			/* Memory region key (big endian). */
+
+			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-			if (unlikely(dseg->lkey ==
-				rte_cpu_to_be_32((uint32_t)-1))) {
+			if (unlikely(lkey == (uint32_t)-1)) {
 				/* MR does not exist. */
 				DEBUG("%p: unable to get MP <-> MR association",
 				      (void *)txq);
 				elt->buf = NULL;
 				break;
 			}
-			/* Never be TXBB aligned, no need compiler barrier. */
-			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
-			/* Fill the control parameters for this packet. */
-			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
+			mlx4_fill_tx_data_seg(dseg, lkey,
+					      rte_pktmbuf_mtod(buf, uintptr_t),
+					      rte_cpu_to_be_32(buf->data_len));
 			nr_txbbs = 1;
 		} else {
 			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 5/8] net/mlx4: merge Tx queue rings management
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
                     ` (3 preceding siblings ...)
  2017-12-06 14:48   ` [PATCH v2 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 16:22     ` Adrien Mazarguil
  2017-12-06 14:48   ` [PATCH v2 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

The Tx queue send ring was managed by Tx block head,tail,count and mask
management variables which were used for managing the send queue remain
space and next places of empty or completed work queue entries.

This method suffered from an actual addresses recalculation per packet,
an unnecessary Tx block based calculations and an expensive dual
management of Tx rings.

Move send queue ring calculation to be based on actual addresses while
managing it by descriptors ring indexes.

Add new work queue entry pointer to the descriptor element to hold the
appropriate entry in the send queue.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  |  20 ++--
 drivers/net/mlx4/mlx4_rxtx.c | 253 ++++++++++++++++++++-----------------------
 drivers/net/mlx4/mlx4_rxtx.h |   1 +
 drivers/net/mlx4/mlx4_txq.c  |  23 ++--
 4 files changed, 139 insertions(+), 158 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index fcc7c12..217ea50 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -54,22 +54,18 @@
 
 /* Typical TSO descriptor with 16 gather entries is 352 bytes. */
 #define MLX4_MAX_WQE_SIZE 512
-#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+#define MLX4_SEG_SHIFT 4
 
 /* Send queue stamping/invalidating information. */
 #define MLX4_SQ_STAMP_STRIDE 64
 #define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
-#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_OWNER_BIT 31
 #define MLX4_SQ_STAMP_VAL 0x7fffffff
 
 /* Work queue element (WQE) flags. */
-#define MLX4_BIT_WQE_OWN 0x80000000
 #define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
 #define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
-#define MLX4_SIZE_TO_TXBBS(size) \
-	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
-
 /* CQE checksum flags. */
 enum {
 	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
@@ -98,17 +94,15 @@ enum {
 struct mlx4_sq {
 	volatile uint8_t *buf; /**< SQ buffer. */
 	volatile uint8_t *eob; /**< End of SQ buffer */
-	uint32_t head; /**< SQ head counter in units of TXBBS. */
-	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
-	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
-	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
-	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t size; /**< SQ size includes headroom. */
+	uint32_t remain_size; /**< Remaining WQE room in SQ (bytes). */
+	uint32_t owner_opcode;
+	/**< Default owner opcode with HW valid owner bit. */
+	uint32_t stamp; /**< Stamp value with an invalid HW owner bit. */
 	volatile uint32_t *db; /**< Pointer to the doorbell. */
 	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
 };
 
-#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
-
 /* Completion queue events, numbers and masks. */
 #define MLX4_CQ_DB_GEQ_N_MASK 0x3
 #define MLX4_CQ_DOORBELL 0x20
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index adf02c0..2467d1d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -61,9 +61,6 @@
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
-#define WQE_ONE_DATA_SEG_SIZE \
-	(sizeof(struct mlx4_wqe_ctrl_seg) + sizeof(struct mlx4_wqe_data_seg))
-
 /**
  * Pointer-value pair structure used in tx_post_send for saving the first
  * DWORD (32 byte) of a TXBB.
@@ -268,52 +265,48 @@ struct pv {
  *
  * @param sq
  *   Pointer to the SQ structure.
- * @param index
- *   Index of the freed WQE.
- * @param num_txbbs
- *   Number of blocks to stamp.
- *   If < 0 the routine will use the size written in the WQ entry.
- * @param owner
- *   The value of the WQE owner bit to use in the stamp.
+ * @param wqe
+ *   Pointer of WQE address to stamp.
  *
  * @return
- *   The number of Tx basic blocs (TXBB) the WQE contained.
+ *   WQE size and updates WQE address to the next WQE.
  */
-static int
-mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+static uint32_t
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
 {
-	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
-					  (!!owner << MLX4_SQ_STAMP_SHIFT));
-	volatile uint8_t *wqe = mlx4_get_send_wqe(sq,
-						(index & sq->txbb_cnt_mask));
-	volatile uint32_t *ptr = (volatile uint32_t *)wqe;
-	int i;
-	int txbbs_size;
-	int num_txbbs;
-
+	uint32_t stamp = sq->stamp;
+	volatile uint32_t *next_txbb = *wqe;
 	/* Extract the size from the control segment of the WQE. */
-	num_txbbs = MLX4_SIZE_TO_TXBBS((((volatile struct mlx4_wqe_ctrl_seg *)
-					 wqe)->fence_size & 0x3f) << 4);
-	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	uint32_t size = RTE_ALIGN((uint32_t)
+				  ((((volatile struct mlx4_wqe_ctrl_seg *)
+				     next_txbb)->fence_size & 0x3f) << 4),
+				  MLX4_TXBB_SIZE);
+	uint32_t size_cd = size;
+
 	/* Optimize the common case when there is no wrap-around. */
-	if (wqe + txbbs_size <= sq->eob) {
+	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
 		/* Stamp the freed descriptor. */
-		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
-			*ptr = stamp;
-			ptr += MLX4_SQ_STAMP_DWORDS;
-		}
+		do {
+			*next_txbb = stamp;
+			next_txbb += MLX4_SQ_STAMP_DWORDS;
+			size_cd -= MLX4_TXBB_SIZE;
+		} while (size_cd);
 	} else {
 		/* Stamp the freed descriptor. */
-		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
-			*ptr = stamp;
-			ptr += MLX4_SQ_STAMP_DWORDS;
-			if ((volatile uint8_t *)ptr >= sq->eob) {
-				ptr = (volatile uint32_t *)sq->buf;
-				stamp ^= RTE_BE32(0x80000000);
+		do {
+			*next_txbb = stamp;
+			next_txbb += MLX4_SQ_STAMP_DWORDS;
+			if ((volatile uint8_t *)next_txbb >= sq->eob) {
+				next_txbb = (volatile uint32_t *)sq->buf;
+				/* Flip invalid stamping ownership. */
+				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
+				sq->stamp = stamp;
 			}
-		}
+			size_cd -= MLX4_TXBB_SIZE;
+		} while (size_cd);
 	}
-	return num_txbbs;
+	*wqe = next_txbb;
+	return size;
 }
 
 /**
@@ -326,24 +319,22 @@ struct pv {
  *
  * @param txq
  *   Pointer to Tx queue structure.
- *
- * @return
- *   0 on success, -1 on failure.
  */
-static int
+static void
 mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
 				  struct mlx4_sq *sq)
 {
-	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
-	unsigned int sq_tail = sq->tail;
 	struct mlx4_cq *cq = &txq->mcq;
 	volatile struct mlx4_cqe *cqe;
 	uint32_t cons_index = cq->cons_index;
-	uint16_t new_index;
-	uint16_t nr_txbbs = 0;
-	int pkts = 0;
-
+	volatile uint32_t *first_wqe;
+	volatile uint32_t *next_wqe = (volatile uint32_t *)
+			((&(*txq->elts)[elts_tail])->wqe);
+	volatile uint32_t *last_wqe;
+	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
+			 MLX4_TXBB_SHIFT) - 1;
+	uint32_t pkts = 0;
 	/*
 	 * Traverse over all CQ entries reported and handle each WQ entry
 	 * reported by them.
@@ -353,11 +344,11 @@ struct pv {
 		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
 		    !!(cons_index & cq->cqe_cnt)))
 			break;
+#ifndef NDEBUG
 		/*
 		 * Make sure we read the CQE after we read the ownership bit.
 		 */
 		rte_io_rmb();
-#ifndef NDEBUG
 		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
 			     MLX4_CQE_OPCODE_ERROR)) {
 			volatile struct mlx4_err_cqe *cqe_err =
@@ -366,41 +357,32 @@ struct pv {
 			      " syndrome: 0x%x\n",
 			      (void *)txq, cqe_err->vendor_err,
 			      cqe_err->syndrome);
+			break;
 		}
 #endif /* NDEBUG */
-		/* Get WQE index reported in the CQE. */
-		new_index =
-			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		/* Get WQE address buy index from the CQE. */
+		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
+			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
+			 MLX4_TXBB_SHIFT));
 		do {
 			/* Free next descriptor. */
-			sq_tail += nr_txbbs;
-			nr_txbbs =
-				mlx4_txq_stamp_freed_wqe(sq,
-				     sq_tail & sq->txbb_cnt_mask,
-				     !!(sq_tail & sq->txbb_cnt));
+			first_wqe = next_wqe;
+			sq->remain_size +=
+				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
 			pkts++;
-		} while ((sq_tail & sq->txbb_cnt_mask) != new_index);
+		} while (first_wqe != last_wqe);
 		cons_index++;
 	} while (1);
 	if (unlikely(pkts == 0))
-		return 0;
-	/* Update CQ. */
+		return;
+	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
-	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & MLX4_CQ_DB_CI_MASK);
-	sq->tail = sq_tail + nr_txbbs;
-	/* Update the list of packets posted for transmission. */
-	elts_comp -= pkts;
-	assert(elts_comp <= txq->elts_comp);
-	/*
-	 * Assume completion status is successful as nothing can be done about
-	 * it anyway.
-	 */
+	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
+	txq->elts_comp -= pkts;
 	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
-	txq->elts_comp = elts_comp;
-	return 0;
 }
 
 /**
@@ -454,41 +436,40 @@ struct pv {
 	dseg->byte_count = byte_count;
 }
 
-static int
+/**
+ * Write data segments of multi-segment packet.
+ *
+ * @param buf
+ *   Pointer to the first packet mbuf.
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param ctrl
+ *   Pointer to the WQE control segment.
+ *
+ * @return
+ *   Pointer to the next WQE control segment on success, NULL otherwise.
+ */
+static volatile struct mlx4_wqe_ctrl_seg *
 mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
-		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
+		   volatile struct mlx4_wqe_ctrl_seg *ctrl)
 {
-	int wqe_real_size;
-	int nr_txbbs;
 	struct pv *pv = (struct pv *)txq->bounce_buf;
 	struct mlx4_sq *sq = &txq->msq;
-	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
-	volatile struct mlx4_wqe_ctrl_seg *ctrl;
-	volatile struct mlx4_wqe_data_seg *dseg;
 	struct rte_mbuf *sbuf = buf;
 	uint32_t lkey;
 	int pv_counter = 0;
 	int nb_segs = buf->nb_segs;
+	uint32_t wqe_size;
+	volatile struct mlx4_wqe_data_seg *dseg =
+		(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
 
-	/* Calculate the needed work queue entry size for this packet. */
-	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
-		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
-	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
-	/*
-	 * Check that there is room for this WQE in the send queue and that
-	 * the WQE size is legal.
-	 */
-	if (((sq->head - sq->tail) + nr_txbbs +
-				sq->headroom_txbbs) >= sq->txbb_cnt ||
-			nr_txbbs > MLX4_MAX_WQE_TXBBS) {
-		return -1;
-	}
-	/* Get the control and data entries of the WQE. */
-	ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
-			mlx4_get_send_wqe(sq, head_idx);
-	dseg = (volatile struct mlx4_wqe_data_seg *)
-			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
-	*pctrl = ctrl;
+	ctrl->fence_size = 1 + nb_segs;
+	wqe_size = RTE_ALIGN((uint32_t)(ctrl->fence_size << MLX4_SEG_SHIFT),
+			     MLX4_TXBB_SIZE);
+	/* Validate WQE size and WQE space in the send queue. */
+	if (sq->remain_size < wqe_size ||
+	    wqe_size > MLX4_MAX_WQE_SIZE)
+		return NULL;
 	/*
 	 * Fill the data segments with buffer information.
 	 * First WQE TXBB head segment is always control segment,
@@ -502,7 +483,7 @@ struct pv {
 	if (unlikely(lkey == (uint32_t)-1)) {
 		DEBUG("%p: unable to get MP <-> MR association",
 		      (void *)txq);
-		return -1;
+		return NULL;
 	}
 	/* Handle WQE wraparound. */
 	if (dseg >=
@@ -534,7 +515,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -550,7 +531,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -566,7 +547,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -590,9 +571,10 @@ struct pv {
 		for (--pv_counter; pv_counter  >= 0; pv_counter--)
 			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
 	}
-	/* Fill the control parameters for this packet. */
-	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-	return nr_txbbs;
+	sq->remain_size -= wqe_size;
+	/* Align next WQE address to the next TXBB. */
+	return (volatile struct mlx4_wqe_ctrl_seg *)
+		((volatile uint8_t *)ctrl + wqe_size);
 }
 
 /**
@@ -618,7 +600,8 @@ struct pv {
 	unsigned int i;
 	unsigned int max;
 	struct mlx4_sq *sq = &txq->msq;
-	int nr_txbbs;
+	volatile struct mlx4_wqe_ctrl_seg *ctrl;
+	struct txq_elt *elt;
 
 	assert(txq->elts_comp_cd != 0);
 	if (likely(txq->elts_comp != 0))
@@ -632,20 +615,22 @@ struct pv {
 	--max;
 	if (max > pkts_n)
 		max = pkts_n;
+	elt = &(*txq->elts)[elts_head];
+	/* Each element saves its appropriate work queue. */
+	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
 		unsigned int elts_head_next =
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
-		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		uint32_t owner_opcode = MLX4_OPCODE_SEND;
-		volatile struct mlx4_wqe_ctrl_seg *ctrl;
-		volatile struct mlx4_wqe_data_seg *dseg;
+		uint32_t owner_opcode = sq->owner_opcode;
+		volatile struct mlx4_wqe_data_seg *dseg =
+				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
+		volatile struct mlx4_wqe_ctrl_seg *ctrl_next;
 		union {
 			uint32_t flags;
 			uint16_t flags16[2];
 		} srcrb;
-		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 		uint32_t lkey;
 
 		/* Clean up old buffer. */
@@ -654,7 +639,7 @@ struct pv {
 
 #ifndef NDEBUG
 			/* Poisoning. */
-			memset(elt, 0x66, sizeof(*elt));
+			memset(elt->buf, 0x66, sizeof(struct rte_mbuf));
 #endif
 			/* Faster than rte_pktmbuf_free(). */
 			do {
@@ -666,23 +651,11 @@ struct pv {
 		}
 		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
 		if (buf->nb_segs == 1) {
-			/*
-			 * Check that there is room for this WQE in the send
-			 * queue and that the WQE size is legal
-			 */
-			if (((sq->head - sq->tail) + 1 + sq->headroom_txbbs) >=
-			     sq->txbb_cnt || 1 > MLX4_MAX_WQE_TXBBS) {
+			/* Validate WQE space in the send queue. */
+			if (sq->remain_size < MLX4_TXBB_SIZE) {
 				elt->buf = NULL;
 				break;
 			}
-			/* Get the control and data entries of the WQE. */
-			ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
-					mlx4_get_send_wqe(sq, head_idx);
-			dseg = (volatile struct mlx4_wqe_data_seg *)
-					((uintptr_t)ctrl +
-					sizeof(struct mlx4_wqe_ctrl_seg));
-
-			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
 			if (unlikely(lkey == (uint32_t)-1)) {
 				/* MR does not exist. */
@@ -691,23 +664,33 @@ struct pv {
 				elt->buf = NULL;
 				break;
 			}
-			mlx4_fill_tx_data_seg(dseg, lkey,
+			mlx4_fill_tx_data_seg(dseg++, lkey,
 					      rte_pktmbuf_mtod(buf, uintptr_t),
 					      rte_cpu_to_be_32(buf->data_len));
-			nr_txbbs = 1;
+			/* Set WQE size in 16-byte units. */
+			ctrl->fence_size = 0x2;
+			sq->remain_size -= MLX4_TXBB_SIZE;
+			/* Align next WQE address to the next TXBB. */
+			ctrl_next = ctrl + 0x4;
 		} else {
-			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
-			if (nr_txbbs < 0) {
+			ctrl_next = mlx4_tx_burst_segs(buf, txq, ctrl);
+			if (!ctrl_next) {
 				elt->buf = NULL;
 				break;
 			}
 		}
+		/* Hold SQ ring wrap around. */
+		if ((volatile uint8_t *)ctrl_next >= sq->eob) {
+			ctrl_next = (volatile struct mlx4_wqe_ctrl_seg *)
+				((volatile uint8_t *)ctrl_next - sq->size);
+			/* Flip HW valid ownership. */
+			sq->owner_opcode ^= 0x1 << MLX4_SQ_OWNER_BIT;
+		}
 		/*
 		 * For raw Ethernet, the SOLICIT flag is used to indicate
 		 * that no ICRC should be calculated.
 		 */
-		txq->elts_comp_cd -= nr_txbbs;
-		if (unlikely(txq->elts_comp_cd <= 0)) {
+		if (--txq->elts_comp_cd == 0) {
 			txq->elts_comp_cd = txq->elts_comp_cd_init;
 			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 					       MLX4_WQE_CTRL_CQ_UPDATE);
@@ -753,13 +736,13 @@ struct pv {
 		 * executing as soon as we do).
 		 */
 		rte_io_wmb();
-		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
-					      ((sq->head & sq->txbb_cnt) ?
-						       MLX4_BIT_WQE_OWN : 0));
-		sq->head += nr_txbbs;
+		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		elts_head = elts_head_next;
+		elt_next->wqe = ctrl_next;
+		ctrl = ctrl_next;
+		elt = elt_next;
 	}
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 463df2b..d56e48d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -105,6 +105,7 @@ struct mlx4_rss {
 /** Tx element. */
 struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
+	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
 };
 
 /** Rx queue counters. */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7882a4d..4c7b62a 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -84,6 +84,7 @@
 		assert(elt->buf != NULL);
 		rte_pktmbuf_free(elt->buf);
 		elt->buf = NULL;
+		elt->wqe = NULL;
 		if (++elts_tail == RTE_DIM(*elts))
 			elts_tail = 0;
 	}
@@ -163,20 +164,19 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4_cq *cq = &txq->mcq;
 	struct mlx4dv_qp *dqp = mlxdv->qp.out;
 	struct mlx4dv_cq *dcq = mlxdv->cq.out;
-	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
 
-	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
 	/* Total length, including headroom and spare WQEs. */
-	sq->eob = sq->buf + sq_size;
-	sq->head = 0;
-	sq->tail = 0;
-	sq->txbb_cnt =
-		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
-	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	sq->eob = sq->buf + sq->size;
+	uint32_t headroom_size = 2048 + (1 << dqp->sq.wqe_shift);
+	/* Continuous headroom size bytes must always stay freed. */
+	sq->remain_size = sq->size - headroom_size;
+	sq->owner_opcode = MLX4_OPCODE_SEND | (0 << MLX4_SQ_OWNER_BIT);
+	sq->stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+				     (0 << MLX4_SQ_OWNER_BIT));
 	sq->db = dqp->sdb;
 	sq->doorbell_qpn = dqp->doorbell_qpn;
-	sq->headroom_txbbs =
-		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
 	cq->buf = dcq->buf.buf;
 	cq->cqe_cnt = dcq->cqe_cnt;
 	cq->set_ci_db = dcq->set_ci_db;
@@ -362,6 +362,9 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
+	/* Save first wqe pointer in the first element. */
+	(&(*txq->elts)[0])->wqe =
+		(volatile struct mlx4_wqe_ctrl_seg *)txq->msq.buf;
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
 	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 6/8] net/mlx4: mitigate Tx send entry size calculations
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
                     ` (4 preceding siblings ...)
  2017-12-06 14:48   ` [PATCH v2 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 14:48   ` [PATCH v2 7/8] net/mlx4: align Tx descriptors number Matan Azrad
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

The previuse code took a send queue entry size for stamping from the
send queue entry pointed by completion queue entry; This 2 reads were
done per packet in completion stage.

The completion burst packets number is managed by fixed size stored in
Tx queue, so we can infer that each valid completion entry actually frees
the next fixed number packets.

The descriptors ring holds the send queue entry, so we just can infer
all the completion burst packet entries size by simple calculation and
prevent calculations per packet.

Adjust completion functions to free full completion bursts packets
by one time and prevent per packet work queue entry reads and
calculations.

Save only start of completion burst or Tx burst send queue entry
pointers in the appropriate descriptor element.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 105 +++++++++++++++++++------------------------
 drivers/net/mlx4/mlx4_rxtx.h |   5 ++-
 2 files changed, 50 insertions(+), 60 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 2467d1d..8b8d95e 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -258,55 +258,48 @@ struct pv {
 };
 
 /**
- * Stamp a WQE so it won't be reused by the HW.
+ * Stamp TXBB burst so it won't be reused by the HW.
  *
  * Routine is used when freeing WQE used by the chip or when failing
  * building an WQ entry has failed leaving partial information on the queue.
  *
  * @param sq
  *   Pointer to the SQ structure.
- * @param wqe
- *   Pointer of WQE address to stamp.
+ * @param start
+ *   Pointer to the first TXBB to stamp.
+ * @param end
+ *   Pointer to the followed end TXBB to stamp.
  *
  * @return
- *   WQE size and updates WQE address to the next WQE.
+ *   Stamping burst size in byte units.
  */
 static uint32_t
-mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t *start,
+			 volatile uint32_t *end)
 {
 	uint32_t stamp = sq->stamp;
-	volatile uint32_t *next_txbb = *wqe;
-	/* Extract the size from the control segment of the WQE. */
-	uint32_t size = RTE_ALIGN((uint32_t)
-				  ((((volatile struct mlx4_wqe_ctrl_seg *)
-				     next_txbb)->fence_size & 0x3f) << 4),
-				  MLX4_TXBB_SIZE);
-	uint32_t size_cd = size;
+	int32_t size = (intptr_t)end - (intptr_t)start;
 
-	/* Optimize the common case when there is no wrap-around. */
-	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
-		/* Stamp the freed descriptor. */
+	assert(start != end);
+	/* Hold SQ ring wrap around. */
+	if (size < 0) {
+		size = (int32_t)sq->size + size;
 		do {
-			*next_txbb = stamp;
-			next_txbb += MLX4_SQ_STAMP_DWORDS;
-			size_cd -= MLX4_TXBB_SIZE;
-		} while (size_cd);
-	} else {
-		/* Stamp the freed descriptor. */
-		do {
-			*next_txbb = stamp;
-			next_txbb += MLX4_SQ_STAMP_DWORDS;
-			if ((volatile uint8_t *)next_txbb >= sq->eob) {
-				next_txbb = (volatile uint32_t *)sq->buf;
-				/* Flip invalid stamping ownership. */
-				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
-				sq->stamp = stamp;
-			}
-			size_cd -= MLX4_TXBB_SIZE;
-		} while (size_cd);
+			*start = stamp;
+			start += MLX4_SQ_STAMP_DWORDS;
+		} while (start != (volatile uint32_t *)sq->eob);
+		start = (volatile uint32_t *)sq->buf;
+		/* Flip invalid stamping ownership. */
+		stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
+		sq->stamp = stamp;
+		if (start == end)
+			return size;
 	}
-	*wqe = next_txbb;
-	return size;
+	do {
+		*start = stamp;
+		start += MLX4_SQ_STAMP_DWORDS;
+	} while (start != end);
+	return (uint32_t)size;
 }
 
 /**
@@ -327,14 +320,10 @@ struct pv {
 	unsigned int elts_tail = txq->elts_tail;
 	struct mlx4_cq *cq = &txq->mcq;
 	volatile struct mlx4_cqe *cqe;
+	uint32_t completed;
 	uint32_t cons_index = cq->cons_index;
-	volatile uint32_t *first_wqe;
-	volatile uint32_t *next_wqe = (volatile uint32_t *)
-			((&(*txq->elts)[elts_tail])->wqe);
-	volatile uint32_t *last_wqe;
-	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
-			 MLX4_TXBB_SHIFT) - 1;
-	uint32_t pkts = 0;
+	volatile uint32_t *first_txbb;
+
 	/*
 	 * Traverse over all CQ entries reported and handle each WQ entry
 	 * reported by them.
@@ -360,28 +349,23 @@ struct pv {
 			break;
 		}
 #endif /* NDEBUG */
-		/* Get WQE address buy index from the CQE. */
-		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
-			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
-			 MLX4_TXBB_SHIFT));
-		do {
-			/* Free next descriptor. */
-			first_wqe = next_wqe;
-			sq->remain_size +=
-				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
-			pkts++;
-		} while (first_wqe != last_wqe);
 		cons_index++;
 	} while (1);
-	if (unlikely(pkts == 0))
+	completed = (cons_index - cq->cons_index) * txq->elts_comp_cd_init;
+	if (unlikely(!completed))
 		return;
+	/* First stamping address is the end of the last one. */
+	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
+	elts_tail += completed;
+	if (elts_tail >= elts_n)
+		elts_tail -= elts_n;
+	/* The new tail element holds the end address. */
+	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
+		(&(*txq->elts)[elts_tail])->eocb);
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
-	txq->elts_comp -= pkts;
-	elts_tail += pkts;
-	if (elts_tail >= elts_n)
-		elts_tail -= elts_n;
+	txq->elts_comp -= completed;
 	txq->elts_tail = elts_tail;
 }
 
@@ -616,7 +600,7 @@ struct pv {
 	if (max > pkts_n)
 		max = pkts_n;
 	elt = &(*txq->elts)[elts_head];
-	/* Each element saves its appropriate work queue. */
+	/* First Tx burst element saves the next WQE control segment. */
 	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
@@ -691,6 +675,8 @@ struct pv {
 		 * that no ICRC should be calculated.
 		 */
 		if (--txq->elts_comp_cd == 0) {
+			/* Save the completion burst end address. */
+			elt_next->eocb = (volatile uint32_t *)ctrl_next;
 			txq->elts_comp_cd = txq->elts_comp_cd_init;
 			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 					       MLX4_WQE_CTRL_CQ_UPDATE);
@@ -740,13 +726,14 @@ struct pv {
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		elts_head = elts_head_next;
-		elt_next->wqe = ctrl_next;
 		ctrl = ctrl_next;
 		elt = elt_next;
 	}
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
+	/* Save WQE address of the next Tx burst element. */
+	elt->wqe = ctrl;
 	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
 	txq->stats.obytes += bytes_sent;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index d56e48d..36ae03a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -105,7 +105,10 @@ struct mlx4_rss {
 /** Tx element. */
 struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
-	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
+	union {
+		volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
+		volatile uint32_t *eocb; /**< End of completion burst. */
+	};
 };
 
 /** Rx queue counters. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 7/8] net/mlx4: align Tx descriptors number
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
                     ` (5 preceding siblings ...)
  2017-12-06 14:48   ` [PATCH v2 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 16:22     ` Adrien Mazarguil
  2017-12-06 14:48   ` [PATCH v2 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Using power of 2 descriptors number makes the ring management easier
and allows to use mask operation instead of wraparound conditions.

Adjust Tx descriptor number to be power of 2 and change calculation to
use mask accordingly.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 28 +++++++++++++---------------
 drivers/net/mlx4/mlx4_txq.c  | 13 +++++++++----
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 8b8d95e..14192fe 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -312,10 +312,14 @@ struct pv {
  *
  * @param txq
  *   Pointer to Tx queue structure.
+ * @param sq
+ *   Pointer to the SQ structure.
+ * @param elts_m
+ *   Tx elements number mask.
  */
 static void
-mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
-				  struct mlx4_sq *sq)
+mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
+		  struct mlx4_sq *sq)
 {
 	unsigned int elts_tail = txq->elts_tail;
 	struct mlx4_cq *cq = &txq->mcq;
@@ -355,13 +359,11 @@ struct pv {
 	if (unlikely(!completed))
 		return;
 	/* First stamping address is the end of the last one. */
-	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
+	first_txbb = (&(*txq->elts)[elts_tail & elts_m])->eocb;
 	elts_tail += completed;
-	if (elts_tail >= elts_n)
-		elts_tail -= elts_n;
 	/* The new tail element holds the end address. */
 	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
-		(&(*txq->elts)[elts_tail])->eocb);
+		(&(*txq->elts)[elts_tail & elts_m])->eocb);
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
@@ -580,6 +582,7 @@ struct pv {
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
+	const unsigned int elts_m = elts_n - 1;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
@@ -589,24 +592,20 @@ struct pv {
 
 	assert(txq->elts_comp_cd != 0);
 	if (likely(txq->elts_comp != 0))
-		mlx4_txq_complete(txq, elts_n, sq);
+		mlx4_txq_complete(txq, elts_m, sq);
 	max = (elts_n - (elts_head - txq->elts_tail));
-	if (max > elts_n)
-		max -= elts_n;
 	assert(max >= 1);
 	assert(max <= elts_n);
 	/* Always leave one free entry in the ring. */
 	--max;
 	if (max > pkts_n)
 		max = pkts_n;
-	elt = &(*txq->elts)[elts_head];
+	elt = &(*txq->elts)[elts_head & elts_m];
 	/* First Tx burst element saves the next WQE control segment. */
 	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
-		unsigned int elts_head_next =
-			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
-		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
+		struct txq_elt *elt_next = &(*txq->elts)[++elts_head & elts_m];
 		uint32_t owner_opcode = sq->owner_opcode;
 		volatile struct mlx4_wqe_data_seg *dseg =
 				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
@@ -725,7 +724,6 @@ struct pv {
 		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
-		elts_head = elts_head_next;
 		ctrl = ctrl_next;
 		elt = elt_next;
 	}
@@ -741,7 +739,7 @@ struct pv {
 	rte_wmb();
 	/* Ring QP doorbell. */
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
-	txq->elts_head = elts_head;
+	txq->elts_head += i;
 	txq->elts_comp += i;
 	return i;
 }
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 4c7b62a..7eb4b04 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -76,17 +76,16 @@
 	unsigned int elts_head = txq->elts_head;
 	unsigned int elts_tail = txq->elts_tail;
 	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
+	unsigned int elts_m = txq->elts_n - 1;
 
 	DEBUG("%p: freeing WRs", (void *)txq);
 	while (elts_tail != elts_head) {
-		struct txq_elt *elt = &(*elts)[elts_tail];
+		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
 
 		assert(elt->buf != NULL);
 		rte_pktmbuf_free(elt->buf);
 		elt->buf = NULL;
 		elt->wqe = NULL;
-		if (++elts_tail == RTE_DIM(*elts))
-			elts_tail = 0;
 	}
 	txq->elts_tail = txq->elts_head;
 }
@@ -208,7 +207,7 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4dv_obj mlxdv;
 	struct mlx4dv_qp dv_qp;
 	struct mlx4dv_cq dv_cq;
-	struct txq_elt (*elts)[desc];
+	struct txq_elt (*elts)[rte_align32pow2(desc)];
 	struct ibv_qp_init_attr qp_init_attr;
 	struct txq *txq;
 	uint8_t *bounce_buf;
@@ -252,6 +251,12 @@ struct txq_mp2mr_mbuf_check_data {
 		ERROR("%p: invalid number of Tx descriptors", (void *)dev);
 		return -rte_errno;
 	}
+	if (desc != RTE_DIM(*elts)) {
+		desc = RTE_DIM(*elts);
+		WARN("%p: increased number of descriptors in Tx queue %u"
+		     " to the next power of two (%u)",
+		     (void *)dev, idx, desc);
+	}
 	/* Allocate and initialize Tx queue. */
 	mlx4_zmallocv_socket("TXQ", vec, RTE_DIM(vec), socket);
 	if (!txq) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 8/8] net/mlx4: remove Tx completion elements counter
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
                     ` (6 preceding siblings ...)
  2017-12-06 14:48   ` [PATCH v2 7/8] net/mlx4: align Tx descriptors number Matan Azrad
@ 2017-12-06 14:48   ` Matan Azrad
  2017-12-06 16:22     ` Adrien Mazarguil
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
  8 siblings, 1 reply; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 14:48 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

This counter saved the descriptor elements which are waiting to be
completted and was used to know if completion function should be
called.

This completion check can be done by other elements management
variables and we can prevent this counter management.

Remove this counter and replace the completion check easily by other
elements management variables.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 8 +++-----
 drivers/net/mlx4/mlx4_rxtx.h | 1 -
 drivers/net/mlx4/mlx4_txq.c  | 1 -
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 14192fe..1b598f2 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -367,7 +367,6 @@ struct pv {
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
-	txq->elts_comp -= completed;
 	txq->elts_tail = elts_tail;
 }
 
@@ -585,15 +584,15 @@ struct pv {
 	const unsigned int elts_m = elts_n - 1;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
-	unsigned int max;
+	unsigned int max = elts_head - txq->elts_tail;
 	struct mlx4_sq *sq = &txq->msq;
 	volatile struct mlx4_wqe_ctrl_seg *ctrl;
 	struct txq_elt *elt;
 
 	assert(txq->elts_comp_cd != 0);
-	if (likely(txq->elts_comp != 0))
+	if (likely(max >= txq->elts_comp_cd_init))
 		mlx4_txq_complete(txq, elts_m, sq);
-	max = (elts_n - (elts_head - txq->elts_tail));
+	max = elts_n - max;
 	assert(max >= 1);
 	assert(max <= elts_n);
 	/* Always leave one free entry in the ring. */
@@ -740,7 +739,6 @@ struct pv {
 	/* Ring QP doorbell. */
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head += i;
-	txq->elts_comp += i;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 36ae03a..b93e2bc 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -125,7 +125,6 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of packets awaiting completion. */
 	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7eb4b04..0c35935 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -274,7 +274,6 @@ struct txq_mp2mr_mbuf_check_data {
 		.elts = elts,
 		.elts_head = 0,
 		.elts_tail = 0,
-		.elts_comp = 0,
 		/*
 		 * Request send completion every MLX4_PMD_TX_PER_COMP_REQ
 		 * packets or at least 4 times per ring.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/8] net/mlx4: optimize Tx multi-segment case
  2017-12-06 14:48   ` [PATCH v2 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
@ 2017-12-06 16:22     ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 16:22 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 02:48:09PM +0000, Matan Azrad wrote:
> mlx4 Tx block can handle up to 4 data segments or control segment + up
> to 3 data segments. The first data segment in each not first Tx block
> must validate Tx queue wraparound and must use IO memory barrier before
> writing the byte count.
> 
> The previous multi-segment code used "for" loop to iterate over all
> packet segments and separated first Tx block data case by "if"
> statements.
> 
> Use switch case and unconditional branches instead of "for" loop can
> optimize the case and prevents the unnecessary checks for each data
> segment; This hints to compiler to create optimized jump table.
> 
> Optimize this case by switch case and unconditional branches usage.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 5/8] net/mlx4: merge Tx queue rings management
  2017-12-06 14:48   ` [PATCH v2 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
@ 2017-12-06 16:22     ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 16:22 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 02:48:10PM +0000, Matan Azrad wrote:
> The Tx queue send ring was managed by Tx block head,tail,count and mask
> management variables which were used for managing the send queue remain
> space and next places of empty or completed work queue entries.
> 
> This method suffered from an actual addresses recalculation per packet,
> an unnecessary Tx block based calculations and an expensive dual
> management of Tx rings.
> 
> Move send queue ring calculation to be based on actual addresses while
> managing it by descriptors ring indexes.
> 
> Add new work queue entry pointer to the descriptor element to hold the
> appropriate entry in the send queue.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

A few more comments on this version below.

<snip>
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index adf02c0..2467d1d 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -61,9 +61,6 @@
>  #include "mlx4_rxtx.h"
>  #include "mlx4_utils.h"
>  
> -#define WQE_ONE_DATA_SEG_SIZE \
> -	(sizeof(struct mlx4_wqe_ctrl_seg) + sizeof(struct mlx4_wqe_data_seg))
> -
>  /**
>   * Pointer-value pair structure used in tx_post_send for saving the first
>   * DWORD (32 byte) of a TXBB.
> @@ -268,52 +265,48 @@ struct pv {
>   *
>   * @param sq
>   *   Pointer to the SQ structure.
> - * @param index
> - *   Index of the freed WQE.
> - * @param num_txbbs
> - *   Number of blocks to stamp.
> - *   If < 0 the routine will use the size written in the WQ entry.
> - * @param owner
> - *   The value of the WQE owner bit to use in the stamp.
> + * @param wqe
> + *   Pointer of WQE address to stamp.

Clarification on what happens on return is primarily needed here actually:

 @param[in, out] wqe
   Pointer of WQE address to stamp. This value is modified on return to
   store the address of the next WQE.

>   *
>   * @return
> - *   The number of Tx basic blocs (TXBB) the WQE contained.
> + *   WQE size and updates WQE address to the next WQE.

You can leave the previous comment if @param wqe is properly documented.

<snip>
> @@ -654,7 +639,7 @@ struct pv {
>  
>  #ifndef NDEBUG
>  			/* Poisoning. */
> -			memset(elt, 0x66, sizeof(*elt));
> +			memset(elt->buf, 0x66, sizeof(struct rte_mbuf));

This likely causes a crash (did you test in debug mode?) the goal is to
poison the buffer address, not the entire mbuf. This should read:

 memset(&elt->buf, 0x66, sizeof(elt->buf));

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 7/8] net/mlx4: align Tx descriptors number
  2017-12-06 14:48   ` [PATCH v2 7/8] net/mlx4: align Tx descriptors number Matan Azrad
@ 2017-12-06 16:22     ` Adrien Mazarguil
  2017-12-06 17:24       ` Matan Azrad
  0 siblings, 1 reply; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 16:22 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 02:48:12PM +0000, Matan Azrad wrote:
> Using power of 2 descriptors number makes the ring management easier
> and allows to use mask operation instead of wraparound conditions.
> 
> Adjust Tx descriptor number to be power of 2 and change calculation to
> use mask accordingly.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 28 +++++++++++++---------------
>  drivers/net/mlx4/mlx4_txq.c  | 13 +++++++++----
>  2 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index 8b8d95e..14192fe 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -312,10 +312,14 @@ struct pv {
>   *
>   * @param txq
>   *   Pointer to Tx queue structure.
> + * @param sq
> + *   Pointer to the SQ structure.
> + * @param elts_m
> + *   Tx elements number mask.

It's minor however these parameters should be described in the same order as
they appear in the function prototype, please swap them if you send an
updated series.

>   */
>  static void
> -mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
> -				  struct mlx4_sq *sq)
> +mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
> +		  struct mlx4_sq *sq)
>  {
<snip>
> diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
> index 4c7b62a..7eb4b04 100644
> --- a/drivers/net/mlx4/mlx4_txq.c
> +++ b/drivers/net/mlx4/mlx4_txq.c
> @@ -76,17 +76,16 @@
>  	unsigned int elts_head = txq->elts_head;
>  	unsigned int elts_tail = txq->elts_tail;
>  	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
> +	unsigned int elts_m = txq->elts_n - 1;
>  
>  	DEBUG("%p: freeing WRs", (void *)txq);
>  	while (elts_tail != elts_head) {
> -		struct txq_elt *elt = &(*elts)[elts_tail];
> +		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
>  
>  		assert(elt->buf != NULL);
>  		rte_pktmbuf_free(elt->buf);
>  		elt->buf = NULL;
>  		elt->wqe = NULL;
> -		if (++elts_tail == RTE_DIM(*elts))
> -			elts_tail = 0;
>  	}
>  	txq->elts_tail = txq->elts_head;
>  }
> @@ -208,7 +207,7 @@ struct txq_mp2mr_mbuf_check_data {
>  	struct mlx4dv_obj mlxdv;
>  	struct mlx4dv_qp dv_qp;
>  	struct mlx4dv_cq dv_cq;
> -	struct txq_elt (*elts)[desc];
> +	struct txq_elt (*elts)[rte_align32pow2(desc)];

OK, I'm curious about what happened to the magic 0x1000 though? Was it a
limitation or some leftover debugging code?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 8/8] net/mlx4: remove Tx completion elements counter
  2017-12-06 14:48   ` [PATCH v2 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
@ 2017-12-06 16:22     ` Adrien Mazarguil
  0 siblings, 0 replies; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-06 16:22 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 02:48:13PM +0000, Matan Azrad wrote:
> This counter saved the descriptor elements which are waiting to be
> completted and was used to know if completion function should be

Looks like you forgot one minor change before adding my ack:

completted => completed

> called.
> 
> This completion check can be done by other elements management
> variables and we can prevent this counter management.
> 
> Remove this counter and replace the completion check easily by other
> elements management variables.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 7/8] net/mlx4: align Tx descriptors number
  2017-12-06 16:22     ` Adrien Mazarguil
@ 2017-12-06 17:24       ` Matan Azrad
  0 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:24 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, December 6, 2017 6:23 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 7/8] net/mlx4: align Tx descriptors number
> 
> On Wed, Dec 06, 2017 at 02:48:12PM +0000, Matan Azrad wrote:
> > Using power of 2 descriptors number makes the ring management easier
> > and allows to use mask operation instead of wraparound conditions.
> >
> > Adjust Tx descriptor number to be power of 2 and change calculation to
> > use mask accordingly.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 28 +++++++++++++---------------
> > drivers/net/mlx4/mlx4_txq.c  | 13 +++++++++----
> >  2 files changed, 22 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > b/drivers/net/mlx4/mlx4_rxtx.c index 8b8d95e..14192fe 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -312,10 +312,14 @@ struct pv {
> >   *
> >   * @param txq
> >   *   Pointer to Tx queue structure.
> > + * @param sq
> > + *   Pointer to the SQ structure.
> > + * @param elts_m
> > + *   Tx elements number mask.
> 
> It's minor however these parameters should be described in the same order
> as they appear in the function prototype, please swap them if you send an
> updated series.
> 
> >   */
> >  static void
> > -mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
> > -				  struct mlx4_sq *sq)
> > +mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
> > +		  struct mlx4_sq *sq)
> >  {
> <snip>
> > diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
> > index 4c7b62a..7eb4b04 100644
> > --- a/drivers/net/mlx4/mlx4_txq.c
> > +++ b/drivers/net/mlx4/mlx4_txq.c
> > @@ -76,17 +76,16 @@
> >  	unsigned int elts_head = txq->elts_head;
> >  	unsigned int elts_tail = txq->elts_tail;
> >  	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
> > +	unsigned int elts_m = txq->elts_n - 1;
> >
> >  	DEBUG("%p: freeing WRs", (void *)txq);
> >  	while (elts_tail != elts_head) {
> > -		struct txq_elt *elt = &(*elts)[elts_tail];
> > +		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
> >
> >  		assert(elt->buf != NULL);
> >  		rte_pktmbuf_free(elt->buf);
> >  		elt->buf = NULL;
> >  		elt->wqe = NULL;
> > -		if (++elts_tail == RTE_DIM(*elts))
> > -			elts_tail = 0;
> >  	}
> >  	txq->elts_tail = txq->elts_head;
> >  }
> > @@ -208,7 +207,7 @@ struct txq_mp2mr_mbuf_check_data {
> >  	struct mlx4dv_obj mlxdv;
> >  	struct mlx4dv_qp dv_qp;
> >  	struct mlx4dv_cq dv_cq;
> > -	struct txq_elt (*elts)[desc];
> > +	struct txq_elt (*elts)[rte_align32pow2(desc)];
> 
> OK, I'm curious about what happened to the magic 0x1000 though? Was it a
> limitation or some leftover debugging code?
> 
Wrong limitation to the max number of descriptors.
Thanks again for the second good review. Will adjust all your comments for v3.
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 0/8] improve mlx4 Tx performance
  2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
                     ` (7 preceding siblings ...)
  2017-12-06 14:48   ` [PATCH v2 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
@ 2017-12-06 17:57   ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
                       ` (8 more replies)
  8 siblings, 9 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

This series improves mlx4 Tx performance and fix and clean some Tx code. 
1. 10% MPPS improvement for 1 queue, 1 core, 64B packets, txonly mode.
2. 20% MPPS improvement for 1 queue, 1 core, 32B*4(segs) packets, txonly mode.

V2:
Add missed function descriptions.
Accurate descriptions.
Change Tx descriptor alignment to be like Rx.
Move mlx4_fill_tx_data_seg to mlx4_rxtx.c and use rte_be32_t for byte count.
Change remain_size type to uin32_t.
Poisoning with memset.

V3:
Accurate descriptions.
Fix poisoning from v2.

Matan Azrad (8):
  net/mlx4: fix Tx packet drop application report
  net/mlx4: remove unnecessary Tx wraparound checks
  net/mlx4: remove restamping from Tx error path
  net/mlx4: optimize Tx multi-segment case
  net/mlx4: merge Tx queue rings management
  net/mlx4: mitigate Tx send entry size calculations
  net/mlx4: align Tx descriptors number
  net/mlx4: remove Tx completion elements counter

 drivers/net/mlx4/mlx4_prm.h  |  20 +-
 drivers/net/mlx4/mlx4_rxtx.c | 492 +++++++++++++++++++++----------------------
 drivers/net/mlx4/mlx4_rxtx.h |   5 +-
 drivers/net/mlx4/mlx4_txq.c  |  37 ++--
 4 files changed, 279 insertions(+), 275 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 1/8] net/mlx4: fix Tx packet drop application report
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, stable

When invalid lkey is sent to HW, HW sends an error notification in
completion function.

The previous code wouldn't crash but doesn't add any application report
in case of completion error, so application cannot know that packet
actually was dropped in case of invalid lkey.

Return back the lkey validation to Tx path.

Fixes: 2eee458746bc ("net/mlx4: remove error flows from Tx fast path")
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 2bfa8b1..0d008ed 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -468,7 +468,6 @@ struct pv {
 		/* Memory region key (big endian) for this memory pool. */
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
 		dseg->lkey = rte_cpu_to_be_32(lkey);
-#ifndef NDEBUG
 		/* Calculate the needed work queue entry size for this packet */
 		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
@@ -486,7 +485,6 @@ struct pv {
 					(sq->head & sq->txbb_cnt) ? 0 : 1);
 			return -1;
 		}
-#endif /* NDEBUG */
 		if (likely(sbuf->data_len)) {
 			byte_count = rte_cpu_to_be_32(sbuf->data_len);
 		} else {
@@ -636,7 +634,6 @@ struct pv {
 			/* Memory region key (big endian). */
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
 			dseg->lkey = rte_cpu_to_be_32(lkey);
-#ifndef NDEBUG
 			if (unlikely(dseg->lkey ==
 				rte_cpu_to_be_32((uint32_t)-1))) {
 				/* MR does not exist. */
@@ -655,7 +652,6 @@ struct pv {
 				elt->buf = NULL;
 				break;
 			}
-#endif /* NDEBUG */
 			/* Never be TXBB aligned, no need compiler barrier. */
 			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
 			/* Fill the control parameters for this packet. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 2/8] net/mlx4: remove unnecessary Tx wraparound checks
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

There is no need to check Tx queue wraparound for segments which are
not at the beginning of a Tx block. Especially relevant in a single
segment case.

Remove unnecessary aforementioned checks from Tx path.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 0d008ed..9a32b3f 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -461,15 +461,11 @@ struct pv {
 	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
 		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
 		rte_prefetch0((volatile void *)addr);
-		/* Handle WQE wraparound. */
-		if (dseg >= (volatile struct mlx4_wqe_data_seg *)sq->eob)
-			dseg = (volatile struct mlx4_wqe_data_seg *)sq->buf;
-		dseg->addr = rte_cpu_to_be_64(addr);
 		/* Memory region key (big endian) for this memory pool. */
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
 		dseg->lkey = rte_cpu_to_be_32(lkey);
 		/* Calculate the needed work queue entry size for this packet */
-		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
+		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
 			DEBUG("%p: unable to get MP <-> MR association",
 					(void *)txq);
@@ -501,6 +497,8 @@ struct pv {
 		 * control segment.
 		 */
 		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+			dseg->addr = rte_cpu_to_be_64(addr);
+			dseg->lkey = rte_cpu_to_be_32(lkey);
 #if RTE_CACHE_LINE_SIZE < 64
 			/*
 			 * Need a barrier here before writing the byte_count
@@ -520,6 +518,13 @@ struct pv {
 			 * TXBB, so we need to postpone its byte_count writing
 			 * for later.
 			 */
+			/* Handle WQE wraparound. */
+			if (dseg >=
+			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
+				dseg = (volatile struct mlx4_wqe_data_seg *)
+					sq->buf;
+			dseg->addr = rte_cpu_to_be_64(addr);
+			dseg->lkey = rte_cpu_to_be_32(lkey);
 			pv[pv_counter].dseg = dseg;
 			pv[pv_counter++].val = byte_count;
 		}
@@ -625,11 +630,6 @@ struct pv {
 					sizeof(struct mlx4_wqe_ctrl_seg));
 			addr = rte_pktmbuf_mtod(buf, uintptr_t);
 			rte_prefetch0((volatile void *)addr);
-			/* Handle WQE wraparound. */
-			if (dseg >=
-				(volatile struct mlx4_wqe_data_seg *)sq->eob)
-				dseg = (volatile struct mlx4_wqe_data_seg *)
-						sq->buf;
 			dseg->addr = rte_cpu_to_be_64(addr);
 			/* Memory region key (big endian). */
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 3/8] net/mlx4: remove restamping from Tx error path
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

At error time, the first 4 bytes of each WQE Tx block still have not
writen, so no need to stamp them because they are already stamped.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 9a32b3f..1d8240a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -468,17 +468,7 @@ struct pv {
 		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
 			/* MR does not exist. */
 			DEBUG("%p: unable to get MP <-> MR association",
-					(void *)txq);
-			/*
-			 * Restamp entry in case of failure.
-			 * Make sure that size is written correctly
-			 * Note that we give ownership to the SW, not the HW.
-			 */
-			wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
-				buf->nb_segs * sizeof(struct mlx4_wqe_data_seg);
-			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-			mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					(sq->head & sq->txbb_cnt) ? 0 : 1);
+			      (void *)txq);
 			return -1;
 		}
 		if (likely(sbuf->data_len)) {
@@ -639,16 +629,6 @@ struct pv {
 				/* MR does not exist. */
 				DEBUG("%p: unable to get MP <-> MR association",
 				      (void *)txq);
-				/*
-				 * Restamp entry in case of failure.
-				 * Make sure that size is written correctly
-				 * Note that we give ownership to the SW,
-				 * not the HW.
-				 */
-				ctrl->fence_size =
-					(WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
-				mlx4_txq_stamp_freed_wqe(sq, head_idx,
-					     (sq->head & sq->txbb_cnt) ? 0 : 1);
 				elt->buf = NULL;
 				break;
 			}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 4/8] net/mlx4: optimize Tx multi-segment case
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
                       ` (2 preceding siblings ...)
  2017-12-06 17:57     ` [PATCH v3 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

mlx4 Tx block can handle up to 4 data segments or control segment + up
to 3 data segments. The first data segment in each not first Tx block
must validate Tx queue wraparound and must use IO memory barrier before
writing the byte count.

The previous multi-segment code used "for" loop to iterate over all
packet segments and separated first Tx block data case by "if"
statements.

Use switch case and unconditional branches instead of "for" loop can
optimize the case and prevents the unnecessary checks for each data
segment; This hints to compiler to create optimized jump table.

Optimize this case by switch case and unconditional branches usage.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 198 ++++++++++++++++++++++++++++---------------
 1 file changed, 128 insertions(+), 70 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 1d8240a..adf02c0 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -421,6 +421,39 @@ struct pv {
 	return buf->pool;
 }
 
+/**
+ * Write Tx data segment to the SQ.
+ *
+ * @param dseg
+ *   Pointer to data segment in SQ.
+ * @param lkey
+ *   Memory region lkey.
+ * @param addr
+ *   Data address.
+ * @param byte_count
+ *   Big endian bytes count of the data to send.
+ */
+static inline void
+mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
+		       uint32_t lkey, uintptr_t addr, rte_be32_t  byte_count)
+{
+	dseg->addr = rte_cpu_to_be_64(addr);
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+#if RTE_CACHE_LINE_SIZE < 64
+	/*
+	 * Need a barrier here before writing the byte_count
+	 * fields to make sure that all the data is visible
+	 * before the byte_count field is set.
+	 * Otherwise, if the segment begins a new cacheline,
+	 * the HCA prefetcher could grab the 64-byte chunk and
+	 * get a valid (!= 0xffffffff) byte count but stale
+	 * data, and end up sending the wrong data.
+	 */
+	rte_io_wmb();
+#endif /* RTE_CACHE_LINE_SIZE */
+	dseg->byte_count = byte_count;
+}
+
 static int
 mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
 		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
@@ -432,15 +465,14 @@ struct pv {
 	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 	volatile struct mlx4_wqe_ctrl_seg *ctrl;
 	volatile struct mlx4_wqe_data_seg *dseg;
-	struct rte_mbuf *sbuf;
+	struct rte_mbuf *sbuf = buf;
 	uint32_t lkey;
-	uintptr_t addr;
-	uint32_t byte_count;
 	int pv_counter = 0;
+	int nb_segs = buf->nb_segs;
 
 	/* Calculate the needed work queue entry size for this packet. */
 	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
-		buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
+		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
 	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
 	/*
 	 * Check that there is room for this WQE in the send queue and that
@@ -457,67 +489,99 @@ struct pv {
 	dseg = (volatile struct mlx4_wqe_data_seg *)
 			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
 	*pctrl = ctrl;
-	/* Fill the data segments with buffer information. */
-	for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
-		addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
-		rte_prefetch0((volatile void *)addr);
-		/* Memory region key (big endian) for this memory pool. */
+	/*
+	 * Fill the data segments with buffer information.
+	 * First WQE TXBB head segment is always control segment,
+	 * so jump to tail TXBB data segments code for the first
+	 * WQE data segments filling.
+	 */
+	goto txbb_tail_segs;
+txbb_head_seg:
+	/* Memory region key (big endian) for this memory pool. */
+	lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+	if (unlikely(lkey == (uint32_t)-1)) {
+		DEBUG("%p: unable to get MP <-> MR association",
+		      (void *)txq);
+		return -1;
+	}
+	/* Handle WQE wraparound. */
+	if (dseg >=
+		(volatile struct mlx4_wqe_data_seg *)sq->eob)
+		dseg = (volatile struct mlx4_wqe_data_seg *)
+			sq->buf;
+	dseg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf, uintptr_t));
+	dseg->lkey = rte_cpu_to_be_32(lkey);
+	/*
+	 * This data segment starts at the beginning of a new
+	 * TXBB, so we need to postpone its byte_count writing
+	 * for later.
+	 */
+	pv[pv_counter].dseg = dseg;
+	/*
+	 * Zero length segment is treated as inline segment
+	 * with zero data.
+	 */
+	pv[pv_counter++].val = rte_cpu_to_be_32(sbuf->data_len ?
+						sbuf->data_len : 0x80000000);
+	sbuf = sbuf->next;
+	dseg++;
+	nb_segs--;
+txbb_tail_segs:
+	/* Jump to default if there are more than two segments remaining. */
+	switch (nb_segs) {
+	default:
 		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
-		dseg->lkey = rte_cpu_to_be_32(lkey);
-		/* Calculate the needed work queue entry size for this packet */
-		if (unlikely(lkey == rte_cpu_to_be_32((uint32_t)-1))) {
-			/* MR does not exist. */
+		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
 			return -1;
 		}
-		if (likely(sbuf->data_len)) {
-			byte_count = rte_cpu_to_be_32(sbuf->data_len);
-		} else {
-			/*
-			 * Zero length segment is treated as inline segment
-			 * with zero data.
-			 */
-			byte_count = RTE_BE32(0x80000000);
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		sbuf = sbuf->next;
+		dseg++;
+		nb_segs--;
+		/* fallthrough */
+	case 2:
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			return -1;
 		}
-		/*
-		 * If the data segment is not at the beginning of a
-		 * Tx basic block (TXBB) then write the byte count,
-		 * else postpone the writing to just before updating the
-		 * control segment.
-		 */
-		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
-			dseg->addr = rte_cpu_to_be_64(addr);
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-#if RTE_CACHE_LINE_SIZE < 64
-			/*
-			 * Need a barrier here before writing the byte_count
-			 * fields to make sure that all the data is visible
-			 * before the byte_count field is set.
-			 * Otherwise, if the segment begins a new cacheline,
-			 * the HCA prefetcher could grab the 64-byte chunk and
-			 * get a valid (!= 0xffffffff) byte count but stale
-			 * data, and end up sending the wrong data.
-			 */
-			rte_io_wmb();
-#endif /* RTE_CACHE_LINE_SIZE */
-			dseg->byte_count = byte_count;
-		} else {
-			/*
-			 * This data segment starts at the beginning of a new
-			 * TXBB, so we need to postpone its byte_count writing
-			 * for later.
-			 */
-			/* Handle WQE wraparound. */
-			if (dseg >=
-			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
-				dseg = (volatile struct mlx4_wqe_data_seg *)
-					sq->buf;
-			dseg->addr = rte_cpu_to_be_64(addr);
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-			pv[pv_counter].dseg = dseg;
-			pv[pv_counter++].val = byte_count;
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		sbuf = sbuf->next;
+		dseg++;
+		nb_segs--;
+		/* fallthrough */
+	case 1:
+		lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+		if (unlikely(lkey == (uint32_t)-1)) {
+			DEBUG("%p: unable to get MP <-> MR association",
+			      (void *)txq);
+			return -1;
+		}
+		mlx4_fill_tx_data_seg(dseg, lkey,
+				      rte_pktmbuf_mtod(sbuf, uintptr_t),
+				      rte_cpu_to_be_32(sbuf->data_len ?
+						       sbuf->data_len :
+						       0x80000000));
+		nb_segs--;
+		if (nb_segs) {
+			sbuf = sbuf->next;
+			dseg++;
+			goto txbb_head_seg;
 		}
+		/* fallthrough */
+	case 0:
+		break;
 	}
 	/* Write the first DWORD of each TXBB save earlier. */
 	if (pv_counter) {
@@ -583,7 +647,6 @@ struct pv {
 		} srcrb;
 		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 		uint32_t lkey;
-		uintptr_t addr;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -618,24 +681,19 @@ struct pv {
 			dseg = (volatile struct mlx4_wqe_data_seg *)
 					((uintptr_t)ctrl +
 					sizeof(struct mlx4_wqe_ctrl_seg));
-			addr = rte_pktmbuf_mtod(buf, uintptr_t);
-			rte_prefetch0((volatile void *)addr);
-			dseg->addr = rte_cpu_to_be_64(addr);
-			/* Memory region key (big endian). */
+
+			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-			dseg->lkey = rte_cpu_to_be_32(lkey);
-			if (unlikely(dseg->lkey ==
-				rte_cpu_to_be_32((uint32_t)-1))) {
+			if (unlikely(lkey == (uint32_t)-1)) {
 				/* MR does not exist. */
 				DEBUG("%p: unable to get MP <-> MR association",
 				      (void *)txq);
 				elt->buf = NULL;
 				break;
 			}
-			/* Never be TXBB aligned, no need compiler barrier. */
-			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
-			/* Fill the control parameters for this packet. */
-			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
+			mlx4_fill_tx_data_seg(dseg, lkey,
+					      rte_pktmbuf_mtod(buf, uintptr_t),
+					      rte_cpu_to_be_32(buf->data_len));
 			nr_txbbs = 1;
 		} else {
 			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 5/8] net/mlx4: merge Tx queue rings management
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
                       ` (3 preceding siblings ...)
  2017-12-06 17:57     ` [PATCH v3 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

The Tx queue send ring was managed by Tx block head,tail,count and mask
management variables which were used for managing the send queue remain
space and next places of empty or completed work queue entries.

This method suffered from an actual addresses recalculation per packet,
an unnecessary Tx block based calculations and an expensive dual
management of Tx rings.

Move send queue ring calculation to be based on actual addresses while
managing it by descriptors ring indexes.

Add new work queue entry pointer to the descriptor element to hold the
appropriate entry in the send queue.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_prm.h  |  20 ++--
 drivers/net/mlx4/mlx4_rxtx.c | 254 ++++++++++++++++++++-----------------------
 drivers/net/mlx4/mlx4_rxtx.h |   1 +
 drivers/net/mlx4/mlx4_txq.c  |  23 ++--
 4 files changed, 140 insertions(+), 158 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_prm.h b/drivers/net/mlx4/mlx4_prm.h
index fcc7c12..217ea50 100644
--- a/drivers/net/mlx4/mlx4_prm.h
+++ b/drivers/net/mlx4/mlx4_prm.h
@@ -54,22 +54,18 @@
 
 /* Typical TSO descriptor with 16 gather entries is 352 bytes. */
 #define MLX4_MAX_WQE_SIZE 512
-#define MLX4_MAX_WQE_TXBBS (MLX4_MAX_WQE_SIZE / MLX4_TXBB_SIZE)
+#define MLX4_SEG_SHIFT 4
 
 /* Send queue stamping/invalidating information. */
 #define MLX4_SQ_STAMP_STRIDE 64
 #define MLX4_SQ_STAMP_DWORDS (MLX4_SQ_STAMP_STRIDE / 4)
-#define MLX4_SQ_STAMP_SHIFT 31
+#define MLX4_SQ_OWNER_BIT 31
 #define MLX4_SQ_STAMP_VAL 0x7fffffff
 
 /* Work queue element (WQE) flags. */
-#define MLX4_BIT_WQE_OWN 0x80000000
 #define MLX4_WQE_CTRL_IIP_HDR_CSUM (1 << 28)
 #define MLX4_WQE_CTRL_IL4_HDR_CSUM (1 << 27)
 
-#define MLX4_SIZE_TO_TXBBS(size) \
-	(RTE_ALIGN((size), (MLX4_TXBB_SIZE)) >> (MLX4_TXBB_SHIFT))
-
 /* CQE checksum flags. */
 enum {
 	MLX4_CQE_L2_TUNNEL_IPV4 = (int)(1u << 25),
@@ -98,17 +94,15 @@ enum {
 struct mlx4_sq {
 	volatile uint8_t *buf; /**< SQ buffer. */
 	volatile uint8_t *eob; /**< End of SQ buffer */
-	uint32_t head; /**< SQ head counter in units of TXBBS. */
-	uint32_t tail; /**< SQ tail counter in units of TXBBS. */
-	uint32_t txbb_cnt; /**< Num of WQEBB in the Q (should be ^2). */
-	uint32_t txbb_cnt_mask; /**< txbbs_cnt mask (txbb_cnt is ^2). */
-	uint32_t headroom_txbbs; /**< Num of txbbs that should be kept free. */
+	uint32_t size; /**< SQ size includes headroom. */
+	uint32_t remain_size; /**< Remaining WQE room in SQ (bytes). */
+	uint32_t owner_opcode;
+	/**< Default owner opcode with HW valid owner bit. */
+	uint32_t stamp; /**< Stamp value with an invalid HW owner bit. */
 	volatile uint32_t *db; /**< Pointer to the doorbell. */
 	uint32_t doorbell_qpn; /**< qp number to write to the doorbell. */
 };
 
-#define mlx4_get_send_wqe(sq, n) ((sq)->buf + ((n) * (MLX4_TXBB_SIZE)))
-
 /* Completion queue events, numbers and masks. */
 #define MLX4_CQ_DB_GEQ_N_MASK 0x3
 #define MLX4_CQ_DOORBELL 0x20
diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index adf02c0..ad84c3c 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -61,9 +61,6 @@
 #include "mlx4_rxtx.h"
 #include "mlx4_utils.h"
 
-#define WQE_ONE_DATA_SEG_SIZE \
-	(sizeof(struct mlx4_wqe_ctrl_seg) + sizeof(struct mlx4_wqe_data_seg))
-
 /**
  * Pointer-value pair structure used in tx_post_send for saving the first
  * DWORD (32 byte) of a TXBB.
@@ -268,52 +265,49 @@ struct pv {
  *
  * @param sq
  *   Pointer to the SQ structure.
- * @param index
- *   Index of the freed WQE.
- * @param num_txbbs
- *   Number of blocks to stamp.
- *   If < 0 the routine will use the size written in the WQ entry.
- * @param owner
- *   The value of the WQE owner bit to use in the stamp.
+ * @param[in, out] wqe
+ *   Pointer of WQE address to stamp. This value is modified on return to
+ *   store the address of the next WQE.
  *
  * @return
- *   The number of Tx basic blocs (TXBB) the WQE contained.
+ *   WQE size.
  */
-static int
-mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, uint16_t index, uint8_t owner)
+static uint32_t
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
 {
-	uint32_t stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
-					  (!!owner << MLX4_SQ_STAMP_SHIFT));
-	volatile uint8_t *wqe = mlx4_get_send_wqe(sq,
-						(index & sq->txbb_cnt_mask));
-	volatile uint32_t *ptr = (volatile uint32_t *)wqe;
-	int i;
-	int txbbs_size;
-	int num_txbbs;
-
+	uint32_t stamp = sq->stamp;
+	volatile uint32_t *next_txbb = *wqe;
 	/* Extract the size from the control segment of the WQE. */
-	num_txbbs = MLX4_SIZE_TO_TXBBS((((volatile struct mlx4_wqe_ctrl_seg *)
-					 wqe)->fence_size & 0x3f) << 4);
-	txbbs_size = num_txbbs * MLX4_TXBB_SIZE;
+	uint32_t size = RTE_ALIGN((uint32_t)
+				  ((((volatile struct mlx4_wqe_ctrl_seg *)
+				     next_txbb)->fence_size & 0x3f) << 4),
+				  MLX4_TXBB_SIZE);
+	uint32_t size_cd = size;
+
 	/* Optimize the common case when there is no wrap-around. */
-	if (wqe + txbbs_size <= sq->eob) {
+	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
 		/* Stamp the freed descriptor. */
-		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
-			*ptr = stamp;
-			ptr += MLX4_SQ_STAMP_DWORDS;
-		}
+		do {
+			*next_txbb = stamp;
+			next_txbb += MLX4_SQ_STAMP_DWORDS;
+			size_cd -= MLX4_TXBB_SIZE;
+		} while (size_cd);
 	} else {
 		/* Stamp the freed descriptor. */
-		for (i = 0; i < txbbs_size; i += MLX4_SQ_STAMP_STRIDE) {
-			*ptr = stamp;
-			ptr += MLX4_SQ_STAMP_DWORDS;
-			if ((volatile uint8_t *)ptr >= sq->eob) {
-				ptr = (volatile uint32_t *)sq->buf;
-				stamp ^= RTE_BE32(0x80000000);
+		do {
+			*next_txbb = stamp;
+			next_txbb += MLX4_SQ_STAMP_DWORDS;
+			if ((volatile uint8_t *)next_txbb >= sq->eob) {
+				next_txbb = (volatile uint32_t *)sq->buf;
+				/* Flip invalid stamping ownership. */
+				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
+				sq->stamp = stamp;
 			}
-		}
+			size_cd -= MLX4_TXBB_SIZE;
+		} while (size_cd);
 	}
-	return num_txbbs;
+	*wqe = next_txbb;
+	return size;
 }
 
 /**
@@ -326,24 +320,22 @@ struct pv {
  *
  * @param txq
  *   Pointer to Tx queue structure.
- *
- * @return
- *   0 on success, -1 on failure.
  */
-static int
+static void
 mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
 				  struct mlx4_sq *sq)
 {
-	unsigned int elts_comp = txq->elts_comp;
 	unsigned int elts_tail = txq->elts_tail;
-	unsigned int sq_tail = sq->tail;
 	struct mlx4_cq *cq = &txq->mcq;
 	volatile struct mlx4_cqe *cqe;
 	uint32_t cons_index = cq->cons_index;
-	uint16_t new_index;
-	uint16_t nr_txbbs = 0;
-	int pkts = 0;
-
+	volatile uint32_t *first_wqe;
+	volatile uint32_t *next_wqe = (volatile uint32_t *)
+			((&(*txq->elts)[elts_tail])->wqe);
+	volatile uint32_t *last_wqe;
+	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
+			 MLX4_TXBB_SHIFT) - 1;
+	uint32_t pkts = 0;
 	/*
 	 * Traverse over all CQ entries reported and handle each WQ entry
 	 * reported by them.
@@ -353,11 +345,11 @@ struct pv {
 		if (unlikely(!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^
 		    !!(cons_index & cq->cqe_cnt)))
 			break;
+#ifndef NDEBUG
 		/*
 		 * Make sure we read the CQE after we read the ownership bit.
 		 */
 		rte_io_rmb();
-#ifndef NDEBUG
 		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
 			     MLX4_CQE_OPCODE_ERROR)) {
 			volatile struct mlx4_err_cqe *cqe_err =
@@ -366,41 +358,32 @@ struct pv {
 			      " syndrome: 0x%x\n",
 			      (void *)txq, cqe_err->vendor_err,
 			      cqe_err->syndrome);
+			break;
 		}
 #endif /* NDEBUG */
-		/* Get WQE index reported in the CQE. */
-		new_index =
-			rte_be_to_cpu_16(cqe->wqe_index) & sq->txbb_cnt_mask;
+		/* Get WQE address buy index from the CQE. */
+		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
+			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
+			 MLX4_TXBB_SHIFT));
 		do {
 			/* Free next descriptor. */
-			sq_tail += nr_txbbs;
-			nr_txbbs =
-				mlx4_txq_stamp_freed_wqe(sq,
-				     sq_tail & sq->txbb_cnt_mask,
-				     !!(sq_tail & sq->txbb_cnt));
+			first_wqe = next_wqe;
+			sq->remain_size +=
+				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
 			pkts++;
-		} while ((sq_tail & sq->txbb_cnt_mask) != new_index);
+		} while (first_wqe != last_wqe);
 		cons_index++;
 	} while (1);
 	if (unlikely(pkts == 0))
-		return 0;
-	/* Update CQ. */
+		return;
+	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
-	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & MLX4_CQ_DB_CI_MASK);
-	sq->tail = sq_tail + nr_txbbs;
-	/* Update the list of packets posted for transmission. */
-	elts_comp -= pkts;
-	assert(elts_comp <= txq->elts_comp);
-	/*
-	 * Assume completion status is successful as nothing can be done about
-	 * it anyway.
-	 */
+	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
+	txq->elts_comp -= pkts;
 	elts_tail += pkts;
 	if (elts_tail >= elts_n)
 		elts_tail -= elts_n;
 	txq->elts_tail = elts_tail;
-	txq->elts_comp = elts_comp;
-	return 0;
 }
 
 /**
@@ -454,41 +437,40 @@ struct pv {
 	dseg->byte_count = byte_count;
 }
 
-static int
+/**
+ * Write data segments of multi-segment packet.
+ *
+ * @param buf
+ *   Pointer to the first packet mbuf.
+ * @param txq
+ *   Pointer to Tx queue structure.
+ * @param ctrl
+ *   Pointer to the WQE control segment.
+ *
+ * @return
+ *   Pointer to the next WQE control segment on success, NULL otherwise.
+ */
+static volatile struct mlx4_wqe_ctrl_seg *
 mlx4_tx_burst_segs(struct rte_mbuf *buf, struct txq *txq,
-		   volatile struct mlx4_wqe_ctrl_seg **pctrl)
+		   volatile struct mlx4_wqe_ctrl_seg *ctrl)
 {
-	int wqe_real_size;
-	int nr_txbbs;
 	struct pv *pv = (struct pv *)txq->bounce_buf;
 	struct mlx4_sq *sq = &txq->msq;
-	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
-	volatile struct mlx4_wqe_ctrl_seg *ctrl;
-	volatile struct mlx4_wqe_data_seg *dseg;
 	struct rte_mbuf *sbuf = buf;
 	uint32_t lkey;
 	int pv_counter = 0;
 	int nb_segs = buf->nb_segs;
+	uint32_t wqe_size;
+	volatile struct mlx4_wqe_data_seg *dseg =
+		(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
 
-	/* Calculate the needed work queue entry size for this packet. */
-	wqe_real_size = sizeof(volatile struct mlx4_wqe_ctrl_seg) +
-		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
-	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
-	/*
-	 * Check that there is room for this WQE in the send queue and that
-	 * the WQE size is legal.
-	 */
-	if (((sq->head - sq->tail) + nr_txbbs +
-				sq->headroom_txbbs) >= sq->txbb_cnt ||
-			nr_txbbs > MLX4_MAX_WQE_TXBBS) {
-		return -1;
-	}
-	/* Get the control and data entries of the WQE. */
-	ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
-			mlx4_get_send_wqe(sq, head_idx);
-	dseg = (volatile struct mlx4_wqe_data_seg *)
-			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
-	*pctrl = ctrl;
+	ctrl->fence_size = 1 + nb_segs;
+	wqe_size = RTE_ALIGN((uint32_t)(ctrl->fence_size << MLX4_SEG_SHIFT),
+			     MLX4_TXBB_SIZE);
+	/* Validate WQE size and WQE space in the send queue. */
+	if (sq->remain_size < wqe_size ||
+	    wqe_size > MLX4_MAX_WQE_SIZE)
+		return NULL;
 	/*
 	 * Fill the data segments with buffer information.
 	 * First WQE TXBB head segment is always control segment,
@@ -502,7 +484,7 @@ struct pv {
 	if (unlikely(lkey == (uint32_t)-1)) {
 		DEBUG("%p: unable to get MP <-> MR association",
 		      (void *)txq);
-		return -1;
+		return NULL;
 	}
 	/* Handle WQE wraparound. */
 	if (dseg >=
@@ -534,7 +516,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -550,7 +532,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -566,7 +548,7 @@ struct pv {
 		if (unlikely(lkey == (uint32_t)-1)) {
 			DEBUG("%p: unable to get MP <-> MR association",
 			      (void *)txq);
-			return -1;
+			return NULL;
 		}
 		mlx4_fill_tx_data_seg(dseg, lkey,
 				      rte_pktmbuf_mtod(sbuf, uintptr_t),
@@ -590,9 +572,10 @@ struct pv {
 		for (--pv_counter; pv_counter  >= 0; pv_counter--)
 			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
 	}
-	/* Fill the control parameters for this packet. */
-	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-	return nr_txbbs;
+	sq->remain_size -= wqe_size;
+	/* Align next WQE address to the next TXBB. */
+	return (volatile struct mlx4_wqe_ctrl_seg *)
+		((volatile uint8_t *)ctrl + wqe_size);
 }
 
 /**
@@ -618,7 +601,8 @@ struct pv {
 	unsigned int i;
 	unsigned int max;
 	struct mlx4_sq *sq = &txq->msq;
-	int nr_txbbs;
+	volatile struct mlx4_wqe_ctrl_seg *ctrl;
+	struct txq_elt *elt;
 
 	assert(txq->elts_comp_cd != 0);
 	if (likely(txq->elts_comp != 0))
@@ -632,20 +616,22 @@ struct pv {
 	--max;
 	if (max > pkts_n)
 		max = pkts_n;
+	elt = &(*txq->elts)[elts_head];
+	/* Each element saves its appropriate work queue. */
+	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
 		unsigned int elts_head_next =
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
-		struct txq_elt *elt = &(*txq->elts)[elts_head];
-		uint32_t owner_opcode = MLX4_OPCODE_SEND;
-		volatile struct mlx4_wqe_ctrl_seg *ctrl;
-		volatile struct mlx4_wqe_data_seg *dseg;
+		uint32_t owner_opcode = sq->owner_opcode;
+		volatile struct mlx4_wqe_data_seg *dseg =
+				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
+		volatile struct mlx4_wqe_ctrl_seg *ctrl_next;
 		union {
 			uint32_t flags;
 			uint16_t flags16[2];
 		} srcrb;
-		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
 		uint32_t lkey;
 
 		/* Clean up old buffer. */
@@ -654,7 +640,7 @@ struct pv {
 
 #ifndef NDEBUG
 			/* Poisoning. */
-			memset(elt, 0x66, sizeof(*elt));
+			memset(&elt->buf, 0x66, sizeof(struct rte_mbuf *));
 #endif
 			/* Faster than rte_pktmbuf_free(). */
 			do {
@@ -666,23 +652,11 @@ struct pv {
 		}
 		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
 		if (buf->nb_segs == 1) {
-			/*
-			 * Check that there is room for this WQE in the send
-			 * queue and that the WQE size is legal
-			 */
-			if (((sq->head - sq->tail) + 1 + sq->headroom_txbbs) >=
-			     sq->txbb_cnt || 1 > MLX4_MAX_WQE_TXBBS) {
+			/* Validate WQE space in the send queue. */
+			if (sq->remain_size < MLX4_TXBB_SIZE) {
 				elt->buf = NULL;
 				break;
 			}
-			/* Get the control and data entries of the WQE. */
-			ctrl = (volatile struct mlx4_wqe_ctrl_seg *)
-					mlx4_get_send_wqe(sq, head_idx);
-			dseg = (volatile struct mlx4_wqe_data_seg *)
-					((uintptr_t)ctrl +
-					sizeof(struct mlx4_wqe_ctrl_seg));
-
-			ctrl->fence_size = (WQE_ONE_DATA_SEG_SIZE >> 4) & 0x3f;
 			lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
 			if (unlikely(lkey == (uint32_t)-1)) {
 				/* MR does not exist. */
@@ -691,23 +665,33 @@ struct pv {
 				elt->buf = NULL;
 				break;
 			}
-			mlx4_fill_tx_data_seg(dseg, lkey,
+			mlx4_fill_tx_data_seg(dseg++, lkey,
 					      rte_pktmbuf_mtod(buf, uintptr_t),
 					      rte_cpu_to_be_32(buf->data_len));
-			nr_txbbs = 1;
+			/* Set WQE size in 16-byte units. */
+			ctrl->fence_size = 0x2;
+			sq->remain_size -= MLX4_TXBB_SIZE;
+			/* Align next WQE address to the next TXBB. */
+			ctrl_next = ctrl + 0x4;
 		} else {
-			nr_txbbs = mlx4_tx_burst_segs(buf, txq, &ctrl);
-			if (nr_txbbs < 0) {
+			ctrl_next = mlx4_tx_burst_segs(buf, txq, ctrl);
+			if (!ctrl_next) {
 				elt->buf = NULL;
 				break;
 			}
 		}
+		/* Hold SQ ring wrap around. */
+		if ((volatile uint8_t *)ctrl_next >= sq->eob) {
+			ctrl_next = (volatile struct mlx4_wqe_ctrl_seg *)
+				((volatile uint8_t *)ctrl_next - sq->size);
+			/* Flip HW valid ownership. */
+			sq->owner_opcode ^= 0x1 << MLX4_SQ_OWNER_BIT;
+		}
 		/*
 		 * For raw Ethernet, the SOLICIT flag is used to indicate
 		 * that no ICRC should be calculated.
 		 */
-		txq->elts_comp_cd -= nr_txbbs;
-		if (unlikely(txq->elts_comp_cd <= 0)) {
+		if (--txq->elts_comp_cd == 0) {
 			txq->elts_comp_cd = txq->elts_comp_cd_init;
 			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 					       MLX4_WQE_CTRL_CQ_UPDATE);
@@ -753,13 +737,13 @@ struct pv {
 		 * executing as soon as we do).
 		 */
 		rte_io_wmb();
-		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
-					      ((sq->head & sq->txbb_cnt) ?
-						       MLX4_BIT_WQE_OWN : 0));
-		sq->head += nr_txbbs;
+		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		elts_head = elts_head_next;
+		elt_next->wqe = ctrl_next;
+		ctrl = ctrl_next;
+		elt = elt_next;
 	}
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 463df2b..d56e48d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -105,6 +105,7 @@ struct mlx4_rss {
 /** Tx element. */
 struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
+	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
 };
 
 /** Rx queue counters. */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7882a4d..4c7b62a 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -84,6 +84,7 @@
 		assert(elt->buf != NULL);
 		rte_pktmbuf_free(elt->buf);
 		elt->buf = NULL;
+		elt->wqe = NULL;
 		if (++elts_tail == RTE_DIM(*elts))
 			elts_tail = 0;
 	}
@@ -163,20 +164,19 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4_cq *cq = &txq->mcq;
 	struct mlx4dv_qp *dqp = mlxdv->qp.out;
 	struct mlx4dv_cq *dcq = mlxdv->cq.out;
-	uint32_t sq_size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
 
-	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
 	/* Total length, including headroom and spare WQEs. */
-	sq->eob = sq->buf + sq_size;
-	sq->head = 0;
-	sq->tail = 0;
-	sq->txbb_cnt =
-		(dqp->sq.wqe_cnt << dqp->sq.wqe_shift) >> MLX4_TXBB_SHIFT;
-	sq->txbb_cnt_mask = sq->txbb_cnt - 1;
+	sq->size = (uint32_t)dqp->rq.offset - (uint32_t)dqp->sq.offset;
+	sq->buf = (uint8_t *)dqp->buf.buf + dqp->sq.offset;
+	sq->eob = sq->buf + sq->size;
+	uint32_t headroom_size = 2048 + (1 << dqp->sq.wqe_shift);
+	/* Continuous headroom size bytes must always stay freed. */
+	sq->remain_size = sq->size - headroom_size;
+	sq->owner_opcode = MLX4_OPCODE_SEND | (0 << MLX4_SQ_OWNER_BIT);
+	sq->stamp = rte_cpu_to_be_32(MLX4_SQ_STAMP_VAL |
+				     (0 << MLX4_SQ_OWNER_BIT));
 	sq->db = dqp->sdb;
 	sq->doorbell_qpn = dqp->doorbell_qpn;
-	sq->headroom_txbbs =
-		(2048 + (1 << dqp->sq.wqe_shift)) >> MLX4_TXBB_SHIFT;
 	cq->buf = dcq->buf.buf;
 	cq->cqe_cnt = dcq->cqe_cnt;
 	cq->set_ci_db = dcq->set_ci_db;
@@ -362,6 +362,9 @@ struct txq_mp2mr_mbuf_check_data {
 		goto error;
 	}
 	mlx4_txq_fill_dv_obj_info(txq, &mlxdv);
+	/* Save first wqe pointer in the first element. */
+	(&(*txq->elts)[0])->wqe =
+		(volatile struct mlx4_wqe_ctrl_seg *)txq->msq.buf;
 	/* Pre-register known mempools. */
 	rte_mempool_walk(mlx4_txq_mp2mr_iter, txq);
 	DEBUG("%p: adding Tx queue %p to list", (void *)dev, (void *)txq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 6/8] net/mlx4: mitigate Tx send entry size calculations
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
                       ` (4 preceding siblings ...)
  2017-12-06 17:57     ` [PATCH v3 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 7/8] net/mlx4: align Tx descriptors number Matan Azrad
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

The previuse code took a send queue entry size for stamping from the
send queue entry pointed by completion queue entry; This 2 reads were
done per packet in completion stage.

The completion burst packets number is managed by fixed size stored in
Tx queue, so we can infer that each valid completion entry actually frees
the next fixed number packets.

The descriptors ring holds the send queue entry, so we just can infer
all the completion burst packet entries size by simple calculation and
prevent calculations per packet.

Adjust completion functions to free full completion bursts packets
by one time and prevent per packet work queue entry reads and
calculations.

Save only start of completion burst or Tx burst send queue entry
pointers in the appropriate descriptor element.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 106 +++++++++++++++++++------------------------
 drivers/net/mlx4/mlx4_rxtx.h |   5 +-
 2 files changed, 50 insertions(+), 61 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index ad84c3c..dbed74d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -258,56 +258,48 @@ struct pv {
 };
 
 /**
- * Stamp a WQE so it won't be reused by the HW.
+ * Stamp TXBB burst so it won't be reused by the HW.
  *
  * Routine is used when freeing WQE used by the chip or when failing
  * building an WQ entry has failed leaving partial information on the queue.
  *
  * @param sq
  *   Pointer to the SQ structure.
- * @param[in, out] wqe
- *   Pointer of WQE address to stamp. This value is modified on return to
- *   store the address of the next WQE.
+ * @param start
+ *   Pointer to the first TXBB to stamp.
+ * @param end
+ *   Pointer to the followed end TXBB to stamp.
  *
  * @return
- *   WQE size.
+ *   Stamping burst size in byte units.
  */
 static uint32_t
-mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t **wqe)
+mlx4_txq_stamp_freed_wqe(struct mlx4_sq *sq, volatile uint32_t *start,
+			 volatile uint32_t *end)
 {
 	uint32_t stamp = sq->stamp;
-	volatile uint32_t *next_txbb = *wqe;
-	/* Extract the size from the control segment of the WQE. */
-	uint32_t size = RTE_ALIGN((uint32_t)
-				  ((((volatile struct mlx4_wqe_ctrl_seg *)
-				     next_txbb)->fence_size & 0x3f) << 4),
-				  MLX4_TXBB_SIZE);
-	uint32_t size_cd = size;
+	int32_t size = (intptr_t)end - (intptr_t)start;
 
-	/* Optimize the common case when there is no wrap-around. */
-	if ((uintptr_t)next_txbb + size < (uintptr_t)sq->eob) {
-		/* Stamp the freed descriptor. */
+	assert(start != end);
+	/* Hold SQ ring wrap around. */
+	if (size < 0) {
+		size = (int32_t)sq->size + size;
 		do {
-			*next_txbb = stamp;
-			next_txbb += MLX4_SQ_STAMP_DWORDS;
-			size_cd -= MLX4_TXBB_SIZE;
-		} while (size_cd);
-	} else {
-		/* Stamp the freed descriptor. */
-		do {
-			*next_txbb = stamp;
-			next_txbb += MLX4_SQ_STAMP_DWORDS;
-			if ((volatile uint8_t *)next_txbb >= sq->eob) {
-				next_txbb = (volatile uint32_t *)sq->buf;
-				/* Flip invalid stamping ownership. */
-				stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
-				sq->stamp = stamp;
-			}
-			size_cd -= MLX4_TXBB_SIZE;
-		} while (size_cd);
+			*start = stamp;
+			start += MLX4_SQ_STAMP_DWORDS;
+		} while (start != (volatile uint32_t *)sq->eob);
+		start = (volatile uint32_t *)sq->buf;
+		/* Flip invalid stamping ownership. */
+		stamp ^= RTE_BE32(0x1 << MLX4_SQ_OWNER_BIT);
+		sq->stamp = stamp;
+		if (start == end)
+			return size;
 	}
-	*wqe = next_txbb;
-	return size;
+	do {
+		*start = stamp;
+		start += MLX4_SQ_STAMP_DWORDS;
+	} while (start != end);
+	return (uint32_t)size;
 }
 
 /**
@@ -328,14 +320,10 @@ struct pv {
 	unsigned int elts_tail = txq->elts_tail;
 	struct mlx4_cq *cq = &txq->mcq;
 	volatile struct mlx4_cqe *cqe;
+	uint32_t completed;
 	uint32_t cons_index = cq->cons_index;
-	volatile uint32_t *first_wqe;
-	volatile uint32_t *next_wqe = (volatile uint32_t *)
-			((&(*txq->elts)[elts_tail])->wqe);
-	volatile uint32_t *last_wqe;
-	uint16_t mask = (((uintptr_t)sq->eob - (uintptr_t)sq->buf) >>
-			 MLX4_TXBB_SHIFT) - 1;
-	uint32_t pkts = 0;
+	volatile uint32_t *first_txbb;
+
 	/*
 	 * Traverse over all CQ entries reported and handle each WQ entry
 	 * reported by them.
@@ -361,28 +349,23 @@ struct pv {
 			break;
 		}
 #endif /* NDEBUG */
-		/* Get WQE address buy index from the CQE. */
-		last_wqe = (volatile uint32_t *)((uintptr_t)sq->buf +
-			((rte_be_to_cpu_16(cqe->wqe_index) & mask) <<
-			 MLX4_TXBB_SHIFT));
-		do {
-			/* Free next descriptor. */
-			first_wqe = next_wqe;
-			sq->remain_size +=
-				mlx4_txq_stamp_freed_wqe(sq, &next_wqe);
-			pkts++;
-		} while (first_wqe != last_wqe);
 		cons_index++;
 	} while (1);
-	if (unlikely(pkts == 0))
+	completed = (cons_index - cq->cons_index) * txq->elts_comp_cd_init;
+	if (unlikely(!completed))
 		return;
+	/* First stamping address is the end of the last one. */
+	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
+	elts_tail += completed;
+	if (elts_tail >= elts_n)
+		elts_tail -= elts_n;
+	/* The new tail element holds the end address. */
+	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
+		(&(*txq->elts)[elts_tail])->eocb);
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
-	txq->elts_comp -= pkts;
-	elts_tail += pkts;
-	if (elts_tail >= elts_n)
-		elts_tail -= elts_n;
+	txq->elts_comp -= completed;
 	txq->elts_tail = elts_tail;
 }
 
@@ -617,7 +600,7 @@ struct pv {
 	if (max > pkts_n)
 		max = pkts_n;
 	elt = &(*txq->elts)[elts_head];
-	/* Each element saves its appropriate work queue. */
+	/* First Tx burst element saves the next WQE control segment. */
 	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
@@ -692,6 +675,8 @@ struct pv {
 		 * that no ICRC should be calculated.
 		 */
 		if (--txq->elts_comp_cd == 0) {
+			/* Save the completion burst end address. */
+			elt_next->eocb = (volatile uint32_t *)ctrl_next;
 			txq->elts_comp_cd = txq->elts_comp_cd_init;
 			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
 					       MLX4_WQE_CTRL_CQ_UPDATE);
@@ -741,13 +726,14 @@ struct pv {
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		elts_head = elts_head_next;
-		elt_next->wqe = ctrl_next;
 		ctrl = ctrl_next;
 		elt = elt_next;
 	}
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;
+	/* Save WQE address of the next Tx burst element. */
+	elt->wqe = ctrl;
 	/* Increment send statistics counters. */
 	txq->stats.opackets += i;
 	txq->stats.obytes += bytes_sent;
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index d56e48d..36ae03a 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -105,7 +105,10 @@ struct mlx4_rss {
 /** Tx element. */
 struct txq_elt {
 	struct rte_mbuf *buf; /**< Buffer. */
-	volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
+	union {
+		volatile struct mlx4_wqe_ctrl_seg *wqe; /**< SQ WQE. */
+		volatile uint32_t *eocb; /**< End of completion burst. */
+	};
 };
 
 /** Rx queue counters. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 7/8] net/mlx4: align Tx descriptors number
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
                       ` (5 preceding siblings ...)
  2017-12-06 17:57     ` [PATCH v3 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-06 17:57     ` [PATCH v3 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
  2017-12-07 10:56     ` [PATCH v3 0/8] improve mlx4 Tx performance Adrien Mazarguil
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

Using power of 2 descriptors number makes the ring management easier
and allows to use mask operation instead of wraparound conditions.

Adjust Tx descriptor number to be power of 2 and change calculation to
use mask accordingly.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 28 +++++++++++++---------------
 drivers/net/mlx4/mlx4_txq.c  | 13 +++++++++----
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index dbed74d..498e56d 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -312,10 +312,14 @@ struct pv {
  *
  * @param txq
  *   Pointer to Tx queue structure.
+ * @param elts_m
+ *   Tx elements number mask.
+ * @param sq
+ *   Pointer to the SQ structure.
  */
 static void
-mlx4_txq_complete(struct txq *txq, const unsigned int elts_n,
-				  struct mlx4_sq *sq)
+mlx4_txq_complete(struct txq *txq, const unsigned int elts_m,
+		  struct mlx4_sq *sq)
 {
 	unsigned int elts_tail = txq->elts_tail;
 	struct mlx4_cq *cq = &txq->mcq;
@@ -355,13 +359,11 @@ struct pv {
 	if (unlikely(!completed))
 		return;
 	/* First stamping address is the end of the last one. */
-	first_txbb = (&(*txq->elts)[elts_tail])->eocb;
+	first_txbb = (&(*txq->elts)[elts_tail & elts_m])->eocb;
 	elts_tail += completed;
-	if (elts_tail >= elts_n)
-		elts_tail -= elts_n;
 	/* The new tail element holds the end address. */
 	sq->remain_size += mlx4_txq_stamp_freed_wqe(sq, first_txbb,
-		(&(*txq->elts)[elts_tail])->eocb);
+		(&(*txq->elts)[elts_tail & elts_m])->eocb);
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
@@ -580,6 +582,7 @@ struct pv {
 	struct txq *txq = (struct txq *)dpdk_txq;
 	unsigned int elts_head = txq->elts_head;
 	const unsigned int elts_n = txq->elts_n;
+	const unsigned int elts_m = elts_n - 1;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
@@ -589,24 +592,20 @@ struct pv {
 
 	assert(txq->elts_comp_cd != 0);
 	if (likely(txq->elts_comp != 0))
-		mlx4_txq_complete(txq, elts_n, sq);
+		mlx4_txq_complete(txq, elts_m, sq);
 	max = (elts_n - (elts_head - txq->elts_tail));
-	if (max > elts_n)
-		max -= elts_n;
 	assert(max >= 1);
 	assert(max <= elts_n);
 	/* Always leave one free entry in the ring. */
 	--max;
 	if (max > pkts_n)
 		max = pkts_n;
-	elt = &(*txq->elts)[elts_head];
+	elt = &(*txq->elts)[elts_head & elts_m];
 	/* First Tx burst element saves the next WQE control segment. */
 	ctrl = elt->wqe;
 	for (i = 0; (i != max); ++i) {
 		struct rte_mbuf *buf = pkts[i];
-		unsigned int elts_head_next =
-			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
-		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
+		struct txq_elt *elt_next = &(*txq->elts)[++elts_head & elts_m];
 		uint32_t owner_opcode = sq->owner_opcode;
 		volatile struct mlx4_wqe_data_seg *dseg =
 				(volatile struct mlx4_wqe_data_seg *)(ctrl + 1);
@@ -725,7 +724,6 @@ struct pv {
 		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode);
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
-		elts_head = elts_head_next;
 		ctrl = ctrl_next;
 		elt = elt_next;
 	}
@@ -741,7 +739,7 @@ struct pv {
 	rte_wmb();
 	/* Ring QP doorbell. */
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
-	txq->elts_head = elts_head;
+	txq->elts_head += i;
 	txq->elts_comp += i;
 	return i;
 }
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 4c7b62a..7eb4b04 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -76,17 +76,16 @@
 	unsigned int elts_head = txq->elts_head;
 	unsigned int elts_tail = txq->elts_tail;
 	struct txq_elt (*elts)[txq->elts_n] = txq->elts;
+	unsigned int elts_m = txq->elts_n - 1;
 
 	DEBUG("%p: freeing WRs", (void *)txq);
 	while (elts_tail != elts_head) {
-		struct txq_elt *elt = &(*elts)[elts_tail];
+		struct txq_elt *elt = &(*elts)[elts_tail++ & elts_m];
 
 		assert(elt->buf != NULL);
 		rte_pktmbuf_free(elt->buf);
 		elt->buf = NULL;
 		elt->wqe = NULL;
-		if (++elts_tail == RTE_DIM(*elts))
-			elts_tail = 0;
 	}
 	txq->elts_tail = txq->elts_head;
 }
@@ -208,7 +207,7 @@ struct txq_mp2mr_mbuf_check_data {
 	struct mlx4dv_obj mlxdv;
 	struct mlx4dv_qp dv_qp;
 	struct mlx4dv_cq dv_cq;
-	struct txq_elt (*elts)[desc];
+	struct txq_elt (*elts)[rte_align32pow2(desc)];
 	struct ibv_qp_init_attr qp_init_attr;
 	struct txq *txq;
 	uint8_t *bounce_buf;
@@ -252,6 +251,12 @@ struct txq_mp2mr_mbuf_check_data {
 		ERROR("%p: invalid number of Tx descriptors", (void *)dev);
 		return -rte_errno;
 	}
+	if (desc != RTE_DIM(*elts)) {
+		desc = RTE_DIM(*elts);
+		WARN("%p: increased number of descriptors in Tx queue %u"
+		     " to the next power of two (%u)",
+		     (void *)dev, idx, desc);
+	}
 	/* Allocate and initialize Tx queue. */
 	mlx4_zmallocv_socket("TXQ", vec, RTE_DIM(vec), socket);
 	if (!txq) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 8/8] net/mlx4: remove Tx completion elements counter
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
                       ` (6 preceding siblings ...)
  2017-12-06 17:57     ` [PATCH v3 7/8] net/mlx4: align Tx descriptors number Matan Azrad
@ 2017-12-06 17:57     ` Matan Azrad
  2017-12-07 10:56     ` [PATCH v3 0/8] improve mlx4 Tx performance Adrien Mazarguil
  8 siblings, 0 replies; 47+ messages in thread
From: Matan Azrad @ 2017-12-06 17:57 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev

This counter saved the descriptor elements which are waiting to be
completed and was used to know if completion function should be
called.

This completion check can be done by other elements management
variables and we can prevent this counter management.

Remove this counter and replace the completion check easily by other
elements management variables.

Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 8 +++-----
 drivers/net/mlx4/mlx4_rxtx.h | 1 -
 drivers/net/mlx4/mlx4_txq.c  | 1 -
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 498e56d..86259cf 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -367,7 +367,6 @@ struct pv {
 	/* Update CQ consumer index. */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cons_index & MLX4_CQ_DB_CI_MASK);
-	txq->elts_comp -= completed;
 	txq->elts_tail = elts_tail;
 }
 
@@ -585,15 +584,15 @@ struct pv {
 	const unsigned int elts_m = elts_n - 1;
 	unsigned int bytes_sent = 0;
 	unsigned int i;
-	unsigned int max;
+	unsigned int max = elts_head - txq->elts_tail;
 	struct mlx4_sq *sq = &txq->msq;
 	volatile struct mlx4_wqe_ctrl_seg *ctrl;
 	struct txq_elt *elt;
 
 	assert(txq->elts_comp_cd != 0);
-	if (likely(txq->elts_comp != 0))
+	if (likely(max >= txq->elts_comp_cd_init))
 		mlx4_txq_complete(txq, elts_m, sq);
-	max = (elts_n - (elts_head - txq->elts_tail));
+	max = elts_n - max;
 	assert(max >= 1);
 	assert(max <= elts_n);
 	/* Always leave one free entry in the ring. */
@@ -740,7 +739,6 @@ struct pv {
 	/* Ring QP doorbell. */
 	rte_write32(txq->msq.doorbell_qpn, txq->msq.db);
 	txq->elts_head += i;
-	txq->elts_comp += i;
 	return i;
 }
 
diff --git a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h
index 36ae03a..b93e2bc 100644
--- a/drivers/net/mlx4/mlx4_rxtx.h
+++ b/drivers/net/mlx4/mlx4_rxtx.h
@@ -125,7 +125,6 @@ struct txq {
 	struct mlx4_cq mcq; /**< Info for directly manipulating the CQ. */
 	unsigned int elts_head; /**< Current index in (*elts)[]. */
 	unsigned int elts_tail; /**< First element awaiting completion. */
-	unsigned int elts_comp; /**< Number of packets awaiting completion. */
 	int elts_comp_cd; /**< Countdown for next completion. */
 	unsigned int elts_comp_cd_init; /**< Initial value for countdown. */
 	unsigned int elts_n; /**< (*elts)[] length. */
diff --git a/drivers/net/mlx4/mlx4_txq.c b/drivers/net/mlx4/mlx4_txq.c
index 7eb4b04..0c35935 100644
--- a/drivers/net/mlx4/mlx4_txq.c
+++ b/drivers/net/mlx4/mlx4_txq.c
@@ -274,7 +274,6 @@ struct txq_mp2mr_mbuf_check_data {
 		.elts = elts,
 		.elts_head = 0,
 		.elts_tail = 0,
-		.elts_comp = 0,
 		/*
 		 * Request send completion every MLX4_PMD_TX_PER_COMP_REQ
 		 * packets or at least 4 times per ring.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/8] improve mlx4 Tx performance
  2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
                       ` (7 preceding siblings ...)
  2017-12-06 17:57     ` [PATCH v3 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
@ 2017-12-07 10:56     ` Adrien Mazarguil
  2017-12-10 10:22       ` Shahaf Shuler
  8 siblings, 1 reply; 47+ messages in thread
From: Adrien Mazarguil @ 2017-12-07 10:56 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev

On Wed, Dec 06, 2017 at 05:57:48PM +0000, Matan Azrad wrote:
> This series improves mlx4 Tx performance and fix and clean some Tx code. 
> 1. 10% MPPS improvement for 1 queue, 1 core, 64B packets, txonly mode.
> 2. 20% MPPS improvement for 1 queue, 1 core, 32B*4(segs) packets, txonly mode.
> 
> V2:
> Add missed function descriptions.
> Accurate descriptions.
> Change Tx descriptor alignment to be like Rx.
> Move mlx4_fill_tx_data_seg to mlx4_rxtx.c and use rte_be32_t for byte count.
> Change remain_size type to uin32_t.
> Poisoning with memset.
> 
> V3:
> Accurate descriptions.
> Fix poisoning from v2.

For the remaining patches in the series:

Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/8] improve mlx4 Tx performance
  2017-12-07 10:56     ` [PATCH v3 0/8] improve mlx4 Tx performance Adrien Mazarguil
@ 2017-12-10 10:22       ` Shahaf Shuler
  0 siblings, 0 replies; 47+ messages in thread
From: Shahaf Shuler @ 2017-12-10 10:22 UTC (permalink / raw)
  To: Adrien Mazarguil, Matan Azrad; +Cc: dev

Thursday, December 7, 2017 12:57 PM, Adrien Mazarguil:
> >
> > V2:
> > Add missed function descriptions.
> > Accurate descriptions.
> > Change Tx descriptor alignment to be like Rx.
> > Move mlx4_fill_tx_data_seg to mlx4_rxtx.c and use rte_be32_t for byte
> count.
> > Change remain_size type to uin32_t.
> > Poisoning with memset.
> >
> > V3:
> > Accurate descriptions.
> > Fix poisoning from v2.
> 
> For the remaining patches in the series:
> 
> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>

Applied to next-net-mlx , thanks.

> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-12-10 10:22 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-28 12:19 [PATCH 0/8] improve mlx4 Tx performance Matan Azrad
2017-11-28 12:19 ` [PATCH 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
2017-12-06 10:57   ` Adrien Mazarguil
2017-11-28 12:19 ` [PATCH 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
2017-12-06 10:57   ` Adrien Mazarguil
2017-11-28 12:19 ` [PATCH 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
2017-12-06 10:58   ` Adrien Mazarguil
2017-11-28 12:19 ` [PATCH 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
2017-12-06 10:58   ` Adrien Mazarguil
2017-12-06 11:29     ` Matan Azrad
2017-12-06 11:55       ` Adrien Mazarguil
2017-11-28 12:19 ` [PATCH 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
2017-12-06 10:58   ` Adrien Mazarguil
2017-12-06 11:43     ` Matan Azrad
2017-12-06 12:09       ` Adrien Mazarguil
2017-11-28 12:19 ` [PATCH 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
2017-12-06 10:59   ` Adrien Mazarguil
2017-11-28 12:19 ` [PATCH 7/8] net/mlx4: align Tx descriptors number Matan Azrad
2017-12-06 10:59   ` Adrien Mazarguil
2017-12-06 11:44     ` Matan Azrad
2017-11-28 12:19 ` [PATCH 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
2017-12-06 10:59   ` Adrien Mazarguil
2017-12-06 14:48 ` [PATCH v2 0/8] improve mlx4 Tx performance Matan Azrad
2017-12-06 14:48   ` [PATCH v2 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
2017-12-06 14:48   ` [PATCH v2 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
2017-12-06 14:48   ` [PATCH v2 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
2017-12-06 14:48   ` [PATCH v2 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
2017-12-06 16:22     ` Adrien Mazarguil
2017-12-06 14:48   ` [PATCH v2 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
2017-12-06 16:22     ` Adrien Mazarguil
2017-12-06 14:48   ` [PATCH v2 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
2017-12-06 14:48   ` [PATCH v2 7/8] net/mlx4: align Tx descriptors number Matan Azrad
2017-12-06 16:22     ` Adrien Mazarguil
2017-12-06 17:24       ` Matan Azrad
2017-12-06 14:48   ` [PATCH v2 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
2017-12-06 16:22     ` Adrien Mazarguil
2017-12-06 17:57   ` [PATCH v3 0/8] improve mlx4 Tx performance Matan Azrad
2017-12-06 17:57     ` [PATCH v3 1/8] net/mlx4: fix Tx packet drop application report Matan Azrad
2017-12-06 17:57     ` [PATCH v3 2/8] net/mlx4: remove unnecessary Tx wraparound checks Matan Azrad
2017-12-06 17:57     ` [PATCH v3 3/8] net/mlx4: remove restamping from Tx error path Matan Azrad
2017-12-06 17:57     ` [PATCH v3 4/8] net/mlx4: optimize Tx multi-segment case Matan Azrad
2017-12-06 17:57     ` [PATCH v3 5/8] net/mlx4: merge Tx queue rings management Matan Azrad
2017-12-06 17:57     ` [PATCH v3 6/8] net/mlx4: mitigate Tx send entry size calculations Matan Azrad
2017-12-06 17:57     ` [PATCH v3 7/8] net/mlx4: align Tx descriptors number Matan Azrad
2017-12-06 17:57     ` [PATCH v3 8/8] net/mlx4: remove Tx completion elements counter Matan Azrad
2017-12-07 10:56     ` [PATCH v3 0/8] improve mlx4 Tx performance Adrien Mazarguil
2017-12-10 10:22       ` Shahaf Shuler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.