All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support
@ 2018-03-10  1:25 Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
                   ` (12 more replies)
  0 siblings, 13 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe
bandwidth by posting a single large buffer for multiple packets. Instead of
posting a buffer per a packet, one large buffer is posted in order to
receive multiple packets on the buffer. A MPRQ buffer consists of multiple
fixed-size strides and each stride receives one packet.

Rx packet is either mem-copied to a user-provided mbuf if length is
comparatively small or referenced by mbuf indirection otherwise. In case of
indirection, the Mempool for the direct mbufs will be allocated and managed
by PMD.

In order to make mbuf indirections to each packets in the buffer, buf_off
field is added to rte_mbuf structure and rte_pktmbuf_attach_at() is also
added.

Yongseok Koh (6):
  mbuf: add buffer offset field for flexible indirection
  net/mlx5: separate filling Rx flags
  net/mlx5: add a function to rdma-core glue
  net/mlx5: add Multi-Packet Rx support
  net/mlx5: release Tx queue resource earlier than Rx
  app/testpmd: conserve mbuf indirection flag

 app/test-pmd/csumonly.c          |   2 +
 app/test-pmd/macfwd.c            |   2 +
 app/test-pmd/macswap.c           |   2 +
 doc/guides/nics/mlx5.rst         |  23 +++
 drivers/net/mlx5/Makefile        |   5 +
 drivers/net/mlx5/mlx5.c          |  79 +++++++-
 drivers/net/mlx5/mlx5.h          |   3 +
 drivers/net/mlx5/mlx5_defs.h     |  20 ++
 drivers/net/mlx5/mlx5_ethdev.c   |   3 +
 drivers/net/mlx5/mlx5_glue.c     |   9 +
 drivers/net/mlx5/mlx5_glue.h     |   4 +
 drivers/net/mlx5/mlx5_prm.h      |  15 ++
 drivers/net/mlx5/mlx5_rxq.c      | 389 +++++++++++++++++++++++++++++++++++----
 drivers/net/mlx5/mlx5_rxtx.c     | 236 ++++++++++++++++++++----
 drivers/net/mlx5/mlx5_rxtx.h     |  16 +-
 drivers/net/mlx5/mlx5_rxtx_vec.c |   4 +
 drivers/net/mlx5/mlx5_rxtx_vec.h |   3 +-
 lib/librte_mbuf/rte_mbuf.h       | 158 +++++++++++++++-
 18 files changed, 895 insertions(+), 78 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v1 1/6] mbuf: add buffer offset field for flexible indirection
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
@ 2018-03-10  1:25 ` Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

When attaching a mbuf, indirect mbuf has to point to start of buffer of
direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
flexible. Indirect mbuf can point to any part of direct mbuf by calling
rte_pktmbuf_attach_at().

Possible use-cases could be:
- If a packet has multiple layers of encapsulation, multiple indirect
  buffers can reference different layers of the encapsulated packet.
- A large direct mbuf can even contain multiple packets in series and
  each packet can be referenced by multiple mbuf indirections.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_mbuf/rte_mbuf.h | 158 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 157 insertions(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 62740254d..053db32d0 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -559,6 +559,11 @@ struct rte_mbuf {
 		};
 	};
 
+	/** Buffer offset of direct mbuf if attached. Indirect mbuf can point to
+	 * any part of direct mbuf.
+	 */
+	uint16_t buf_off;
+
 	/** Size of the application private data. In case of an indirect
 	 * mbuf, it stores the direct mbuf private data size. */
 	uint16_t priv_size;
@@ -671,7 +676,9 @@ rte_mbuf_data_dma_addr_default(const struct rte_mbuf *mb)
 static inline struct rte_mbuf *
 rte_mbuf_from_indirect(struct rte_mbuf *mi)
 {
-	return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + mi->priv_size);
+	return (struct rte_mbuf *)
+		RTE_PTR_SUB(mi->buf_addr,
+				sizeof(*mi) + mi->priv_size + mi->buf_off);
 }
 
 /**
@@ -1281,6 +1288,98 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Adjust tailroom of indirect mbuf. If offset is positive, enlarge the
+ * tailroom of the mbuf. If negative, shrink the tailroom.
+ *
+ * If length is out of range, then the function will fail and return -1,
+ * without modifying the indirect mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param len
+ *   The amount of length to adjust (in bytes).
+ * @return
+ *   - 0: On success.
+ *   - -1: On error.
+ */
+static inline int rte_pktmbuf_adj_indirect_tail(struct rte_mbuf *mi, int len)
+{
+	struct rte_mbuf *md;
+	uint16_t tailroom;
+	int delta;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(mi));
+
+	md = rte_mbuf_from_indirect(mi);
+	if (unlikely(mi->buf_len + len <= 0 ||
+			mi->buf_off + mi->buf_len + len >= md->buf_len))
+		return -1;
+
+	mi->buf_len += len;
+
+	tailroom = mi->buf_len - mi->data_off - mi->data_len;
+	delta = tailroom + len;
+	if (delta > 0) {
+		/* Adjust tailroom */
+		delta = 0;
+	} else if (delta + mi->data_len < 0) {
+		/* No data */
+		mi->data_off += delta + mi->data_len;
+		delta = mi->data_len;
+	}
+	mi->data_len += delta;
+	mi->pkt_len += delta;
+	return 0;
+}
+
+/**
+ * Shift buffer reference of indirect mbuf. If offset is positive, push
+ * the offset of the mbuf. If negative, pull the offset.
+ *
+ * Returns a pointer to the start address of the new data area. If offset
+ * is out of range, then the function will fail and return NULL, without
+ * modifying the indirect mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param off
+ *   The amount of offset to adjust (in bytes).
+ * @return
+ *   A pointer to the new start of the data.
+ */
+static inline char *rte_pktmbuf_adj_indirect_head(struct rte_mbuf *mi, int off)
+{
+	int delta;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(mi));
+
+	if (unlikely(off >= mi->buf_len || mi->buf_off + off < 0))
+		return NULL;
+
+	mi->buf_iova += off;
+	mi->buf_addr = (char *)mi->buf_addr + off;
+	mi->buf_len -= off;
+	mi->buf_off += off;
+
+	delta = off - mi->data_off;
+	if (delta < 0) {
+		/* Adjust headroom */
+		mi->data_off -= off;
+		delta = 0;
+	} else if (delta < mi->data_len) {
+		/* No headroom */
+		mi->data_off = 0;
+	} else {
+		/* No data */
+		mi->data_off = 0;
+		delta = mi->data_len;
+	}
+	mi->data_len -= delta;
+	mi->pkt_len -= delta;
+	return (char *)mi->buf_addr + mi->data_off;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
@@ -1315,6 +1414,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
+	mi->buf_off = 0;
 
 	mi->data_off = m->data_off;
 	mi->data_len = m->data_len;
@@ -1336,6 +1436,62 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
+ * Attach packet mbuf to another packet mbuf pointing by given offset.
+ *
+ * After attachment we refer the mbuf we attached as 'indirect',
+ * while mbuf we attached to as 'direct'.
+ *
+ * The indirect mbuf can reference to anywhere in the buffer of the direct
+ * mbuf by the given offset. And the indirect mbuf is also be trimmed by
+ * the given buffer length.
+ *
+ * As a result, if a direct mbuf has multiple layers of encapsulation,
+ * multiple indirect buffers can reference different layers of the packet.
+ * Or, a large direct mbuf can even contain multiple packets in series and
+ * each packet can be referenced by multiple mbuf indirections.
+ *
+ * Returns a pointer to the start address of the new data area. If offset
+ * or buffer length is out of range, then the function will fail and return
+ * NULL, without attaching the mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param m
+ *   The packet mbuf we're attaching to.
+ * @param off
+ *   The amount of offset to push (in bytes).
+ * @param buf_len
+ *   The buffer length of the indirect mbuf (in bytes).
+ * @return
+ *   A pointer to the new start of the data.
+ */
+static inline char *rte_pktmbuf_attach_at(struct rte_mbuf *mi,
+	struct rte_mbuf *m, uint16_t off, uint16_t buf_len)
+{
+	struct rte_mbuf *md;
+	char *ret;
+
+	if (RTE_MBUF_DIRECT(m))
+		md = m;
+	else
+		md = rte_mbuf_from_indirect(m);
+
+	if (off + buf_len > md->buf_len)
+		return NULL;
+
+	rte_pktmbuf_attach(mi, m);
+
+	/* Push reference of indirect mbuf */
+	ret = rte_pktmbuf_adj_indirect_head(mi, off);
+	RTE_ASSERT(ret != NULL);
+
+	/* Trim reference of indirect mbuf */
+	rte_pktmbuf_adj_indirect_tail(mi, off + buf_len - md->buf_len);
+
+	return ret;
+}
+
+/**
  * Detach an indirect packet mbuf.
  *
  *  - restore original mbuf address and length values.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v1 2/6] net/mlx5: separate filling Rx flags
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
@ 2018-03-10  1:25 ` Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Filling in fields of mbuf becomes a separate inline function so that this
can be reused.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx.c | 84 +++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 33 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 049f7e6c1..36eeefb49 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -43,6 +43,10 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 static __rte_always_inline uint32_t
 rxq_cq_to_ol_flags(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe);
 
+static __rte_always_inline void
+rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
+	       volatile struct mlx5_cqe *cqe, uint32_t rss_hash_res);
+
 uint32_t mlx5_ptype_table[] __rte_cache_aligned = {
 	[0xff] = RTE_PTYPE_ALL_MASK, /* Last entry for errored packet. */
 };
@@ -1721,6 +1725,52 @@ rxq_cq_to_ol_flags(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe)
 }
 
 /**
+ * Fill in mbuf fields from RX completion flags.
+ * Note that pkt->ol_flags should be initialized outside of this function.
+ *
+ * @param rxq
+ *   Pointer to RX queue.
+ * @param pkt
+ *   mbuf to fill.
+ * @param cqe
+ *   CQE to process.
+ * @param rss_hash_res
+ *   Packet RSS Hash result.
+ */
+static inline void
+rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
+	       volatile struct mlx5_cqe *cqe, uint32_t rss_hash_res)
+{
+	/* Update packet information. */
+	pkt->packet_type = rxq_cq_to_pkt_type(cqe);
+	if (rss_hash_res && rxq->rss_hash) {
+		pkt->hash.rss = rss_hash_res;
+		pkt->ol_flags |= PKT_RX_RSS_HASH;
+	}
+	if (rxq->mark && MLX5_FLOW_MARK_IS_VALID(cqe->sop_drop_qpn)) {
+		pkt->ol_flags |= PKT_RX_FDIR;
+		if (cqe->sop_drop_qpn !=
+		    rte_cpu_to_be_32(MLX5_FLOW_MARK_DEFAULT)) {
+			uint32_t mark = cqe->sop_drop_qpn;
+
+			pkt->ol_flags |= PKT_RX_FDIR_ID;
+			pkt->hash.fdir.hi = mlx5_flow_mark_get(mark);
+		}
+	}
+	if (rxq->csum | rxq->csum_l2tun)
+		pkt->ol_flags |= rxq_cq_to_ol_flags(rxq, cqe);
+	if (rxq->vlan_strip &&
+	    (cqe->hdr_type_etc & rte_cpu_to_be_16(MLX5_CQE_VLAN_STRIPPED))) {
+		pkt->ol_flags |= PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED;
+		pkt->vlan_tci = rte_be_to_cpu_16(cqe->vlan_info);
+	}
+	if (rxq->hw_timestamp) {
+		pkt->timestamp = rte_be_to_cpu_64(cqe->timestamp);
+		pkt->ol_flags |= PKT_RX_TIMESTAMP;
+	}
+}
+
+/**
  * DPDK callback for RX.
  *
  * @param dpdk_rxq
@@ -1796,40 +1846,8 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			}
 			pkt = seg;
 			assert(len >= (rxq->crc_present << 2));
-			/* Update packet information. */
-			pkt->packet_type = rxq_cq_to_pkt_type(cqe);
 			pkt->ol_flags = 0;
-			if (rss_hash_res && rxq->rss_hash) {
-				pkt->hash.rss = rss_hash_res;
-				pkt->ol_flags = PKT_RX_RSS_HASH;
-			}
-			if (rxq->mark &&
-			    MLX5_FLOW_MARK_IS_VALID(cqe->sop_drop_qpn)) {
-				pkt->ol_flags |= PKT_RX_FDIR;
-				if (cqe->sop_drop_qpn !=
-				    rte_cpu_to_be_32(MLX5_FLOW_MARK_DEFAULT)) {
-					uint32_t mark = cqe->sop_drop_qpn;
-
-					pkt->ol_flags |= PKT_RX_FDIR_ID;
-					pkt->hash.fdir.hi =
-						mlx5_flow_mark_get(mark);
-				}
-			}
-			if (rxq->csum | rxq->csum_l2tun)
-				pkt->ol_flags |= rxq_cq_to_ol_flags(rxq, cqe);
-			if (rxq->vlan_strip &&
-			    (cqe->hdr_type_etc &
-			     rte_cpu_to_be_16(MLX5_CQE_VLAN_STRIPPED))) {
-				pkt->ol_flags |= PKT_RX_VLAN |
-					PKT_RX_VLAN_STRIPPED;
-				pkt->vlan_tci =
-					rte_be_to_cpu_16(cqe->vlan_info);
-			}
-			if (rxq->hw_timestamp) {
-				pkt->timestamp =
-					rte_be_to_cpu_64(cqe->timestamp);
-				pkt->ol_flags |= PKT_RX_TIMESTAMP;
-			}
+			rxq_cq_to_mbuf(rxq, pkt, cqe, rss_hash_res);
 			if (rxq->crc_present)
 				len -= ETHER_CRC_LEN;
 			PKT_LEN(pkt) = len;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
@ 2018-03-10  1:25 ` Yongseok Koh
  2018-03-12  9:13   ` Nélio Laranjeiro
  2018-03-10  1:25 ` [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

mlx5dv_create_wq() is added.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_glue.c | 9 +++++++++
 drivers/net/mlx5/mlx5_glue.h | 4 ++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_glue.c b/drivers/net/mlx5/mlx5_glue.c
index 1c4396ada..e33fc76b5 100644
--- a/drivers/net/mlx5/mlx5_glue.c
+++ b/drivers/net/mlx5/mlx5_glue.c
@@ -287,6 +287,14 @@ mlx5_glue_dv_create_cq(struct ibv_context *context,
 	return mlx5dv_create_cq(context, cq_attr, mlx5_cq_attr);
 }
 
+static struct ibv_wq *
+mlx5_glue_dv_create_wq(struct ibv_context *context,
+		       struct ibv_wq_init_attr *wq_attr,
+		       struct mlx5dv_wq_init_attr *mlx5_wq_attr)
+{
+	return mlx5dv_create_wq(context, wq_attr, mlx5_wq_attr);
+}
+
 static int
 mlx5_glue_dv_query_device(struct ibv_context *ctx,
 			  struct mlx5dv_context *attrs_out)
@@ -347,6 +355,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue){
 	.port_state_str = mlx5_glue_port_state_str,
 	.cq_ex_to_cq = mlx5_glue_cq_ex_to_cq,
 	.dv_create_cq = mlx5_glue_dv_create_cq,
+	.dv_create_wq = mlx5_glue_dv_create_wq,
 	.dv_query_device = mlx5_glue_dv_query_device,
 	.dv_set_context_attr = mlx5_glue_dv_set_context_attr,
 	.dv_init_obj = mlx5_glue_dv_init_obj,
diff --git a/drivers/net/mlx5/mlx5_glue.h b/drivers/net/mlx5/mlx5_glue.h
index b5efee3b6..21a713961 100644
--- a/drivers/net/mlx5/mlx5_glue.h
+++ b/drivers/net/mlx5/mlx5_glue.h
@@ -100,6 +100,10 @@ struct mlx5_glue {
 		(struct ibv_context *context,
 		 struct ibv_cq_init_attr_ex *cq_attr,
 		 struct mlx5dv_cq_init_attr *mlx5_cq_attr);
+	struct ibv_wq *(*dv_create_wq)
+		(struct ibv_context *context,
+		 struct ibv_wq_init_attr *wq_attr,
+		 struct mlx5dv_wq_init_attr *mlx5_wq_attr);
 	int (*dv_query_device)(struct ibv_context *ctx_in,
 			       struct mlx5dv_context *attrs_out);
 	int (*dv_set_context_attr)(struct ibv_context *ibv_ctx,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (2 preceding siblings ...)
  2018-03-10  1:25 ` [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
@ 2018-03-10  1:25 ` Yongseok Koh
  2018-03-12  9:20   ` Nélio Laranjeiro
  2018-03-10  1:25 ` [PATCH v1 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe
bandwidth by posting a single large buffer for multiple packets. Instead of
posting a buffer per a packet, one large buffer is posted in order to
receive multiple packets on the buffer. A MPRQ buffer consists of multiple
fixed-size strides and each stride receives one packet.

Rx packet is either mem-copied to a user-provided mbuf if length is
comparatively small or referenced by mbuf indirection otherwise. In case of
indirection, the Mempool for the direct mbufs will be allocated and managed
by PMD.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 doc/guides/nics/mlx5.rst         |  23 +++
 drivers/net/mlx5/Makefile        |   5 +
 drivers/net/mlx5/mlx5.c          |  63 +++++++
 drivers/net/mlx5/mlx5.h          |   3 +
 drivers/net/mlx5/mlx5_defs.h     |  20 ++
 drivers/net/mlx5/mlx5_ethdev.c   |   3 +
 drivers/net/mlx5/mlx5_prm.h      |  15 ++
 drivers/net/mlx5/mlx5_rxq.c      | 389 +++++++++++++++++++++++++++++++++++----
 drivers/net/mlx5/mlx5_rxtx.c     | 152 ++++++++++++++-
 drivers/net/mlx5/mlx5_rxtx.h     |  16 +-
 drivers/net/mlx5/mlx5_rxtx_vec.c |   4 +
 drivers/net/mlx5/mlx5_rxtx_vec.h |   3 +-
 12 files changed, 660 insertions(+), 36 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 0e6e525c9..1600bfa7b 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -253,6 +253,29 @@ Run-time configuration
   - x86_64 with ConnectX-4, ConnectX-4 LX and ConnectX-5.
   - POWER8 and ARMv8 with ConnectX-4 LX and ConnectX-5.
 
+- ``mprq_en`` parameter [int]
+
+  A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
+  configured as Multi-Packet RQ if the total number of Rx queues is
+  ``rxqs_min_mprq`` or more and Rx scatter isn't configured. Disabled by default.
+
+  Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
+  by posting a single large buffer for multiple packets. Instead of posting a
+  buffers per a packet, one large buffer is posted in order to receive multiple
+  packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides
+  and each stride receives one packet.
+
+- ``mprq_max_memcpy_len`` parameter [int]
+  The maximum size of packet for memcpy in case of Multi-Packet Rx queue. Rx
+  packet is mem-copied to a user-provided mbuf if the size of Rx packet is less
+  than or equal to this parameter. Otherwise, the packet will be referenced by mbuf
+  indirection. In case of indirection, the Mempool for the direct mbufs will be
+  allocated and managed by PMD. The default value is 128.
+
+- ``rxqs_min_mprq`` parameter [int]
+  Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is greater or
+  equal to this value. The default value is 12.
+
 - ``txq_inline`` parameter [int]
 
   Amount of data to be inlined during TX operations. Improves latency.
diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
index afda4118f..e5e276a71 100644
--- a/drivers/net/mlx5/Makefile
+++ b/drivers/net/mlx5/Makefile
@@ -125,6 +125,11 @@ mlx5_autoconf.h.new: FORCE
 mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 	$Q $(RM) -f -- '$@'
 	$Q sh -- '$<' '$@' \
+		HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT \
+		infiniband/mlx5dv.h \
+		enum MLX5DV_CONTEXT_MASK_STRIDING_RQ \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_IBV_DEVICE_TUNNEL_SUPPORT \
 		infiniband/mlx5dv.h \
 		enum MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS \
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 61cb93101..25c0b5b1f 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -44,6 +44,18 @@
 /* Device parameter to enable RX completion queue compression. */
 #define MLX5_RXQ_CQE_COMP_EN "rxq_cqe_comp_en"
 
+/* Device parameter to enable Multi-Packet Rx queue. */
+#define MLX5_RX_MPRQ_EN "mprq_en"
+
+/* Device parameter to limit the size of memcpy'd packet. */
+#define MLX5_RX_MPRQ_MAX_MEMCPY_LEN "mprq_max_memcpy_len"
+
+/*
+ * Device parameter to set the minimum number of Rx queues to configure
+ * Multi-Packet Rx queue.
+ */
+#define MLX5_RXQS_MIN_MPRQ "rxqs_min_mprq"
+
 /* Device parameter to configure inline send. */
 #define MLX5_TXQ_INLINE "txq_inline"
 
@@ -383,6 +395,12 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 	}
 	if (strcmp(MLX5_RXQ_CQE_COMP_EN, key) == 0) {
 		config->cqe_comp = !!tmp;
+	} else if (strcmp(MLX5_RX_MPRQ_EN, key) == 0) {
+		config->mprq = !!tmp;
+	} else if (strcmp(MLX5_RX_MPRQ_MAX_MEMCPY_LEN, key) == 0) {
+		config->mprq_max_memcpy_len = tmp;
+	} else if (strcmp(MLX5_RXQS_MIN_MPRQ, key) == 0) {
+		config->rxqs_mprq = tmp;
 	} else if (strcmp(MLX5_TXQ_INLINE, key) == 0) {
 		config->txq_inline = tmp;
 	} else if (strcmp(MLX5_TXQS_MIN_INLINE, key) == 0) {
@@ -420,6 +438,9 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 {
 	const char **params = (const char *[]){
 		MLX5_RXQ_CQE_COMP_EN,
+		MLX5_RX_MPRQ_EN,
+		MLX5_RX_MPRQ_MAX_MEMCPY_LEN,
+		MLX5_RXQS_MIN_MPRQ,
 		MLX5_TXQ_INLINE,
 		MLX5_TXQS_MIN_INLINE,
 		MLX5_TXQ_MPW_EN,
@@ -582,6 +603,7 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 	unsigned int mps;
 	unsigned int cqe_comp;
 	unsigned int tunnel_en = 0;
+	unsigned int mprq = 0;
 	int idx;
 	int i;
 	struct mlx5dv_context attrs_out = {0};
@@ -664,6 +686,9 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 #ifdef HAVE_IBV_DEVICE_TUNNEL_SUPPORT
 	attrs_out.comp_mask |= MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS;
 #endif
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	attrs_out.comp_mask |= MLX5DV_CONTEXT_MASK_STRIDING_RQ;
+#endif
 	mlx5_glue->dv_query_device(attr_ctx, &attrs_out);
 	if (attrs_out.flags & MLX5DV_CONTEXT_FLAGS_MPW_ALLOWED) {
 		if (attrs_out.flags & MLX5DV_CONTEXT_FLAGS_ENHANCED_MPW) {
@@ -677,6 +702,37 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 		DEBUG("MPW isn't supported");
 		mps = MLX5_MPW_DISABLED;
 	}
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	if (attrs_out.comp_mask & MLX5DV_CONTEXT_MASK_STRIDING_RQ) {
+		struct mlx5dv_striding_rq_caps mprq_caps =
+			attrs_out.striding_rq_caps;
+
+		DEBUG("\tmin_single_stride_log_num_of_bytes: %d",
+		      mprq_caps.min_single_stride_log_num_of_bytes);
+		DEBUG("\tmax_single_stride_log_num_of_bytes: %d",
+		      mprq_caps.max_single_stride_log_num_of_bytes);
+		DEBUG("\tmin_single_wqe_log_num_of_strides: %d",
+		      mprq_caps.min_single_wqe_log_num_of_strides);
+		DEBUG("\tmax_single_wqe_log_num_of_strides: %d",
+		      mprq_caps.max_single_wqe_log_num_of_strides);
+		DEBUG("\tsupported_qpts: %d",
+		      mprq_caps.supported_qpts);
+		if (mprq_caps.min_single_stride_log_num_of_bytes <=
+		    MLX5_MPRQ_MIN_STRIDE_SZ_N &&
+		    mprq_caps.max_single_stride_log_num_of_bytes >=
+		    MLX5_MPRQ_STRIDE_SZ_N &&
+		    mprq_caps.min_single_wqe_log_num_of_strides <=
+		    MLX5_MPRQ_MIN_STRIDE_NUM_N &&
+		    mprq_caps.max_single_wqe_log_num_of_strides >=
+		    MLX5_MPRQ_STRIDE_NUM_N) {
+			DEBUG("Multi-Packet RQ is supported");
+			mprq = 1;
+		} else {
+			DEBUG("Multi-Packet RQ isn't supported");
+			mprq = 0;
+		}
+	}
+#endif
 	if (RTE_CACHE_LINE_SIZE == 128 &&
 	    !(attrs_out.flags & MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP))
 		cqe_comp = 0;
@@ -721,6 +777,9 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 			.txq_inline = MLX5_ARG_UNSET,
 			.txqs_inline = MLX5_ARG_UNSET,
 			.inline_max_packet_sz = MLX5_ARG_UNSET,
+			.mprq = 0, /* Disable by default. */
+			.mprq_max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN,
+			.rxqs_mprq = MLX5_MPRQ_MIN_RXQS,
 		};
 
 		len = snprintf(name, sizeof(name), PCI_PRI_FMT,
@@ -891,6 +950,10 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 			WARN("Rx CQE compression isn't supported");
 			config.cqe_comp = 0;
 		}
+		if (config.mprq && !mprq) {
+			WARN("Multi-Packet RQ isn't supported");
+			config.mprq = 0;
+		}
 		err = priv_uar_init_primary(priv);
 		if (err)
 			goto port_error;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 9ad0533fc..42632a7e5 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -88,6 +88,9 @@ struct mlx5_dev_config {
 	unsigned int tx_vec_en:1; /* Tx vector is enabled. */
 	unsigned int rx_vec_en:1; /* Rx vector is enabled. */
 	unsigned int mpw_hdr_dseg:1; /* Enable DSEGs in the title WQEBB. */
+	unsigned int mprq:1; /* Whether Multi-Packet RQ is supported. */
+	unsigned int mprq_max_memcpy_len; /* Maximum packet size to memcpy. */
+	unsigned int rxqs_mprq; /* Queue count threshold for Multi-Packet RQ. */
 	unsigned int tso_max_payload_sz; /* Maximum TCP payload for TSO. */
 	unsigned int ind_table_max_size; /* Maximum indirection table size. */
 	int txq_inline; /* Maximum packet size for inlining. */
diff --git a/drivers/net/mlx5/mlx5_defs.h b/drivers/net/mlx5/mlx5_defs.h
index c3334ca30..39cc1344a 100644
--- a/drivers/net/mlx5/mlx5_defs.h
+++ b/drivers/net/mlx5/mlx5_defs.h
@@ -95,4 +95,24 @@
  */
 #define MLX5_UAR_OFFSET (1ULL << 32)
 
+/* Log 2 of the size of a stride for Multi-Packet RQ. */
+#define MLX5_MPRQ_STRIDE_SZ_N 11
+#define MLX5_MPRQ_MIN_STRIDE_SZ_N 6
+
+/* Log 2 of the number of strides per WQE for Multi-Packet RQ. */
+#define MLX5_MPRQ_STRIDE_NUM_N 4
+#define MLX5_MPRQ_MIN_STRIDE_NUM_N 3
+
+/* Two-byte shift is disabled for Multi-Packet RQ. */
+#define MLX5_MPRQ_TWO_BYTE_SHIFT 0
+
+/* Minimum size of packet to be memcpy'd instead of indirection. */
+#define MLX5_MPRQ_MEMCPY_DEFAULT_LEN 128
+
+/* Minimum number Rx queues to enable Multi-Packet RQ. */
+#define MLX5_MPRQ_MIN_RXQS 12
+
+/* Cache size of mempool for Multi-Packet RQ. */
+#define MLX5_MPRQ_MP_CACHE_SZ 16
+
 #endif /* RTE_PMD_MLX5_DEFS_H_ */
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index b73cb53df..2729c3b62 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -494,6 +494,7 @@ mlx5_dev_supported_ptypes_get(struct rte_eth_dev *dev)
 	};
 
 	if (dev->rx_pkt_burst == mlx5_rx_burst ||
+	    dev->rx_pkt_burst == mlx5_rx_burst_mprq ||
 	    dev->rx_pkt_burst == mlx5_rx_burst_vec)
 		return ptypes;
 	return NULL;
@@ -1316,6 +1317,8 @@ priv_select_rx_function(struct priv *priv, __rte_unused struct rte_eth_dev *dev)
 	if (priv_check_vec_rx_support(priv) > 0) {
 		rx_pkt_burst = mlx5_rx_burst_vec;
 		DEBUG("selected RX vectorized function");
+	} else if (priv_mprq_enabled(priv)) {
+		rx_pkt_burst = mlx5_rx_burst_mprq;
 	}
 	return rx_pkt_burst;
 }
diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h
index 9eb9c15e1..b7ad3454e 100644
--- a/drivers/net/mlx5/mlx5_prm.h
+++ b/drivers/net/mlx5/mlx5_prm.h
@@ -195,6 +195,21 @@ struct mlx5_mpw {
 	} data;
 };
 
+/* WQE for Multi-Packet RQ. */
+struct mlx5_wqe_mprq {
+	struct mlx5_wqe_srq_next_seg next_seg;
+	struct mlx5_wqe_data_seg dseg;
+};
+
+#define MLX5_MPRQ_LEN_MASK 0x000ffff
+#define MLX5_MPRQ_LEN_SHIFT 0
+#define MLX5_MPRQ_STRIDE_NUM_MASK 0x7fff0000
+#define MLX5_MPRQ_STRIDE_NUM_SHIFT 16
+#define MLX5_MPRQ_FILLER_MASK 0x80000000
+#define MLX5_MPRQ_FILLER_SHIFT 31
+
+#define MLX5_MPRQ_STRIDE_SHIFT_BYTE 2
+
 /* CQ element structure - should be equal to the cache line size */
 struct mlx5_cqe {
 #if (RTE_CACHE_LINE_SIZE == 128)
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 238fa7e56..8fa56a53a 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -55,7 +55,73 @@ uint8_t rss_hash_default_key[] = {
 const size_t rss_hash_default_key_len = sizeof(rss_hash_default_key);
 
 /**
- * Allocate RX queue elements.
+ * Check whether Multi-Packet RQ can be enabled for the device.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ *
+ * @return
+ *   1 if supported, negative errno value if not.
+ */
+inline int
+priv_check_mprq_support(struct priv *priv)
+{
+	if (priv->config.mprq && priv->rxqs_n >= priv->config.rxqs_mprq)
+		return 1;
+	return -ENOTSUP;
+}
+
+/**
+ * Check whether Multi-Packet RQ is enabled for the Rx queue.
+ *
+ *  @param rxq
+ *     Pointer to receive queue structure.
+ *
+ * @return
+ *   0 if disabled, otherwise enabled.
+ */
+static inline int
+rxq_mprq_enabled(struct mlx5_rxq_data *rxq)
+{
+	return rxq->mprq_mp != NULL;
+}
+
+/**
+ * Check whether Multi-Packet RQ is enabled for the device.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ *
+ * @return
+ *   0 if disabled, otherwise enabled.
+ */
+inline int
+priv_mprq_enabled(struct priv *priv)
+{
+	uint16_t i;
+	uint16_t n = 0;
+
+	if (priv_check_mprq_support(priv) < 0)
+		return 0;
+	/* All the configured queues should be enabled. */
+	for (i = 0; i < priv->rxqs_n; ++i) {
+		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+
+		if (!rxq)
+			continue;
+		if (rxq_mprq_enabled(rxq))
+			++n;
+	}
+	if (n == priv->rxqs_n)
+		return 1;
+	if (n != 0)
+		ERROR("Multi-Packet RQ can't be partially configured, %u/%u",
+		      n, priv->rxqs_n);
+	return 0;
+}
+
+/**
+ * Allocate RX queue elements for Multi-Packet RQ.
  *
  * @param rxq_ctrl
  *   Pointer to RX queue structure.
@@ -63,8 +129,57 @@ const size_t rss_hash_default_key_len = sizeof(rss_hash_default_key);
  * @return
  *   0 on success, errno value on failure.
  */
-int
-rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
+static int
+rxq_alloc_elts_mprq(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_rxq_data *rxq = &rxq_ctrl->rxq;
+	unsigned int wqe_n = 1 << rxq->elts_n;
+	unsigned int i;
+	int ret = 0;
+
+	/* Iterate on segments. */
+	for (i = 0; i <= wqe_n; ++i) {
+		struct rte_mbuf *buf;
+
+		if (rte_mempool_get(rxq->mprq_mp, (void **)&buf) < 0) {
+			ERROR("%p: empty mbuf pool", (void *)rxq_ctrl);
+			ret = ENOMEM;
+			goto error;
+		}
+		if (i < wqe_n)
+			(*rxq->elts)[i] = buf;
+		else
+			rxq->mprq_repl = buf;
+		PORT(buf) = rxq->port_id;
+	}
+	DEBUG("%p: allocated and configured %u segments",
+	      (void *)rxq_ctrl, wqe_n);
+	assert(ret == 0);
+	return 0;
+error:
+	wqe_n = i;
+	for (i = 0; (i != wqe_n); ++i) {
+		if ((*rxq->elts)[i] != NULL)
+			rte_mempool_put(rxq->mprq_mp,
+					(*rxq->elts)[i]);
+		(*rxq->elts)[i] = NULL;
+	}
+	DEBUG("%p: failed, freed everything", (void *)rxq_ctrl);
+	assert(ret > 0);
+	return ret;
+}
+
+/**
+ * Allocate RX queue elements for Single-Packet RQ.
+ *
+ * @param rxq_ctrl
+ *   Pointer to RX queue structure.
+ *
+ * @return
+ *   0 on success, errno value on failure.
+ */
+static int
+rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 {
 	const unsigned int sges_n = 1 << rxq_ctrl->rxq.sges_n;
 	unsigned int elts_n = 1 << rxq_ctrl->rxq.elts_n;
@@ -135,6 +250,22 @@ rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
 }
 
 /**
+ * Allocate RX queue elements.
+ *
+ * @param rxq_ctrl
+ *   Pointer to RX queue structure.
+ *
+ * @return
+ *   0 on success, errno value on failure.
+ */
+int
+rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	return rxq_mprq_enabled(&rxq_ctrl->rxq) ?
+	       rxq_alloc_elts_mprq(rxq_ctrl) : rxq_alloc_elts_sprq(rxq_ctrl);
+}
+
+/**
  * Free RX queue elements.
  *
  * @param rxq_ctrl
@@ -166,6 +297,10 @@ rxq_free_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
 			rte_pktmbuf_free_seg((*rxq->elts)[i]);
 		(*rxq->elts)[i] = NULL;
 	}
+	if (rxq->mprq_repl != NULL) {
+		rte_pktmbuf_free_seg(rxq->mprq_repl);
+		rxq->mprq_repl = NULL;
+	}
 }
 
 /**
@@ -613,10 +748,16 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 			struct ibv_cq_init_attr_ex ibv;
 			struct mlx5dv_cq_init_attr mlx5;
 		} cq;
-		struct ibv_wq_init_attr wq;
+		struct {
+			struct ibv_wq_init_attr ibv;
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+			struct mlx5dv_wq_init_attr mlx5;
+#endif
+		} wq;
 		struct ibv_cq_ex cq_attr;
 	} attr;
-	unsigned int cqe_n = (1 << rxq_data->elts_n) - 1;
+	unsigned int cqe_n;
+	unsigned int wqe_n = 1 << rxq_data->elts_n;
 	struct mlx5_rxq_ibv *tmpl;
 	struct mlx5dv_cq cq_info;
 	struct mlx5dv_rwq rwq;
@@ -624,6 +765,7 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 	int ret = 0;
 	struct mlx5dv_obj obj;
 	struct mlx5_dev_config *config = &priv->config;
+	const int mprq_en = rxq_mprq_enabled(rxq_data);
 
 	assert(rxq_data);
 	assert(!rxq_ctrl->ibv);
@@ -646,6 +788,17 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 			goto error;
 		}
 	}
+	if (mprq_en) {
+		tmpl->mprq_mr = priv_mr_get(priv, rxq_data->mprq_mp);
+		if (!tmpl->mprq_mr) {
+			tmpl->mprq_mr = priv_mr_new(priv, rxq_data->mprq_mp);
+			if (!tmpl->mprq_mr) {
+				ERROR("%p: cannot create MR for"
+				      " Multi-Packet RQ", (void *)rxq_ctrl);
+				goto error;
+			}
+		}
+	}
 	if (rxq_ctrl->irq) {
 		tmpl->channel = mlx5_glue->create_comp_channel(priv->ctx);
 		if (!tmpl->channel) {
@@ -654,6 +807,10 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 			goto error;
 		}
 	}
+	if (mprq_en)
+		cqe_n = wqe_n * (1 << MLX5_MPRQ_STRIDE_NUM_N) - 1;
+	else
+		cqe_n = wqe_n  - 1;
 	attr.cq.ibv = (struct ibv_cq_init_attr_ex){
 		.cqe = cqe_n,
 		.channel = tmpl->channel,
@@ -686,11 +843,11 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 	      priv->device_attr.orig_attr.max_qp_wr);
 	DEBUG("priv->device_attr.max_sge is %d",
 	      priv->device_attr.orig_attr.max_sge);
-	attr.wq = (struct ibv_wq_init_attr){
+	attr.wq.ibv = (struct ibv_wq_init_attr){
 		.wq_context = NULL, /* Could be useful in the future. */
 		.wq_type = IBV_WQT_RQ,
 		/* Max number of outstanding WRs. */
-		.max_wr = (1 << rxq_data->elts_n) >> rxq_data->sges_n,
+		.max_wr = wqe_n >> rxq_data->sges_n,
 		/* Max number of scatter/gather elements in a WR. */
 		.max_sge = 1 << rxq_data->sges_n,
 		.pd = priv->pd,
@@ -704,8 +861,8 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 	};
 	/* By default, FCS (CRC) is stripped by hardware. */
 	if (rxq_data->crc_present) {
-		attr.wq.create_flags |= IBV_WQ_FLAGS_SCATTER_FCS;
-		attr.wq.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
+		attr.wq.ibv.create_flags |= IBV_WQ_FLAGS_SCATTER_FCS;
+		attr.wq.ibv.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
 	}
 #ifdef HAVE_IBV_WQ_FLAG_RX_END_PADDING
 	if (config->hw_padding) {
@@ -713,7 +870,26 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 		attr.wq.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
 	}
 #endif
-	tmpl->wq = mlx5_glue->create_wq(priv->ctx, &attr.wq);
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	attr.wq.mlx5 = (struct mlx5dv_wq_init_attr){
+		.comp_mask = 0,
+	};
+	if (mprq_en) {
+		struct mlx5dv_striding_rq_init_attr *mprq_attr =
+			&attr.wq.mlx5.striding_rq_attrs;
+
+		attr.wq.mlx5.comp_mask |= MLX5DV_WQ_INIT_ATTR_MASK_STRIDING_RQ;
+		*mprq_attr = (struct mlx5dv_striding_rq_init_attr){
+			.single_stride_log_num_of_bytes = MLX5_MPRQ_STRIDE_SZ_N,
+			.single_wqe_log_num_of_strides = MLX5_MPRQ_STRIDE_NUM_N,
+			.two_byte_shift_en = MLX5_MPRQ_TWO_BYTE_SHIFT,
+		};
+	}
+	tmpl->wq = mlx5_glue->dv_create_wq(priv->ctx, &attr.wq.ibv,
+					   &attr.wq.mlx5);
+#else
+	tmpl->wq = mlx5_glue->create_wq(priv->ctx, &attr.wq.ibv);
+#endif
 	if (tmpl->wq == NULL) {
 		ERROR("%p: WQ creation failure", (void *)rxq_ctrl);
 		goto error;
@@ -722,14 +898,13 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 	 * Make sure number of WRs*SGEs match expectations since a queue
 	 * cannot allocate more than "desc" buffers.
 	 */
-	if (((int)attr.wq.max_wr !=
-	     ((1 << rxq_data->elts_n) >> rxq_data->sges_n)) ||
-	    ((int)attr.wq.max_sge != (1 << rxq_data->sges_n))) {
+	if (attr.wq.ibv.max_wr != (wqe_n >> rxq_data->sges_n) ||
+	    (int)attr.wq.ibv.max_sge != (1 << rxq_data->sges_n)) {
 		ERROR("%p: requested %u*%u but got %u*%u WRs*SGEs",
 		      (void *)rxq_ctrl,
-		      ((1 << rxq_data->elts_n) >> rxq_data->sges_n),
+		      wqe_n >> rxq_data->sges_n,
 		      (1 << rxq_data->sges_n),
-		      attr.wq.max_wr, attr.wq.max_sge);
+		      attr.wq.ibv.max_wr, attr.wq.ibv.max_sge);
 		goto error;
 	}
 	/* Change queue state to ready. */
@@ -756,25 +931,38 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 		goto error;
 	}
 	/* Fill the rings. */
-	rxq_data->wqes = (volatile struct mlx5_wqe_data_seg (*)[])
-		(uintptr_t)rwq.buf;
-	for (i = 0; (i != (unsigned int)(1 << rxq_data->elts_n)); ++i) {
+	rxq_data->wqes = rwq.buf;
+	for (i = 0; (i != wqe_n); ++i) {
+		volatile struct mlx5_wqe_data_seg *scat;
 		struct rte_mbuf *buf = (*rxq_data->elts)[i];
-		volatile struct mlx5_wqe_data_seg *scat = &(*rxq_data->wqes)[i];
-
+		uintptr_t addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		uint32_t byte_count;
+		uint32_t lkey;
+
+		if (mprq_en) {
+			scat = &((volatile struct mlx5_wqe_mprq *)
+				 rxq_data->wqes)[i].dseg;
+			byte_count = (1 << MLX5_MPRQ_STRIDE_SZ_N) *
+				     (1 << MLX5_MPRQ_STRIDE_NUM_N);
+			lkey = tmpl->mprq_mr->lkey;
+		} else {
+			scat = &((volatile struct mlx5_wqe_data_seg *)
+				 rxq_data->wqes)[i];
+			byte_count = DATA_LEN(buf);
+			lkey = tmpl->mr->lkey;
+		}
 		/* scat->addr must be able to store a pointer. */
 		assert(sizeof(scat->addr) >= sizeof(uintptr_t));
 		*scat = (struct mlx5_wqe_data_seg){
-			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
-								  uintptr_t)),
-			.byte_count = rte_cpu_to_be_32(DATA_LEN(buf)),
-			.lkey = tmpl->mr->lkey,
+			.addr = rte_cpu_to_be_64(addr),
+			.byte_count = rte_cpu_to_be_32(byte_count),
+			.lkey = lkey
 		};
 	}
 	rxq_data->rq_db = rwq.dbrec;
 	rxq_data->cqe_n = log2above(cq_info.cqe_cnt);
 	rxq_data->cq_ci = 0;
-	rxq_data->rq_ci = 0;
+	rxq_data->strd_ci = 0;
 	rxq_data->rq_pi = 0;
 	rxq_data->zip = (struct rxq_zip){
 		.ai = 0,
@@ -785,7 +973,7 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 	rxq_data->cqn = cq_info.cqn;
 	rxq_data->cq_arm_sn = 0;
 	/* Update doorbell counter. */
-	rxq_data->rq_ci = (1 << rxq_data->elts_n) >> rxq_data->sges_n;
+	rxq_data->rq_ci = wqe_n >> rxq_data->sges_n;
 	rte_wmb();
 	*rxq_data->rq_db = rte_cpu_to_be_32(rxq_data->rq_ci);
 	DEBUG("%p: rxq updated with %p", (void *)rxq_ctrl, (void *)&tmpl);
@@ -802,6 +990,8 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
 		claim_zero(mlx5_glue->destroy_cq(tmpl->cq));
 	if (tmpl->channel)
 		claim_zero(mlx5_glue->destroy_comp_channel(tmpl->channel));
+	if (tmpl->mprq_mr)
+		priv_mr_release(priv, tmpl->mprq_mr);
 	if (tmpl->mr)
 		priv_mr_release(priv, tmpl->mr);
 	priv->verbs_alloc_ctx.type = MLX5_VERBS_ALLOC_TYPE_NONE;
@@ -832,6 +1022,8 @@ mlx5_priv_rxq_ibv_get(struct priv *priv, uint16_t idx)
 	rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
 	if (rxq_ctrl->ibv) {
 		priv_mr_get(priv, rxq_data->mp);
+		if (rxq_mprq_enabled(rxq_data))
+			priv_mr_get(priv, rxq_data->mprq_mp);
 		rte_atomic32_inc(&rxq_ctrl->ibv->refcnt);
 		DEBUG("%p: Verbs Rx queue %p: refcnt %d", (void *)priv,
 		      (void *)rxq_ctrl->ibv,
@@ -863,6 +1055,11 @@ mlx5_priv_rxq_ibv_release(struct priv *priv, struct mlx5_rxq_ibv *rxq_ibv)
 	ret = priv_mr_release(priv, rxq_ibv->mr);
 	if (!ret)
 		rxq_ibv->mr = NULL;
+	if (rxq_mprq_enabled(&rxq_ibv->rxq_ctrl->rxq)) {
+		ret = priv_mr_release(priv, rxq_ibv->mprq_mr);
+		if (!ret)
+			rxq_ibv->mprq_mr = NULL;
+	}
 	DEBUG("%p: Verbs Rx queue %p: refcnt %d", (void *)priv,
 	      (void *)rxq_ibv, rte_atomic32_read(&rxq_ibv->refcnt));
 	if (rte_atomic32_dec_and_test(&rxq_ibv->refcnt)) {
@@ -918,12 +1115,99 @@ mlx5_priv_rxq_ibv_releasable(struct priv *priv, struct mlx5_rxq_ibv *rxq_ibv)
 }
 
 /**
+ * Callback function to initialize mbufs for Multi-Packet RQ.
+ */
+static inline void
+mlx5_mprq_mbuf_init(struct rte_mempool *mp, void *opaque_arg,
+		    void *_m, unsigned int i __rte_unused)
+{
+	struct rte_mbuf *m = _m;
+
+	rte_pktmbuf_init(mp, opaque_arg, _m, i);
+	m->buf_len =
+		(1 << MLX5_MPRQ_STRIDE_SZ_N) * (1 << MLX5_MPRQ_STRIDE_NUM_N);
+	rte_pktmbuf_reset_headroom(m);
+}
+
+/**
+ * Configure Rx queue as Multi-Packet RQ.
+ *
+ * @param rxq_ctrl
+ *   Pointer to RX queue structure.
+ * @param priv
+ *   Pointer to private structure.
+ * @param idx
+ *   RX queue index.
+ * @param desc
+ *   Number of descriptors to configure in queue.
+ *
+ * @return
+ *   0 on success, negative errno value on failure.
+ */
+static int
+rxq_configure_mprq(struct mlx5_rxq_ctrl *rxq_ctrl, uint16_t idx, uint16_t desc)
+{
+	struct priv *priv = rxq_ctrl->priv;
+	struct mlx5_dev_config *config = &priv->config;
+	struct rte_mempool *mp;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	unsigned int buf_len;
+	unsigned int obj_size;
+
+	assert(rxq_ctrl->rxq.sges_n == 0);
+	rxq_ctrl->rxq.strd_sz_n =
+		MLX5_MPRQ_STRIDE_SZ_N - MLX5_MPRQ_MIN_STRIDE_SZ_N;
+	rxq_ctrl->rxq.strd_num_n =
+		MLX5_MPRQ_STRIDE_NUM_N - MLX5_MPRQ_MIN_STRIDE_NUM_N;
+	rxq_ctrl->rxq.strd_shift_en = MLX5_MPRQ_TWO_BYTE_SHIFT;
+	rxq_ctrl->rxq.mprq_max_memcpy_len = config->mprq_max_memcpy_len;
+	buf_len = (1 << MLX5_MPRQ_STRIDE_SZ_N) * (1 << MLX5_MPRQ_STRIDE_NUM_N) +
+		  RTE_PKTMBUF_HEADROOM;
+	obj_size = buf_len + sizeof(struct rte_mbuf);
+	snprintf(name, sizeof(name), "%s-mprq-%u", priv->dev->data->name, idx);
+	/*
+	 * Allocate per-queue Mempool for Multi-Packet RQ.
+	 *
+	 * Received packets can be either memcpy'd or indirectly referenced. In
+	 * case of mbuf indirection, as it isn't possible to predict how the
+	 * buffers will be queued by application, there's no option to exactly
+	 * pre-allocate needed buffers in advance but to speculatively prepares
+	 * enough buffers.
+	 *
+	 * In the data path, if this Mempool is depleted, PMD will try to memcpy
+	 * received packets to buffers provided by application (rxq->mp) until
+	 * this Mempool gets available again.
+	 */
+	desc *= 4;
+	mp = rte_mempool_create(name, desc + MLX5_MPRQ_MP_CACHE_SZ,
+				obj_size, MLX5_MPRQ_MP_CACHE_SZ,
+				sizeof(struct rte_pktmbuf_pool_private),
+				NULL, NULL, NULL, NULL,
+				priv->dev->device->numa_node,
+				MEMPOOL_F_SC_GET);
+	if (mp == NULL) {
+		ERROR("%p: failed to allocate a mempool for"
+		      " multi-packet Rx queue (%u): %s",
+		      (void *)priv->dev, idx,
+		      rte_strerror(rte_errno));
+		return -ENOMEM;
+	}
+
+	rte_pktmbuf_pool_init(mp, NULL);
+	rte_mempool_obj_iter(mp, mlx5_mprq_mbuf_init, NULL);
+	rxq_ctrl->rxq.mprq_mp = mp;
+	DEBUG("%p: Multi-Packet RQ is enabled for Rx queue %u",
+	      (void *)priv->dev, idx);
+	return 0;
+}
+
+/**
  * Create a DPDK Rx queue.
  *
  * @param priv
  *   Pointer to private structure.
  * @param idx
- *   TX queue index.
+ *   RX queue index.
  * @param desc
  *   Number of descriptors to configure in queue.
  * @param socket
@@ -945,8 +1229,9 @@ mlx5_priv_rxq_new(struct priv *priv, uint16_t idx, uint16_t desc,
 	 * Always allocate extra slots, even if eventually
 	 * the vector Rx will not be used.
 	 */
-	const uint16_t desc_n =
+	uint16_t desc_n =
 		desc + config->rx_vec_en * MLX5_VPMD_DESCS_PER_LOOP;
+	const int mprq_en = priv_check_mprq_support(priv) > 0;
 
 	tmpl = rte_calloc_socket("RXQ", 1,
 				 sizeof(*tmpl) +
@@ -954,13 +1239,35 @@ mlx5_priv_rxq_new(struct priv *priv, uint16_t idx, uint16_t desc,
 				 0, socket);
 	if (!tmpl)
 		return NULL;
+	tmpl->priv = priv;
 	tmpl->socket = socket;
 	if (priv->dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
-	/* Enable scattered packets support for this queue if necessary. */
+	/*
+	 * This Rx queue can be configured as a Multi-Packet RQ if all of the
+	 * following conditions are met:
+	 *  - MPRQ is enabled.
+	 *  - The number of descs is more than the number of strides.
+	 *  - max_rx_pkt_len is less than the size of a stride sparing headroom.
+	 *
+	 *  Otherwise, enable Rx scatter if necessary.
+	 */
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
-	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
-	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
+	if (mprq_en &&
+	    desc >= (1U << MLX5_MPRQ_STRIDE_NUM_N) &&
+	    dev->data->dev_conf.rxmode.max_rx_pkt_len <=
+	    (1U << MLX5_MPRQ_STRIDE_SZ_N) - RTE_PKTMBUF_HEADROOM) {
+		int ret;
+
+		/* TODO: Rx scatter isn't supported yet. */
+		tmpl->rxq.sges_n = 0;
+		/* Trim the number of descs needed. */
+		desc >>= MLX5_MPRQ_STRIDE_NUM_N;
+		ret = rxq_configure_mprq(tmpl, idx, desc);
+		if (ret)
+			goto error;
+	} else if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
+		   (mb_len - RTE_PKTMBUF_HEADROOM)) {
 		tmpl->rxq.sges_n = 0;
 	} else if (conf->offloads & DEV_RX_OFFLOAD_SCATTER) {
 		unsigned int size =
@@ -1030,7 +1337,6 @@ mlx5_priv_rxq_new(struct priv *priv, uint16_t idx, uint16_t desc,
 	/* Save port ID. */
 	tmpl->rxq.rss_hash = priv->rxqs_n > 1;
 	tmpl->rxq.port_id = dev->data->port_id;
-	tmpl->priv = priv;
 	tmpl->rxq.mp = mp;
 	tmpl->rxq.stats.idx = idx;
 	tmpl->rxq.elts_n = log2above(desc);
@@ -1105,6 +1411,25 @@ mlx5_priv_rxq_release(struct priv *priv, uint16_t idx)
 	DEBUG("%p: Rx queue %p: refcnt %d", (void *)priv,
 	      (void *)rxq_ctrl, rte_atomic32_read(&rxq_ctrl->refcnt));
 	if (rte_atomic32_dec_and_test(&rxq_ctrl->refcnt)) {
+		if (rxq_ctrl->rxq.mprq_mp != NULL) {
+			/* If a mbuf in the pool has an indirect mbuf attached
+			 * and it is still in use by application, destroying
+			 * the Rx qeueue can spoil the packet. It is unlikely
+			 * to happen but if application dynamically creates and
+			 * destroys with holding Rx packets, this can happen.
+			 *
+			 * TODO: It is unavoidable for now because the Mempool
+			 * for Multi-Packet RQ isn't provided by application but
+			 * managed by PMD.
+			 */
+			if (!rte_mempool_full(rxq_ctrl->rxq.mprq_mp)) {
+				ERROR("Mempool for Multi-Packet RQ %p"
+				      " is still in use", (void *)rxq_ctrl);
+				return EBUSY;
+			}
+			rte_mempool_free(rxq_ctrl->rxq.mprq_mp);
+			rxq_ctrl->rxq.mprq_mp = NULL;
+		}
 		LIST_REMOVE(rxq_ctrl, next);
 		rte_free(rxq_ctrl);
 		(*priv->rxqs)[idx] = NULL;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 36eeefb49..49254ab59 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1800,7 +1800,8 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 	while (pkts_n) {
 		unsigned int idx = rq_ci & wqe_cnt;
-		volatile struct mlx5_wqe_data_seg *wqe = &(*rxq->wqes)[idx];
+		volatile struct mlx5_wqe_data_seg *wqe =
+			&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[idx];
 		struct rte_mbuf *rep = (*rxq->elts)[idx];
 		uint32_t rss_hash_res = 0;
 
@@ -1901,6 +1902,155 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
+ * DPDK callback for RX with Multi-Packet RQ support.
+ *
+ * @param dpdk_rxq
+ *   Generic pointer to RX queue structure.
+ * @param[out] pkts
+ *   Array to store received packets.
+ * @param pkts_n
+ *   Maximum number of packets in array.
+ *
+ * @return
+ *   Number of packets successfully received (<= pkts_n).
+ */
+uint16_t
+mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
+{
+	struct mlx5_rxq_data *rxq = dpdk_rxq;
+	const unsigned int strd_n =
+		1 << (rxq->strd_num_n + MLX5_MPRQ_MIN_STRIDE_NUM_N);
+	const unsigned int strd_sz =
+		1 << (rxq->strd_sz_n + MLX5_MPRQ_MIN_STRIDE_SZ_N);
+	const unsigned int strd_shift =
+		MLX5_MPRQ_STRIDE_SHIFT_BYTE * rxq->strd_shift_en;
+	const unsigned int cq_mask = (1 << rxq->cqe_n) - 1;
+	const unsigned int wq_mask = (1 << rxq->elts_n) - 1;
+	volatile struct mlx5_cqe *cqe = &(*rxq->cqes)[rxq->cq_ci & cq_mask];
+	unsigned int i = 0;
+	uint16_t rq_ci = rxq->rq_ci;
+	uint16_t strd_idx = rxq->strd_ci;
+	struct rte_mbuf *buf = (*rxq->elts)[rq_ci & wq_mask];
+
+	while (i < pkts_n) {
+		struct rte_mbuf *pkt;
+		int ret;
+		unsigned int len;
+		uint16_t consumed_strd;
+		uint32_t offset;
+		uint32_t byte_cnt;
+		uint32_t rss_hash_res = 0;
+
+		if (strd_idx == strd_n) {
+			/* Replace WQE only if the buffer is still in use. */
+			if (unlikely(rte_mbuf_refcnt_read(buf) > 1)) {
+				struct rte_mbuf *rep = rxq->mprq_repl;
+				volatile struct mlx5_wqe_data_seg *wqe =
+					&((volatile struct mlx5_wqe_mprq *)
+					  rxq->wqes)[rq_ci & wq_mask].dseg;
+				uintptr_t addr;
+
+				/* Replace mbuf. */
+				(*rxq->elts)[rq_ci & wq_mask] = rep;
+				PORT(rep) = PORT(buf);
+				/* Release the old buffer. */
+				if (__rte_mbuf_refcnt_update(buf, -1) == 0) {
+					rte_mbuf_refcnt_set(buf, 1);
+					rte_mbuf_raw_free(buf);
+				}
+				/* Replace WQE. */
+				addr = rte_pktmbuf_mtod(rep, uintptr_t);
+				wqe->addr = rte_cpu_to_be_64(addr);
+				/* Stash a mbuf for next replacement. */
+				if (likely(!rte_mempool_get(rxq->mprq_mp,
+							    (void **)&rep)))
+					rxq->mprq_repl = rep;
+				else
+					rxq->mprq_repl = NULL;
+			}
+			/* Advance to the next WQE. */
+			strd_idx = 0;
+			++rq_ci;
+			buf = (*rxq->elts)[rq_ci & wq_mask];
+		}
+		cqe = &(*rxq->cqes)[rxq->cq_ci & cq_mask];
+		ret = mlx5_rx_poll_len(rxq, cqe, cq_mask, &rss_hash_res);
+		if (!ret)
+			break;
+		if (unlikely(ret == -1)) {
+			/* RX error, packet is likely too large. */
+			++rxq->stats.idropped;
+			continue;
+		}
+		byte_cnt = ret;
+		offset = strd_idx * strd_sz + strd_shift;
+		consumed_strd = (byte_cnt & MLX5_MPRQ_STRIDE_NUM_MASK) >>
+				MLX5_MPRQ_STRIDE_NUM_SHIFT;
+		strd_idx += consumed_strd;
+		if (byte_cnt & MLX5_MPRQ_FILLER_MASK)
+			continue;
+		pkt = rte_pktmbuf_alloc(rxq->mp);
+		if (unlikely(pkt == NULL)) {
+			++rxq->stats.rx_nombuf;
+			break;
+		}
+		len = (byte_cnt & MLX5_MPRQ_LEN_MASK) >> MLX5_MPRQ_LEN_SHIFT;
+		assert((int)len >= (rxq->crc_present << 2));
+		if (rxq->crc_present)
+			len -= ETHER_CRC_LEN;
+		/*
+		 * Memcpy packets to the target mbuf if:
+		 * - The size of packet is smaller than MLX5_MPRQ_MEMCPY_LEN.
+		 * - Out of buffer in the Mempool for Multi-Packet RQ.
+		 */
+		if (len <= rxq->mprq_max_memcpy_len || rxq->mprq_repl == NULL) {
+			uintptr_t base = rte_pktmbuf_mtod(buf, uintptr_t);
+
+			rte_memcpy(rte_pktmbuf_mtod(pkt, void *),
+				   (void *)(base + offset), len);
+			/* Initialize the offload flag. */
+			pkt->ol_flags = 0;
+		} else {
+			/*
+			 * IND_ATTACHED_MBUF will be set to pkt->ol_flags when
+			 * attaching the mbuf and more offload flags will be
+			 * added below by calling rxq_cq_to_mbuf(). Other fields
+			 * will be overwritten.
+			 */
+			rte_pktmbuf_attach_at(pkt, buf, offset,
+					      consumed_strd * strd_sz);
+			assert(pkt->ol_flags == IND_ATTACHED_MBUF);
+			rte_pktmbuf_reset_headroom(pkt);
+		}
+		rxq_cq_to_mbuf(rxq, pkt, cqe, rss_hash_res);
+		PKT_LEN(pkt) = len;
+		DATA_LEN(pkt) = len;
+#ifdef MLX5_PMD_SOFT_COUNTERS
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += PKT_LEN(pkt);
+#endif
+		/* Return packet. */
+		*(pkts++) = pkt;
+		++i;
+	}
+	/* Update the consumer index. */
+	rxq->rq_pi += i;
+	rxq->strd_ci = strd_idx;
+	rte_io_wmb();
+	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
+	if (rq_ci != rxq->rq_ci) {
+		rxq->rq_ci = rq_ci;
+		rte_io_wmb();
+		*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
+	}
+#ifdef MLX5_PMD_SOFT_COUNTERS
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+#endif
+	return i;
+}
+
+/**
  * Dummy DPDK callback for TX.
  *
  * This function is used to temporarily replace the real callback during
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index d7e890558..ba8ac32c2 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -86,18 +86,25 @@ struct mlx5_rxq_data {
 	unsigned int elts_n:4; /* Log 2 of Mbufs. */
 	unsigned int rss_hash:1; /* RSS hash result is enabled. */
 	unsigned int mark:1; /* Marked flow available on the queue. */
-	unsigned int :15; /* Remaining bits. */
+	unsigned int strd_sz_n:3; /* Log 2 of stride size. */
+	unsigned int strd_num_n:4; /* Log 2 of the number of stride. */
+	unsigned int strd_shift_en:1; /* Enable 2bytes shift on a stride. */
+	unsigned int :8; /* Remaining bits. */
 	volatile uint32_t *rq_db;
 	volatile uint32_t *cq_db;
 	uint16_t port_id;
 	uint16_t rq_ci;
+	uint16_t strd_ci; /* Stride index in a WQE for Multi-Packet RQ. */
 	uint16_t rq_pi;
 	uint16_t cq_ci;
-	volatile struct mlx5_wqe_data_seg(*wqes)[];
+	uint16_t mprq_max_memcpy_len; /* Maximum size of packet to memcpy. */
+	volatile void *wqes;
 	volatile struct mlx5_cqe(*cqes)[];
 	struct rxq_zip zip; /* Compressed context. */
 	struct rte_mbuf *(*elts)[];
 	struct rte_mempool *mp;
+	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
+	struct rte_mbuf *mprq_repl; /* Stashed mbuf for replenish. */
 	struct mlx5_rxq_stats stats;
 	uint64_t mbuf_initializer; /* Default rearm_data for vectorized Rx. */
 	struct rte_mbuf fake_mbuf; /* elts padding for vectorized Rx. */
@@ -115,6 +122,7 @@ struct mlx5_rxq_ibv {
 	struct ibv_wq *wq; /* Work Queue. */
 	struct ibv_comp_channel *channel;
 	struct mlx5_mr *mr; /* Memory Region (for mp). */
+	struct mlx5_mr *mprq_mr; /* Memory Region (for mprq_mp). */
 };
 
 /* RX queue control descriptor. */
@@ -210,6 +218,8 @@ struct mlx5_txq_ctrl {
 extern uint8_t rss_hash_default_key[];
 extern const size_t rss_hash_default_key_len;
 
+int priv_check_mprq_support(struct priv *);
+int priv_mprq_enabled(struct priv *);
 void mlx5_rxq_cleanup(struct mlx5_rxq_ctrl *);
 int mlx5_rx_queue_setup(struct rte_eth_dev *, uint16_t, uint16_t, unsigned int,
 			const struct rte_eth_rxconf *, struct rte_mempool *);
@@ -232,6 +242,7 @@ int mlx5_priv_rxq_release(struct priv *, uint16_t);
 int mlx5_priv_rxq_releasable(struct priv *, uint16_t);
 int mlx5_priv_rxq_verify(struct priv *);
 int rxq_alloc_elts(struct mlx5_rxq_ctrl *);
+int rxq_alloc_mprq_buf(struct mlx5_rxq_ctrl *);
 struct mlx5_ind_table_ibv *mlx5_priv_ind_table_ibv_new(struct priv *,
 						       uint16_t [],
 						       uint16_t);
@@ -280,6 +291,7 @@ uint16_t mlx5_tx_burst_mpw(void *, struct rte_mbuf **, uint16_t);
 uint16_t mlx5_tx_burst_mpw_inline(void *, struct rte_mbuf **, uint16_t);
 uint16_t mlx5_tx_burst_empw(void *, struct rte_mbuf **, uint16_t);
 uint16_t mlx5_rx_burst(void *, struct rte_mbuf **, uint16_t);
+uint16_t mlx5_rx_burst_mprq(void *, struct rte_mbuf **, uint16_t);
 uint16_t removed_tx_burst(void *, struct rte_mbuf **, uint16_t);
 uint16_t removed_rx_burst(void *, struct rte_mbuf **, uint16_t);
 int mlx5_rx_descriptor_status(void *, uint16_t);
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index b66c2916f..ab4610c84 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -282,6 +282,8 @@ rxq_check_vec_support(struct mlx5_rxq_data *rxq)
 	struct mlx5_rxq_ctrl *ctrl =
 		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
 
+	if (priv_mprq_enabled(ctrl->priv))
+		return -ENOTSUP;
 	if (!ctrl->priv->config.rx_vec_en || rxq->sges_n != 0)
 		return -ENOTSUP;
 	return 1;
@@ -303,6 +305,8 @@ priv_check_vec_rx_support(struct priv *priv)
 
 	if (!priv->config.rx_vec_en)
 		return -ENOTSUP;
+	if (priv_mprq_enabled(priv))
+		return -ENOTSUP;
 	/* All the configured queues should support. */
 	for (i = 0; i < priv->rxqs_n; ++i) {
 		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index 44856bbff..b181d04cf 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -87,7 +87,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
 	const uint16_t q_mask = q_n - 1;
 	uint16_t elts_idx = rxq->rq_ci & q_mask;
 	struct rte_mbuf **elts = &(*rxq->elts)[elts_idx];
-	volatile struct mlx5_wqe_data_seg *wq = &(*rxq->wqes)[elts_idx];
+	volatile struct mlx5_wqe_data_seg *wq =
+		&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
 	unsigned int i;
 
 	assert(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v1 5/6] net/mlx5: release Tx queue resource earlier than Rx
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (3 preceding siblings ...)
  2018-03-10  1:25 ` [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
@ 2018-03-10  1:25 ` Yongseok Koh
  2018-03-10  1:25 ` [PATCH v1 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Multi-Packet RQ uses mbuf indirection and direct mbufs come from the
private Mempool (rxq->mprq_mp) of PMD. To properly release the Mempool, it
is better to empty the Tx completeion array (txq->elts) before releasing
it.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 25c0b5b1f..b2487186a 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -187,14 +187,6 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 	/* Prevent crashes when queues are still in use. */
 	dev->rx_pkt_burst = removed_rx_burst;
 	dev->tx_pkt_burst = removed_tx_burst;
-	if (priv->rxqs != NULL) {
-		/* XXX race condition if mlx5_rx_burst() is still running. */
-		usleep(1000);
-		for (i = 0; (i != priv->rxqs_n); ++i)
-			mlx5_priv_rxq_release(priv, i);
-		priv->rxqs_n = 0;
-		priv->rxqs = NULL;
-	}
 	if (priv->txqs != NULL) {
 		/* XXX race condition if mlx5_tx_burst() is still running. */
 		usleep(1000);
@@ -203,6 +195,14 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 		priv->txqs_n = 0;
 		priv->txqs = NULL;
 	}
+	if (priv->rxqs != NULL) {
+		/* XXX race condition if mlx5_rx_burst() is still running. */
+		usleep(1000);
+		for (i = 0; (i != priv->rxqs_n); ++i)
+			mlx5_priv_rxq_release(priv, i);
+		priv->rxqs_n = 0;
+		priv->rxqs = NULL;
+	}
 	if (priv->pd != NULL) {
 		assert(priv->ctx != NULL);
 		claim_zero(mlx5_glue->dealloc_pd(priv->pd));
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v1 6/6] app/testpmd: conserve mbuf indirection flag
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (4 preceding siblings ...)
  2018-03-10  1:25 ` [PATCH v1 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
@ 2018-03-10  1:25 ` Yongseok Koh
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-03-10  1:25 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

If PMD delivers Rx packets with mbuf indirection, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be indirect mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 2 ++
 app/test-pmd/macfwd.c   | 2 ++
 app/test-pmd/macswap.c  | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5f5ab64aa..1dd4d7130 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -770,6 +770,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (RTE_MBUF_INDIRECT(m))
+			tx_ol_flags |= IND_ATTACHED_MBUF;
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..7e096ee78 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,8 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (RTE_MBUF_INDIRECT(mb))
+			ol_flags |= IND_ATTACHED_MBUF;
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..39f96c1e0 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,8 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (RTE_MBUF_INDIRECT(mb))
+			ol_flags |= IND_ATTACHED_MBUF;
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue
  2018-03-10  1:25 ` [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
@ 2018-03-12  9:13   ` Nélio Laranjeiro
  0 siblings, 0 replies; 86+ messages in thread
From: Nélio Laranjeiro @ 2018-03-12  9:13 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: wenzhuo.lu, jingjing.wu, adrien.mazarguil, olivier.matz, dev

On Fri, Mar 09, 2018 at 05:25:29PM -0800, Yongseok Koh wrote:
> mlx5dv_create_wq() is added.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_glue.c | 9 +++++++++
>  drivers/net/mlx5/mlx5_glue.h | 4 ++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/net/mlx5/mlx5_glue.c b/drivers/net/mlx5/mlx5_glue.c
> index 1c4396ada..e33fc76b5 100644
> --- a/drivers/net/mlx5/mlx5_glue.c
> +++ b/drivers/net/mlx5/mlx5_glue.c
> @@ -287,6 +287,14 @@ mlx5_glue_dv_create_cq(struct ibv_context *context,
>  	return mlx5dv_create_cq(context, cq_attr, mlx5_cq_attr);
>  }
>  
> +static struct ibv_wq *
> +mlx5_glue_dv_create_wq(struct ibv_context *context,
> +		       struct ibv_wq_init_attr *wq_attr,
> +		       struct mlx5dv_wq_init_attr *mlx5_wq_attr)
> +{
> +	return mlx5dv_create_wq(context, wq_attr, mlx5_wq_attr);
> +}
> +
>  static int
>  mlx5_glue_dv_query_device(struct ibv_context *ctx,
>  			  struct mlx5dv_context *attrs_out)
> @@ -347,6 +355,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue){
>  	.port_state_str = mlx5_glue_port_state_str,
>  	.cq_ex_to_cq = mlx5_glue_cq_ex_to_cq,
>  	.dv_create_cq = mlx5_glue_dv_create_cq,
> +	.dv_create_wq = mlx5_glue_dv_create_wq,
>  	.dv_query_device = mlx5_glue_dv_query_device,
>  	.dv_set_context_attr = mlx5_glue_dv_set_context_attr,
>  	.dv_init_obj = mlx5_glue_dv_init_obj,
> diff --git a/drivers/net/mlx5/mlx5_glue.h b/drivers/net/mlx5/mlx5_glue.h
> index b5efee3b6..21a713961 100644
> --- a/drivers/net/mlx5/mlx5_glue.h
> +++ b/drivers/net/mlx5/mlx5_glue.h
> @@ -100,6 +100,10 @@ struct mlx5_glue {
>  		(struct ibv_context *context,
>  		 struct ibv_cq_init_attr_ex *cq_attr,
>  		 struct mlx5dv_cq_init_attr *mlx5_cq_attr);
> +	struct ibv_wq *(*dv_create_wq)
> +		(struct ibv_context *context,
> +		 struct ibv_wq_init_attr *wq_attr,
> +		 struct mlx5dv_wq_init_attr *mlx5_wq_attr);
>  	int (*dv_query_device)(struct ibv_context *ctx_in,
>  			       struct mlx5dv_context *attrs_out);
>  	int (*dv_set_context_attr)(struct ibv_context *ibv_ctx,
> -- 
> 2.11.0
 
You missed to change the GLUE ABI version, it must be updated.

Regards,

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support
  2018-03-10  1:25 ` [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
@ 2018-03-12  9:20   ` Nélio Laranjeiro
  0 siblings, 0 replies; 86+ messages in thread
From: Nélio Laranjeiro @ 2018-03-12  9:20 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: wenzhuo.lu, jingjing.wu, adrien.mazarguil, olivier.matz, dev

On Fri, Mar 09, 2018 at 05:25:30PM -0800, Yongseok Koh wrote:
> Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe
> bandwidth by posting a single large buffer for multiple packets. Instead of
> posting a buffer per a packet, one large buffer is posted in order to
> receive multiple packets on the buffer. A MPRQ buffer consists of multiple
> fixed-size strides and each stride receives one packet.
> 
> Rx packet is either mem-copied to a user-provided mbuf if length is
> comparatively small or referenced by mbuf indirection otherwise. In case of
> indirection, the Mempool for the direct mbufs will be allocated and managed
> by PMD.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>  doc/guides/nics/mlx5.rst         |  23 +++
>  drivers/net/mlx5/Makefile        |   5 +
>  drivers/net/mlx5/mlx5.c          |  63 +++++++
>  drivers/net/mlx5/mlx5.h          |   3 +
>  drivers/net/mlx5/mlx5_defs.h     |  20 ++
>  drivers/net/mlx5/mlx5_ethdev.c   |   3 +
>  drivers/net/mlx5/mlx5_prm.h      |  15 ++
>  drivers/net/mlx5/mlx5_rxq.c      | 389 +++++++++++++++++++++++++++++++++++----
>  drivers/net/mlx5/mlx5_rxtx.c     | 152 ++++++++++++++-
>  drivers/net/mlx5/mlx5_rxtx.h     |  16 +-
>  drivers/net/mlx5/mlx5_rxtx_vec.c |   4 +
>  drivers/net/mlx5/mlx5_rxtx_vec.h |   3 +-
>  12 files changed, 660 insertions(+), 36 deletions(-)
> 
> diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
> index 0e6e525c9..1600bfa7b 100644
> --- a/doc/guides/nics/mlx5.rst
> +++ b/doc/guides/nics/mlx5.rst
> @@ -253,6 +253,29 @@ Run-time configuration
>    - x86_64 with ConnectX-4, ConnectX-4 LX and ConnectX-5.
>    - POWER8 and ARMv8 with ConnectX-4 LX and ConnectX-5.
>  
> +- ``mprq_en`` parameter [int]
> +
> +  A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
> +  configured as Multi-Packet RQ if the total number of Rx queues is
> +  ``rxqs_min_mprq`` or more and Rx scatter isn't configured. Disabled by default.
> +
> +  Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
> +  by posting a single large buffer for multiple packets. Instead of posting a
> +  buffers per a packet, one large buffer is posted in order to receive multiple
> +  packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides
> +  and each stride receives one packet.
> +
> +- ``mprq_max_memcpy_len`` parameter [int]
> +  The maximum size of packet for memcpy in case of Multi-Packet Rx queue. Rx
> +  packet is mem-copied to a user-provided mbuf if the size of Rx packet is less
> +  than or equal to this parameter. Otherwise, the packet will be referenced by mbuf
> +  indirection. In case of indirection, the Mempool for the direct mbufs will be
> +  allocated and managed by PMD. The default value is 128.
> +
> +- ``rxqs_min_mprq`` parameter [int]
> +  Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is greater or
> +  equal to this value. The default value is 12.
> +
>  - ``txq_inline`` parameter [int]
>  
>    Amount of data to be inlined during TX operations. Improves latency.
> diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
> index afda4118f..e5e276a71 100644
> --- a/drivers/net/mlx5/Makefile
> +++ b/drivers/net/mlx5/Makefile
> @@ -125,6 +125,11 @@ mlx5_autoconf.h.new: FORCE
>  mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
>  	$Q $(RM) -f -- '$@'
>  	$Q sh -- '$<' '$@' \
> +		HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT \
> +		infiniband/mlx5dv.h \
> +		enum MLX5DV_CONTEXT_MASK_STRIDING_RQ \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
>  		HAVE_IBV_DEVICE_TUNNEL_SUPPORT \
>  		infiniband/mlx5dv.h \
>  		enum MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS \
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
> index 61cb93101..25c0b5b1f 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -44,6 +44,18 @@
>  /* Device parameter to enable RX completion queue compression. */
>  #define MLX5_RXQ_CQE_COMP_EN "rxq_cqe_comp_en"
>  
> +/* Device parameter to enable Multi-Packet Rx queue. */
> +#define MLX5_RX_MPRQ_EN "mprq_en"
> +
> +/* Device parameter to limit the size of memcpy'd packet. */
> +#define MLX5_RX_MPRQ_MAX_MEMCPY_LEN "mprq_max_memcpy_len"
> +
> +/*
> + * Device parameter to set the minimum number of Rx queues to configure
> + * Multi-Packet Rx queue.
> + */
> +#define MLX5_RXQS_MIN_MPRQ "rxqs_min_mprq"
> +
>  /* Device parameter to configure inline send. */
>  #define MLX5_TXQ_INLINE "txq_inline"
>  
> @@ -383,6 +395,12 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
>  	}
>  	if (strcmp(MLX5_RXQ_CQE_COMP_EN, key) == 0) {
>  		config->cqe_comp = !!tmp;
> +	} else if (strcmp(MLX5_RX_MPRQ_EN, key) == 0) {
> +		config->mprq = !!tmp;
> +	} else if (strcmp(MLX5_RX_MPRQ_MAX_MEMCPY_LEN, key) == 0) {
> +		config->mprq_max_memcpy_len = tmp;
> +	} else if (strcmp(MLX5_RXQS_MIN_MPRQ, key) == 0) {
> +		config->rxqs_mprq = tmp;
>  	} else if (strcmp(MLX5_TXQ_INLINE, key) == 0) {
>  		config->txq_inline = tmp;
>  	} else if (strcmp(MLX5_TXQS_MIN_INLINE, key) == 0) {
> @@ -420,6 +438,9 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
>  {
>  	const char **params = (const char *[]){
>  		MLX5_RXQ_CQE_COMP_EN,
> +		MLX5_RX_MPRQ_EN,
> +		MLX5_RX_MPRQ_MAX_MEMCPY_LEN,
> +		MLX5_RXQS_MIN_MPRQ,
>  		MLX5_TXQ_INLINE,
>  		MLX5_TXQS_MIN_INLINE,
>  		MLX5_TXQ_MPW_EN,
> @@ -582,6 +603,7 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
>  	unsigned int mps;
>  	unsigned int cqe_comp;
>  	unsigned int tunnel_en = 0;
> +	unsigned int mprq = 0;
>  	int idx;
>  	int i;
>  	struct mlx5dv_context attrs_out = {0};
> @@ -664,6 +686,9 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
>  #ifdef HAVE_IBV_DEVICE_TUNNEL_SUPPORT
>  	attrs_out.comp_mask |= MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS;
>  #endif
> +#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
> +	attrs_out.comp_mask |= MLX5DV_CONTEXT_MASK_STRIDING_RQ;
> +#endif
>  	mlx5_glue->dv_query_device(attr_ctx, &attrs_out);
>  	if (attrs_out.flags & MLX5DV_CONTEXT_FLAGS_MPW_ALLOWED) {
>  		if (attrs_out.flags & MLX5DV_CONTEXT_FLAGS_ENHANCED_MPW) {
> @@ -677,6 +702,37 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
>  		DEBUG("MPW isn't supported");
>  		mps = MLX5_MPW_DISABLED;
>  	}
> +#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
> +	if (attrs_out.comp_mask & MLX5DV_CONTEXT_MASK_STRIDING_RQ) {
> +		struct mlx5dv_striding_rq_caps mprq_caps =
> +			attrs_out.striding_rq_caps;
> +
> +		DEBUG("\tmin_single_stride_log_num_of_bytes: %d",
> +		      mprq_caps.min_single_stride_log_num_of_bytes);
> +		DEBUG("\tmax_single_stride_log_num_of_bytes: %d",
> +		      mprq_caps.max_single_stride_log_num_of_bytes);
> +		DEBUG("\tmin_single_wqe_log_num_of_strides: %d",
> +		      mprq_caps.min_single_wqe_log_num_of_strides);
> +		DEBUG("\tmax_single_wqe_log_num_of_strides: %d",
> +		      mprq_caps.max_single_wqe_log_num_of_strides);
> +		DEBUG("\tsupported_qpts: %d",
> +		      mprq_caps.supported_qpts);
> +		if (mprq_caps.min_single_stride_log_num_of_bytes <=
> +		    MLX5_MPRQ_MIN_STRIDE_SZ_N &&
> +		    mprq_caps.max_single_stride_log_num_of_bytes >=
> +		    MLX5_MPRQ_STRIDE_SZ_N &&
> +		    mprq_caps.min_single_wqe_log_num_of_strides <=
> +		    MLX5_MPRQ_MIN_STRIDE_NUM_N &&
> +		    mprq_caps.max_single_wqe_log_num_of_strides >=
> +		    MLX5_MPRQ_STRIDE_NUM_N) {
> +			DEBUG("Multi-Packet RQ is supported");
> +			mprq = 1;
> +		} else {
> +			DEBUG("Multi-Packet RQ isn't supported");
> +			mprq = 0;

DEBUG does not exists anymore, please rebase this series on top of [1].

> +		}
> +	}
> +#endif
>  	if (RTE_CACHE_LINE_SIZE == 128 &&
>  	    !(attrs_out.flags & MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP))
>  		cqe_comp = 0;
> @@ -721,6 +777,9 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
>  			.txq_inline = MLX5_ARG_UNSET,
>  			.txqs_inline = MLX5_ARG_UNSET,
>  			.inline_max_packet_sz = MLX5_ARG_UNSET,
> +			.mprq = 0, /* Disable by default. */
> +			.mprq_max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN,
> +			.rxqs_mprq = MLX5_MPRQ_MIN_RXQS,
>  		};
>  
>  		len = snprintf(name, sizeof(name), PCI_PRI_FMT,
> @@ -891,6 +950,10 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
>  			WARN("Rx CQE compression isn't supported");
>  			config.cqe_comp = 0;
>  		}
> +		if (config.mprq && !mprq) {
> +			WARN("Multi-Packet RQ isn't supported");

Same for WARN macro.

> +			config.mprq = 0;
> +		}
>  		err = priv_uar_init_primary(priv);
>  		if (err)
>  			goto port_error;
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
> index 9ad0533fc..42632a7e5 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -88,6 +88,9 @@ struct mlx5_dev_config {
>  	unsigned int tx_vec_en:1; /* Tx vector is enabled. */
>  	unsigned int rx_vec_en:1; /* Rx vector is enabled. */
>  	unsigned int mpw_hdr_dseg:1; /* Enable DSEGs in the title WQEBB. */
> +	unsigned int mprq:1; /* Whether Multi-Packet RQ is supported. */
> +	unsigned int mprq_max_memcpy_len; /* Maximum packet size to memcpy. */
> +	unsigned int rxqs_mprq; /* Queue count threshold for Multi-Packet RQ. */
>  	unsigned int tso_max_payload_sz; /* Maximum TCP payload for TSO. */
>  	unsigned int ind_table_max_size; /* Maximum indirection table size. */
>  	int txq_inline; /* Maximum packet size for inlining. */
> diff --git a/drivers/net/mlx5/mlx5_defs.h b/drivers/net/mlx5/mlx5_defs.h
> index c3334ca30..39cc1344a 100644
> --- a/drivers/net/mlx5/mlx5_defs.h
> +++ b/drivers/net/mlx5/mlx5_defs.h
> @@ -95,4 +95,24 @@
>   */
>  #define MLX5_UAR_OFFSET (1ULL << 32)
>  
> +/* Log 2 of the size of a stride for Multi-Packet RQ. */
> +#define MLX5_MPRQ_STRIDE_SZ_N 11
> +#define MLX5_MPRQ_MIN_STRIDE_SZ_N 6
> +
> +/* Log 2 of the number of strides per WQE for Multi-Packet RQ. */
> +#define MLX5_MPRQ_STRIDE_NUM_N 4
> +#define MLX5_MPRQ_MIN_STRIDE_NUM_N 3
> +
> +/* Two-byte shift is disabled for Multi-Packet RQ. */
> +#define MLX5_MPRQ_TWO_BYTE_SHIFT 0
> +
> +/* Minimum size of packet to be memcpy'd instead of indirection. */
> +#define MLX5_MPRQ_MEMCPY_DEFAULT_LEN 128
> +
> +/* Minimum number Rx queues to enable Multi-Packet RQ. */
> +#define MLX5_MPRQ_MIN_RXQS 12
> +
> +/* Cache size of mempool for Multi-Packet RQ. */
> +#define MLX5_MPRQ_MP_CACHE_SZ 16
> +
>  #endif /* RTE_PMD_MLX5_DEFS_H_ */
> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
> index b73cb53df..2729c3b62 100644
> --- a/drivers/net/mlx5/mlx5_ethdev.c
> +++ b/drivers/net/mlx5/mlx5_ethdev.c
> @@ -494,6 +494,7 @@ mlx5_dev_supported_ptypes_get(struct rte_eth_dev *dev)
>  	};
>  
>  	if (dev->rx_pkt_burst == mlx5_rx_burst ||
> +	    dev->rx_pkt_burst == mlx5_rx_burst_mprq ||
>  	    dev->rx_pkt_burst == mlx5_rx_burst_vec)
>  		return ptypes;
>  	return NULL;
> @@ -1316,6 +1317,8 @@ priv_select_rx_function(struct priv *priv, __rte_unused struct rte_eth_dev *dev)
>  	if (priv_check_vec_rx_support(priv) > 0) {
>  		rx_pkt_burst = mlx5_rx_burst_vec;
>  		DEBUG("selected RX vectorized function");
> +	} else if (priv_mprq_enabled(priv)) {
> +		rx_pkt_burst = mlx5_rx_burst_mprq;
>  	}
>  	return rx_pkt_burst;
>  }
> diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h
> index 9eb9c15e1..b7ad3454e 100644
> --- a/drivers/net/mlx5/mlx5_prm.h
> +++ b/drivers/net/mlx5/mlx5_prm.h
> @@ -195,6 +195,21 @@ struct mlx5_mpw {
>  	} data;
>  };
>  
> +/* WQE for Multi-Packet RQ. */
> +struct mlx5_wqe_mprq {
> +	struct mlx5_wqe_srq_next_seg next_seg;
> +	struct mlx5_wqe_data_seg dseg;
> +};
> +
> +#define MLX5_MPRQ_LEN_MASK 0x000ffff
> +#define MLX5_MPRQ_LEN_SHIFT 0
> +#define MLX5_MPRQ_STRIDE_NUM_MASK 0x7fff0000
> +#define MLX5_MPRQ_STRIDE_NUM_SHIFT 16
> +#define MLX5_MPRQ_FILLER_MASK 0x80000000
> +#define MLX5_MPRQ_FILLER_SHIFT 31
> +
> +#define MLX5_MPRQ_STRIDE_SHIFT_BYTE 2
> +
>  /* CQ element structure - should be equal to the cache line size */
>  struct mlx5_cqe {
>  #if (RTE_CACHE_LINE_SIZE == 128)
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> index 238fa7e56..8fa56a53a 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -55,7 +55,73 @@ uint8_t rss_hash_default_key[] = {
>  const size_t rss_hash_default_key_len = sizeof(rss_hash_default_key);
>  
>  /**
> - * Allocate RX queue elements.
> + * Check whether Multi-Packet RQ can be enabled for the device.
> + *
> + * @param priv
> + *   Pointer to private structure.
> + *
> + * @return
> + *   1 if supported, negative errno value if not.
> + */
> +inline int
> +priv_check_mprq_support(struct priv *priv)
> +{
> +	if (priv->config.mprq && priv->rxqs_n >= priv->config.rxqs_mprq)
> +		return 1;
> +	return -ENOTSUP;
> +}

No more priv functions are allowed since [2].
rte_errno should be set.

> +
> +/**
> + * Check whether Multi-Packet RQ is enabled for the Rx queue.
> + *
> + *  @param rxq
> + *     Pointer to receive queue structure.
> + *
> + * @return
> + *   0 if disabled, otherwise enabled.
> + */
> +static inline int
> +rxq_mprq_enabled(struct mlx5_rxq_data *rxq)
> +{
> +	return rxq->mprq_mp != NULL;
> +}
> +
> +/**
> + * Check whether Multi-Packet RQ is enabled for the device.
> + *
> + * @param priv
> + *   Pointer to private structure.
> + *
> + * @return
> + *   0 if disabled, otherwise enabled.
> + */
> +inline int
> +priv_mprq_enabled(struct priv *priv)
> +{
> +	uint16_t i;
> +	uint16_t n = 0;
> +
> +	if (priv_check_mprq_support(priv) < 0)
> +		return 0;
> +	/* All the configured queues should be enabled. */
> +	for (i = 0; i < priv->rxqs_n; ++i) {
> +		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
> +
> +		if (!rxq)
> +			continue;
> +		if (rxq_mprq_enabled(rxq))
> +			++n;
> +	}
> +	if (n == priv->rxqs_n)
> +		return 1;
> +	if (n != 0)
> +		ERROR("Multi-Packet RQ can't be partially configured, %u/%u",
> +		      n, priv->rxqs_n);
> +	return 0;
> +}
> +
> +/**
> + * Allocate RX queue elements for Multi-Packet RQ.
>   *
>   * @param rxq_ctrl
>   *   Pointer to RX queue structure.
> @@ -63,8 +129,57 @@ const size_t rss_hash_default_key_len = sizeof(rss_hash_default_key);
>   * @return
>   *   0 on success, errno value on failure.
>   */
> -int
> -rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
> +static int
> +rxq_alloc_elts_mprq(struct mlx5_rxq_ctrl *rxq_ctrl)
> +{
> +	struct mlx5_rxq_data *rxq = &rxq_ctrl->rxq;
> +	unsigned int wqe_n = 1 << rxq->elts_n;
> +	unsigned int i;
> +	int ret = 0;
> +
> +	/* Iterate on segments. */
> +	for (i = 0; i <= wqe_n; ++i) {
> +		struct rte_mbuf *buf;
> +
> +		if (rte_mempool_get(rxq->mprq_mp, (void **)&buf) < 0) {
> +			ERROR("%p: empty mbuf pool", (void *)rxq_ctrl);
> +			ret = ENOMEM;
> +			goto error;
> +		}
> +		if (i < wqe_n)
> +			(*rxq->elts)[i] = buf;
> +		else
> +			rxq->mprq_repl = buf;
> +		PORT(buf) = rxq->port_id;
> +	}
> +	DEBUG("%p: allocated and configured %u segments",
> +	      (void *)rxq_ctrl, wqe_n);
> +	assert(ret == 0);
> +	return 0;
> +error:
> +	wqe_n = i;
> +	for (i = 0; (i != wqe_n); ++i) {
> +		if ((*rxq->elts)[i] != NULL)
> +			rte_mempool_put(rxq->mprq_mp,
> +					(*rxq->elts)[i]);
> +		(*rxq->elts)[i] = NULL;
> +	}
> +	DEBUG("%p: failed, freed everything", (void *)rxq_ctrl);
> +	assert(ret > 0);
> +	return ret;
> +}
> +
> +/**
> + * Allocate RX queue elements for Single-Packet RQ.
> + *
> + * @param rxq_ctrl
> + *   Pointer to RX queue structure.
> + *
> + * @return
> + *   0 on success, errno value on failure.
> + */
> +static int
> +rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
>  {
>  	const unsigned int sges_n = 1 << rxq_ctrl->rxq.sges_n;
>  	unsigned int elts_n = 1 << rxq_ctrl->rxq.elts_n;
> @@ -135,6 +250,22 @@ rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
>  }
>  
>  /**
> + * Allocate RX queue elements.
> + *
> + * @param rxq_ctrl
> + *   Pointer to RX queue structure.
> + *
> + * @return
> + *   0 on success, errno value on failure.
> + */
> +int
> +rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
> +{
> +	return rxq_mprq_enabled(&rxq_ctrl->rxq) ?
> +	       rxq_alloc_elts_mprq(rxq_ctrl) : rxq_alloc_elts_sprq(rxq_ctrl);
> +}
> +
> +/**
>   * Free RX queue elements.
>   *
>   * @param rxq_ctrl
> @@ -166,6 +297,10 @@ rxq_free_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
>  			rte_pktmbuf_free_seg((*rxq->elts)[i]);
>  		(*rxq->elts)[i] = NULL;
>  	}
> +	if (rxq->mprq_repl != NULL) {
> +		rte_pktmbuf_free_seg(rxq->mprq_repl);
> +		rxq->mprq_repl = NULL;
> +	}
>  }
>  
>  /**
> @@ -613,10 +748,16 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  			struct ibv_cq_init_attr_ex ibv;
>  			struct mlx5dv_cq_init_attr mlx5;
>  		} cq;
> -		struct ibv_wq_init_attr wq;
> +		struct {
> +			struct ibv_wq_init_attr ibv;
> +#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
> +			struct mlx5dv_wq_init_attr mlx5;
> +#endif
> +		} wq;
>  		struct ibv_cq_ex cq_attr;
>  	} attr;
> -	unsigned int cqe_n = (1 << rxq_data->elts_n) - 1;
> +	unsigned int cqe_n;
> +	unsigned int wqe_n = 1 << rxq_data->elts_n;
>  	struct mlx5_rxq_ibv *tmpl;
>  	struct mlx5dv_cq cq_info;
>  	struct mlx5dv_rwq rwq;
> @@ -624,6 +765,7 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  	int ret = 0;
>  	struct mlx5dv_obj obj;
>  	struct mlx5_dev_config *config = &priv->config;
> +	const int mprq_en = rxq_mprq_enabled(rxq_data);
>  
>  	assert(rxq_data);
>  	assert(!rxq_ctrl->ibv);
> @@ -646,6 +788,17 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  			goto error;
>  		}
>  	}
> +	if (mprq_en) {
> +		tmpl->mprq_mr = priv_mr_get(priv, rxq_data->mprq_mp);
> +		if (!tmpl->mprq_mr) {
> +			tmpl->mprq_mr = priv_mr_new(priv, rxq_data->mprq_mp);
> +			if (!tmpl->mprq_mr) {
> +				ERROR("%p: cannot create MR for"
> +				      " Multi-Packet RQ", (void *)rxq_ctrl);
> +				goto error;
> +			}
> +		}
> +	}
>  	if (rxq_ctrl->irq) {
>  		tmpl->channel = mlx5_glue->create_comp_channel(priv->ctx);
>  		if (!tmpl->channel) {
> @@ -654,6 +807,10 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  			goto error;
>  		}
>  	}
> +	if (mprq_en)
> +		cqe_n = wqe_n * (1 << MLX5_MPRQ_STRIDE_NUM_N) - 1;
> +	else
> +		cqe_n = wqe_n  - 1;
>  	attr.cq.ibv = (struct ibv_cq_init_attr_ex){
>  		.cqe = cqe_n,
>  		.channel = tmpl->channel,
> @@ -686,11 +843,11 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  	      priv->device_attr.orig_attr.max_qp_wr);
>  	DEBUG("priv->device_attr.max_sge is %d",
>  	      priv->device_attr.orig_attr.max_sge);
> -	attr.wq = (struct ibv_wq_init_attr){
> +	attr.wq.ibv = (struct ibv_wq_init_attr){
>  		.wq_context = NULL, /* Could be useful in the future. */
>  		.wq_type = IBV_WQT_RQ,
>  		/* Max number of outstanding WRs. */
> -		.max_wr = (1 << rxq_data->elts_n) >> rxq_data->sges_n,
> +		.max_wr = wqe_n >> rxq_data->sges_n,
>  		/* Max number of scatter/gather elements in a WR. */
>  		.max_sge = 1 << rxq_data->sges_n,
>  		.pd = priv->pd,
> @@ -704,8 +861,8 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  	};
>  	/* By default, FCS (CRC) is stripped by hardware. */
>  	if (rxq_data->crc_present) {
> -		attr.wq.create_flags |= IBV_WQ_FLAGS_SCATTER_FCS;
> -		attr.wq.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
> +		attr.wq.ibv.create_flags |= IBV_WQ_FLAGS_SCATTER_FCS;
> +		attr.wq.ibv.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
>  	}
>  #ifdef HAVE_IBV_WQ_FLAG_RX_END_PADDING
>  	if (config->hw_padding) {
> @@ -713,7 +870,26 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  		attr.wq.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
>  	}
>  #endif
> -	tmpl->wq = mlx5_glue->create_wq(priv->ctx, &attr.wq);
> +#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
> +	attr.wq.mlx5 = (struct mlx5dv_wq_init_attr){
> +		.comp_mask = 0,
> +	};
> +	if (mprq_en) {
> +		struct mlx5dv_striding_rq_init_attr *mprq_attr =
> +			&attr.wq.mlx5.striding_rq_attrs;
> +
> +		attr.wq.mlx5.comp_mask |= MLX5DV_WQ_INIT_ATTR_MASK_STRIDING_RQ;
> +		*mprq_attr = (struct mlx5dv_striding_rq_init_attr){
> +			.single_stride_log_num_of_bytes = MLX5_MPRQ_STRIDE_SZ_N,
> +			.single_wqe_log_num_of_strides = MLX5_MPRQ_STRIDE_NUM_N,
> +			.two_byte_shift_en = MLX5_MPRQ_TWO_BYTE_SHIFT,
> +		};
> +	}
> +	tmpl->wq = mlx5_glue->dv_create_wq(priv->ctx, &attr.wq.ibv,
> +					   &attr.wq.mlx5);
> +#else
> +	tmpl->wq = mlx5_glue->create_wq(priv->ctx, &attr.wq.ibv);
> +#endif
>  	if (tmpl->wq == NULL) {
>  		ERROR("%p: WQ creation failure", (void *)rxq_ctrl);
>  		goto error;
> @@ -722,14 +898,13 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  	 * Make sure number of WRs*SGEs match expectations since a queue
>  	 * cannot allocate more than "desc" buffers.
>  	 */
> -	if (((int)attr.wq.max_wr !=
> -	     ((1 << rxq_data->elts_n) >> rxq_data->sges_n)) ||
> -	    ((int)attr.wq.max_sge != (1 << rxq_data->sges_n))) {
> +	if (attr.wq.ibv.max_wr != (wqe_n >> rxq_data->sges_n) ||
> +	    (int)attr.wq.ibv.max_sge != (1 << rxq_data->sges_n)) {
>  		ERROR("%p: requested %u*%u but got %u*%u WRs*SGEs",
>  		      (void *)rxq_ctrl,
> -		      ((1 << rxq_data->elts_n) >> rxq_data->sges_n),
> +		      wqe_n >> rxq_data->sges_n,
>  		      (1 << rxq_data->sges_n),
> -		      attr.wq.max_wr, attr.wq.max_sge);
> +		      attr.wq.ibv.max_wr, attr.wq.ibv.max_sge);
>  		goto error;
>  	}
>  	/* Change queue state to ready. */
> @@ -756,25 +931,38 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  		goto error;
>  	}
>  	/* Fill the rings. */
> -	rxq_data->wqes = (volatile struct mlx5_wqe_data_seg (*)[])
> -		(uintptr_t)rwq.buf;
> -	for (i = 0; (i != (unsigned int)(1 << rxq_data->elts_n)); ++i) {
> +	rxq_data->wqes = rwq.buf;
> +	for (i = 0; (i != wqe_n); ++i) {
> +		volatile struct mlx5_wqe_data_seg *scat;
>  		struct rte_mbuf *buf = (*rxq_data->elts)[i];
> -		volatile struct mlx5_wqe_data_seg *scat = &(*rxq_data->wqes)[i];
> -
> +		uintptr_t addr = rte_pktmbuf_mtod(buf, uintptr_t);
> +		uint32_t byte_count;
> +		uint32_t lkey;
> +
> +		if (mprq_en) {
> +			scat = &((volatile struct mlx5_wqe_mprq *)
> +				 rxq_data->wqes)[i].dseg;
> +			byte_count = (1 << MLX5_MPRQ_STRIDE_SZ_N) *
> +				     (1 << MLX5_MPRQ_STRIDE_NUM_N);
> +			lkey = tmpl->mprq_mr->lkey;
> +		} else {
> +			scat = &((volatile struct mlx5_wqe_data_seg *)
> +				 rxq_data->wqes)[i];
> +			byte_count = DATA_LEN(buf);
> +			lkey = tmpl->mr->lkey;
> +		}
>  		/* scat->addr must be able to store a pointer. */
>  		assert(sizeof(scat->addr) >= sizeof(uintptr_t));
>  		*scat = (struct mlx5_wqe_data_seg){
> -			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
> -								  uintptr_t)),
> -			.byte_count = rte_cpu_to_be_32(DATA_LEN(buf)),
> -			.lkey = tmpl->mr->lkey,
> +			.addr = rte_cpu_to_be_64(addr),
> +			.byte_count = rte_cpu_to_be_32(byte_count),
> +			.lkey = lkey
>  		};
>  	}
>  	rxq_data->rq_db = rwq.dbrec;
>  	rxq_data->cqe_n = log2above(cq_info.cqe_cnt);
>  	rxq_data->cq_ci = 0;
> -	rxq_data->rq_ci = 0;
> +	rxq_data->strd_ci = 0;
>  	rxq_data->rq_pi = 0;
>  	rxq_data->zip = (struct rxq_zip){
>  		.ai = 0,
> @@ -785,7 +973,7 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  	rxq_data->cqn = cq_info.cqn;
>  	rxq_data->cq_arm_sn = 0;
>  	/* Update doorbell counter. */
> -	rxq_data->rq_ci = (1 << rxq_data->elts_n) >> rxq_data->sges_n;
> +	rxq_data->rq_ci = wqe_n >> rxq_data->sges_n;
>  	rte_wmb();
>  	*rxq_data->rq_db = rte_cpu_to_be_32(rxq_data->rq_ci);
>  	DEBUG("%p: rxq updated with %p", (void *)rxq_ctrl, (void *)&tmpl);
> @@ -802,6 +990,8 @@ mlx5_priv_rxq_ibv_new(struct priv *priv, uint16_t idx)
>  		claim_zero(mlx5_glue->destroy_cq(tmpl->cq));
>  	if (tmpl->channel)
>  		claim_zero(mlx5_glue->destroy_comp_channel(tmpl->channel));
> +	if (tmpl->mprq_mr)
> +		priv_mr_release(priv, tmpl->mprq_mr);
>  	if (tmpl->mr)
>  		priv_mr_release(priv, tmpl->mr);
>  	priv->verbs_alloc_ctx.type = MLX5_VERBS_ALLOC_TYPE_NONE;
> @@ -832,6 +1022,8 @@ mlx5_priv_rxq_ibv_get(struct priv *priv, uint16_t idx)
>  	rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
>  	if (rxq_ctrl->ibv) {
>  		priv_mr_get(priv, rxq_data->mp);
> +		if (rxq_mprq_enabled(rxq_data))
> +			priv_mr_get(priv, rxq_data->mprq_mp);
>  		rte_atomic32_inc(&rxq_ctrl->ibv->refcnt);
>  		DEBUG("%p: Verbs Rx queue %p: refcnt %d", (void *)priv,
>  		      (void *)rxq_ctrl->ibv,
> @@ -863,6 +1055,11 @@ mlx5_priv_rxq_ibv_release(struct priv *priv, struct mlx5_rxq_ibv *rxq_ibv)
>  	ret = priv_mr_release(priv, rxq_ibv->mr);
>  	if (!ret)
>  		rxq_ibv->mr = NULL;
> +	if (rxq_mprq_enabled(&rxq_ibv->rxq_ctrl->rxq)) {
> +		ret = priv_mr_release(priv, rxq_ibv->mprq_mr);
> +		if (!ret)
> +			rxq_ibv->mprq_mr = NULL;
> +	}
>  	DEBUG("%p: Verbs Rx queue %p: refcnt %d", (void *)priv,
>  	      (void *)rxq_ibv, rte_atomic32_read(&rxq_ibv->refcnt));
>  	if (rte_atomic32_dec_and_test(&rxq_ibv->refcnt)) {
> @@ -918,12 +1115,99 @@ mlx5_priv_rxq_ibv_releasable(struct priv *priv, struct mlx5_rxq_ibv *rxq_ibv)
>  }
>  
>  /**
> + * Callback function to initialize mbufs for Multi-Packet RQ.
> + */
> +static inline void
> +mlx5_mprq_mbuf_init(struct rte_mempool *mp, void *opaque_arg,
> +		    void *_m, unsigned int i __rte_unused)
> +{
> +	struct rte_mbuf *m = _m;
> +
> +	rte_pktmbuf_init(mp, opaque_arg, _m, i);
> +	m->buf_len =
> +		(1 << MLX5_MPRQ_STRIDE_SZ_N) * (1 << MLX5_MPRQ_STRIDE_NUM_N);
> +	rte_pktmbuf_reset_headroom(m);
> +}
> +
> +/**
> + * Configure Rx queue as Multi-Packet RQ.
> + *
> + * @param rxq_ctrl
> + *   Pointer to RX queue structure.
> + * @param priv
> + *   Pointer to private structure.
> + * @param idx
> + *   RX queue index.
> + * @param desc
> + *   Number of descriptors to configure in queue.
> + *
> + * @return
> + *   0 on success, negative errno value on failure.
> + */
> +static int
> +rxq_configure_mprq(struct mlx5_rxq_ctrl *rxq_ctrl, uint16_t idx, uint16_t desc)
> +{
> +	struct priv *priv = rxq_ctrl->priv;
> +	struct mlx5_dev_config *config = &priv->config;
> +	struct rte_mempool *mp;
> +	char name[RTE_MEMPOOL_NAMESIZE];
> +	unsigned int buf_len;
> +	unsigned int obj_size;
> +
> +	assert(rxq_ctrl->rxq.sges_n == 0);
> +	rxq_ctrl->rxq.strd_sz_n =
> +		MLX5_MPRQ_STRIDE_SZ_N - MLX5_MPRQ_MIN_STRIDE_SZ_N;
> +	rxq_ctrl->rxq.strd_num_n =
> +		MLX5_MPRQ_STRIDE_NUM_N - MLX5_MPRQ_MIN_STRIDE_NUM_N;
> +	rxq_ctrl->rxq.strd_shift_en = MLX5_MPRQ_TWO_BYTE_SHIFT;
> +	rxq_ctrl->rxq.mprq_max_memcpy_len = config->mprq_max_memcpy_len;
> +	buf_len = (1 << MLX5_MPRQ_STRIDE_SZ_N) * (1 << MLX5_MPRQ_STRIDE_NUM_N) +
> +		  RTE_PKTMBUF_HEADROOM;
> +	obj_size = buf_len + sizeof(struct rte_mbuf);
> +	snprintf(name, sizeof(name), "%s-mprq-%u", priv->dev->data->name, idx);
> +	/*
> +	 * Allocate per-queue Mempool for Multi-Packet RQ.
> +	 *
> +	 * Received packets can be either memcpy'd or indirectly referenced. In
> +	 * case of mbuf indirection, as it isn't possible to predict how the
> +	 * buffers will be queued by application, there's no option to exactly
> +	 * pre-allocate needed buffers in advance but to speculatively prepares
> +	 * enough buffers.
> +	 *
> +	 * In the data path, if this Mempool is depleted, PMD will try to memcpy
> +	 * received packets to buffers provided by application (rxq->mp) until
> +	 * this Mempool gets available again.
> +	 */
> +	desc *= 4;
> +	mp = rte_mempool_create(name, desc + MLX5_MPRQ_MP_CACHE_SZ,
> +				obj_size, MLX5_MPRQ_MP_CACHE_SZ,
> +				sizeof(struct rte_pktmbuf_pool_private),
> +				NULL, NULL, NULL, NULL,
> +				priv->dev->device->numa_node,
> +				MEMPOOL_F_SC_GET);
> +	if (mp == NULL) {
> +		ERROR("%p: failed to allocate a mempool for"
> +		      " multi-packet Rx queue (%u): %s",
> +		      (void *)priv->dev, idx,
> +		      rte_strerror(rte_errno));
> +		return -ENOMEM;
> +	}
> +
> +	rte_pktmbuf_pool_init(mp, NULL);
> +	rte_mempool_obj_iter(mp, mlx5_mprq_mbuf_init, NULL);
> +	rxq_ctrl->rxq.mprq_mp = mp;
> +	DEBUG("%p: Multi-Packet RQ is enabled for Rx queue %u",
> +	      (void *)priv->dev, idx);
> +	return 0;
> +}
> +
> +/**
>   * Create a DPDK Rx queue.
>   *
>   * @param priv
>   *   Pointer to private structure.
>   * @param idx
> - *   TX queue index.
> + *   RX queue index.
>   * @param desc
>   *   Number of descriptors to configure in queue.
>   * @param socket
> @@ -945,8 +1229,9 @@ mlx5_priv_rxq_new(struct priv *priv, uint16_t idx, uint16_t desc,
>  	 * Always allocate extra slots, even if eventually
>  	 * the vector Rx will not be used.
>  	 */
> -	const uint16_t desc_n =
> +	uint16_t desc_n =
>  		desc + config->rx_vec_en * MLX5_VPMD_DESCS_PER_LOOP;
> +	const int mprq_en = priv_check_mprq_support(priv) > 0;
>  
>  	tmpl = rte_calloc_socket("RXQ", 1,
>  				 sizeof(*tmpl) +
> @@ -954,13 +1239,35 @@ mlx5_priv_rxq_new(struct priv *priv, uint16_t idx, uint16_t desc,
>  				 0, socket);
>  	if (!tmpl)
>  		return NULL;
> +	tmpl->priv = priv;
>  	tmpl->socket = socket;
>  	if (priv->dev->data->dev_conf.intr_conf.rxq)
>  		tmpl->irq = 1;
> -	/* Enable scattered packets support for this queue if necessary. */
> +	/*
> +	 * This Rx queue can be configured as a Multi-Packet RQ if all of the
> +	 * following conditions are met:
> +	 *  - MPRQ is enabled.
> +	 *  - The number of descs is more than the number of strides.
> +	 *  - max_rx_pkt_len is less than the size of a stride sparing headroom.
> +	 *
> +	 *  Otherwise, enable Rx scatter if necessary.
> +	 */
>  	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
> -	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
> -	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
> +	if (mprq_en &&
> +	    desc >= (1U << MLX5_MPRQ_STRIDE_NUM_N) &&
> +	    dev->data->dev_conf.rxmode.max_rx_pkt_len <=
> +	    (1U << MLX5_MPRQ_STRIDE_SZ_N) - RTE_PKTMBUF_HEADROOM) {
> +		int ret;
> +
> +		/* TODO: Rx scatter isn't supported yet. */
> +		tmpl->rxq.sges_n = 0;
> +		/* Trim the number of descs needed. */
> +		desc >>= MLX5_MPRQ_STRIDE_NUM_N;
> +		ret = rxq_configure_mprq(tmpl, idx, desc);
> +		if (ret)
> +			goto error;
> +	} else if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
> +		   (mb_len - RTE_PKTMBUF_HEADROOM)) {
>  		tmpl->rxq.sges_n = 0;
>  	} else if (conf->offloads & DEV_RX_OFFLOAD_SCATTER) {
>  		unsigned int size =
> @@ -1030,7 +1337,6 @@ mlx5_priv_rxq_new(struct priv *priv, uint16_t idx, uint16_t desc,
>  	/* Save port ID. */
>  	tmpl->rxq.rss_hash = priv->rxqs_n > 1;
>  	tmpl->rxq.port_id = dev->data->port_id;
> -	tmpl->priv = priv;
>  	tmpl->rxq.mp = mp;
>  	tmpl->rxq.stats.idx = idx;
>  	tmpl->rxq.elts_n = log2above(desc);
> @@ -1105,6 +1411,25 @@ mlx5_priv_rxq_release(struct priv *priv, uint16_t idx)
>  	DEBUG("%p: Rx queue %p: refcnt %d", (void *)priv,
>  	      (void *)rxq_ctrl, rte_atomic32_read(&rxq_ctrl->refcnt));
>  	if (rte_atomic32_dec_and_test(&rxq_ctrl->refcnt)) {
> +		if (rxq_ctrl->rxq.mprq_mp != NULL) {
> +			/* If a mbuf in the pool has an indirect mbuf attached
> +			 * and it is still in use by application, destroying
> +			 * the Rx qeueue can spoil the packet. It is unlikely
> +			 * to happen but if application dynamically creates and
> +			 * destroys with holding Rx packets, this can happen.
> +			 *
> +			 * TODO: It is unavoidable for now because the Mempool
> +			 * for Multi-Packet RQ isn't provided by application but
> +			 * managed by PMD.
> +			 */
> +			if (!rte_mempool_full(rxq_ctrl->rxq.mprq_mp)) {
> +				ERROR("Mempool for Multi-Packet RQ %p"
> +				      " is still in use", (void *)rxq_ctrl);
> +				return EBUSY;
> +			}
> +			rte_mempool_free(rxq_ctrl->rxq.mprq_mp);
> +			rxq_ctrl->rxq.mprq_mp = NULL;
> +		}
>  		LIST_REMOVE(rxq_ctrl, next);
>  		rte_free(rxq_ctrl);
>  		(*priv->rxqs)[idx] = NULL;
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index 36eeefb49..49254ab59 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1800,7 +1800,8 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  
>  	while (pkts_n) {
>  		unsigned int idx = rq_ci & wqe_cnt;
> -		volatile struct mlx5_wqe_data_seg *wqe = &(*rxq->wqes)[idx];
> +		volatile struct mlx5_wqe_data_seg *wqe =
> +			&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[idx];
>  		struct rte_mbuf *rep = (*rxq->elts)[idx];
>  		uint32_t rss_hash_res = 0;
>  
> @@ -1901,6 +1902,155 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  }
>  
>  /**
> + * DPDK callback for RX with Multi-Packet RQ support.
> + *
> + * @param dpdk_rxq
> + *   Generic pointer to RX queue structure.
> + * @param[out] pkts
> + *   Array to store received packets.
> + * @param pkts_n
> + *   Maximum number of packets in array.
> + *
> + * @return
> + *   Number of packets successfully received (<= pkts_n).
> + */
> +uint16_t
> +mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
> +{
> +	struct mlx5_rxq_data *rxq = dpdk_rxq;
> +	const unsigned int strd_n =
> +		1 << (rxq->strd_num_n + MLX5_MPRQ_MIN_STRIDE_NUM_N);
> +	const unsigned int strd_sz =
> +		1 << (rxq->strd_sz_n + MLX5_MPRQ_MIN_STRIDE_SZ_N);
> +	const unsigned int strd_shift =
> +		MLX5_MPRQ_STRIDE_SHIFT_BYTE * rxq->strd_shift_en;
> +	const unsigned int cq_mask = (1 << rxq->cqe_n) - 1;
> +	const unsigned int wq_mask = (1 << rxq->elts_n) - 1;
> +	volatile struct mlx5_cqe *cqe = &(*rxq->cqes)[rxq->cq_ci & cq_mask];
> +	unsigned int i = 0;
> +	uint16_t rq_ci = rxq->rq_ci;
> +	uint16_t strd_idx = rxq->strd_ci;
> +	struct rte_mbuf *buf = (*rxq->elts)[rq_ci & wq_mask];
> +
> +	while (i < pkts_n) {
> +		struct rte_mbuf *pkt;
> +		int ret;
> +		unsigned int len;
> +		uint16_t consumed_strd;
> +		uint32_t offset;
> +		uint32_t byte_cnt;
> +		uint32_t rss_hash_res = 0;
> +
> +		if (strd_idx == strd_n) {
> +			/* Replace WQE only if the buffer is still in use. */
> +			if (unlikely(rte_mbuf_refcnt_read(buf) > 1)) {
> +				struct rte_mbuf *rep = rxq->mprq_repl;
> +				volatile struct mlx5_wqe_data_seg *wqe =
> +					&((volatile struct mlx5_wqe_mprq *)
> +					  rxq->wqes)[rq_ci & wq_mask].dseg;
> +				uintptr_t addr;
> +
> +				/* Replace mbuf. */
> +				(*rxq->elts)[rq_ci & wq_mask] = rep;
> +				PORT(rep) = PORT(buf);
> +				/* Release the old buffer. */
> +				if (__rte_mbuf_refcnt_update(buf, -1) == 0) {
> +					rte_mbuf_refcnt_set(buf, 1);
> +					rte_mbuf_raw_free(buf);
> +				}
> +				/* Replace WQE. */
> +				addr = rte_pktmbuf_mtod(rep, uintptr_t);
> +				wqe->addr = rte_cpu_to_be_64(addr);
> +				/* Stash a mbuf for next replacement. */
> +				if (likely(!rte_mempool_get(rxq->mprq_mp,
> +							    (void **)&rep)))
> +					rxq->mprq_repl = rep;
> +				else
> +					rxq->mprq_repl = NULL;
> +			}
> +			/* Advance to the next WQE. */
> +			strd_idx = 0;
> +			++rq_ci;
> +			buf = (*rxq->elts)[rq_ci & wq_mask];
> +		}
> +		cqe = &(*rxq->cqes)[rxq->cq_ci & cq_mask];
> +		ret = mlx5_rx_poll_len(rxq, cqe, cq_mask, &rss_hash_res);
> +		if (!ret)
> +			break;
> +		if (unlikely(ret == -1)) {
> +			/* RX error, packet is likely too large. */
> +			++rxq->stats.idropped;
> +			continue;
> +		}
> +		byte_cnt = ret;
> +		offset = strd_idx * strd_sz + strd_shift;
> +		consumed_strd = (byte_cnt & MLX5_MPRQ_STRIDE_NUM_MASK) >>
> +				MLX5_MPRQ_STRIDE_NUM_SHIFT;
> +		strd_idx += consumed_strd;
> +		if (byte_cnt & MLX5_MPRQ_FILLER_MASK)
> +			continue;
> +		pkt = rte_pktmbuf_alloc(rxq->mp);
> +		if (unlikely(pkt == NULL)) {
> +			++rxq->stats.rx_nombuf;
> +			break;
> +		}
> +		len = (byte_cnt & MLX5_MPRQ_LEN_MASK) >> MLX5_MPRQ_LEN_SHIFT;
> +		assert((int)len >= (rxq->crc_present << 2));
> +		if (rxq->crc_present)
> +			len -= ETHER_CRC_LEN;
> +		/*
> +		 * Memcpy packets to the target mbuf if:
> +		 * - The size of packet is smaller than MLX5_MPRQ_MEMCPY_LEN.
> +		 * - Out of buffer in the Mempool for Multi-Packet RQ.
> +		 */
> +		if (len <= rxq->mprq_max_memcpy_len || rxq->mprq_repl == NULL) {
> +			uintptr_t base = rte_pktmbuf_mtod(buf, uintptr_t);
> +
> +			rte_memcpy(rte_pktmbuf_mtod(pkt, void *),
> +				   (void *)(base + offset), len);
> +			/* Initialize the offload flag. */
> +			pkt->ol_flags = 0;
> +		} else {
> +			/*
> +			 * IND_ATTACHED_MBUF will be set to pkt->ol_flags when
> +			 * attaching the mbuf and more offload flags will be
> +			 * added below by calling rxq_cq_to_mbuf(). Other fields
> +			 * will be overwritten.
> +			 */
> +			rte_pktmbuf_attach_at(pkt, buf, offset,
> +					      consumed_strd * strd_sz);
> +			assert(pkt->ol_flags == IND_ATTACHED_MBUF);
> +			rte_pktmbuf_reset_headroom(pkt);
> +		}
> +		rxq_cq_to_mbuf(rxq, pkt, cqe, rss_hash_res);
> +		PKT_LEN(pkt) = len;
> +		DATA_LEN(pkt) = len;
> +#ifdef MLX5_PMD_SOFT_COUNTERS
> +		/* Increment bytes counter. */
> +		rxq->stats.ibytes += PKT_LEN(pkt);
> +#endif
> +		/* Return packet. */
> +		*(pkts++) = pkt;
> +		++i;
> +	}
> +	/* Update the consumer index. */
> +	rxq->rq_pi += i;
> +	rxq->strd_ci = strd_idx;
> +	rte_io_wmb();
> +	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
> +	if (rq_ci != rxq->rq_ci) {
> +		rxq->rq_ci = rq_ci;
> +		rte_io_wmb();
> +		*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
> +	}
> +#ifdef MLX5_PMD_SOFT_COUNTERS
> +	/* Increment packets counter. */
> +	rxq->stats.ipackets += i;
> +#endif
> +	return i;
> +}
> +
> +/**
>   * Dummy DPDK callback for TX.
>   *
>   * This function is used to temporarily replace the real callback during
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index d7e890558..ba8ac32c2 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -86,18 +86,25 @@ struct mlx5_rxq_data {
>  	unsigned int elts_n:4; /* Log 2 of Mbufs. */
>  	unsigned int rss_hash:1; /* RSS hash result is enabled. */
>  	unsigned int mark:1; /* Marked flow available on the queue. */
> -	unsigned int :15; /* Remaining bits. */
> +	unsigned int strd_sz_n:3; /* Log 2 of stride size. */
> +	unsigned int strd_num_n:4; /* Log 2 of the number of stride. */
> +	unsigned int strd_shift_en:1; /* Enable 2bytes shift on a stride. */
> +	unsigned int :8; /* Remaining bits. */
>  	volatile uint32_t *rq_db;
>  	volatile uint32_t *cq_db;
>  	uint16_t port_id;
>  	uint16_t rq_ci;
> +	uint16_t strd_ci; /* Stride index in a WQE for Multi-Packet RQ. */
>  	uint16_t rq_pi;
>  	uint16_t cq_ci;
> -	volatile struct mlx5_wqe_data_seg(*wqes)[];
> +	uint16_t mprq_max_memcpy_len; /* Maximum size of packet to memcpy. */
> +	volatile void *wqes;
>  	volatile struct mlx5_cqe(*cqes)[];
>  	struct rxq_zip zip; /* Compressed context. */
>  	struct rte_mbuf *(*elts)[];
>  	struct rte_mempool *mp;
> +	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
> +	struct rte_mbuf *mprq_repl; /* Stashed mbuf for replenish. */
>  	struct mlx5_rxq_stats stats;
>  	uint64_t mbuf_initializer; /* Default rearm_data for vectorized Rx. */
>  	struct rte_mbuf fake_mbuf; /* elts padding for vectorized Rx. */
> @@ -115,6 +122,7 @@ struct mlx5_rxq_ibv {
>  	struct ibv_wq *wq; /* Work Queue. */
>  	struct ibv_comp_channel *channel;
>  	struct mlx5_mr *mr; /* Memory Region (for mp). */
> +	struct mlx5_mr *mprq_mr; /* Memory Region (for mprq_mp). */
>  };
>  
>  /* RX queue control descriptor. */
> @@ -210,6 +218,8 @@ struct mlx5_txq_ctrl {
>  extern uint8_t rss_hash_default_key[];
>  extern const size_t rss_hash_default_key_len;
>  
> +int priv_check_mprq_support(struct priv *);
> +int priv_mprq_enabled(struct priv *);
>  void mlx5_rxq_cleanup(struct mlx5_rxq_ctrl *);
>  int mlx5_rx_queue_setup(struct rte_eth_dev *, uint16_t, uint16_t, unsigned int,
>  			const struct rte_eth_rxconf *, struct rte_mempool *);
> @@ -232,6 +242,7 @@ int mlx5_priv_rxq_release(struct priv *, uint16_t);
>  int mlx5_priv_rxq_releasable(struct priv *, uint16_t);
>  int mlx5_priv_rxq_verify(struct priv *);
>  int rxq_alloc_elts(struct mlx5_rxq_ctrl *);
> +int rxq_alloc_mprq_buf(struct mlx5_rxq_ctrl *);
>  struct mlx5_ind_table_ibv *mlx5_priv_ind_table_ibv_new(struct priv *,
>  						       uint16_t [],
>  						       uint16_t);
> @@ -280,6 +291,7 @@ uint16_t mlx5_tx_burst_mpw(void *, struct rte_mbuf **, uint16_t);
>  uint16_t mlx5_tx_burst_mpw_inline(void *, struct rte_mbuf **, uint16_t);
>  uint16_t mlx5_tx_burst_empw(void *, struct rte_mbuf **, uint16_t);
>  uint16_t mlx5_rx_burst(void *, struct rte_mbuf **, uint16_t);
> +uint16_t mlx5_rx_burst_mprq(void *, struct rte_mbuf **, uint16_t);
>  uint16_t removed_tx_burst(void *, struct rte_mbuf **, uint16_t);
>  uint16_t removed_rx_burst(void *, struct rte_mbuf **, uint16_t);
>  int mlx5_rx_descriptor_status(void *, uint16_t);
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
> index b66c2916f..ab4610c84 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.c
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
> @@ -282,6 +282,8 @@ rxq_check_vec_support(struct mlx5_rxq_data *rxq)
>  	struct mlx5_rxq_ctrl *ctrl =
>  		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
>  
> +	if (priv_mprq_enabled(ctrl->priv))
> +		return -ENOTSUP;
>  	if (!ctrl->priv->config.rx_vec_en || rxq->sges_n != 0)
>  		return -ENOTSUP;
>  	return 1;
> @@ -303,6 +305,8 @@ priv_check_vec_rx_support(struct priv *priv)
>  
>  	if (!priv->config.rx_vec_en)
>  		return -ENOTSUP;
> +	if (priv_mprq_enabled(priv))
> +		return -ENOTSUP;
>  	/* All the configured queues should support. */
>  	for (i = 0; i < priv->rxqs_n; ++i) {
>  		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
> index 44856bbff..b181d04cf 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> @@ -87,7 +87,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
>  	const uint16_t q_mask = q_n - 1;
>  	uint16_t elts_idx = rxq->rq_ci & q_mask;
>  	struct rte_mbuf **elts = &(*rxq->elts)[elts_idx];
> -	volatile struct mlx5_wqe_data_seg *wq = &(*rxq->wqes)[elts_idx];
> +	volatile struct mlx5_wqe_data_seg *wq =
> +		&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
>  	unsigned int i;
>  
>  	assert(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH);
> -- 
> 2.11.0
> 

Please rebase on top of  [1] and [2].

Thanks,

[1] https://dpdk.org/dev/patchwork/patch/35650/
[2] https://dpdk.org/dev/patchwork/patch/35653/

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (5 preceding siblings ...)
  2018-03-10  1:25 ` [PATCH v1 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
@ 2018-04-02 18:50 ` Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
                     ` (5 more replies)
  2018-04-19  1:11 ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
                   ` (5 subsequent siblings)
  12 siblings, 6 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe
bandwidth by posting a single large buffer for multiple packets. Instead of
posting a buffer per a packet, one large buffer is posted in order to
receive multiple packets on the buffer. A MPRQ buffer consists of multiple
fixed-size strides and each stride receives one packet.

Rx packet is either mem-copied to a user-provided mbuf if length is
comparatively small or referenced by mbuf indirection otherwise. In case of
indirection, the Mempool for the direct mbufs will be allocated and managed
by PMD.

In order to make mbuf indirections to each packets in the buffer, buf_off
field is added to rte_mbuf structure and rte_pktmbuf_attach_at() is also
added.

v2:
* Change LIB_GLUE_VERSION to 18.05.0
* Make the new glue API consistent between rdma-core or MLNX_OFED
* Enable Multi-Packet RQ by default
* Rebased on top of dpdk-next-net-mlx to accommodate Nelio's cleanup patches

Yongseok Koh (6):
  mbuf: add buffer offset field for flexible indirection
  net/mlx5: separate filling Rx flags
  net/mlx5: add a function to rdma-core glue
  net/mlx5: add Multi-Packet Rx support
  net/mlx5: release Tx queue resource earlier than Rx
  app/testpmd: conserve mbuf indirection flag

 app/test-pmd/csumonly.c          |   2 +
 app/test-pmd/macfwd.c            |   2 +
 app/test-pmd/macswap.c           |   2 +
 doc/guides/nics/mlx5.rst         |  23 +++
 drivers/net/mlx5/Makefile        |   7 +-
 drivers/net/mlx5/mlx5.c          |  81 +++++++-
 drivers/net/mlx5/mlx5.h          |   3 +
 drivers/net/mlx5/mlx5_defs.h     |  20 ++
 drivers/net/mlx5/mlx5_ethdev.c   |   3 +
 drivers/net/mlx5/mlx5_glue.c     |  16 ++
 drivers/net/mlx5/mlx5_glue.h     |   8 +
 drivers/net/mlx5/mlx5_prm.h      |  15 ++
 drivers/net/mlx5/mlx5_rxq.c      | 401 +++++++++++++++++++++++++++++++++++----
 drivers/net/mlx5/mlx5_rxtx.c     | 236 +++++++++++++++++++----
 drivers/net/mlx5/mlx5_rxtx.h     |  17 +-
 drivers/net/mlx5/mlx5_rxtx_vec.c |   4 +
 drivers/net/mlx5/mlx5_rxtx_vec.h |   3 +-
 lib/librte_mbuf/rte_mbuf.h       | 158 ++++++++++++++-
 18 files changed, 921 insertions(+), 80 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
@ 2018-04-02 18:50   ` Yongseok Koh
  2018-04-03  8:26     ` Olivier Matz
  2018-04-02 18:50   ` [PATCH v2 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

When attaching a mbuf, indirect mbuf has to point to start of buffer of
direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
flexible. Indirect mbuf can point to any part of direct mbuf by calling
rte_pktmbuf_attach_at().

Possible use-cases could be:
- If a packet has multiple layers of encapsulation, multiple indirect
  buffers can reference different layers of the encapsulated packet.
- A large direct mbuf can even contain multiple packets in series and
  each packet can be referenced by multiple mbuf indirections.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_mbuf/rte_mbuf.h | 158 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 157 insertions(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 62740254d..053db32d0 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -559,6 +559,11 @@ struct rte_mbuf {
 		};
 	};
 
+	/** Buffer offset of direct mbuf if attached. Indirect mbuf can point to
+	 * any part of direct mbuf.
+	 */
+	uint16_t buf_off;
+
 	/** Size of the application private data. In case of an indirect
 	 * mbuf, it stores the direct mbuf private data size. */
 	uint16_t priv_size;
@@ -671,7 +676,9 @@ rte_mbuf_data_dma_addr_default(const struct rte_mbuf *mb)
 static inline struct rte_mbuf *
 rte_mbuf_from_indirect(struct rte_mbuf *mi)
 {
-	return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + mi->priv_size);
+	return (struct rte_mbuf *)
+		RTE_PTR_SUB(mi->buf_addr,
+				sizeof(*mi) + mi->priv_size + mi->buf_off);
 }
 
 /**
@@ -1281,6 +1288,98 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Adjust tailroom of indirect mbuf. If offset is positive, enlarge the
+ * tailroom of the mbuf. If negative, shrink the tailroom.
+ *
+ * If length is out of range, then the function will fail and return -1,
+ * without modifying the indirect mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param len
+ *   The amount of length to adjust (in bytes).
+ * @return
+ *   - 0: On success.
+ *   - -1: On error.
+ */
+static inline int rte_pktmbuf_adj_indirect_tail(struct rte_mbuf *mi, int len)
+{
+	struct rte_mbuf *md;
+	uint16_t tailroom;
+	int delta;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(mi));
+
+	md = rte_mbuf_from_indirect(mi);
+	if (unlikely(mi->buf_len + len <= 0 ||
+			mi->buf_off + mi->buf_len + len >= md->buf_len))
+		return -1;
+
+	mi->buf_len += len;
+
+	tailroom = mi->buf_len - mi->data_off - mi->data_len;
+	delta = tailroom + len;
+	if (delta > 0) {
+		/* Adjust tailroom */
+		delta = 0;
+	} else if (delta + mi->data_len < 0) {
+		/* No data */
+		mi->data_off += delta + mi->data_len;
+		delta = mi->data_len;
+	}
+	mi->data_len += delta;
+	mi->pkt_len += delta;
+	return 0;
+}
+
+/**
+ * Shift buffer reference of indirect mbuf. If offset is positive, push
+ * the offset of the mbuf. If negative, pull the offset.
+ *
+ * Returns a pointer to the start address of the new data area. If offset
+ * is out of range, then the function will fail and return NULL, without
+ * modifying the indirect mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param off
+ *   The amount of offset to adjust (in bytes).
+ * @return
+ *   A pointer to the new start of the data.
+ */
+static inline char *rte_pktmbuf_adj_indirect_head(struct rte_mbuf *mi, int off)
+{
+	int delta;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(mi));
+
+	if (unlikely(off >= mi->buf_len || mi->buf_off + off < 0))
+		return NULL;
+
+	mi->buf_iova += off;
+	mi->buf_addr = (char *)mi->buf_addr + off;
+	mi->buf_len -= off;
+	mi->buf_off += off;
+
+	delta = off - mi->data_off;
+	if (delta < 0) {
+		/* Adjust headroom */
+		mi->data_off -= off;
+		delta = 0;
+	} else if (delta < mi->data_len) {
+		/* No headroom */
+		mi->data_off = 0;
+	} else {
+		/* No data */
+		mi->data_off = 0;
+		delta = mi->data_len;
+	}
+	mi->data_len -= delta;
+	mi->pkt_len -= delta;
+	return (char *)mi->buf_addr + mi->data_off;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
@@ -1315,6 +1414,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
+	mi->buf_off = 0;
 
 	mi->data_off = m->data_off;
 	mi->data_len = m->data_len;
@@ -1336,6 +1436,62 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
+ * Attach packet mbuf to another packet mbuf pointing by given offset.
+ *
+ * After attachment we refer the mbuf we attached as 'indirect',
+ * while mbuf we attached to as 'direct'.
+ *
+ * The indirect mbuf can reference to anywhere in the buffer of the direct
+ * mbuf by the given offset. And the indirect mbuf is also be trimmed by
+ * the given buffer length.
+ *
+ * As a result, if a direct mbuf has multiple layers of encapsulation,
+ * multiple indirect buffers can reference different layers of the packet.
+ * Or, a large direct mbuf can even contain multiple packets in series and
+ * each packet can be referenced by multiple mbuf indirections.
+ *
+ * Returns a pointer to the start address of the new data area. If offset
+ * or buffer length is out of range, then the function will fail and return
+ * NULL, without attaching the mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param m
+ *   The packet mbuf we're attaching to.
+ * @param off
+ *   The amount of offset to push (in bytes).
+ * @param buf_len
+ *   The buffer length of the indirect mbuf (in bytes).
+ * @return
+ *   A pointer to the new start of the data.
+ */
+static inline char *rte_pktmbuf_attach_at(struct rte_mbuf *mi,
+	struct rte_mbuf *m, uint16_t off, uint16_t buf_len)
+{
+	struct rte_mbuf *md;
+	char *ret;
+
+	if (RTE_MBUF_DIRECT(m))
+		md = m;
+	else
+		md = rte_mbuf_from_indirect(m);
+
+	if (off + buf_len > md->buf_len)
+		return NULL;
+
+	rte_pktmbuf_attach(mi, m);
+
+	/* Push reference of indirect mbuf */
+	ret = rte_pktmbuf_adj_indirect_head(mi, off);
+	RTE_ASSERT(ret != NULL);
+
+	/* Trim reference of indirect mbuf */
+	rte_pktmbuf_adj_indirect_tail(mi, off + buf_len - md->buf_len);
+
+	return ret;
+}
+
+/**
  * Detach an indirect packet mbuf.
  *
  *  - restore original mbuf address and length values.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 2/6] net/mlx5: separate filling Rx flags
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
@ 2018-04-02 18:50   ` Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Filling in fields of mbuf becomes a separate inline function so that this
can be reused.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx.c | 84 +++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 33 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index d1292aa27..461d7bdf6 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -43,6 +43,10 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 static __rte_always_inline uint32_t
 rxq_cq_to_ol_flags(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe);
 
+static __rte_always_inline void
+rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
+	       volatile struct mlx5_cqe *cqe, uint32_t rss_hash_res);
+
 uint32_t mlx5_ptype_table[] __rte_cache_aligned = {
 	[0xff] = RTE_PTYPE_ALL_MASK, /* Last entry for errored packet. */
 };
@@ -1761,6 +1765,52 @@ rxq_cq_to_ol_flags(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe)
 }
 
 /**
+ * Fill in mbuf fields from RX completion flags.
+ * Note that pkt->ol_flags should be initialized outside of this function.
+ *
+ * @param rxq
+ *   Pointer to RX queue.
+ * @param pkt
+ *   mbuf to fill.
+ * @param cqe
+ *   CQE to process.
+ * @param rss_hash_res
+ *   Packet RSS Hash result.
+ */
+static inline void
+rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
+	       volatile struct mlx5_cqe *cqe, uint32_t rss_hash_res)
+{
+	/* Update packet information. */
+	pkt->packet_type = rxq_cq_to_pkt_type(cqe);
+	if (rss_hash_res && rxq->rss_hash) {
+		pkt->hash.rss = rss_hash_res;
+		pkt->ol_flags |= PKT_RX_RSS_HASH;
+	}
+	if (rxq->mark && MLX5_FLOW_MARK_IS_VALID(cqe->sop_drop_qpn)) {
+		pkt->ol_flags |= PKT_RX_FDIR;
+		if (cqe->sop_drop_qpn !=
+		    rte_cpu_to_be_32(MLX5_FLOW_MARK_DEFAULT)) {
+			uint32_t mark = cqe->sop_drop_qpn;
+
+			pkt->ol_flags |= PKT_RX_FDIR_ID;
+			pkt->hash.fdir.hi = mlx5_flow_mark_get(mark);
+		}
+	}
+	if (rxq->csum | rxq->csum_l2tun)
+		pkt->ol_flags |= rxq_cq_to_ol_flags(rxq, cqe);
+	if (rxq->vlan_strip &&
+	    (cqe->hdr_type_etc & rte_cpu_to_be_16(MLX5_CQE_VLAN_STRIPPED))) {
+		pkt->ol_flags |= PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED;
+		pkt->vlan_tci = rte_be_to_cpu_16(cqe->vlan_info);
+	}
+	if (rxq->hw_timestamp) {
+		pkt->timestamp = rte_be_to_cpu_64(cqe->timestamp);
+		pkt->ol_flags |= PKT_RX_TIMESTAMP;
+	}
+}
+
+/**
  * DPDK callback for RX.
  *
  * @param dpdk_rxq
@@ -1836,40 +1886,8 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			}
 			pkt = seg;
 			assert(len >= (rxq->crc_present << 2));
-			/* Update packet information. */
-			pkt->packet_type = rxq_cq_to_pkt_type(cqe);
 			pkt->ol_flags = 0;
-			if (rss_hash_res && rxq->rss_hash) {
-				pkt->hash.rss = rss_hash_res;
-				pkt->ol_flags = PKT_RX_RSS_HASH;
-			}
-			if (rxq->mark &&
-			    MLX5_FLOW_MARK_IS_VALID(cqe->sop_drop_qpn)) {
-				pkt->ol_flags |= PKT_RX_FDIR;
-				if (cqe->sop_drop_qpn !=
-				    rte_cpu_to_be_32(MLX5_FLOW_MARK_DEFAULT)) {
-					uint32_t mark = cqe->sop_drop_qpn;
-
-					pkt->ol_flags |= PKT_RX_FDIR_ID;
-					pkt->hash.fdir.hi =
-						mlx5_flow_mark_get(mark);
-				}
-			}
-			if (rxq->csum | rxq->csum_l2tun)
-				pkt->ol_flags |= rxq_cq_to_ol_flags(rxq, cqe);
-			if (rxq->vlan_strip &&
-			    (cqe->hdr_type_etc &
-			     rte_cpu_to_be_16(MLX5_CQE_VLAN_STRIPPED))) {
-				pkt->ol_flags |= PKT_RX_VLAN |
-					PKT_RX_VLAN_STRIPPED;
-				pkt->vlan_tci =
-					rte_be_to_cpu_16(cqe->vlan_info);
-			}
-			if (rxq->hw_timestamp) {
-				pkt->timestamp =
-					rte_be_to_cpu_64(cqe->timestamp);
-				pkt->ol_flags |= PKT_RX_TIMESTAMP;
-			}
+			rxq_cq_to_mbuf(rxq, pkt, cqe, rss_hash_res);
 			if (rxq->crc_present)
 				len -= ETHER_CRC_LEN;
 			PKT_LEN(pkt) = len;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 3/6] net/mlx5: add a function to rdma-core glue
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
@ 2018-04-02 18:50   ` Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

mlx5dv_create_wq() is added for the Multi-Packet RQ (a.k.a Striding RQ).

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/Makefile    |  7 ++++++-
 drivers/net/mlx5/mlx5_glue.c | 16 ++++++++++++++++
 drivers/net/mlx5/mlx5_glue.h |  8 ++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
index 201f6f06a..e51debec8 100644
--- a/drivers/net/mlx5/Makefile
+++ b/drivers/net/mlx5/Makefile
@@ -35,7 +35,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_mlx5.a
 LIB_GLUE = $(LIB_GLUE_BASE).$(LIB_GLUE_VERSION)
 LIB_GLUE_BASE = librte_pmd_mlx5_glue.so
-LIB_GLUE_VERSION = 18.02.0
+LIB_GLUE_VERSION = 18.05.0
 
 # Sources.
 SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5.c
@@ -125,6 +125,11 @@ mlx5_autoconf.h.new: FORCE
 mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 	$Q $(RM) -f -- '$@'
 	$Q sh -- '$<' '$@' \
+		HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT \
+		infiniband/mlx5dv.h \
+		enum MLX5DV_CONTEXT_MASK_STRIDING_RQ \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_IBV_DEVICE_TUNNEL_SUPPORT \
 		infiniband/mlx5dv.h \
 		enum MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS \
diff --git a/drivers/net/mlx5/mlx5_glue.c b/drivers/net/mlx5/mlx5_glue.c
index be684d378..3551dcac2 100644
--- a/drivers/net/mlx5/mlx5_glue.c
+++ b/drivers/net/mlx5/mlx5_glue.c
@@ -293,6 +293,21 @@ mlx5_glue_dv_create_cq(struct ibv_context *context,
 	return mlx5dv_create_cq(context, cq_attr, mlx5_cq_attr);
 }
 
+static struct ibv_wq *
+mlx5_glue_dv_create_wq(struct ibv_context *context,
+		       struct ibv_wq_init_attr *wq_attr,
+		       struct mlx5dv_wq_init_attr *mlx5_wq_attr)
+{
+#ifndef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	(void)context;
+	(void)wq_attr;
+	(void)mlx5_wq_attr;
+	return NULL;
+#else
+	return mlx5dv_create_wq(context, wq_attr, mlx5_wq_attr);
+#endif
+}
+
 static int
 mlx5_glue_dv_query_device(struct ibv_context *ctx,
 			  struct mlx5dv_context *attrs_out)
@@ -353,6 +368,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue){
 	.port_state_str = mlx5_glue_port_state_str,
 	.cq_ex_to_cq = mlx5_glue_cq_ex_to_cq,
 	.dv_create_cq = mlx5_glue_dv_create_cq,
+	.dv_create_wq = mlx5_glue_dv_create_wq,
 	.dv_query_device = mlx5_glue_dv_query_device,
 	.dv_set_context_attr = mlx5_glue_dv_set_context_attr,
 	.dv_init_obj = mlx5_glue_dv_init_obj,
diff --git a/drivers/net/mlx5/mlx5_glue.h b/drivers/net/mlx5/mlx5_glue.h
index b5efee3b6..2789201bc 100644
--- a/drivers/net/mlx5/mlx5_glue.h
+++ b/drivers/net/mlx5/mlx5_glue.h
@@ -31,6 +31,10 @@ struct ibv_counter_set_init_attr;
 struct ibv_query_counter_set_attr;
 #endif
 
+#ifndef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+struct mlx5dv_wq_init_attr;
+#endif
+
 /* LIB_GLUE_VERSION must be updated every time this structure is modified. */
 struct mlx5_glue {
 	const char *version;
@@ -100,6 +104,10 @@ struct mlx5_glue {
 		(struct ibv_context *context,
 		 struct ibv_cq_init_attr_ex *cq_attr,
 		 struct mlx5dv_cq_init_attr *mlx5_cq_attr);
+	struct ibv_wq *(*dv_create_wq)
+		(struct ibv_context *context,
+		 struct ibv_wq_init_attr *wq_attr,
+		 struct mlx5dv_wq_init_attr *mlx5_wq_attr);
 	int (*dv_query_device)(struct ibv_context *ctx_in,
 			       struct mlx5dv_context *attrs_out);
 	int (*dv_set_context_attr)(struct ibv_context *ibv_ctx,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 4/6] net/mlx5: add Multi-Packet Rx support
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                     ` (2 preceding siblings ...)
  2018-04-02 18:50   ` [PATCH v2 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
@ 2018-04-02 18:50   ` Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
  5 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe
bandwidth by posting a single large buffer for multiple packets. Instead of
posting a buffer per a packet, one large buffer is posted in order to
receive multiple packets on the buffer. A MPRQ buffer consists of multiple
fixed-size strides and each stride receives one packet.

Rx packet is either mem-copied to a user-provided mbuf if length is
comparatively small or referenced by mbuf indirection otherwise. In case of
indirection, the Mempool for the direct mbufs will be allocated and managed
by PMD.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 doc/guides/nics/mlx5.rst         |  23 +++
 drivers/net/mlx5/mlx5.c          |  65 +++++++
 drivers/net/mlx5/mlx5.h          |   3 +
 drivers/net/mlx5/mlx5_defs.h     |  20 ++
 drivers/net/mlx5/mlx5_ethdev.c   |   3 +
 drivers/net/mlx5/mlx5_prm.h      |  15 ++
 drivers/net/mlx5/mlx5_rxq.c      | 401 +++++++++++++++++++++++++++++++++++----
 drivers/net/mlx5/mlx5_rxtx.c     | 152 ++++++++++++++-
 drivers/net/mlx5/mlx5_rxtx.h     |  17 +-
 drivers/net/mlx5/mlx5_rxtx_vec.c |   4 +
 drivers/net/mlx5/mlx5_rxtx_vec.h |   3 +-
 11 files changed, 669 insertions(+), 37 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 46d26e4c8..f4502f637 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -254,6 +254,29 @@ Run-time configuration
   - x86_64 with ConnectX-4, ConnectX-4 LX and ConnectX-5.
   - POWER8 and ARMv8 with ConnectX-4 LX and ConnectX-5.
 
+- ``mprq_en`` parameter [int]
+
+  A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
+  configured as Multi-Packet RQ if the total number of Rx queues is
+  ``rxqs_min_mprq`` or more and Rx scatter isn't configured. Enabled by default.
+
+  Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
+  by posting a single large buffer for multiple packets. Instead of posting a
+  buffers per a packet, one large buffer is posted in order to receive multiple
+  packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides
+  and each stride receives one packet.
+
+- ``mprq_max_memcpy_len`` parameter [int]
+  The maximum size of packet for memcpy in case of Multi-Packet Rx queue. Rx
+  packet is mem-copied to a user-provided mbuf if the size of Rx packet is less
+  than or equal to this parameter. Otherwise, the packet will be referenced by mbuf
+  indirection. In case of indirection, the Mempool for the direct mbufs will be
+  allocated and managed by PMD. The default value is 128.
+
+- ``rxqs_min_mprq`` parameter [int]
+  Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is greater or
+  equal to this value. The default value is 12.
+
 - ``txq_inline`` parameter [int]
 
   Amount of data to be inlined during TX operations. Improves latency.
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 7d58d66bb..aba44746f 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -44,6 +44,18 @@
 /* Device parameter to enable RX completion queue compression. */
 #define MLX5_RXQ_CQE_COMP_EN "rxq_cqe_comp_en"
 
+/* Device parameter to enable Multi-Packet Rx queue. */
+#define MLX5_RX_MPRQ_EN "mprq_en"
+
+/* Device parameter to limit the size of memcpy'd packet. */
+#define MLX5_RX_MPRQ_MAX_MEMCPY_LEN "mprq_max_memcpy_len"
+
+/*
+ * Device parameter to set the minimum number of Rx queues to configure
+ * Multi-Packet Rx queue.
+ */
+#define MLX5_RXQS_MIN_MPRQ "rxqs_min_mprq"
+
 /* Device parameter to configure inline send. */
 #define MLX5_TXQ_INLINE "txq_inline"
 
@@ -393,6 +405,12 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 	}
 	if (strcmp(MLX5_RXQ_CQE_COMP_EN, key) == 0) {
 		config->cqe_comp = !!tmp;
+	} else if (strcmp(MLX5_RX_MPRQ_EN, key) == 0) {
+		config->mprq = !!tmp;
+	} else if (strcmp(MLX5_RX_MPRQ_MAX_MEMCPY_LEN, key) == 0) {
+		config->mprq_max_memcpy_len = tmp;
+	} else if (strcmp(MLX5_RXQS_MIN_MPRQ, key) == 0) {
+		config->rxqs_mprq = tmp;
 	} else if (strcmp(MLX5_TXQ_INLINE, key) == 0) {
 		config->txq_inline = tmp;
 	} else if (strcmp(MLX5_TXQS_MIN_INLINE, key) == 0) {
@@ -431,6 +449,9 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 {
 	const char **params = (const char *[]){
 		MLX5_RXQ_CQE_COMP_EN,
+		MLX5_RX_MPRQ_EN,
+		MLX5_RX_MPRQ_MAX_MEMCPY_LEN,
+		MLX5_RXQS_MIN_MPRQ,
 		MLX5_TXQ_INLINE,
 		MLX5_TXQS_MIN_INLINE,
 		MLX5_TXQ_MPW_EN,
@@ -600,6 +621,7 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	unsigned int mps;
 	unsigned int cqe_comp;
 	unsigned int tunnel_en = 0;
+	unsigned int mprq = 0;
 	int idx;
 	int i;
 	struct mlx5dv_context attrs_out = {0};
@@ -674,6 +696,9 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 #ifdef HAVE_IBV_DEVICE_TUNNEL_SUPPORT
 	attrs_out.comp_mask |= MLX5DV_CONTEXT_MASK_TUNNEL_OFFLOADS;
 #endif
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	attrs_out.comp_mask |= MLX5DV_CONTEXT_MASK_STRIDING_RQ;
+#endif
 	mlx5_glue->dv_query_device(attr_ctx, &attrs_out);
 	if (attrs_out.flags & MLX5DV_CONTEXT_FLAGS_MPW_ALLOWED) {
 		if (attrs_out.flags & MLX5DV_CONTEXT_FLAGS_ENHANCED_MPW) {
@@ -687,6 +712,37 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 		DRV_LOG(DEBUG, "MPW isn't supported");
 		mps = MLX5_MPW_DISABLED;
 	}
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	if (attrs_out.comp_mask & MLX5DV_CONTEXT_MASK_STRIDING_RQ) {
+		struct mlx5dv_striding_rq_caps mprq_caps =
+			attrs_out.striding_rq_caps;
+
+		DRV_LOG(DEBUG, "\tmin_single_stride_log_num_of_bytes: %d",
+			mprq_caps.min_single_stride_log_num_of_bytes);
+		DRV_LOG(DEBUG, "\tmax_single_stride_log_num_of_bytes: %d",
+			mprq_caps.max_single_stride_log_num_of_bytes);
+		DRV_LOG(DEBUG, "\tmin_single_wqe_log_num_of_strides: %d",
+			mprq_caps.min_single_wqe_log_num_of_strides);
+		DRV_LOG(DEBUG, "\tmax_single_wqe_log_num_of_strides: %d",
+			mprq_caps.max_single_wqe_log_num_of_strides);
+		DRV_LOG(DEBUG, "\tsupported_qpts: %d",
+			mprq_caps.supported_qpts);
+		if (mprq_caps.min_single_stride_log_num_of_bytes <=
+		    MLX5_MPRQ_MIN_STRIDE_SZ_N &&
+		    mprq_caps.max_single_stride_log_num_of_bytes >=
+		    MLX5_MPRQ_STRIDE_SZ_N &&
+		    mprq_caps.min_single_wqe_log_num_of_strides <=
+		    MLX5_MPRQ_MIN_STRIDE_NUM_N &&
+		    mprq_caps.max_single_wqe_log_num_of_strides >=
+		    MLX5_MPRQ_STRIDE_NUM_N) {
+			DRV_LOG(DEBUG, "Multi-Packet RQ is supported");
+			mprq = 1;
+		} else {
+			DRV_LOG(DEBUG, "Multi-Packet RQ isn't supported");
+			mprq = 0;
+		}
+	}
+#endif
 	if (RTE_CACHE_LINE_SIZE == 128 &&
 	    !(attrs_out.flags & MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP))
 		cqe_comp = 0;
@@ -733,6 +789,9 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 			.txq_inline = MLX5_ARG_UNSET,
 			.txqs_inline = MLX5_ARG_UNSET,
 			.inline_max_packet_sz = MLX5_ARG_UNSET,
+			.mprq = 1, /* Enabled by default. */
+			.mprq_max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN,
+			.rxqs_mprq = MLX5_MPRQ_MIN_RXQS,
 		};
 
 		len = snprintf(name, sizeof(name), PCI_PRI_FMT,
@@ -890,6 +949,12 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 			DRV_LOG(WARNING, "Rx CQE compression isn't supported");
 			config.cqe_comp = 0;
 		}
+		if (config.mprq && !mprq) {
+			DRV_LOG(WARNING, "Multi-Packet RQ isn't supported");
+			config.mprq = 0;
+		}
+		DRV_LOG(INFO, "Multi-Packet RQ is %s",
+			config.mprq ? "enabled" : "disabled");
 		eth_dev = rte_eth_dev_allocate(name);
 		if (eth_dev == NULL) {
 			DRV_LOG(ERR, "can not allocate rte ethdev");
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index faacfd9d6..d2e789990 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -87,6 +87,9 @@ struct mlx5_dev_config {
 	unsigned int tx_vec_en:1; /* Tx vector is enabled. */
 	unsigned int rx_vec_en:1; /* Rx vector is enabled. */
 	unsigned int mpw_hdr_dseg:1; /* Enable DSEGs in the title WQEBB. */
+	unsigned int mprq:1; /* Whether Multi-Packet RQ is supported. */
+	unsigned int mprq_max_memcpy_len; /* Maximum packet size to memcpy. */
+	unsigned int rxqs_mprq; /* Queue count threshold for Multi-Packet RQ. */
 	unsigned int tso_max_payload_sz; /* Maximum TCP payload for TSO. */
 	unsigned int ind_table_max_size; /* Maximum indirection table size. */
 	int txq_inline; /* Maximum packet size for inlining. */
diff --git a/drivers/net/mlx5/mlx5_defs.h b/drivers/net/mlx5/mlx5_defs.h
index 6401588ee..d9fa3142d 100644
--- a/drivers/net/mlx5/mlx5_defs.h
+++ b/drivers/net/mlx5/mlx5_defs.h
@@ -95,4 +95,24 @@
  */
 #define MLX5_UAR_OFFSET (1ULL << 32)
 
+/* Log 2 of the size of a stride for Multi-Packet RQ. */
+#define MLX5_MPRQ_STRIDE_SZ_N 11
+#define MLX5_MPRQ_MIN_STRIDE_SZ_N 6
+
+/* Log 2 of the number of strides per WQE for Multi-Packet RQ. */
+#define MLX5_MPRQ_STRIDE_NUM_N 4
+#define MLX5_MPRQ_MIN_STRIDE_NUM_N 3
+
+/* Two-byte shift is disabled for Multi-Packet RQ. */
+#define MLX5_MPRQ_TWO_BYTE_SHIFT 0
+
+/* Minimum size of packet to be memcpy'd instead of indirection. */
+#define MLX5_MPRQ_MEMCPY_DEFAULT_LEN 128
+
+/* Minimum number Rx queues to enable Multi-Packet RQ. */
+#define MLX5_MPRQ_MIN_RXQS 12
+
+/* Cache size of mempool for Multi-Packet RQ. */
+#define MLX5_MPRQ_MP_CACHE_SZ 16
+
 #endif /* RTE_PMD_MLX5_DEFS_H_ */
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index b6f5101cf..a2fed5c69 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -464,6 +464,7 @@ mlx5_dev_supported_ptypes_get(struct rte_eth_dev *dev)
 	};
 
 	if (dev->rx_pkt_burst == mlx5_rx_burst ||
+	    dev->rx_pkt_burst == mlx5_rx_burst_mprq ||
 	    dev->rx_pkt_burst == mlx5_rx_burst_vec)
 		return ptypes;
 	return NULL;
@@ -1116,6 +1117,8 @@ mlx5_select_rx_function(struct rte_eth_dev *dev)
 		rx_pkt_burst = mlx5_rx_burst_vec;
 		DRV_LOG(DEBUG, "port %u selected Rx vectorized function",
 			dev->data->port_id);
+	} else if (mlx5_mprq_enabled(dev)) {
+		rx_pkt_burst = mlx5_rx_burst_mprq;
 	}
 	return rx_pkt_burst;
 }
diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h
index 9eb9c15e1..b7ad3454e 100644
--- a/drivers/net/mlx5/mlx5_prm.h
+++ b/drivers/net/mlx5/mlx5_prm.h
@@ -195,6 +195,21 @@ struct mlx5_mpw {
 	} data;
 };
 
+/* WQE for Multi-Packet RQ. */
+struct mlx5_wqe_mprq {
+	struct mlx5_wqe_srq_next_seg next_seg;
+	struct mlx5_wqe_data_seg dseg;
+};
+
+#define MLX5_MPRQ_LEN_MASK 0x000ffff
+#define MLX5_MPRQ_LEN_SHIFT 0
+#define MLX5_MPRQ_STRIDE_NUM_MASK 0x7fff0000
+#define MLX5_MPRQ_STRIDE_NUM_SHIFT 16
+#define MLX5_MPRQ_FILLER_MASK 0x80000000
+#define MLX5_MPRQ_FILLER_SHIFT 31
+
+#define MLX5_MPRQ_STRIDE_SHIFT_BYTE 2
+
 /* CQ element structure - should be equal to the cache line size */
 struct mlx5_cqe {
 #if (RTE_CACHE_LINE_SIZE == 128)
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 1b4570586..d2018d929 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -55,7 +55,75 @@ uint8_t rss_hash_default_key[] = {
 const size_t rss_hash_default_key_len = sizeof(rss_hash_default_key);
 
 /**
- * Allocate RX queue elements.
+ * Check whether Multi-Packet RQ can be enabled for the device.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ *
+ * @return
+ *   1 if supported, negative errno value if not.
+ */
+inline int
+mlx5_check_mprq_support(struct rte_eth_dev *dev)
+{
+	struct priv *priv = dev->data->dev_private;
+
+	if (priv->config.mprq && priv->rxqs_n >= priv->config.rxqs_mprq)
+		return 1;
+	return -ENOTSUP;
+}
+
+/**
+ * Check whether Multi-Packet RQ is enabled for the Rx queue.
+ *
+ *  @param rxq
+ *     Pointer to receive queue structure.
+ *
+ * @return
+ *   0 if disabled, otherwise enabled.
+ */
+static inline int
+rxq_mprq_enabled(struct mlx5_rxq_data *rxq)
+{
+	return rxq->mprq_mp != NULL;
+}
+
+/**
+ * Check whether Multi-Packet RQ is enabled for the device.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ *
+ * @return
+ *   0 if disabled, otherwise enabled.
+ */
+inline int
+mlx5_mprq_enabled(struct rte_eth_dev *dev)
+{
+	struct priv *priv = dev->data->dev_private;
+	uint16_t i;
+	uint16_t n = 0;
+
+	if (mlx5_check_mprq_support(dev) < 0)
+		return 0;
+	/* All the configured queues should be enabled. */
+	for (i = 0; i < priv->rxqs_n; ++i) {
+		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+
+		if (!rxq)
+			continue;
+		if (rxq_mprq_enabled(rxq))
+			++n;
+	}
+	/* Multi-Packet RQ can't be partially configured. */
+	assert(n != 0);
+	if (n == priv->rxqs_n)
+		return 1;
+	return 0;
+}
+
+/**
+ * Allocate RX queue elements for Multi-Packet RQ.
  *
  * @param rxq_ctrl
  *   Pointer to RX queue structure.
@@ -63,8 +131,60 @@ const size_t rss_hash_default_key_len = sizeof(rss_hash_default_key);
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-int
-rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
+static int
+rxq_alloc_elts_mprq(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_rxq_data *rxq = &rxq_ctrl->rxq;
+	unsigned int wqe_n = 1 << rxq->elts_n;
+	unsigned int i;
+	int err;
+
+	/* Iterate on segments. */
+	for (i = 0; i <= wqe_n; ++i) {
+		struct rte_mbuf *buf;
+
+		if (rte_mempool_get(rxq->mprq_mp, (void **)&buf) < 0) {
+			DRV_LOG(ERR, "port %u empty mbuf pool",
+				rxq_ctrl->priv->dev->data->port_id);
+			rte_errno = ENOMEM;
+			goto error;
+		}
+		if (i < wqe_n)
+			(*rxq->elts)[i] = buf;
+		else
+			rxq->mprq_repl = buf;
+		PORT(buf) = rxq->port_id;
+	}
+	DRV_LOG(DEBUG,
+		"port %u Rx queue %u allocated and configured %u segments",
+		rxq_ctrl->priv->dev->data->port_id, rxq_ctrl->idx, wqe_n);
+	return 0;
+error:
+	err = rte_errno; /* Save rte_errno before cleanup. */
+	wqe_n = i;
+	for (i = 0; (i != wqe_n); ++i) {
+		if ((*rxq->elts)[i] != NULL)
+			rte_mempool_put(rxq->mprq_mp,
+					(*rxq->elts)[i]);
+		(*rxq->elts)[i] = NULL;
+	}
+	DRV_LOG(DEBUG, "port %u Rx queue %u failed, freed everything",
+		rxq_ctrl->priv->dev->data->port_id, rxq_ctrl->idx);
+	rte_errno = err; /* Restore rte_errno. */
+	return -rte_errno;
+}
+
+/**
+ * Allocate RX queue elements for Single-Packet RQ.
+ *
+ * @param rxq_ctrl
+ *   Pointer to RX queue structure.
+ *
+ * @return
+ *   0 on success, errno value on failure.
+ */
+static int
+rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 {
 	const unsigned int sges_n = 1 << rxq_ctrl->rxq.sges_n;
 	unsigned int elts_n = 1 << rxq_ctrl->rxq.elts_n;
@@ -140,6 +260,22 @@ rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
 }
 
 /**
+ * Allocate RX queue elements.
+ *
+ * @param rxq_ctrl
+ *   Pointer to RX queue structure.
+ *
+ * @return
+ *   0 on success, errno value on failure.
+ */
+int
+rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	return rxq_mprq_enabled(&rxq_ctrl->rxq) ?
+	       rxq_alloc_elts_mprq(rxq_ctrl) : rxq_alloc_elts_sprq(rxq_ctrl);
+}
+
+/**
  * Free RX queue elements.
  *
  * @param rxq_ctrl
@@ -172,6 +308,10 @@ rxq_free_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
 			rte_pktmbuf_free_seg((*rxq->elts)[i]);
 		(*rxq->elts)[i] = NULL;
 	}
+	if (rxq->mprq_repl != NULL) {
+		rte_pktmbuf_free_seg(rxq->mprq_repl);
+		rxq->mprq_repl = NULL;
+	}
 }
 
 /**
@@ -623,10 +763,16 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 			struct ibv_cq_init_attr_ex ibv;
 			struct mlx5dv_cq_init_attr mlx5;
 		} cq;
-		struct ibv_wq_init_attr wq;
+		struct {
+			struct ibv_wq_init_attr ibv;
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+			struct mlx5dv_wq_init_attr mlx5;
+#endif
+		} wq;
 		struct ibv_cq_ex cq_attr;
 	} attr;
-	unsigned int cqe_n = (1 << rxq_data->elts_n) - 1;
+	unsigned int cqe_n;
+	unsigned int wqe_n = 1 << rxq_data->elts_n;
 	struct mlx5_rxq_ibv *tmpl;
 	struct mlx5dv_cq cq_info;
 	struct mlx5dv_rwq rwq;
@@ -634,6 +780,7 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 	int ret = 0;
 	struct mlx5dv_obj obj;
 	struct mlx5_dev_config *config = &priv->config;
+	const int mprq_en = rxq_mprq_enabled(rxq_data);
 
 	assert(rxq_data);
 	assert(!rxq_ctrl->ibv);
@@ -659,6 +806,19 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 			goto error;
 		}
 	}
+	if (mprq_en) {
+		tmpl->mprq_mr = mlx5_mr_get(dev, rxq_data->mprq_mp);
+		if (!tmpl->mprq_mr) {
+			tmpl->mprq_mr = mlx5_mr_new(dev, rxq_data->mprq_mp);
+			if (!tmpl->mprq_mr) {
+				DRV_LOG(ERR,
+					"port %u Rx queue %u: "
+					"MR creation failure for Multi-Packet RQ",
+					dev->data->port_id, rxq_ctrl->idx);
+				goto error;
+			}
+		}
+	}
 	if (rxq_ctrl->irq) {
 		tmpl->channel = mlx5_glue->create_comp_channel(priv->ctx);
 		if (!tmpl->channel) {
@@ -668,6 +828,10 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 			goto error;
 		}
 	}
+	if (mprq_en)
+		cqe_n = wqe_n * (1 << MLX5_MPRQ_STRIDE_NUM_N) - 1;
+	else
+		cqe_n = wqe_n  - 1;
 	attr.cq.ibv = (struct ibv_cq_init_attr_ex){
 		.cqe = cqe_n,
 		.channel = tmpl->channel,
@@ -705,11 +869,11 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 		dev->data->port_id, priv->device_attr.orig_attr.max_qp_wr);
 	DRV_LOG(DEBUG, "port %u priv->device_attr.max_sge is %d",
 		dev->data->port_id, priv->device_attr.orig_attr.max_sge);
-	attr.wq = (struct ibv_wq_init_attr){
+	attr.wq.ibv = (struct ibv_wq_init_attr){
 		.wq_context = NULL, /* Could be useful in the future. */
 		.wq_type = IBV_WQT_RQ,
 		/* Max number of outstanding WRs. */
-		.max_wr = (1 << rxq_data->elts_n) >> rxq_data->sges_n,
+		.max_wr = wqe_n >> rxq_data->sges_n,
 		/* Max number of scatter/gather elements in a WR. */
 		.max_sge = 1 << rxq_data->sges_n,
 		.pd = priv->pd,
@@ -723,8 +887,8 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 	};
 	/* By default, FCS (CRC) is stripped by hardware. */
 	if (rxq_data->crc_present) {
-		attr.wq.create_flags |= IBV_WQ_FLAGS_SCATTER_FCS;
-		attr.wq.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
+		attr.wq.ibv.create_flags |= IBV_WQ_FLAGS_SCATTER_FCS;
+		attr.wq.ibv.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
 	}
 #ifdef HAVE_IBV_WQ_FLAG_RX_END_PADDING
 	if (config->hw_padding) {
@@ -732,7 +896,26 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 		attr.wq.comp_mask |= IBV_WQ_INIT_ATTR_FLAGS;
 	}
 #endif
-	tmpl->wq = mlx5_glue->create_wq(priv->ctx, &attr.wq);
+#ifdef HAVE_IBV_DEVICE_STRIDING_RQ_SUPPORT
+	attr.wq.mlx5 = (struct mlx5dv_wq_init_attr){
+		.comp_mask = 0,
+	};
+	if (mprq_en) {
+		struct mlx5dv_striding_rq_init_attr *mprq_attr =
+			&attr.wq.mlx5.striding_rq_attrs;
+
+		attr.wq.mlx5.comp_mask |= MLX5DV_WQ_INIT_ATTR_MASK_STRIDING_RQ;
+		*mprq_attr = (struct mlx5dv_striding_rq_init_attr){
+			.single_stride_log_num_of_bytes = MLX5_MPRQ_STRIDE_SZ_N,
+			.single_wqe_log_num_of_strides = MLX5_MPRQ_STRIDE_NUM_N,
+			.two_byte_shift_en = MLX5_MPRQ_TWO_BYTE_SHIFT,
+		};
+	}
+	tmpl->wq = mlx5_glue->dv_create_wq(priv->ctx, &attr.wq.ibv,
+					   &attr.wq.mlx5);
+#else
+	tmpl->wq = mlx5_glue->create_wq(priv->ctx, &attr.wq.ibv);
+#endif
 	if (tmpl->wq == NULL) {
 		DRV_LOG(ERR, "port %u Rx queue %u WQ creation failure",
 			dev->data->port_id, idx);
@@ -743,16 +926,14 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 	 * Make sure number of WRs*SGEs match expectations since a queue
 	 * cannot allocate more than "desc" buffers.
 	 */
-	if (((int)attr.wq.max_wr !=
-	     ((1 << rxq_data->elts_n) >> rxq_data->sges_n)) ||
-	    ((int)attr.wq.max_sge != (1 << rxq_data->sges_n))) {
+	if (attr.wq.ibv.max_wr != (wqe_n >> rxq_data->sges_n) ||
+	    attr.wq.ibv.max_sge != (1u << rxq_data->sges_n)) {
 		DRV_LOG(ERR,
 			"port %u Rx queue %u requested %u*%u but got %u*%u"
 			" WRs*SGEs",
 			dev->data->port_id, idx,
-			((1 << rxq_data->elts_n) >> rxq_data->sges_n),
-			(1 << rxq_data->sges_n),
-			attr.wq.max_wr, attr.wq.max_sge);
+			wqe_n >> rxq_data->sges_n, (1 << rxq_data->sges_n),
+			attr.wq.ibv.max_wr, attr.wq.ibv.max_sge);
 		rte_errno = EINVAL;
 		goto error;
 	}
@@ -787,25 +968,38 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 		goto error;
 	}
 	/* Fill the rings. */
-	rxq_data->wqes = (volatile struct mlx5_wqe_data_seg (*)[])
-		(uintptr_t)rwq.buf;
-	for (i = 0; (i != (unsigned int)(1 << rxq_data->elts_n)); ++i) {
+	rxq_data->wqes = rwq.buf;
+	for (i = 0; (i != wqe_n); ++i) {
+		volatile struct mlx5_wqe_data_seg *scat;
 		struct rte_mbuf *buf = (*rxq_data->elts)[i];
-		volatile struct mlx5_wqe_data_seg *scat = &(*rxq_data->wqes)[i];
-
+		uintptr_t addr = rte_pktmbuf_mtod(buf, uintptr_t);
+		uint32_t byte_count;
+		uint32_t lkey;
+
+		if (mprq_en) {
+			scat = &((volatile struct mlx5_wqe_mprq *)
+				 rxq_data->wqes)[i].dseg;
+			byte_count = (1 << MLX5_MPRQ_STRIDE_SZ_N) *
+				     (1 << MLX5_MPRQ_STRIDE_NUM_N);
+			lkey = tmpl->mprq_mr->lkey;
+		} else {
+			scat = &((volatile struct mlx5_wqe_data_seg *)
+				 rxq_data->wqes)[i];
+			byte_count = DATA_LEN(buf);
+			lkey = tmpl->mr->lkey;
+		}
 		/* scat->addr must be able to store a pointer. */
 		assert(sizeof(scat->addr) >= sizeof(uintptr_t));
 		*scat = (struct mlx5_wqe_data_seg){
-			.addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
-								  uintptr_t)),
-			.byte_count = rte_cpu_to_be_32(DATA_LEN(buf)),
-			.lkey = tmpl->mr->lkey,
+			.addr = rte_cpu_to_be_64(addr),
+			.byte_count = rte_cpu_to_be_32(byte_count),
+			.lkey = lkey
 		};
 	}
 	rxq_data->rq_db = rwq.dbrec;
 	rxq_data->cqe_n = log2above(cq_info.cqe_cnt);
 	rxq_data->cq_ci = 0;
-	rxq_data->rq_ci = 0;
+	rxq_data->strd_ci = 0;
 	rxq_data->rq_pi = 0;
 	rxq_data->zip = (struct rxq_zip){
 		.ai = 0,
@@ -816,7 +1010,7 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 	rxq_data->cqn = cq_info.cqn;
 	rxq_data->cq_arm_sn = 0;
 	/* Update doorbell counter. */
-	rxq_data->rq_ci = (1 << rxq_data->elts_n) >> rxq_data->sges_n;
+	rxq_data->rq_ci = wqe_n >> rxq_data->sges_n;
 	rte_wmb();
 	*rxq_data->rq_db = rte_cpu_to_be_32(rxq_data->rq_ci);
 	DRV_LOG(DEBUG, "port %u rxq %u updated with %p", dev->data->port_id,
@@ -835,6 +1029,8 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, uint16_t idx)
 		claim_zero(mlx5_glue->destroy_cq(tmpl->cq));
 	if (tmpl->channel)
 		claim_zero(mlx5_glue->destroy_comp_channel(tmpl->channel));
+	if (tmpl->mprq_mr)
+		mlx5_mr_release(tmpl->mprq_mr);
 	if (tmpl->mr)
 		mlx5_mr_release(tmpl->mr);
 	priv->verbs_alloc_ctx.type = MLX5_VERBS_ALLOC_TYPE_NONE;
@@ -867,6 +1063,8 @@ mlx5_rxq_ibv_get(struct rte_eth_dev *dev, uint16_t idx)
 	rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
 	if (rxq_ctrl->ibv) {
 		mlx5_mr_get(dev, rxq_data->mp);
+		if (rxq_mprq_enabled(rxq_data))
+			mlx5_mr_get(dev, rxq_data->mprq_mp);
 		rte_atomic32_inc(&rxq_ctrl->ibv->refcnt);
 		DRV_LOG(DEBUG, "port %u Verbs Rx queue %u: refcnt %d",
 			dev->data->port_id, rxq_ctrl->idx,
@@ -896,6 +1094,11 @@ mlx5_rxq_ibv_release(struct mlx5_rxq_ibv *rxq_ibv)
 	ret = mlx5_mr_release(rxq_ibv->mr);
 	if (!ret)
 		rxq_ibv->mr = NULL;
+	if (rxq_mprq_enabled(&rxq_ibv->rxq_ctrl->rxq)) {
+		ret = mlx5_mr_release(rxq_ibv->mprq_mr);
+		if (!ret)
+			rxq_ibv->mprq_mr = NULL;
+	}
 	DRV_LOG(DEBUG, "port %u Verbs Rx queue %u: refcnt %d",
 		rxq_ibv->rxq_ctrl->priv->dev->data->port_id,
 		rxq_ibv->rxq_ctrl->idx, rte_atomic32_read(&rxq_ibv->refcnt));
@@ -951,12 +1154,101 @@ mlx5_rxq_ibv_releasable(struct mlx5_rxq_ibv *rxq_ibv)
 }
 
 /**
+ * Callback function to initialize mbufs for Multi-Packet RQ.
+ */
+static inline void
+mlx5_mprq_mbuf_init(struct rte_mempool *mp, void *opaque_arg,
+		    void *_m, unsigned int i __rte_unused)
+{
+	struct rte_mbuf *m = _m;
+
+	rte_pktmbuf_init(mp, opaque_arg, _m, i);
+	m->buf_len =
+		(1 << MLX5_MPRQ_STRIDE_SZ_N) * (1 << MLX5_MPRQ_STRIDE_NUM_N);
+	rte_pktmbuf_reset_headroom(m);
+}
+
+/**
+ * Configure Rx queue as Multi-Packet RQ.
+ *
+ * @param rxq_ctrl
+ *   Pointer to RX queue structure.
+ * @param priv
+ *   Pointer to private structure.
+ * @param idx
+ *   RX queue index.
+ * @param desc
+ *   Number of descriptors to configure in queue.
+ *
+ * @return
+ *   0 on success, negative errno value on failure.
+ */
+static int
+rxq_configure_mprq(struct mlx5_rxq_ctrl *rxq_ctrl, uint16_t idx, uint16_t desc)
+{
+	struct priv *priv = rxq_ctrl->priv;
+	struct rte_eth_dev *dev = priv->dev;
+	struct mlx5_dev_config *config = &priv->config;
+	struct rte_mempool *mp;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	unsigned int buf_len;
+	unsigned int obj_size;
+
+	assert(rxq_ctrl->rxq.sges_n == 0);
+	rxq_ctrl->rxq.strd_sz_n =
+		MLX5_MPRQ_STRIDE_SZ_N - MLX5_MPRQ_MIN_STRIDE_SZ_N;
+	rxq_ctrl->rxq.strd_num_n =
+		MLX5_MPRQ_STRIDE_NUM_N - MLX5_MPRQ_MIN_STRIDE_NUM_N;
+	rxq_ctrl->rxq.strd_shift_en = MLX5_MPRQ_TWO_BYTE_SHIFT;
+	rxq_ctrl->rxq.mprq_max_memcpy_len = config->mprq_max_memcpy_len;
+	buf_len = (1 << MLX5_MPRQ_STRIDE_SZ_N) * (1 << MLX5_MPRQ_STRIDE_NUM_N) +
+		  RTE_PKTMBUF_HEADROOM;
+	obj_size = buf_len + sizeof(struct rte_mbuf);
+	snprintf(name, sizeof(name), "%s-mprq-%u", dev->data->name, idx);
+	/*
+	 * Allocate per-queue Mempool for Multi-Packet RQ.
+	 *
+	 * Received packets can be either memcpy'd or indirectly referenced. In
+	 * case of mbuf indirection, as it isn't possible to predict how the
+	 * buffers will be queued by application, there's no option to exactly
+	 * pre-allocate needed buffers in advance but to speculatively prepares
+	 * enough buffers.
+	 *
+	 * In the data path, if this Mempool is depleted, PMD will try to memcpy
+	 * received packets to buffers provided by application (rxq->mp) until
+	 * this Mempool gets available again.
+	 */
+	desc *= 4;
+	mp = rte_mempool_create(name, desc + MLX5_MPRQ_MP_CACHE_SZ,
+				obj_size, MLX5_MPRQ_MP_CACHE_SZ,
+				sizeof(struct rte_pktmbuf_pool_private),
+				NULL, NULL, NULL, NULL,
+				dev->device->numa_node,
+				MEMPOOL_F_SC_GET);
+	if (mp == NULL) {
+		DRV_LOG(ERR,
+			"port %u Rx queue %u: failed to allocate a mempool for"
+			" Multi-Packet RQ",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+
+	rte_pktmbuf_pool_init(mp, NULL);
+	rte_mempool_obj_iter(mp, mlx5_mprq_mbuf_init, NULL);
+	rxq_ctrl->rxq.mprq_mp = mp;
+	DRV_LOG(DEBUG, "port %u Rx queue %u: Multi-Packet RQ is enabled",
+		dev->data->port_id, idx);
+	return 0;
+}
+
+/**
  * Create a DPDK Rx queue.
  *
  * @param dev
  *   Pointer to Ethernet device.
  * @param idx
- *   TX queue index.
+ *   RX queue index.
  * @param desc
  *   Number of descriptors to configure in queue.
  * @param socket
@@ -978,8 +1270,9 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	 * Always allocate extra slots, even if eventually
 	 * the vector Rx will not be used.
 	 */
-	const uint16_t desc_n =
+	uint16_t desc_n =
 		desc + config->rx_vec_en * MLX5_VPMD_DESCS_PER_LOOP;
+	const int mprq_en = mlx5_check_mprq_support(dev) > 0;
 
 	tmpl = rte_calloc_socket("RXQ", 1,
 				 sizeof(*tmpl) +
@@ -989,13 +1282,35 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		rte_errno = ENOMEM;
 		return NULL;
 	}
+	tmpl->priv = priv;
 	tmpl->socket = socket;
 	if (priv->dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
-	/* Enable scattered packets support for this queue if necessary. */
+	/*
+	 * This Rx queue can be configured as a Multi-Packet RQ if all of the
+	 * following conditions are met:
+	 *  - MPRQ is enabled.
+	 *  - The number of descs is more than the number of strides.
+	 *  - max_rx_pkt_len is less than the size of a stride sparing headroom.
+	 *
+	 *  Otherwise, enable Rx scatter if necessary.
+	 */
 	assert(mb_len >= RTE_PKTMBUF_HEADROOM);
-	if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
-	    (mb_len - RTE_PKTMBUF_HEADROOM)) {
+	if (mprq_en &&
+	    desc >= (1U << MLX5_MPRQ_STRIDE_NUM_N) &&
+	    dev->data->dev_conf.rxmode.max_rx_pkt_len <=
+	    (1U << MLX5_MPRQ_STRIDE_SZ_N) - RTE_PKTMBUF_HEADROOM) {
+		int ret;
+
+		/* TODO: Rx scatter isn't supported yet. */
+		tmpl->rxq.sges_n = 0;
+		/* Trim the number of descs needed. */
+		desc >>= MLX5_MPRQ_STRIDE_NUM_N;
+		ret = rxq_configure_mprq(tmpl, idx, desc);
+		if (ret)
+			goto error;
+	} else if (dev->data->dev_conf.rxmode.max_rx_pkt_len <=
+		   (mb_len - RTE_PKTMBUF_HEADROOM)) {
 		tmpl->rxq.sges_n = 0;
 	} else if (conf->offloads & DEV_RX_OFFLOAD_SCATTER) {
 		unsigned int size =
@@ -1073,7 +1388,6 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	tmpl->rxq.rss_hash = !!priv->rss_conf.rss_hf &&
 		(!!(dev->data->dev_conf.rxmode.mq_mode & ETH_MQ_RX_RSS));
 	tmpl->rxq.port_id = dev->data->port_id;
-	tmpl->priv = priv;
 	tmpl->rxq.mp = mp;
 	tmpl->rxq.stats.idx = idx;
 	tmpl->rxq.elts_n = log2above(desc);
@@ -1146,6 +1460,27 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 	DRV_LOG(DEBUG, "port %u Rx queue %u: refcnt %d", dev->data->port_id,
 		rxq_ctrl->idx, rte_atomic32_read(&rxq_ctrl->refcnt));
 	if (rte_atomic32_dec_and_test(&rxq_ctrl->refcnt)) {
+		if (rxq_ctrl->rxq.mprq_mp != NULL) {
+			/* If a mbuf in the pool has an indirect mbuf attached
+			 * and it is still in use by application, destroying
+			 * the Rx qeueue can spoil the packet. It is unlikely
+			 * to happen but if application dynamically creates and
+			 * destroys with holding Rx packets, this can happen.
+			 *
+			 * TODO: It is unavoidable for now because the Mempool
+			 * for Multi-Packet RQ isn't provided by application but
+			 * managed by PMD.
+			 */
+			if (!rte_mempool_full(rxq_ctrl->rxq.mprq_mp)) {
+				DRV_LOG(DEBUG,
+					"port %u Rx queue %u: "
+					"Mempool for Multi-Packet RQ is still in use",
+					dev->data->port_id, rxq_ctrl->idx);
+				return 1;
+			}
+			rte_mempool_free(rxq_ctrl->rxq.mprq_mp);
+			rxq_ctrl->rxq.mprq_mp = NULL;
+		}
 		LIST_REMOVE(rxq_ctrl, next);
 		rte_free(rxq_ctrl);
 		(*priv->rxqs)[idx] = NULL;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 461d7bdf6..d71cf405b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1840,7 +1840,8 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 	while (pkts_n) {
 		unsigned int idx = rq_ci & wqe_cnt;
-		volatile struct mlx5_wqe_data_seg *wqe = &(*rxq->wqes)[idx];
+		volatile struct mlx5_wqe_data_seg *wqe =
+			&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[idx];
 		struct rte_mbuf *rep = (*rxq->elts)[idx];
 		uint32_t rss_hash_res = 0;
 
@@ -1941,6 +1942,155 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 }
 
 /**
+ * DPDK callback for RX with Multi-Packet RQ support.
+ *
+ * @param dpdk_rxq
+ *   Generic pointer to RX queue structure.
+ * @param[out] pkts
+ *   Array to store received packets.
+ * @param pkts_n
+ *   Maximum number of packets in array.
+ *
+ * @return
+ *   Number of packets successfully received (<= pkts_n).
+ */
+uint16_t
+mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
+{
+	struct mlx5_rxq_data *rxq = dpdk_rxq;
+	const unsigned int strd_n =
+		1 << (rxq->strd_num_n + MLX5_MPRQ_MIN_STRIDE_NUM_N);
+	const unsigned int strd_sz =
+		1 << (rxq->strd_sz_n + MLX5_MPRQ_MIN_STRIDE_SZ_N);
+	const unsigned int strd_shift =
+		MLX5_MPRQ_STRIDE_SHIFT_BYTE * rxq->strd_shift_en;
+	const unsigned int cq_mask = (1 << rxq->cqe_n) - 1;
+	const unsigned int wq_mask = (1 << rxq->elts_n) - 1;
+	volatile struct mlx5_cqe *cqe = &(*rxq->cqes)[rxq->cq_ci & cq_mask];
+	unsigned int i = 0;
+	uint16_t rq_ci = rxq->rq_ci;
+	uint16_t strd_idx = rxq->strd_ci;
+	struct rte_mbuf *buf = (*rxq->elts)[rq_ci & wq_mask];
+
+	while (i < pkts_n) {
+		struct rte_mbuf *pkt;
+		int ret;
+		unsigned int len;
+		uint16_t consumed_strd;
+		uint32_t offset;
+		uint32_t byte_cnt;
+		uint32_t rss_hash_res = 0;
+
+		if (strd_idx == strd_n) {
+			/* Replace WQE only if the buffer is still in use. */
+			if (unlikely(rte_mbuf_refcnt_read(buf) > 1)) {
+				struct rte_mbuf *rep = rxq->mprq_repl;
+				volatile struct mlx5_wqe_data_seg *wqe =
+					&((volatile struct mlx5_wqe_mprq *)
+					  rxq->wqes)[rq_ci & wq_mask].dseg;
+				uintptr_t addr;
+
+				/* Replace mbuf. */
+				(*rxq->elts)[rq_ci & wq_mask] = rep;
+				PORT(rep) = PORT(buf);
+				/* Release the old buffer. */
+				if (__rte_mbuf_refcnt_update(buf, -1) == 0) {
+					rte_mbuf_refcnt_set(buf, 1);
+					rte_mbuf_raw_free(buf);
+				}
+				/* Replace WQE. */
+				addr = rte_pktmbuf_mtod(rep, uintptr_t);
+				wqe->addr = rte_cpu_to_be_64(addr);
+				/* Stash a mbuf for next replacement. */
+				if (likely(!rte_mempool_get(rxq->mprq_mp,
+							    (void **)&rep)))
+					rxq->mprq_repl = rep;
+				else
+					rxq->mprq_repl = NULL;
+			}
+			/* Advance to the next WQE. */
+			strd_idx = 0;
+			++rq_ci;
+			buf = (*rxq->elts)[rq_ci & wq_mask];
+		}
+		cqe = &(*rxq->cqes)[rxq->cq_ci & cq_mask];
+		ret = mlx5_rx_poll_len(rxq, cqe, cq_mask, &rss_hash_res);
+		if (!ret)
+			break;
+		if (unlikely(ret == -1)) {
+			/* RX error, packet is likely too large. */
+			++rxq->stats.idropped;
+			continue;
+		}
+		byte_cnt = ret;
+		offset = strd_idx * strd_sz + strd_shift;
+		consumed_strd = (byte_cnt & MLX5_MPRQ_STRIDE_NUM_MASK) >>
+				MLX5_MPRQ_STRIDE_NUM_SHIFT;
+		strd_idx += consumed_strd;
+		if (byte_cnt & MLX5_MPRQ_FILLER_MASK)
+			continue;
+		pkt = rte_pktmbuf_alloc(rxq->mp);
+		if (unlikely(pkt == NULL)) {
+			++rxq->stats.rx_nombuf;
+			break;
+		}
+		len = (byte_cnt & MLX5_MPRQ_LEN_MASK) >> MLX5_MPRQ_LEN_SHIFT;
+		assert((int)len >= (rxq->crc_present << 2));
+		if (rxq->crc_present)
+			len -= ETHER_CRC_LEN;
+		/*
+		 * Memcpy packets to the target mbuf if:
+		 * - The size of packet is smaller than MLX5_MPRQ_MEMCPY_LEN.
+		 * - Out of buffer in the Mempool for Multi-Packet RQ.
+		 */
+		if (len <= rxq->mprq_max_memcpy_len || rxq->mprq_repl == NULL) {
+			uintptr_t base = rte_pktmbuf_mtod(buf, uintptr_t);
+
+			rte_memcpy(rte_pktmbuf_mtod(pkt, void *),
+				   (void *)(base + offset), len);
+			/* Initialize the offload flag. */
+			pkt->ol_flags = 0;
+		} else {
+			/*
+			 * IND_ATTACHED_MBUF will be set to pkt->ol_flags when
+			 * attaching the mbuf and more offload flags will be
+			 * added below by calling rxq_cq_to_mbuf(). Other fields
+			 * will be overwritten.
+			 */
+			rte_pktmbuf_attach_at(pkt, buf, offset,
+					      consumed_strd * strd_sz);
+			assert(pkt->ol_flags == IND_ATTACHED_MBUF);
+			rte_pktmbuf_reset_headroom(pkt);
+		}
+		rxq_cq_to_mbuf(rxq, pkt, cqe, rss_hash_res);
+		PKT_LEN(pkt) = len;
+		DATA_LEN(pkt) = len;
+#ifdef MLX5_PMD_SOFT_COUNTERS
+		/* Increment bytes counter. */
+		rxq->stats.ibytes += PKT_LEN(pkt);
+#endif
+		/* Return packet. */
+		*(pkts++) = pkt;
+		++i;
+	}
+	/* Update the consumer index. */
+	rxq->rq_pi += i;
+	rxq->strd_ci = strd_idx;
+	rte_io_wmb();
+	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
+	if (rq_ci != rxq->rq_ci) {
+		rxq->rq_ci = rq_ci;
+		rte_io_wmb();
+		*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
+	}
+#ifdef MLX5_PMD_SOFT_COUNTERS
+	/* Increment packets counter. */
+	rxq->stats.ipackets += i;
+#endif
+	return i;
+}
+
+/**
  * Dummy DPDK callback for TX.
  *
  * This function is used to temporarily replace the real callback during
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index f5af43735..f642fb29f 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -86,18 +86,25 @@ struct mlx5_rxq_data {
 	unsigned int elts_n:4; /* Log 2 of Mbufs. */
 	unsigned int rss_hash:1; /* RSS hash result is enabled. */
 	unsigned int mark:1; /* Marked flow available on the queue. */
-	unsigned int :15; /* Remaining bits. */
+	unsigned int strd_sz_n:3; /* Log 2 of stride size. */
+	unsigned int strd_num_n:4; /* Log 2 of the number of stride. */
+	unsigned int strd_shift_en:1; /* Enable 2bytes shift on a stride. */
+	unsigned int :8; /* Remaining bits. */
 	volatile uint32_t *rq_db;
 	volatile uint32_t *cq_db;
 	uint16_t port_id;
 	uint16_t rq_ci;
+	uint16_t strd_ci; /* Stride index in a WQE for Multi-Packet RQ. */
 	uint16_t rq_pi;
 	uint16_t cq_ci;
-	volatile struct mlx5_wqe_data_seg(*wqes)[];
+	uint16_t mprq_max_memcpy_len; /* Maximum size of packet to memcpy. */
+	volatile void *wqes;
 	volatile struct mlx5_cqe(*cqes)[];
 	struct rxq_zip zip; /* Compressed context. */
 	struct rte_mbuf *(*elts)[];
 	struct rte_mempool *mp;
+	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
+	struct rte_mbuf *mprq_repl; /* Stashed mbuf for replenish. */
 	struct mlx5_rxq_stats stats;
 	uint64_t mbuf_initializer; /* Default rearm_data for vectorized Rx. */
 	struct rte_mbuf fake_mbuf; /* elts padding for vectorized Rx. */
@@ -115,6 +122,7 @@ struct mlx5_rxq_ibv {
 	struct ibv_wq *wq; /* Work Queue. */
 	struct ibv_comp_channel *channel;
 	struct mlx5_mr *mr; /* Memory Region (for mp). */
+	struct mlx5_mr *mprq_mr; /* Memory Region (for mprq_mp). */
 };
 
 /* RX queue control descriptor. */
@@ -213,6 +221,8 @@ struct mlx5_txq_ctrl {
 extern uint8_t rss_hash_default_key[];
 extern const size_t rss_hash_default_key_len;
 
+int mlx5_check_mprq_support(struct rte_eth_dev *dev);
+int mlx5_mprq_enabled(struct rte_eth_dev *dev);
 void mlx5_rxq_cleanup(struct mlx5_rxq_ctrl *rxq_ctrl);
 int mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 			unsigned int socket, const struct rte_eth_rxconf *conf,
@@ -236,6 +246,7 @@ int mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_releasable(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_verify(struct rte_eth_dev *dev);
 int rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl);
+int rxq_alloc_mprq_buf(struct mlx5_rxq_ctrl *rxq_ctrl);
 struct mlx5_ind_table_ibv *mlx5_ind_table_ibv_new(struct rte_eth_dev *dev,
 						  uint16_t queues[],
 						  uint16_t queues_n);
@@ -291,6 +302,8 @@ uint16_t mlx5_tx_burst_mpw_inline(void *dpdk_txq, struct rte_mbuf **pkts,
 uint16_t mlx5_tx_burst_empw(void *dpdk_txq, struct rte_mbuf **pkts,
 			    uint16_t pkts_n);
 uint16_t mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n);
+uint16_t mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts,
+			    uint16_t pkts_n);
 uint16_t removed_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts,
 			  uint16_t pkts_n);
 uint16_t removed_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts,
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index 257d7b11c..b4d738147 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -278,6 +278,8 @@ mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq)
 	struct mlx5_rxq_ctrl *ctrl =
 		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
 
+	if (mlx5_mprq_enabled(ctrl->priv->dev))
+		return -ENOTSUP;
 	if (!ctrl->priv->config.rx_vec_en || rxq->sges_n != 0)
 		return -ENOTSUP;
 	return 1;
@@ -300,6 +302,8 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev)
 
 	if (!priv->config.rx_vec_en)
 		return -ENOTSUP;
+	if (mlx5_mprq_enabled(dev))
+		return -ENOTSUP;
 	/* All the configured queues should support. */
 	for (i = 0; i < priv->rxqs_n; ++i) {
 		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index 44856bbff..b181d04cf 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -87,7 +87,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
 	const uint16_t q_mask = q_n - 1;
 	uint16_t elts_idx = rxq->rq_ci & q_mask;
 	struct rte_mbuf **elts = &(*rxq->elts)[elts_idx];
-	volatile struct mlx5_wqe_data_seg *wq = &(*rxq->wqes)[elts_idx];
+	volatile struct mlx5_wqe_data_seg *wq =
+		&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
 	unsigned int i;
 
 	assert(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 5/6] net/mlx5: release Tx queue resource earlier than Rx
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                     ` (3 preceding siblings ...)
  2018-04-02 18:50   ` [PATCH v2 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
@ 2018-04-02 18:50   ` Yongseok Koh
  2018-04-02 18:50   ` [PATCH v2 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
  5 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

Multi-Packet RQ uses mbuf indirection and direct mbufs come from the
private Mempool (rxq->mprq_mp) of PMD. To properly release the Mempool, it
is better to empty the Tx completeion array (txq->elts) before releasing
it.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index aba44746f..51169e6ac 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -189,14 +189,6 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 	/* Prevent crashes when queues are still in use. */
 	dev->rx_pkt_burst = removed_rx_burst;
 	dev->tx_pkt_burst = removed_tx_burst;
-	if (priv->rxqs != NULL) {
-		/* XXX race condition if mlx5_rx_burst() is still running. */
-		usleep(1000);
-		for (i = 0; (i != priv->rxqs_n); ++i)
-			mlx5_rxq_release(dev, i);
-		priv->rxqs_n = 0;
-		priv->rxqs = NULL;
-	}
 	if (priv->txqs != NULL) {
 		/* XXX race condition if mlx5_tx_burst() is still running. */
 		usleep(1000);
@@ -205,6 +197,14 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 		priv->txqs_n = 0;
 		priv->txqs = NULL;
 	}
+	if (priv->rxqs != NULL) {
+		/* XXX race condition if mlx5_rx_burst() is still running. */
+		usleep(1000);
+		for (i = 0; (i != priv->rxqs_n); ++i)
+			mlx5_rxq_release(dev, i);
+		priv->rxqs_n = 0;
+		priv->rxqs = NULL;
+	}
 	if (priv->pd != NULL) {
 		assert(priv->ctx != NULL);
 		claim_zero(mlx5_glue->dealloc_pd(priv->pd));
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 6/6] app/testpmd: conserve mbuf indirection flag
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                     ` (4 preceding siblings ...)
  2018-04-02 18:50   ` [PATCH v2 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
@ 2018-04-02 18:50   ` Yongseok Koh
  5 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-02 18:50 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro,
	olivier.matz
  Cc: dev, Yongseok Koh

If PMD delivers Rx packets with mbuf indirection, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be indirect mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 2 ++
 app/test-pmd/macfwd.c   | 2 ++
 app/test-pmd/macswap.c  | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5f5ab64aa..1dd4d7130 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -770,6 +770,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (RTE_MBUF_INDIRECT(m))
+			tx_ol_flags |= IND_ATTACHED_MBUF;
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..7e096ee78 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,8 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (RTE_MBUF_INDIRECT(mb))
+			ol_flags |= IND_ATTACHED_MBUF;
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..39f96c1e0 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,8 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (RTE_MBUF_INDIRECT(mb))
+			ol_flags |= IND_ATTACHED_MBUF;
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-02 18:50   ` [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
@ 2018-04-03  8:26     ` Olivier Matz
  2018-04-04  0:12       ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-03  8:26 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro, dev

Hi,

On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> When attaching a mbuf, indirect mbuf has to point to start of buffer of
> direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> flexible. Indirect mbuf can point to any part of direct mbuf by calling
> rte_pktmbuf_attach_at().
> 
> Possible use-cases could be:
> - If a packet has multiple layers of encapsulation, multiple indirect
>   buffers can reference different layers of the encapsulated packet.
> - A large direct mbuf can even contain multiple packets in series and
>   each packet can be referenced by multiple mbuf indirections.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

I think the current API is already able to do what you want.

1/ Here is a mbuf m with its data

               off
               <-->
                      len
          +----+   <---------->
          |    |
        +-|----v----------------------+
        | |    -----------------------|
m       | buf  |    XXXXXXXXXXX      ||
        |      -----------------------|
        +-----------------------------+


2/ clone m:

  c = rte_pktmbuf_alloc(pool);
  rte_pktmbuf_attach(c, m);

  Note that c has its own offset and length fields.


               off
               <-->
                      len
          +----+   <---------->
          |    |
        +-|----v----------------------+
        | |    -----------------------|
m       | buf  |    XXXXXXXXXXX      ||
        |      -----------------------|
        +------^----------------------+
               |
          +----+
indirect  |
        +-|---------------------------+
        | |    -----------------------|
c       | buf  |                     ||
        |      -----------------------|
        +-----------------------------+

                off    len
                <--><---------->


3/ remove some data from c without changing m

   rte_pktmbuf_adj(c, 10)   // at head
   rte_pktmbuf_trim(c, 10)  // at tail


Please let me know if it fits your needs.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-03  8:26     ` Olivier Matz
@ 2018-04-04  0:12       ` Yongseok Koh
  2018-04-09 16:04         ` Olivier Matz
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-04  0:12 UTC (permalink / raw)
  To: Olivier Matz
  Cc: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro, dev

On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> Hi,
> 
> On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > rte_pktmbuf_attach_at().
> > 
> > Possible use-cases could be:
> > - If a packet has multiple layers of encapsulation, multiple indirect
> >   buffers can reference different layers of the encapsulated packet.
> > - A large direct mbuf can even contain multiple packets in series and
> >   each packet can be referenced by multiple mbuf indirections.
> > 
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> 
> I think the current API is already able to do what you want.
> 
> 1/ Here is a mbuf m with its data
> 
>                off
>                <-->
>                       len
>           +----+   <---------->
>           |    |
>         +-|----v----------------------+
>         | |    -----------------------|
> m       | buf  |    XXXXXXXXXXX      ||
>         |      -----------------------|
>         +-----------------------------+
> 
> 
> 2/ clone m:
> 
>   c = rte_pktmbuf_alloc(pool);
>   rte_pktmbuf_attach(c, m);
> 
>   Note that c has its own offset and length fields.
> 
> 
>                off
>                <-->
>                       len
>           +----+   <---------->
>           |    |
>         +-|----v----------------------+
>         | |    -----------------------|
> m       | buf  |    XXXXXXXXXXX      ||
>         |      -----------------------|
>         +------^----------------------+
>                |
>           +----+
> indirect  |
>         +-|---------------------------+
>         | |    -----------------------|
> c       | buf  |                     ||
>         |      -----------------------|
>         +-----------------------------+
> 
>                 off    len
>                 <--><---------->
> 
> 
> 3/ remove some data from c without changing m
> 
>    rte_pktmbuf_adj(c, 10)   // at head
>    rte_pktmbuf_trim(c, 10)  // at tail
> 
> 
> Please let me know if it fits your needs.

No, it doesn't.

Trimming head and tail with the current APIs removes data and make the space
available. Adjusting packet head means giving more headroom, not shifting the
buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
difference offsets in m,

rte_pktmbuf_adj(c1, 10);
rte_pktmbuf_adj(c2, 20);

then the owner of c2 regard the first (off+20)B as available headroom. If it
wants to attach outer header, it will overwrite the headroom even though the
owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
header should be linked by h1->next = c2.

If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
which actually shrink the headroom, this case can be properly handled.

And another use-case (this is my actual use-case) is to make a large mbuf have
multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
because it transfers multiple packets to a single large buffer to reduce PCIe
overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
indirect referencing.

Does this make sense?


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-04  0:12       ` Yongseok Koh
@ 2018-04-09 16:04         ` Olivier Matz
  2018-04-10  1:59           ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-09 16:04 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro, dev

Hi Yongseok,

On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > Hi,
> > 
> > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > rte_pktmbuf_attach_at().
> > > 
> > > Possible use-cases could be:
> > > - If a packet has multiple layers of encapsulation, multiple indirect
> > >   buffers can reference different layers of the encapsulated packet.
> > > - A large direct mbuf can even contain multiple packets in series and
> > >   each packet can be referenced by multiple mbuf indirections.
> > > 
> > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > 
> > I think the current API is already able to do what you want.
> > 
> > 1/ Here is a mbuf m with its data
> > 
> >                off
> >                <-->
> >                       len
> >           +----+   <---------->
> >           |    |
> >         +-|----v----------------------+
> >         | |    -----------------------|
> > m       | buf  |    XXXXXXXXXXX      ||
> >         |      -----------------------|
> >         +-----------------------------+
> > 
> > 
> > 2/ clone m:
> > 
> >   c = rte_pktmbuf_alloc(pool);
> >   rte_pktmbuf_attach(c, m);
> > 
> >   Note that c has its own offset and length fields.
> > 
> > 
> >                off
> >                <-->
> >                       len
> >           +----+   <---------->
> >           |    |
> >         +-|----v----------------------+
> >         | |    -----------------------|
> > m       | buf  |    XXXXXXXXXXX      ||
> >         |      -----------------------|
> >         +------^----------------------+
> >                |
> >           +----+
> > indirect  |
> >         +-|---------------------------+
> >         | |    -----------------------|
> > c       | buf  |                     ||
> >         |      -----------------------|
> >         +-----------------------------+
> > 
> >                 off    len
> >                 <--><---------->
> > 
> > 
> > 3/ remove some data from c without changing m
> > 
> >    rte_pktmbuf_adj(c, 10)   // at head
> >    rte_pktmbuf_trim(c, 10)  // at tail
> > 
> > 
> > Please let me know if it fits your needs.
> 
> No, it doesn't.
> 
> Trimming head and tail with the current APIs removes data and make the space
> available. Adjusting packet head means giving more headroom, not shifting the
> buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> difference offsets in m,
> 
> rte_pktmbuf_adj(c1, 10);
> rte_pktmbuf_adj(c2, 20);
> 
> then the owner of c2 regard the first (off+20)B as available headroom. If it
> wants to attach outer header, it will overwrite the headroom even though the
> owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> header should be linked by h1->next = c2.

Yes, after these operations c1, c2 and m should become read-only. So, to
prepend headers, another mbuf has to be inserted before as you suggest. It
is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
length) that will:
  - alloc and attach indirect mbuf for each segment of m that is
    in the range [offset : length+offset].
  - prepend an empty and writable mbuf for the headers

> If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> which actually shrink the headroom, this case can be properly handled.

What do you mean by properly handled?

Yes, prepending data or adding data in the indirect mbuf won't override
the direct mbuf. But prepending data or adding data in the direct mbuf m
won't be protected.

>From an application point of view, indirect mbufs, or direct mbufs that
have refcnt != 1, should be both considered as read-only because they
may share their data. How an application can know if the data is shared
or not?

Maybe we need a flag to differentiate mbufs that are read-only
(something like SHARED_DATA, or simply READONLY). In your case, if my
understanding is correct, you want to have indirect mbufs with RW data.


> And another use-case (this is my actual use-case) is to make a large mbuf have
> multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> because it transfers multiple packets to a single large buffer to reduce PCIe
> overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> indirect referencing.
> 
> Does this make sense?

I understand the need.

Another option would be to make the mbuf->buffer point to an external
buffer (not inside the direct mbuf). This would require to add a
mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
a quick overview.

[1] https://dpdksummit.com/Archive/pdf/2016Userspace/Day01-Session05-OlivierMatz-Userspace2016.pdf

The advantage is that it does not require the large data to be inside a
mbuf (requiring a mbuf structure before the buffer, and requiring to be
allocated from a mempool). On the other hand, it is maybe more complex
to implement compared to your solution.


Regards,
Olivier

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-09 16:04         ` Olivier Matz
@ 2018-04-10  1:59           ` Yongseok Koh
  2018-04-11  0:25             ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-10  1:59 UTC (permalink / raw)
  To: Olivier Matz
  Cc: wenzhuo.lu, jingjing.wu, adrien.mazarguil, nelio.laranjeiro, dev

On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> Hi Yongseok,
> 
> On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > Hi,
> > > 
> > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > rte_pktmbuf_attach_at().
> > > > 
> > > > Possible use-cases could be:
> > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > >   buffers can reference different layers of the encapsulated packet.
> > > > - A large direct mbuf can even contain multiple packets in series and
> > > >   each packet can be referenced by multiple mbuf indirections.
> > > > 
> > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > 
> > > I think the current API is already able to do what you want.
> > > 
> > > 1/ Here is a mbuf m with its data
> > > 
> > >                off
> > >                <-->
> > >                       len
> > >           +----+   <---------->
> > >           |    |
> > >         +-|----v----------------------+
> > >         | |    -----------------------|
> > > m       | buf  |    XXXXXXXXXXX      ||
> > >         |      -----------------------|
> > >         +-----------------------------+
> > > 
> > > 
> > > 2/ clone m:
> > > 
> > >   c = rte_pktmbuf_alloc(pool);
> > >   rte_pktmbuf_attach(c, m);
> > > 
> > >   Note that c has its own offset and length fields.
> > > 
> > > 
> > >                off
> > >                <-->
> > >                       len
> > >           +----+   <---------->
> > >           |    |
> > >         +-|----v----------------------+
> > >         | |    -----------------------|
> > > m       | buf  |    XXXXXXXXXXX      ||
> > >         |      -----------------------|
> > >         +------^----------------------+
> > >                |
> > >           +----+
> > > indirect  |
> > >         +-|---------------------------+
> > >         | |    -----------------------|
> > > c       | buf  |                     ||
> > >         |      -----------------------|
> > >         +-----------------------------+
> > > 
> > >                 off    len
> > >                 <--><---------->
> > > 
> > > 
> > > 3/ remove some data from c without changing m
> > > 
> > >    rte_pktmbuf_adj(c, 10)   // at head
> > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > 
> > > 
> > > Please let me know if it fits your needs.
> > 
> > No, it doesn't.
> > 
> > Trimming head and tail with the current APIs removes data and make the space
> > available. Adjusting packet head means giving more headroom, not shifting the
> > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > difference offsets in m,
> > 
> > rte_pktmbuf_adj(c1, 10);
> > rte_pktmbuf_adj(c2, 20);
> > 
> > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > wants to attach outer header, it will overwrite the headroom even though the
> > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > header should be linked by h1->next = c2.
> 
> Yes, after these operations c1, c2 and m should become read-only. So, to
> prepend headers, another mbuf has to be inserted before as you suggest. It
> is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> length) that will:
>   - alloc and attach indirect mbuf for each segment of m that is
>     in the range [offset : length+offset].
>   - prepend an empty and writable mbuf for the headers
> 
> > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > which actually shrink the headroom, this case can be properly handled.
> 
> What do you mean by properly handled?
> 
> Yes, prepending data or adding data in the indirect mbuf won't override
> the direct mbuf. But prepending data or adding data in the direct mbuf m
> won't be protected.
> 
> From an application point of view, indirect mbufs, or direct mbufs that
> have refcnt != 1, should be both considered as read-only because they
> may share their data. How an application can know if the data is shared
> or not?
> 
> Maybe we need a flag to differentiate mbufs that are read-only
> (something like SHARED_DATA, or simply READONLY). In your case, if my
> understanding is correct, you want to have indirect mbufs with RW data.

Agree that indirect mbuf must be treated as read-only, Then the current code is
enough to handle that use-case.

> > And another use-case (this is my actual use-case) is to make a large mbuf have
> > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > because it transfers multiple packets to a single large buffer to reduce PCIe
> > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > indirect referencing.
> > 
> > Does this make sense?
> 
> I understand the need.
> 
> Another option would be to make the mbuf->buffer point to an external
> buffer (not inside the direct mbuf). This would require to add a
> mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> a quick overview.
> 
> [1] https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01-Session05-OlivierMatz-Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> 
> The advantage is that it does not require the large data to be inside a
> mbuf (requiring a mbuf structure before the buffer, and requiring to be
> allocated from a mempool). On the other hand, it is maybe more complex
> to implement compared to your solution.

I knew that you presented the slides and frankly, I had considered that option
at first. But even with that option, metadata to store refcnt should also be
allocated and managed anyway. Kernel also maintains the skb_shared_info at the
end of the data segment. Even though it could have smaller metadata structure,
I just wanted to make full use of the existing framework because it is less
complex as you mentioned. Given that you presented the idea of external data
buffer in 2016 and there hasn't been many follow-up discussions/activities so
far, I thought the demand isn't so big yet thus I wanted to make this patch
simpler.  I personally think that we can take the idea of external data seg when
more demands come from users in the future as it would be a huge change and may
break current ABI/API. When the day comes, I'll gladly participate in the
discussions and write codes for it if I can be helpful.

Do you think this patch is okay for now?


Thanks for your comments,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-10  1:59           ` Yongseok Koh
@ 2018-04-11  0:25             ` Ananyev, Konstantin
  2018-04-11  5:33               ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-11  0:25 UTC (permalink / raw)
  To: Yongseok Koh, Olivier Matz
  Cc: Lu, Wenzhuo, Wu, Jingjing, adrien.mazarguil, nelio.laranjeiro, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yongseok Koh
> Sent: Tuesday, April 10, 2018 2:59 AM
> To: Olivier Matz <olivier.matz@6wind.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; adrien.mazarguil@6wind.com;
> nelio.laranjeiro@6wind.com; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
> 
> On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > Hi Yongseok,
> >
> > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > Hi,
> > > >
> > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > rte_pktmbuf_attach_at().
> > > > >
> > > > > Possible use-cases could be:
> > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > >
> > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > >
> > > > I think the current API is already able to do what you want.
> > > >
> > > > 1/ Here is a mbuf m with its data
> > > >
> > > >                off
> > > >                <-->
> > > >                       len
> > > >           +----+   <---------->
> > > >           |    |
> > > >         +-|----v----------------------+
> > > >         | |    -----------------------|
> > > > m       | buf  |    XXXXXXXXXXX      ||
> > > >         |      -----------------------|
> > > >         +-----------------------------+
> > > >
> > > >
> > > > 2/ clone m:
> > > >
> > > >   c = rte_pktmbuf_alloc(pool);
> > > >   rte_pktmbuf_attach(c, m);
> > > >
> > > >   Note that c has its own offset and length fields.
> > > >
> > > >
> > > >                off
> > > >                <-->
> > > >                       len
> > > >           +----+   <---------->
> > > >           |    |
> > > >         +-|----v----------------------+
> > > >         | |    -----------------------|
> > > > m       | buf  |    XXXXXXXXXXX      ||
> > > >         |      -----------------------|
> > > >         +------^----------------------+
> > > >                |
> > > >           +----+
> > > > indirect  |
> > > >         +-|---------------------------+
> > > >         | |    -----------------------|
> > > > c       | buf  |                     ||
> > > >         |      -----------------------|
> > > >         +-----------------------------+
> > > >
> > > >                 off    len
> > > >                 <--><---------->
> > > >
> > > >
> > > > 3/ remove some data from c without changing m
> > > >
> > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > >
> > > >
> > > > Please let me know if it fits your needs.
> > >
> > > No, it doesn't.
> > >
> > > Trimming head and tail with the current APIs removes data and make the space
> > > available. Adjusting packet head means giving more headroom, not shifting the
> > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > difference offsets in m,
> > >
> > > rte_pktmbuf_adj(c1, 10);
> > > rte_pktmbuf_adj(c2, 20);
> > >
> > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > wants to attach outer header, it will overwrite the headroom even though the
> > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > header should be linked by h1->next = c2.
> >
> > Yes, after these operations c1, c2 and m should become read-only. So, to
> > prepend headers, another mbuf has to be inserted before as you suggest. It
> > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > length) that will:
> >   - alloc and attach indirect mbuf for each segment of m that is
> >     in the range [offset : length+offset].
> >   - prepend an empty and writable mbuf for the headers
> >
> > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > which actually shrink the headroom, this case can be properly handled.
> >
> > What do you mean by properly handled?
> >
> > Yes, prepending data or adding data in the indirect mbuf won't override
> > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > won't be protected.
> >
> > From an application point of view, indirect mbufs, or direct mbufs that
> > have refcnt != 1, should be both considered as read-only because they
> > may share their data. How an application can know if the data is shared
> > or not?
> >
> > Maybe we need a flag to differentiate mbufs that are read-only
> > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > understanding is correct, you want to have indirect mbufs with RW data.
> 
> Agree that indirect mbuf must be treated as read-only, Then the current code is
> enough to handle that use-case.
> 
> > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > indirect referencing.

But just to make HW to RX multiple packets into one mbuf,
data_off inside indirect mbuf should be enough, correct?
As I understand, what you'd like to achieve with this new field -
ability to manipulate packet boundaries after RX, probably at upper layer.
As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
indirect mbufs trying to modify same direct buffer.
Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
Fields for indirect mbufs, straight after attach()?
Konstantin

> > >
> > > Does this make sense?
> >
> > I understand the need.
> >
> > Another option would be to make the mbuf->buffer point to an external
> > buffer (not inside the direct mbuf). This would require to add a
> > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > a quick overview.
> >
> > [1]
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> -Session05-OlivierMatz-
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> >
> > The advantage is that it does not require the large data to be inside a
> > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > allocated from a mempool). On the other hand, it is maybe more complex
> > to implement compared to your solution.
> 
> I knew that you presented the slides and frankly, I had considered that option
> at first. But even with that option, metadata to store refcnt should also be
> allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> end of the data segment. Even though it could have smaller metadata structure,
> I just wanted to make full use of the existing framework because it is less
> complex as you mentioned. Given that you presented the idea of external data
> buffer in 2016 and there hasn't been many follow-up discussions/activities so
> far, I thought the demand isn't so big yet thus I wanted to make this patch
> simpler.  I personally think that we can take the idea of external data seg when
> more demands come from users in the future as it would be a huge change and may
> break current ABI/API. When the day comes, I'll gladly participate in the
> discussions and write codes for it if I can be helpful.
> 
> Do you think this patch is okay for now?
> 
> 
> Thanks for your comments,
> Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-11  0:25             ` Ananyev, Konstantin
@ 2018-04-11  5:33               ` Yongseok Koh
  2018-04-11 11:39                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-11  5:33 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Olivier Matz, Lu, Wenzhuo, Wu, Jingjing, Adrien Mazarguil,
	Nélio Laranjeiro, dev

On Tue, Apr 10, 2018 at 05:25:31PM -0700, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yongseok Koh
> > Sent: Tuesday, April 10, 2018 2:59 AM
> > To: Olivier Matz <olivier.matz@6wind.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; adrien.mazarguil@6wind.com;
> > nelio.laranjeiro@6wind.com; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
> > 
> > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > Hi Yongseok,
> > >
> > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > Hi,
> > > > >
> > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > rte_pktmbuf_attach_at().
> > > > > >
> > > > > > Possible use-cases could be:
> > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > >
> > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > >
> > > > > I think the current API is already able to do what you want.
> > > > >
> > > > > 1/ Here is a mbuf m with its data
> > > > >
> > > > >                off
> > > > >                <-->
> > > > >                       len
> > > > >           +----+   <---------->
> > > > >           |    |
> > > > >         +-|----v----------------------+
> > > > >         | |    -----------------------|
> > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > >         |      -----------------------|
> > > > >         +-----------------------------+
> > > > >
> > > > >
> > > > > 2/ clone m:
> > > > >
> > > > >   c = rte_pktmbuf_alloc(pool);
> > > > >   rte_pktmbuf_attach(c, m);
> > > > >
> > > > >   Note that c has its own offset and length fields.
> > > > >
> > > > >
> > > > >                off
> > > > >                <-->
> > > > >                       len
> > > > >           +----+   <---------->
> > > > >           |    |
> > > > >         +-|----v----------------------+
> > > > >         | |    -----------------------|
> > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > >         |      -----------------------|
> > > > >         +------^----------------------+
> > > > >                |
> > > > >           +----+
> > > > > indirect  |
> > > > >         +-|---------------------------+
> > > > >         | |    -----------------------|
> > > > > c       | buf  |                     ||
> > > > >         |      -----------------------|
> > > > >         +-----------------------------+
> > > > >
> > > > >                 off    len
> > > > >                 <--><---------->
> > > > >
> > > > >
> > > > > 3/ remove some data from c without changing m
> > > > >
> > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > >
> > > > >
> > > > > Please let me know if it fits your needs.
> > > >
> > > > No, it doesn't.
> > > >
> > > > Trimming head and tail with the current APIs removes data and make the space
> > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > difference offsets in m,
> > > >
> > > > rte_pktmbuf_adj(c1, 10);
> > > > rte_pktmbuf_adj(c2, 20);
> > > >
> > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > header should be linked by h1->next = c2.
> > >
> > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > length) that will:
> > >   - alloc and attach indirect mbuf for each segment of m that is
> > >     in the range [offset : length+offset].
> > >   - prepend an empty and writable mbuf for the headers
> > >
> > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > which actually shrink the headroom, this case can be properly handled.
> > >
> > > What do you mean by properly handled?
> > >
> > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > won't be protected.
> > >
> > > From an application point of view, indirect mbufs, or direct mbufs that
> > > have refcnt != 1, should be both considered as read-only because they
> > > may share their data. How an application can know if the data is shared
> > > or not?
> > >
> > > Maybe we need a flag to differentiate mbufs that are read-only
> > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > understanding is correct, you want to have indirect mbufs with RW data.
> > 
> > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > enough to handle that use-case.
> > 
> > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > indirect referencing.
> 
> But just to make HW to RX multiple packets into one mbuf,
> data_off inside indirect mbuf should be enough, correct?
Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
to saturate the network link.

> As I understand, what you'd like to achieve with this new field -
> ability to manipulate packet boundaries after RX, probably at upper layer.
> As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> indirect mbufs trying to modify same direct buffer.

I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
is read-only. What that means, all the entities which own such mbufs have to be
aware of that and keep the principle as DPDK can't enforce the rule and there
can't be such sanity check. In this sense, HW doesn't violate it because the
direct mbuf is injected to HW before indirection. When packets are written by
HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
application layer with freeing the original direct mbuf (decrement refcnt by 1).
So, HW doesn't touch the direct buffer once it reaches to upper layer. The
direct buffer will be freed and get available for reuse when all the attached
indirect mbufs are freed.

> Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> Fields for indirect mbufs, straight after attach()?

Good point.
Actually that was my draft (Mellanox internal) version of this patch :-) But I
had to consider a case where priv_size is really given by user. Even though it
is less likely, but if original priv_size is quite big, it can't cover entire
buf_len. For this, I had to increase priv_size to 32-bit but adding another
16bit field (buf_off) looked more plausible.

Thanks for good comments,
Yongseok

> > > >
> > > > Does this make sense?
> > >
> > > I understand the need.
> > >
> > > Another option would be to make the mbuf->buffer point to an external
> > > buffer (not inside the direct mbuf). This would require to add a
> > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > a quick overview.
> > >
> > > [1]
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > -Session05-OlivierMatz-
> > Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > >
> > > The advantage is that it does not require the large data to be inside a
> > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > allocated from a mempool). On the other hand, it is maybe more complex
> > > to implement compared to your solution.
> > 
> > I knew that you presented the slides and frankly, I had considered that option
> > at first. But even with that option, metadata to store refcnt should also be
> > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > end of the data segment. Even though it could have smaller metadata structure,
> > I just wanted to make full use of the existing framework because it is less
> > complex as you mentioned. Given that you presented the idea of external data
> > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > simpler.  I personally think that we can take the idea of external data seg when
> > more demands come from users in the future as it would be a huge change and may
> > break current ABI/API. When the day comes, I'll gladly participate in the
> > discussions and write codes for it if I can be helpful.
> > 
> > Do you think this patch is okay for now?
> > 
> > 
> > Thanks for your comments,
> > Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-11  5:33               ` Yongseok Koh
@ 2018-04-11 11:39                 ` Ananyev, Konstantin
  2018-04-11 14:02                   ` Andrew Rybchenko
  2018-04-11 17:08                   ` Yongseok Koh
  0 siblings, 2 replies; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-11 11:39 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Olivier Matz, Lu, Wenzhuo, Wu, Jingjing, Adrien Mazarguil,
	Nélio Laranjeiro, dev


Hi Yongseok,

> > >
> > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > Hi Yongseok,
> > > >
> > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > rte_pktmbuf_attach_at().
> > > > > > >
> > > > > > > Possible use-cases could be:
> > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > >
> > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > >
> > > > > > I think the current API is already able to do what you want.
> > > > > >
> > > > > > 1/ Here is a mbuf m with its data
> > > > > >
> > > > > >                off
> > > > > >                <-->
> > > > > >                       len
> > > > > >           +----+   <---------->
> > > > > >           |    |
> > > > > >         +-|----v----------------------+
> > > > > >         | |    -----------------------|
> > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > >         |      -----------------------|
> > > > > >         +-----------------------------+
> > > > > >
> > > > > >
> > > > > > 2/ clone m:
> > > > > >
> > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > >   rte_pktmbuf_attach(c, m);
> > > > > >
> > > > > >   Note that c has its own offset and length fields.
> > > > > >
> > > > > >
> > > > > >                off
> > > > > >                <-->
> > > > > >                       len
> > > > > >           +----+   <---------->
> > > > > >           |    |
> > > > > >         +-|----v----------------------+
> > > > > >         | |    -----------------------|
> > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > >         |      -----------------------|
> > > > > >         +------^----------------------+
> > > > > >                |
> > > > > >           +----+
> > > > > > indirect  |
> > > > > >         +-|---------------------------+
> > > > > >         | |    -----------------------|
> > > > > > c       | buf  |                     ||
> > > > > >         |      -----------------------|
> > > > > >         +-----------------------------+
> > > > > >
> > > > > >                 off    len
> > > > > >                 <--><---------->
> > > > > >
> > > > > >
> > > > > > 3/ remove some data from c without changing m
> > > > > >
> > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > >
> > > > > >
> > > > > > Please let me know if it fits your needs.
> > > > >
> > > > > No, it doesn't.
> > > > >
> > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > difference offsets in m,
> > > > >
> > > > > rte_pktmbuf_adj(c1, 10);
> > > > > rte_pktmbuf_adj(c2, 20);
> > > > >
> > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > header should be linked by h1->next = c2.
> > > >
> > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > length) that will:
> > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > >     in the range [offset : length+offset].
> > > >   - prepend an empty and writable mbuf for the headers
> > > >
> > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > which actually shrink the headroom, this case can be properly handled.
> > > >
> > > > What do you mean by properly handled?
> > > >
> > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > won't be protected.
> > > >
> > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > have refcnt != 1, should be both considered as read-only because they
> > > > may share their data. How an application can know if the data is shared
> > > > or not?
> > > >
> > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > understanding is correct, you want to have indirect mbufs with RW data.
> > >
> > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > enough to handle that use-case.
> > >
> > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > indirect referencing.
> >
> > But just to make HW to RX multiple packets into one mbuf,
> > data_off inside indirect mbuf should be enough, correct?
> Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> to saturate the network link.

There were few complains that 64KB max is a limitation for some use-cases.
I am not against increasing it, but I don't think we have free space on first cache-line for that
without another big rework of mbuf layout. 
Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too. 

> 
> > As I understand, what you'd like to achieve with this new field -
> > ability to manipulate packet boundaries after RX, probably at upper layer.
> > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > indirect mbufs trying to modify same direct buffer.
> 
> I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> is read-only. What that means, all the entities which own such mbufs have to be
> aware of that and keep the principle as DPDK can't enforce the rule and there
> can't be such sanity check. In this sense, HW doesn't violate it because the
> direct mbuf is injected to HW before indirection. When packets are written by
> HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> application layer with freeing the original direct mbuf (decrement refcnt by 1).
> So, HW doesn't touch the direct buffer once it reaches to upper layer.

Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
which implies that it should be possible by some entity (upper layer?) to manipulate these
indirect mbufs.
And we don't know how exactly it will be done.

> The direct buffer will be freed and get available for reuse when all the attached
> indirect mbufs are freed.
> 
> > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > Fields for indirect mbufs, straight after attach()?
> 
> Good point.
> Actually that was my draft (Mellanox internal) version of this patch :-) But I
> had to consider a case where priv_size is really given by user. Even though it
> is less likely, but if original priv_size is quite big, it can't cover entire
> buf_len. For this, I had to increase priv_size to 32-bit but adding another
> 16bit field (buf_off) looked more plausible.

As I remember, we can't have mbufs bigger then 64K,
so priv_size + buf_len should be always less than 64K, correct?
Konstantin  

> 
> Thanks for good comments,
> Yongseok
> 
> > > > >
> > > > > Does this make sense?
> > > >
> > > > I understand the need.
> > > >
> > > > Another option would be to make the mbuf->buffer point to an external
> > > > buffer (not inside the direct mbuf). This would require to add a
> > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > a quick overview.
> > > >
> > > > [1]
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > -Session05-OlivierMatz-
> > >
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > >
> > > > The advantage is that it does not require the large data to be inside a
> > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > to implement compared to your solution.
> > >
> > > I knew that you presented the slides and frankly, I had considered that option
> > > at first. But even with that option, metadata to store refcnt should also be
> > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > end of the data segment. Even though it could have smaller metadata structure,
> > > I just wanted to make full use of the existing framework because it is less
> > > complex as you mentioned. Given that you presented the idea of external data
> > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > simpler.  I personally think that we can take the idea of external data seg when
> > > more demands come from users in the future as it would be a huge change and may
> > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > discussions and write codes for it if I can be helpful.
> > >
> > > Do you think this patch is okay for now?
> > >
> > >
> > > Thanks for your comments,
> > > Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-11 11:39                 ` Ananyev, Konstantin
@ 2018-04-11 14:02                   ` Andrew Rybchenko
  2018-04-11 17:18                     ` Yongseok Koh
  2018-04-11 17:08                   ` Yongseok Koh
  1 sibling, 1 reply; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-11 14:02 UTC (permalink / raw)
  To: Ananyev, Konstantin, Yongseok Koh
  Cc: Olivier Matz, Lu, Wenzhuo, Wu, Jingjing, Adrien Mazarguil,
	Nélio Laranjeiro, dev

On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote:
> Hi Yongseok,
>
>>>> On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
>>>>> Hi Yongseok,
>>>>>
>>>>> On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
>>>>>> On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
>>>>>>>> When attaching a mbuf, indirect mbuf has to point to start of buffer of
>>>>>>>> direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
>>>>>>>> flexible. Indirect mbuf can point to any part of direct mbuf by calling
>>>>>>>> rte_pktmbuf_attach_at().
>>>>>>>>
>>>>>>>> Possible use-cases could be:
>>>>>>>> - If a packet has multiple layers of encapsulation, multiple indirect
>>>>>>>>    buffers can reference different layers of the encapsulated packet.
>>>>>>>> - A large direct mbuf can even contain multiple packets in series and
>>>>>>>>    each packet can be referenced by multiple mbuf indirections.
>>>>>>>>
>>>>>>>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
>>>>>>> I think the current API is already able to do what you want.
>>>>>>>
>>>>>>> 1/ Here is a mbuf m with its data
>>>>>>>
>>>>>>>                 off
>>>>>>>                 <-->
>>>>>>>                        len
>>>>>>>            +----+   <---------->
>>>>>>>            |    |
>>>>>>>          +-|----v----------------------+
>>>>>>>          | |    -----------------------|
>>>>>>> m       | buf  |    XXXXXXXXXXX      ||
>>>>>>>          |      -----------------------|
>>>>>>>          +-----------------------------+
>>>>>>>
>>>>>>>
>>>>>>> 2/ clone m:
>>>>>>>
>>>>>>>    c = rte_pktmbuf_alloc(pool);
>>>>>>>    rte_pktmbuf_attach(c, m);
>>>>>>>
>>>>>>>    Note that c has its own offset and length fields.
>>>>>>>
>>>>>>>
>>>>>>>                 off
>>>>>>>                 <-->
>>>>>>>                        len
>>>>>>>            +----+   <---------->
>>>>>>>            |    |
>>>>>>>          +-|----v----------------------+
>>>>>>>          | |    -----------------------|
>>>>>>> m       | buf  |    XXXXXXXXXXX      ||
>>>>>>>          |      -----------------------|
>>>>>>>          +------^----------------------+
>>>>>>>                 |
>>>>>>>            +----+
>>>>>>> indirect  |
>>>>>>>          +-|---------------------------+
>>>>>>>          | |    -----------------------|
>>>>>>> c       | buf  |                     ||
>>>>>>>          |      -----------------------|
>>>>>>>          +-----------------------------+
>>>>>>>
>>>>>>>                  off    len
>>>>>>>                  <--><---------->
>>>>>>>
>>>>>>>
>>>>>>> 3/ remove some data from c without changing m
>>>>>>>
>>>>>>>     rte_pktmbuf_adj(c, 10)   // at head
>>>>>>>     rte_pktmbuf_trim(c, 10)  // at tail
>>>>>>>
>>>>>>>
>>>>>>> Please let me know if it fits your needs.
>>>>>> No, it doesn't.
>>>>>>
>>>>>> Trimming head and tail with the current APIs removes data and make the space
>>>>>> available. Adjusting packet head means giving more headroom, not shifting the
>>>>>> buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
>>>>>> difference offsets in m,
>>>>>>
>>>>>> rte_pktmbuf_adj(c1, 10);
>>>>>> rte_pktmbuf_adj(c2, 20);
>>>>>>
>>>>>> then the owner of c2 regard the first (off+20)B as available headroom. If it
>>>>>> wants to attach outer header, it will overwrite the headroom even though the
>>>>>> owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
>>>>>> header should be linked by h1->next = c2.
>>>>> Yes, after these operations c1, c2 and m should become read-only. So, to
>>>>> prepend headers, another mbuf has to be inserted before as you suggest. It
>>>>> is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
>>>>> length) that will:
>>>>>    - alloc and attach indirect mbuf for each segment of m that is
>>>>>      in the range [offset : length+offset].
>>>>>    - prepend an empty and writable mbuf for the headers
>>>>>
>>>>>> If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
>>>>>> which actually shrink the headroom, this case can be properly handled.
>>>>> What do you mean by properly handled?
>>>>>
>>>>> Yes, prepending data or adding data in the indirect mbuf won't override
>>>>> the direct mbuf. But prepending data or adding data in the direct mbuf m
>>>>> won't be protected.
>>>>>
>>>>>  From an application point of view, indirect mbufs, or direct mbufs that
>>>>> have refcnt != 1, should be both considered as read-only because they
>>>>> may share their data. How an application can know if the data is shared
>>>>> or not?
>>>>>
>>>>> Maybe we need a flag to differentiate mbufs that are read-only
>>>>> (something like SHARED_DATA, or simply READONLY). In your case, if my
>>>>> understanding is correct, you want to have indirect mbufs with RW data.
>>>> Agree that indirect mbuf must be treated as read-only, Then the current code is
>>>> enough to handle that use-case.
>>>>
>>>>>> And another use-case (this is my actual use-case) is to make a large mbuf have
>>>>>> multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
>>>>>> because it transfers multiple packets to a single large buffer to reduce PCIe
>>>>>> overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
>>>>>> Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
>>>>>> indirect referencing.
>>> But just to make HW to RX multiple packets into one mbuf,
>>> data_off inside indirect mbuf should be enough, correct?
>> Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
>> to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
>> a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
>> better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
>> to saturate the network link.
> There were few complains that 64KB max is a limitation for some use-cases.
> I am not against increasing it, but I don't think we have free space on first cache-line for that
> without another big rework of mbuf layout.
> Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
>
>>> As I understand, what you'd like to achieve with this new field -
>>> ability to manipulate packet boundaries after RX, probably at upper layer.
>>> As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
>>> indirect mbufs trying to modify same direct buffer.
>> I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
>> is read-only. What that means, all the entities which own such mbufs have to be
>> aware of that and keep the principle as DPDK can't enforce the rule and there
>> can't be such sanity check. In this sense, HW doesn't violate it because the
>> direct mbuf is injected to HW before indirection. When packets are written by
>> HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
>> application layer with freeing the original direct mbuf (decrement refcnt by 1).
>> So, HW doesn't touch the direct buffer once it reaches to upper layer.
> Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> which implies that it should be possible by some entity (upper layer?) to manipulate these
> indirect mbufs.
> And we don't know how exactly it will be done.
>
>> The direct buffer will be freed and get available for reuse when all the attached
>> indirect mbufs are freed.
>>
>>> Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
>>> Fields for indirect mbufs, straight after attach()?
>> Good point.
>> Actually that was my draft (Mellanox internal) version of this patch :-) But I
>> had to consider a case where priv_size is really given by user. Even though it
>> is less likely, but if original priv_size is quite big, it can't cover entire
>> buf_len. For this, I had to increase priv_size to 32-bit but adding another
>> 16bit field (buf_off) looked more plausible.
> As I remember, we can't have mbufs bigger then 64K,
> so priv_size + buf_len should be always less than 64K, correct?

It sounds like it is suggested to use/customize priv_size to limit 
indirect mbuf
range in the direct one. It does not work from the box since priv_size is
used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()).

Andrew.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-11 11:39                 ` Ananyev, Konstantin
  2018-04-11 14:02                   ` Andrew Rybchenko
@ 2018-04-11 17:08                   ` Yongseok Koh
  2018-04-12 16:34                     ` Ananyev, Konstantin
  1 sibling, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-11 17:08 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Olivier Matz, Lu, Wenzhuo, Wu, Jingjing, Adrien Mazarguil,
	Nélio Laranjeiro, dev

On Wed, Apr 11, 2018 at 11:39:47AM +0000, Ananyev, Konstantin wrote:
> 
> Hi Yongseok,
> 
> > > >
> > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > Hi Yongseok,
> > > > >
> > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > >
> > > > > > > > Possible use-cases could be:
> > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > > >
> > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > >
> > > > > > > I think the current API is already able to do what you want.
> > > > > > >
> > > > > > > 1/ Here is a mbuf m with its data
> > > > > > >
> > > > > > >                off
> > > > > > >                <-->
> > > > > > >                       len
> > > > > > >           +----+   <---------->
> > > > > > >           |    |
> > > > > > >         +-|----v----------------------+
> > > > > > >         | |    -----------------------|
> > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > >         |      -----------------------|
> > > > > > >         +-----------------------------+
> > > > > > >
> > > > > > >
> > > > > > > 2/ clone m:
> > > > > > >
> > > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > > >   rte_pktmbuf_attach(c, m);
> > > > > > >
> > > > > > >   Note that c has its own offset and length fields.
> > > > > > >
> > > > > > >
> > > > > > >                off
> > > > > > >                <-->
> > > > > > >                       len
> > > > > > >           +----+   <---------->
> > > > > > >           |    |
> > > > > > >         +-|----v----------------------+
> > > > > > >         | |    -----------------------|
> > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > >         |      -----------------------|
> > > > > > >         +------^----------------------+
> > > > > > >                |
> > > > > > >           +----+
> > > > > > > indirect  |
> > > > > > >         +-|---------------------------+
> > > > > > >         | |    -----------------------|
> > > > > > > c       | buf  |                     ||
> > > > > > >         |      -----------------------|
> > > > > > >         +-----------------------------+
> > > > > > >
> > > > > > >                 off    len
> > > > > > >                 <--><---------->
> > > > > > >
> > > > > > >
> > > > > > > 3/ remove some data from c without changing m
> > > > > > >
> > > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > >
> > > > > > >
> > > > > > > Please let me know if it fits your needs.
> > > > > >
> > > > > > No, it doesn't.
> > > > > >
> > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > difference offsets in m,
> > > > > >
> > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > >
> > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > header should be linked by h1->next = c2.
> > > > >
> > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > length) that will:
> > > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > > >     in the range [offset : length+offset].
> > > > >   - prepend an empty and writable mbuf for the headers
> > > > >
> > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > >
> > > > > What do you mean by properly handled?
> > > > >
> > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > won't be protected.
> > > > >
> > > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > may share their data. How an application can know if the data is shared
> > > > > or not?
> > > > >
> > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > >
> > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > enough to handle that use-case.
> > > >
> > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > indirect referencing.
> > >
> > > But just to make HW to RX multiple packets into one mbuf,
> > > data_off inside indirect mbuf should be enough, correct?
> > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > to saturate the network link.
> 
> There were few complains that 64KB max is a limitation for some use-cases.
> I am not against increasing it, but I don't think we have free space on first cache-line for that
> without another big rework of mbuf layout. 
> Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too. 
> 
> > 
> > > As I understand, what you'd like to achieve with this new field -
> > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > indirect mbufs trying to modify same direct buffer.
> > 
> > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > is read-only. What that means, all the entities which own such mbufs have to be
> > aware of that and keep the principle as DPDK can't enforce the rule and there
> > can't be such sanity check. In this sense, HW doesn't violate it because the
> > direct mbuf is injected to HW before indirection. When packets are written by
> > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> 
> Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> which implies that it should be possible by some entity (upper layer?) to manipulate these
> indirect mbufs.
> And we don't know how exactly it will be done.

That's a valid concern. I can make it private by merging into the _attach_to()
func, or I just can add a comment in the API doc. However, if users are aware
that a mbuf is read-only and we expect them to keep it intact by their own
judgement, they would/should not use those APIs. We can't stop them modifying
content or the buffer itself anyway. Will add more comments of this discussion
regarding read-only mode.

> > The direct buffer will be freed and get available for reuse when all the attached
> > indirect mbufs are freed.
> > 
> > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > Fields for indirect mbufs, straight after attach()?
> > 
> > Good point.
> > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > had to consider a case where priv_size is really given by user. Even though it
> > is less likely, but if original priv_size is quite big, it can't cover entire
> > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > 16bit field (buf_off) looked more plausible.
> 
> As I remember, we can't have mbufs bigger then 64K,
> so priv_size + buf_len should be always less than 64K, correct?

Can you let me know where I can find the constraint? I checked
rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mistake
but there's no such limitation.

	elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
		(unsigned)data_room_size;

The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts from 'm +
sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. So,
priv_size couldn't be used for this purpose.

Yongseok

> > > > > >
> > > > > > Does this make sense?
> > > > >
> > > > > I understand the need.
> > > > >
> > > > > Another option would be to make the mbuf->buffer point to an external
> > > > > buffer (not inside the direct mbuf). This would require to add a
> > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > > a quick overview.
> > > > >
> > > > > [1]
> > > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > > -Session05-OlivierMatz-
> > > >
> > Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > > >
> > > > > The advantage is that it does not require the large data to be inside a
> > > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > > to implement compared to your solution.
> > > >
> > > > I knew that you presented the slides and frankly, I had considered that option
> > > > at first. But even with that option, metadata to store refcnt should also be
> > > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > > end of the data segment. Even though it could have smaller metadata structure,
> > > > I just wanted to make full use of the existing framework because it is less
> > > > complex as you mentioned. Given that you presented the idea of external data
> > > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > > simpler.  I personally think that we can take the idea of external data seg when
> > > > more demands come from users in the future as it would be a huge change and may
> > > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > > discussions and write codes for it if I can be helpful.
> > > >
> > > > Do you think this patch is okay for now?
> > > >
> > > >
> > > > Thanks for your comments,
> > > > Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-11 14:02                   ` Andrew Rybchenko
@ 2018-04-11 17:18                     ` Yongseok Koh
  0 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-11 17:18 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Ananyev, Konstantin, Olivier Matz, Lu, Wenzhuo, Wu, Jingjing,
	Adrien Mazarguil, Nélio Laranjeiro, dev

On Wed, Apr 11, 2018 at 05:02:50PM +0300, Andrew Rybchenko wrote:
> On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote:
> > Hi Yongseok,
> > 
> > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > > Hi Yongseok,
> > > > > > 
> > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > > > 
> > > > > > > > > Possible use-cases could be:
> > > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > > >    buffers can reference different layers of the encapsulated packet.
> > > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > > >    each packet can be referenced by multiple mbuf indirections.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > > > I think the current API is already able to do what you want.
> > > > > > > > 
> > > > > > > > 1/ Here is a mbuf m with its data
> > > > > > > > 
> > > > > > > >                 off
> > > > > > > >                 <-->
> > > > > > > >                        len
> > > > > > > >            +----+   <---------->
> > > > > > > >            |    |
> > > > > > > >          +-|----v----------------------+
> > > > > > > >          | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >          |      -----------------------|
> > > > > > > >          +-----------------------------+
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 2/ clone m:
> > > > > > > > 
> > > > > > > >    c = rte_pktmbuf_alloc(pool);
> > > > > > > >    rte_pktmbuf_attach(c, m);
> > > > > > > > 
> > > > > > > >    Note that c has its own offset and length fields.
> > > > > > > > 
> > > > > > > > 
> > > > > > > >                 off
> > > > > > > >                 <-->
> > > > > > > >                        len
> > > > > > > >            +----+   <---------->
> > > > > > > >            |    |
> > > > > > > >          +-|----v----------------------+
> > > > > > > >          | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >          |      -----------------------|
> > > > > > > >          +------^----------------------+
> > > > > > > >                 |
> > > > > > > >            +----+
> > > > > > > > indirect  |
> > > > > > > >          +-|---------------------------+
> > > > > > > >          | |    -----------------------|
> > > > > > > > c       | buf  |                     ||
> > > > > > > >          |      -----------------------|
> > > > > > > >          +-----------------------------+
> > > > > > > > 
> > > > > > > >                  off    len
> > > > > > > >                  <--><---------->
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 3/ remove some data from c without changing m
> > > > > > > > 
> > > > > > > >     rte_pktmbuf_adj(c, 10)   // at head
> > > > > > > >     rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Please let me know if it fits your needs.
> > > > > > > No, it doesn't.
> > > > > > > 
> > > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > > difference offsets in m,
> > > > > > > 
> > > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > > > 
> > > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > > header should be linked by h1->next = c2.
> > > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > > length) that will:
> > > > > >    - alloc and attach indirect mbuf for each segment of m that is
> > > > > >      in the range [offset : length+offset].
> > > > > >    - prepend an empty and writable mbuf for the headers
> > > > > > 
> > > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > > > What do you mean by properly handled?
> > > > > > 
> > > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > > won't be protected.
> > > > > > 
> > > > > >  From an application point of view, indirect mbufs, or direct mbufs that
> > > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > > may share their data. How an application can know if the data is shared
> > > > > > or not?
> > > > > > 
> > > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > > enough to handle that use-case.
> > > > > 
> > > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > > indirect referencing.
> > > > But just to make HW to RX multiple packets into one mbuf,
> > > > data_off inside indirect mbuf should be enough, correct?
> > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > > to saturate the network link.
> > There were few complains that 64KB max is a limitation for some use-cases.
> > I am not against increasing it, but I don't think we have free space on first cache-line for that
> > without another big rework of mbuf layout.
> > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
> > 
> > > > As I understand, what you'd like to achieve with this new field -
> > > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > > indirect mbufs trying to modify same direct buffer.
> > > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > > is read-only. What that means, all the entities which own such mbufs have to be
> > > aware of that and keep the principle as DPDK can't enforce the rule and there
> > > can't be such sanity check. In this sense, HW doesn't violate it because the
> > > direct mbuf is injected to HW before indirection. When packets are written by
> > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> > Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> > which implies that it should be possible by some entity (upper layer?) to manipulate these
> > indirect mbufs.
> > And we don't know how exactly it will be done.
> > 
> > > The direct buffer will be freed and get available for reuse when all the attached
> > > indirect mbufs are freed.
> > > 
> > > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > > Fields for indirect mbufs, straight after attach()?
> > > Good point.
> > > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > > had to consider a case where priv_size is really given by user. Even though it
> > > is less likely, but if original priv_size is quite big, it can't cover entire
> > > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > > 16bit field (buf_off) looked more plausible.
> > As I remember, we can't have mbufs bigger then 64K,
> > so priv_size + buf_len should be always less than 64K, correct?
> 
> It sounds like it is suggested to use/customize priv_size to limit indirect
> mbuf range in the direct one. It does not work from the box since priv_size is
> used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()).

?? That's exactly why he suggested to use priv_size...

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-11 17:08                   ` Yongseok Koh
@ 2018-04-12 16:34                     ` Ananyev, Konstantin
  2018-04-12 18:58                       ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-12 16:34 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Olivier Matz, Lu, Wenzhuo, Wu, Jingjing, Adrien Mazarguil,
	Nélio Laranjeiro, dev

> >
> > > > >
> > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > > Hi Yongseok,
> > > > > >
> > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > > >
> > > > > > > > > Possible use-cases could be:
> > > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > > >
> > > > > > > > I think the current API is already able to do what you want.
> > > > > > > >
> > > > > > > > 1/ Here is a mbuf m with its data
> > > > > > > >
> > > > > > > >                off
> > > > > > > >                <-->
> > > > > > > >                       len
> > > > > > > >           +----+   <---------->
> > > > > > > >           |    |
> > > > > > > >         +-|----v----------------------+
> > > > > > > >         | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >         |      -----------------------|
> > > > > > > >         +-----------------------------+
> > > > > > > >
> > > > > > > >
> > > > > > > > 2/ clone m:
> > > > > > > >
> > > > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > > > >   rte_pktmbuf_attach(c, m);
> > > > > > > >
> > > > > > > >   Note that c has its own offset and length fields.
> > > > > > > >
> > > > > > > >
> > > > > > > >                off
> > > > > > > >                <-->
> > > > > > > >                       len
> > > > > > > >           +----+   <---------->
> > > > > > > >           |    |
> > > > > > > >         +-|----v----------------------+
> > > > > > > >         | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >         |      -----------------------|
> > > > > > > >         +------^----------------------+
> > > > > > > >                |
> > > > > > > >           +----+
> > > > > > > > indirect  |
> > > > > > > >         +-|---------------------------+
> > > > > > > >         | |    -----------------------|
> > > > > > > > c       | buf  |                     ||
> > > > > > > >         |      -----------------------|
> > > > > > > >         +-----------------------------+
> > > > > > > >
> > > > > > > >                 off    len
> > > > > > > >                 <--><---------->
> > > > > > > >
> > > > > > > >
> > > > > > > > 3/ remove some data from c without changing m
> > > > > > > >
> > > > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > > >
> > > > > > > >
> > > > > > > > Please let me know if it fits your needs.
> > > > > > >
> > > > > > > No, it doesn't.
> > > > > > >
> > > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > > difference offsets in m,
> > > > > > >
> > > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > > >
> > > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > > header should be linked by h1->next = c2.
> > > > > >
> > > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > > length) that will:
> > > > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > > > >     in the range [offset : length+offset].
> > > > > >   - prepend an empty and writable mbuf for the headers
> > > > > >
> > > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > > >
> > > > > > What do you mean by properly handled?
> > > > > >
> > > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > > won't be protected.
> > > > > >
> > > > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > > may share their data. How an application can know if the data is shared
> > > > > > or not?
> > > > > >
> > > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > > >
> > > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > > enough to handle that use-case.
> > > > >
> > > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > > indirect referencing.
> > > >
> > > > But just to make HW to RX multiple packets into one mbuf,
> > > > data_off inside indirect mbuf should be enough, correct?
> > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > > to saturate the network link.
> >
> > There were few complains that 64KB max is a limitation for some use-cases.
> > I am not against increasing it, but I don't think we have free space on first cache-line for that
> > without another big rework of mbuf layout.
> > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
> >
> > >
> > > > As I understand, what you'd like to achieve with this new field -
> > > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > > indirect mbufs trying to modify same direct buffer.
> > >
> > > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > > is read-only. What that means, all the entities which own such mbufs have to be
> > > aware of that and keep the principle as DPDK can't enforce the rule and there
> > > can't be such sanity check. In this sense, HW doesn't violate it because the
> > > direct mbuf is injected to HW before indirection. When packets are written by
> > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> >
> > Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> > which implies that it should be possible by some entity (upper layer?) to manipulate these
> > indirect mbufs.
> > And we don't know how exactly it will be done.
> 
> That's a valid concern. I can make it private by merging into the _attach_to()
> func, or I just can add a comment in the API doc. However, if users are aware
> that a mbuf is read-only and we expect them to keep it intact by their own
> judgement, they would/should not use those APIs. We can't stop them modifying
> content or the buffer itself anyway. Will add more comments of this discussion
> regarding read-only mode.

Ok, so these functions are intended to be used only by PMD level?
But in that case do you need them at all?
Isn't it possible implement same thing with just data_off?
I mean your PMD knows in advance what is the buf_len of mbuf and at startup
time it can decide it going to slice it how to slice it into multiple packets.
So each offset is known in advance and you don't need to worry that you'll overwrite
neighbor packet's data. 

> 
> > > The direct buffer will be freed and get available for reuse when all the attached
> > > indirect mbufs are freed.
> > >
> > > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > > Fields for indirect mbufs, straight after attach()?
> > >
> > > Good point.
> > > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > > had to consider a case where priv_size is really given by user. Even though it
> > > is less likely, but if original priv_size is quite big, it can't cover entire
> > > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > > 16bit field (buf_off) looked more plausible.
> >
> > As I remember, we can't have mbufs bigger then 64K,
> > so priv_size + buf_len should be always less than 64K, correct?
> 
> Can you let me know where I can find the constraint? I checked
> rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mistake
> but there's no such limitation.
> 
> 	elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
> 		(unsigned)data_room_size;


Ok I scanned through librte_mbuf and didn't find any limitations.
Seems like a false impression from my side.
Anyway that seems like a corner case to have priv_szie + buf_len >64KB.
Do you really need to support it?

Konstantin

> 
> The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts from 'm +
> sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. So,
> priv_size couldn't be used for this purpose.
> 
> Yongseok
> 
> > > > > > >
> > > > > > > Does this make sense?
> > > > > >
> > > > > > I understand the need.
> > > > > >
> > > > > > Another option would be to make the mbuf->buffer point to an external
> > > > > > buffer (not inside the direct mbuf). This would require to add a
> > > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > > > a quick overview.
> > > > > >
> > > > > > [1]
> > > > >
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > > > -Session05-OlivierMatz-
> > > > >
> > >
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > > > >
> > > > > > The advantage is that it does not require the large data to be inside a
> > > > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > > > to implement compared to your solution.
> > > > >
> > > > > I knew that you presented the slides and frankly, I had considered that option
> > > > > at first. But even with that option, metadata to store refcnt should also be
> > > > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > > > end of the data segment. Even though it could have smaller metadata structure,
> > > > > I just wanted to make full use of the existing framework because it is less
> > > > > complex as you mentioned. Given that you presented the idea of external data
> > > > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > > > simpler.  I personally think that we can take the idea of external data seg when
> > > > > more demands come from users in the future as it would be a huge change and may
> > > > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > > > discussions and write codes for it if I can be helpful.
> > > > >
> > > > > Do you think this patch is okay for now?
> > > > >
> > > > >
> > > > > Thanks for your comments,
> > > > > Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
  2018-04-12 16:34                     ` Ananyev, Konstantin
@ 2018-04-12 18:58                       ` Yongseok Koh
  0 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-12 18:58 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Olivier Matz, Lu, Wenzhuo, Wu, Jingjing, Adrien Mazarguil,
	Nélio Laranjeiro, dev

On Thu, Apr 12, 2018 at 04:34:56PM +0000, Ananyev, Konstantin wrote:
> > >
> > > > > >
> > > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > > > Hi Yongseok,
> > > > > > >
> > > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > > > >
> > > > > > > > > > Possible use-cases could be:
> > > > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > > > >
> > > > > > > > > I think the current API is already able to do what you want.
> > > > > > > > >
> > > > > > > > > 1/ Here is a mbuf m with its data
> > > > > > > > >
> > > > > > > > >                off
> > > > > > > > >                <-->
> > > > > > > > >                       len
> > > > > > > > >           +----+   <---------->
> > > > > > > > >           |    |
> > > > > > > > >         +-|----v----------------------+
> > > > > > > > >         | |    -----------------------|
> > > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > > >         |      -----------------------|
> > > > > > > > >         +-----------------------------+
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2/ clone m:
> > > > > > > > >
> > > > > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > > > > >   rte_pktmbuf_attach(c, m);
> > > > > > > > >
> > > > > > > > >   Note that c has its own offset and length fields.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >                off
> > > > > > > > >                <-->
> > > > > > > > >                       len
> > > > > > > > >           +----+   <---------->
> > > > > > > > >           |    |
> > > > > > > > >         +-|----v----------------------+
> > > > > > > > >         | |    -----------------------|
> > > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > > >         |      -----------------------|
> > > > > > > > >         +------^----------------------+
> > > > > > > > >                |
> > > > > > > > >           +----+
> > > > > > > > > indirect  |
> > > > > > > > >         +-|---------------------------+
> > > > > > > > >         | |    -----------------------|
> > > > > > > > > c       | buf  |                     ||
> > > > > > > > >         |      -----------------------|
> > > > > > > > >         +-----------------------------+
> > > > > > > > >
> > > > > > > > >                 off    len
> > > > > > > > >                 <--><---------->
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 3/ remove some data from c without changing m
> > > > > > > > >
> > > > > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please let me know if it fits your needs.
> > > > > > > >
> > > > > > > > No, it doesn't.
> > > > > > > >
> > > > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > > > difference offsets in m,
> > > > > > > >
> > > > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > > > >
> > > > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > > > header should be linked by h1->next = c2.
> > > > > > >
> > > > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > > > length) that will:
> > > > > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > > > > >     in the range [offset : length+offset].
> > > > > > >   - prepend an empty and writable mbuf for the headers
> > > > > > >
> > > > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > > > >
> > > > > > > What do you mean by properly handled?
> > > > > > >
> > > > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > > > won't be protected.
> > > > > > >
> > > > > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > > > may share their data. How an application can know if the data is shared
> > > > > > > or not?
> > > > > > >
> > > > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > > > >
> > > > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > > > enough to handle that use-case.
> > > > > >
> > > > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > > > indirect referencing.
> > > > >
> > > > > But just to make HW to RX multiple packets into one mbuf,
> > > > > data_off inside indirect mbuf should be enough, correct?
> > > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > > > to saturate the network link.
> > >
> > > There were few complains that 64KB max is a limitation for some use-cases.
> > > I am not against increasing it, but I don't think we have free space on first cache-line for that
> > > without another big rework of mbuf layout.
> > > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
> > >
> > > >
> > > > > As I understand, what you'd like to achieve with this new field -
> > > > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > > > indirect mbufs trying to modify same direct buffer.
> > > >
> > > > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > > > is read-only. What that means, all the entities which own such mbufs have to be
> > > > aware of that and keep the principle as DPDK can't enforce the rule and there
> > > > can't be such sanity check. In this sense, HW doesn't violate it because the
> > > > direct mbuf is injected to HW before indirection. When packets are written by
> > > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > > > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > > > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> > >
> > > Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> > > which implies that it should be possible by some entity (upper layer?) to manipulate these
> > > indirect mbufs.
> > > And we don't know how exactly it will be done.
> > 
> > That's a valid concern. I can make it private by merging into the _attach_to()
> > func, or I just can add a comment in the API doc. However, if users are aware
> > that a mbuf is read-only and we expect them to keep it intact by their own
> > judgement, they would/should not use those APIs. We can't stop them modifying
> > content or the buffer itself anyway. Will add more comments of this discussion
> > regarding read-only mode.
> 
> Ok, so these functions are intended to be used only by PMD level?
> But in that case do you need them at all?
> Isn't it possible implement same thing with just data_off?
> I mean your PMD knows in advance what is the buf_len of mbuf and at startup
> time it can decide it going to slice it how to slice it into multiple packets.
> So each offset is known in advance and you don't need to worry that you'll overwrite
> neighbor packet's data. 

Since Olivier's last comment, I've been thinking about the approach all over
again. It looks like I'm trapped in self-contradiction. The reason why I didn't
want to use data_off was to provide valid headroom for each Rx packet and let
users freely write the headroom. But, given that indirect mbuf should be
considered read-only, this isn't a right approach. Instead of slicing a buffer
with mbuf indirection and manipulating boundaries, the idea of external data (as
Olivier suggested) would fit better. Even though it is more complex, it is
doable. I summarized ideas yesterday and will come up with a new patch soon.

Briefly, I think reserved bit 61 of ol_flags can be used to indicate externally
attached mbuf. The following is my initial thought.

#define EXT_ATTACHED_MBUF    (1ULL << 61)

struct rte_pktmbuf_ext_shared_info {
	refcnt;
	*free_cb();
	*opaque /* arg for free_cb() */
}

rte_pktmbuf_get_ext_shinfo() {
	/* Put shared info at the end of external buffer */
	return (struct rte_pktmbuf_ext_shared_info *)(m->buf_addr + m->buf_len);
}

rte_pktmbuf_attach_ext_buf(m, buf_addr, buf_len, free_cb, opaque) {
	struct rte_pktmbuf_ext_shared_info *shinfo;

	m->buf_addr = buf_addr;
	m->buf_iova = rte_mempool_virt2iova(buf_addr);
	/* Have to add some calculation for alignment */
	m->buf_len = buf_len - sizeof (*shinfo);
	shinfo = m->buf_addr + m->buf_len;
	...
	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
	m->ol_flags |= EXT_ATTACHED_MBUF;
	atomic set shinfo->refcnt = 1;

	shinfo->free_cb = free_cb;
	shinfo->opaque = opaque;

	...
}
rte_pktmbuf_detach_ext_buf(m)

#define RTE_MBUF_EXT(mb)   ((mb)->ol_flags & EXT_ATTACHED_MBUF)

In rte_pktmbuf_prefree_seg(),

		if (RTE_MBUF_INDIRECT(m))
			rte_pktmbuf_detach(m);
		else if (RTE_MBUF_EXT(m))
			rte_pktmbuf_detach_ext_buf(m);

And in rte_pktmbuf_attach(), if the mbuf attaching to is externally attached,
then just increase refcnt in shinfo so that multiple mbufs can refer to the same
external buffer.

Please feel free to share any concern/idea.

> > > > The direct buffer will be freed and get available for reuse when all the attached
> > > > indirect mbufs are freed.
> > > >
> > > > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > > > Fields for indirect mbufs, straight after attach()?
> > > >
> > > > Good point.
> > > > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > > > had to consider a case where priv_size is really given by user. Even though it
> > > > is less likely, but if original priv_size is quite big, it can't cover entire
> > > > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > > > 16bit field (buf_off) looked more plausible.
> > >
> > > As I remember, we can't have mbufs bigger then 64K,
> > > so priv_size + buf_len should be always less than 64K, correct?
> > 
> > Can you let me know where I can find the constraint? I checked
> > rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mistake
> > but there's no such limitation.
> > 
> > 	elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
> > 		(unsigned)data_room_size;
> 
> 
> Ok I scanned through librte_mbuf and didn't find any limitations.
> Seems like a false impression from my side.
> Anyway that seems like a corner case to have priv_szie + buf_len >64KB.
> Do you really need to support it?

If a user must have 64kB buffer (it's valid, no violation) and the priv_size is
just a few bytes. Then, does library have to force the user to sacrifice a few
bytes for priv_size? Do you think it's a corner case? Still using priv_size
doesn't seem to be a good idea.

Yongseok

> > The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts from 'm +
> > sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. So,
> > priv_size couldn't be used for this purpose.
> > 
> > Yongseok
> > 
> > > > > > > >
> > > > > > > > Does this make sense?
> > > > > > >
> > > > > > > I understand the need.
> > > > > > >
> > > > > > > Another option would be to make the mbuf->buffer point to an external
> > > > > > > buffer (not inside the direct mbuf). This would require to add a
> > > > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > > > > a quick overview.
> > > > > > >
> > > > > > > [1]
> > > > > >
> > > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > > > > -Session05-OlivierMatz-
> > > > > >
> > > >
> > Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > > > > >
> > > > > > > The advantage is that it does not require the large data to be inside a
> > > > > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > > > > to implement compared to your solution.
> > > > > >
> > > > > > I knew that you presented the slides and frankly, I had considered that option
> > > > > > at first. But even with that option, metadata to store refcnt should also be
> > > > > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > > > > end of the data segment. Even though it could have smaller metadata structure,
> > > > > > I just wanted to make full use of the existing framework because it is less
> > > > > > complex as you mentioned. Given that you presented the idea of external data
> > > > > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > > > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > > > > simpler.  I personally think that we can take the idea of external data seg when
> > > > > > more demands come from users in the future as it would be a huge change and may
> > > > > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > > > > discussions and write codes for it if I can be helpful.
> > > > > >
> > > > > > Do you think this patch is okay for now?
> > > > > >
> > > > > >
> > > > > > Thanks for your comments,
> > > > > > Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (6 preceding siblings ...)
  2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
@ 2018-04-19  1:11 ` Yongseok Koh
  2018-04-19  1:11   ` [PATCH v3 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
                     ` (2 more replies)
  2018-04-24  1:38 ` [PATCH v4 " Yongseok Koh
                   ` (4 subsequent siblings)
  12 siblings, 3 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-19  1:11 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro,
	Yongseok Koh

This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
    mbuf must be read-only. But external buffer has its own refcnt and it
    starts from 1. Unless multiple mbufs are attached to a mbuf having an
    external buffer, the external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v3:
* implement external buffer attachment instead of introducing buf_off for
  mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 276 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 248 insertions(+), 28 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 06eceba37..e64160c81 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -326,7 +326,7 @@ extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -568,6 +568,24 @@ struct rte_mbuf {
 
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	RTE_STD_C11
+	union {
+		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
+		uint16_t refcnt;          /**< Non-atomically accessed refcnt */
+	};
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
 #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
 
 /**
+ * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -821,6 +844,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_extbuf_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_extbuf_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_extbuf_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_extbuf_refcnt_read(shinfo) == 1)) {
+		rte_extbuf_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1195,11 +1270,120 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Return shared data of external buffer of a mbuf.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @return
+ *   The address of the shared data.
+ */
+static inline struct rte_mbuf_ext_shared_info *
+rte_mbuf_ext_shinfo(struct rte_mbuf *m)
+{
+	return (struct rte_mbuf_ext_shared_info *)
+		RTE_PTR_ADD(m->buf_addr, m->buf_len);
+}
+
+/**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided. This callback function will be called once all the mbufs are
+ * detached from the buffer.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * A few bytes in the trailer of the provided buffer will be dedicated for
+ * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt,
+ * callback function and so on. The shared data can be referenced by
+ * ``rte_mbuf_ext_shinfo()``
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
+ *   mbuf must be read-only. But external buffer has its own refcnt and it
+ *   starts from 1. Unless multiple mbufs are attached to a mbuf having an
+ *   external buffer, the external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be consiered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer we're attaching to.
+ * @param buf_len
+ *   The size of the external buffer we're attaching to. This must be larger
+ *   than the size of ``struct rte_mbuf_ext_shared_info`` and padding for
+ *   alignment. If not enough, this function will return NULL.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ * @return
+ *   A pointer to the new start of the data on success, return NULL otherwise.
+ */
+static inline char * __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
+	void *fcb_opaque)
+{
+	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
+	struct rte_mbuf_ext_shared_info *shinfo;
+
+	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
+			sizeof(uintptr_t));
+
+	if ((void *)shinfo <= buf_addr)
+		return NULL;
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = rte_mempool_virt2iova(buf_addr);
+	m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+	m->data_len = 0;
+
+	rte_pktmbuf_reset_headroom(m);
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+
+	rte_extbuf_refcnt_set(shinfo, 1);
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+
+	return (char *)m->buf_addr + m->data_off;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1213,19 +1397,18 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_extbuf_refcnt_update(rte_mbuf_ext_shinfo(m), 1);
+		mi->ol_flags = m->ol_flags;
+	} else {
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1241,7 +1424,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1250,12 +1432,53 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	struct rte_mbuf_ext_shared_info *shinfo;
+
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+
+	shinfo = rte_mbuf_ext_shinfo(m);
+
+	if (rte_extbuf_refcnt_update(shinfo, -1) == 0)
+		shinfo->free_cb(m->buf_addr, shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
  *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
+ *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1264,10 +1487,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1279,13 +1506,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1309,7 +1529,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1321,7 +1541,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v3 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-19  1:11 ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
@ 2018-04-19  1:11   ` Yongseok Koh
  2018-04-23 11:53   ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
  2018-04-23 16:18   ` Olivier Matz
  2 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-19  1:11 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro,
	Yongseok Koh

If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be externally attached mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 3 +++
 app/test-pmd/macfwd.c   | 3 +++
 app/test-pmd/macswap.c  | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5f5ab64aa..bb0b675a8 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -770,6 +770,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (!RTE_MBUF_DIRECT(m))
+			tx_ol_flags |= m->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..ba0021194 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..b8d15f6ba 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,9 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-19  1:11 ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
  2018-04-19  1:11   ` [PATCH v3 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-23 11:53   ` Ananyev, Konstantin
  2018-04-24  2:04     ` Yongseok Koh
  2018-04-23 16:18   ` Olivier Matz
  2 siblings, 1 reply; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-23 11:53 UTC (permalink / raw)
  To: Yongseok Koh, Lu, Wenzhuo, Wu, Jingjing, olivier.matz
  Cc: dev, adrien.mazarguil, nelio.laranjeiro



> 
> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
>     mbuf must be read-only. But external buffer has its own refcnt and it
>     starts from 1. Unless multiple mbufs are attached to a mbuf having an
>     external buffer, the external buffer is writable.
>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
> 
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
> 
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>   mbuf indirection.
> 
>  lib/librte_mbuf/rte_mbuf.h | 276 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 248 insertions(+), 28 deletions(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 06eceba37..e64160c81 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -326,7 +326,7 @@ extern "C" {
>  		PKT_TX_MACSEC |		 \
>  		PKT_TX_SEC_OFFLOAD)
> 
> -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> +#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */
> 
>  #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
> 
> @@ -568,6 +568,24 @@ struct rte_mbuf {
> 
>  } __rte_cache_aligned;
> 
> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	RTE_STD_C11
> +	union {
> +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> +		uint16_t refcnt;          /**< Non-atomically accessed refcnt */
> +	};
> +};
> +
>  /**< Maximum number of nb_segs allowed. */
>  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
> 
> @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> 
>  /**
> + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> +
> +/**
>   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
>   */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))

As a nit:
RTE_MBUF_DIRECT(mb)  (((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)) == 0)

> 
>  /**
>   * Private data in case of pktmbuf pool.
> @@ -821,6 +844,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
> 
>  #endif /* RTE_MBUF_REFCNT_ATOMIC */
> 
> +/**
> + * Reads the refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @return
> + *   Reference count number.
> + */
> +static inline uint16_t
> +rte_extbuf_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
> +{
> +	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
> +}
> +
> +/**
> + * Set refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @param new_value
> + *   Value set
> + */
> +static inline void
> +rte_extbuf_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
> +	uint16_t new_value)
> +{
> +	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
> +}
> +
> +/**
> + * Add given value to refcnt of an external buffer and return its new
> + * value.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @param value
> + *   Value to add/subtract
> + * @return
> + *   Updated value
> + */
> +static inline uint16_t
> +rte_extbuf_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
> +	int16_t value)
> +{
> +	if (likely(rte_extbuf_refcnt_read(shinfo) == 1)) {
> +		rte_extbuf_refcnt_set(shinfo, 1 + value);
> +		return 1 + value;
> +	}
> +
> +	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
> +}
> +
>  /** Mbuf prefetch */
>  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
>  	if ((m) != NULL)                        \
> @@ -1195,11 +1270,120 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>  }
> 
>  /**
> + * Return shared data of external buffer of a mbuf.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @return
> + *   The address of the shared data.
> + */
> +static inline struct rte_mbuf_ext_shared_info *
> +rte_mbuf_ext_shinfo(struct rte_mbuf *m)
> +{
> +	return (struct rte_mbuf_ext_shared_info *)
> +		RTE_PTR_ADD(m->buf_addr, m->buf_len);
> +}
> +
> +/**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * A few bytes in the trailer of the provided buffer will be dedicated for
> + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt,
> + * callback function and so on. The shared data can be referenced by
> + * ``rte_mbuf_ext_shinfo()``
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> + *   mbuf must be read-only. But external buffer has its own refcnt and it
> + *   starts from 1. Unless multiple mbufs are attached to a mbuf having an
> + *   external buffer, the external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be consiered direct if it is neither indirect nor
> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. This must be larger
> + *   than the size of ``struct rte_mbuf_ext_shared_info`` and padding for
> + *   alignment. If not enough, this function will return NULL.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + * @return
> + *   A pointer to the new start of the data on success, return NULL otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
> +	void *fcb_opaque)
> +{
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +
> +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
> +			sizeof(uintptr_t));
> +
> +	if ((void *)shinfo <= buf_addr)
> +		return NULL;
> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = rte_mempool_virt2iova(buf_addr);


That wouldn't work for arbitrary extern buffer.
Only for the one that is an element in some other mempool.
For arbitrary external buffer - callee has to provide PA for it plus guarantee that
it's VA would be locked down.
>From other side - if your intention is just to use only elements of other mempools -
No need to have free_cb(). mempool_put should do.

> +	m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> +	m->data_len = 0;
> +
> +	rte_pktmbuf_reset_headroom(m);
> +	m->ol_flags |= EXT_ATTACHED_MBUF;
> +
> +	rte_extbuf_refcnt_set(shinfo, 1);
> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;
> +
> +	return (char *)m->buf_addr + m->data_off;
> +}
> +
> +/**
> + * Detach the external buffer attached to a mbuf, same as
> + * ``rte_pktmbuf_detach()``
> + *
> + * @param m
> + *   The mbuf having external buffer.
> + */
> +#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
> +
> +/**
>   * Attach packet mbuf to another packet mbuf.
>   *
> - * After attachment we refer the mbuf we attached as 'indirect',
> - * while mbuf we attached to as 'direct'.
> - * The direct mbuf's reference counter is incremented.
> + * If the mbuf we are attaching to isn't a direct buffer and is attached to
> + * an external buffer, the mbuf being attached will be attached to the
> + * external buffer instead of mbuf indirection.
> + *
> + * Otherwise, the mbuf will be indirectly attached. After attachment we
> + * refer the mbuf we attached as 'indirect', while mbuf we attached to as
> + * 'direct'.  The direct mbuf's reference counter is incremented.
>   *
>   * Right now, not supported:
>   *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
> @@ -1213,19 +1397,18 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   */
>  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  {
> -	struct rte_mbuf *md;
> -
>  	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
>  	    rte_mbuf_refcnt_read(mi) == 1);
> 
> -	/* if m is not direct, get the mbuf that embeds the data */
> -	if (RTE_MBUF_DIRECT(m))
> -		md = m;
> -	else
> -		md = rte_mbuf_from_indirect(m);
> +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> +		rte_extbuf_refcnt_update(rte_mbuf_ext_shinfo(m), 1);
> +		mi->ol_flags = m->ol_flags;
> +	} else {
> +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
> +		mi->priv_size = m->priv_size;
> +		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
> +	}
> 
> -	rte_mbuf_refcnt_update(md, 1);
> -	mi->priv_size = m->priv_size;
>  	mi->buf_iova = m->buf_iova;
>  	mi->buf_addr = m->buf_addr;
>  	mi->buf_len = m->buf_len;
> @@ -1241,7 +1424,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  	mi->next = NULL;
>  	mi->pkt_len = mi->data_len;
>  	mi->nb_segs = 1;
> -	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
>  	mi->packet_type = m->packet_type;
>  	mi->timestamp = m->timestamp;
> 
> @@ -1250,12 +1432,53 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  }
> 
>  /**
> - * Detach an indirect packet mbuf.
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the reference counter of the external buffer. When the
> + * reference counter becomes 0, the buffer is freed by pre-registered
> + * callback.
> + */
> +static inline void
> +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +
> +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> +
> +	shinfo = rte_mbuf_ext_shinfo(m);
> +
> +	if (rte_extbuf_refcnt_update(shinfo, -1) == 0)
> +		shinfo->free_cb(m->buf_addr, shinfo->fcb_opaque);


I understand the reason but extra function call for each external mbuf - seems quite expensive.
Wonder is it possible to group them somehow and amortize the cost?

> +}
> +
> +/**
> + * @internal used by rte_pktmbuf_detach().
>   *
> + * Decrement the direct mbuf's reference counter. When the reference
> + * counter becomes 0, the direct mbuf is freed.
> + */
> +static inline void
> +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
> +
> +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> +
> +	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> +		md->next = NULL;
> +		md->nb_segs = 1;
> +		rte_mbuf_refcnt_set(md, 1);
> +		rte_mbuf_raw_free(md);
> +	}
> +}
> +
> +/**
> + * Detach a packet mbuf from external buffer or direct buffer.
> + *
> + *  - decrement refcnt and free the external/direct buffer if refcnt
> + *    becomes zero.
>   *  - restore original mbuf address and length values.
>   *  - reset pktmbuf data and data_len to their default values.
> - *  - decrement the direct mbuf's reference counter. When the
> - *  reference counter becomes 0, the direct mbuf is freed.
>   *
>   * All other fields of the given packet mbuf will be left intact.
>   *
> @@ -1264,10 +1487,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   */
>  static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  {
> -	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
>  	struct rte_mempool *mp = m->pool;
>  	uint32_t mbuf_size, buf_len, priv_size;
> 
> +	if (RTE_MBUF_HAS_EXTBUF(m))
> +		__rte_pktmbuf_free_extbuf(m);
> +	else
> +		__rte_pktmbuf_free_direct(m);
> +
>  	priv_size = rte_pktmbuf_priv_size(mp);
>  	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
>  	buf_len = rte_pktmbuf_data_room_size(mp);
> @@ -1279,13 +1506,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  	rte_pktmbuf_reset_headroom(m);
>  	m->data_len = 0;
>  	m->ol_flags = 0;
> -
> -	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> -		md->next = NULL;
> -		md->nb_segs = 1;
> -		rte_mbuf_refcnt_set(md, 1);
> -		rte_mbuf_raw_free(md);
> -	}
>  }
> 
>  /**
> @@ -1309,7 +1529,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> 
>  	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> 
> -		if (RTE_MBUF_INDIRECT(m))
> +		if (!RTE_MBUF_DIRECT(m))
>  			rte_pktmbuf_detach(m);
> 
>  		if (m->next != NULL) {
> @@ -1321,7 +1541,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> 
>  	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> 
> -		if (RTE_MBUF_INDIRECT(m))
> +		if (!RTE_MBUF_DIRECT(m))
>  			rte_pktmbuf_detach(m);
> 
>  		if (m->next != NULL) {
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-19  1:11 ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
  2018-04-19  1:11   ` [PATCH v3 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
  2018-04-23 11:53   ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
@ 2018-04-23 16:18   ` Olivier Matz
  2018-04-24  1:29     ` Yongseok Koh
  2 siblings, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-23 16:18 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

Hi Yongseok,

Please see some comments below.

On Wed, Apr 18, 2018 at 06:11:04PM -0700, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
>     mbuf must be read-only. But external buffer has its own refcnt and it
>     starts from 1. Unless multiple mbufs are attached to a mbuf having an
>     external buffer, the external buffer is writable.

I'm wondering if "As refcnt of a direct mbuf is at least 2" should be
clarified. I guess we are talking about a direct mbuf that has another one
attached too.

I'm also not sure if I understand properly: to me, it is possible to have
an indirect mbuf that references a direct mbuf with a refcount of 1:
  m = rte_pktmbuf_alloc()
  mi = rte_pktmbuf_alloc()
  rte_pktmbuf_attach(mi, m)
  rte_pktmbuf_free(m)

>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

[...]

> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	RTE_STD_C11
> +	union {
> +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> +		uint16_t refcnt;          /**< Non-atomically accessed refcnt */

It looks that only refcnt_atomic is used.
I don't know if we really need the non-atomic one yet.


> +	};
> +};
> +
>  /**< Maximum number of nb_segs allowed. */
>  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
>  
> @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>  
>  /**
> + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> +
> +/**
>   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
>   */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))

I'm a bit reticent to have RTE_MBUF_DIRECT(m) different of
!RTE_MBUF_INDIRECT(m), I feel it's not very natural.

What about:
- direct = embeds its own data
- clone (or another name) = data is another mbuf
- extbuf = data is in an external buffer


>  /**
>   * Private data in case of pktmbuf pool.
> @@ -821,6 +844,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
>  
>  #endif /* RTE_MBUF_REFCNT_ATOMIC */
>  
> +/**
> + * Reads the refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @return
> + *   Reference count number.
> + */
> +static inline uint16_t
> +rte_extbuf_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)

What do you think about rte_mbuf_ext_refcnt_read() to keep name consistency?
(same for other functions below)

[...]

> @@ -1195,11 +1270,120 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>  }
>  
>  /**
> + * Return shared data of external buffer of a mbuf.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @return
> + *   The address of the shared data.
> + */
> +static inline struct rte_mbuf_ext_shared_info *
> +rte_mbuf_ext_shinfo(struct rte_mbuf *m)
> +{
> +	return (struct rte_mbuf_ext_shared_info *)
> +		RTE_PTR_ADD(m->buf_addr, m->buf_len);
> +}

This forces to have the shared data at the end of the buffer. Is it
always possible? I think there are use-cases where the user may want to
specify another location for it.

For instance, an application mmaps a big file (locked in memory), and
wants to send mbufs pointing to this data without doing any copy.

Maybe adding a m->shinfo field would be a better choice, what do you
think?

This would certainly break the ABI, but I wonder if that patch does
not already break it. I mean, how would react an application compiled
for 18.02 if an EXTBUF is passed to it, knowing that many functions
are inline?


> +/**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * A few bytes in the trailer of the provided buffer will be dedicated for
> + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt,
> + * callback function and so on. The shared data can be referenced by
> + * ``rte_mbuf_ext_shinfo()``
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> + *   mbuf must be read-only. But external buffer has its own refcnt and it
> + *   starts from 1. Unless multiple mbufs are attached to a mbuf having an
> + *   external buffer, the external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be consiered direct if it is neither indirect nor

small typo:
consiered -> considered

> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. This must be larger
> + *   than the size of ``struct rte_mbuf_ext_shared_info`` and padding for
> + *   alignment. If not enough, this function will return NULL.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + * @return
> + *   A pointer to the new start of the data on success, return NULL otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
> +	void *fcb_opaque)
> +{
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +
> +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
> +			sizeof(uintptr_t));
> +
> +	if ((void *)shinfo <= buf_addr)
> +		return NULL;
> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = rte_mempool_virt2iova(buf_addr);

Agree with Konstantin's comment. I think buf_iova should be an argument
of the function.


> +	m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);

Related to what I said above: I think m->buf_len should be set to
the buf_len argument, so a user can point to existing read-only
data.

[...]

Few more comments:

I think we still need to find a good way to advertise to the users
if a mbuf is writable or readable. Today, the rules are quite implicit.
There are surely some use cases where the mbuf is indirect but with
only one active user, meaning it could be READ-WRITE. We could target
18.08 for this.

One side question about your implementation in mlx. I guess the
hardware will write the mbuf data in a big contiguous buffer like
this:

+-------+--------------+--------+--------------+--------+- - -
|       |mbuf1 data    |        |mbuf2 data    |        |
|       |              |        |              |        |
+-------+--------------+--------+--------------+--------+- - -

Which will be transformed in:

+--+----+--------------+---+----+--------------+---+---+- - -
|  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
|  |room|              |inf|room|              |inf|   |
+--+----+--------------+---+----+--------------+---+---+- - -

So, there is one shinfo (i.e one refcount) for each mbuf.
How do you know when the big buffer is not used anymore?


To summarize, I like the idea of your patchset, this is close to
what I had in mind... which does not necessarly mean it is the good
way to do ;)

I'm a bit afraid about ABI breakage, we need to check that a
18.02-compiled application still works well with this change.

About testing, I don't know if you checked the mbuf autotests,
but it could also help to check that basic stuff still work.


Thanks,
Olivier

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-23 16:18   ` Olivier Matz
@ 2018-04-24  1:29     ` Yongseok Koh
  2018-04-24 15:36       ` Olivier Matz
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24  1:29 UTC (permalink / raw)
  To: Olivier Matz
  Cc: wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

On Mon, Apr 23, 2018 at 06:18:43PM +0200, Olivier Matz wrote:
> Hi Yongseok,
> 
> Please see some comments below.
> 
> On Wed, Apr 18, 2018 at 06:11:04PM -0700, Yongseok Koh wrote:
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >   - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> >     mbuf must be read-only. But external buffer has its own refcnt and it
> >     starts from 1. Unless multiple mbufs are attached to a mbuf having an
> >     external buffer, the external buffer is writable.
> 
> I'm wondering if "As refcnt of a direct mbuf is at least 2" should be
> clarified. I guess we are talking about a direct mbuf that has another one
> attached too.
> 
> I'm also not sure if I understand properly: to me, it is possible to have
> an indirect mbuf that references a direct mbuf with a refcount of 1:
>   m = rte_pktmbuf_alloc()
>   mi = rte_pktmbuf_alloc()
>   rte_pktmbuf_attach(mi, m)
>   rte_pktmbuf_free(m)

Totally agree. Will change the comment.

[...]
> > +struct rte_mbuf_ext_shared_info {
> > +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> > +	void *fcb_opaque;                        /**< Free callback argument */
> > +	RTE_STD_C11
> > +	union {
> > +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> > +		uint16_t refcnt;          /**< Non-atomically accessed refcnt */
> 
> It looks that only refcnt_atomic is used.
> I don't know if we really need the non-atomic one yet.

Will remove.

> > +	};
> > +};
> > +
> >  /**< Maximum number of nb_segs allowed. */
> >  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
> >  
> > @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> >  
> >  /**
> > + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> > + */
> > +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > +
> > +/**
> >   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> >   */
> > -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
> 
> I'm a bit reticent to have RTE_MBUF_DIRECT(m) different of
> !RTE_MBUF_INDIRECT(m), I feel it's not very natural.
> 
> What about:
> - direct = embeds its own data
> - clone (or another name) = data is another mbuf
> - extbuf = data is in an external buffer

Good point. I'll clarify it in a new version by adding RTE_MBUF_CLONED().

> >  /**
> >   * Private data in case of pktmbuf pool.
> > @@ -821,6 +844,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
> >  
> >  #endif /* RTE_MBUF_REFCNT_ATOMIC */
> >  
> > +/**
> > + * Reads the refcnt of an external buffer.
> > + *
> > + * @param shinfo
> > + *   Shared data of the external buffer.
> > + * @return
> > + *   Reference count number.
> > + */
> > +static inline uint16_t
> > +rte_extbuf_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
> 
> What do you think about rte_mbuf_ext_refcnt_read() to keep name consistency?
> (same for other functions below)

No problem.

> [...]
> 
> > @@ -1195,11 +1270,120 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> >  }
> >  
> >  /**
> > + * Return shared data of external buffer of a mbuf.
> > + *
> > + * @param m
> > + *   The pointer to the mbuf.
> > + * @return
> > + *   The address of the shared data.
> > + */
> > +static inline struct rte_mbuf_ext_shared_info *
> > +rte_mbuf_ext_shinfo(struct rte_mbuf *m)
> > +{
> > +	return (struct rte_mbuf_ext_shared_info *)
> > +		RTE_PTR_ADD(m->buf_addr, m->buf_len);
> > +}
> 
> This forces to have the shared data at the end of the buffer. Is it
> always possible? I think there are use-cases where the user may want to
> specify another location for it.
> 
> For instance, an application mmaps a big file (locked in memory), and
> wants to send mbufs pointing to this data without doing any copy.

Very good point. Will make rte_pktmbuf_attach_extbuf() take *shinfo as an
argument.

> Maybe adding a m->shinfo field would be a better choice, what do you
> think?

I like it to store in mbuf too.

> This would certainly break the ABI, but I wonder if that patch does
> not already break it. I mean, how would react an application compiled
> for 18.02 if an EXTBUF is passed to it, knowing that many functions
> are inline?

Even if I add shinfo field in rte_mbuf, I think it won't break the ABI. The
second cacheline is just 40B and it would simply make it 48B. Some code might
check the order/size of some fields (e.g. vPMD) in the struct, but if it is
added at the end of the struct, it should be okay. And there's no need to make a
change in a C file for this.

> > +/**
> > + * Attach an external buffer to a mbuf.
> > + *
> > + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> > + * it, corresponding free callback function and its argument should be
> > + * provided. This callback function will be called once all the mbufs are
> > + * detached from the buffer.
> > + *
> > + * More mbufs can be attached to the same external buffer by
> > + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> > + * this API.
> > + *
> > + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> > + * ``rte_pktmbuf_detach()``.
> > + *
> > + * A few bytes in the trailer of the provided buffer will be dedicated for
> > + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt,
> > + * callback function and so on. The shared data can be referenced by
> > + * ``rte_mbuf_ext_shinfo()``
> > + *
> > + * Attaching an external buffer is quite similar to mbuf indirection in
> > + * replacing buffer addresses and length of a mbuf, but a few differences:
> > + * - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> > + *   mbuf must be read-only. But external buffer has its own refcnt and it
> > + *   starts from 1. Unless multiple mbufs are attached to a mbuf having an
> > + *   external buffer, the external buffer is writable.
> > + * - There's no need to allocate buffer from a mempool. Any buffer can be
> > + *   attached with appropriate free callback.
> > + * - Smaller metadata is required to maintain shared data such as refcnt.
> > + *
> > + * @warning
> > + * @b EXPERIMENTAL: This API may change without prior notice.
> > + * Once external buffer is enabled by allowing experimental API,
> > + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> > + * exclusive. A mbuf can be consiered direct if it is neither indirect nor
> 
> small typo:
> consiered -> considered

Will fix. Thanks.

> > + * having external buffer.
> > + *
> > + * @param m
> > + *   The pointer to the mbuf.
> > + * @param buf_addr
> > + *   The pointer to the external buffer we're attaching to.
> > + * @param buf_len
> > + *   The size of the external buffer we're attaching to. This must be larger
> > + *   than the size of ``struct rte_mbuf_ext_shared_info`` and padding for
> > + *   alignment. If not enough, this function will return NULL.
> > + * @param free_cb
> > + *   Free callback function to call when the external buffer needs to be freed.
> > + * @param fcb_opaque
> > + *   Argument for the free callback function.
> > + * @return
> > + *   A pointer to the new start of the data on success, return NULL otherwise.
> > + */
> > +static inline char * __rte_experimental
> > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > +	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
> > +	void *fcb_opaque)
> > +{
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> > +
> > +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
> > +			sizeof(uintptr_t));
> > +
> > +	if ((void *)shinfo <= buf_addr)
> > +		return NULL;
> > +
> > +	m->buf_addr = buf_addr;
> > +	m->buf_iova = rte_mempool_virt2iova(buf_addr);
> 
> Agree with Konstantin's comment. I think buf_iova should be an argument
> of the function.

Oops, that was my silly mistake. I just copied this block from
rte_pktmbuf_init(). Then, I wanted to change it to rte_malloc_virt2iova() but I
forgot. I didn't realized it during my tests because mlx devices don't use iova
but virtaddr.

If it takes iova as an argument instead, it can be faster and it can use 'real'
external memory for packet DMA, e.g. storage application you mentioned. I mean,
even if a buffer isn't allocated inside DPDK (doesn't belong to one of memseg
list), this should work. Good suggestion!

> > +	m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> 
> Related to what I said above: I think m->buf_len should be set to
> the buf_len argument, so a user can point to existing read-only
> data.

I will make a change to have an argument of extra memory for shared data. And if
shinfo is passed as NULL, it will still spare bytes at the end, just for
convenience.

> [...]
> 
> Few more comments:
> 
> I think we still need to find a good way to advertise to the users
> if a mbuf is writable or readable. Today, the rules are quite implicit.
> There are surely some use cases where the mbuf is indirect but with
> only one active user, meaning it could be READ-WRITE. We could target
> 18.08 for this.

Right. That'll be very good to have.

> One side question about your implementation in mlx. I guess the
> hardware will write the mbuf data in a big contiguous buffer like
> this:
> 
> +-------+--------------+--------+--------------+--------+- - -
> |       |mbuf1 data    |        |mbuf2 data    |        |
> |       |              |        |              |        |
> +-------+--------------+--------+--------------+--------+- - -
> 
> Which will be transformed in:
> 
> +--+----+--------------+---+----+--------------+---+---+- - -
> |  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
> |  |room|              |inf|room|              |inf|   |
> +--+----+--------------+---+----+--------------+---+---+- - -
> 
> So, there is one shinfo (i.e one refcount) for each mbuf.
> How do you know when the big buffer is not used anymore?
 
 +--+----+--------------+---+----+--------------+---+---+- - -
 |  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
 |  |room|              |inf|room|              |inf|   |
 +--+----+--------------+---+----+--------------+---+---+- - -
  ^
  |
  Metadata for the whole chunk, having another refcnt managed by PMD.
  fcb_opaque will have this pointer so that the callback func knows it.

> To summarize, I like the idea of your patchset, this is close to
> what I had in mind... which does not necessarly mean it is the good
> way to do ;)
> 
> I'm a bit afraid about ABI breakage, we need to check that a
> 18.02-compiled application still works well with this change.

I had the same concern so I made rte_pktmbuf_attach_extbuf() __rte_experimental.
Although this new ol_flag is introduced, it can only be set by the new API and
the rest of changes won't be effective unless this flag is set.
RTE_MBUF_HAS_EXTBUF() will always be false if -DALLOW_EXPERIMENTAL_API isn't
specified or rte_pktmbuf_attach_extbuf() isn't called. And there's no change
needed in a C file. For this reason, I don't think there's ABI breakage.

Sounds correct?

> About testing, I don't know if you checked the mbuf autotests,
> but it could also help to check that basic stuff still work.

I'll make sure all the tests pass before I submit a new version.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (7 preceding siblings ...)
  2018-04-19  1:11 ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
@ 2018-04-24  1:38 ` Yongseok Koh
  2018-04-24  1:38   ` [PATCH v4 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
                     ` (2 more replies)
  2018-04-25  2:53 ` [PATCH v5 " Yongseok Koh
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24  1:38 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro,
	Yongseok Koh

This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - When an indirect mbuf is attached, refcnt of the direct mbuf would be
    2 as long as the direct mbuf itself isn't freed after the attachment.
    In such cases, the buffer area of a direct mbuf must be read-only. But
    external buffer has its own refcnt and it starts from 1. Unless
    multiple mbufs are attached to a mbuf having an external buffer, the
    external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---

** This patch can pass the mbuf_autotest. **

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v4:
* rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
  user can pass memory for shared data via shinfo argument.
* minor changes from review.

v3:
* implement external buffer attachment instead of introducing buf_off for
 mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 289 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 260 insertions(+), 29 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 06eceba37..7f6507a66 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -326,7 +326,7 @@ extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -566,8 +566,24 @@ struct rte_mbuf {
 	/** Sequence number. See also rte_reorder_insert(). */
 	uint32_t seqn;
 
+	struct rte_mbuf_ext_shared_info *shinfo;
+
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
 }
 
 /**
+ * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
+ * otherwise.
+ *
+ * If a mbuf has its data in another mbuf and references it by mbuf
+ * indirection, this mbuf can be defined as a cloned mbuf.
+ */
+#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
  */
-#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
+#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
+
+/**
+ * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
+ *
+ * External buffer is a user-provided anonymous buffer.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
 
 /**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
+ *
+ * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
+ * can be defined as a direct mbuf.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -821,6 +856,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
+		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1195,11 +1282,122 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided. This callback function will be called once all the mbufs are
+ * detached from the buffer.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * Memory for shared data can be provided by shinfo argument. If shinfo is NULL,
+ * a few bytes in the trailer of the provided buffer will be dedicated for
+ * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt, callback
+ * function and so on. The pointer of shared data will be stored in m->shinfo.
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
+ *   2 as long as the direct mbuf itself isn't freed after the attachment.
+ *   In such cases, the buffer area of a direct mbuf must be read-only. But
+ *   external buffer has its own refcnt and it starts from 1. Unless
+ *   multiple mbufs are attached to a mbuf having an external buffer, the
+ *   external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback and its IO address.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be considered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer we're attaching to.
+ * @param buf_iova
+ *   IO address of the external buffer we're attaching to.
+ * @param buf_len
+ *   The size of the external buffer we're attaching to. If memory for
+ *   shared data is not provided, buf_len must be larger than the size of
+ *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
+ *   enough, this function will return NULL.
+ * @param shinfo
+ *   User-provided memory for shared data. If NULL, a few bytes in the
+ *   trailer of the provided buffer will be dedicated for shared data.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be
+ *   freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ *
+ * @return
+ *   A pointer to the new start of the data on success, return NULL
+ *   otherwise.
+ */
+static inline char * __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	rte_iova_t buf_iova, uint16_t buf_len,
+	struct rte_mbuf_ext_shared_info *shinfo,
+	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
+{
+	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = buf_iova;
+
+	if (shinfo == NULL) {
+		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
+					sizeof(*shinfo)), sizeof(uintptr_t));
+		if ((void *)shinfo <= buf_addr)
+			return NULL;
+
+		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+	} else {
+		m->buf_len = buf_len;
+	}
+
+	m->data_len = 0;
+
+	rte_pktmbuf_reset_headroom(m);
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+	m->shinfo = shinfo;
+
+	rte_mbuf_ext_refcnt_set(shinfo, 1);
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+
+	return (char *)m->buf_addr + m->data_off;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1213,19 +1411,18 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
+		mi->ol_flags = m->ol_flags;
+	} else {
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1241,7 +1438,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1250,12 +1446,50 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+	RTE_ASSERT(m->shinfo != NULL);
+
+	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
+		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
  *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1264,10 +1498,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1279,13 +1517,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1309,7 +1540,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1321,7 +1552,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-24  1:38 ` [PATCH v4 " Yongseok Koh
@ 2018-04-24  1:38   ` Yongseok Koh
  2018-04-24  5:01   ` [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf Stephen Hemminger
  2018-04-24 12:28   ` Andrew Rybchenko
  2 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24  1:38 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro,
	Yongseok Koh

If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be externally attached mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 3 +++
 app/test-pmd/macfwd.c   | 3 +++
 app/test-pmd/macswap.c  | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5f5ab64aa..bb0b675a8 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -770,6 +770,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (!RTE_MBUF_DIRECT(m))
+			tx_ol_flags |= m->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..ba0021194 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..b8d15f6ba 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,9 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-23 11:53   ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
@ 2018-04-24  2:04     ` Yongseok Koh
  2018-04-25 13:16       ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24  2:04 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, adrien.mazarguil,
	nelio.laranjeiro

On Mon, Apr 23, 2018 at 11:53:04AM +0000, Ananyev, Konstantin wrote:
[...]
> > @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > 
> >  /**
> > + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> > + */
> > +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > +
> > +/**
> >   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> >   */
> > -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
> 
> As a nit:
> RTE_MBUF_DIRECT(mb)  (((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)) == 0)

It was for better readability and I expected compiler did the same.
But, if you still want this way, I can change it.

[...]
> > +static inline char * __rte_experimental
> > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > +	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
> > +	void *fcb_opaque)
> > +{
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> > +
> > +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
> > +			sizeof(uintptr_t));
> > +
> > +	if ((void *)shinfo <= buf_addr)
> > +		return NULL;
> > +
> > +	m->buf_addr = buf_addr;
> > +	m->buf_iova = rte_mempool_virt2iova(buf_addr);
> 
> 
> That wouldn't work for arbitrary extern buffer.
> Only for the one that is an element in some other mempool.
> For arbitrary external buffer - callee has to provide PA for it plus guarantee that
> it's VA would be locked down.
> From other side - if your intention is just to use only elements of other mempools -
> No need to have free_cb(). mempool_put should do.

Of course, I didn't mean that. That was a mistake. Please refer to my reply to
Olivier.

[...]
> >  /**
> > - * Detach an indirect packet mbuf.
> > + * @internal used by rte_pktmbuf_detach().
> > + *
> > + * Decrement the reference counter of the external buffer. When the
> > + * reference counter becomes 0, the buffer is freed by pre-registered
> > + * callback.
> > + */
> > +static inline void
> > +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> > +{
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> > +
> > +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> > +
> > +	shinfo = rte_mbuf_ext_shinfo(m);
> > +
> > +	if (rte_extbuf_refcnt_update(shinfo, -1) == 0)
> > +		shinfo->free_cb(m->buf_addr, shinfo->fcb_opaque);
> 
> 
> I understand the reason but extra function call for each external mbuf - seems quite expensive.
> Wonder is it possible to group them somehow and amortize the cost?

Good point. I thought about it today.

Comparing to the regular mbuf, maybe three differences. a) free function isn't
inlined but a real branch. b) no help from core local cache like mempool's c) no
free_bulk func like rte_mempool_put_bulk(). But these look quite costly and
complicated for the external buffer attachment.

For example, to free it in bulk, external buffers should be grouped as the
buffers would have different callback functions. To do that, I have to make an
API to pre-register an external buffer group to prepare resources for the bulk
free. Then, buffers can't be anonymous anymore but have to be registered in
advance. If so, it would be better to use existing APIs, especially when a user
wants high throughput...

Let me know if you have better idea to implement it. Then, I'll gladly take
that. Or, we can push any improvement patch in the next releases.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24  1:38 ` [PATCH v4 " Yongseok Koh
  2018-04-24  1:38   ` [PATCH v4 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-24  5:01   ` Stephen Hemminger
  2018-04-24 11:47     ` Yongseok Koh
  2018-04-24 12:28   ` Andrew Rybchenko
  2 siblings, 1 reply; 86+ messages in thread
From: Stephen Hemminger @ 2018-04-24  5:01 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, olivier.matz, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

On Mon, 23 Apr 2018 18:38:53 -0700
Yongseok Koh <yskoh@mellanox.com> wrote:

> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>     2 as long as the direct mbuf itself isn't freed after the attachment.
>     In such cases, the buffer area of a direct mbuf must be read-only. But
>     external buffer has its own refcnt and it starts from 1. Unless
>     multiple mbufs are attached to a mbuf having an external buffer, the
>     external buffer is writable.
>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

I think this is a good idea. It looks more useful than indirect mbuf's for
the use case where received data needs to come from a non mempool area.

Does it have any performance impact? I would hope it doesn't impact
applications not using external buffers.

Is it possible to start with a refcnt > 1 for the mbuf?  I am thinking
of the case in netvsc where data is received into an area returned
from the host. The area is an RNDIS buffer and may contain several
packets.  A useful optimization would be for the driver return
mbufs which point to that buffer where starting refcnt value
is the number of packets in the buffer.  When refcnt goes to
0 the buffer would be returned to the host.

One other problem with this is that it adds an additional buffer
management constraint on the application. If for example the
mbuf's are going into a TCP stack and TCP can have very slow
readers; then the receive buffer might have a long lifetime.
Since the receive buffers are limited, eventually the receive
area runs out and no more packets are received. Much fingerpointing
and angry users ensue..

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24  5:01   ` [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf Stephen Hemminger
@ 2018-04-24 11:47     ` Yongseok Koh
  0 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24 11:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: wenzhuo.lu, jingjing.wu, olivier.matz, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

On Mon, Apr 23, 2018 at 10:01:07PM -0700, Stephen Hemminger wrote:
> On Mon, 23 Apr 2018 18:38:53 -0700
> Yongseok Koh <yskoh@mellanox.com> wrote:
> 
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> >     2 as long as the direct mbuf itself isn't freed after the attachment.
> >     In such cases, the buffer area of a direct mbuf must be read-only. But
> >     external buffer has its own refcnt and it starts from 1. Unless
> >     multiple mbufs are attached to a mbuf having an external buffer, the
> >     external buffer is writable.
> >   - There's no need to allocate buffer from a mempool. Any buffer can be
> >     attached with appropriate free callback.
> >   - Smaller metadata is required to maintain shared data such as refcnt.
> > 
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> 
> I think this is a good idea. It looks more useful than indirect mbuf's for
> the use case where received data needs to come from a non mempool area.

Actually Olivier's idea and I just implemented it for my need. :-)

> Does it have any performance impact? I would hope it doesn't impact
> applications not using external buffers.

It would have little. The only change which can impact regular cases is in
rte_pktmbuf_prefree_seg(). This critical path inlines rte_pktmbuf_detach() and
it becomes a little longer - a few more instructions to update refcnt and branch
to user-provided callback. In io fwd of testpmd with single core, I'm not seeing
any noticeable drop.

> Is it possible to start with a refcnt > 1 for the mbuf?  I am thinking
> of the case in netvsc where data is received into an area returned
> from the host. The area is an RNDIS buffer and may contain several
> packets.  A useful optimization would be for the driver return
> mbufs which point to that buffer where starting refcnt value
> is the number of packets in the buffer.  When refcnt goes to
> 0 the buffer would be returned to the host.

That's actually my use-case for mlx5 PMD. mlx5 device supports "Multi-Packet Rx
Queue".  The device can pack multiple packets into a single Rx buffer to reduce
PCIe overhead of control transactions. And it is also quite common for FPGA
based NICs. What I've done is allocating a big buffer (from a PMD-private
mempool) and reserve a space in the head to store metadata to manage another
refcnt, which gets decremented by registered callback func. And the callback
func will free the whole chunk if the refcnt gets to zero.

+--+----+--------------+---+----+--------------+---+---+- - -
|  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
|  |room|              |inf|room|              |inf|   |
+--+----+--------------+---+----+--------------+---+---+- - -
 ^
 |
 Metadata for the whole chunk, having another refcnt managed by PMD.
 fcb_opaque will have this pointer so that the callback func knows it.

> One other problem with this is that it adds an additional buffer
> management constraint on the application. If for example the
> mbuf's are going into a TCP stack and TCP can have very slow
> readers; then the receive buffer might have a long lifetime.
> Since the receive buffers are limited, eventually the receive
> area runs out and no more packets are received. Much fingerpointing
> and angry users ensue..

In such a case (buffer depletion), I memcpy the Rx packet to mbuf instead of
attaching it to the mbuf until buffers get available. Seems unavoidable
penalty but better than dropping packets.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24  1:38 ` [PATCH v4 " Yongseok Koh
  2018-04-24  1:38   ` [PATCH v4 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
  2018-04-24  5:01   ` [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf Stephen Hemminger
@ 2018-04-24 12:28   ` Andrew Rybchenko
  2018-04-24 16:02     ` Olivier Matz
  2018-04-24 22:30     ` Yongseok Koh
  2 siblings, 2 replies; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-24 12:28 UTC (permalink / raw)
  To: Yongseok Koh, wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro

On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
>
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>      2 as long as the direct mbuf itself isn't freed after the attachment.
>      In such cases, the buffer area of a direct mbuf must be read-only. But
>      external buffer has its own refcnt and it starts from 1. Unless
>      multiple mbufs are attached to a mbuf having an external buffer, the
>      external buffer is writable.
>    - There's no need to allocate buffer from a mempool. Any buffer can be
>      attached with appropriate free callback.
>    - Smaller metadata is required to maintain shared data such as refcnt.

Really useful. Many thanks. See my notes below.

It worries me that detach is more expensive than it really required 
since it
requires to restore mbuf as direct. If mbuf mempool is used for mbufs
as headers for external buffers only all these actions are absolutely 
useless.

> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>
> ** This patch can pass the mbuf_autotest. **
>
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
>
> v4:
> * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
>    user can pass memory for shared data via shinfo argument.
> * minor changes from review.
>
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>   mbuf indirection.
>
>   lib/librte_mbuf/rte_mbuf.h | 289 ++++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 260 insertions(+), 29 deletions(-)
>
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 06eceba37..7f6507a66 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -326,7 +326,7 @@ extern "C" {
>   		PKT_TX_MACSEC |		 \
>   		PKT_TX_SEC_OFFLOAD)
>   
> -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> +#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */

May be it should mention that shinfo is filled in.

>   
>   #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
>   
> @@ -566,8 +566,24 @@ struct rte_mbuf {
>   	/** Sequence number. See also rte_reorder_insert(). */
>   	uint32_t seqn;
>   
> +	struct rte_mbuf_ext_shared_info *shinfo;

I think it would be useful to add comment that it is used in the case of 
RTE_MBUF_HAS_EXTBUF() only.

> +
>   } __rte_cache_aligned;
>   
> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
> +};
> +
>   /**< Maximum number of nb_segs allowed. */
>   #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
>   
> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>   }
>   
>   /**
> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> + * otherwise.
> + *
> + * If a mbuf has its data in another mbuf and references it by mbuf
> + * indirection, this mbuf can be defined as a cloned mbuf.
> + */
> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +
> +/**
>    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>    */
> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)

It is still confusing that INDIRECT != !DIRECT.
May be we have no good options right now, but I'd suggest to at least 
deprecate
RTE_MBUF_INDIRECT() and completely remove it in the next release.

> +
> +/**
> + * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
> + *
> + * External buffer is a user-provided anonymous buffer.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
>   
>   /**
>    * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> + *
> + * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
> + * can be defined as a direct mbuf.
>    */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))

[...]

> @@ -1195,11 +1282,122 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   }
>   
>   /**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * Memory for shared data can be provided by shinfo argument. If shinfo is NULL,
> + * a few bytes in the trailer of the provided buffer will be dedicated for
> + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt, callback
> + * function and so on. The pointer of shared data will be stored in m->shinfo.
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> + *   external buffer has its own refcnt and it starts from 1. Unless
> + *   multiple mbufs are attached to a mbuf having an external buffer, the
> + *   external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback and its IO address.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_iova
> + *   IO address of the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. If memory for
> + *   shared data is not provided, buf_len must be larger than the size of
> + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> + *   enough, this function will return NULL.
> + * @param shinfo
> + *   User-provided memory for shared data. If NULL, a few bytes in the
> + *   trailer of the provided buffer will be dedicated for shared data.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be
> + *   freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + *
> + * @return
> + *   A pointer to the new start of the data on success, return NULL
> + *   otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	rte_iova_t buf_iova, uint16_t buf_len,
> +	struct rte_mbuf_ext_shared_info *shinfo,
> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> +{
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);

May I suggest to move it inside if (shinfo == NULL) to make it clear 
that it is not used if shinfo pointer is provided.

> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = buf_iova;
> +
> +	if (shinfo == NULL) {
> +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> +					sizeof(*shinfo)), sizeof(uintptr_t));
> +		if ((void *)shinfo <= buf_addr)
> +			return NULL;
> +
> +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> +	} else {
> +		m->buf_len = buf_len;
> +	}
> +
> +	m->data_len = 0;
> +
> +	rte_pktmbuf_reset_headroom(m);

I would suggest to make data_off one more parameter.
If I have a buffer with data which I'd like to attach to an mbuf, I'd 
like to control data_off.

> +	m->ol_flags |= EXT_ATTACHED_MBUF;
> +	m->shinfo = shinfo;
> +
> +	rte_mbuf_ext_refcnt_set(shinfo, 1);

Why is assignment used here? Cannot we attach extbuf already attached to 
other mbuf?
May be shinfo should be initialized only if it is not provided (shinfo 
== NULL on input)?

> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;
> +
> +	return (char *)m->buf_addr + m->data_off;
> +}
> +
> +/**
> + * Detach the external buffer attached to a mbuf, same as
> + * ``rte_pktmbuf_detach()``
> + *
> + * @param m
> + *   The mbuf having external buffer.
> + */
> +#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
> +
> +/**
>    * Attach packet mbuf to another packet mbuf.
>    *
> - * After attachment we refer the mbuf we attached as 'indirect',
> - * while mbuf we attached to as 'direct'.
> - * The direct mbuf's reference counter is incremented.
> + * If the mbuf we are attaching to isn't a direct buffer and is attached to
> + * an external buffer, the mbuf being attached will be attached to the
> + * external buffer instead of mbuf indirection.
> + *
> + * Otherwise, the mbuf will be indirectly attached. After attachment we
> + * refer the mbuf we attached as 'indirect', while mbuf we attached to as
> + * 'direct'.  The direct mbuf's reference counter is incremented.
>    *
>    * Right now, not supported:
>    *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
> @@ -1213,19 +1411,18 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>    */
>   static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   {
> -	struct rte_mbuf *md;
> -
>   	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
>   	    rte_mbuf_refcnt_read(mi) == 1);
>   
> -	/* if m is not direct, get the mbuf that embeds the data */
> -	if (RTE_MBUF_DIRECT(m))
> -		md = m;
> -	else
> -		md = rte_mbuf_from_indirect(m);
> +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> +		mi->ol_flags = m->ol_flags;
> +	} else {
> +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);

It looks like handling of the direct mbuf is lost here. May be it is 
intentional
to avoid branching since result will be the same for direct mbuf as well,
but looks confusing. Deserves at least a comment which explains why.
Ideally it should be proven by measurements.

> +		mi->priv_size = m->priv_size;
> +		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
> +	}
>   
> -	rte_mbuf_refcnt_update(md, 1);
> -	mi->priv_size = m->priv_size;
>   	mi->buf_iova = m->buf_iova;
>   	mi->buf_addr = m->buf_addr;
>   	mi->buf_len = m->buf_len;
> @@ -1241,7 +1438,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   	mi->next = NULL;
>   	mi->pkt_len = mi->data_len;
>   	mi->nb_segs = 1;
> -	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
>   	mi->packet_type = m->packet_type;
>   	mi->timestamp = m->timestamp;
>   
> @@ -1250,12 +1446,50 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   }
>   
>   /**
> - * Detach an indirect packet mbuf.
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the reference counter of the external buffer. When the
> + * reference counter becomes 0, the buffer is freed by pre-registered
> + * callback.
> + */
> +static inline void
> +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> +{
> +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> +	RTE_ASSERT(m->shinfo != NULL);
> +
> +	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
> +		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
> +}
> +
> +/**
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the direct mbuf's reference counter. When the reference
> + * counter becomes 0, the direct mbuf is freed.
> + */
> +static inline void
> +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf *md = rte_mbuf_from_indirect(m);

Shouldn't it be done after below assertion? Just to be less confusing.

> +
> +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> +
> +	if (rte_mbuf_refcnt_update(md, -1) == 0) {

It is not directly related to the changeset, but rte_pktmbuf_prefree_seg()
has many optimizations which could be useful here:
  - do not update if refcnt is 1
  - do not set next/nb_seq if next is already NULL

> +		md->next = NULL;
> +		md->nb_segs = 1;
> +		rte_mbuf_refcnt_set(md, 1);
> +		rte_mbuf_raw_free(md);
> +	}
> +}

[...]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24  1:29     ` Yongseok Koh
@ 2018-04-24 15:36       ` Olivier Matz
  0 siblings, 0 replies; 86+ messages in thread
From: Olivier Matz @ 2018-04-24 15:36 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

Hi,

On Mon, Apr 23, 2018 at 06:29:57PM -0700, Yongseok Koh wrote:
> On Mon, Apr 23, 2018 at 06:18:43PM +0200, Olivier Matz wrote:
> > I'm a bit afraid about ABI breakage, we need to check that a
> > 18.02-compiled application still works well with this change.
> 
> I had the same concern so I made rte_pktmbuf_attach_extbuf() __rte_experimental.
> Although this new ol_flag is introduced, it can only be set by the new API and
> the rest of changes won't be effective unless this flag is set.
> RTE_MBUF_HAS_EXTBUF() will always be false if -DALLOW_EXPERIMENTAL_API isn't
> specified or rte_pktmbuf_attach_extbuf() isn't called. And there's no change
> needed in a C file. For this reason, I don't think there's ABI breakage.
> 
> Sounds correct?

Hmm, imagine you compile an application on top of 18.02.
Then, you update your dpdk libraries to 18.05.

The mlx driver may send mbufs pointing to an external buffer to the
application. When the application will call the mbuf free function, it
will probably not do the expected work, because most of the functions
involved are inline. So, to me this is an ABI breakage.

This is not a technical issue, since the ABI of mbuf will already be
broken this release (control mbuf removed). This is more a process
question, because an ABI breakage and its area should be announced.

Olivier

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 12:28   ` Andrew Rybchenko
@ 2018-04-24 16:02     ` Olivier Matz
  2018-04-24 18:21       ` ***Spam*** " Andrew Rybchenko
  2018-04-25  8:28       ` Olivier Matz
  2018-04-24 22:30     ` Yongseok Koh
  1 sibling, 2 replies; 86+ messages in thread
From: Olivier Matz @ 2018-04-24 16:02 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Yongseok Koh, wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

Hi Andrew, Yongseok,

On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> >      2 as long as the direct mbuf itself isn't freed after the attachment.
> >      In such cases, the buffer area of a direct mbuf must be read-only. But
> >      external buffer has its own refcnt and it starts from 1. Unless
> >      multiple mbufs are attached to a mbuf having an external buffer, the
> >      external buffer is writable.
> >    - There's no need to allocate buffer from a mempool. Any buffer can be
> >      attached with appropriate free callback.
> >    - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Really useful. Many thanks. See my notes below.
> 
> It worries me that detach is more expensive than it really required since it
> requires to restore mbuf as direct. If mbuf mempool is used for mbufs
> as headers for external buffers only all these actions are absolutely
> useless.

I agree on the principle. And we have the same issue with indirect mbuf.
Currently, the assumption is that a free mbuf (inside a mempool) is
initialized as a direct mbuf. We can think about optimizations here,
but I'm not sure it should be in this patchset.

[...]

> > @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >   }
> >   /**
> > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > + * otherwise.
> > + *
> > + * If a mbuf has its data in another mbuf and references it by mbuf
> > + * indirection, this mbuf can be defined as a cloned mbuf.
> > + */
> > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > +
> > +/**
> >    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> >    */
> > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> 
> It is still confusing that INDIRECT != !DIRECT.
> May be we have no good options right now, but I'd suggest to at least
> deprecate
> RTE_MBUF_INDIRECT() and completely remove it in the next release.

Agree. I may have missed something, but is my previous suggestion
not doable?

- direct = embeds its own data      (and indirect = !direct)
- clone (or another name) = data is another mbuf
- extbuf = data is in an external buffer

Deprecating the macro is a good idea.

> > +	m->buf_addr = buf_addr;
> > +	m->buf_iova = buf_iova;
> > +
> > +	if (shinfo == NULL) {
> > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > +		if ((void *)shinfo <= buf_addr)
> > +			return NULL;
> > +
> > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > +	} else {
> > +		m->buf_len = buf_len;
> > +	}
> > +
> > +	m->data_len = 0;
> > +
> > +	rte_pktmbuf_reset_headroom(m);
> 
> I would suggest to make data_off one more parameter.
> If I have a buffer with data which I'd like to attach to an mbuf, I'd like
> to control data_off.

Another option is to set the headroom to 0.
Because the after attaching the mbuf to an external buffer, we will
still require to set the length.

A user can do something like this:

	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
		free_cb, free_cb_arg);
	rte_pktmbuf_append(m, data_len + headroom);
	rte_pktmbuf_adj(m, headroom);

> 
> > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > +	m->shinfo = shinfo;
> > +
> > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> 
> Why is assignment used here? Cannot we attach extbuf already attached to
> other mbuf?

In rte_pktmbuf_attach(), this is true. That's not illogical to
keep the same approach here. Maybe an assert could be added?

> May be shinfo should be initialized only if it is not provided (shinfo ==
> NULL on input)?

I don't get why, can you explain please?


Thanks,
Olivier

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: ***Spam*** Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 16:02     ` Olivier Matz
@ 2018-04-24 18:21       ` Andrew Rybchenko
  2018-04-24 19:15         ` Olivier Matz
  2018-04-25  8:28       ` Olivier Matz
  1 sibling, 1 reply; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-24 18:21 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Yongseok Koh, wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

On 04/24/2018 07:02 PM, Olivier Matz wrote:
> Hi Andrew, Yongseok,
>
> On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
>> On 04/24/2018 04:38 AM, Yongseok Koh wrote:
>>> This patch introduces a new way of attaching an external buffer to a mbuf.
>>>
>>> Attaching an external buffer is quite similar to mbuf indirection in
>>> replacing buffer addresses and length of a mbuf, but a few differences:
>>>     - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>>>       2 as long as the direct mbuf itself isn't freed after the attachment.
>>>       In such cases, the buffer area of a direct mbuf must be read-only. But
>>>       external buffer has its own refcnt and it starts from 1. Unless
>>>       multiple mbufs are attached to a mbuf having an external buffer, the
>>>       external buffer is writable.
>>>     - There's no need to allocate buffer from a mempool. Any buffer can be
>>>       attached with appropriate free callback.
>>>     - Smaller metadata is required to maintain shared data such as refcnt.
>> Really useful. Many thanks. See my notes below.
>>
>> It worries me that detach is more expensive than it really required since it
>> requires to restore mbuf as direct. If mbuf mempool is used for mbufs
>> as headers for external buffers only all these actions are absolutely
>> useless.
> I agree on the principle. And we have the same issue with indirect mbuf.
> Currently, the assumption is that a free mbuf (inside a mempool) is
> initialized as a direct mbuf. We can think about optimizations here,
> but I'm not sure it should be in this patchset.

I agree that it should be addressed separately.

> [...]
>
>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>>>    }
>>>    /**
>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
>>> + * otherwise.
>>> + *
>>> + * If a mbuf has its data in another mbuf and references it by mbuf
>>> + * indirection, this mbuf can be defined as a cloned mbuf.
>>> + */
>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>> +
>>> +/**
>>>     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>>>     */
>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
>> It is still confusing that INDIRECT != !DIRECT.
>> May be we have no good options right now, but I'd suggest to at least
>> deprecate
>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
> Agree. I may have missed something, but is my previous suggestion
> not doable?
>
> - direct = embeds its own data      (and indirect = !direct)
> - clone (or another name) = data is another mbuf
> - extbuf = data is in an external buffer

I guess the problem that it changes INDIRECT semantics since EXTBUF
is added as well. I think strictly speaking it is an API change.
Is it OK to make it without announcement?

> Deprecating the macro is a good idea.
>
>>> +	m->buf_addr = buf_addr;
>>> +	m->buf_iova = buf_iova;
>>> +
>>> +	if (shinfo == NULL) {
>>> +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
>>> +					sizeof(*shinfo)), sizeof(uintptr_t));
>>> +		if ((void *)shinfo <= buf_addr)
>>> +			return NULL;
>>> +
>>> +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
>>> +	} else {
>>> +		m->buf_len = buf_len;
>>> +	}
>>> +
>>> +	m->data_len = 0;
>>> +
>>> +	rte_pktmbuf_reset_headroom(m);
>> I would suggest to make data_off one more parameter.
>> If I have a buffer with data which I'd like to attach to an mbuf, I'd like
>> to control data_off.
> Another option is to set the headroom to 0.
> Because the after attaching the mbuf to an external buffer, we will
> still require to set the length.
>
> A user can do something like this:
>
> 	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
> 		free_cb, free_cb_arg);
> 	rte_pktmbuf_append(m, data_len + headroom);
> 	rte_pktmbuf_adj(m, headroom);
>
>>> +	m->ol_flags |= EXT_ATTACHED_MBUF;
>>> +	m->shinfo = shinfo;
>>> +
>>> +	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> Why is assignment used here? Cannot we attach extbuf already attached to
>> other mbuf?
> In rte_pktmbuf_attach(), this is true. That's not illogical to
> keep the same approach here. Maybe an assert could be added?
>
>> May be shinfo should be initialized only if it is not provided (shinfo ==
>> NULL on input)?
> I don't get why, can you explain please?

May be I misunderstand how it should look like when one huge buffer
is partitioned. I thought that it should be only one shinfo per huge buffer
to control when it is not used any more by any mbufs with extbuf.

Other option is to have shinfo per small buf plus reference counter
per huge buf (which is decremented when small buf reference counter
becomes zero and free callback is executed). I guess it is assumed above.
My fear is that it is too much reference counters:
  1. mbuf reference counter
  2. small buf reference counter
  3. huge buf reference counter
May be it is possible use (1) for (2) as well?

Andrew.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 18:21       ` ***Spam*** " Andrew Rybchenko
@ 2018-04-24 19:15         ` Olivier Matz
  2018-04-24 20:22           ` Thomas Monjalon
  2018-04-24 23:34           ` Yongseok Koh
  0 siblings, 2 replies; 86+ messages in thread
From: Olivier Matz @ 2018-04-24 19:15 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Yongseok Koh, wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro, Thomas Monjalon

On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > Hi Andrew, Yongseok,
> > 
> > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > This patch introduces a new way of attaching an external buffer to a mbuf.
> > > > 
> > > > Attaching an external buffer is quite similar to mbuf indirection in
> > > > replacing buffer addresses and length of a mbuf, but a few differences:
> > > >     - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> > > >       2 as long as the direct mbuf itself isn't freed after the attachment.
> > > >       In such cases, the buffer area of a direct mbuf must be read-only. But
> > > >       external buffer has its own refcnt and it starts from 1. Unless
> > > >       multiple mbufs are attached to a mbuf having an external buffer, the
> > > >       external buffer is writable.
> > > >     - There's no need to allocate buffer from a mempool. Any buffer can be
> > > >       attached with appropriate free callback.
> > > >     - Smaller metadata is required to maintain shared data such as refcnt.
> > > Really useful. Many thanks. See my notes below.
> > > 
> > > It worries me that detach is more expensive than it really required since it
> > > requires to restore mbuf as direct. If mbuf mempool is used for mbufs
> > > as headers for external buffers only all these actions are absolutely
> > > useless.
> > I agree on the principle. And we have the same issue with indirect mbuf.
> > Currently, the assumption is that a free mbuf (inside a mempool) is
> > initialized as a direct mbuf. We can think about optimizations here,
> > but I'm not sure it should be in this patchset.
> 
> I agree that it should be addressed separately.
> 
> > [...]
> > 
> > > > @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > > >    }
> > > >    /**
> > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > + * otherwise.
> > > > + *
> > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > + */
> > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > +
> > > > +/**
> > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > >     */
> > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > It is still confusing that INDIRECT != !DIRECT.
> > > May be we have no good options right now, but I'd suggest to at least
> > > deprecate
> > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > Agree. I may have missed something, but is my previous suggestion
> > not doable?
> > 
> > - direct = embeds its own data      (and indirect = !direct)
> > - clone (or another name) = data is another mbuf
> > - extbuf = data is in an external buffer
> 
> I guess the problem that it changes INDIRECT semantics since EXTBUF
> is added as well. I think strictly speaking it is an API change.
> Is it OK to make it without announcement?

In any case, there will be an ABI change, because an application
compiled for 18.02 will not be able to handle these new kind of
mbuf.

So unfortunatly yes, I think this kind of changes should first be
announced.

Thomas, what do you think?


> > Deprecating the macro is a good idea.
> > 
> > > > +	m->buf_addr = buf_addr;
> > > > +	m->buf_iova = buf_iova;
> > > > +
> > > > +	if (shinfo == NULL) {
> > > > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > > > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > > > +		if ((void *)shinfo <= buf_addr)
> > > > +			return NULL;
> > > > +
> > > > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > > > +	} else {
> > > > +		m->buf_len = buf_len;
> > > > +	}
> > > > +
> > > > +	m->data_len = 0;
> > > > +
> > > > +	rte_pktmbuf_reset_headroom(m);
> > > I would suggest to make data_off one more parameter.
> > > If I have a buffer with data which I'd like to attach to an mbuf, I'd like
> > > to control data_off.
> > Another option is to set the headroom to 0.
> > Because the after attaching the mbuf to an external buffer, we will
> > still require to set the length.
> > 
> > A user can do something like this:
> > 
> > 	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
> > 		free_cb, free_cb_arg);
> > 	rte_pktmbuf_append(m, data_len + headroom);
> > 	rte_pktmbuf_adj(m, headroom);
> > 
> > > > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > > > +	m->shinfo = shinfo;
> > > > +
> > > > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > Why is assignment used here? Cannot we attach extbuf already attached to
> > > other mbuf?
> > In rte_pktmbuf_attach(), this is true. That's not illogical to
> > keep the same approach here. Maybe an assert could be added?
> > 
> > > May be shinfo should be initialized only if it is not provided (shinfo ==
> > > NULL on input)?
> > I don't get why, can you explain please?
> 
> May be I misunderstand how it should look like when one huge buffer
> is partitioned. I thought that it should be only one shinfo per huge buffer
> to control when it is not used any more by any mbufs with extbuf.

OK I got it.

I think both approach could make sense:
- one shinfo per huge buffer
- or one shinfo per mbuf, and use the callback to manage another refcnt
  (like what Yongseok described)

So I agree with your proposal, shinfo should be initialized by
the caller if it is != NULL, else it can be initialized by
rte_pktmbuf_attach_extbuf().


> Other option is to have shinfo per small buf plus reference counter
> per huge buf (which is decremented when small buf reference counter
> becomes zero and free callback is executed). I guess it is assumed above.
> My fear is that it is too much reference counters:
>  1. mbuf reference counter
>  2. small buf reference counter
>  3. huge buf reference counter
> May be it is possible use (1) for (2) as well?

I would prefer to have only 2 reference counters, one in the mbuf
and one in the shinfo.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 19:15         ` Olivier Matz
@ 2018-04-24 20:22           ` Thomas Monjalon
  2018-04-24 21:53             ` Yongseok Koh
  2018-04-25 15:06             ` Stephen Hemminger
  2018-04-24 23:34           ` Yongseok Koh
  1 sibling, 2 replies; 86+ messages in thread
From: Thomas Monjalon @ 2018-04-24 20:22 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko, Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

24/04/2018 21:15, Olivier Matz:
> On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > + * otherwise.
> > > > > + *
> > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > + */
> > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > +
> > > > > +/**
> > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > >     */
> > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > It is still confusing that INDIRECT != !DIRECT.
> > > > May be we have no good options right now, but I'd suggest to at least
> > > > deprecate
> > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > Agree. I may have missed something, but is my previous suggestion
> > > not doable?
> > > 
> > > - direct = embeds its own data      (and indirect = !direct)
> > > - clone (or another name) = data is another mbuf
> > > - extbuf = data is in an external buffer
> > 
> > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > is added as well. I think strictly speaking it is an API change.
> > Is it OK to make it without announcement?
> 
> In any case, there will be an ABI change, because an application
> compiled for 18.02 will not be able to handle these new kind of
> mbuf.
> 
> So unfortunatly yes, I think this kind of changes should first be
> announced.
> 
> Thomas, what do you think?

What is the impact for the application developer?
Is there something to change in the application after this patch?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 20:22           ` Thomas Monjalon
@ 2018-04-24 21:53             ` Yongseok Koh
  2018-04-24 22:15               ` Thomas Monjalon
  2018-04-25  8:21               ` Olivier Matz
  2018-04-25 15:06             ` Stephen Hemminger
  1 sibling, 2 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24 21:53 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Olivier Matz, Andrew Rybchenko, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro

On Tue, Apr 24, 2018 at 10:22:45PM +0200, Thomas Monjalon wrote:
> 24/04/2018 21:15, Olivier Matz:
> > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > > + * otherwise.
> > > > > > + *
> > > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > > + */
> > > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > +
> > > > > > +/**
> > > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > > >     */
> > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > It is still confusing that INDIRECT != !DIRECT.
> > > > > May be we have no good options right now, but I'd suggest to at least
> > > > > deprecate
> > > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > > Agree. I may have missed something, but is my previous suggestion
> > > > not doable?
> > > > 
> > > > - direct = embeds its own data      (and indirect = !direct)
> > > > - clone (or another name) = data is another mbuf
> > > > - extbuf = data is in an external buffer
> > > 
> > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > is added as well. I think strictly speaking it is an API change.
> > > Is it OK to make it without announcement?
> > 
> > In any case, there will be an ABI change, because an application
> > compiled for 18.02 will not be able to handle these new kind of
> > mbuf.
> > 
> > So unfortunatly yes, I think this kind of changes should first be
> > announced.
> > 
> > Thomas, what do you think?
> 
> What is the impact for the application developer?
> Is there something to change in the application after this patch?

Let me address two concerns discussed here.

1) API breakage of RTE_MBUF_DIRECT()
Previously, direct == !indirect but now direct == !indirect && !extbuf. But to
set the new flag (EXT_ATTACHED_MBUF), the new API, rte_pktmbuf_attach_extbuf()
should be used and it is experimental. If application isn't compiled without
allowing experimental API or application doesn't use the new API, it is always
true that direct == !indirect. It looks logically okay to me. And FYI, it passed
the mbuf_autotest.

2) ABI breakage of mlx5's new Multi-Packet RQ (a.k.a MPRQ) feature
It's right that it could breadk ABI if the PMD delivers packets with external
buffer attached. But, the MPRQ feature is disabled by default and it can be
enabled only by the newly introduced PMD parameter (mprq_en). So, there's no
possibility that 18.02-based application receives a mbuf having an external
buffer. And, like Olivier mentioned, there's another ABI breakage by removing
control mbuf anyway.

So, I don't think there's need for developers to change their application after
this patch unless they want to use the new feature.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 21:53             ` Yongseok Koh
@ 2018-04-24 22:15               ` Thomas Monjalon
  2018-04-25  8:21               ` Olivier Matz
  1 sibling, 0 replies; 86+ messages in thread
From: Thomas Monjalon @ 2018-04-24 22:15 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko
  Cc: Yongseok Koh, wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

24/04/2018 23:53, Yongseok Koh:
> On Tue, Apr 24, 2018 at 10:22:45PM +0200, Thomas Monjalon wrote:
> > 24/04/2018 21:15, Olivier Matz:
> > > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > > > + * otherwise.
> > > > > > > + *
> > > > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > > > + */
> > > > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +
> > > > > > > +/**
> > > > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > > > >     */
> > > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > > It is still confusing that INDIRECT != !DIRECT.
> > > > > > May be we have no good options right now, but I'd suggest to at least
> > > > > > deprecate
> > > > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > > > Agree. I may have missed something, but is my previous suggestion
> > > > > not doable?
> > > > > 
> > > > > - direct = embeds its own data      (and indirect = !direct)
> > > > > - clone (or another name) = data is another mbuf
> > > > > - extbuf = data is in an external buffer
> > > > 
> > > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > > is added as well. I think strictly speaking it is an API change.
> > > > Is it OK to make it without announcement?
> > > 
> > > In any case, there will be an ABI change, because an application
> > > compiled for 18.02 will not be able to handle these new kind of
> > > mbuf.
> > > 
> > > So unfortunatly yes, I think this kind of changes should first be
> > > announced.
> > > 
> > > Thomas, what do you think?
> > 
> > What is the impact for the application developer?
> > Is there something to change in the application after this patch?
> 
> Let me address two concerns discussed here.
> 
> 1) API breakage of RTE_MBUF_DIRECT()
> Previously, direct == !indirect but now direct == !indirect && !extbuf. But to
> set the new flag (EXT_ATTACHED_MBUF), the new API, rte_pktmbuf_attach_extbuf()
> should be used and it is experimental. If application isn't compiled without
> allowing experimental API or application doesn't use the new API, it is always
> true that direct == !indirect. It looks logically okay to me. And FYI, it passed
> the mbuf_autotest.
> 
> 2) ABI breakage of mlx5's new Multi-Packet RQ (a.k.a MPRQ) feature
> It's right that it could breadk ABI if the PMD delivers packets with external
> buffer attached. But, the MPRQ feature is disabled by default and it can be
> enabled only by the newly introduced PMD parameter (mprq_en). So, there's no
> possibility that 18.02-based application receives a mbuf having an external
> buffer. And, like Olivier mentioned, there's another ABI breakage by removing
> control mbuf anyway.
> 
> So, I don't think there's need for developers to change their application after
> this patch unless they want to use the new feature.

To summarize, this a feature addition, there is no breakage.
So I don't see what should be announced.

I think it could be integrated as experimental with a first
PMD implementation in 18.05. It will allow to test the feature
in the field, and have more feedbacks about how to improve the API.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 12:28   ` Andrew Rybchenko
  2018-04-24 16:02     ` Olivier Matz
@ 2018-04-24 22:30     ` Yongseok Koh
  1 sibling, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24 22:30 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: wenzhuo.lu, jingjing.wu, olivier.matz, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> On 04/24/2018 04:38 AM, Yongseok Koh wrote:
[...]
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 06eceba37..7f6507a66 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -326,7 +326,7 @@ extern "C" {
> >   		PKT_TX_MACSEC |		 \
> >   		PKT_TX_SEC_OFFLOAD)
> > -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> > +#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */
> 
> May be it should mention that shinfo is filled in.

Okay.

> >   #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
> > @@ -566,8 +566,24 @@ struct rte_mbuf {
> >   	/** Sequence number. See also rte_reorder_insert(). */
> >   	uint32_t seqn;
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> 
> I think it would be useful to add comment that it is used in the case of
> RTE_MBUF_HAS_EXTBUF() only.

Oops, I missed that. Thanks.

[...]
> > +static inline char * __rte_experimental
> > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > +	rte_iova_t buf_iova, uint16_t buf_len,
> > +	struct rte_mbuf_ext_shared_info *shinfo,
> > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > +{
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> 
> May I suggest to move it inside if (shinfo == NULL) to make it clear that it
> is not used if shinfo pointer is provided.

Done.

[...]
> >   static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
> >   {
> > -	struct rte_mbuf *md;
> > -
> >   	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
> >   	    rte_mbuf_refcnt_read(mi) == 1);
> > -	/* if m is not direct, get the mbuf that embeds the data */
> > -	if (RTE_MBUF_DIRECT(m))
> > -		md = m;
> > -	else
> > -		md = rte_mbuf_from_indirect(m);
> > +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> > +		mi->ol_flags = m->ol_flags;
> > +	} else {
> > +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
> 
> It looks like handling of the direct mbuf is lost here. May be it is
> intentional
> to avoid branching since result will be the same for direct mbuf as well,
> but looks confusing. Deserves at least a comment which explains why.
> Ideally it should be proven by measurements.

Right, that was intentional to avoid the branch. Sometimes branch is more
expensive than arithmetic ops in core's pipeline. Will add a comment.

[...]
> > +static inline void
> > +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> > +{
> > +	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
> 
> Shouldn't it be done after below assertion? Just to be less confusing.

Right. Done.

> > +
> > +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> > +
> > +	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> 
> It is not directly related to the changeset, but rte_pktmbuf_prefree_seg()
> has many optimizations which could be useful here:
>  - do not update if refcnt is 1
>  - do not set next/nb_seq if next is already NULL

Would be better to have a separate patch later.

Thanks,
Yongseok

> > +		md->next = NULL;
> > +		md->nb_segs = 1;
> > +		rte_mbuf_refcnt_set(md, 1);
> > +		rte_mbuf_raw_free(md);
> > +	}
> > +}
> 
> [...]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 19:15         ` Olivier Matz
  2018-04-24 20:22           ` Thomas Monjalon
@ 2018-04-24 23:34           ` Yongseok Koh
  2018-04-25 14:45             ` Andrew Rybchenko
  1 sibling, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-24 23:34 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Andrew Rybchenko, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro,
	Thomas Monjalon

On Tue, Apr 24, 2018 at 09:15:38PM +0200, Olivier Matz wrote:
> On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > Hi Andrew, Yongseok,
> > > 
> > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
[...]
> > > > > +	m->buf_addr = buf_addr;
> > > > > +	m->buf_iova = buf_iova;
> > > > > +
> > > > > +	if (shinfo == NULL) {
> > > > > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > > > > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > > > > +		if ((void *)shinfo <= buf_addr)
> > > > > +			return NULL;
> > > > > +
> > > > > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > > > > +	} else {
> > > > > +		m->buf_len = buf_len;
> > > > > +	}
> > > > > +
> > > > > +	m->data_len = 0;
> > > > > +
> > > > > +	rte_pktmbuf_reset_headroom(m);
> > > > I would suggest to make data_off one more parameter.
> > > > If I have a buffer with data which I'd like to attach to an mbuf, I'd like
> > > > to control data_off.
> > > Another option is to set the headroom to 0.
> > > Because the after attaching the mbuf to an external buffer, we will
> > > still require to set the length.
> > > 
> > > A user can do something like this:
> > > 
> > > 	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
> > > 		free_cb, free_cb_arg);
> > > 	rte_pktmbuf_append(m, data_len + headroom);
> > > 	rte_pktmbuf_adj(m, headroom);

I'd take this option. Will make the change and document it.

> > > > > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > > > > +	m->shinfo = shinfo;
> > > > > +
> > > > > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > Why is assignment used here? Cannot we attach extbuf already attached to
> > > > other mbuf?
> > > In rte_pktmbuf_attach(), this is true. That's not illogical to
> > > keep the same approach here. Maybe an assert could be added?

Like I described in the doc, intention is to attach external buffer by
_attach_extbuf() for the first time and _attach() is just for additional mbuf
attachment. Will add an assert.

> > > > May be shinfo should be initialized only if it is not provided (shinfo ==
> > > > NULL on input)?
> > > I don't get why, can you explain please?
> > 
> > May be I misunderstand how it should look like when one huge buffer
> > is partitioned. I thought that it should be only one shinfo per huge buffer
> > to control when it is not used any more by any mbufs with extbuf.
> 
> OK I got it.
> 
> I think both approach could make sense:
> - one shinfo per huge buffer
> - or one shinfo per mbuf, and use the callback to manage another refcnt
>   (like what Yongseok described)
> 
> So I agree with your proposal, shinfo should be initialized by
> the caller if it is != NULL, else it can be initialized by
> rte_pktmbuf_attach_extbuf().

Also agreed. Will change.

> > Other option is to have shinfo per small buf plus reference counter
> > per huge buf (which is decremented when small buf reference counter
> > becomes zero and free callback is executed). I guess it is assumed above.
> > My fear is that it is too much reference counters:
> >  1. mbuf reference counter
> >  2. small buf reference counter
> >  3. huge buf reference counter
> > May be it is possible use (1) for (2) as well?
> 
> I would prefer to have only 2 reference counters, one in the mbuf
> and one in the shinfo.

Good discussion. It should be a design decision by user.

In my use-case, it would be a good idea to make all the mbufs in a same chunk
point to the same shared info in the head of the chunk and reset the refcnt of
shinfo to the total number of slices in the chunk.

+--+----+----+--------------+----+--------------+---+- - -
|global |head|mbuf1 data    |head|mbuf2 data    |   |
| shinfo|room|              |room|              |   |
+--+----+----+--------------+----+--------------+---+- - -


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (8 preceding siblings ...)
  2018-04-24  1:38 ` [PATCH v4 " Yongseok Koh
@ 2018-04-25  2:53 ` Yongseok Koh
  2018-04-25  2:53   ` [PATCH v5 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
  2018-04-25 13:31   ` [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
  2018-04-26  1:10 ` [PATCH v6 " Yongseok Koh
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25  2:53 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - When an indirect mbuf is attached, refcnt of the direct mbuf would be
    2 as long as the direct mbuf itself isn't freed after the attachment.
    In such cases, the buffer area of a direct mbuf must be read-only. But
    external buffer has its own refcnt and it starts from 1. Unless
    multiple mbufs are attached to a mbuf having an external buffer, the
    external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---

** This patch can pass the mbuf_autotest. **

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v5:
* rte_pktmbuf_attach_extbuf() sets headroom to 0.
* if shinfo is provided when attaching, user should initialize it.
* minor changes from review.

v4:
* rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
  user can pass memory for shared data via shinfo argument.
* minor changes from review.

v3:
* implement external buffer attachment instead of introducing buf_off for
  mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 303 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 274 insertions(+), 29 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 43aaa9c5f..e2c12874a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -344,7 +344,10 @@ extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+/**
+ * Mbuf having an external buffer attached. shinfo in mbuf must be filled.
+ */
+#define EXT_ATTACHED_MBUF    (1ULL << 61)
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -584,8 +587,27 @@ struct rte_mbuf {
 	/** Sequence number. See also rte_reorder_insert(). */
 	uint32_t seqn;
 
+	/** Shared data for external buffer attached to mbuf. See
+	 * rte_pktmbuf_attach_extbuf().
+	 */
+	struct rte_mbuf_ext_shared_info *shinfo;
+
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -706,14 +728,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
 }
 
 /**
+ * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
+ * otherwise.
+ *
+ * If a mbuf has its data in another mbuf and references it by mbuf
+ * indirection, this mbuf can be defined as a cloned mbuf.
+ */
+#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
  */
-#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
+#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
+
+/**
+ * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
+ *
+ * External buffer is a user-provided anonymous buffer.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
 
 /**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
+ *
+ * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
+ * can be defined as a direct mbuf.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -839,6 +880,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
+		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided. This callback function will be called once all the mbufs are
+ * detached from the buffer.
+ *
+ * The headroom for the attaching mbuf will be set to zero and this can be
+ * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
+ * or ``rte_pktmbuf_reset_headroom()`` can be used.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
+ *   2 as long as the direct mbuf itself isn't freed after the attachment.
+ *   In such cases, the buffer area of a direct mbuf must be read-only. But
+ *   external buffer has its own refcnt and it starts from 1. Unless
+ *   multiple mbufs are attached to a mbuf having an external buffer, the
+ *   external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback and its IO address.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be considered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer we're attaching to.
+ * @param buf_iova
+ *   IO address of the external buffer we're attaching to.
+ * @param buf_len
+ *   The size of the external buffer we're attaching to. If memory for
+ *   shared data is not provided, buf_len must be larger than the size of
+ *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
+ *   enough, this function will return NULL.
+ * @param shinfo
+ *   User-provided memory for shared data. If NULL, a few bytes in the
+ *   trailer of the provided buffer will be dedicated for shared data and
+ *   the shared data will be properly initialized. Otherwise, user must
+ *   initialize the content except for free callback and its argument. The
+ *   pointer of shared data will be stored in m->shinfo.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be
+ *   freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ *
+ * @return
+ *   A pointer to the new start of the data on success, return NULL
+ *   otherwise.
+ */
+static inline char * __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	rte_iova_t buf_iova, uint16_t buf_len,
+	struct rte_mbuf_ext_shared_info *shinfo,
+	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
+{
+	/* Additional attachment should be done by rte_pktmbuf_attach() */
+	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = buf_iova;
+
+	if (shinfo == NULL) {
+		void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
+
+		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
+					sizeof(*shinfo)), sizeof(uintptr_t));
+		if ((void *)shinfo <= buf_addr)
+			return NULL;
+
+		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+		rte_mbuf_ext_refcnt_set(shinfo, 1);
+	} else {
+		m->buf_len = buf_len;
+	}
+
+	m->data_len = 0;
+	m->data_off = 0;
+
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+	m->shinfo = shinfo;
+
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+
+	return (char *)m->buf_addr + m->data_off;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1231,19 +1440,19 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
+		mi->ol_flags = m->ol_flags;
+	} else {
+		/* if m is not direct, get the mbuf that embeds the data */
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1259,7 +1468,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1268,12 +1476,52 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
  *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+	RTE_ASSERT(m->shinfo != NULL);
+
+	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
+		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	md = rte_mbuf_from_indirect(m);
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
+ *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1282,10 +1530,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1297,13 +1549,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1327,7 +1572,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1339,7 +1584,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-25  2:53 ` [PATCH v5 " Yongseok Koh
@ 2018-04-25  2:53   ` Yongseok Koh
  2018-04-25 13:31   ` [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
  1 sibling, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25  2:53 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch is to accommodate an experimental feature of mbuf - external
buffer attachment. If mbuf is attached to an external buffer, its ol_flags
will have EXT_ATTACHED_MBUF set. Without enabling/using the feature,
everything remains same.

If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be carried with externally attached mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 3 +++
 app/test-pmd/macfwd.c   | 3 +++
 app/test-pmd/macswap.c  | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5f5ab64aa..bb0b675a8 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -770,6 +770,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (!RTE_MBUF_DIRECT(m))
+			tx_ol_flags |= m->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..ba0021194 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..b8d15f6ba 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,9 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 21:53             ` Yongseok Koh
  2018-04-24 22:15               ` Thomas Monjalon
@ 2018-04-25  8:21               ` Olivier Matz
  1 sibling, 0 replies; 86+ messages in thread
From: Olivier Matz @ 2018-04-25  8:21 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Thomas Monjalon, Andrew Rybchenko, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, adrien.mazarguil, nelio.laranjeiro

On Tue, Apr 24, 2018 at 02:53:41PM -0700, Yongseok Koh wrote:
> On Tue, Apr 24, 2018 at 10:22:45PM +0200, Thomas Monjalon wrote:
> > 24/04/2018 21:15, Olivier Matz:
> > > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > > > + * otherwise.
> > > > > > > + *
> > > > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > > > + */
> > > > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +
> > > > > > > +/**
> > > > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > > > >     */
> > > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > > It is still confusing that INDIRECT != !DIRECT.
> > > > > > May be we have no good options right now, but I'd suggest to at least
> > > > > > deprecate
> > > > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > > > Agree. I may have missed something, but is my previous suggestion
> > > > > not doable?
> > > > > 
> > > > > - direct = embeds its own data      (and indirect = !direct)
> > > > > - clone (or another name) = data is another mbuf
> > > > > - extbuf = data is in an external buffer
> > > > 
> > > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > > is added as well. I think strictly speaking it is an API change.
> > > > Is it OK to make it without announcement?
> > > 
> > > In any case, there will be an ABI change, because an application
> > > compiled for 18.02 will not be able to handle these new kind of
> > > mbuf.
> > > 
> > > So unfortunatly yes, I think this kind of changes should first be
> > > announced.
> > > 
> > > Thomas, what do you think?
> > 
> > What is the impact for the application developer?
> > Is there something to change in the application after this patch?
> 
> Let me address two concerns discussed here.
> 
> 1) API breakage of RTE_MBUF_DIRECT()
> Previously, direct == !indirect but now direct == !indirect && !extbuf. But to
> set the new flag (EXT_ATTACHED_MBUF), the new API, rte_pktmbuf_attach_extbuf()
> should be used and it is experimental. If application isn't compiled without
> allowing experimental API or application doesn't use the new API, it is always
> true that direct == !indirect. It looks logically okay to me. And FYI, it passed
> the mbuf_autotest.
> 
> 2) ABI breakage of mlx5's new Multi-Packet RQ (a.k.a MPRQ) feature
> It's right that it could breadk ABI if the PMD delivers packets with external
> buffer attached. But, the MPRQ feature is disabled by default and it can be
> enabled only by the newly introduced PMD parameter (mprq_en). So, there's no
> possibility that 18.02-based application receives a mbuf having an external
> buffer. And, like Olivier mentioned, there's another ABI breakage by removing
> control mbuf anyway.

Stricly speaking, it is possible that a user pass this parameter through
the application (which just forwards it) to the new dpdk. So there is an ABI
change. In short, if a user wants to enable an optimization of 18.05 on an
application compiled for 18.02, it will fail.

But I agree the impact is very limited.

We are a bit lucky, because:
- the mbuf size is aligned to 128, so it stays the same, and the priv area
  is after the 2nd cache line (note: we are at 112 bytes over 128 on x86_64).
- previously, the area where shinfo is added was filled with garbage. It has
  no impact because it is only accessed when the EXT flag is set.
- the unused flags are 0 by default.

Knowing there is an ABI breakage this release, it could also make sense to
try to limitate them and avoid breaking it again in 18.08.

So in my opinion, reagarding API/ABI, this patchset could go in for 18.05.

Olivier


> 
> So, I don't think there's need for developers to change their application after
> this patch unless they want to use the new feature.
> 
> 
> Thanks,
> Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 16:02     ` Olivier Matz
  2018-04-24 18:21       ` ***Spam*** " Andrew Rybchenko
@ 2018-04-25  8:28       ` Olivier Matz
  2018-04-25  9:08         ` Yongseok Koh
  1 sibling, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-25  8:28 UTC (permalink / raw)
  To: Andrew Rybchenko, Yongseok Koh
  Cc: wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro

Hi Yongseok,

On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
> > > @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > >   }
> > >   /**
> > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > + * otherwise.
> > > + *
> > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > + */
> > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > +
> > > +/**
> > >    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > >    */
> > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > 
> > It is still confusing that INDIRECT != !DIRECT.
> > May be we have no good options right now, but I'd suggest to at least
> > deprecate
> > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> 
> Agree. I may have missed something, but is my previous suggestion
> not doable?
> 
> - direct = embeds its own data      (and indirect = !direct)
> - clone (or another name) = data is another mbuf
> - extbuf = data is in an external buffer

Any comment about this option?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25  8:28       ` Olivier Matz
@ 2018-04-25  9:08         ` Yongseok Koh
  2018-04-25  9:19           ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25  9:08 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Andrew Rybchenko, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, Adrien Mazarguil, Nélio Laranjeiro



> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> 
> Hi Yongseok,
> 
> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>>>>  }
>>>>  /**
>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
>>>> + * otherwise.
>>>> + *
>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
>>>> + */
>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>> +
>>>> +/**
>>>>   * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>>>>   */
>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
>>> 
>>> It is still confusing that INDIRECT != !DIRECT.
>>> May be we have no good options right now, but I'd suggest to at least
>>> deprecate
>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
>> 
>> Agree. I may have missed something, but is my previous suggestion
>> not doable?
>> 
>> - direct = embeds its own data      (and indirect = !direct)
>> - clone (or another name) = data is another mbuf
>> - extbuf = data is in an external buffer
> 
> Any comment about this option?

I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
(!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
I'm not sure I understand you correctly.

Can you please give me more guidelines so that I can take you idea?

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25  9:08         ` Yongseok Koh
@ 2018-04-25  9:19           ` Yongseok Koh
  2018-04-25 20:00             ` Olivier Matz
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25  9:19 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Andrew Rybchenko, Wenzhuo Lu, Jingjing Wu, dev,
	Konstantin Ananyev, Adrien Mazarguil, Nélio Laranjeiro

> On Apr 25, 2018, at 2:08 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
>> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
>> 
>> Hi Yongseok,
>> 
>> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
>>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>>>>> }
>>>>> /**
>>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
>>>>> + * otherwise.
>>>>> + *
>>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
>>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
>>>>> + */
>>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>>> +
>>>>> +/**
>>>>>  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>>>>>  */
>>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
>>>> 
>>>> It is still confusing that INDIRECT != !DIRECT.
>>>> May be we have no good options right now, but I'd suggest to at least
>>>> deprecate
>>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
>>> 
>>> Agree. I may have missed something, but is my previous suggestion
>>> not doable?
>>> 
>>> - direct = embeds its own data      (and indirect = !direct)
>>> - clone (or another name) = data is another mbuf
>>> - extbuf = data is in an external buffer
>> 
>> Any comment about this option?
> 
> I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
> RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
> (!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
> I'm not sure I understand you correctly.
> 
> Can you please give me more guidelines so that I can take you idea?

Maybe, did you mean the following? Looks like doable but RTE_MBUF_DIRECT()
can't logically mean 'mbuf embeds its own data', right?

#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))

#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)

[...]

@@ -1327,7 +1572,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
        if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-               if (RTE_MBUF_INDIRECT(m))
+               if (RTE_MBUF_INDIRECT(m) || RTE_MBUF_HAS_EXTBUF(m))
                        rte_pktmbuf_detach(m);
 
                if (m->next != NULL) {
@@ -1339,7 +1584,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
        } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-               if (RTE_MBUF_INDIRECT(m))
+               if (RTE_MBUF_INDIRECT(m) || RTE_MBUF_HAS_EXTBUF(m))
                        rte_pktmbuf_detach(m);
 
                if (m->next != NULL) {


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24  2:04     ` Yongseok Koh
@ 2018-04-25 13:16       ` Ananyev, Konstantin
  2018-04-25 16:44         ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-25 13:16 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, adrien.mazarguil,
	nelio.laranjeiro



> 
> On Mon, Apr 23, 2018 at 11:53:04AM +0000, Ananyev, Konstantin wrote:
> [...]
> > > @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > >  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > >
> > >  /**
> > > + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> > > + */
> > > +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > > +
> > > +/**
> > >   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> > >   */
> > > -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > > +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
> >
> > As a nit:
> > RTE_MBUF_DIRECT(mb)  (((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)) == 0)
> 
> It was for better readability and I expected compiler did the same.
> But, if you still want this way, I can change it.

I know compilers are quite smart these days, but you never know for sure,
so yes, I think better to do that explicitly.

> 
> [...]
> > >  /**
> > > - * Detach an indirect packet mbuf.
> > > + * @internal used by rte_pktmbuf_detach().
> > > + *
> > > + * Decrement the reference counter of the external buffer. When the
> > > + * reference counter becomes 0, the buffer is freed by pre-registered
> > > + * callback.
> > > + */
> > > +static inline void
> > > +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> > > +{
> > > +	struct rte_mbuf_ext_shared_info *shinfo;
> > > +
> > > +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> > > +
> > > +	shinfo = rte_mbuf_ext_shinfo(m);
> > > +
> > > +	if (rte_extbuf_refcnt_update(shinfo, -1) == 0)
> > > +		shinfo->free_cb(m->buf_addr, shinfo->fcb_opaque);
> >
> >
> > I understand the reason but extra function call for each external mbuf - seems quite expensive.
> > Wonder is it possible to group them somehow and amortize the cost?
> 
> Good point. I thought about it today.
> 
> Comparing to the regular mbuf, maybe three differences. a) free function isn't
> inlined but a real branch. b) no help from core local cache like mempool's c) no
> free_bulk func like rte_mempool_put_bulk(). But these look quite costly and
> complicated for the external buffer attachment.
> 
> For example, to free it in bulk, external buffers should be grouped as the
> buffers would have different callback functions. To do that, I have to make an
> API to pre-register an external buffer group to prepare resources for the bulk
> free. Then, buffers can't be anonymous anymore but have to be registered in
> advance. If so, it would be better to use existing APIs, especially when a user
> wants high throughput...
> 
> Let me know if you have better idea to implement it. Then, I'll gladly take
> that. Or, we can push any improvement patch in the next releases.

I don't have any extra-smart thoughts here.
One option I thought about - was to introduce group of external buffers with
common free routine (I think o mentioned it already).
Second - hide all that external buffer management inside mempool,
i.e. if user wants to use external buffers he create a mempool
(with rte_mbuf_ext_shared_info as elements?), then attach external buffer to shinfo
and call mbuf_attach_external(mbuf, shinfo).
Though for free we can just call mempool_put(shinfo) and let particular implementation
decide when/how call free_cb(), etc. 

Konstantin

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25  2:53 ` [PATCH v5 " Yongseok Koh
  2018-04-25  2:53   ` [PATCH v5 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-25 13:31   ` Ananyev, Konstantin
  2018-04-25 17:06     ` Yongseok Koh
  1 sibling, 1 reply; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-25 13:31 UTC (permalink / raw)
  To: Yongseok Koh, Lu, Wenzhuo, Wu, Jingjing, olivier.matz
  Cc: dev, arybchenko, stephen, thomas, adrien.mazarguil, nelio.laranjeiro


> 
> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>     2 as long as the direct mbuf itself isn't freed after the attachment.
>     In such cases, the buffer area of a direct mbuf must be read-only. But
>     external buffer has its own refcnt and it starts from 1. Unless
>     multiple mbufs are attached to a mbuf having an external buffer, the
>     external buffer is writable.
>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
> 
> ** This patch can pass the mbuf_autotest. **
> 
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
> 
> v5:
> * rte_pktmbuf_attach_extbuf() sets headroom to 0.
> * if shinfo is provided when attaching, user should initialize it.
> * minor changes from review.
> 
> v4:
> * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
>   user can pass memory for shared data via shinfo argument.
> * minor changes from review.
> 
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>   mbuf indirection.
> 
>  lib/librte_mbuf/rte_mbuf.h | 303 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 274 insertions(+), 29 deletions(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 43aaa9c5f..e2c12874a 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -344,7 +344,10 @@ extern "C" {
>  		PKT_TX_MACSEC |		 \
>  		PKT_TX_SEC_OFFLOAD)
> 
> -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> +/**
> + * Mbuf having an external buffer attached. shinfo in mbuf must be filled.
> + */
> +#define EXT_ATTACHED_MBUF    (1ULL << 61)
> 
>  #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
> 
> @@ -584,8 +587,27 @@ struct rte_mbuf {
>  	/** Sequence number. See also rte_reorder_insert(). */
>  	uint32_t seqn;
> 
> +	/** Shared data for external buffer attached to mbuf. See
> +	 * rte_pktmbuf_attach_extbuf().
> +	 */
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +
>  } __rte_cache_aligned;
> 
> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
> +};
> +
>  /**< Maximum number of nb_segs allowed. */
>  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
> 
> @@ -706,14 +728,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>  }
> 
>  /**
> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> + * otherwise.
> + *
> + * If a mbuf has its data in another mbuf and references it by mbuf
> + * indirection, this mbuf can be defined as a cloned mbuf.
> + */
> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +
> +/**
>   * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>   */
> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> +
> +/**
> + * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
> + *
> + * External buffer is a user-provided anonymous buffer.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> 
>  /**
>   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> + *
> + * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
> + * can be defined as a direct mbuf.
>   */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))
> 
>  /**
>   * Private data in case of pktmbuf pool.
> @@ -839,6 +880,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
> 
>  #endif /* RTE_MBUF_REFCNT_ATOMIC */
> 
> +/**
> + * Reads the refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @return
> + *   Reference count number.
> + */
> +static inline uint16_t
> +rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
> +{
> +	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
> +}
> +
> +/**
> + * Set refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @param new_value
> + *   Value set
> + */
> +static inline void
> +rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
> +	uint16_t new_value)
> +{
> +	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
> +}
> +
> +/**
> + * Add given value to refcnt of an external buffer and return its new
> + * value.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @param value
> + *   Value to add/subtract
> + * @return
> + *   Updated value
> + */
> +static inline uint16_t
> +rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
> +	int16_t value)
> +{
> +	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
> +		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
> +		return 1 + value;
> +	}
> +
> +	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
> +}
> +
>  /** Mbuf prefetch */
>  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
>  	if ((m) != NULL)                        \
> @@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>  }
> 
>  /**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * The headroom for the attaching mbuf will be set to zero and this can be
> + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
> + * or ``rte_pktmbuf_reset_headroom()`` can be used.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> + *   external buffer has its own refcnt and it starts from 1. Unless
> + *   multiple mbufs are attached to a mbuf having an external buffer, the
> + *   external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback and its IO address.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_iova
> + *   IO address of the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. If memory for
> + *   shared data is not provided, buf_len must be larger than the size of
> + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> + *   enough, this function will return NULL.
> + * @param shinfo
> + *   User-provided memory for shared data. If NULL, a few bytes in the
> + *   trailer of the provided buffer will be dedicated for shared data and
> + *   the shared data will be properly initialized. Otherwise, user must
> + *   initialize the content except for free callback and its argument. The
> + *   pointer of shared data will be stored in m->shinfo.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be
> + *   freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + *
> + * @return
> + *   A pointer to the new start of the data on success, return NULL
> + *   otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	rte_iova_t buf_iova, uint16_t buf_len,
> +	struct rte_mbuf_ext_shared_info *shinfo,
> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> +{
> +	/* Additional attachment should be done by rte_pktmbuf_attach() */
> +	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));

Shouldn't we have here something like:
RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
?

> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = buf_iova;
> +
> +	if (shinfo == NULL) {

Instead of allocating shinfo ourselves - wound's it be better to rely
on caller always allocating afeeling it for us (he can do that at the end/start of buffer,
or whenever he likes to.
Again in that case - caller can provide one shinfo to several mbufs (with different buf_addrs)
and would know for sure that free_cb wouldn't be overwritten by mistake.
I.E. mbuf code will only update refcnt inside shinfo.

> +		void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> +
> +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> +					sizeof(*shinfo)), sizeof(uintptr_t));
> +		if ((void *)shinfo <= buf_addr)
> +			return NULL;
> +
> +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> +		rte_mbuf_ext_refcnt_set(shinfo, 1);
> +	} else {
> +		m->buf_len = buf_len;

I think you need to update shinfo>refcnt here too.

> +	}
> +
> +	m->data_len = 0;
> +	m->data_off = 0;
> +
> +	m->ol_flags |= EXT_ATTACHED_MBUF;
> +	m->shinfo = shinfo;
> +
> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;
> +
> +	return (char *)m->buf_addr + m->data_off;
> +}
> +
> +/**
> + * Detach the external buffer attached to a mbuf, same as
> + * ``rte_pktmbuf_detach()``
> + *
> + * @param m
> + *   The mbuf having external buffer.
> + */
> +#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
> +
> +/**
>   * Attach packet mbuf to another packet mbuf.
>   *
> - * After attachment we refer the mbuf we attached as 'indirect',
> - * while mbuf we attached to as 'direct'.
> - * The direct mbuf's reference counter is incremented.
> + * If the mbuf we are attaching to isn't a direct buffer and is attached to
> + * an external buffer, the mbuf being attached will be attached to the
> + * external buffer instead of mbuf indirection.
> + *
> + * Otherwise, the mbuf will be indirectly attached. After attachment we
> + * refer the mbuf we attached as 'indirect', while mbuf we attached to as
> + * 'direct'.  The direct mbuf's reference counter is incremented.
>   *
>   * Right now, not supported:
>   *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
> @@ -1231,19 +1440,19 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   */
>  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  {
> -	struct rte_mbuf *md;
> -
>  	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
>  	    rte_mbuf_refcnt_read(mi) == 1);
> 
> -	/* if m is not direct, get the mbuf that embeds the data */
> -	if (RTE_MBUF_DIRECT(m))
> -		md = m;
> -	else
> -		md = rte_mbuf_from_indirect(m);
> +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> +		mi->ol_flags = m->ol_flags;
> +	} else {
> +		/* if m is not direct, get the mbuf that embeds the data */
> +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
> +		mi->priv_size = m->priv_size;
> +		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
> +	}
> 
> -	rte_mbuf_refcnt_update(md, 1);
> -	mi->priv_size = m->priv_size;
>  	mi->buf_iova = m->buf_iova;
>  	mi->buf_addr = m->buf_addr;
>  	mi->buf_len = m->buf_len;
> @@ -1259,7 +1468,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  	mi->next = NULL;
>  	mi->pkt_len = mi->data_len;
>  	mi->nb_segs = 1;
> -	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
>  	mi->packet_type = m->packet_type;
>  	mi->timestamp = m->timestamp;
> 
> @@ -1268,12 +1476,52 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  }
> 
>  /**
> - * Detach an indirect packet mbuf.
> + * @internal used by rte_pktmbuf_detach().
>   *
> + * Decrement the reference counter of the external buffer. When the
> + * reference counter becomes 0, the buffer is freed by pre-registered
> + * callback.
> + */
> +static inline void
> +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> +{
> +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> +	RTE_ASSERT(m->shinfo != NULL);
> +
> +	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
> +		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
> +}
> +
> +/**
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the direct mbuf's reference counter. When the reference
> + * counter becomes 0, the direct mbuf is freed.
> + */
> +static inline void
> +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf *md;
> +
> +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> +
> +	md = rte_mbuf_from_indirect(m);
> +
> +	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> +		md->next = NULL;
> +		md->nb_segs = 1;
> +		rte_mbuf_refcnt_set(md, 1);
> +		rte_mbuf_raw_free(md);
> +	}
> +}
> +
> +/**
> + * Detach a packet mbuf from external buffer or direct buffer.
> + *
> + *  - decrement refcnt and free the external/direct buffer if refcnt
> + *    becomes zero.
>   *  - restore original mbuf address and length values.
>   *  - reset pktmbuf data and data_len to their default values.
> - *  - decrement the direct mbuf's reference counter. When the
> - *  reference counter becomes 0, the direct mbuf is freed.
>   *
>   * All other fields of the given packet mbuf will be left intact.
>   *
> @@ -1282,10 +1530,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   */
>  static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  {
> -	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
>  	struct rte_mempool *mp = m->pool;
>  	uint32_t mbuf_size, buf_len, priv_size;
> 
> +	if (RTE_MBUF_HAS_EXTBUF(m))
> +		__rte_pktmbuf_free_extbuf(m);
> +	else
> +		__rte_pktmbuf_free_direct(m);
> +
>  	priv_size = rte_pktmbuf_priv_size(mp);
>  	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
>  	buf_len = rte_pktmbuf_data_room_size(mp);
> @@ -1297,13 +1549,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  	rte_pktmbuf_reset_headroom(m);
>  	m->data_len = 0;
>  	m->ol_flags = 0;
> -
> -	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> -		md->next = NULL;
> -		md->nb_segs = 1;
> -		rte_mbuf_refcnt_set(md, 1);
> -		rte_mbuf_raw_free(md);
> -	}
>  }
> 
>  /**
> @@ -1327,7 +1572,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> 
>  	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> 
> -		if (RTE_MBUF_INDIRECT(m))
> +		if (!RTE_MBUF_DIRECT(m))
>  			rte_pktmbuf_detach(m);
> 
>  		if (m->next != NULL) {
> @@ -1339,7 +1584,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> 
>  	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> 
> -		if (RTE_MBUF_INDIRECT(m))
> +		if (!RTE_MBUF_DIRECT(m))
>  			rte_pktmbuf_detach(m);
> 
>  		if (m->next != NULL) {
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 23:34           ` Yongseok Koh
@ 2018-04-25 14:45             ` Andrew Rybchenko
  2018-04-25 17:40               ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-25 14:45 UTC (permalink / raw)
  To: Yongseok Koh, Olivier Matz
  Cc: wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro, Thomas Monjalon

On 04/25/2018 02:34 AM, Yongseok Koh wrote:
> On Tue, Apr 24, 2018 at 09:15:38PM +0200, Olivier Matz wrote:
>> On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
>>> On 04/24/2018 07:02 PM, Olivier Matz wrote:
>>>>>> +	m->ol_flags |= EXT_ATTACHED_MBUF;
>>>>>> +	m->shinfo = shinfo;
>>>>>> +
>>>>>> +	rte_mbuf_ext_refcnt_set(shinfo, 1);
>>>>> Why is assignment used here? Cannot we attach extbuf already attached to
>>>>> other mbuf?
>>>> In rte_pktmbuf_attach(), this is true. That's not illogical to
>>>> keep the same approach here. Maybe an assert could be added?
> Like I described in the doc, intention is to attach external buffer by
> _attach_extbuf() for the first time and _attach() is just for additional mbuf
> attachment. Will add an assert.

Not sure that I understand. How should the second chunk with shared shinfo
of the huge buffer be attached to a new mbuf?

>>> Other option is to have shinfo per small buf plus reference counter
>>> per huge buf (which is decremented when small buf reference counter
>>> becomes zero and free callback is executed). I guess it is assumed above.
>>> My fear is that it is too much reference counters:
>>>   1. mbuf reference counter
>>>   2. small buf reference counter
>>>   3. huge buf reference counter
>>> May be it is possible use (1) for (2) as well?
>> I would prefer to have only 2 reference counters, one in the mbuf
>> and one in the shinfo.
> Good discussion. It should be a design decision by user.
>
> In my use-case, it would be a good idea to make all the mbufs in a same chunk
> point to the same shared info in the head of the chunk and reset the refcnt of
> shinfo to the total number of slices in the chunk.
>
> +--+----+----+--------------+----+--------------+---+- - -
> |global |head|mbuf1 data    |head|mbuf2 data    |   |
> | shinfo|room|              |room|              |   |
> +--+----+----+--------------+----+--------------+---+- - -

I don't understand how it can be achieved using proposed API.

Andrew.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-24 20:22           ` Thomas Monjalon
  2018-04-24 21:53             ` Yongseok Koh
@ 2018-04-25 15:06             ` Stephen Hemminger
  1 sibling, 0 replies; 86+ messages in thread
From: Stephen Hemminger @ 2018-04-25 15:06 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Olivier Matz, Andrew Rybchenko, Yongseok Koh, wenzhuo.lu,
	jingjing.wu, dev, konstantin.ananyev, adrien.mazarguil,
	nelio.laranjeiro

On Tue, 24 Apr 2018 22:22:45 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> > > 
> > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > is added as well. I think strictly speaking it is an API change.
> > > Is it OK to make it without announcement?  
> > 
> > In any case, there will be an ABI change, because an application
> > compiled for 18.02 will not be able to handle these new kind of
> > mbuf.
> > 
> > So unfortunatly yes, I think this kind of changes should first be
> > announced.
> > 
> > Thomas, what do you think?  
> 
> What is the impact for the application developer?
> Is there something to change in the application after this patch?

Maybe the use of external buffers should be negotiated as a receiver
flag (per queue) in the device driver.  If device wants external buffers
it sets that in capability flag, and only uses it application requests
it. This allows old applications to work with no semantic surprises.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 13:16       ` Ananyev, Konstantin
@ 2018-04-25 16:44         ` Yongseok Koh
  2018-04-25 18:05           ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 16:44 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, adrien.mazarguil,
	nelio.laranjeiro

On Wed, Apr 25, 2018 at 01:16:38PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > 
> > On Mon, Apr 23, 2018 at 11:53:04AM +0000, Ananyev, Konstantin wrote:
> > [...]
> > > > @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > > >  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > >
> > > >  /**
> > > > + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> > > > + */
> > > > +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > > > +
> > > > +/**
> > > >   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> > > >   */
> > > > -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > > > +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
> > >
> > > As a nit:
> > > RTE_MBUF_DIRECT(mb)  (((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)) == 0)
> > 
> > It was for better readability and I expected compiler did the same.
> > But, if you still want this way, I can change it.
> 
> I know compilers are quite smart these days, but you never know for sure,
> so yes, I think better to do that explicitly.

Okay.

> > [...]
> > > >  /**
> > > > - * Detach an indirect packet mbuf.
> > > > + * @internal used by rte_pktmbuf_detach().
> > > > + *
> > > > + * Decrement the reference counter of the external buffer. When the
> > > > + * reference counter becomes 0, the buffer is freed by pre-registered
> > > > + * callback.
> > > > + */
> > > > +static inline void
> > > > +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> > > > +{
> > > > +	struct rte_mbuf_ext_shared_info *shinfo;
> > > > +
> > > > +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> > > > +
> > > > +	shinfo = rte_mbuf_ext_shinfo(m);
> > > > +
> > > > +	if (rte_extbuf_refcnt_update(shinfo, -1) == 0)
> > > > +		shinfo->free_cb(m->buf_addr, shinfo->fcb_opaque);
> > >
> > >
> > > I understand the reason but extra function call for each external mbuf - seems quite expensive.
> > > Wonder is it possible to group them somehow and amortize the cost?
> > 
> > Good point. I thought about it today.
> > 
> > Comparing to the regular mbuf, maybe three differences. a) free function isn't
> > inlined but a real branch. b) no help from core local cache like mempool's c) no
> > free_bulk func like rte_mempool_put_bulk(). But these look quite costly and
> > complicated for the external buffer attachment.
> > 
> > For example, to free it in bulk, external buffers should be grouped as the
> > buffers would have different callback functions. To do that, I have to make an
> > API to pre-register an external buffer group to prepare resources for the bulk
> > free. Then, buffers can't be anonymous anymore but have to be registered in
> > advance. If so, it would be better to use existing APIs, especially when a user
> > wants high throughput...
> > 
> > Let me know if you have better idea to implement it. Then, I'll gladly take
> > that. Or, we can push any improvement patch in the next releases.
> 
> I don't have any extra-smart thoughts here.
> One option I thought about - was to introduce group of external buffers with
> common free routine (I think o mentioned it already).
> Second - hide all that external buffer management inside mempool,
> i.e. if user wants to use external buffers he create a mempool
> (with rte_mbuf_ext_shared_info as elements?), then attach external buffer to shinfo
> and call mbuf_attach_external(mbuf, shinfo).
> Though for free we can just call mempool_put(shinfo) and let particular implementation
> decide when/how call free_cb(), etc. 
I don't want to restrict external buffer to mempool object. Especially for
storage users, they want to use **any** buffer, even coming outside of DPDK.

However, will open a follow-up discussion for this in the next release window
probably with more measurement data.
Thank you for suggestions.

Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 13:31   ` [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
@ 2018-04-25 17:06     ` Yongseok Koh
  2018-04-25 17:23       ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 17:06 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, arybchenko,
	stephen, thomas, adrien.mazarguil, nelio.laranjeiro

On Wed, Apr 25, 2018 at 01:31:42PM +0000, Ananyev, Konstantin wrote:
[...]
> >  /** Mbuf prefetch */
> >  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> >  	if ((m) != NULL)                        \
> > @@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> >  }
> > 
> >  /**
> > + * Attach an external buffer to a mbuf.
> > + *
> > + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> > + * it, corresponding free callback function and its argument should be
> > + * provided. This callback function will be called once all the mbufs are
> > + * detached from the buffer.
> > + *
> > + * The headroom for the attaching mbuf will be set to zero and this can be
> > + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
> > + * or ``rte_pktmbuf_reset_headroom()`` can be used.
> > + *
> > + * More mbufs can be attached to the same external buffer by
> > + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> > + * this API.
> > + *
> > + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> > + * ``rte_pktmbuf_detach()``.
> > + *
> > + * Attaching an external buffer is quite similar to mbuf indirection in
> > + * replacing buffer addresses and length of a mbuf, but a few differences:
> > + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> > + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> > + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> > + *   external buffer has its own refcnt and it starts from 1. Unless
> > + *   multiple mbufs are attached to a mbuf having an external buffer, the
> > + *   external buffer is writable.
> > + * - There's no need to allocate buffer from a mempool. Any buffer can be
> > + *   attached with appropriate free callback and its IO address.
> > + * - Smaller metadata is required to maintain shared data such as refcnt.
> > + *
> > + * @warning
> > + * @b EXPERIMENTAL: This API may change without prior notice.
> > + * Once external buffer is enabled by allowing experimental API,
> > + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> > + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> > + * having external buffer.
> > + *
> > + * @param m
> > + *   The pointer to the mbuf.
> > + * @param buf_addr
> > + *   The pointer to the external buffer we're attaching to.
> > + * @param buf_iova
> > + *   IO address of the external buffer we're attaching to.
> > + * @param buf_len
> > + *   The size of the external buffer we're attaching to. If memory for
> > + *   shared data is not provided, buf_len must be larger than the size of
> > + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> > + *   enough, this function will return NULL.
> > + * @param shinfo
> > + *   User-provided memory for shared data. If NULL, a few bytes in the
> > + *   trailer of the provided buffer will be dedicated for shared data and
> > + *   the shared data will be properly initialized. Otherwise, user must
> > + *   initialize the content except for free callback and its argument. The
> > + *   pointer of shared data will be stored in m->shinfo.
> > + * @param free_cb
> > + *   Free callback function to call when the external buffer needs to be
> > + *   freed.
> > + * @param fcb_opaque
> > + *   Argument for the free callback function.
> > + *
> > + * @return
> > + *   A pointer to the new start of the data on success, return NULL
> > + *   otherwise.
> > + */
> > +static inline char * __rte_experimental
> > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > +	rte_iova_t buf_iova, uint16_t buf_len,
> > +	struct rte_mbuf_ext_shared_info *shinfo,
> > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > +{
> > +	/* Additional attachment should be done by rte_pktmbuf_attach() */
> > +	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));
> 
> Shouldn't we have here something like:
> RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> ?

Right. That's better. Attaching mbuf should be direct and writable.

> > +
> > +	m->buf_addr = buf_addr;
> > +	m->buf_iova = buf_iova;
> > +
> > +	if (shinfo == NULL) {
> 
> Instead of allocating shinfo ourselves - wound's it be better to rely
> on caller always allocating afeeling it for us (he can do that at the end/start of buffer,
> or whenever he likes to.

It is just for convenience. For some users, external attachment could be
occasional and casual, e.g. punt control traffic from kernel/hv. For such
non-serious cases, it is good to provide this small utility.

> Again in that case - caller can provide one shinfo to several mbufs (with different buf_addrs)
> and would know for sure that free_cb wouldn't be overwritten by mistake.
> I.E. mbuf code will only update refcnt inside shinfo.

I think you missed the discussion with other people yesterday. This change is
exactly for that purpose. Like I documented above, if this API is called with
shinfo being provided, it will use the user-provided shinfo instead of sparing a
few byte in the trailer and won't touch the shinfo. This code block happens only
if user doesn't provide memory for shared data (shinfo is NULL).

> > +		void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > +
> > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > +		if ((void *)shinfo <= buf_addr)
> > +			return NULL;
> > +
> > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > +		rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +	} else {
> > +		m->buf_len = buf_len;
> 
> I think you need to update shinfo>refcnt here too.

Like explained above, if shinfo is provided, it doesn't alter anything except
for callbacks and its arg.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 17:06     ` Yongseok Koh
@ 2018-04-25 17:23       ` Ananyev, Konstantin
  2018-04-25 18:02         ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-25 17:23 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, arybchenko,
	stephen, thomas, adrien.mazarguil, nelio.laranjeiro



> -----Original Message-----
> From: Yongseok Koh [mailto:yskoh@mellanox.com]
> Sent: Wednesday, April 25, 2018 6:07 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; olivier.matz@6wind.com; dev@dpdk.org;
> arybchenko@solarflare.com; stephen@networkplumber.org; thomas@monjalon.net; adrien.mazarguil@6wind.com;
> nelio.laranjeiro@6wind.com
> Subject: Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
> 
> On Wed, Apr 25, 2018 at 01:31:42PM +0000, Ananyev, Konstantin wrote:
> [...]
> > >  /** Mbuf prefetch */
> > >  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> > >  	if ((m) != NULL)                        \
> > > @@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > >  }
> > >
> > >  /**
> > > + * Attach an external buffer to a mbuf.
> > > + *
> > > + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> > > + * it, corresponding free callback function and its argument should be
> > > + * provided. This callback function will be called once all the mbufs are
> > > + * detached from the buffer.
> > > + *
> > > + * The headroom for the attaching mbuf will be set to zero and this can be
> > > + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
> > > + * or ``rte_pktmbuf_reset_headroom()`` can be used.
> > > + *
> > > + * More mbufs can be attached to the same external buffer by
> > > + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> > > + * this API.
> > > + *
> > > + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> > > + * ``rte_pktmbuf_detach()``.
> > > + *
> > > + * Attaching an external buffer is quite similar to mbuf indirection in
> > > + * replacing buffer addresses and length of a mbuf, but a few differences:
> > > + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> > > + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> > > + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> > > + *   external buffer has its own refcnt and it starts from 1. Unless
> > > + *   multiple mbufs are attached to a mbuf having an external buffer, the
> > > + *   external buffer is writable.
> > > + * - There's no need to allocate buffer from a mempool. Any buffer can be
> > > + *   attached with appropriate free callback and its IO address.
> > > + * - Smaller metadata is required to maintain shared data such as refcnt.
> > > + *
> > > + * @warning
> > > + * @b EXPERIMENTAL: This API may change without prior notice.
> > > + * Once external buffer is enabled by allowing experimental API,
> > > + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> > > + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> > > + * having external buffer.
> > > + *
> > > + * @param m
> > > + *   The pointer to the mbuf.
> > > + * @param buf_addr
> > > + *   The pointer to the external buffer we're attaching to.
> > > + * @param buf_iova
> > > + *   IO address of the external buffer we're attaching to.
> > > + * @param buf_len
> > > + *   The size of the external buffer we're attaching to. If memory for
> > > + *   shared data is not provided, buf_len must be larger than the size of
> > > + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> > > + *   enough, this function will return NULL.
> > > + * @param shinfo
> > > + *   User-provided memory for shared data. If NULL, a few bytes in the
> > > + *   trailer of the provided buffer will be dedicated for shared data and
> > > + *   the shared data will be properly initialized. Otherwise, user must
> > > + *   initialize the content except for free callback and its argument. The
> > > + *   pointer of shared data will be stored in m->shinfo.
> > > + * @param free_cb
> > > + *   Free callback function to call when the external buffer needs to be
> > > + *   freed.
> > > + * @param fcb_opaque
> > > + *   Argument for the free callback function.
> > > + *
> > > + * @return
> > > + *   A pointer to the new start of the data on success, return NULL
> > > + *   otherwise.
> > > + */
> > > +static inline char * __rte_experimental
> > > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > > +	rte_iova_t buf_iova, uint16_t buf_len,
> > > +	struct rte_mbuf_ext_shared_info *shinfo,
> > > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > > +{
> > > +	/* Additional attachment should be done by rte_pktmbuf_attach() */
> > > +	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));
> >
> > Shouldn't we have here something like:
> > RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> > ?
> 
> Right. That's better. Attaching mbuf should be direct and writable.
> 
> > > +
> > > +	m->buf_addr = buf_addr;
> > > +	m->buf_iova = buf_iova;
> > > +
> > > +	if (shinfo == NULL) {
> >
> > Instead of allocating shinfo ourselves - wound's it be better to rely
> > on caller always allocating afeeling it for us (he can do that at the end/start of buffer,
> > or whenever he likes to.
> 
> It is just for convenience. For some users, external attachment could be
> occasional and casual, e.g. punt control traffic from kernel/hv. For such
> non-serious cases, it is good to provide this small utility.

For such users that small utility could be a separate function then:
shinfo_inside_buf() or so.

> 
> > Again in that case - caller can provide one shinfo to several mbufs (with different buf_addrs)
> > and would know for sure that free_cb wouldn't be overwritten by mistake.
> > I.E. mbuf code will only update refcnt inside shinfo.
> 
> I think you missed the discussion with other people yesterday. This change is
> exactly for that purpose. Like I documented above, if this API is called with
> shinfo being provided, it will use the user-provided shinfo instead of sparing a
> few byte in the trailer and won't touch the shinfo.

As I can see your current code always update free_cb and fcb_opaque.
Which is kind of strange these fields shold be the same for all instances of the shinfo.

> This code block happens only
> if user doesn't provide memory for shared data (shinfo is NULL).
> 
> > > +		void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > > +
> > > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > > +		if ((void *)shinfo <= buf_addr)
> > > +			return NULL;
> > > +
> > > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > > +		rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > +	} else {
> > > +		m->buf_len = buf_len;
> >
> > I think you need to update shinfo>refcnt here too.
> 
> Like explained above, if shinfo is provided, it doesn't alter anything except
> for callbacks and its arg.

Hm, but I have 2mbufs attached to the same external buffer via  same shinfo,
shouldn't  shinfo.refcnt == 2?


> 
> 
> Thanks,
> Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 14:45             ` Andrew Rybchenko
@ 2018-04-25 17:40               ` Yongseok Koh
  0 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 17:40 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Olivier Matz, wenzhuo.lu, jingjing.wu, dev, konstantin.ananyev,
	adrien.mazarguil, nelio.laranjeiro, Thomas Monjalon

On Wed, Apr 25, 2018 at 05:45:34PM +0300, Andrew Rybchenko wrote:
> On 04/25/2018 02:34 AM, Yongseok Koh wrote:
> > On Tue, Apr 24, 2018 at 09:15:38PM +0200, Olivier Matz wrote:
> > > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > > > > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > > > > > > +	m->shinfo = shinfo;
> > > > > > > +
> > > > > > > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > > > Why is assignment used here? Cannot we attach extbuf already attached to
> > > > > > other mbuf?
> > > > > In rte_pktmbuf_attach(), this is true. That's not illogical to
> > > > > keep the same approach here. Maybe an assert could be added?
> > Like I described in the doc, intention is to attach external buffer by
> > _attach_extbuf() for the first time and _attach() is just for additional mbuf
> > attachment. Will add an assert.
> 
> Not sure that I understand. How should the second chunk with shared shinfo
> of the huge buffer be attached to a new mbuf?

Okay, I think I know what you misunderstood. This patch itself has no
consideration about huge buffer. It is to simply attach an external buffer to an
mbuf. Slicing huge buffer and attaching multiple mbufs is one of use-cases of
this feature. Let me take a few examples.

rte_pktmbuf_attach_extbuf(m, buf, buf_iova, buf_len, NULL, fcb, fcb_arg);

                |<------ buf_len ------>|
 +----+         +----------------+------+
 | m  |-------->| ext buf        |shif  |
 |    |         |                |refc=1|
 +----+         +----------------+------+
                |<- m->buf_len ->|

rte_pktmbuf_attach(m1, m);

 +----+         +----------------+------+
 | m  |-------->| ext buf        |shif  |
 |    |         |                |refc=2|
 +----+         +----------------+------+
                ^
 +----+         |
 | m1 |---------+
 |    |
 +----+

rte_pktmbuf_attach_extbuf(m, buf, buf_iova, buf_len, shinfo, fcb, fcb_arg);

                                 |<------ buf_len ------>|
 +------+         +----+         +-----------------------+
 |shinfo|<--------| m  |-------->| ext buf               |
 |refc=1|         |    |         |                       |
 +------+         +----+         +-----------------------+
                                 |<----- m->buf_len ---->|

rte_pktmbuf_attach(m1, m);

 +------+         +----+         +-----------------------+
 |shinfo|<--------| m  |-------->| ext buf               |
 |refc=2|         |    |         |                       |
 +------+         +----+         +-----------------------+
     ^                           ^
     |            +----+         |
     +------------| m1 |---------+
                  |    |
                  +----+

> > > > Other option is to have shinfo per small buf plus reference counter
> > > > per huge buf (which is decremented when small buf reference counter
> > > > becomes zero and free callback is executed). I guess it is assumed above.
> > > > My fear is that it is too much reference counters:
> > > >   1. mbuf reference counter
> > > >   2. small buf reference counter
> > > >   3. huge buf reference counter
> > > > May be it is possible use (1) for (2) as well?
> > > I would prefer to have only 2 reference counters, one in the mbuf
> > > and one in the shinfo.
> > Good discussion. It should be a design decision by user.
> > 
> > In my use-case, it would be a good idea to make all the mbufs in a same chunk
> > point to the same shared info in the head of the chunk and reset the refcnt of
> > shinfo to the total number of slices in the chunk.
> > 
> > +--+----+----+--------------+----+--------------+---+- - -
> > |global |head|mbuf1 data    |head|mbuf2 data    |   |
> > | shinfo|room|              |room|              |   |
> > +--+----+----+--------------+----+--------------+---+- - -
> 
> I don't understand how it can be achieved using proposed API.
 
For the following use-case,
 +--+----+----+--------------+----+--------------+---+- - -
 |global |head|mbuf1 data    |head|mbuf2 data    |   |
 | shinfo|room|              |room|              |   |
 +--+----+----+--------------+----+--------------+---+- - -
 ^       |<---- buf_len ---->|
 |
 mem

User can do,

g_shinfo = (struct rte_mbuf_ext_shared_info *)mem;
buf = mem + sizeof (*g_shinfo);
buf_iova = get_iova(buf);
rte_pktmbuf_attach_extbuf(m1, buf, buf_iova, buf_len, g_shinfo, fcb, fcb_arg);
rte_mbuf_ext_refcnt_update(g_shinfo, 1);
rte_pktmbuf_reset_headroom(m1);
buf += buf_len;
buf_iova = get_iova(buf);
rte_pktmbuf_attach_extbuf(m2, buf, buf_iova, buf_len, g_shinfo, fcb, fcb_arg);
rte_mbuf_ext_refcnt_update(g_shinfo, 1);
rte_pktmbuf_reset_headroom(m2);


Does it make sense?

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 17:23       ` Ananyev, Konstantin
@ 2018-04-25 18:02         ` Yongseok Koh
  2018-04-25 18:22           ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 18:02 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, arybchenko,
	stephen, thomas, adrien.mazarguil, nelio.laranjeiro

On Wed, Apr 25, 2018 at 05:23:20PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Yongseok Koh [mailto:yskoh@mellanox.com]
> > Sent: Wednesday, April 25, 2018 6:07 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; olivier.matz@6wind.com; dev@dpdk.org;
> > arybchenko@solarflare.com; stephen@networkplumber.org; thomas@monjalon.net; adrien.mazarguil@6wind.com;
> > nelio.laranjeiro@6wind.com
> > Subject: Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
> > 
> > On Wed, Apr 25, 2018 at 01:31:42PM +0000, Ananyev, Konstantin wrote:
> > [...]
> > > >  /** Mbuf prefetch */
> > > >  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> > > >  	if ((m) != NULL)                        \
> > > > @@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > > >  }
> > > >
> > > >  /**
> > > > + * Attach an external buffer to a mbuf.
> > > > + *
> > > > + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> > > > + * it, corresponding free callback function and its argument should be
> > > > + * provided. This callback function will be called once all the mbufs are
> > > > + * detached from the buffer.
> > > > + *
> > > > + * The headroom for the attaching mbuf will be set to zero and this can be
> > > > + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
> > > > + * or ``rte_pktmbuf_reset_headroom()`` can be used.
> > > > + *
> > > > + * More mbufs can be attached to the same external buffer by
> > > > + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> > > > + * this API.
> > > > + *
> > > > + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> > > > + * ``rte_pktmbuf_detach()``.
> > > > + *
> > > > + * Attaching an external buffer is quite similar to mbuf indirection in
> > > > + * replacing buffer addresses and length of a mbuf, but a few differences:
> > > > + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> > > > + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> > > > + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> > > > + *   external buffer has its own refcnt and it starts from 1. Unless
> > > > + *   multiple mbufs are attached to a mbuf having an external buffer, the
> > > > + *   external buffer is writable.
> > > > + * - There's no need to allocate buffer from a mempool. Any buffer can be
> > > > + *   attached with appropriate free callback and its IO address.
> > > > + * - Smaller metadata is required to maintain shared data such as refcnt.
> > > > + *
> > > > + * @warning
> > > > + * @b EXPERIMENTAL: This API may change without prior notice.
> > > > + * Once external buffer is enabled by allowing experimental API,
> > > > + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> > > > + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> > > > + * having external buffer.
> > > > + *
> > > > + * @param m
> > > > + *   The pointer to the mbuf.
> > > > + * @param buf_addr
> > > > + *   The pointer to the external buffer we're attaching to.
> > > > + * @param buf_iova
> > > > + *   IO address of the external buffer we're attaching to.
> > > > + * @param buf_len
> > > > + *   The size of the external buffer we're attaching to. If memory for
> > > > + *   shared data is not provided, buf_len must be larger than the size of
> > > > + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> > > > + *   enough, this function will return NULL.
> > > > + * @param shinfo
> > > > + *   User-provided memory for shared data. If NULL, a few bytes in the
> > > > + *   trailer of the provided buffer will be dedicated for shared data and
> > > > + *   the shared data will be properly initialized. Otherwise, user must
> > > > + *   initialize the content except for free callback and its argument. The
> > > > + *   pointer of shared data will be stored in m->shinfo.
> > > > + * @param free_cb
> > > > + *   Free callback function to call when the external buffer needs to be
> > > > + *   freed.
> > > > + * @param fcb_opaque
> > > > + *   Argument for the free callback function.
> > > > + *
> > > > + * @return
> > > > + *   A pointer to the new start of the data on success, return NULL
> > > > + *   otherwise.
> > > > + */
> > > > +static inline char * __rte_experimental
> > > > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > > > +	rte_iova_t buf_iova, uint16_t buf_len,
> > > > +	struct rte_mbuf_ext_shared_info *shinfo,
> > > > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > > > +{
> > > > +	/* Additional attachment should be done by rte_pktmbuf_attach() */
> > > > +	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));
> > >
> > > Shouldn't we have here something like:
> > > RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> > > ?
> > 
> > Right. That's better. Attaching mbuf should be direct and writable.
> > 
> > > > +
> > > > +	m->buf_addr = buf_addr;
> > > > +	m->buf_iova = buf_iova;
> > > > +
> > > > +	if (shinfo == NULL) {
> > >
> > > Instead of allocating shinfo ourselves - wound's it be better to rely
> > > on caller always allocating afeeling it for us (he can do that at the end/start of buffer,
> > > or whenever he likes to.
> > 
> > It is just for convenience. For some users, external attachment could be
> > occasional and casual, e.g. punt control traffic from kernel/hv. For such
> > non-serious cases, it is good to provide this small utility.
> 
> For such users that small utility could be a separate function then:
> shinfo_inside_buf() or so.

I like this idea! As this is an inline function and can be called in a datapath,
shorter code is better if it isn't expected to be used frequently.

Will take this idea for the new version. Thanks.

> > > Again in that case - caller can provide one shinfo to several mbufs (with different buf_addrs)
> > > and would know for sure that free_cb wouldn't be overwritten by mistake.
> > > I.E. mbuf code will only update refcnt inside shinfo.
> > 
> > I think you missed the discussion with other people yesterday. This change is
> > exactly for that purpose. Like I documented above, if this API is called with
> > shinfo being provided, it will use the user-provided shinfo instead of sparing a
> > few byte in the trailer and won't touch the shinfo.
> 
> As I can see your current code always update free_cb and fcb_opaque.
> Which is kind of strange these fields shold be the same for all instances of the shinfo.
> 
> > This code block happens only
> > if user doesn't provide memory for shared data (shinfo is NULL).
> > 
> > > > +		void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > > > +
> > > > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > > > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > > > +		if ((void *)shinfo <= buf_addr)
> > > > +			return NULL;
> > > > +
> > > > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > > > +		rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > +	} else {
> > > > +		m->buf_len = buf_len;
> > >
> > > I think you need to update shinfo>refcnt here too.
> > 
> > Like explained above, if shinfo is provided, it doesn't alter anything except
> > for callbacks and its arg.
> 
> Hm, but I have 2mbufs attached to the same external buffer via  same shinfo,
> shouldn't  shinfo.refcnt == 2?

If the shinfo is provided by user, user is responsible for managing the content.
Please refer to my reply to Andrew. I drew a few diagrams.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 16:44         ` Yongseok Koh
@ 2018-04-25 18:05           ` Ananyev, Konstantin
  0 siblings, 0 replies; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-25 18:05 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, adrien.mazarguil,
	nelio.laranjeiro



> -----Original Message-----
> From: Yongseok Koh [mailto:yskoh@mellanox.com]
> Sent: Wednesday, April 25, 2018 5:44 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; olivier.matz@6wind.com; dev@dpdk.org;
> adrien.mazarguil@6wind.com; nelio.laranjeiro@6wind.com
> Subject: Re: [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf
> 
> On Wed, Apr 25, 2018 at 01:16:38PM +0000, Ananyev, Konstantin wrote:
> >
> >
> > >
> > > On Mon, Apr 23, 2018 at 11:53:04AM +0000, Ananyev, Konstantin wrote:
> > > [...]
> > > > > @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > > > >  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > >
> > > > >  /**
> > > > > + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> > > > > + */
> > > > > +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > > > > +
> > > > > +/**
> > > > >   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> > > > >   */
> > > > > -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > > > > +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
> > > >
> > > > As a nit:
> > > > RTE_MBUF_DIRECT(mb)  (((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)) == 0)
> > >
> > > It was for better readability and I expected compiler did the same.
> > > But, if you still want this way, I can change it.
> >
> > I know compilers are quite smart these days, but you never know for sure,
> > so yes, I think better to do that explicitly.
> 
> Okay.
> 
> > > [...]
> > > > >  /**
> > > > > - * Detach an indirect packet mbuf.
> > > > > + * @internal used by rte_pktmbuf_detach().
> > > > > + *
> > > > > + * Decrement the reference counter of the external buffer. When the
> > > > > + * reference counter becomes 0, the buffer is freed by pre-registered
> > > > > + * callback.
> > > > > + */
> > > > > +static inline void
> > > > > +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> > > > > +{
> > > > > +	struct rte_mbuf_ext_shared_info *shinfo;
> > > > > +
> > > > > +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> > > > > +
> > > > > +	shinfo = rte_mbuf_ext_shinfo(m);
> > > > > +
> > > > > +	if (rte_extbuf_refcnt_update(shinfo, -1) == 0)
> > > > > +		shinfo->free_cb(m->buf_addr, shinfo->fcb_opaque);
> > > >
> > > >
> > > > I understand the reason but extra function call for each external mbuf - seems quite expensive.
> > > > Wonder is it possible to group them somehow and amortize the cost?
> > >
> > > Good point. I thought about it today.
> > >
> > > Comparing to the regular mbuf, maybe three differences. a) free function isn't
> > > inlined but a real branch. b) no help from core local cache like mempool's c) no
> > > free_bulk func like rte_mempool_put_bulk(). But these look quite costly and
> > > complicated for the external buffer attachment.
> > >
> > > For example, to free it in bulk, external buffers should be grouped as the
> > > buffers would have different callback functions. To do that, I have to make an
> > > API to pre-register an external buffer group to prepare resources for the bulk
> > > free. Then, buffers can't be anonymous anymore but have to be registered in
> > > advance. If so, it would be better to use existing APIs, especially when a user
> > > wants high throughput...
> > >
> > > Let me know if you have better idea to implement it. Then, I'll gladly take
> > > that. Or, we can push any improvement patch in the next releases.
> >
> > I don't have any extra-smart thoughts here.
> > One option I thought about - was to introduce group of external buffers with
> > common free routine (I think o mentioned it already).
> > Second - hide all that external buffer management inside mempool,
> > i.e. if user wants to use external buffers he create a mempool
> > (with rte_mbuf_ext_shared_info as elements?), then attach external buffer to shinfo
> > and call mbuf_attach_external(mbuf, shinfo).
> > Though for free we can just call mempool_put(shinfo) and let particular implementation
> > decide when/how call free_cb(), etc.
> I don't want to restrict external buffer to mempool object. Especially for
> storage users, they want to use **any** buffer, even coming outside of DPDK.

I am not talking about the case when external buffer can be allocated from mempool.
I am talking about the implementation where shinfo is a a mempool element.
So to bring extrernal buffer into DPDK - users get a shinfo (from mempool) and attach it
to external buffer.
When no one needs that external buffer any more (shinfo.refcnt == 0) 
mempool_put() is invoked for shinfo.
Inside put() we can either call free_cb() or keep extrenal buffer for further usage.
Anyway just a thought.
Konstantin

> 
> However, will open a follow-up discussion for this in the next release window
> probably with more measurement data.
> Thank you for suggestions.
> 
> Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 18:02         ` Yongseok Koh
@ 2018-04-25 18:22           ` Yongseok Koh
  2018-04-25 18:30             ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 18:22 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, arybchenko,
	stephen, thomas, adrien.mazarguil, nelio.laranjeiro

On Wed, Apr 25, 2018 at 11:02:36AM -0700, Yongseok Koh wrote:
> On Wed, Apr 25, 2018 at 05:23:20PM +0000, Ananyev, Konstantin wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Yongseok Koh [mailto:yskoh@mellanox.com]
> > > Sent: Wednesday, April 25, 2018 6:07 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; olivier.matz@6wind.com; dev@dpdk.org;
> > > arybchenko@solarflare.com; stephen@networkplumber.org; thomas@monjalon.net; adrien.mazarguil@6wind.com;
> > > nelio.laranjeiro@6wind.com
> > > Subject: Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
> > > 
> > > On Wed, Apr 25, 2018 at 01:31:42PM +0000, Ananyev, Konstantin wrote:
> > > [...]
> > > > >  /** Mbuf prefetch */
> > > > >  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> > > > >  	if ((m) != NULL)                        \
> > > > > @@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > > > >  }
> > > > >
> > > > >  /**
> > > > > + * Attach an external buffer to a mbuf.
> > > > > + *
> > > > > + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> > > > > + * it, corresponding free callback function and its argument should be
> > > > > + * provided. This callback function will be called once all the mbufs are
> > > > > + * detached from the buffer.
> > > > > + *
> > > > > + * The headroom for the attaching mbuf will be set to zero and this can be
> > > > > + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
> > > > > + * or ``rte_pktmbuf_reset_headroom()`` can be used.
> > > > > + *
> > > > > + * More mbufs can be attached to the same external buffer by
> > > > > + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> > > > > + * this API.
> > > > > + *
> > > > > + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> > > > > + * ``rte_pktmbuf_detach()``.
> > > > > + *
> > > > > + * Attaching an external buffer is quite similar to mbuf indirection in
> > > > > + * replacing buffer addresses and length of a mbuf, but a few differences:
> > > > > + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> > > > > + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> > > > > + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> > > > > + *   external buffer has its own refcnt and it starts from 1. Unless
> > > > > + *   multiple mbufs are attached to a mbuf having an external buffer, the
> > > > > + *   external buffer is writable.
> > > > > + * - There's no need to allocate buffer from a mempool. Any buffer can be
> > > > > + *   attached with appropriate free callback and its IO address.
> > > > > + * - Smaller metadata is required to maintain shared data such as refcnt.
> > > > > + *
> > > > > + * @warning
> > > > > + * @b EXPERIMENTAL: This API may change without prior notice.
> > > > > + * Once external buffer is enabled by allowing experimental API,
> > > > > + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> > > > > + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> > > > > + * having external buffer.
> > > > > + *
> > > > > + * @param m
> > > > > + *   The pointer to the mbuf.
> > > > > + * @param buf_addr
> > > > > + *   The pointer to the external buffer we're attaching to.
> > > > > + * @param buf_iova
> > > > > + *   IO address of the external buffer we're attaching to.
> > > > > + * @param buf_len
> > > > > + *   The size of the external buffer we're attaching to. If memory for
> > > > > + *   shared data is not provided, buf_len must be larger than the size of
> > > > > + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> > > > > + *   enough, this function will return NULL.
> > > > > + * @param shinfo
> > > > > + *   User-provided memory for shared data. If NULL, a few bytes in the
> > > > > + *   trailer of the provided buffer will be dedicated for shared data and
> > > > > + *   the shared data will be properly initialized. Otherwise, user must
> > > > > + *   initialize the content except for free callback and its argument. The
> > > > > + *   pointer of shared data will be stored in m->shinfo.
> > > > > + * @param free_cb
> > > > > + *   Free callback function to call when the external buffer needs to be
> > > > > + *   freed.
> > > > > + * @param fcb_opaque
> > > > > + *   Argument for the free callback function.
> > > > > + *
> > > > > + * @return
> > > > > + *   A pointer to the new start of the data on success, return NULL
> > > > > + *   otherwise.
> > > > > + */
> > > > > +static inline char * __rte_experimental
> > > > > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > > > > +	rte_iova_t buf_iova, uint16_t buf_len,
> > > > > +	struct rte_mbuf_ext_shared_info *shinfo,
> > > > > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > > > > +{
> > > > > +	/* Additional attachment should be done by rte_pktmbuf_attach() */
> > > > > +	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));
> > > >
> > > > Shouldn't we have here something like:
> > > > RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> > > > ?
> > > 
> > > Right. That's better. Attaching mbuf should be direct and writable.
> > > 
> > > > > +
> > > > > +	m->buf_addr = buf_addr;
> > > > > +	m->buf_iova = buf_iova;
> > > > > +
> > > > > +	if (shinfo == NULL) {
> > > >
> > > > Instead of allocating shinfo ourselves - wound's it be better to rely
> > > > on caller always allocating afeeling it for us (he can do that at the end/start of buffer,
> > > > or whenever he likes to.
> > > 
> > > It is just for convenience. For some users, external attachment could be
> > > occasional and casual, e.g. punt control traffic from kernel/hv. For such
> > > non-serious cases, it is good to provide this small utility.
> > 
> > For such users that small utility could be a separate function then:
> > shinfo_inside_buf() or so.
> 
> I like this idea! As this is an inline function and can be called in a datapath,
> shorter code is better if it isn't expected to be used frequently.
> 
> Will take this idea for the new version. Thanks.

However, if this API is called with shinfo=NULL (builtin constant), this code
block won't get included in compile time because it is an inline function.

What is disadvantage to keep this block here? More intuitive?

Advantage of keeping it here could be simplicity. No need to call the utility in
advance.

Or separating this code to another inline function could make the API prototype
simpler because free_cb and its arg should be passed via shinfo.

static inline char * __rte_experimental
rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
		rte_iova_t buf_iova, uint16_t buf_len,
		struct rte_mbuf_ext_shared_info *shinfo)

I'm still inclined to write the utility function like you suggested.
Thoughts?

Thanks,
Yongseok

> > > > Again in that case - caller can provide one shinfo to several mbufs (with different buf_addrs)
> > > > and would know for sure that free_cb wouldn't be overwritten by mistake.
> > > > I.E. mbuf code will only update refcnt inside shinfo.
> > > 
> > > I think you missed the discussion with other people yesterday. This change is
> > > exactly for that purpose. Like I documented above, if this API is called with
> > > shinfo being provided, it will use the user-provided shinfo instead of sparing a
> > > few byte in the trailer and won't touch the shinfo.
> > 
> > As I can see your current code always update free_cb and fcb_opaque.
> > Which is kind of strange these fields shold be the same for all instances of the shinfo.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 18:22           ` Yongseok Koh
@ 2018-04-25 18:30             ` Yongseok Koh
  0 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 18:30 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Wu, Jingjing, olivier.matz, dev, arybchenko,
	stephen, Thomas Monjalon, Adrien Mazarguil,
	Nélio Laranjeiro


> On Apr 25, 2018, at 11:22 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
> 
> On Wed, Apr 25, 2018 at 11:02:36AM -0700, Yongseok Koh wrote:
>> On Wed, Apr 25, 2018 at 05:23:20PM +0000, Ananyev, Konstantin wrote:
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Yongseok Koh [mailto:yskoh@mellanox.com]
>>>> Sent: Wednesday, April 25, 2018 6:07 PM
>>>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
>>>> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; olivier.matz@6wind.com; dev@dpdk.org;
>>>> arybchenko@solarflare.com; stephen@networkplumber.org; thomas@monjalon.net; adrien.mazarguil@6wind.com;
>>>> nelio.laranjeiro@6wind.com
>>>> Subject: Re: [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf
>>>> 
>>>> On Wed, Apr 25, 2018 at 01:31:42PM +0000, Ananyev, Konstantin wrote:
>>>> [...]
>>>>>> /** Mbuf prefetch */
>>>>>> #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
>>>>>> 	if ((m) != NULL)                        \
>>>>>> @@ -1213,11 +1306,127 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>>>>>> }
>>>>>> 
>>>>>> /**
>>>>>> + * Attach an external buffer to a mbuf.
>>>>>> + *
>>>>>> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
>>>>>> + * it, corresponding free callback function and its argument should be
>>>>>> + * provided. This callback function will be called once all the mbufs are
>>>>>> + * detached from the buffer.
>>>>>> + *
>>>>>> + * The headroom for the attaching mbuf will be set to zero and this can be
>>>>>> + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
>>>>>> + * or ``rte_pktmbuf_reset_headroom()`` can be used.
>>>>>> + *
>>>>>> + * More mbufs can be attached to the same external buffer by
>>>>>> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
>>>>>> + * this API.
>>>>>> + *
>>>>>> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
>>>>>> + * ``rte_pktmbuf_detach()``.
>>>>>> + *
>>>>>> + * Attaching an external buffer is quite similar to mbuf indirection in
>>>>>> + * replacing buffer addresses and length of a mbuf, but a few differences:
>>>>>> + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>>>>>> + *   2 as long as the direct mbuf itself isn't freed after the attachment.
>>>>>> + *   In such cases, the buffer area of a direct mbuf must be read-only. But
>>>>>> + *   external buffer has its own refcnt and it starts from 1. Unless
>>>>>> + *   multiple mbufs are attached to a mbuf having an external buffer, the
>>>>>> + *   external buffer is writable.
>>>>>> + * - There's no need to allocate buffer from a mempool. Any buffer can be
>>>>>> + *   attached with appropriate free callback and its IO address.
>>>>>> + * - Smaller metadata is required to maintain shared data such as refcnt.
>>>>>> + *
>>>>>> + * @warning
>>>>>> + * @b EXPERIMENTAL: This API may change without prior notice.
>>>>>> + * Once external buffer is enabled by allowing experimental API,
>>>>>> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
>>>>>> + * exclusive. A mbuf can be considered direct if it is neither indirect nor
>>>>>> + * having external buffer.
>>>>>> + *
>>>>>> + * @param m
>>>>>> + *   The pointer to the mbuf.
>>>>>> + * @param buf_addr
>>>>>> + *   The pointer to the external buffer we're attaching to.
>>>>>> + * @param buf_iova
>>>>>> + *   IO address of the external buffer we're attaching to.
>>>>>> + * @param buf_len
>>>>>> + *   The size of the external buffer we're attaching to. If memory for
>>>>>> + *   shared data is not provided, buf_len must be larger than the size of
>>>>>> + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
>>>>>> + *   enough, this function will return NULL.
>>>>>> + * @param shinfo
>>>>>> + *   User-provided memory for shared data. If NULL, a few bytes in the
>>>>>> + *   trailer of the provided buffer will be dedicated for shared data and
>>>>>> + *   the shared data will be properly initialized. Otherwise, user must
>>>>>> + *   initialize the content except for free callback and its argument. The
>>>>>> + *   pointer of shared data will be stored in m->shinfo.
>>>>>> + * @param free_cb
>>>>>> + *   Free callback function to call when the external buffer needs to be
>>>>>> + *   freed.
>>>>>> + * @param fcb_opaque
>>>>>> + *   Argument for the free callback function.
>>>>>> + *
>>>>>> + * @return
>>>>>> + *   A pointer to the new start of the data on success, return NULL
>>>>>> + *   otherwise.
>>>>>> + */
>>>>>> +static inline char * __rte_experimental
>>>>>> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
>>>>>> +	rte_iova_t buf_iova, uint16_t buf_len,
>>>>>> +	struct rte_mbuf_ext_shared_info *shinfo,
>>>>>> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
>>>>>> +{
>>>>>> +	/* Additional attachment should be done by rte_pktmbuf_attach() */
>>>>>> +	RTE_ASSERT(!RTE_MBUF_HAS_EXTBUF(m));
>>>>> 
>>>>> Shouldn't we have here something like:
>>>>> RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
>>>>> ?
>>>> 
>>>> Right. That's better. Attaching mbuf should be direct and writable.
>>>> 
>>>>>> +
>>>>>> +	m->buf_addr = buf_addr;
>>>>>> +	m->buf_iova = buf_iova;
>>>>>> +
>>>>>> +	if (shinfo == NULL) {
>>>>> 
>>>>> Instead of allocating shinfo ourselves - wound's it be better to rely
>>>>> on caller always allocating afeeling it for us (he can do that at the end/start of buffer,
>>>>> or whenever he likes to.
>>>> 
>>>> It is just for convenience. For some users, external attachment could be
>>>> occasional and casual, e.g. punt control traffic from kernel/hv. For such
>>>> non-serious cases, it is good to provide this small utility.
>>> 
>>> For such users that small utility could be a separate function then:
>>> shinfo_inside_buf() or so.
>> 
>> I like this idea! As this is an inline function and can be called in a datapath,
>> shorter code is better if it isn't expected to be used frequently.
>> 
>> Will take this idea for the new version. Thanks.
> 
> However, if this API is called with shinfo=NULL (builtin constant), this code
> block won't get included in compile time because it is an inline function.

Sorry, it was wrong. I said the exact opposite. Not enough sleep theses days. :-(
If shinfo is passed, the code block will be included anyway.

Please disregard the email.

Yongseok

> 
> What is disadvantage to keep this block here? More intuitive?
> 
> Advantage of keeping it here could be simplicity. No need to call the utility in
> advance.
> 
> Or separating this code to another inline function could make the API prototype
> simpler because free_cb and its arg should be passed via shinfo.
> 
> static inline char * __rte_experimental
> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> 		rte_iova_t buf_iova, uint16_t buf_len,
> 		struct rte_mbuf_ext_shared_info *shinfo)
> 
> I'm still inclined to write the utility function like you suggested.
> Thoughts?
> 
> Thanks,
> Yongseok
> 
>>>>> Again in that case - caller can provide one shinfo to several mbufs (with different buf_addrs)
>>>>> and would know for sure that free_cb wouldn't be overwritten by mistake.
>>>>> I.E. mbuf code will only update refcnt inside shinfo.
>>>> 
>>>> I think you missed the discussion with other people yesterday. This change is
>>>> exactly for that purpose. Like I documented above, if this API is called with
>>>> shinfo being provided, it will use the user-provided shinfo instead of sparing a
>>>> few byte in the trailer and won't touch the shinfo.
>>> 
>>> As I can see your current code always update free_cb and fcb_opaque.
>>> Which is kind of strange these fields shold be the same for all instances of the shinfo.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25  9:19           ` Yongseok Koh
@ 2018-04-25 20:00             ` Olivier Matz
  2018-04-25 22:54               ` Yongseok Koh
  0 siblings, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-25 20:00 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Andrew Rybchenko, Wenzhuo Lu, Jingjing Wu, dev,
	Konstantin Ananyev, Adrien Mazarguil, Nélio Laranjeiro

On Wed, Apr 25, 2018 at 09:19:32AM +0000, Yongseok Koh wrote:
> > On Apr 25, 2018, at 2:08 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
> >> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> >> 
> >> Hi Yongseok,
> >> 
> >> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
> >>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >>>>> }
> >>>>> /**
> >>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> >>>>> + * otherwise.
> >>>>> + *
> >>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
> >>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
> >>>>> + */
> >>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> >>>>> +
> >>>>> +/**
> >>>>>  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> >>>>>  */
> >>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> >>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> >>>> 
> >>>> It is still confusing that INDIRECT != !DIRECT.
> >>>> May be we have no good options right now, but I'd suggest to at least
> >>>> deprecate
> >>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
> >>> 
> >>> Agree. I may have missed something, but is my previous suggestion
> >>> not doable?
> >>> 
> >>> - direct = embeds its own data      (and indirect = !direct)
> >>> - clone (or another name) = data is another mbuf
> >>> - extbuf = data is in an external buffer
> >> 
> >> Any comment about this option?
> > 
> > I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
> > RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
> > (!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
> > I'm not sure I understand you correctly.
> > 
> > Can you please give me more guidelines so that I can take you idea?
> 
> Maybe, did you mean the following? Looks like doable but RTE_MBUF_DIRECT()
> can't logically mean 'mbuf embeds its own data', right?
> 
> #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> #define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> 
> #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> #define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)

I was thinking about something like this:

#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)

#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_HAS_EXTBUF(mb) && !RTE_MBUF_CLONED(mb))
#define RTE_MBUF_INDIRECT(mb)   (!RTE_MBUF_DIRECT(mb))

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-25 20:00             ` Olivier Matz
@ 2018-04-25 22:54               ` Yongseok Koh
  0 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-25 22:54 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Andrew Rybchenko, Wenzhuo Lu, Jingjing Wu, dev,
	Konstantin Ananyev, Adrien Mazarguil, Nélio Laranjeiro

On Wed, Apr 25, 2018 at 10:00:10PM +0200, Olivier Matz wrote:
> On Wed, Apr 25, 2018 at 09:19:32AM +0000, Yongseok Koh wrote:
> > > On Apr 25, 2018, at 2:08 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
> > >> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> > >> 
> > >> Hi Yongseok,
> > >> 
> > >> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
> > >>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > >>>>> }
> > >>>>> /**
> > >>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > >>>>> + * otherwise.
> > >>>>> + *
> > >>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
> > >>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
> > >>>>> + */
> > >>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > >>>>> +
> > >>>>> +/**
> > >>>>>  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > >>>>>  */
> > >>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > >>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > >>>> 
> > >>>> It is still confusing that INDIRECT != !DIRECT.
> > >>>> May be we have no good options right now, but I'd suggest to at least
> > >>>> deprecate
> > >>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > >>> 
> > >>> Agree. I may have missed something, but is my previous suggestion
> > >>> not doable?
> > >>> 
> > >>> - direct = embeds its own data      (and indirect = !direct)
> > >>> - clone (or another name) = data is another mbuf
> > >>> - extbuf = data is in an external buffer
> > >> 
> > >> Any comment about this option?
> > > 
> > > I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
> > > RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
> > > (!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
> > > I'm not sure I understand you correctly.
> > > 
> > > Can you please give me more guidelines so that I can take you idea?
> > 
> > Maybe, did you mean the following? Looks like doable but RTE_MBUF_DIRECT()
> > can't logically mean 'mbuf embeds its own data', right?
> > 
> > #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > #define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > 
> > #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > #define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> 
> I was thinking about something like this:
> 
> #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> #define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> 
> #define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_HAS_EXTBUF(mb) && !RTE_MBUF_CLONED(mb))
> #define RTE_MBUF_INDIRECT(mb)   (!RTE_MBUF_DIRECT(mb))

Then, indirect should mean either having IND_ATTACHED_MBUF or having
EXT_ATTACHED_MBUF and it sounds weird. In a situation where EXT_ATTACHED_MBUF
(experimental) is never set, your definition of indirect will be same as now. No
breakage either.  Although your idea is logically same as the current patch,
there seems to be semantic conflict.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (9 preceding siblings ...)
  2018-04-25  2:53 ` [PATCH v5 " Yongseok Koh
@ 2018-04-26  1:10 ` Yongseok Koh
  2018-04-26  1:10   ` [PATCH v6 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
                     ` (2 more replies)
  2018-04-27  0:01 ` [PATCH v7 " Yongseok Koh
  2018-04-27 17:22 ` [PATCH v8 " Yongseok Koh
  12 siblings, 3 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-26  1:10 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - When an indirect mbuf is attached, refcnt of the direct mbuf would be
    2 as long as the direct mbuf itself isn't freed after the attachment.
    In such cases, the buffer area of a direct mbuf must be read-only. But
    external buffer has its own refcnt and it starts from 1. Unless
    multiple mbufs are attached to a mbuf having an external buffer, the
    external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---

** This patch can pass the mbuf_autotest. **

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v6:
* rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
  rte_pktmbuf_ext_shinfo_init_helper() is added.
* bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
* minor changes from review.

v5:
* rte_pktmbuf_attach_extbuf() sets headroom to 0.
* if shinfo is provided when attaching, user should initialize it.
* minor changes from review.

v4:
* rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
  user can pass memory for shared data via shinfo argument.
* minor changes from review.

v3:
* implement external buffer attachment instead of introducing buf_off for
  mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 335 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 306 insertions(+), 29 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 43aaa9c5f..0a6885281 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -344,7 +344,10 @@ extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+/**
+ * Mbuf having an external buffer attached. shinfo in mbuf must be filled.
+ */
+#define EXT_ATTACHED_MBUF    (1ULL << 61)
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -584,8 +587,27 @@ struct rte_mbuf {
 	/** Sequence number. See also rte_reorder_insert(). */
 	uint32_t seqn;
 
+	/** Shared data for external buffer attached to mbuf. See
+	 * rte_pktmbuf_attach_extbuf().
+	 */
+	struct rte_mbuf_ext_shared_info *shinfo;
+
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -706,14 +728,34 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
 }
 
 /**
+ * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
+ * otherwise.
+ *
+ * If a mbuf has its data in another mbuf and references it by mbuf
+ * indirection, this mbuf can be defined as a cloned mbuf.
+ */
+#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
  */
-#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
+#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
+
+/**
+ * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
+ *
+ * External buffer is a user-provided anonymous buffer.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
 
 /**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
+ *
+ * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
+ * can be defined as a direct mbuf.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) \
+	(!((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -839,6 +881,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
+		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Initialize shared data at the end of an external buffer before attaching
+ * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
+ * initialization but a helper function to simply spare a few bytes at the
+ * end of the buffer for shared data. If shared data is allocated
+ * separately, this should not be called but application has to properly
+ * initialize the shared data according to its need.
+ *
+ * Free callback and its argument is saved and the refcnt is set to 1.
+ *
+ * @warning
+ * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this
+ * initialization. For example,
+ *
+ *   struct rte_mbuf_ext_shared_info *shinfo =
+ *          rte_pktmbuf_ext_shinfo_init_helpfer(buf_addr, buf_len,
+ *                                              free_cb, fcb_arg);
+ *   buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+ *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
+ *
+ * @param buf_addr
+ *   The pointer to the external buffer.
+ * @param buf_len
+ *   The size of the external buffer. buf_len must be larger than the size
+ *   of ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If
+ *   not enough, this function will return NULL.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be
+ *   freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ *
+ * @return
+ *   A pointer to the initialized shared data on success, return NULL
+ *   otherwise.
+ */
+static inline struct rte_mbuf_ext_shared_info *
+rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t buf_len,
+	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
+{
+	struct rte_mbuf_ext_shared_info *shinfo;
+	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
+
+	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
+				sizeof(*shinfo)), sizeof(uintptr_t));
+	if ((void *)shinfo <= buf_addr)
+		return NULL;
+
+	rte_mbuf_ext_refcnt_set(shinfo, 1);
+
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+
+	/* buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) */
+	return shinfo;
+}
+
+/**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided via shinfo. This callback function will be called once all the
+ * mbufs are detached from the buffer (refcnt becomes zero).
+ *
+ * The headroom for the attaching mbuf will be set to zero and this can be
+ * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
+ * or ``rte_pktmbuf_reset_headroom()`` might be used.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * Memory for shared data must be provided and user must initialize all of
+ * the content properly, escpecially free callback and refcnt. The pointer
+ * of shared data will be stored in m->shinfo.
+ * ``rte_pktmbuf_ext_shinfo_init_helper`` can help to simply spare a few
+ * bytes at the end of buffer for the shared data, store free callback and
+ * its argument and set the refcnt to 1.
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
+ *   2 as long as the direct mbuf itself isn't freed after the attachment.
+ *   In such cases, the buffer area of a direct mbuf must be read-only. But
+ *   external buffer has its own refcnt and it starts from 1. Unless
+ *   multiple mbufs are attached to a mbuf having an external buffer, the
+ *   external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback and its IO address.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be considered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer.
+ * @param buf_iova
+ *   IO address of the external buffer.
+ * @param buf_len
+ *   The size of the external buffer.
+ * @param shinfo
+ *   User-provided memory for shared data of the external buffer.
+ */
+static inline void __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	rte_iova_t buf_iova, uint16_t buf_len,
+	struct rte_mbuf_ext_shared_info *shinfo)
+{
+	/* mbuf should not be read-only */
+	RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
+	RTE_ASSERT(shinfo->free_cb != NULL);
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = buf_iova;
+	m->buf_len = buf_len;
+
+	m->data_len = 0;
+	m->data_off = 0;
+
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+	m->shinfo = shinfo;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1231,19 +1471,20 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
+		mi->ol_flags = m->ol_flags;
+		mi->shinfo = m->shinfo;
+	} else {
+		/* if m is not direct, get the mbuf that embeds the data */
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1259,7 +1500,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1268,12 +1508,52 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
  *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+	RTE_ASSERT(m->shinfo != NULL);
+
+	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
+		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	md = rte_mbuf_from_indirect(m);
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
+ *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1282,10 +1562,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1297,13 +1581,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1327,7 +1604,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1339,7 +1616,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v6 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-26  1:10 ` [PATCH v6 " Yongseok Koh
@ 2018-04-26  1:10   ` Yongseok Koh
  2018-04-26 11:39   ` [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
  2018-04-26 16:05   ` Andrew Rybchenko
  2 siblings, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-26  1:10 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch is to accommodate an experimental feature of mbuf - external
buffer attachment. If mbuf is attached to an external buffer, its ol_flags
will have EXT_ATTACHED_MBUF set. Without enabling/using the feature,
everything remains same.

If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be carried with externally attached mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 3 +++
 app/test-pmd/macfwd.c   | 3 +++
 app/test-pmd/macswap.c  | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5f5ab64aa..bb0b675a8 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -770,6 +770,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (!RTE_MBUF_DIRECT(m))
+			tx_ol_flags |= m->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..ba0021194 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..b8d15f6ba 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,9 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26  1:10 ` [PATCH v6 " Yongseok Koh
  2018-04-26  1:10   ` [PATCH v6 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-26 11:39   ` Ananyev, Konstantin
  2018-04-26 16:05   ` Andrew Rybchenko
  2 siblings, 0 replies; 86+ messages in thread
From: Ananyev, Konstantin @ 2018-04-26 11:39 UTC (permalink / raw)
  To: Yongseok Koh, Lu, Wenzhuo, Wu, Jingjing, olivier.matz
  Cc: dev, arybchenko, stephen, thomas, adrien.mazarguil, nelio.laranjeiro



> -----Original Message-----
> From: Yongseok Koh [mailto:yskoh@mellanox.com]
> Sent: Thursday, April 26, 2018 2:10 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; olivier.matz@6wind.com
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>; arybchenko@solarflare.com; stephen@networkplumber.org;
> thomas@monjalon.net; adrien.mazarguil@6wind.com; nelio.laranjeiro@6wind.com; Yongseok Koh <yskoh@mellanox.com>
> Subject: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
> 
> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>     2 as long as the direct mbuf itself isn't freed after the attachment.
>     In such cases, the buffer area of a direct mbuf must be read-only. But
>     external buffer has its own refcnt and it starts from 1. Unless
>     multiple mbufs are attached to a mbuf having an external buffer, the
>     external buffer is writable.
>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>


> 
> ** This patch can pass the mbuf_autotest. **
> 
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
> 
> v6:
> * rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
>   rte_pktmbuf_ext_shinfo_init_helper() is added.
> * bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
> * minor changes from review.
> 
> v5:
> * rte_pktmbuf_attach_extbuf() sets headroom to 0.
> * if shinfo is provided when attaching, user should initialize it.
> * minor changes from review.
> 
> v4:
> * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
>   user can pass memory for shared data via shinfo argument.
> * minor changes from review.
> 
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>   mbuf indirection.
> 
>  lib/librte_mbuf/rte_mbuf.h | 335 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 306 insertions(+), 29 deletions(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 43aaa9c5f..0a6885281 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -344,7 +344,10 @@ extern "C" {
>  		PKT_TX_MACSEC |		 \
>  		PKT_TX_SEC_OFFLOAD)
> 
> -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> +/**
> + * Mbuf having an external buffer attached. shinfo in mbuf must be filled.
> + */
> +#define EXT_ATTACHED_MBUF    (1ULL << 61)
> 
>  #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
> 
> @@ -584,8 +587,27 @@ struct rte_mbuf {
>  	/** Sequence number. See also rte_reorder_insert(). */
>  	uint32_t seqn;
> 
> +	/** Shared data for external buffer attached to mbuf. See
> +	 * rte_pktmbuf_attach_extbuf().
> +	 */
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +
>  } __rte_cache_aligned;
> 
> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
> +};
> +
>  /**< Maximum number of nb_segs allowed. */
>  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
> 
> @@ -706,14 +728,34 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>  }
> 
>  /**
> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> + * otherwise.
> + *
> + * If a mbuf has its data in another mbuf and references it by mbuf
> + * indirection, this mbuf can be defined as a cloned mbuf.
> + */
> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +
> +/**
>   * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>   */
> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> +
> +/**
> + * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
> + *
> + * External buffer is a user-provided anonymous buffer.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> 
>  /**
>   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> + *
> + * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
> + * can be defined as a direct mbuf.
>   */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) \
> +	(!((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)))
> 
>  /**
>   * Private data in case of pktmbuf pool.
> @@ -839,6 +881,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
> 
>  #endif /* RTE_MBUF_REFCNT_ATOMIC */
> 
> +/**
> + * Reads the refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @return
> + *   Reference count number.
> + */
> +static inline uint16_t
> +rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
> +{
> +	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
> +}
> +
> +/**
> + * Set refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @param new_value
> + *   Value set
> + */
> +static inline void
> +rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
> +	uint16_t new_value)
> +{
> +	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
> +}
> +
> +/**
> + * Add given value to refcnt of an external buffer and return its new
> + * value.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @param value
> + *   Value to add/subtract
> + * @return
> + *   Updated value
> + */
> +static inline uint16_t
> +rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
> +	int16_t value)
> +{
> +	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
> +		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
> +		return 1 + value;
> +	}
> +
> +	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
> +}
> +
>  /** Mbuf prefetch */
>  #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
>  	if ((m) != NULL)                        \
> @@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>  }
> 
>  /**
> + * Initialize shared data at the end of an external buffer before attaching
> + * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
> + * initialization but a helper function to simply spare a few bytes at the
> + * end of the buffer for shared data. If shared data is allocated
> + * separately, this should not be called but application has to properly
> + * initialize the shared data according to its need.
> + *
> + * Free callback and its argument is saved and the refcnt is set to 1.
> + *
> + * @warning
> + * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this
> + * initialization. For example,
> + *
> + *   struct rte_mbuf_ext_shared_info *shinfo =
> + *          rte_pktmbuf_ext_shinfo_init_helpfer(buf_addr, buf_len,
> + *                                              free_cb, fcb_arg);
> + *   buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> + *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
> + *
> + * @param buf_addr
> + *   The pointer to the external buffer.
> + * @param buf_len
> + *   The size of the external buffer. buf_len must be larger than the size
> + *   of ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If
> + *   not enough, this function will return NULL.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be
> + *   freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + *
> + * @return
> + *   A pointer to the initialized shared data on success, return NULL
> + *   otherwise.
> + */
> +static inline struct rte_mbuf_ext_shared_info *
> +rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t buf_len,
> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> +{
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> +
> +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> +				sizeof(*shinfo)), sizeof(uintptr_t));
> +	if ((void *)shinfo <= buf_addr)
> +		return NULL;
> +
> +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> +
> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;
> +
> +	/* buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) */
> +	return shinfo;
> +}
> +
> +/**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided via shinfo. This callback function will be called once all the
> + * mbufs are detached from the buffer (refcnt becomes zero).
> + *
> + * The headroom for the attaching mbuf will be set to zero and this can be
> + * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
> + * or ``rte_pktmbuf_reset_headroom()`` might be used.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * Memory for shared data must be provided and user must initialize all of
> + * the content properly, escpecially free callback and refcnt. The pointer
> + * of shared data will be stored in m->shinfo.
> + * ``rte_pktmbuf_ext_shinfo_init_helper`` can help to simply spare a few
> + * bytes at the end of buffer for the shared data, store free callback and
> + * its argument and set the refcnt to 1.
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> + *   external buffer has its own refcnt and it starts from 1. Unless
> + *   multiple mbufs are attached to a mbuf having an external buffer, the
> + *   external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback and its IO address.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer.
> + * @param buf_iova
> + *   IO address of the external buffer.
> + * @param buf_len
> + *   The size of the external buffer.
> + * @param shinfo
> + *   User-provided memory for shared data of the external buffer.
> + */
> +static inline void __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	rte_iova_t buf_iova, uint16_t buf_len,
> +	struct rte_mbuf_ext_shared_info *shinfo)
> +{
> +	/* mbuf should not be read-only */
> +	RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> +	RTE_ASSERT(shinfo->free_cb != NULL);
> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = buf_iova;
> +	m->buf_len = buf_len;
> +
> +	m->data_len = 0;
> +	m->data_off = 0;
> +
> +	m->ol_flags |= EXT_ATTACHED_MBUF;
> +	m->shinfo = shinfo;
> +}
> +
> +/**
> + * Detach the external buffer attached to a mbuf, same as
> + * ``rte_pktmbuf_detach()``
> + *
> + * @param m
> + *   The mbuf having external buffer.
> + */
> +#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
> +
> +/**
>   * Attach packet mbuf to another packet mbuf.
>   *
> - * After attachment we refer the mbuf we attached as 'indirect',
> - * while mbuf we attached to as 'direct'.
> - * The direct mbuf's reference counter is incremented.
> + * If the mbuf we are attaching to isn't a direct buffer and is attached to
> + * an external buffer, the mbuf being attached will be attached to the
> + * external buffer instead of mbuf indirection.
> + *
> + * Otherwise, the mbuf will be indirectly attached. After attachment we
> + * refer the mbuf we attached as 'indirect', while mbuf we attached to as
> + * 'direct'.  The direct mbuf's reference counter is incremented.
>   *
>   * Right now, not supported:
>   *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
> @@ -1231,19 +1471,20 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   */
>  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  {
> -	struct rte_mbuf *md;
> -
>  	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
>  	    rte_mbuf_refcnt_read(mi) == 1);
> 
> -	/* if m is not direct, get the mbuf that embeds the data */
> -	if (RTE_MBUF_DIRECT(m))
> -		md = m;
> -	else
> -		md = rte_mbuf_from_indirect(m);
> +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> +		mi->ol_flags = m->ol_flags;
> +		mi->shinfo = m->shinfo;
> +	} else {
> +		/* if m is not direct, get the mbuf that embeds the data */
> +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
> +		mi->priv_size = m->priv_size;
> +		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
> +	}
> 
> -	rte_mbuf_refcnt_update(md, 1);
> -	mi->priv_size = m->priv_size;
>  	mi->buf_iova = m->buf_iova;
>  	mi->buf_addr = m->buf_addr;
>  	mi->buf_len = m->buf_len;
> @@ -1259,7 +1500,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  	mi->next = NULL;
>  	mi->pkt_len = mi->data_len;
>  	mi->nb_segs = 1;
> -	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
>  	mi->packet_type = m->packet_type;
>  	mi->timestamp = m->timestamp;
> 
> @@ -1268,12 +1508,52 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  }
> 
>  /**
> - * Detach an indirect packet mbuf.
> + * @internal used by rte_pktmbuf_detach().
>   *
> + * Decrement the reference counter of the external buffer. When the
> + * reference counter becomes 0, the buffer is freed by pre-registered
> + * callback.
> + */
> +static inline void
> +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> +{
> +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> +	RTE_ASSERT(m->shinfo != NULL);
> +
> +	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
> +		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
> +}
> +
> +/**
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the direct mbuf's reference counter. When the reference
> + * counter becomes 0, the direct mbuf is freed.
> + */
> +static inline void
> +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf *md;
> +
> +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> +
> +	md = rte_mbuf_from_indirect(m);
> +
> +	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> +		md->next = NULL;
> +		md->nb_segs = 1;
> +		rte_mbuf_refcnt_set(md, 1);
> +		rte_mbuf_raw_free(md);
> +	}
> +}
> +
> +/**
> + * Detach a packet mbuf from external buffer or direct buffer.
> + *
> + *  - decrement refcnt and free the external/direct buffer if refcnt
> + *    becomes zero.
>   *  - restore original mbuf address and length values.
>   *  - reset pktmbuf data and data_len to their default values.
> - *  - decrement the direct mbuf's reference counter. When the
> - *  reference counter becomes 0, the direct mbuf is freed.
>   *
>   * All other fields of the given packet mbuf will be left intact.
>   *
> @@ -1282,10 +1562,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   */
>  static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  {
> -	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
>  	struct rte_mempool *mp = m->pool;
>  	uint32_t mbuf_size, buf_len, priv_size;
> 
> +	if (RTE_MBUF_HAS_EXTBUF(m))
> +		__rte_pktmbuf_free_extbuf(m);
> +	else
> +		__rte_pktmbuf_free_direct(m);
> +
>  	priv_size = rte_pktmbuf_priv_size(mp);
>  	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
>  	buf_len = rte_pktmbuf_data_room_size(mp);
> @@ -1297,13 +1581,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  	rte_pktmbuf_reset_headroom(m);
>  	m->data_len = 0;
>  	m->ol_flags = 0;
> -
> -	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> -		md->next = NULL;
> -		md->nb_segs = 1;
> -		rte_mbuf_refcnt_set(md, 1);
> -		rte_mbuf_raw_free(md);
> -	}
>  }
> 
>  /**
> @@ -1327,7 +1604,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> 
>  	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> 
> -		if (RTE_MBUF_INDIRECT(m))
> +		if (!RTE_MBUF_DIRECT(m))
>  			rte_pktmbuf_detach(m);
> 
>  		if (m->next != NULL) {
> @@ -1339,7 +1616,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> 
>  	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> 
> -		if (RTE_MBUF_INDIRECT(m))
> +		if (!RTE_MBUF_DIRECT(m))
>  			rte_pktmbuf_detach(m);
> 
>  		if (m->next != NULL) {
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26  1:10 ` [PATCH v6 " Yongseok Koh
  2018-04-26  1:10   ` [PATCH v6 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
  2018-04-26 11:39   ` [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
@ 2018-04-26 16:05   ` Andrew Rybchenko
  2018-04-26 16:10     ` Thomas Monjalon
  2018-04-26 17:18     ` Yongseok Koh
  2 siblings, 2 replies; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-26 16:05 UTC (permalink / raw)
  To: Yongseok Koh, wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, stephen, thomas, adrien.mazarguil,
	nelio.laranjeiro

On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
>
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>      2 as long as the direct mbuf itself isn't freed after the attachment.
>      In such cases, the buffer area of a direct mbuf must be read-only. But
>      external buffer has its own refcnt and it starts from 1. Unless
>      multiple mbufs are attached to a mbuf having an external buffer, the
>      external buffer is writable.
>    - There's no need to allocate buffer from a mempool. Any buffer can be
>      attached with appropriate free callback.
>    - Smaller metadata is required to maintain shared data such as refcnt.

I'm still unsure how reference counters for external buffer work.

Let's consider the following example:

                         |<-mbuf1 buf_len-->|<-mbuf2 buf_len-->|
+--+-----------+----+--------------+----+--------------+---+- - -
|  global      |head|mbuf1 data    |head|mbuf2 data    | |
|  shinfo=2    |room| |room|              |   |
+--+-----------+----+--------------+----+--------------+---+- - -
                ^ ^
+----------+   |      +----------+ |
| mbuf1    +---+      | mbuf2    +-+
| refcnt=1 |          | refcnt=1 |
+----------+          +----------+

I.e. we have big buffer which is sliced into many small data
buffers referenced from mbufs.

shinfo reference counter is used to control when big buffer
may be freed. But what controls sharing of each block?

headroom and following mbuf data (buf_len) is owned by
corresponding mbuf and the mbuf owner can do everything
with the space (prepend data, append data, modify etc).
I.e. it is read-write in above terminology.

What should happen if mbuf1 is cloned? Right now it will result
in a new mbuf1a with reference counter 1 and incremented shinfo
reference counter. And what does say that corresponding area
is read-only now? It looks like nothing.

As I understand it should be solved using per data area shinfo
which free callback decrements big buffer reference counter.

So, we have two reference counters per each mbuf with external
buffer (plus reference counter per big buffer).
Two reference counters sounds too much and it looks like
mbuf-with-extbuf reference counter is not really used
(since on clone/attach update shinfo refcnt).
It is still two counters to check on free.

Have you considered alternative approach to use mbuf refcnt
as sharing indicator for extbuf data? However, in this case
indirect referencing extbuf would logically look like:

+----------+    +--------+     +--------+
| indirect +--->| extbuf +---->|  data  |
|  mbuf    |    |  mbuf  |     |        |
+----------+    +--------+     +--------+

It looks like it would allow to avoid two reference counters
per data block as above. Personally I'm not sure which approach
is better and would like to hear what you and other reviewers
think about it.

Some minor notes below as well.

> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>
> ** This patch can pass the mbuf_autotest. **
>
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
>
> v6:
> * rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
>    rte_pktmbuf_ext_shinfo_init_helper() is added.
> * bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
> * minor changes from review.
>
> v5:
> * rte_pktmbuf_attach_extbuf() sets headroom to 0.
> * if shinfo is provided when attaching, user should initialize it.
> * minor changes from review.
>
> v4:
> * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
>    user can pass memory for shared data via shinfo argument.
> * minor changes from review.
>
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>    mbuf indirection.
>
>   lib/librte_mbuf/rte_mbuf.h | 335 +++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 306 insertions(+), 29 deletions(-)
>
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 43aaa9c5f..0a6885281 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h

[...]

> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
> +};
> +
>   /**< Maximum number of nb_segs allowed. */
>   #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
>   
> @@ -706,14 +728,34 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>   }
>   
>   /**
> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> + * otherwise.
> + *
> + * If a mbuf has its data in another mbuf and references it by mbuf
> + * indirection, this mbuf can be defined as a cloned mbuf.
> + */
> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +
> +/**
>    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>    */
> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)

We have discussed that it would be good to deprecate RTE_MBUF_INDIRECT()
since it is not !RTE_MBUF_DIREC(). Is it lost here or intentional (may 
be I've lost
in the thread)?

>   /** Mbuf prefetch */
>   #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
>   	if ((m) != NULL)                        \
> @@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   }
>   
>   /**
> + * Initialize shared data at the end of an external buffer before attaching
> + * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
> + * initialization but a helper function to simply spare a few bytes at the
> + * end of the buffer for shared data. If shared data is allocated
> + * separately, this should not be called but application has to properly
> + * initialize the shared data according to its need.
> + *
> + * Free callback and its argument is saved and the refcnt is set to 1.
> + *
> + * @warning
> + * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this
> + * initialization. For example,

May be buf_len should be inout and it should be done by the function?
Just a question since current approach looks fragile.

> + *
> + *   struct rte_mbuf_ext_shared_info *shinfo =
> + *          rte_pktmbuf_ext_shinfo_init_helpfer(buf_addr, buf_len,
> + *                                              free_cb, fcb_arg);
> + *   buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> + *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
> + *
> + * @param buf_addr
> + *   The pointer to the external buffer.
> + * @param buf_len
> + *   The size of the external buffer. buf_len must be larger than the size
> + *   of ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If
> + *   not enough, this function will return NULL.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be
> + *   freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + *
> + * @return
> + *   A pointer to the initialized shared data on success, return NULL
> + *   otherwise.
> + */
> +static inline struct rte_mbuf_ext_shared_info *
> +rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t buf_len,
> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> +{
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> +
> +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> +				sizeof(*shinfo)), sizeof(uintptr_t));
> +	if ((void *)shinfo <= buf_addr)
> +		return NULL;
> +
> +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> +
> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;

Just a nit, but I'd suggest to initialize in the same order as in the 
struct.
(if there is no reasons why reference counter should be initialized first)

> +
> +	/* buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) */
> +	return shinfo;
> +}

[...]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 16:05   ` Andrew Rybchenko
@ 2018-04-26 16:10     ` Thomas Monjalon
  2018-04-26 19:42       ` Olivier Matz
  2018-04-26 17:18     ` Yongseok Koh
  1 sibling, 1 reply; 86+ messages in thread
From: Thomas Monjalon @ 2018-04-26 16:10 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Yongseok Koh, wenzhuo.lu, jingjing.wu, olivier.matz, dev,
	konstantin.ananyev, stephen, adrien.mazarguil, nelio.laranjeiro

26/04/2018 18:05, Andrew Rybchenko:
> On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> 
> We have discussed that it would be good to deprecate RTE_MBUF_INDIRECT()
> since it is not !RTE_MBUF_DIREC(). Is it lost here or intentional (may 
> be I've lost
> in the thread)?

I think it should be a separate deprecation notice.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 16:05   ` Andrew Rybchenko
  2018-04-26 16:10     ` Thomas Monjalon
@ 2018-04-26 17:18     ` Yongseok Koh
  2018-04-26 19:45       ` Olivier Matz
  1 sibling, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-26 17:18 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: wenzhuo.lu, jingjing.wu, olivier.matz, dev, konstantin.ananyev,
	stephen, thomas, adrien.mazarguil, nelio.laranjeiro

On Thu, Apr 26, 2018 at 07:05:01PM +0300, Andrew Rybchenko wrote:
> On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> >      2 as long as the direct mbuf itself isn't freed after the attachment.
> >      In such cases, the buffer area of a direct mbuf must be read-only. But
> >      external buffer has its own refcnt and it starts from 1. Unless
> >      multiple mbufs are attached to a mbuf having an external buffer, the
> >      external buffer is writable.
> >    - There's no need to allocate buffer from a mempool. Any buffer can be
> >      attached with appropriate free callback.
> >    - Smaller metadata is required to maintain shared data such as refcnt.
> 
> I'm still unsure how reference counters for external buffer work.
> 
> Let's consider the following example:
> 
>                         |<-mbuf1 buf_len-->|<-mbuf2 buf_len-->|
> +--+-----------+----+--------------+----+--------------+---+- - -
> |  global      |head|mbuf1 data    |head|mbuf2 data    | |
> |  shinfo=2    |room| |room|              |   |
> +--+-----------+----+--------------+----+--------------+---+- - -
>                ^ ^
> +----------+   |      +----------+ |
> | mbuf1    +---+      | mbuf2    +-+
> | refcnt=1 |          | refcnt=1 |
> +----------+          +----------+
> 
> I.e. we have big buffer which is sliced into many small data
> buffers referenced from mbufs.
> 
> shinfo reference counter is used to control when big buffer
> may be freed. But what controls sharing of each block?
> 
> headroom and following mbuf data (buf_len) is owned by
> corresponding mbuf and the mbuf owner can do everything
> with the space (prepend data, append data, modify etc).
> I.e. it is read-write in above terminology.
> 
> What should happen if mbuf1 is cloned? Right now it will result
> in a new mbuf1a with reference counter 1 and incremented shinfo
> reference counter. And what does say that corresponding area
> is read-only now? It looks like nothing.
> 
> As I understand it should be solved using per data area shinfo
> which free callback decrements big buffer reference counter.

I have to admit that I was confused at the moment and I mixed two different
use-cases.

1) Transmitting a large storage block.

                |<--mbuf1 buf_len-->|<--mbuf2 buf_len-->|
 +--+-----------+----+--------------+----+--------------+---+- - -
 |  global      |head|mbuf1 data    |head|mbuf2 data    |   |
 |  shinfo=2    |room|              |room|              |   |
 +--+-----------+----+--------------+----+--------------+---+- - -
                ^                   ^
 +----------+   |      +----------+ |
 | mbuf1    +---+      | mbuf2    +-+
 | refcnt=1 |          | refcnt=1 |
 +----------+          +----------+
       ^                     ^
       |next                 |next
 +-----+----+          +----------+ 
 | mbuf1_hdr|          | mbuf2_hdr|
 | refcnt=1 |          | refcnt=1 |
 +----------+          +----------+

Yes, in this case, the large external buffer should always be read-only. And
necessary network headers should be linked via m->next. Owners of m1 or m2
can't alter any bit in the external buffer because shinfo->refcnt > 1.

2) Slicing a large buffer and provide r-w buffers.

                |<--mbuf1 buf_len-->|      |<--mbuf2 buf_len-->|
 +--+-----------+----+--------------+------+----+--------------+------+----+- - -
 |  user data   |head|mbuf1 data    |shinfo|head|mbuf2 data    |shinfo|    |
 |  refc=2      |room|              |refc=1|room|              |refc=1|    |
 +--+-----------+----+--------------+------+----+--------------+------+----+- - -
                ^                          ^
 +----------+   |             +----------+ |
 | mbuf1    +---+             | mbuf2    +-+
 | refcnt=1 |                 | refcnt=1 |
 +----------+                 +----------+
 
Here, the user data for the large chunk isn't rte_mbuf_ext_shared_info but a
custom structure managed by user in order to free the whole chunk. free_cb would
decrement a custom refcnt in custom way. But librte_mbuf doesn't need to be
aware of it.  It is user's responsibility. The library is just responsible for
calling free_cb when shinfo->refcnt gets to zero.

> So, we have two reference counters per each mbuf with external
> buffer (plus reference counter per big buffer).
> Two reference counters sounds too much and it looks like
> mbuf-with-extbuf reference counter is not really used
> (since on clone/attach update shinfo refcnt).
> It is still two counters to check on free.

Each refcnt implies whether it is r or r-w. Even for direct mbuf, if two users
are accessing it, refcnt is 2 and it is read-only. This should mean both
mbuf metadata and its data area are all read-only. Users can alter neither
various length fields nor packet data, for example. For non-direct mbufs, still
its refcnt should be used, but refcnt only represents the metadata is shared and
read-only if it is more than 1. So, refcnt of mbuf-with-extbuf is still used.
Incrementing refcnt means an entity acquired access to the object, including
cases of attachment (indirec/extbuf).

> Have you considered alternative approach to use mbuf refcnt
> as sharing indicator for extbuf data? However, in this case
> indirect referencing extbuf would logically look like:
> 
> +----------+    +--------+     +--------+
> | indirect +--->| extbuf +---->|  data  |
> |  mbuf    |    |  mbuf  |     |        |
> +----------+    +--------+     +--------+
> 
> It looks like it would allow to avoid two reference counters
> per data block as above. Personally I'm not sure which approach
> is better and would like to hear what you and other reviewers
> think about it.

So, I still think this patch is okay.

> Some minor notes below as well.
> 
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > ---
> > 
> > ** This patch can pass the mbuf_autotest. **
> > 
> > Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> > will be submitted separately rebased on a differnet patchset which
> > accommodates new memory hotplug design to mlx PMDs.
> > 
> > v6:
> > * rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
> >    rte_pktmbuf_ext_shinfo_init_helper() is added.
> > * bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
> > * minor changes from review.
> > 
> > v5:
> > * rte_pktmbuf_attach_extbuf() sets headroom to 0.
> > * if shinfo is provided when attaching, user should initialize it.
> > * minor changes from review.
> > 
> > v4:
> > * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
> >    user can pass memory for shared data via shinfo argument.
> > * minor changes from review.
> > 
> > v3:
> > * implement external buffer attachment instead of introducing buf_off for
> >    mbuf indirection.
> > 
> >   lib/librte_mbuf/rte_mbuf.h | 335 +++++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 306 insertions(+), 29 deletions(-)
> > 
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 43aaa9c5f..0a6885281 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
[...]
> >   /** Mbuf prefetch */
> >   #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> >   	if ((m) != NULL)                        \
> > @@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> >   }
> >   /**
> > + * Initialize shared data at the end of an external buffer before attaching
> > + * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
> > + * initialization but a helper function to simply spare a few bytes at the
> > + * end of the buffer for shared data. If shared data is allocated
> > + * separately, this should not be called but application has to properly
> > + * initialize the shared data according to its need.
> > + *
> > + * Free callback and its argument is saved and the refcnt is set to 1.
> > + *
> > + * @warning
> > + * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this
> > + * initialization. For example,
> 
> May be buf_len should be inout and it should be done by the function?
> Just a question since current approach looks fragile.

Yeah, I thought about that but I didn't want to alter user's variable, I thought
it could be error-prone. Anyway either way is okay to me. Will wait for a day to
get input because I will send out a new version (hopefully last :-) to fix the
nit you mentioned below.

> > + *
> > + *   struct rte_mbuf_ext_shared_info *shinfo =
> > + *          rte_pktmbuf_ext_shinfo_init_helpfer(buf_addr, buf_len,
> > + *                                              free_cb, fcb_arg);
> > + *   buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > + *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
> > + *
> > + * @param buf_addr
> > + *   The pointer to the external buffer.
> > + * @param buf_len
> > + *   The size of the external buffer. buf_len must be larger than the size
> > + *   of ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If
> > + *   not enough, this function will return NULL.
> > + * @param free_cb
> > + *   Free callback function to call when the external buffer needs to be
> > + *   freed.
> > + * @param fcb_opaque
> > + *   Argument for the free callback function.
> > + *
> > + * @return
> > + *   A pointer to the initialized shared data on success, return NULL
> > + *   otherwise.
> > + */
> > +static inline struct rte_mbuf_ext_shared_info *
> > +rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t buf_len,
> > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > +{
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > +
> > +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > +				sizeof(*shinfo)), sizeof(uintptr_t));
> > +	if ((void *)shinfo <= buf_addr)
> > +		return NULL;
> > +
> > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +
> > +	shinfo->free_cb = free_cb;
> > +	shinfo->fcb_opaque = fcb_opaque;
> 
> Just a nit, but I'd suggest to initialize in the same order as in the
> struct.
> (if there is no reasons why reference counter should be initialized first)

Will do.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 16:10     ` Thomas Monjalon
@ 2018-04-26 19:42       ` Olivier Matz
  2018-04-26 19:58         ` Thomas Monjalon
  0 siblings, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-26 19:42 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Andrew Rybchenko, Yongseok Koh, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, stephen, adrien.mazarguil, nelio.laranjeiro

On Thu, Apr 26, 2018 at 06:10:36PM +0200, Thomas Monjalon wrote:
> 26/04/2018 18:05, Andrew Rybchenko:
> > On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > 
> > We have discussed that it would be good to deprecate RTE_MBUF_INDIRECT()
> > since it is not !RTE_MBUF_DIREC(). Is it lost here or intentional (may 
> > be I've lost
> > in the thread)?
> 
> I think it should be a separate deprecation notice.

Agree with Andrew that RTE_MBUF_INDIRECT should be deprecated
to avoid confusion with !DIRECT.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 17:18     ` Yongseok Koh
@ 2018-04-26 19:45       ` Olivier Matz
  0 siblings, 0 replies; 86+ messages in thread
From: Olivier Matz @ 2018-04-26 19:45 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Andrew Rybchenko, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, stephen, thomas, adrien.mazarguil,
	nelio.laranjeiro

On Thu, Apr 26, 2018 at 10:18:14AM -0700, Yongseok Koh wrote:
>

[...]

> > >   /** Mbuf prefetch */
> > >   #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
> > >   	if ((m) != NULL)                        \
> > > @@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > >   }
> > >   /**
> > > + * Initialize shared data at the end of an external buffer before attaching
> > > + * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
> > > + * initialization but a helper function to simply spare a few bytes at the
> > > + * end of the buffer for shared data. If shared data is allocated
> > > + * separately, this should not be called but application has to properly
> > > + * initialize the shared data according to its need.
> > > + *
> > > + * Free callback and its argument is saved and the refcnt is set to 1.
> > > + *
> > > + * @warning
> > > + * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this
> > > + * initialization. For example,
> > 
> > May be buf_len should be inout and it should be done by the function?
> > Just a question since current approach looks fragile.
> 
> Yeah, I thought about that but I didn't want to alter user's variable, I thought
> it could be error-prone. Anyway either way is okay to me. Will wait for a day to
> get input because I will send out a new version (hopefully last :-) to fix the
> nit you mentioned below.

+1, I had exactly the same comment than Andrew in mind.
To me, it looks better to have buf_len as in/out.

I don't think it's a problem to have this change for rc2.

So,
Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thank you Yongseok for this nice improvement.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 19:42       ` Olivier Matz
@ 2018-04-26 19:58         ` Thomas Monjalon
  2018-04-26 20:07           ` Olivier Matz
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Monjalon @ 2018-04-26 19:58 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Andrew Rybchenko, Yongseok Koh, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, stephen, adrien.mazarguil, nelio.laranjeiro

26/04/2018 21:42, Olivier Matz:
> On Thu, Apr 26, 2018 at 06:10:36PM +0200, Thomas Monjalon wrote:
> > 26/04/2018 18:05, Andrew Rybchenko:
> > > On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > 
> > > We have discussed that it would be good to deprecate RTE_MBUF_INDIRECT()
> > > since it is not !RTE_MBUF_DIREC(). Is it lost here or intentional (may 
> > > be I've lost
> > > in the thread)?
> > 
> > I think it should be a separate deprecation notice.
> 
> Agree with Andrew that RTE_MBUF_INDIRECT should be deprecated
> to avoid confusion with !DIRECT.

What do you mean?
We should add a comment? Or poisoining the macro? Or something else?
Should it be removed? In which release?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 19:58         ` Thomas Monjalon
@ 2018-04-26 20:07           ` Olivier Matz
  2018-04-26 20:24             ` Thomas Monjalon
  0 siblings, 1 reply; 86+ messages in thread
From: Olivier Matz @ 2018-04-26 20:07 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Andrew Rybchenko, Yongseok Koh, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, stephen, adrien.mazarguil, nelio.laranjeiro

On Thu, Apr 26, 2018 at 09:58:00PM +0200, Thomas Monjalon wrote:
> 26/04/2018 21:42, Olivier Matz:
> > On Thu, Apr 26, 2018 at 06:10:36PM +0200, Thomas Monjalon wrote:
> > > 26/04/2018 18:05, Andrew Rybchenko:
> > > > On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > 
> > > > We have discussed that it would be good to deprecate RTE_MBUF_INDIRECT()
> > > > since it is not !RTE_MBUF_DIREC(). Is it lost here or intentional (may 
> > > > be I've lost
> > > > in the thread)?
> > > 
> > > I think it should be a separate deprecation notice.
> > 
> > Agree with Andrew that RTE_MBUF_INDIRECT should be deprecated
> > to avoid confusion with !DIRECT.
> 
> What do you mean?
> We should add a comment? Or poisoining the macro? Or something else?
> Should it be removed? In which release?

Sorry if I was not clear.

Not necessarly remove the macro for this release. But I think we
should announce it and remove it, following the process.

I suggest:
- for 18.05: send the deprecation notice + add a comment in the .h
  saying that the macro will be deprecated in 18.08 (or 18.11, there
  is no hurry if there is the comment)
- for 18.08 (or 18.11): remove the macro (I don't think poisoining
  is useful in this case).

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-26 20:07           ` Olivier Matz
@ 2018-04-26 20:24             ` Thomas Monjalon
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Monjalon @ 2018-04-26 20:24 UTC (permalink / raw)
  To: Olivier Matz, Yongseok Koh
  Cc: Andrew Rybchenko, wenzhuo.lu, jingjing.wu, dev,
	konstantin.ananyev, stephen, adrien.mazarguil, nelio.laranjeiro

26/04/2018 22:07, Olivier Matz:
> On Thu, Apr 26, 2018 at 09:58:00PM +0200, Thomas Monjalon wrote:
> > 26/04/2018 21:42, Olivier Matz:
> > > On Thu, Apr 26, 2018 at 06:10:36PM +0200, Thomas Monjalon wrote:
> > > > 26/04/2018 18:05, Andrew Rybchenko:
> > > > > On 04/26/2018 04:10 AM, Yongseok Koh wrote:
> > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > 
> > > > > We have discussed that it would be good to deprecate RTE_MBUF_INDIRECT()
> > > > > since it is not !RTE_MBUF_DIREC(). Is it lost here or intentional (may 
> > > > > be I've lost
> > > > > in the thread)?
> > > > 
> > > > I think it should be a separate deprecation notice.
> > > 
> > > Agree with Andrew that RTE_MBUF_INDIRECT should be deprecated
> > > to avoid confusion with !DIRECT.
> > 
> > What do you mean?
> > We should add a comment? Or poisoining the macro? Or something else?
> > Should it be removed? In which release?
> 
> Sorry if I was not clear.
> 
> Not necessarly remove the macro for this release. But I think we
> should announce it and remove it, following the process.
> 
> I suggest:
> - for 18.05: send the deprecation notice + add a comment in the .h
>   saying that the macro will be deprecated in 18.08 (or 18.11, there
>   is no hurry if there is the comment)
> - for 18.08 (or 18.11): remove the macro (I don't think poisoining
>   is useful in this case).

OK it works for me.
I think we can wait 18.11 for a complete removal, except if the mbuf API
is broken in 18.08 and not 18.11. But we probably need to do a soft 18.08
release without any breakage at all.

So, Yongseok, please prepare a patch including a deprecation notice
with a fuzzy removal deadline, and a doxygen comment.
As it requires a special process (3 acks, etc), it is better to have it
as a separate patch.

Thanks

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v7 1/2] mbuf: support attaching external buffer to mbuf
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (10 preceding siblings ...)
  2018-04-26  1:10 ` [PATCH v6 " Yongseok Koh
@ 2018-04-27  0:01 ` Yongseok Koh
  2018-04-27  0:01   ` [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
  2018-04-27  7:22   ` [PATCH v7 1/2] mbuf: support attaching external buffer to mbuf Andrew Rybchenko
  2018-04-27 17:22 ` [PATCH v8 " Yongseok Koh
  12 siblings, 2 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-27  0:01 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - When an indirect mbuf is attached, refcnt of the direct mbuf would be
    2 as long as the direct mbuf itself isn't freed after the attachment.
    In such cases, the buffer area of a direct mbuf must be read-only. But
    external buffer has its own refcnt and it starts from 1. Unless
    multiple mbufs are attached to a mbuf having an external buffer, the
    external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
---

Deprecation of RTE_MBUF_INDIRECT() will follow after integration of this
patch.

** This patch can pass the mbuf_autotest. **

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v7:
* make buf_len param [in,out] in rte_pktmbuf_ext_shinfo_init_helper().
* a minor change from review.

v6:
* rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
 rte_pktmbuf_ext_shinfo_init_helper() is added.
* bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
* minor changes from review.

v5:
* rte_pktmbuf_attach_extbuf() sets headroom to 0.
* if shinfo is provided when attaching, user should initialize it.
* minor changes from review.

v4:
* rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
 user can pass memory for shared data via shinfo argument.
* minor changes from review.

v3:
* implement external buffer attachment instead of introducing buf_off for
 mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 337 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 308 insertions(+), 29 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 0cd6a1c6b..4fd9a0d9e 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -345,7 +345,10 @@ extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+/**
+ * Mbuf having an external buffer attached. shinfo in mbuf must be filled.
+ */
+#define EXT_ATTACHED_MBUF    (1ULL << 61)
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -585,8 +588,27 @@ struct rte_mbuf {
 	/** Sequence number. See also rte_reorder_insert(). */
 	uint32_t seqn;
 
+	/** Shared data for external buffer attached to mbuf. See
+	 * rte_pktmbuf_attach_extbuf().
+	 */
+	struct rte_mbuf_ext_shared_info *shinfo;
+
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -707,14 +729,34 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
 }
 
 /**
+ * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
+ * otherwise.
+ *
+ * If a mbuf has its data in another mbuf and references it by mbuf
+ * indirection, this mbuf can be defined as a cloned mbuf.
+ */
+#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
  */
-#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
+#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
+
+/**
+ * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
+ *
+ * External buffer is a user-provided anonymous buffer.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
 
 /**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
+ *
+ * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
+ * can be defined as a direct mbuf.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) \
+	(!((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -840,6 +882,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
+		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1214,11 +1308,159 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Initialize shared data at the end of an external buffer before attaching
+ * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
+ * initialization but a helper function to simply spare a few bytes at the
+ * end of the buffer for shared data. If shared data is allocated
+ * separately, this should not be called but application has to properly
+ * initialize the shared data according to its need.
+ *
+ * Free callback and its argument is saved and the refcnt is set to 1.
+ *
+ * @warning
+ * The value of buf_len will be reduced to RTE_PTR_DIFF(shinfo, buf_addr)
+ * after this initialization. This shall be used for
+ * ``rte_pktmbuf_attach_extbuf()``
+ *
+ * @param buf_addr
+ *   The pointer to the external buffer.
+ * @param [in,out] buf_len
+ *   The pointer to length of the external buffer. Input value must be
+ *   larger than the size of ``struct rte_mbuf_ext_shared_info`` and
+ *   padding for alignment. If not enough, this function will return NULL.
+ *   Adjusted buffer length will be returned through this pointer.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be
+ *   freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ *
+ * @return
+ *   A pointer to the initialized shared data on success, return NULL
+ *   otherwise.
+ */
+static inline struct rte_mbuf_ext_shared_info *
+rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t *buf_len,
+	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
+{
+	struct rte_mbuf_ext_shared_info *shinfo;
+	void *buf_end = RTE_PTR_ADD(buf_addr, *buf_len);
+
+	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
+				sizeof(*shinfo)), sizeof(uintptr_t));
+	if ((void *)shinfo <= buf_addr)
+		return NULL;
+
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+	rte_mbuf_ext_refcnt_set(shinfo, 1);
+
+	*buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+	return shinfo;
+}
+
+/**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided via shinfo. This callback function will be called once all the
+ * mbufs are detached from the buffer (refcnt becomes zero).
+ *
+ * The headroom for the attaching mbuf will be set to zero and this can be
+ * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
+ * or ``rte_pktmbuf_reset_headroom()`` might be used.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * Memory for shared data must be provided and user must initialize all of
+ * the content properly, escpecially free callback and refcnt. The pointer
+ * of shared data will be stored in m->shinfo.
+ * ``rte_pktmbuf_ext_shinfo_init_helper`` can help to simply spare a few
+ * bytes at the end of buffer for the shared data, store free callback and
+ * its argument and set the refcnt to 1. The following is an example:
+ *
+ *   struct rte_mbuf_ext_shared_info *shinfo =
+ *          rte_pktmbuf_ext_shinfo_init_helper(buf_addr, &buf_len,
+ *                                             free_cb, fcb_arg);
+ *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
+ *   rte_pktmbuf_reset_headroom(m);
+ *   rte_pktmbuf_adj(m, data_len);
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
+ *   2 as long as the direct mbuf itself isn't freed after the attachment.
+ *   In such cases, the buffer area of a direct mbuf must be read-only. But
+ *   external buffer has its own refcnt and it starts from 1. Unless
+ *   multiple mbufs are attached to a mbuf having an external buffer, the
+ *   external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback and its IO address.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be considered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer.
+ * @param buf_iova
+ *   IO address of the external buffer.
+ * @param buf_len
+ *   The size of the external buffer.
+ * @param shinfo
+ *   User-provided memory for shared data of the external buffer.
+ */
+static inline void __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	rte_iova_t buf_iova, uint16_t buf_len,
+	struct rte_mbuf_ext_shared_info *shinfo)
+{
+	/* mbuf should not be read-only */
+	RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
+	RTE_ASSERT(shinfo->free_cb != NULL);
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = buf_iova;
+	m->buf_len = buf_len;
+
+	m->data_len = 0;
+	m->data_off = 0;
+
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+	m->shinfo = shinfo;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1232,19 +1474,20 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
+		mi->ol_flags = m->ol_flags;
+		mi->shinfo = m->shinfo;
+	} else {
+		/* if m is not direct, get the mbuf that embeds the data */
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1260,7 +1503,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1269,12 +1511,52 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
  *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+	RTE_ASSERT(m->shinfo != NULL);
+
+	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
+		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	md = rte_mbuf_from_indirect(m);
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
+ *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1283,10 +1565,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1298,13 +1584,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1328,7 +1607,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1340,7 +1619,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-27  0:01 ` [PATCH v7 " Yongseok Koh
@ 2018-04-27  0:01   ` Yongseok Koh
  2018-04-27  8:00     ` Andrew Rybchenko
  2018-04-27  7:22   ` [PATCH v7 1/2] mbuf: support attaching external buffer to mbuf Andrew Rybchenko
  1 sibling, 1 reply; 86+ messages in thread
From: Yongseok Koh @ 2018-04-27  0:01 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch is to accommodate an experimental feature of mbuf - external
buffer attachment. If mbuf is attached to an external buffer, its ol_flags
will have EXT_ATTACHED_MBUF set. Without enabling/using the feature,
everything remains same.

If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be carried with externally attached mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 app/test-pmd/csumonly.c | 3 +++
 app/test-pmd/macfwd.c   | 3 +++
 app/test-pmd/macswap.c  | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 53b98412a..4a82bbc92 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -851,6 +851,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			m->l4_len = info.l4_len;
 			m->tso_segsz = info.tso_segsz;
 		}
+		if (!RTE_MBUF_DIRECT(m))
+			tx_ol_flags |= m->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		m->ol_flags = tx_ol_flags;
 
 		/* Do split & copy for the packet. */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..ba0021194 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,6 +96,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..b8d15f6ba 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,6 +127,9 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
+		if (!RTE_MBUF_DIRECT(mb))
+			ol_flags |= mb->ol_flags &
+				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		mb->ol_flags = ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-27  0:01 ` [PATCH v7 " Yongseok Koh
  2018-04-27  0:01   ` [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-27  7:22   ` Andrew Rybchenko
  1 sibling, 0 replies; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-27  7:22 UTC (permalink / raw)
  To: Yongseok Koh, wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, stephen, thomas, adrien.mazarguil,
	nelio.laranjeiro

On 04/27/2018 03:01 AM, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
>
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>      2 as long as the direct mbuf itself isn't freed after the attachment.
>      In such cases, the buffer area of a direct mbuf must be read-only. But
>      external buffer has its own refcnt and it starts from 1. Unless
>      multiple mbufs are attached to a mbuf having an external buffer, the
>      external buffer is writable.
>    - There's no need to allocate buffer from a mempool. Any buffer can be
>      attached with appropriate free callback.
>    - Smaller metadata is required to maintain shared data such as refcnt.
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Many thanks,
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-27  0:01   ` [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-27  8:00     ` Andrew Rybchenko
  0 siblings, 0 replies; 86+ messages in thread
From: Andrew Rybchenko @ 2018-04-27  8:00 UTC (permalink / raw)
  To: Yongseok Koh, wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, stephen, thomas, adrien.mazarguil,
	nelio.laranjeiro

On 04/27/2018 03:01 AM, Yongseok Koh wrote:
> This patch is to accommodate an experimental feature of mbuf - external
> buffer attachment. If mbuf is attached to an external buffer, its ol_flags
> will have EXT_ATTACHED_MBUF set. Without enabling/using the feature,
> everything remains same.
>
> If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
> overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
> be carried with externally attached mbufs.
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>   app/test-pmd/csumonly.c | 3 +++
>   app/test-pmd/macfwd.c   | 3 +++
>   app/test-pmd/macswap.c  | 3 +++
>   3 files changed, 9 insertions(+)
>
> diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
> index 53b98412a..4a82bbc92 100644
> --- a/app/test-pmd/csumonly.c
> +++ b/app/test-pmd/csumonly.c
> @@ -851,6 +851,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
>   			m->l4_len = info.l4_len;
>   			m->tso_segsz = info.tso_segsz;
>   		}
> +		if (!RTE_MBUF_DIRECT(m))
> +			tx_ol_flags |= m->ol_flags &
> +				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);

1. I see no point to check !RTE_MBUF_DIRECT(m). Just inherit flags.
2. Consider to do it when tx_ol_flags are initialized above as 0, i.e.
    tx_ol_flags = m->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);

>   		m->ol_flags = tx_ol_flags;
>   
>   		/* Do split & copy for the packet. */
> diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
> index 2adce7019..ba0021194 100644
> --- a/app/test-pmd/macfwd.c
> +++ b/app/test-pmd/macfwd.c
> @@ -96,6 +96,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
>   				&eth_hdr->d_addr);
>   		ether_addr_copy(&ports[fs->tx_port].eth_addr,
>   				&eth_hdr->s_addr);
> +		if (!RTE_MBUF_DIRECT(mb))
> +			ol_flags |= mb->ol_flags &
> +				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);

1. Do not update ol_flags which is global and applied for all mbufs in 
the burst.
2. There is no point to check for  (!RTE_MBUF_DIRECT(mb). Just inherit 
these flags.

>   		mb->ol_flags = ol_flags;
>   		mb->l2_len = sizeof(struct ether_hdr);
>   		mb->l3_len = sizeof(struct ipv4_hdr);
> diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
> index e2cc4812c..b8d15f6ba 100644
> --- a/app/test-pmd/macswap.c
> +++ b/app/test-pmd/macswap.c
> @@ -127,6 +127,9 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
>   		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
>   		ether_addr_copy(&addr, &eth_hdr->s_addr);
>   
> +		if (!RTE_MBUF_DIRECT(mb))
> +			ol_flags |= mb->ol_flags &
> +				(IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);

Same as above.k

>   		mb->ol_flags = ol_flags;
>   		mb->l2_len = sizeof(struct ether_hdr);
>   		mb->l3_len = sizeof(struct ipv4_hdr);

With above fixed:
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v8 1/2] mbuf: support attaching external buffer to mbuf
  2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
                   ` (11 preceding siblings ...)
  2018-04-27  0:01 ` [PATCH v7 " Yongseok Koh
@ 2018-04-27 17:22 ` Yongseok Koh
  2018-04-27 17:22   ` [PATCH v8 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
  2018-04-27 18:09   ` [PATCH v8 1/2] mbuf: support attaching external buffer to mbuf Thomas Monjalon
  12 siblings, 2 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-27 17:22 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - When an indirect mbuf is attached, refcnt of the direct mbuf would be
    2 as long as the direct mbuf itself isn't freed after the attachment.
    In such cases, the buffer area of a direct mbuf must be read-only. But
    external buffer has its own refcnt and it starts from 1. Unless
    multiple mbufs are attached to a mbuf having an external buffer, the
    external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---

Deprecation of RTE_MBUF_INDIRECT() will follow after integration of this
patch.

** This patch can pass the mbuf_autotest. **

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v8:
* NO CHANGE.

v7:
* make buf_len param [in,out] in rte_pktmbuf_ext_shinfo_init_helper().
* a minor change from review.

v6:
* rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead,
  rte_pktmbuf_ext_shinfo_init_helper() is added.
* bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi.
* minor changes from review.

v5:
* rte_pktmbuf_attach_extbuf() sets headroom to 0.
* if shinfo is provided when attaching, user should initialize it.
* minor changes from review.

v4:
* rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
  user can pass memory for shared data via shinfo argument.
* minor changes from review.

v3:
* implement external buffer attachment instead of introducing buf_off for
  mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 337 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 308 insertions(+), 29 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 0cd6a1c6b..4fd9a0d9e 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -345,7 +345,10 @@ extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+/**
+ * Mbuf having an external buffer attached. shinfo in mbuf must be filled.
+ */
+#define EXT_ATTACHED_MBUF    (1ULL << 61)
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -585,8 +588,27 @@ struct rte_mbuf {
 	/** Sequence number. See also rte_reorder_insert(). */
 	uint32_t seqn;
 
+	/** Shared data for external buffer attached to mbuf. See
+	 * rte_pktmbuf_attach_extbuf().
+	 */
+	struct rte_mbuf_ext_shared_info *shinfo;
+
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -707,14 +729,34 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
 }
 
 /**
+ * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
+ * otherwise.
+ *
+ * If a mbuf has its data in another mbuf and references it by mbuf
+ * indirection, this mbuf can be defined as a cloned mbuf.
+ */
+#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
  */
-#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
+#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
+
+/**
+ * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
+ *
+ * External buffer is a user-provided anonymous buffer.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
 
 /**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
+ *
+ * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
+ * can be defined as a direct mbuf.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) \
+	(!((mb)->ol_flags & (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF)))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -840,6 +882,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
+		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1214,11 +1308,159 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Initialize shared data at the end of an external buffer before attaching
+ * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory
+ * initialization but a helper function to simply spare a few bytes at the
+ * end of the buffer for shared data. If shared data is allocated
+ * separately, this should not be called but application has to properly
+ * initialize the shared data according to its need.
+ *
+ * Free callback and its argument is saved and the refcnt is set to 1.
+ *
+ * @warning
+ * The value of buf_len will be reduced to RTE_PTR_DIFF(shinfo, buf_addr)
+ * after this initialization. This shall be used for
+ * ``rte_pktmbuf_attach_extbuf()``
+ *
+ * @param buf_addr
+ *   The pointer to the external buffer.
+ * @param [in,out] buf_len
+ *   The pointer to length of the external buffer. Input value must be
+ *   larger than the size of ``struct rte_mbuf_ext_shared_info`` and
+ *   padding for alignment. If not enough, this function will return NULL.
+ *   Adjusted buffer length will be returned through this pointer.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be
+ *   freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ *
+ * @return
+ *   A pointer to the initialized shared data on success, return NULL
+ *   otherwise.
+ */
+static inline struct rte_mbuf_ext_shared_info *
+rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t *buf_len,
+	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
+{
+	struct rte_mbuf_ext_shared_info *shinfo;
+	void *buf_end = RTE_PTR_ADD(buf_addr, *buf_len);
+
+	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
+				sizeof(*shinfo)), sizeof(uintptr_t));
+	if ((void *)shinfo <= buf_addr)
+		return NULL;
+
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+	rte_mbuf_ext_refcnt_set(shinfo, 1);
+
+	*buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+	return shinfo;
+}
+
+/**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided via shinfo. This callback function will be called once all the
+ * mbufs are detached from the buffer (refcnt becomes zero).
+ *
+ * The headroom for the attaching mbuf will be set to zero and this can be
+ * properly adjusted after attachment. For example, ``rte_pktmbuf_adj()``
+ * or ``rte_pktmbuf_reset_headroom()`` might be used.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * Memory for shared data must be provided and user must initialize all of
+ * the content properly, escpecially free callback and refcnt. The pointer
+ * of shared data will be stored in m->shinfo.
+ * ``rte_pktmbuf_ext_shinfo_init_helper`` can help to simply spare a few
+ * bytes at the end of buffer for the shared data, store free callback and
+ * its argument and set the refcnt to 1. The following is an example:
+ *
+ *   struct rte_mbuf_ext_shared_info *shinfo =
+ *          rte_pktmbuf_ext_shinfo_init_helper(buf_addr, &buf_len,
+ *                                             free_cb, fcb_arg);
+ *   rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo);
+ *   rte_pktmbuf_reset_headroom(m);
+ *   rte_pktmbuf_adj(m, data_len);
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
+ *   2 as long as the direct mbuf itself isn't freed after the attachment.
+ *   In such cases, the buffer area of a direct mbuf must be read-only. But
+ *   external buffer has its own refcnt and it starts from 1. Unless
+ *   multiple mbufs are attached to a mbuf having an external buffer, the
+ *   external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback and its IO address.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be considered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer.
+ * @param buf_iova
+ *   IO address of the external buffer.
+ * @param buf_len
+ *   The size of the external buffer.
+ * @param shinfo
+ *   User-provided memory for shared data of the external buffer.
+ */
+static inline void __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	rte_iova_t buf_iova, uint16_t buf_len,
+	struct rte_mbuf_ext_shared_info *shinfo)
+{
+	/* mbuf should not be read-only */
+	RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
+	RTE_ASSERT(shinfo->free_cb != NULL);
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = buf_iova;
+	m->buf_len = buf_len;
+
+	m->data_len = 0;
+	m->data_off = 0;
+
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+	m->shinfo = shinfo;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1232,19 +1474,20 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
+		mi->ol_flags = m->ol_flags;
+		mi->shinfo = m->shinfo;
+	} else {
+		/* if m is not direct, get the mbuf that embeds the data */
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1260,7 +1503,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1269,12 +1511,52 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
  *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+	RTE_ASSERT(m->shinfo != NULL);
+
+	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
+		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	md = rte_mbuf_from_indirect(m);
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
+ *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1283,10 +1565,14 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1298,13 +1584,6 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1328,7 +1607,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1340,7 +1619,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v8 2/2] app/testpmd: conserve offload flags of mbuf
  2018-04-27 17:22 ` [PATCH v8 " Yongseok Koh
@ 2018-04-27 17:22   ` Yongseok Koh
  2018-04-27 18:09   ` [PATCH v8 1/2] mbuf: support attaching external buffer to mbuf Thomas Monjalon
  1 sibling, 0 replies; 86+ messages in thread
From: Yongseok Koh @ 2018-04-27 17:22 UTC (permalink / raw)
  To: wenzhuo.lu, jingjing.wu, olivier.matz
  Cc: dev, konstantin.ananyev, arybchenko, stephen, thomas,
	adrien.mazarguil, nelio.laranjeiro, Yongseok Koh

This patch is to accommodate an experimental feature of mbuf - external
buffer attachment. If mbuf is attached to an external buffer, its ol_flags
will have EXT_ATTACHED_MBUF set. Without enabling/using the feature,
everything remains same.

If PMD delivers Rx packets with non-direct mbuf, ol_flags should not be
overwritten. For mlx5 PMD, if Multi-Packet RQ is enabled, Rx packets could
be carried with externally attached mbufs.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

v8:
* inherit flags from mbuf instead of checking !RTE_MBUF_DIRECT().
* fix a bug - update flags on per-packet basis.

---
 app/test-pmd/csumonly.c | 3 ++-
 app/test-pmd/macfwd.c   | 3 ++-
 app/test-pmd/macswap.c  | 3 ++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 53b98412a..0bb88cf7d 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -737,7 +737,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		m = pkts_burst[i];
 		info.is_tunnel = 0;
 		info.pkt_len = rte_pktmbuf_pkt_len(m);
-		tx_ol_flags = 0;
+		tx_ol_flags = m->ol_flags &
+			      (IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF);
 		rx_ol_flags = m->ol_flags;
 
 		/* Update the L3/L4 checksum error packet statistics */
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 2adce7019..7cac757a0 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -96,7 +96,8 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
-		mb->ol_flags = ol_flags;
+		mb->ol_flags &= IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF;
+		mb->ol_flags |= ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
 		mb->vlan_tci = txp->tx_vlan_id;
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index e2cc4812c..a8384d5b8 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -127,7 +127,8 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		ether_addr_copy(&eth_hdr->s_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&addr, &eth_hdr->s_addr);
 
-		mb->ol_flags = ol_flags;
+		mb->ol_flags &= IND_ATTACHED_MBUF | EXT_ATTACHED_MBUF;
+		mb->ol_flags |= ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
 		mb->vlan_tci = txp->tx_vlan_id;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 1/2] mbuf: support attaching external buffer to mbuf
  2018-04-27 17:22 ` [PATCH v8 " Yongseok Koh
  2018-04-27 17:22   ` [PATCH v8 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
@ 2018-04-27 18:09   ` Thomas Monjalon
  1 sibling, 0 replies; 86+ messages in thread
From: Thomas Monjalon @ 2018-04-27 18:09 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: dev, wenzhuo.lu, jingjing.wu, olivier.matz, konstantin.ananyev,
	arybchenko, stephen, adrien.mazarguil, nelio.laranjeiro

27/04/2018 19:22, Yongseok Koh:
> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>     2 as long as the direct mbuf itself isn't freed after the attachment.
>     In such cases, the buffer area of a direct mbuf must be read-only. But
>     external buffer has its own refcnt and it starts from 1. Unless
>     multiple mbufs are attached to a mbuf having an external buffer, the
>     external buffer is writable.
>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

Series applied, thanks

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2018-04-27 18:09 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-10  1:25 [PATCH v1 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-03-10  1:25 ` [PATCH v1 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
2018-03-10  1:25 ` [PATCH v1 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
2018-03-10  1:25 ` [PATCH v1 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
2018-03-12  9:13   ` Nélio Laranjeiro
2018-03-10  1:25 ` [PATCH v1 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-03-12  9:20   ` Nélio Laranjeiro
2018-03-10  1:25 ` [PATCH v1 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
2018-03-10  1:25 ` [PATCH v1 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
2018-04-02 18:50 ` [PATCH v2 0/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-04-02 18:50   ` [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection Yongseok Koh
2018-04-03  8:26     ` Olivier Matz
2018-04-04  0:12       ` Yongseok Koh
2018-04-09 16:04         ` Olivier Matz
2018-04-10  1:59           ` Yongseok Koh
2018-04-11  0:25             ` Ananyev, Konstantin
2018-04-11  5:33               ` Yongseok Koh
2018-04-11 11:39                 ` Ananyev, Konstantin
2018-04-11 14:02                   ` Andrew Rybchenko
2018-04-11 17:18                     ` Yongseok Koh
2018-04-11 17:08                   ` Yongseok Koh
2018-04-12 16:34                     ` Ananyev, Konstantin
2018-04-12 18:58                       ` Yongseok Koh
2018-04-02 18:50   ` [PATCH v2 2/6] net/mlx5: separate filling Rx flags Yongseok Koh
2018-04-02 18:50   ` [PATCH v2 3/6] net/mlx5: add a function to rdma-core glue Yongseok Koh
2018-04-02 18:50   ` [PATCH v2 4/6] net/mlx5: add Multi-Packet Rx support Yongseok Koh
2018-04-02 18:50   ` [PATCH v2 5/6] net/mlx5: release Tx queue resource earlier than Rx Yongseok Koh
2018-04-02 18:50   ` [PATCH v2 6/6] app/testpmd: conserve mbuf indirection flag Yongseok Koh
2018-04-19  1:11 ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Yongseok Koh
2018-04-19  1:11   ` [PATCH v3 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-23 11:53   ` [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
2018-04-24  2:04     ` Yongseok Koh
2018-04-25 13:16       ` Ananyev, Konstantin
2018-04-25 16:44         ` Yongseok Koh
2018-04-25 18:05           ` Ananyev, Konstantin
2018-04-23 16:18   ` Olivier Matz
2018-04-24  1:29     ` Yongseok Koh
2018-04-24 15:36       ` Olivier Matz
2018-04-24  1:38 ` [PATCH v4 " Yongseok Koh
2018-04-24  1:38   ` [PATCH v4 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-24  5:01   ` [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf Stephen Hemminger
2018-04-24 11:47     ` Yongseok Koh
2018-04-24 12:28   ` Andrew Rybchenko
2018-04-24 16:02     ` Olivier Matz
2018-04-24 18:21       ` ***Spam*** " Andrew Rybchenko
2018-04-24 19:15         ` Olivier Matz
2018-04-24 20:22           ` Thomas Monjalon
2018-04-24 21:53             ` Yongseok Koh
2018-04-24 22:15               ` Thomas Monjalon
2018-04-25  8:21               ` Olivier Matz
2018-04-25 15:06             ` Stephen Hemminger
2018-04-24 23:34           ` Yongseok Koh
2018-04-25 14:45             ` Andrew Rybchenko
2018-04-25 17:40               ` Yongseok Koh
2018-04-25  8:28       ` Olivier Matz
2018-04-25  9:08         ` Yongseok Koh
2018-04-25  9:19           ` Yongseok Koh
2018-04-25 20:00             ` Olivier Matz
2018-04-25 22:54               ` Yongseok Koh
2018-04-24 22:30     ` Yongseok Koh
2018-04-25  2:53 ` [PATCH v5 " Yongseok Koh
2018-04-25  2:53   ` [PATCH v5 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-25 13:31   ` [PATCH v5 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
2018-04-25 17:06     ` Yongseok Koh
2018-04-25 17:23       ` Ananyev, Konstantin
2018-04-25 18:02         ` Yongseok Koh
2018-04-25 18:22           ` Yongseok Koh
2018-04-25 18:30             ` Yongseok Koh
2018-04-26  1:10 ` [PATCH v6 " Yongseok Koh
2018-04-26  1:10   ` [PATCH v6 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-26 11:39   ` [PATCH v6 1/2] mbuf: support attaching external buffer to mbuf Ananyev, Konstantin
2018-04-26 16:05   ` Andrew Rybchenko
2018-04-26 16:10     ` Thomas Monjalon
2018-04-26 19:42       ` Olivier Matz
2018-04-26 19:58         ` Thomas Monjalon
2018-04-26 20:07           ` Olivier Matz
2018-04-26 20:24             ` Thomas Monjalon
2018-04-26 17:18     ` Yongseok Koh
2018-04-26 19:45       ` Olivier Matz
2018-04-27  0:01 ` [PATCH v7 " Yongseok Koh
2018-04-27  0:01   ` [PATCH v7 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-27  8:00     ` Andrew Rybchenko
2018-04-27  7:22   ` [PATCH v7 1/2] mbuf: support attaching external buffer to mbuf Andrew Rybchenko
2018-04-27 17:22 ` [PATCH v8 " Yongseok Koh
2018-04-27 17:22   ` [PATCH v8 2/2] app/testpmd: conserve offload flags of mbuf Yongseok Koh
2018-04-27 18:09   ` [PATCH v8 1/2] mbuf: support attaching external buffer to mbuf Thomas Monjalon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.