All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue
@ 2015-12-13 23:35 Huawei Xie
  2015-12-13 23:35 ` [PATCH 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
                   ` (6 more replies)
  0 siblings, 7 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-13 23:35 UTC (permalink / raw)
  To: dev

For symmetric rte_pktmbuf_free_bulk, if the app knows in its scenarios
their mbufs are all simple mbufs, i.e meet the following requirements:
 * no multiple segments
 * not indirect mbuf
 * refcnt is 1
 * belong to the same mbuf memory pool,
it could directly call rte_mempool_put to free the bulk of mbufs,
otherwise rte_pktmbuf_free_bulk has to call rte_pktmbuf_free to free
the mbuf one by one.
This patchset will not provide this symmetric implementation.

Huawei Xie (2):
  mbuf: provide rte_pktmbuf_alloc_bulk API
  vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue

 lib/librte_mbuf/rte_mbuf.h    | 31 +++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 2 files changed, 53 insertions(+), 13 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
@ 2015-12-13 23:35 ` Huawei Xie
  2015-12-13 23:35 ` [PATCH 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-13 23:35 UTC (permalink / raw)
  To: dev

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..c0bc622 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,37 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	for (idx = 0; idx < count; idx++) {
+		RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+		rte_mbuf_refcnt_set(mbufs[idx], 1);
+		rte_pktmbuf_reset(mbufs[idx]);
+	}
+	return rc;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
  2015-12-13 23:35 ` [PATCH 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-13 23:35 ` Huawei Xie
  2015-12-14  1:14 ` [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-13 23:35 UTC (permalink / raw)
  To: dev

pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..0faae58 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -576,6 +576,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t i;
 	uint16_t free_entries, entry_success = 0;
 	uint16_t avail_idx;
+	uint8_t alloc_err = 0;
+	uint8_t seg_num;
 
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
 		RTE_LOG(ERR, VHOST_DATA,
@@ -609,6 +611,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
 			dev->device_fh, free_entries);
+
+	if (unlikely(rte_pktmbuf_alloc_bulk(mbuf_pool,
+		pkts, free_entries)) < 0) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to bulk allocating %d mbufs\n", free_entries);
+		return 0;
+	}
+
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < free_entries; i++)
 		head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];
@@ -621,9 +631,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t vb_avail, vb_offset;
 		uint32_t seg_avail, seg_offset;
 		uint32_t cpy_len;
-		uint32_t seg_num = 0;
+		seg_num = 0;
 		struct rte_mbuf *cur;
-		uint8_t alloc_err = 0;
+
 
 		desc = &vq->desc[head[entry_success]];
 
@@ -654,13 +664,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
 
-		/* Allocate an mbuf and populate the structure. */
-		m = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(m == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
-			break;
-		}
+		prev = cur = m = pkts[entry_success];
 		seg_offset = 0;
 		seg_avail = m->buf_len - RTE_PKTMBUF_HEADROOM;
 		cpy_len = RTE_MIN(vb_avail, seg_avail);
@@ -668,8 +672,6 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);
 
 		seg_num++;
-		cur = m;
-		prev = m;
 		while (cpy_len != 0) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 				(void *)((uintptr_t)(vb_addr + vb_offset)),
@@ -761,16 +763,23 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 			cpy_len = RTE_MIN(vb_avail, seg_avail);
 		}
 
-		if (unlikely(alloc_err == 1))
+		if (unlikely(alloc_err))
 			break;
 
 		m->nb_segs = seg_num;
 
-		pkts[entry_success] = m;
 		vq->last_used_idx++;
 		entry_success++;
 	}
 
+	if (unlikely(alloc_err)) {
+		uint16_t i = entry_success;
+
+		m->nb_segs = seg_num;
+		for (; i < free_entries; i++)
+			rte_pktmbuf_free(pkts[entry_success]);
+	}
+
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
 	/* Kick guest if required. */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
  2015-12-13 23:35 ` [PATCH 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2015-12-13 23:35 ` [PATCH 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
@ 2015-12-14  1:14 ` Huawei Xie
  2015-12-14  1:14   ` [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
                     ` (2 more replies)
  2015-12-22 23:05 ` [PATCH v4 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-14  1:14 UTC (permalink / raw)
  To: dev

v2 changes:
 unroll the loop in rte_pktmbuf_alloc_bulk to help the performance

For symmetric rte_pktmbuf_free_bulk, if the app knows in its scenarios
their mbufs are all simple mbufs, i.e meet the following requirements:
 * no multiple segments
 * not indirect mbuf
 * refcnt is 1
 * belong to the same mbuf memory pool,
it could directly call rte_mempool_put to free the bulk of mbufs,
otherwise rte_pktmbuf_free_bulk has to call rte_pktmbuf_free to free
the mbuf one by one.
This patchset will not provide this symmetric implementation.

Huawei Xie (2):
  mbuf: provide rte_pktmbuf_alloc_bulk API
  vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue

 lib/librte_mbuf/rte_mbuf.h    | 50 +++++++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost_rxtx.c | 35 +++++++++++++++++++-----------
 2 files changed, 72 insertions(+), 13 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-14  1:14 ` [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2015-12-14  1:14   ` Huawei Xie
  2015-12-17  6:41     ` Yuanhan Liu
  2015-12-18  5:01     ` Stephen Hemminger
  2015-12-14  1:14   ` [PATCH v2 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  2015-12-22 16:17   ` [PATCH v3 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2 siblings, 2 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-14  1:14 UTC (permalink / raw)
  To: dev

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..4e209e0 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx = 0;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	switch (count % 4) {
+	while (idx != count) {
+		case 0:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+		case 3:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+		case 2:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+		case 1:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	}
+	}
+	return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-14  1:14 ` [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2015-12-14  1:14   ` [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-14  1:14   ` Huawei Xie
  2015-12-17  6:41     ` Yuanhan Liu
  2015-12-22 16:17   ` [PATCH v3 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2 siblings, 1 reply; 54+ messages in thread
From: Huawei Xie @ 2015-12-14  1:14 UTC (permalink / raw)
  To: dev

pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..0faae58 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -576,6 +576,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t i;
 	uint16_t free_entries, entry_success = 0;
 	uint16_t avail_idx;
+	uint8_t alloc_err = 0;
+	uint8_t seg_num;
 
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
 		RTE_LOG(ERR, VHOST_DATA,
@@ -609,6 +611,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
 			dev->device_fh, free_entries);
+
+	if (unlikely(rte_pktmbuf_alloc_bulk(mbuf_pool,
+		pkts, free_entries)) < 0) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to bulk allocating %d mbufs\n", free_entries);
+		return 0;
+	}
+
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < free_entries; i++)
 		head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];
@@ -621,9 +631,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t vb_avail, vb_offset;
 		uint32_t seg_avail, seg_offset;
 		uint32_t cpy_len;
-		uint32_t seg_num = 0;
+		seg_num = 0;
 		struct rte_mbuf *cur;
-		uint8_t alloc_err = 0;
+
 
 		desc = &vq->desc[head[entry_success]];
 
@@ -654,13 +664,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
 
-		/* Allocate an mbuf and populate the structure. */
-		m = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(m == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
-			break;
-		}
+		prev = cur = m = pkts[entry_success];
 		seg_offset = 0;
 		seg_avail = m->buf_len - RTE_PKTMBUF_HEADROOM;
 		cpy_len = RTE_MIN(vb_avail, seg_avail);
@@ -668,8 +672,6 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);
 
 		seg_num++;
-		cur = m;
-		prev = m;
 		while (cpy_len != 0) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 				(void *)((uintptr_t)(vb_addr + vb_offset)),
@@ -761,16 +763,23 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 			cpy_len = RTE_MIN(vb_avail, seg_avail);
 		}
 
-		if (unlikely(alloc_err == 1))
+		if (unlikely(alloc_err))
 			break;
 
 		m->nb_segs = seg_num;
 
-		pkts[entry_success] = m;
 		vq->last_used_idx++;
 		entry_success++;
 	}
 
+	if (unlikely(alloc_err)) {
+		uint16_t i = entry_success;
+
+		m->nb_segs = seg_num;
+		for (; i < free_entries; i++)
+			rte_pktmbuf_free(pkts[entry_success]);
+	}
+
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
 	/* Kick guest if required. */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-14  1:14   ` [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-17  6:41     ` Yuanhan Liu
  2015-12-17 15:42       ` Ananyev, Konstantin
  2015-12-18  5:01     ` Stephen Hemminger
  1 sibling, 1 reply; 54+ messages in thread
From: Yuanhan Liu @ 2015-12-17  6:41 UTC (permalink / raw)
  To: Huawei Xie; +Cc: dev

On Mon, Dec 14, 2015 at 09:14:41AM +0800, Huawei Xie wrote:
> v2 changes:
>  unroll the loop a bit to help the performance
> 
> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
> 
> There is related thread about this bulk API.
> http://dpdk.org/dev/patchwork/patch/4718/
> Thanks to Konstantin's loop unrolling.
> 
> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index f234ac9..4e209e0 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>  }
>  
>  /**
> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
> + * values.
> + *
> + *  @param pool
> + *    The mempool from which mbufs are allocated.
> + *  @param mbufs
> + *    Array of pointers to mbufs
> + *  @param count
> + *    Array size
> + *  @return
> + *   - 0: Success
> + */
> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> +	 struct rte_mbuf **mbufs, unsigned count)

It violates the coding style a bit.

> +{
> +	unsigned idx = 0;
> +	int rc;
> +
> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> +	if (unlikely(rc))
> +		return rc;
> +
> +	switch (count % 4) {
> +	while (idx != count) {

Well, that's an awkward trick, putting while between switch and case.

How about moving the whole switch block ahead, and use goto?

	switch (count % 4) {
	case 3:
		goto __3;
		break;
	case 2:
		goto __2;
		break;
	...

	}

It basically generates same instructions, yet it improves the
readability a bit.

	--yliu

> +		case 0:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		case 3:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		case 2:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		case 1:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	}
> +	}
> +	return 0;
> +}
> +
> +/**
>   * Attach packet mbuf to another packet mbuf.
>   *
>   * After attachment we refer the mbuf we attached as 'indirect',
> -- 
> 1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-14  1:14   ` [PATCH v2 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
@ 2015-12-17  6:41     ` Yuanhan Liu
  0 siblings, 0 replies; 54+ messages in thread
From: Yuanhan Liu @ 2015-12-17  6:41 UTC (permalink / raw)
  To: Huawei Xie; +Cc: dev

On Mon, Dec 14, 2015 at 09:14:42AM +0800, Huawei Xie wrote:
> pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand
> 
> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-17  6:41     ` Yuanhan Liu
@ 2015-12-17 15:42       ` Ananyev, Konstantin
  2015-12-18  2:17         ` Yuanhan Liu
  0 siblings, 1 reply; 54+ messages in thread
From: Ananyev, Konstantin @ 2015-12-17 15:42 UTC (permalink / raw)
  To: Yuanhan Liu, Xie, Huawei; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Thursday, December 17, 2015 6:41 AM
> To: Xie, Huawei
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
> 
> On Mon, Dec 14, 2015 at 09:14:41AM +0800, Huawei Xie wrote:
> > v2 changes:
> >  unroll the loop a bit to help the performance
> >
> > rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
> >
> > There is related thread about this bulk API.
> > http://dpdk.org/dev/patchwork/patch/4718/
> > Thanks to Konstantin's loop unrolling.
> >
> > Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> > Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index f234ac9..4e209e0 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
> >  }
> >
> >  /**
> > + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
> > + * values.
> > + *
> > + *  @param pool
> > + *    The mempool from which mbufs are allocated.
> > + *  @param mbufs
> > + *    Array of pointers to mbufs
> > + *  @param count
> > + *    Array size
> > + *  @return
> > + *   - 0: Success
> > + */
> > +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > +	 struct rte_mbuf **mbufs, unsigned count)
> 
> It violates the coding style a bit.
> 
> > +{
> > +	unsigned idx = 0;
> > +	int rc;
> > +
> > +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> > +	if (unlikely(rc))
> > +		return rc;
> > +
> > +	switch (count % 4) {
> > +	while (idx != count) {
> 
> Well, that's an awkward trick, putting while between switch and case.
> 
> How about moving the whole switch block ahead, and use goto?
> 
> 	switch (count % 4) {
> 	case 3:
> 		goto __3;
> 		break;
> 	case 2:
> 		goto __2;
> 		break;
> 	...
> 
> 	}
> 
> It basically generates same instructions, yet it improves the
> readability a bit.

I am personally not a big fun of gotos, unless it is totally unavoidable.
I think switch/while construction is pretty obvious these days.
For me the original variant looks cleaner, so my vote would be to stick with it.
Konstantin

> 
> 	--yliu
> 
> > +		case 0:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 3:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 2:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 1:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	}
> > +	}
> > +	return 0;
> > +}
> > +
> > +/**
> >   * Attach packet mbuf to another packet mbuf.
> >   *
> >   * After attachment we refer the mbuf we attached as 'indirect',
> > --
> > 1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-17 15:42       ` Ananyev, Konstantin
@ 2015-12-18  2:17         ` Yuanhan Liu
  0 siblings, 0 replies; 54+ messages in thread
From: Yuanhan Liu @ 2015-12-18  2:17 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Thu, Dec 17, 2015 at 03:42:19PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> > Sent: Thursday, December 17, 2015 6:41 AM
> > To: Xie, Huawei
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
> > 
> > > +{
> > > +	unsigned idx = 0;
> > > +	int rc;
> > > +
> > > +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> > > +	if (unlikely(rc))
> > > +		return rc;
> > > +
> > > +	switch (count % 4) {
> > > +	while (idx != count) {
> > 
> > Well, that's an awkward trick, putting while between switch and case.
> > 
> > How about moving the whole switch block ahead, and use goto?
> > 
> > 	switch (count % 4) {
> > 	case 3:
> > 		goto __3;
> > 		break;
> > 	case 2:
> > 		goto __2;
> > 		break;
> > 	...
> > 
> > 	}
> > 
> > It basically generates same instructions, yet it improves the
> > readability a bit.
> 
> I am personally not a big fun of gotos, unless it is totally unavoidable.
> I think switch/while construction is pretty obvious these days.

To me, it's not. (well, maybe I have been out for a while :(

> For me the original variant looks cleaner,

I agree with you on that. But it sacrifices code readability a bit.
If two pieces of code generates same instructions, but one is cleaner
(shorter), another one is more readable, I'd prefer the later.

> so my vote would be to stick with it.

Okay. And anyway, above is just a suggestion, and I'm open to other
suggestions.

	--yliu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-14  1:14   ` [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2015-12-17  6:41     ` Yuanhan Liu
@ 2015-12-18  5:01     ` Stephen Hemminger
  2015-12-18  5:21       ` Yuanhan Liu
                         ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Stephen Hemminger @ 2015-12-18  5:01 UTC (permalink / raw)
  To: Huawei Xie; +Cc: dev

On Mon, 14 Dec 2015 09:14:41 +0800
Huawei Xie <huawei.xie@intel.com> wrote:

> v2 changes:
>  unroll the loop a bit to help the performance
> 
> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
> 
> There is related thread about this bulk API.
> http://dpdk.org/dev/patchwork/patch/4718/
> Thanks to Konstantin's loop unrolling.
> 
> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index f234ac9..4e209e0 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>  }
>  
>  /**
> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
> + * values.
> + *
> + *  @param pool
> + *    The mempool from which mbufs are allocated.
> + *  @param mbufs
> + *    Array of pointers to mbufs
> + *  @param count
> + *    Array size
> + *  @return
> + *   - 0: Success
> + */
> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> +	 struct rte_mbuf **mbufs, unsigned count)
> +{
> +	unsigned idx = 0;
> +	int rc;
> +
> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> +	if (unlikely(rc))
> +		return rc;
> +
> +	switch (count % 4) {
> +	while (idx != count) {
> +		case 0:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		case 3:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		case 2:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		case 1:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	}
> +	}
> +	return 0;
> +}

This is weird. Why not just use Duff's device in a more normal manner.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18  5:01     ` Stephen Hemminger
@ 2015-12-18  5:21       ` Yuanhan Liu
  2015-12-18  7:10       ` Xie, Huawei
  2015-12-18 10:44       ` Ananyev, Konstantin
  2 siblings, 0 replies; 54+ messages in thread
From: Yuanhan Liu @ 2015-12-18  5:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Thu, Dec 17, 2015 at 09:01:14PM -0800, Stephen Hemminger wrote:
...
> > +
> > +	switch (count % 4) {
> > +	while (idx != count) {
> > +		case 0:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 3:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 2:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 1:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	}
> > +	}
> > +	return 0;
> > +}
> 
> This is weird. Why not just use Duff's device in a more normal manner.

Duff's device; interesting and good to know. Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18  5:01     ` Stephen Hemminger
  2015-12-18  5:21       ` Yuanhan Liu
@ 2015-12-18  7:10       ` Xie, Huawei
  2015-12-18 10:44       ` Ananyev, Konstantin
  2 siblings, 0 replies; 54+ messages in thread
From: Xie, Huawei @ 2015-12-18  7:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On 12/18/2015 1:03 PM, Stephen Hemminger wrote:
> On Mon, 14 Dec 2015 09:14:41 +0800
> Huawei Xie <huawei.xie@intel.com> wrote:
>
>> v2 changes:
>>  unroll the loop a bit to help the performance
>>
>> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>>
>> There is related thread about this bulk API.
>> http://dpdk.org/dev/patchwork/patch/4718/
>> Thanks to Konstantin's loop unrolling.
>>
>> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> ---
>>  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>> index f234ac9..4e209e0 100644
>> --- a/lib/librte_mbuf/rte_mbuf.h
>> +++ b/lib/librte_mbuf/rte_mbuf.h
>> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>>  }
>>  
>>  /**
>> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
>> + * values.
>> + *
>> + *  @param pool
>> + *    The mempool from which mbufs are allocated.
>> + *  @param mbufs
>> + *    Array of pointers to mbufs
>> + *  @param count
>> + *    Array size
>> + *  @return
>> + *   - 0: Success
>> + */
>> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>> +	 struct rte_mbuf **mbufs, unsigned count)
>> +{
>> +	unsigned idx = 0;
>> +	int rc;
>> +
>> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>> +	if (unlikely(rc))
>> +		return rc;
>> +
>> +	switch (count % 4) {
>> +	while (idx != count) {
>> +		case 0:
>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +			rte_pktmbuf_reset(mbufs[idx]);
>> +			idx++;
>> +		case 3:
>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +			rte_pktmbuf_reset(mbufs[idx]);
>> +			idx++;
>> +		case 2:
>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +			rte_pktmbuf_reset(mbufs[idx]);
>> +			idx++;
>> +		case 1:
>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +			rte_pktmbuf_reset(mbufs[idx]);
>> +			idx++;
>> +	}
>> +	}
>> +	return 0;
>> +}
> This is weird. Why not just use Duff's device in a more normal manner.
Hi Stephen: I just compared with duff's unrolled version. It is slightly
different, but where is weird?
>
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18  5:01     ` Stephen Hemminger
  2015-12-18  5:21       ` Yuanhan Liu
  2015-12-18  7:10       ` Xie, Huawei
@ 2015-12-18 10:44       ` Ananyev, Konstantin
  2015-12-18 17:32         ` Stephen Hemminger
  2 siblings, 1 reply; 54+ messages in thread
From: Ananyev, Konstantin @ 2015-12-18 10:44 UTC (permalink / raw)
  To: Stephen Hemminger, Xie, Huawei; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> Sent: Friday, December 18, 2015 5:01 AM
> To: Xie, Huawei
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
> 
> On Mon, 14 Dec 2015 09:14:41 +0800
> Huawei Xie <huawei.xie@intel.com> wrote:
> 
> > v2 changes:
> >  unroll the loop a bit to help the performance
> >
> > rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
> >
> > There is related thread about this bulk API.
> > http://dpdk.org/dev/patchwork/patch/4718/
> > Thanks to Konstantin's loop unrolling.
> >
> > Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> > Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index f234ac9..4e209e0 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
> >  }
> >
> >  /**
> > + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
> > + * values.
> > + *
> > + *  @param pool
> > + *    The mempool from which mbufs are allocated.
> > + *  @param mbufs
> > + *    Array of pointers to mbufs
> > + *  @param count
> > + *    Array size
> > + *  @return
> > + *   - 0: Success
> > + */
> > +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > +	 struct rte_mbuf **mbufs, unsigned count)
> > +{
> > +	unsigned idx = 0;
> > +	int rc;
> > +
> > +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> > +	if (unlikely(rc))
> > +		return rc;
> > +
> > +	switch (count % 4) {
> > +	while (idx != count) {
> > +		case 0:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 3:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 2:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +		case 1:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	}
> > +	}
> > +	return 0;
> > +}
> 
> This is weird. Why not just use Duff's device in a more normal manner.

But it is a sort of Duff's method.
Not sure what looks weird to you here?
while () {} instead of do {} while();?
Konstantin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18 10:44       ` Ananyev, Konstantin
@ 2015-12-18 17:32         ` Stephen Hemminger
  2015-12-18 19:27           ` Wiles, Keith
  2015-12-21 12:25           ` Xie, Huawei
  0 siblings, 2 replies; 54+ messages in thread
From: Stephen Hemminger @ 2015-12-18 17:32 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Fri, 18 Dec 2015 10:44:02 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> > Sent: Friday, December 18, 2015 5:01 AM
> > To: Xie, Huawei
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
> > 
> > On Mon, 14 Dec 2015 09:14:41 +0800
> > Huawei Xie <huawei.xie@intel.com> wrote:
> > 
> > > v2 changes:
> > >  unroll the loop a bit to help the performance
> > >
> > > rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
> > >
> > > There is related thread about this bulk API.
> > > http://dpdk.org/dev/patchwork/patch/4718/
> > > Thanks to Konstantin's loop unrolling.
> > >
> > > Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> > > Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > >  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 50 insertions(+)
> > >
> > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > > index f234ac9..4e209e0 100644
> > > --- a/lib/librte_mbuf/rte_mbuf.h
> > > +++ b/lib/librte_mbuf/rte_mbuf.h
> > > @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
> > >  }
> > >
> > >  /**
> > > + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
> > > + * values.
> > > + *
> > > + *  @param pool
> > > + *    The mempool from which mbufs are allocated.
> > > + *  @param mbufs
> > > + *    Array of pointers to mbufs
> > > + *  @param count
> > > + *    Array size
> > > + *  @return
> > > + *   - 0: Success
> > > + */
> > > +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> > > +	 struct rte_mbuf **mbufs, unsigned count)
> > > +{
> > > +	unsigned idx = 0;
> > > +	int rc;
> > > +
> > > +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> > > +	if (unlikely(rc))
> > > +		return rc;
> > > +
> > > +	switch (count % 4) {
> > > +	while (idx != count) {
> > > +		case 0:
> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > > +			rte_pktmbuf_reset(mbufs[idx]);
> > > +			idx++;
> > > +		case 3:
> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > > +			rte_pktmbuf_reset(mbufs[idx]);
> > > +			idx++;
> > > +		case 2:
> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > > +			rte_pktmbuf_reset(mbufs[idx]);
> > > +			idx++;
> > > +		case 1:
> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > > +			rte_pktmbuf_reset(mbufs[idx]);
> > > +			idx++;
> > > +	}
> > > +	}
> > > +	return 0;
> > > +}
> > 
> > This is weird. Why not just use Duff's device in a more normal manner.
> 
> But it is a sort of Duff's method.
> Not sure what looks weird to you here?
> while () {} instead of do {} while();?
> Konstantin
> 
> 
> 

It is unusual to have cases not associated with block of the switch.
Unusual to me means, "not used commonly in most code".

Since you are jumping into the loop, might make more sense as a do { } while()

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18 17:32         ` Stephen Hemminger
@ 2015-12-18 19:27           ` Wiles, Keith
  2015-12-21 15:21             ` Xie, Huawei
  2015-12-21 12:25           ` Xie, Huawei
  1 sibling, 1 reply; 54+ messages in thread
From: Wiles, Keith @ 2015-12-18 19:27 UTC (permalink / raw)
  To: Stephen Hemminger, Ananyev, Konstantin; +Cc: dev

On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" <dev-bounces@dpdk.org on behalf of stephen@networkplumber.org> wrote:

>On Fri, 18 Dec 2015 10:44:02 +0000
>"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
>
>> 
>> 
>> > -----Original Message-----
>> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
>> > Sent: Friday, December 18, 2015 5:01 AM
>> > To: Xie, Huawei
>> > Cc: dev@dpdk.org
>> > Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>> > 
>> > On Mon, 14 Dec 2015 09:14:41 +0800
>> > Huawei Xie <huawei.xie@intel.com> wrote:
>> > 
>> > > v2 changes:
>> > >  unroll the loop a bit to help the performance
>> > >
>> > > rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>> > >
>> > > There is related thread about this bulk API.
>> > > http://dpdk.org/dev/patchwork/patch/4718/
>> > > Thanks to Konstantin's loop unrolling.
>> > >
>> > > Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
>> > > Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> > > ---
>> > >  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>> > >  1 file changed, 50 insertions(+)
>> > >
>> > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>> > > index f234ac9..4e209e0 100644
>> > > --- a/lib/librte_mbuf/rte_mbuf.h
>> > > +++ b/lib/librte_mbuf/rte_mbuf.h
>> > > @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>> > >  }
>> > >
>> > >  /**
>> > > + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
>> > > + * values.
>> > > + *
>> > > + *  @param pool
>> > > + *    The mempool from which mbufs are allocated.
>> > > + *  @param mbufs
>> > > + *    Array of pointers to mbufs
>> > > + *  @param count
>> > > + *    Array size
>> > > + *  @return
>> > > + *   - 0: Success
>> > > + */
>> > > +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>> > > +	 struct rte_mbuf **mbufs, unsigned count)
>> > > +{
>> > > +	unsigned idx = 0;
>> > > +	int rc;
>> > > +
>> > > +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>> > > +	if (unlikely(rc))
>> > > +		return rc;
>> > > +
>> > > +	switch (count % 4) {
>> > > +	while (idx != count) {
>> > > +		case 0:
>> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> > > +			rte_pktmbuf_reset(mbufs[idx]);
>> > > +			idx++;
>> > > +		case 3:
>> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> > > +			rte_pktmbuf_reset(mbufs[idx]);
>> > > +			idx++;
>> > > +		case 2:
>> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> > > +			rte_pktmbuf_reset(mbufs[idx]);
>> > > +			idx++;
>> > > +		case 1:
>> > > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> > > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>> > > +			rte_pktmbuf_reset(mbufs[idx]);
>> > > +			idx++;
>> > > +	}
>> > > +	}
>> > > +	return 0;
>> > > +}
>> > 
>> > This is weird. Why not just use Duff's device in a more normal manner.
>> 
>> But it is a sort of Duff's method.
>> Not sure what looks weird to you here?
>> while () {} instead of do {} while();?
>> Konstantin
>> 
>> 
>> 
>
>It is unusual to have cases not associated with block of the switch.
>Unusual to me means, "not used commonly in most code".
>
>Since you are jumping into the loop, might make more sense as a do { } while()

I find this a very odd coding practice and I would suggest we not do this, unless it gives us some great performance gain.

Keith
>
>


Regards,
Keith





^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18 17:32         ` Stephen Hemminger
  2015-12-18 19:27           ` Wiles, Keith
@ 2015-12-21 12:25           ` Xie, Huawei
  1 sibling, 0 replies; 54+ messages in thread
From: Xie, Huawei @ 2015-12-21 12:25 UTC (permalink / raw)
  To: Stephen Hemminger, Ananyev, Konstantin; +Cc: dev

On 12/19/2015 1:32 AM, Stephen Hemminger wrote:
> On Fri, 18 Dec 2015 10:44:02 +0000
> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
>
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
>>> Sent: Friday, December 18, 2015 5:01 AM
>>> To: Xie, Huawei
>>> Cc: dev@dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>>>
>>> On Mon, 14 Dec 2015 09:14:41 +0800
>>> Huawei Xie <huawei.xie@intel.com> wrote:
>>>
>>>> v2 changes:
>>>>  unroll the loop a bit to help the performance
>>>>
>>>> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>>>>
>>>> There is related thread about this bulk API.
>>>> http://dpdk.org/dev/patchwork/patch/4718/
>>>> Thanks to Konstantin's loop unrolling.
>>>>
>>>> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
>>>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>>> ---
>>>>  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 50 insertions(+)
>>>>
>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>>>> index f234ac9..4e209e0 100644
>>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>>> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>>>>  }
>>>>
>>>>  /**
>>>> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
>>>> + * values.
>>>> + *
>>>> + *  @param pool
>>>> + *    The mempool from which mbufs are allocated.
>>>> + *  @param mbufs
>>>> + *    Array of pointers to mbufs
>>>> + *  @param count
>>>> + *    Array size
>>>> + *  @return
>>>> + *   - 0: Success
>>>> + */
>>>> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>>>> +	 struct rte_mbuf **mbufs, unsigned count)
>>>> +{
>>>> +	unsigned idx = 0;
>>>> +	int rc;
>>>> +
>>>> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>>>> +	if (unlikely(rc))
>>>> +		return rc;
>>>> +
>>>> +	switch (count % 4) {
>>>> +	while (idx != count) {
>>>> +		case 0:
>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>> +			idx++;
>>>> +		case 3:
>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>> +			idx++;
>>>> +		case 2:
>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>> +			idx++;
>>>> +		case 1:
>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>> +			idx++;
>>>> +	}
>>>> +	}
>>>> +	return 0;
>>>> +}
>>> This is weird. Why not just use Duff's device in a more normal manner.
>> But it is a sort of Duff's method.
>> Not sure what looks weird to you here?
>> while () {} instead of do {} while();?
>> Konstantin
>>
>>
>>
> It is unusual to have cases not associated with block of the switch.
> Unusual to me means, "not used commonly in most code".
>
> Since you are jumping into the loop, might make more sense as a do { } while()
>
Stephen:
How about we move while a bit:
    switch(count % 4) {
    case 0: while (idx != count) {
            ... reset ...
    case 3:
            ... reset ...
    case 2:
            ... reset ...
    case 1:
            ... reset ...
     }
     }

With do {} while, we probably need one more extra check on if count is
zero. Duff's initial implementation assumes that count isn't zero. With
while loop, we save one line of code.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-18 19:27           ` Wiles, Keith
@ 2015-12-21 15:21             ` Xie, Huawei
  2015-12-21 17:20               ` Wiles, Keith
  2015-12-21 22:34               ` Don Provan
  0 siblings, 2 replies; 54+ messages in thread
From: Xie, Huawei @ 2015-12-21 15:21 UTC (permalink / raw)
  To: Wiles, Keith, Stephen Hemminger, Ananyev, Konstantin; +Cc: dev

On 12/19/2015 3:27 AM, Wiles, Keith wrote:
> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" <dev-bounces@dpdk.org on behalf of stephen@networkplumber.org> wrote:
>
>> On Fri, 18 Dec 2015 10:44:02 +0000
>> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
>>
>>>
>>>> -----Original Message-----
>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
>>>> Sent: Friday, December 18, 2015 5:01 AM
>>>> To: Xie, Huawei
>>>> Cc: dev@dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>>>>
>>>> On Mon, 14 Dec 2015 09:14:41 +0800
>>>> Huawei Xie <huawei.xie@intel.com> wrote:
>>>>
>>>>> v2 changes:
>>>>>  unroll the loop a bit to help the performance
>>>>>
>>>>> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>>>>>
>>>>> There is related thread about this bulk API.
>>>>> http://dpdk.org/dev/patchwork/patch/4718/
>>>>> Thanks to Konstantin's loop unrolling.
>>>>>
>>>>> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
>>>>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>>>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>>>> ---
>>>>>  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 50 insertions(+)
>>>>>
>>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>>>>> index f234ac9..4e209e0 100644
>>>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>>>> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>>>>>  }
>>>>>
>>>>>  /**
>>>>> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
>>>>> + * values.
>>>>> + *
>>>>> + *  @param pool
>>>>> + *    The mempool from which mbufs are allocated.
>>>>> + *  @param mbufs
>>>>> + *    Array of pointers to mbufs
>>>>> + *  @param count
>>>>> + *    Array size
>>>>> + *  @return
>>>>> + *   - 0: Success
>>>>> + */
>>>>> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>>>>> +	 struct rte_mbuf **mbufs, unsigned count)
>>>>> +{
>>>>> +	unsigned idx = 0;
>>>>> +	int rc;
>>>>> +
>>>>> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>>>>> +	if (unlikely(rc))
>>>>> +		return rc;
>>>>> +
>>>>> +	switch (count % 4) {
>>>>> +	while (idx != count) {
>>>>> +		case 0:
>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>> +			idx++;
>>>>> +		case 3:
>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>> +			idx++;
>>>>> +		case 2:
>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>> +			idx++;
>>>>> +		case 1:
>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>> +			idx++;
>>>>> +	}
>>>>> +	}
>>>>> +	return 0;
>>>>> +}
>>>> This is weird. Why not just use Duff's device in a more normal manner.
>>> But it is a sort of Duff's method.
>>> Not sure what looks weird to you here?
>>> while () {} instead of do {} while();?
>>> Konstantin
>>>
>>>
>>>
>> It is unusual to have cases not associated with block of the switch.
>> Unusual to me means, "not used commonly in most code".
>>
>> Since you are jumping into the loop, might make more sense as a do { } while()
> I find this a very odd coding practice and I would suggest we not do this, unless it gives us some great performance gain.
>
> Keith
The loop unwinding could give performance gain. The only problem is the
switch/loop combination makes people feel weird at the first glance but
soon they will grasp this style. Since this is inherited from old famous
duff's device, i prefer to keep this style which saves lines of code.
>>
>
> Regards,
> Keith
>
>
>
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-21 15:21             ` Xie, Huawei
@ 2015-12-21 17:20               ` Wiles, Keith
  2015-12-21 21:30                 ` Thomas Monjalon
  2015-12-21 22:34               ` Don Provan
  1 sibling, 1 reply; 54+ messages in thread
From: Wiles, Keith @ 2015-12-21 17:20 UTC (permalink / raw)
  To: Xie, Huawei, Stephen Hemminger, Ananyev, Konstantin; +Cc: dev

On 12/21/15, 9:21 AM, "Xie, Huawei" <huawei.xie@intel.com> wrote:

>On 12/19/2015 3:27 AM, Wiles, Keith wrote:
>> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" <dev-bounces@dpdk.org on behalf of stephen@networkplumber.org> wrote:
>>
>>> On Fri, 18 Dec 2015 10:44:02 +0000
>>> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
>>>>> Sent: Friday, December 18, 2015 5:01 AM
>>>>> To: Xie, Huawei
>>>>> Cc: dev@dpdk.org
>>>>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>>>>>
>>>>> On Mon, 14 Dec 2015 09:14:41 +0800
>>>>> Huawei Xie <huawei.xie@intel.com> wrote:
>>>>>
>>>>>> v2 changes:
>>>>>>  unroll the loop a bit to help the performance
>>>>>>
>>>>>> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>>>>>>
>>>>>> There is related thread about this bulk API.
>>>>>> http://dpdk.org/dev/patchwork/patch/4718/
>>>>>> Thanks to Konstantin's loop unrolling.
>>>>>>
>>>>>> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
>>>>>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>>>>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>>>>> ---
>>>>>>  lib/librte_mbuf/rte_mbuf.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 50 insertions(+)
>>>>>>
>>>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>>>>>> index f234ac9..4e209e0 100644
>>>>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>>>>> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>>>>>>  }
>>>>>>
>>>>>>  /**
>>>>>> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
>>>>>> + * values.
>>>>>> + *
>>>>>> + *  @param pool
>>>>>> + *    The mempool from which mbufs are allocated.
>>>>>> + *  @param mbufs
>>>>>> + *    Array of pointers to mbufs
>>>>>> + *  @param count
>>>>>> + *    Array size
>>>>>> + *  @return
>>>>>> + *   - 0: Success
>>>>>> + */
>>>>>> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>>>>>> +	 struct rte_mbuf **mbufs, unsigned count)
>>>>>> +{
>>>>>> +	unsigned idx = 0;
>>>>>> +	int rc;
>>>>>> +
>>>>>> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>>>>>> +	if (unlikely(rc))
>>>>>> +		return rc;
>>>>>> +
>>>>>> +	switch (count % 4) {
>>>>>> +	while (idx != count) {
>>>>>> +		case 0:
>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>> +			idx++;
>>>>>> +		case 3:
>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>> +			idx++;
>>>>>> +		case 2:
>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>> +			idx++;
>>>>>> +		case 1:
>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>> +			idx++;
>>>>>> +	}
>>>>>> +	}
>>>>>> +	return 0;
>>>>>> +}
>>>>> This is weird. Why not just use Duff's device in a more normal manner.
>>>> But it is a sort of Duff's method.
>>>> Not sure what looks weird to you here?
>>>> while () {} instead of do {} while();?
>>>> Konstantin
>>>>
>>>>
>>>>
>>> It is unusual to have cases not associated with block of the switch.
>>> Unusual to me means, "not used commonly in most code".
>>>
>>> Since you are jumping into the loop, might make more sense as a do { } while()
>> I find this a very odd coding practice and I would suggest we not do this, unless it gives us some great performance gain.
>>
>> Keith
>The loop unwinding could give performance gain. The only problem is the
>switch/loop combination makes people feel weird at the first glance but
>soon they will grasp this style. Since this is inherited from old famous
>duff's device, i prefer to keep this style which saves lines of code.

Please add a comment to the code to reflex where this style came from and why you are using it, would be very handy here.

>>>
>>
>> Regards,
>> Keith
>>
>>
>>
>>
>
>


Regards,
Keith





^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-21 17:20               ` Wiles, Keith
@ 2015-12-21 21:30                 ` Thomas Monjalon
  2015-12-22  1:58                   ` Xie, Huawei
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Monjalon @ 2015-12-21 21:30 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev

2015-12-21 17:20, Wiles, Keith:
> On 12/21/15, 9:21 AM, "Xie, Huawei" <huawei.xie@intel.com> wrote:
> >On 12/19/2015 3:27 AM, Wiles, Keith wrote:
> >> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" <dev-bounces@dpdk.org on behalf of stephen@networkplumber.org> wrote:
> >>> On Fri, 18 Dec 2015 10:44:02 +0000
> >>> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> >>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> >>>>> On Mon, 14 Dec 2015 09:14:41 +0800
> >>>>> Huawei Xie <huawei.xie@intel.com> wrote:
> >>>>>> +	switch (count % 4) {
> >>>>>> +	while (idx != count) {
> >>>>>> +		case 0:
> >>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> >>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
> >>>>>> +			idx++;
> >>>>>> +		case 3:
> >>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> >>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
> >>>>>> +			idx++;
> >>>>>> +		case 2:
> >>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> >>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
> >>>>>> +			idx++;
> >>>>>> +		case 1:
> >>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> >>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
> >>>>>> +			idx++;
> >>>>>> +	}
> >>>>>> +	}
> >>>>>> +	return 0;
> >>>>>> +}
> >>>>> This is weird. Why not just use Duff's device in a more normal manner.
> >>>> But it is a sort of Duff's method.
> >>>> Not sure what looks weird to you here?
> >>>> while () {} instead of do {} while();?
> >>>> Konstantin
> >>>>
> >>>>
> >>>>
> >>> It is unusual to have cases not associated with block of the switch.
> >>> Unusual to me means, "not used commonly in most code".
> >>>
> >>> Since you are jumping into the loop, might make more sense as a do { } while()
> >> I find this a very odd coding practice and I would suggest we not do this, unless it gives us some great performance gain.
> >>
> >> Keith
> >The loop unwinding could give performance gain. The only problem is the
> >switch/loop combination makes people feel weird at the first glance but
> >soon they will grasp this style. Since this is inherited from old famous
> >duff's device, i prefer to keep this style which saves lines of code.
> 
> Please add a comment to the code to reflex where this style came from and why you are using it, would be very handy here.

+1
At least the words "loop" and "unwinding" may be helpful to some readers.
Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-21 15:21             ` Xie, Huawei
  2015-12-21 17:20               ` Wiles, Keith
@ 2015-12-21 22:34               ` Don Provan
  1 sibling, 0 replies; 54+ messages in thread
From: Don Provan @ 2015-12-21 22:34 UTC (permalink / raw)
  To: Xie, Huawei, Wiles, Keith, Stephen Hemminger, Ananyev, Konstantin; +Cc: dev

>From: Xie, Huawei [mailto:huawei.xie@intel.com] 
>Sent: Monday, December 21, 2015 7:22 AM
>Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>
>The loop unwinding could give performance gain. The only problem is the switch/loop
>combination makes people feel weird at the first glance but soon they will grasp this style.
>Since this is inherited from old famous duff's device, i prefer to keep this style which saves
>lines of code.

You don't really mean "lines of code", of course, since it increases the lines of code.
It reduces the number of branches.

Is Duff's Device used in other "bulk" routines? If not, what justifies making this a special case?

-don provan
dprovan@bivio.net

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-21 21:30                 ` Thomas Monjalon
@ 2015-12-22  1:58                   ` Xie, Huawei
  0 siblings, 0 replies; 54+ messages in thread
From: Xie, Huawei @ 2015-12-22  1:58 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On 12/22/2015 5:32 AM, Thomas Monjalon wrote:
> 2015-12-21 17:20, Wiles, Keith:
>> On 12/21/15, 9:21 AM, "Xie, Huawei" <huawei.xie@intel.com> wrote:
>>> On 12/19/2015 3:27 AM, Wiles, Keith wrote:
>>>> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" <dev-bounces@dpdk.org on behalf of stephen@networkplumber.org> wrote:
>>>>> On Fri, 18 Dec 2015 10:44:02 +0000
>>>>> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
>>>>>>> On Mon, 14 Dec 2015 09:14:41 +0800
>>>>>>> Huawei Xie <huawei.xie@intel.com> wrote:
>>>>>>>> +	switch (count % 4) {
>>>>>>>> +	while (idx != count) {
>>>>>>>> +		case 0:
>>>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>>>> +			idx++;
>>>>>>>> +		case 3:
>>>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>>>> +			idx++;
>>>>>>>> +		case 2:
>>>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>>>> +			idx++;
>>>>>>>> +		case 1:
>>>>>>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>>>>>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>>>>>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>>>>>>> +			idx++;
>>>>>>>> +	}
>>>>>>>> +	}
>>>>>>>> +	return 0;
>>>>>>>> +}
>>>>>>> This is weird. Why not just use Duff's device in a more normal manner.
>>>>>> But it is a sort of Duff's method.
>>>>>> Not sure what looks weird to you here?
>>>>>> while () {} instead of do {} while();?
>>>>>> Konstantin
>>>>>>
>>>>>>
>>>>>>
>>>>> It is unusual to have cases not associated with block of the switch.
>>>>> Unusual to me means, "not used commonly in most code".
>>>>>
>>>>> Since you are jumping into the loop, might make more sense as a do { } while()
>>>> I find this a very odd coding practice and I would suggest we not do this, unless it gives us some great performance gain.
>>>>
>>>> Keith
>>> The loop unwinding could give performance gain. The only problem is the
>>> switch/loop combination makes people feel weird at the first glance but
>>> soon they will grasp this style. Since this is inherited from old famous
>>> duff's device, i prefer to keep this style which saves lines of code.
>> Please add a comment to the code to reflex where this style came from and why you are using it, would be very handy here.
> +1
> At least the words "loop" and "unwinding" may be helpful to some readers.
OK. Will add more context. Probably the wiki page for duff's device
should be updated on how to handle the case count is zero, using while()
or add one line to check.

> Thanks
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue
  2015-12-14  1:14 ` [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2015-12-14  1:14   ` [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2015-12-14  1:14   ` [PATCH v2 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
@ 2015-12-22 16:17   ` Huawei Xie
  2015-12-22 16:17     ` [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2015-12-22 16:17     ` [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  2 siblings, 2 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-22 16:17 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop in rte_pktmbuf_alloc_bulk to help the performance

For symmetric rte_pktmbuf_free_bulk, if the app knows in its scenarios
their mbufs are all simple mbufs, i.e meet the following requirements:
 * no multiple segments
 * not indirect mbuf
 * refcnt is 1
 * belong to the same mbuf memory pool,
it could directly call rte_mempool_put to free the bulk of mbufs,
otherwise rte_pktmbuf_free_bulk has to call rte_pktmbuf_free to free
the mbuf one by one.
This patchset will not provide this symmetric implementation.



Huawei Xie (2):
  mbuf: provide rte_pktmbuf_alloc_bulk API
  vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue

 lib/librte_mbuf/rte_mbuf.h    | 49 +++++++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost_rxtx.c | 35 +++++++++++++++++++------------
 2 files changed, 71 insertions(+), 13 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-22 16:17   ` [PATCH v3 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2015-12-22 16:17     ` Huawei Xie
  2015-12-23 18:37       ` Stephen Hemminger
  2015-12-22 16:17     ` [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  1 sibling, 1 reply; 54+ messages in thread
From: Huawei Xie @ 2015-12-22 16:17 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Attached the wiki page about duff's device. It explains the performance
optimization through loop unwinding, and also the most dramatic use of
case label fall-through.
https://en.wikipedia.org/wiki/Duff%27s_device

In our implementation, we use while() loop rather than do{} while() loop
because we could not assume count is strictly positive. Using while()
loop saves one line of check if count is zero.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 49 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..3381c28 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,55 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx = 0;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	switch (count % 4) {
+	case 0: while (idx != count) {
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 3:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 2:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 1:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	}
+	}
+	return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-22 16:17   ` [PATCH v3 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2015-12-22 16:17     ` [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-22 16:17     ` Huawei Xie
  2015-12-23 11:22       ` linhaifeng
  1 sibling, 1 reply; 54+ messages in thread
From: Huawei Xie @ 2015-12-22 16:17 UTC (permalink / raw)
  To: dev; +Cc: dprovan

pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..0faae58 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -576,6 +576,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t i;
 	uint16_t free_entries, entry_success = 0;
 	uint16_t avail_idx;
+	uint8_t alloc_err = 0;
+	uint8_t seg_num;
 
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
 		RTE_LOG(ERR, VHOST_DATA,
@@ -609,6 +611,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
 			dev->device_fh, free_entries);
+
+	if (unlikely(rte_pktmbuf_alloc_bulk(mbuf_pool,
+		pkts, free_entries)) < 0) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to bulk allocating %d mbufs\n", free_entries);
+		return 0;
+	}
+
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < free_entries; i++)
 		head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];
@@ -621,9 +631,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t vb_avail, vb_offset;
 		uint32_t seg_avail, seg_offset;
 		uint32_t cpy_len;
-		uint32_t seg_num = 0;
+		seg_num = 0;
 		struct rte_mbuf *cur;
-		uint8_t alloc_err = 0;
+
 
 		desc = &vq->desc[head[entry_success]];
 
@@ -654,13 +664,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
 
-		/* Allocate an mbuf and populate the structure. */
-		m = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(m == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
-			break;
-		}
+		prev = cur = m = pkts[entry_success];
 		seg_offset = 0;
 		seg_avail = m->buf_len - RTE_PKTMBUF_HEADROOM;
 		cpy_len = RTE_MIN(vb_avail, seg_avail);
@@ -668,8 +672,6 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);
 
 		seg_num++;
-		cur = m;
-		prev = m;
 		while (cpy_len != 0) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 				(void *)((uintptr_t)(vb_addr + vb_offset)),
@@ -761,16 +763,23 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 			cpy_len = RTE_MIN(vb_avail, seg_avail);
 		}
 
-		if (unlikely(alloc_err == 1))
+		if (unlikely(alloc_err))
 			break;
 
 		m->nb_segs = seg_num;
 
-		pkts[entry_success] = m;
 		vq->last_used_idx++;
 		entry_success++;
 	}
 
+	if (unlikely(alloc_err)) {
+		uint16_t i = entry_success;
+
+		m->nb_segs = seg_num;
+		for (; i < free_entries; i++)
+			rte_pktmbuf_free(pkts[entry_success]);
+	}
+
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
 	/* Kick guest if required. */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
                   ` (2 preceding siblings ...)
  2015-12-14  1:14 ` [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2015-12-22 23:05 ` Huawei Xie
  2015-12-22 23:05   ` [PATCH v4 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2015-12-22 23:05   ` [PATCH v4 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  2015-12-27 16:38 ` [PATCH v5 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-22 23:05 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v4 changes:
 fix a silly typo in error handling when rte_pktmbuf_alloc fails

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop in rte_pktmbuf_alloc_bulk to help the performance

For symmetric rte_pktmbuf_free_bulk, if the app knows in its scenarios
their mbufs are all simple mbufs, i.e meet the following requirements:
 * no multiple segments
 * not indirect mbuf
 * refcnt is 1
 * belong to the same mbuf memory pool,
it could directly call rte_mempool_put to free the bulk of mbufs,
otherwise rte_pktmbuf_free_bulk has to call rte_pktmbuf_free to free
the mbuf one by one.
This patchset will not provide this symmetric implementation.

Huawei Xie (2):
  mbuf: provide rte_pktmbuf_alloc_bulk API
  vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue

 lib/librte_mbuf/rte_mbuf.h    | 49 +++++++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost_rxtx.c | 35 +++++++++++++++++++------------
 2 files changed, 71 insertions(+), 13 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v4 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-22 23:05 ` [PATCH v4 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2015-12-22 23:05   ` Huawei Xie
  2015-12-22 23:05   ` [PATCH v4 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-22 23:05 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Attached the wiki page about duff's device. It explains the performance
optimization through loop unwinding, and also the most dramatic use of
case label fall-through.
https://en.wikipedia.org/wiki/Duff%27s_device

In our implementation, we use while() loop rather than do{} while() loop
because we could not assume count is strictly positive. Using while()
loop saves one line of check if count is zero.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 49 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..3381c28 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,55 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx = 0;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	switch (count % 4) {
+	case 0: while (idx != count) {
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 3:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 2:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 1:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	}
+	}
+	return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-22 23:05 ` [PATCH v4 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2015-12-22 23:05   ` [PATCH v4 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-22 23:05   ` Huawei Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-22 23:05 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v4 changes:
 fix a silly typo in error handling when rte_pktmbuf_alloc fails
reported by haifeng

pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..f10d534 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -576,6 +576,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t i;
 	uint16_t free_entries, entry_success = 0;
 	uint16_t avail_idx;
+	uint8_t alloc_err = 0;
+	uint8_t seg_num;
 
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
 		RTE_LOG(ERR, VHOST_DATA,
@@ -609,6 +611,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
 			dev->device_fh, free_entries);
+
+	if (unlikely(rte_pktmbuf_alloc_bulk(mbuf_pool,
+		pkts, free_entries)) < 0) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to bulk allocating %d mbufs\n", free_entries);
+		return 0;
+	}
+
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < free_entries; i++)
 		head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];
@@ -621,9 +631,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t vb_avail, vb_offset;
 		uint32_t seg_avail, seg_offset;
 		uint32_t cpy_len;
-		uint32_t seg_num = 0;
+		seg_num = 0;
 		struct rte_mbuf *cur;
-		uint8_t alloc_err = 0;
+
 
 		desc = &vq->desc[head[entry_success]];
 
@@ -654,13 +664,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
 
-		/* Allocate an mbuf and populate the structure. */
-		m = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(m == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
-			break;
-		}
+		prev = cur = m = pkts[entry_success];
 		seg_offset = 0;
 		seg_avail = m->buf_len - RTE_PKTMBUF_HEADROOM;
 		cpy_len = RTE_MIN(vb_avail, seg_avail);
@@ -668,8 +672,6 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);
 
 		seg_num++;
-		cur = m;
-		prev = m;
 		while (cpy_len != 0) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 				(void *)((uintptr_t)(vb_addr + vb_offset)),
@@ -761,16 +763,23 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 			cpy_len = RTE_MIN(vb_avail, seg_avail);
 		}
 
-		if (unlikely(alloc_err == 1))
+		if (unlikely(alloc_err))
 			break;
 
 		m->nb_segs = seg_num;
 
-		pkts[entry_success] = m;
 		vq->last_used_idx++;
 		entry_success++;
 	}
 
+	if (unlikely(alloc_err)) {
+		uint16_t i = entry_success;
+
+		m->nb_segs = seg_num;
+		for (; i < free_entries; i++)
+			rte_pktmbuf_free(pkts[i]);
+	}
+
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
 	/* Kick guest if required. */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-22 16:17     ` [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
@ 2015-12-23 11:22       ` linhaifeng
  2015-12-23 11:39         ` Xie, Huawei
  0 siblings, 1 reply; 54+ messages in thread
From: linhaifeng @ 2015-12-23 11:22 UTC (permalink / raw)
  To: dev



>  
> +	if (unlikely(alloc_err)) {
> +		uint16_t i = entry_success;
> +
> +		m->nb_segs = seg_num;
> +		for (; i < free_entries; i++)
> +			rte_pktmbuf_free(pkts[entry_success]); -> rte_pktmbuf_free(pkts[i]);
> +	}
> +
>  	rte_compiler_barrier();
>  	vq->used->idx += entry_success;
>  	/* Kick guest if required. */
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-23 11:22       ` linhaifeng
@ 2015-12-23 11:39         ` Xie, Huawei
  0 siblings, 0 replies; 54+ messages in thread
From: Xie, Huawei @ 2015-12-23 11:39 UTC (permalink / raw)
  To: linhaifeng, dev

On 12/23/2015 7:25 PM, linhaifeng wrote:
>
>>  
>> +	if (unlikely(alloc_err)) {
>> +		uint16_t i = entry_success;
>> +
>> +		m->nb_segs = seg_num;
>> +		for (; i < free_entries; i++)
>> +			rte_pktmbuf_free(pkts[entry_success]); -> rte_pktmbuf_free(pkts[i]);
>> +	}
>> +
>>  	rte_compiler_barrier();
>>  	vq->used->idx += entry_success;
>>  	/* Kick guest if required. */
Very sorry for silly typo. Thanks!
>>
>
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-22 16:17     ` [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-23 18:37       ` Stephen Hemminger
  2015-12-23 18:49         ` Ananyev, Konstantin
  0 siblings, 1 reply; 54+ messages in thread
From: Stephen Hemminger @ 2015-12-23 18:37 UTC (permalink / raw)
  To: Huawei Xie; +Cc: dev, dprovan

On Wed, 23 Dec 2015 00:17:53 +0800
Huawei Xie <huawei.xie@intel.com> wrote:

> +
> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> +	if (unlikely(rc))
> +		return rc;
> +
> +	switch (count % 4) {
> +	case 0: while (idx != count) {
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	case 3:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	case 2:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	case 1:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	}
> +	}
> +	return 0;
> +}

Since function will not work if count can not be 0 (otherwise rte_mempool_get_bulk will fail),
why not:
	1. Document that assumption
	2. Use that assumption to speed up code.



	switch(count % 4) {
		do {
			case 0:
			...
			case 1:
			...
		} while (idx != count);
	}

Also you really need to add a big block comment about this loop, to explain
what it does and why.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-23 18:37       ` Stephen Hemminger
@ 2015-12-23 18:49         ` Ananyev, Konstantin
  2015-12-24  1:33           ` Xie, Huawei
  0 siblings, 1 reply; 54+ messages in thread
From: Ananyev, Konstantin @ 2015-12-23 18:49 UTC (permalink / raw)
  To: Stephen Hemminger, Xie, Huawei; +Cc: dev, dprovan



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
> Sent: Wednesday, December 23, 2015 6:38 PM
> To: Xie, Huawei
> Cc: dev@dpdk.org; dprovan@bivio.net
> Subject: Re: [dpdk-dev] [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
> 
> On Wed, 23 Dec 2015 00:17:53 +0800
> Huawei Xie <huawei.xie@intel.com> wrote:
> 
> > +
> > +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> > +	if (unlikely(rc))
> > +		return rc;
> > +
> > +	switch (count % 4) {
> > +	case 0: while (idx != count) {
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	case 3:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	case 2:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	case 1:
> > +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> > +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> > +			rte_pktmbuf_reset(mbufs[idx]);
> > +			idx++;
> > +	}
> > +	}
> > +	return 0;
> > +}
> 
> Since function will not work if count can not be 0 (otherwise rte_mempool_get_bulk will fail),

As I understand, rte_mempool_get_bulk() will work correctly and return 0, if count==0.
That's why Huawei prefers while() {}, instead of do {} while() - to avoid extra check for
(count != 0) at the start. 
Konstantin


> why not:
> 	1. Document that assumption
> 	2. Use that assumption to speed up code.
> 
> 
> 
> 	switch(count % 4) {
> 		do {
> 			case 0:
> 			...
> 			case 1:
> 			...
> 		} while (idx != count);
> 	}
> 
> Also you really need to add a big block comment about this loop, to explain
> what it does and why.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-23 18:49         ` Ananyev, Konstantin
@ 2015-12-24  1:33           ` Xie, Huawei
  0 siblings, 0 replies; 54+ messages in thread
From: Xie, Huawei @ 2015-12-24  1:33 UTC (permalink / raw)
  To: Ananyev, Konstantin, Stephen Hemminger; +Cc: dev, dprovan

On 12/24/2015 2:49 AM, Ananyev, Konstantin wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
>> Sent: Wednesday, December 23, 2015 6:38 PM
>> To: Xie, Huawei
>> Cc: dev@dpdk.org; dprovan@bivio.net
>> Subject: Re: [dpdk-dev] [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>>
>> On Wed, 23 Dec 2015 00:17:53 +0800
>> Huawei Xie <huawei.xie@intel.com> wrote:
>>
>>> +
>>> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>>> +	if (unlikely(rc))
>>> +		return rc;
>>> +
>>> +	switch (count % 4) {
>>> +	case 0: while (idx != count) {
>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>> +			idx++;
>>> +	case 3:
>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>> +			idx++;
>>> +	case 2:
>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>> +			idx++;
>>> +	case 1:
>>> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>>> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
>>> +			rte_pktmbuf_reset(mbufs[idx]);
>>> +			idx++;
>>> +	}
>>> +	}
>>> +	return 0;
>>> +}
>> Since function will not work if count can not be 0 (otherwise rte_mempool_get_bulk will fail),
> As I understand, rte_mempool_get_bulk() will work correctly and return 0, if count==0.
> That's why Huawei prefers while() {}, instead of do {} while() - to avoid extra check for
> (count != 0) at the start. 
> Konstantin

Yes.

>
>
>> why not:
>> 	1. Document that assumption
>> 	2. Use that assumption to speed up code.
>>
>>
>>
>> 	switch(count % 4) {
>> 		do {
>> 			case 0:
>> 			...
>> 			case 1:
>> 			...
>> 		} while (idx != count);
>> 	}
>>
>> Also you really need to add a big block comment about this loop, to explain
>> what it does and why.

Since we change duff's implementation a bit, and for people who don't
know duff's device, we could add comment.
Is comment like this enough?
Use Duff's device to unroll the loop a bit to gain more performance
Use while() rather than do {} while() as count could be zero.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v5 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
                   ` (3 preceding siblings ...)
  2015-12-22 23:05 ` [PATCH v4 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2015-12-27 16:38 ` Huawei Xie
  2015-12-27 16:38   ` [PATCH v5 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2015-12-27 16:38   ` [PATCH v5 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  2016-01-26 17:03 ` [PATCH v6 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2016-02-28 12:44 ` [PATCH v7] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  6 siblings, 2 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-27 16:38 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v5 changes:
 add comment about duff's device and our variant implementation

v4 changes:
 fix a silly typo in error handling when rte_pktmbuf_alloc fails

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop in rte_pktmbuf_alloc_bulk to help the performance

For symmetric rte_pktmbuf_free_bulk, if the app knows in its scenarios
their mbufs are all simple mbufs, i.e meet the following requirements:
 * no multiple segments
 * not indirect mbuf
 * refcnt is 1
 * belong to the same mbuf memory pool,
it could directly call rte_mempool_put to free the bulk of mbufs,
otherwise rte_pktmbuf_free_bulk has to call rte_pktmbuf_free to free
the mbuf one by one.
This patchset will not provide this symmetric implementation.

Huawei Xie (2):
  mbuf: provide rte_pktmbuf_alloc_bulk API
  vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue

 lib/librte_mbuf/rte_mbuf.h    | 55 +++++++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost_rxtx.c | 35 +++++++++++++++++----------
 2 files changed, 77 insertions(+), 13 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v5 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-27 16:38 ` [PATCH v5 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2015-12-27 16:38   ` Huawei Xie
  2015-12-27 16:38   ` [PATCH v5 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-27 16:38 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v5 changes:
 add comment about duff's device and our variant implementation
 revise the code style a bit

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Attached the wiki page about duff's device. It explains the performance
optimization through loop unwinding, and also the most dramatic use of
case label fall-through.
https://en.wikipedia.org/wiki/Duff%27s_device

In our implementation, we use while() loop rather than do{} while() loop
because we could not assume count is strictly positive. Using while()
loop saves one line of check if count is zero.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..b2ed479 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,61 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx = 0;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	/* To understand duff's device on loop unwinding optimization, see
+	 * https://en.wikipedia.org/wiki/Duff's_device.
+	 * Here while() loop is used rather than do() while{} to avoid extra
+	 * check if count is zero.
+	 */
+	switch (count % 4) {
+	case 0:
+		while (idx != count) {
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 3:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 2:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 1:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+		}
+	}
+	return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v5 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2015-12-27 16:38 ` [PATCH v5 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2015-12-27 16:38   ` [PATCH v5 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2015-12-27 16:38   ` Huawei Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2015-12-27 16:38 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v4 changes:
 fix a silly typo in error handling when rte_pktmbuf_alloc fails
reported by haifeng

pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..f10d534 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -576,6 +576,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t i;
 	uint16_t free_entries, entry_success = 0;
 	uint16_t avail_idx;
+	uint8_t alloc_err = 0;
+	uint8_t seg_num;
 
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
 		RTE_LOG(ERR, VHOST_DATA,
@@ -609,6 +611,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
 			dev->device_fh, free_entries);
+
+	if (unlikely(rte_pktmbuf_alloc_bulk(mbuf_pool,
+		pkts, free_entries)) < 0) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to bulk allocating %d mbufs\n", free_entries);
+		return 0;
+	}
+
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < free_entries; i++)
 		head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];
@@ -621,9 +631,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t vb_avail, vb_offset;
 		uint32_t seg_avail, seg_offset;
 		uint32_t cpy_len;
-		uint32_t seg_num = 0;
+		seg_num = 0;
 		struct rte_mbuf *cur;
-		uint8_t alloc_err = 0;
+
 
 		desc = &vq->desc[head[entry_success]];
 
@@ -654,13 +664,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
 
-		/* Allocate an mbuf and populate the structure. */
-		m = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(m == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
-			break;
-		}
+		prev = cur = m = pkts[entry_success];
 		seg_offset = 0;
 		seg_avail = m->buf_len - RTE_PKTMBUF_HEADROOM;
 		cpy_len = RTE_MIN(vb_avail, seg_avail);
@@ -668,8 +672,6 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);
 
 		seg_num++;
-		cur = m;
-		prev = m;
 		while (cpy_len != 0) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 				(void *)((uintptr_t)(vb_addr + vb_offset)),
@@ -761,16 +763,23 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 			cpy_len = RTE_MIN(vb_avail, seg_avail);
 		}
 
-		if (unlikely(alloc_err == 1))
+		if (unlikely(alloc_err))
 			break;
 
 		m->nb_segs = seg_num;
 
-		pkts[entry_success] = m;
 		vq->last_used_idx++;
 		entry_success++;
 	}
 
+	if (unlikely(alloc_err)) {
+		uint16_t i = entry_success;
+
+		m->nb_segs = seg_num;
+		for (; i < free_entries; i++)
+			rte_pktmbuf_free(pkts[i]);
+	}
+
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
 	/* Kick guest if required. */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
                   ` (4 preceding siblings ...)
  2015-12-27 16:38 ` [PATCH v5 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2016-01-26 17:03 ` Huawei Xie
  2016-01-26 17:03   ` [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  2016-01-26 17:03   ` [PATCH v6 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  2016-02-28 12:44 ` [PATCH v7] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
  6 siblings, 2 replies; 54+ messages in thread
From: Huawei Xie @ 2016-01-26 17:03 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v6 changes:
 reflect the changes in release notes and library version map file
 revise our duff's code style a bit to make it more readable

v5 changes:
 add comment about duff's device and our variant implementation

v4 changes:
 fix a silly typo in error handling when rte_pktmbuf_alloc fails

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop in rte_pktmbuf_alloc_bulk to help the performance

For symmetric rte_pktmbuf_free_bulk, if the app knows in its scenarios
their mbufs are all simple mbufs, i.e meet the following requirements:
 * no multiple segments
 * not indirect mbuf
 * refcnt is 1
 * belong to the same mbuf memory pool,
it could directly call rte_mempool_put to free the bulk of mbufs,
otherwise rte_pktmbuf_free_bulk has to call rte_pktmbuf_free to free
the mbuf one by one.
This patchset will not provide this symmetric implementation.

Huawei Xie (2):
  mbuf: provide rte_pktmbuf_alloc_bulk API
  vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue

 doc/guides/rel_notes/release_2_3.rst |  3 ++
 lib/librte_mbuf/rte_mbuf.h           | 55 ++++++++++++++++++++++++++++++++++++
 lib/librte_mbuf/rte_mbuf_version.map |  7 +++++
 lib/librte_vhost/vhost_rxtx.c        | 35 ++++++++++++++---------
 4 files changed, 87 insertions(+), 13 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-01-26 17:03 ` [PATCH v6 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2016-01-26 17:03   ` Huawei Xie
  2016-01-27 13:56     ` Panu Matilainen
  2016-01-26 17:03   ` [PATCH v6 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
  1 sibling, 1 reply; 54+ messages in thread
From: Huawei Xie @ 2016-01-26 17:03 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v6 changes:
 reflect the changes in release notes and library version map file
 revise our duff's code style a bit to make it more readable

v5 changes:
 add comment about duff's device and our variant implementation

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Attached the wiki page about duff's device. It explains the performance
optimization through loop unwinding, and also the most dramatic use of
case label fall-through.
https://en.wikipedia.org/wiki/Duff%27s_device

In our implementation, we use while() loop rather than do{} while() loop
because we could not assume count is strictly positive. Using while()
loop saves one line of check if count is zero.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/rel_notes/release_2_3.rst |  3 ++
 lib/librte_mbuf/rte_mbuf.h           | 55 ++++++++++++++++++++++++++++++++++++
 lib/librte_mbuf/rte_mbuf_version.map |  7 +++++
 3 files changed, 65 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_3.rst b/doc/guides/rel_notes/release_2_3.rst
index 99de186..a52cba3 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -4,6 +4,9 @@ DPDK Release 2.3
 New Features
 ------------
 
+* **Enable bulk allocation of mbufs. **
+  A new function ``rte_pktmbuf_alloc_bulk()`` has been added to allow the user
+  to allocate a bulk of mbufs.
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..b2ed479 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,61 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx = 0;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	/* To understand duff's device on loop unwinding optimization, see
+	 * https://en.wikipedia.org/wiki/Duff's_device.
+	 * Here while() loop is used rather than do() while{} to avoid extra
+	 * check if count is zero.
+	 */
+	switch (count % 4) {
+	case 0:
+		while (idx != count) {
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 3:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 2:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 1:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+		}
+	}
+	return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
diff --git a/lib/librte_mbuf/rte_mbuf_version.map b/lib/librte_mbuf/rte_mbuf_version.map
index e10f6bd..257c65a 100644
--- a/lib/librte_mbuf/rte_mbuf_version.map
+++ b/lib/librte_mbuf/rte_mbuf_version.map
@@ -18,3 +18,10 @@ DPDK_2.1 {
 	rte_pktmbuf_pool_create;
 
 } DPDK_2.0;
+
+DPDK_2.3 {
+	global:
+
+	rte_pktmbuf_alloc_bulk;
+
+} DPDK_2.1;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v6 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue
  2016-01-26 17:03 ` [PATCH v6 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
  2016-01-26 17:03   ` [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2016-01-26 17:03   ` Huawei Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Huawei Xie @ 2016-01-26 17:03 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v4 changes:
 fix a silly typo in error handling when rte_pktmbuf_alloc fails
reported by haifeng

pre-allocate a bulk of mbufs instead of allocating one mbuf a time on demand

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..f10d534 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -576,6 +576,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t i;
 	uint16_t free_entries, entry_success = 0;
 	uint16_t avail_idx;
+	uint8_t alloc_err = 0;
+	uint8_t seg_num;
 
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
 		RTE_LOG(ERR, VHOST_DATA,
@@ -609,6 +611,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
 			dev->device_fh, free_entries);
+
+	if (unlikely(rte_pktmbuf_alloc_bulk(mbuf_pool,
+		pkts, free_entries)) < 0) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to bulk allocating %d mbufs\n", free_entries);
+		return 0;
+	}
+
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < free_entries; i++)
 		head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];
@@ -621,9 +631,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t vb_avail, vb_offset;
 		uint32_t seg_avail, seg_offset;
 		uint32_t cpy_len;
-		uint32_t seg_num = 0;
+		seg_num = 0;
 		struct rte_mbuf *cur;
-		uint8_t alloc_err = 0;
+
 
 		desc = &vq->desc[head[entry_success]];
 
@@ -654,13 +664,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
 
-		/* Allocate an mbuf and populate the structure. */
-		m = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(m == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
-			break;
-		}
+		prev = cur = m = pkts[entry_success];
 		seg_offset = 0;
 		seg_avail = m->buf_len - RTE_PKTMBUF_HEADROOM;
 		cpy_len = RTE_MIN(vb_avail, seg_avail);
@@ -668,8 +672,6 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);
 
 		seg_num++;
-		cur = m;
-		prev = m;
 		while (cpy_len != 0) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
 				(void *)((uintptr_t)(vb_addr + vb_offset)),
@@ -761,16 +763,23 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 			cpy_len = RTE_MIN(vb_avail, seg_avail);
 		}
 
-		if (unlikely(alloc_err == 1))
+		if (unlikely(alloc_err))
 			break;
 
 		m->nb_segs = seg_num;
 
-		pkts[entry_success] = m;
 		vq->last_used_idx++;
 		entry_success++;
 	}
 
+	if (unlikely(alloc_err)) {
+		uint16_t i = entry_success;
+
+		m->nb_segs = seg_num;
+		for (; i < free_entries; i++)
+			rte_pktmbuf_free(pkts[i]);
+	}
+
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
 	/* Kick guest if required. */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-01-26 17:03   ` [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2016-01-27 13:56     ` Panu Matilainen
  2016-02-03 17:23       ` Olivier MATZ
  0 siblings, 1 reply; 54+ messages in thread
From: Panu Matilainen @ 2016-01-27 13:56 UTC (permalink / raw)
  To: Huawei Xie, dev; +Cc: dprovan

On 01/26/2016 07:03 PM, Huawei Xie wrote:
> v6 changes:
>   reflect the changes in release notes and library version map file
>   revise our duff's code style a bit to make it more readable
>
> v5 changes:
>   add comment about duff's device and our variant implementation
>
> v3 changes:
>   move while after case 0
>   add context about duff's device and why we use while loop in the commit
> message
>
> v2 changes:
>   unroll the loop a bit to help the performance
>
> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>
> There is related thread about this bulk API.
> http://dpdk.org/dev/patchwork/patch/4718/
> Thanks to Konstantin's loop unrolling.
>
> Attached the wiki page about duff's device. It explains the performance
> optimization through loop unwinding, and also the most dramatic use of
> case label fall-through.
> https://en.wikipedia.org/wiki/Duff%27s_device
>
> In our implementation, we use while() loop rather than do{} while() loop
> because we could not assume count is strictly positive. Using while()
> loop saves one line of check if count is zero.
>
> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>   doc/guides/rel_notes/release_2_3.rst |  3 ++
>   lib/librte_mbuf/rte_mbuf.h           | 55 ++++++++++++++++++++++++++++++++++++
>   lib/librte_mbuf/rte_mbuf_version.map |  7 +++++
>   3 files changed, 65 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_2_3.rst b/doc/guides/rel_notes/release_2_3.rst
> index 99de186..a52cba3 100644
> --- a/doc/guides/rel_notes/release_2_3.rst
> +++ b/doc/guides/rel_notes/release_2_3.rst
> @@ -4,6 +4,9 @@ DPDK Release 2.3
>   New Features
>   ------------
>
> +* **Enable bulk allocation of mbufs. **
> +  A new function ``rte_pktmbuf_alloc_bulk()`` has been added to allow the user
> +  to allocate a bulk of mbufs.
>
>   Resolved Issues
>   ---------------
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index f234ac9..b2ed479 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1336,6 +1336,61 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
>   }
>
>   /**
> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
> + * values.
> + *
> + *  @param pool
> + *    The mempool from which mbufs are allocated.
> + *  @param mbufs
> + *    Array of pointers to mbufs
> + *  @param count
> + *    Array size
> + *  @return
> + *   - 0: Success
> + */
> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> +	 struct rte_mbuf **mbufs, unsigned count)
> +{
> +	unsigned idx = 0;
> +	int rc;
> +
> +	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> +	if (unlikely(rc))
> +		return rc;
> +
> +	/* To understand duff's device on loop unwinding optimization, see
> +	 * https://en.wikipedia.org/wiki/Duff's_device.
> +	 * Here while() loop is used rather than do() while{} to avoid extra
> +	 * check if count is zero.
> +	 */
> +	switch (count % 4) {
> +	case 0:
> +		while (idx != count) {
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	case 3:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	case 2:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +	case 1:
> +			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> +			rte_mbuf_refcnt_set(mbufs[idx], 1);
> +			rte_pktmbuf_reset(mbufs[idx]);
> +			idx++;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/**
>    * Attach packet mbuf to another packet mbuf.
>    *
>    * After attachment we refer the mbuf we attached as 'indirect',
> diff --git a/lib/librte_mbuf/rte_mbuf_version.map b/lib/librte_mbuf/rte_mbuf_version.map
> index e10f6bd..257c65a 100644
> --- a/lib/librte_mbuf/rte_mbuf_version.map
> +++ b/lib/librte_mbuf/rte_mbuf_version.map
> @@ -18,3 +18,10 @@ DPDK_2.1 {
>   	rte_pktmbuf_pool_create;
>
>   } DPDK_2.0;
> +
> +DPDK_2.3 {
> +	global:
> +
> +	rte_pktmbuf_alloc_bulk;
> +
> +} DPDK_2.1;
>

Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of 
the library ABI and should not be listed in the version map.

I assume its inline for performance reasons, but then you lose the 
benefits of dynamic linking such as ability to fix bugs and/or improve 
itby just updating the library. Since the point of having a bulk API is 
to improve performance by reducing the number of calls required, does it 
really have to be inline? As in, have you actually measured the 
difference between inline and non-inline and decided its worth all the 
downsides?

	- Panu -

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-01-27 13:56     ` Panu Matilainen
@ 2016-02-03 17:23       ` Olivier MATZ
  2016-02-22 14:49         ` Xie, Huawei
  0 siblings, 1 reply; 54+ messages in thread
From: Olivier MATZ @ 2016-02-03 17:23 UTC (permalink / raw)
  To: Panu Matilainen, Huawei Xie, dev; +Cc: dprovan

Hi,

On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>
> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
> the library ABI and should not be listed in the version map.
>
> I assume its inline for performance reasons, but then you lose the
> benefits of dynamic linking such as ability to fix bugs and/or improve
> itby just updating the library. Since the point of having a bulk API is
> to improve performance by reducing the number of calls required, does it
> really have to be inline? As in, have you actually measured the
> difference between inline and non-inline and decided its worth all the
> downsides?

Agree with Panu. It would be interesting to compare the performance
between inline and non inline to decide whether inlining it or not.

Also, it would be nice to have a simple test function in
app/test/test_mbuf.c. For instance, you could update
test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
the mbuf allocation from the function. Then it could be called with
a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
all the mbufs of rte_pktmbuf_alloc_bulk().

Regards,
Olivier

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-03 17:23       ` Olivier MATZ
@ 2016-02-22 14:49         ` Xie, Huawei
  2016-02-23  5:35           ` Xie, Huawei
  0 siblings, 1 reply; 54+ messages in thread
From: Xie, Huawei @ 2016-02-22 14:49 UTC (permalink / raw)
  To: Olivier MATZ, Panu Matilainen, dev; +Cc: dprovan

On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> Hi,
>
> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>
>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>> the library ABI and should not be listed in the version map.
>>
>> I assume its inline for performance reasons, but then you lose the
>> benefits of dynamic linking such as ability to fix bugs and/or improve
>> itby just updating the library. Since the point of having a bulk API is
>> to improve performance by reducing the number of calls required, does it
>> really have to be inline? As in, have you actually measured the
>> difference between inline and non-inline and decided its worth all the
>> downsides?
>
> Agree with Panu. It would be interesting to compare the performance
> between inline and non inline to decide whether inlining it or not.

Will update after i gathered more data. inline could show obvious
performance difference in some cases.

>
> Also, it would be nice to have a simple test function in
> app/test/test_mbuf.c. For instance, you could update
> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
> the mbuf allocation from the function. Then it could be called with
> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
> all the mbufs of rte_pktmbuf_alloc_bulk().
>
> Regards,
> Olivier
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-22 14:49         ` Xie, Huawei
@ 2016-02-23  5:35           ` Xie, Huawei
  2016-02-24 12:11             ` Panu Matilainen
  2016-02-26  8:55             ` Olivier MATZ
  0 siblings, 2 replies; 54+ messages in thread
From: Xie, Huawei @ 2016-02-23  5:35 UTC (permalink / raw)
  To: Olivier MATZ, Panu Matilainen, dev; +Cc: dprovan

On 2/22/2016 10:52 PM, Xie, Huawei wrote:
> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>> Hi,
>>
>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>> the library ABI and should not be listed in the version map.
>>>
>>> I assume its inline for performance reasons, but then you lose the
>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>> itby just updating the library. Since the point of having a bulk API is
>>> to improve performance by reducing the number of calls required, does it
>>> really have to be inline? As in, have you actually measured the
>>> difference between inline and non-inline and decided its worth all the
>>> downsides?
>> Agree with Panu. It would be interesting to compare the performance
>> between inline and non inline to decide whether inlining it or not.
> Will update after i gathered more data. inline could show obvious
> performance difference in some cases.

Panu and Oliver:
I write a simple benchmark. This benchmark run 10M rounds, in each round
8 mbufs are allocated through bulk API, and then freed.
These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
Btw, i have removed some exceptional data, the frequency of which is
like 1/10. Sometimes observed user usage suddenly disappeared, no clue
what happened.

With 8 mbufs allocated, there is about 6% performance increase using inline.
inline            non-inline
2780738888        2950309416
2834853696        2951378072
2823015320        2954500888
2825060032        2958939912
2824499804        2898938284
2810859720        2944892796
2852229420        3014273296
2787308500        2956809852
2793337260        2958674900
2822223476        2954346352
2785455184        2925719136
2821528624        2937380416
2822922136        2974978604
2776645920        2947666548
2815952572        2952316900
2801048740        2947366984
2851462672        2946469004

With 16 mbufs allocated, we could still observe obvious performance
difference, though only 1%-2%

inline            non-inline
5519987084        5669902680
5538416096        5737646840
5578934064        5590165532
5548131972        5767926840
5625585696        5831345628
5558282876        5662223764
5445587768        5641003924
5559096320        5775258444
5656437988        5743969272
5440939404        5664882412
5498875968        5785138532
5561652808        5737123940
5515211716        5627775604
5550567140        5630790628
5665964280        5589568164
5591295900        5702697308

With 32/64 mbufs allocated, the deviation of the data itself would hide
the performance difference.

So we prefer using inline for performance.
>> Also, it would be nice to have a simple test function in
>> app/test/test_mbuf.c. For instance, you could update
>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>> the mbuf allocation from the function. Then it could be called with
>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>> all the mbufs of rte_pktmbuf_alloc_bulk().

Don't quite get you. Is it that we write two cases, one case allocate
mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
good to have. I could do this after this patch.
>>
>> Regards,
>> Olivier
>>
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-23  5:35           ` Xie, Huawei
@ 2016-02-24 12:11             ` Panu Matilainen
  2016-02-24 13:23               ` Ananyev, Konstantin
  2016-02-26  8:55             ` Olivier MATZ
  1 sibling, 1 reply; 54+ messages in thread
From: Panu Matilainen @ 2016-02-24 12:11 UTC (permalink / raw)
  To: Xie, Huawei, Olivier MATZ, dev; +Cc: dprovan

On 02/23/2016 07:35 AM, Xie, Huawei wrote:
> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>> Hi,
>>>
>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>>> the library ABI and should not be listed in the version map.
>>>>
>>>> I assume its inline for performance reasons, but then you lose the
>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>>> itby just updating the library. Since the point of having a bulk API is
>>>> to improve performance by reducing the number of calls required, does it
>>>> really have to be inline? As in, have you actually measured the
>>>> difference between inline and non-inline and decided its worth all the
>>>> downsides?
>>> Agree with Panu. It would be interesting to compare the performance
>>> between inline and non inline to decide whether inlining it or not.
>> Will update after i gathered more data. inline could show obvious
>> performance difference in some cases.
>
> Panu and Oliver:
> I write a simple benchmark. This benchmark run 10M rounds, in each round
> 8 mbufs are allocated through bulk API, and then freed.
> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
> Btw, i have removed some exceptional data, the frequency of which is
> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
> what happened.
>
> With 8 mbufs allocated, there is about 6% performance increase using inline.
[...]
>
> With 16 mbufs allocated, we could still observe obvious performance
> difference, though only 1%-2%
>
[...]
>
> With 32/64 mbufs allocated, the deviation of the data itself would hide
> the performance difference.
> So we prefer using inline for performance.

At least I was more after real-world performance in a real-world 
use-case rather than CPU cycles in a microbenchmark, we know function 
calls have a cost but the benefits tend to outweight the cons.

Inline functions have their place and they're far less evil in project 
internal use, but in library public API they are BAD and should be ... 
well, not banned because there are exceptions to every rule, but highly 
discouraged.

	- Panu -

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-24 12:11             ` Panu Matilainen
@ 2016-02-24 13:23               ` Ananyev, Konstantin
  2016-02-26  7:39                 ` Xie, Huawei
  2016-02-29 10:51                 ` Panu Matilainen
  0 siblings, 2 replies; 54+ messages in thread
From: Ananyev, Konstantin @ 2016-02-24 13:23 UTC (permalink / raw)
  To: Panu Matilainen, Xie, Huawei, Olivier MATZ, dev; +Cc: dprovan

Hi Panu,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Panu Matilainen
> Sent: Wednesday, February 24, 2016 12:12 PM
> To: Xie, Huawei; Olivier MATZ; dev@dpdk.org
> Cc: dprovan@bivio.net
> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
> 
> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
> > On 2/22/2016 10:52 PM, Xie, Huawei wrote:
> >> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> >>> Hi,
> >>>
> >>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
> >>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
> >>>> the library ABI and should not be listed in the version map.
> >>>>
> >>>> I assume its inline for performance reasons, but then you lose the
> >>>> benefits of dynamic linking such as ability to fix bugs and/or improve
> >>>> itby just updating the library. Since the point of having a bulk API is
> >>>> to improve performance by reducing the number of calls required, does it
> >>>> really have to be inline? As in, have you actually measured the
> >>>> difference between inline and non-inline and decided its worth all the
> >>>> downsides?
> >>> Agree with Panu. It would be interesting to compare the performance
> >>> between inline and non inline to decide whether inlining it or not.
> >> Will update after i gathered more data. inline could show obvious
> >> performance difference in some cases.
> >
> > Panu and Oliver:
> > I write a simple benchmark. This benchmark run 10M rounds, in each round
> > 8 mbufs are allocated through bulk API, and then freed.
> > These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
> > 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
> > Btw, i have removed some exceptional data, the frequency of which is
> > like 1/10. Sometimes observed user usage suddenly disappeared, no clue
> > what happened.
> >
> > With 8 mbufs allocated, there is about 6% performance increase using inline.
> [...]
> >
> > With 16 mbufs allocated, we could still observe obvious performance
> > difference, though only 1%-2%
> >
> [...]
> >
> > With 32/64 mbufs allocated, the deviation of the data itself would hide
> > the performance difference.
> > So we prefer using inline for performance.
> 
> At least I was more after real-world performance in a real-world
> use-case rather than CPU cycles in a microbenchmark, we know function
> calls have a cost but the benefits tend to outweight the cons.
> 
> Inline functions have their place and they're far less evil in project
> internal use, but in library public API they are BAD and should be ...
> well, not banned because there are exceptions to every rule, but highly
> discouraged.

Why is that?
As you can see right now we have all mbuf alloc/free routines as static inline.
And I think we would like to keep it like that.
So why that particular function should be different?
After all that function is nothing more than a wrapper 
around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
So unless mempool get/put API would change, I can hardly see there could be any ABI
breakages in future. 
About 'real world' performance gain - it was a 'real world' performance problem,
that we tried to solve by introducing that function:
http://dpdk.org/ml/archives/dev/2015-May/017633.html

And according to the user feedback, it does help:  
http://dpdk.org/ml/archives/dev/2016-February/033203.html

Konstantin

> 
> 	- Panu -
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-24 13:23               ` Ananyev, Konstantin
@ 2016-02-26  7:39                 ` Xie, Huawei
  2016-02-26  8:45                   ` Olivier MATZ
  2016-02-29 10:51                 ` Panu Matilainen
  1 sibling, 1 reply; 54+ messages in thread
From: Xie, Huawei @ 2016-02-26  7:39 UTC (permalink / raw)
  To: Ananyev, Konstantin, Panu Matilainen, Olivier MATZ, dev; +Cc: dprovan

On 2/24/2016 9:23 PM, Ananyev, Konstantin wrote:
> Hi Panu,
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Panu Matilainen
>> Sent: Wednesday, February 24, 2016 12:12 PM
>> To: Xie, Huawei; Olivier MATZ; dev@dpdk.org
>> Cc: dprovan@bivio.net
>> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>>
>> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
>>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>>>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>>>> Hi,
>>>>>
>>>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>>>>> the library ABI and should not be listed in the version map.
>>>>>>
>>>>>> I assume its inline for performance reasons, but then you lose the
>>>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>>>>> itby just updating the library. Since the point of having a bulk API is
>>>>>> to improve performance by reducing the number of calls required, does it
>>>>>> really have to be inline? As in, have you actually measured the
>>>>>> difference between inline and non-inline and decided its worth all the
>>>>>> downsides?
>>>>> Agree with Panu. It would be interesting to compare the performance
>>>>> between inline and non inline to decide whether inlining it or not.
>>>> Will update after i gathered more data. inline could show obvious
>>>> performance difference in some cases.
>>> Panu and Oliver:
>>> I write a simple benchmark. This benchmark run 10M rounds, in each round
>>> 8 mbufs are allocated through bulk API, and then freed.
>>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
>>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
>>> Btw, i have removed some exceptional data, the frequency of which is
>>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
>>> what happened.
>>>
>>> With 8 mbufs allocated, there is about 6% performance increase using inline.
>> [...]
>>> With 16 mbufs allocated, we could still observe obvious performance
>>> difference, though only 1%-2%
>>>
>> [...]
>>> With 32/64 mbufs allocated, the deviation of the data itself would hide
>>> the performance difference.
>>> So we prefer using inline for performance.
>> At least I was more after real-world performance in a real-world
>> use-case rather than CPU cycles in a microbenchmark, we know function
>> calls have a cost but the benefits tend to outweight the cons.

It depends on what could be called the real world case. It could be
argued. I think the case Konstantin mentioned could be called a real
world one.
If your opinion on whether use benchmark or real-world use case is not
specific to this bulk API, then i have different opinion. For example,
for kernel virtio optimization, people use vring bench. We couldn't
guarantee each small optimization could bring obvious performance gain
in some big workload. The gain could be hided if bottleneck is
elsewhere, so i also plan to build such kind of virtio bench in DPDK.

Finally, i am open to inline or not, but currently priority better goes
with performance. If we make it an API now, we couldn't easily step back
in future; But we could change otherwise, after we have more confidence.
We could even check every inline "API", whether it should be inline or
be in the lib.

>>
>> Inline functions have their place and they're far less evil in project
>> internal use, but in library public API they are BAD and should be ...
>> well, not banned because there are exceptions to every rule, but highly
>> discouraged.
> Why is that?
> As you can see right now we have all mbuf alloc/free routines as static inline.
> And I think we would like to keep it like that.
> So why that particular function should be different?
> After all that function is nothing more than a wrapper 
> around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
> So unless mempool get/put API would change, I can hardly see there could be any ABI
> breakages in future. 
> About 'real world' performance gain - it was a 'real world' performance problem,
> that we tried to solve by introducing that function:
> http://dpdk.org/ml/archives/dev/2015-May/017633.html
>
> And according to the user feedback, it does help:  
> http://dpdk.org/ml/archives/dev/2016-February/033203.html
>
> Konstantin
>
>> 	- Panu -
>>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-26  7:39                 ` Xie, Huawei
@ 2016-02-26  8:45                   ` Olivier MATZ
  0 siblings, 0 replies; 54+ messages in thread
From: Olivier MATZ @ 2016-02-26  8:45 UTC (permalink / raw)
  To: Xie, Huawei, Ananyev, Konstantin, Panu Matilainen, dev; +Cc: dprovan



On 02/26/2016 08:39 AM, Xie, Huawei wrote:
>>>> With 8 mbufs allocated, there is about 6% performance increase using inline.
>>>> With 16 mbufs allocated, we could still observe obvious performance
>>>> difference, though only 1%-2%
> 

> On 2/24/2016 9:23 PM, Ananyev, Konstantin wrote:
>> As you can see right now we have all mbuf alloc/free routines as static inline.
>> And I think we would like to keep it like that.
>> So why that particular function should be different?
>> After all that function is nothing more than a wrapper 
>> around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
>> So unless mempool get/put API would change, I can hardly see there could be any ABI
>> breakages in future. 
>> About 'real world' performance gain - it was a 'real world' performance problem,
>> that we tried to solve by introducing that function:
>> http://dpdk.org/ml/archives/dev/2015-May/017633.html
>>
>> And according to the user feedback, it does help:  
>> http://dpdk.org/ml/archives/dev/2016-February/033203.html

For me, there's no doubt this function will help in real world use
cases. That's also true that today most (oh no, all) datapath mbuf
functions are inline. Although I understand Panu's point of view
about the use of inline functions, trying to de-inline some functions
of the mbuf API (and others APIs like mempool or ring) would require
a deep analysis first to check the performance impact. And I think there
would be an impact for most of them.

In this particular case, as the function does bulk allocations, it
probably tempers the cost of the function call, and that's why I
was curious of any comparison with/without inlining. But I'm not
sure having this only function as non-inline makes a lot of sense.

So:
Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-23  5:35           ` Xie, Huawei
  2016-02-24 12:11             ` Panu Matilainen
@ 2016-02-26  8:55             ` Olivier MATZ
  2016-02-26  9:07               ` Xie, Huawei
  1 sibling, 1 reply; 54+ messages in thread
From: Olivier MATZ @ 2016-02-26  8:55 UTC (permalink / raw)
  To: Xie, Huawei, Panu Matilainen, dev; +Cc: dprovan



On 02/23/2016 06:35 AM, Xie, Huawei wrote:
>>> Also, it would be nice to have a simple test function in
>>> app/test/test_mbuf.c. For instance, you could update
>>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>>> the mbuf allocation from the function. Then it could be called with
>>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>>> all the mbufs of rte_pktmbuf_alloc_bulk().
> 
> Don't quite get you. Is it that we write two cases, one case allocate
> mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
> good to have. 

Yes, something like:

test_one_pktmbuf(struct rte_mbuf *m)
{
	/* same as before without the allocation/free */
}

test_pkt_mbuf(void)
{
	m = rte_pktmbuf_alloc(pool);
	test_one_pktmbuf(m);
	rte_pktmbuf_free(m);

	ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
	for (i = 0; i < BULK_CNT; i++) {
		m = mtab[i];
		test_one_pktmbuf(m);
		rte_pktmbuf_free(m);
	}
}

> I could do this after this patch.

Yes, please.


Thanks,
Olivier

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-26  8:55             ` Olivier MATZ
@ 2016-02-26  9:07               ` Xie, Huawei
  2016-02-26  9:18                 ` Olivier MATZ
  0 siblings, 1 reply; 54+ messages in thread
From: Xie, Huawei @ 2016-02-26  9:07 UTC (permalink / raw)
  To: Olivier MATZ, Panu Matilainen, dev; +Cc: dprovan

On 2/26/2016 4:56 PM, Olivier MATZ wrote:
>
> On 02/23/2016 06:35 AM, Xie, Huawei wrote:
>>>> Also, it would be nice to have a simple test function in
>>>> app/test/test_mbuf.c. For instance, you could update
>>>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>>>> the mbuf allocation from the function. Then it could be called with
>>>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>>>> all the mbufs of rte_pktmbuf_alloc_bulk().
>> Don't quite get you. Is it that we write two cases, one case allocate
>> mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
>> good to have. 
> Yes, something like:
>
> test_one_pktmbuf(struct rte_mbuf *m)
> {
> 	/* same as before without the allocation/free */
> }
>
> test_pkt_mbuf(void)
> {
> 	m = rte_pktmbuf_alloc(pool);
> 	test_one_pktmbuf(m);
> 	rte_pktmbuf_free(m);
>
> 	ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
> 	for (i = 0; i < BULK_CNT; i++) {
> 		m = mtab[i];
> 		test_one_pktmbuf(m);
> 		rte_pktmbuf_free(m);
> 	}
> }

This is to test the functionality.
Let us also have the case like the following?
        cycles_start = rte_get_timer_cycles();
        while(rounds--) {

		ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
		for (i = 0; i < BULK_CNT; i++) {
			m = mtab[i];
			/* some work if needed */
			rte_pktmbuf_free(m);
		}
        }
	cycles_end = rte_get_timer_cycles();

to compare with
       cycles_start = rte_get_timer_cycles();
       while(rounds--) {
                for (i = 0; i < BULK_CNT; i++)
                    mtab[i] = rte_pktmbuf_alloc(...);

		ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
		for (i = 0; i < BULK_CNT; i++) {
			m = mtab[i];
			/* some work if needed */
			rte_pktmbuf_free(m);
		}
        }
	cycles_end = rte_get_timer_cycles();


>> I could do this after this patch.
> Yes, please.
>
>
> Thanks,
> Olivier
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-26  9:07               ` Xie, Huawei
@ 2016-02-26  9:18                 ` Olivier MATZ
  0 siblings, 0 replies; 54+ messages in thread
From: Olivier MATZ @ 2016-02-26  9:18 UTC (permalink / raw)
  To: Xie, Huawei, Panu Matilainen, dev; +Cc: dprovan

Hi Huawei,

On 02/26/2016 10:07 AM, Xie, Huawei wrote:
> On 2/26/2016 4:56 PM, Olivier MATZ wrote:
>> test_one_pktmbuf(struct rte_mbuf *m)
>> {
>> 	/* same as before without the allocation/free */
>> }
>>
>> test_pkt_mbuf(void)
>> {
>> 	m = rte_pktmbuf_alloc(pool);
>> 	test_one_pktmbuf(m);
>> 	rte_pktmbuf_free(m);
>>
>> 	ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
>> 	for (i = 0; i < BULK_CNT; i++) {
>> 		m = mtab[i];
>> 		test_one_pktmbuf(m);
>> 		rte_pktmbuf_free(m);
>> 	}
>> }
> 
> This is to test the functionality.
> Let us also have the case like the following?
>         cycles_start = rte_get_timer_cycles();
>         while(rounds--) {
> 
> 		ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
> 		for (i = 0; i < BULK_CNT; i++) {
> 			m = mtab[i];
> 			/* some work if needed */
> 			rte_pktmbuf_free(m);
> 		}
>         }
> 	cycles_end = rte_get_timer_cycles();
> 
> to compare with
>        cycles_start = rte_get_timer_cycles();
>        while(rounds--) {
>                 for (i = 0; i < BULK_CNT; i++)
>                     mtab[i] = rte_pktmbuf_alloc(...);
> 
> 		ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
> 		for (i = 0; i < BULK_CNT; i++) {
> 			m = mtab[i];
> 			/* some work if needed */
> 			rte_pktmbuf_free(m);
> 		}
>         }
> 	cycles_end = rte_get_timer_cycles();

In my opinion, it's already quite obvious that the bulk allocation
will be faster than the non-bulk (and we already have some mempool
benchmarks showing it). So I would say that functional testing is
enough.

On the other hand, it would be good to see if some examples
applications could be updated to take advantage of the new API (as
you did for the librte_vhost).

What do you think?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v7] mbuf: provide rte_pktmbuf_alloc_bulk API
  2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
                   ` (5 preceding siblings ...)
  2016-01-26 17:03 ` [PATCH v6 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
@ 2016-02-28 12:44 ` Huawei Xie
  2016-02-29 16:27   ` Thomas Monjalon
  6 siblings, 1 reply; 54+ messages in thread
From: Huawei Xie @ 2016-02-28 12:44 UTC (permalink / raw)
  To: dev; +Cc: dprovan

v7 changes:
 rte_pktmbuf_alloc_bulk isn't exported as API, so shouldn't be listed in
version map

v6 changes:
 reflect the changes in release notes and library version map file
 revise our duff's code style a bit to make it more readable

v5 changes:
 add comment about duff's device and our variant implementation

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Attached the wiki page about duff's device. It explains the performance
optimization through loop unwinding, and also the most dramatic use of
case label fall-through.
https://en.wikipedia.org/wiki/Duff%27s_device

In this implementation, while() loop is used because we could not assume
count is strictly positive. Using while() loop saves one line of check.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
---
 doc/guides/rel_notes/release_16_04.rst |  3 ++
 lib/librte_mbuf/rte_mbuf.h             | 55 ++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/doc/guides/rel_notes/release_16_04.rst b/doc/guides/rel_notes/release_16_04.rst
index e2219d0..b10a11b 100644
--- a/doc/guides/rel_notes/release_16_04.rst
+++ b/doc/guides/rel_notes/release_16_04.rst
@@ -46,6 +46,9 @@ This section should contain new features added in this release. Sample format:
 
 * **Added vhost-user live migration support.**
 
+* **Enable bulk allocation of mbufs.**
+  A new function ``rte_pktmbuf_alloc_bulk()`` has been added to allow the user
+  to allocate a bulk of mbufs.
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index c973e9b..c1f6bc4 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,61 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 }
 
 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *    The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *    Array of pointers to mbufs
+ *  @param count
+ *    Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+	 struct rte_mbuf **mbufs, unsigned count)
+{
+	unsigned idx = 0;
+	int rc;
+
+	rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+	if (unlikely(rc))
+		return rc;
+
+	/* To understand duff's device on loop unwinding optimization, see
+	 * https://en.wikipedia.org/wiki/Duff's_device.
+	 * Here while() loop is used rather than do() while{} to avoid extra
+	 * check if count is zero.
+	 */
+	switch (count % 4) {
+	case 0:
+		while (idx != count) {
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 3:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 2:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+	case 1:
+			RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+			rte_mbuf_refcnt_set(mbufs[idx], 1);
+			rte_pktmbuf_reset(mbufs[idx]);
+			idx++;
+		}
+	}
+	return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-24 13:23               ` Ananyev, Konstantin
  2016-02-26  7:39                 ` Xie, Huawei
@ 2016-02-29 10:51                 ` Panu Matilainen
  2016-02-29 16:14                   ` Thomas Monjalon
  1 sibling, 1 reply; 54+ messages in thread
From: Panu Matilainen @ 2016-02-29 10:51 UTC (permalink / raw)
  To: Ananyev, Konstantin, Xie, Huawei, Olivier MATZ, dev; +Cc: dprovan

On 02/24/2016 03:23 PM, Ananyev, Konstantin wrote:
> Hi Panu,
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Panu Matilainen
>> Sent: Wednesday, February 24, 2016 12:12 PM
>> To: Xie, Huawei; Olivier MATZ; dev@dpdk.org
>> Cc: dprovan@bivio.net
>> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>>
>> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
>>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>>>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>>>> Hi,
>>>>>
>>>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>>>>> the library ABI and should not be listed in the version map.
>>>>>>
>>>>>> I assume its inline for performance reasons, but then you lose the
>>>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>>>>> itby just updating the library. Since the point of having a bulk API is
>>>>>> to improve performance by reducing the number of calls required, does it
>>>>>> really have to be inline? As in, have you actually measured the
>>>>>> difference between inline and non-inline and decided its worth all the
>>>>>> downsides?
>>>>> Agree with Panu. It would be interesting to compare the performance
>>>>> between inline and non inline to decide whether inlining it or not.
>>>> Will update after i gathered more data. inline could show obvious
>>>> performance difference in some cases.
>>>
>>> Panu and Oliver:
>>> I write a simple benchmark. This benchmark run 10M rounds, in each round
>>> 8 mbufs are allocated through bulk API, and then freed.
>>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
>>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
>>> Btw, i have removed some exceptional data, the frequency of which is
>>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
>>> what happened.
>>>
>>> With 8 mbufs allocated, there is about 6% performance increase using inline.
>> [...]
>>>
>>> With 16 mbufs allocated, we could still observe obvious performance
>>> difference, though only 1%-2%
>>>
>> [...]
>>>
>>> With 32/64 mbufs allocated, the deviation of the data itself would hide
>>> the performance difference.
>>> So we prefer using inline for performance.
>>
>> At least I was more after real-world performance in a real-world
>> use-case rather than CPU cycles in a microbenchmark, we know function
>> calls have a cost but the benefits tend to outweight the cons.
>>
>> Inline functions have their place and they're far less evil in project
>> internal use, but in library public API they are BAD and should be ...
>> well, not banned because there are exceptions to every rule, but highly
>> discouraged.
>
> Why is that?

For all the reasons static linking is bad, and what's worse it forces 
the static linking badness into dynamically linked builds.

If there's a bug (security or otherwise) in a library, a distro wants to 
supply an updated package which fixes that bug and be done with it. But 
if that bug is in an inlined code, supplying an update is not enough, 
you also need to recompile everything using that code, and somehow 
inform customers possibly using that code that they need to not only 
update the library but to recompile their apps as well. That is 
precisely the reason distros go to great lenghts to avoid *any* 
statically linked apps and libs in the distro, completely regardless of 
the performance overhead.

In addition, inlined code complicates ABI compatibility issues because 
some of the code is one the "wrong" side, and worse, it bypasses all the 
other ABI compatibility safeguards like soname and symbol versioning.

Like said, inlined code is fine for internal consumption, but incredibly 
bad for public interfaces. And of course, the more complicated a 
function is, greater the potential of needing bugfixes.

Mind you, none of this is magically specific to this particular 
function. Except in the sense that bulk operations offer a better way of 
performance improvements than just inlining everything.

> As you can see right now we have all mbuf alloc/free routines as static inline.
> And I think we would like to keep it like that.
> So why that particular function should be different?

Because there's much less need to have it inlined since the function 
call overhead is "amortized" by the fact its doing bulk operations. "We 
always did it that way" is not a very good reason :)

> After all that function is nothing more than a wrapper
> around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
> So unless mempool get/put API would change, I can hardly see there could be any ABI
> breakages in future.
> About 'real world' performance gain - it was a 'real world' performance problem,
> that we tried to solve by introducing that function:
> http://dpdk.org/ml/archives/dev/2015-May/017633.html
>
> And according to the user feedback, it does help:
> http://dpdk.org/ml/archives/dev/2016-February/033203.html

The question is not whether the function is useful, not at all. The 
question is whether the real-world case sees any measurable difference 
in performance if the function is made non-inline.

	- Panu -

> Konstantin
>
>>
>> 	- Panu -
>>
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-29 10:51                 ` Panu Matilainen
@ 2016-02-29 16:14                   ` Thomas Monjalon
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Monjalon @ 2016-02-29 16:14 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, dprovan

2016-02-29 12:51, Panu Matilainen:
> On 02/24/2016 03:23 PM, Ananyev, Konstantin wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Panu Matilainen
> >> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
> >>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
> >>>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> >>>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
> >>>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
> >>>>>> the library ABI and should not be listed in the version map.
> >>>>>>
> >>>>>> I assume its inline for performance reasons, but then you lose the
> >>>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
> >>>>>> itby just updating the library. Since the point of having a bulk API is
> >>>>>> to improve performance by reducing the number of calls required, does it
> >>>>>> really have to be inline? As in, have you actually measured the
> >>>>>> difference between inline and non-inline and decided its worth all the
> >>>>>> downsides?
> >>>>> Agree with Panu. It would be interesting to compare the performance
> >>>>> between inline and non inline to decide whether inlining it or not.
> >>>> Will update after i gathered more data. inline could show obvious
> >>>> performance difference in some cases.
> >>>
> >>> Panu and Oliver:
> >>> I write a simple benchmark. This benchmark run 10M rounds, in each round
> >>> 8 mbufs are allocated through bulk API, and then freed.
> >>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
> >>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
> >>> Btw, i have removed some exceptional data, the frequency of which is
> >>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
> >>> what happened.
> >>>
> >>> With 8 mbufs allocated, there is about 6% performance increase using inline.
> >> [...]
> >>>
> >>> With 16 mbufs allocated, we could still observe obvious performance
> >>> difference, though only 1%-2%
> >>>
> >> [...]
> >>>
> >>> With 32/64 mbufs allocated, the deviation of the data itself would hide
> >>> the performance difference.
> >>> So we prefer using inline for performance.
> >>
> >> At least I was more after real-world performance in a real-world
> >> use-case rather than CPU cycles in a microbenchmark, we know function
> >> calls have a cost but the benefits tend to outweight the cons.
> >>
> >> Inline functions have their place and they're far less evil in project
> >> internal use, but in library public API they are BAD and should be ...
> >> well, not banned because there are exceptions to every rule, but highly
> >> discouraged.
> >
> > Why is that?
> 
> For all the reasons static linking is bad, and what's worse it forces 
> the static linking badness into dynamically linked builds.
> 
> If there's a bug (security or otherwise) in a library, a distro wants to 
> supply an updated package which fixes that bug and be done with it. But 
> if that bug is in an inlined code, supplying an update is not enough, 
> you also need to recompile everything using that code, and somehow 
> inform customers possibly using that code that they need to not only 
> update the library but to recompile their apps as well. That is 
> precisely the reason distros go to great lenghts to avoid *any* 
> statically linked apps and libs in the distro, completely regardless of 
> the performance overhead.
> 
> In addition, inlined code complicates ABI compatibility issues because 
> some of the code is one the "wrong" side, and worse, it bypasses all the 
> other ABI compatibility safeguards like soname and symbol versioning.
> 
> Like said, inlined code is fine for internal consumption, but incredibly 
> bad for public interfaces. And of course, the more complicated a 
> function is, greater the potential of needing bugfixes.
> 
> Mind you, none of this is magically specific to this particular 
> function. Except in the sense that bulk operations offer a better way of 
> performance improvements than just inlining everything.
> 
> > As you can see right now we have all mbuf alloc/free routines as static inline.
> > And I think we would like to keep it like that.
> > So why that particular function should be different?
> 
> Because there's much less need to have it inlined since the function 
> call overhead is "amortized" by the fact its doing bulk operations. "We 
> always did it that way" is not a very good reason :)
> 
> > After all that function is nothing more than a wrapper
> > around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
> > So unless mempool get/put API would change, I can hardly see there could be any ABI
> > breakages in future.
> > About 'real world' performance gain - it was a 'real world' performance problem,
> > that we tried to solve by introducing that function:
> > http://dpdk.org/ml/archives/dev/2015-May/017633.html
> >
> > And according to the user feedback, it does help:
> > http://dpdk.org/ml/archives/dev/2016-February/033203.html
> 
> The question is not whether the function is useful, not at all. The 
> question is whether the real-world case sees any measurable difference 
> in performance if the function is made non-inline.

This is a valid question, and it applies to a large part of DPDK.
But it's something to measure and change more globally than just
a new function.
Generally speaking, any effort to reduce the size of the exported headers
will be welcome.

That said, this patch won't be blocked.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7] mbuf: provide rte_pktmbuf_alloc_bulk API
  2016-02-28 12:44 ` [PATCH v7] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
@ 2016-02-29 16:27   ` Thomas Monjalon
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Monjalon @ 2016-02-29 16:27 UTC (permalink / raw)
  To: Huawei Xie; +Cc: dev, dprovan

2016-02-28 20:44, Huawei Xie:
> v7 changes:
>  rte_pktmbuf_alloc_bulk isn't exported as API, so shouldn't be listed in
> version map
> 
> v6 changes:
>  reflect the changes in release notes and library version map file
>  revise our duff's code style a bit to make it more readable
> 
> v5 changes:
>  add comment about duff's device and our variant implementation
> 
> v3 changes:
>  move while after case 0
>  add context about duff's device and why we use while loop in the commit
> message
> 
> v2 changes:
>  unroll the loop a bit to help the performance
> 
> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
> 
> There is related thread about this bulk API.
> http://dpdk.org/dev/patchwork/patch/4718/
> Thanks to Konstantin's loop unrolling.
> 
> Attached the wiki page about duff's device. It explains the performance
> optimization through loop unwinding, and also the most dramatic use of
> case label fall-through.
> https://en.wikipedia.org/wiki/Duff%27s_device
> 
> In this implementation, while() loop is used because we could not assume
> count is strictly positive. Using while() loop saves one line of check.
> 
> Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Applied, thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2016-02-29 16:29 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-13 23:35 [PATCH 0/2] provide rte_pktmbuf_alloc_bulk API and call it in vhost dequeue Huawei Xie
2015-12-13 23:35 ` [PATCH 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2015-12-13 23:35 ` [PATCH 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
2015-12-14  1:14 ` [PATCH v2 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
2015-12-14  1:14   ` [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2015-12-17  6:41     ` Yuanhan Liu
2015-12-17 15:42       ` Ananyev, Konstantin
2015-12-18  2:17         ` Yuanhan Liu
2015-12-18  5:01     ` Stephen Hemminger
2015-12-18  5:21       ` Yuanhan Liu
2015-12-18  7:10       ` Xie, Huawei
2015-12-18 10:44       ` Ananyev, Konstantin
2015-12-18 17:32         ` Stephen Hemminger
2015-12-18 19:27           ` Wiles, Keith
2015-12-21 15:21             ` Xie, Huawei
2015-12-21 17:20               ` Wiles, Keith
2015-12-21 21:30                 ` Thomas Monjalon
2015-12-22  1:58                   ` Xie, Huawei
2015-12-21 22:34               ` Don Provan
2015-12-21 12:25           ` Xie, Huawei
2015-12-14  1:14   ` [PATCH v2 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
2015-12-17  6:41     ` Yuanhan Liu
2015-12-22 16:17   ` [PATCH v3 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
2015-12-22 16:17     ` [PATCH v3 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2015-12-23 18:37       ` Stephen Hemminger
2015-12-23 18:49         ` Ananyev, Konstantin
2015-12-24  1:33           ` Xie, Huawei
2015-12-22 16:17     ` [PATCH v3 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
2015-12-23 11:22       ` linhaifeng
2015-12-23 11:39         ` Xie, Huawei
2015-12-22 23:05 ` [PATCH v4 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
2015-12-22 23:05   ` [PATCH v4 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2015-12-22 23:05   ` [PATCH v4 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
2015-12-27 16:38 ` [PATCH v5 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
2015-12-27 16:38   ` [PATCH v5 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2015-12-27 16:38   ` [PATCH v5 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
2016-01-26 17:03 ` [PATCH v6 0/2] provide rte_pktmbuf_alloc_bulk API and call it " Huawei Xie
2016-01-26 17:03   ` [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2016-01-27 13:56     ` Panu Matilainen
2016-02-03 17:23       ` Olivier MATZ
2016-02-22 14:49         ` Xie, Huawei
2016-02-23  5:35           ` Xie, Huawei
2016-02-24 12:11             ` Panu Matilainen
2016-02-24 13:23               ` Ananyev, Konstantin
2016-02-26  7:39                 ` Xie, Huawei
2016-02-26  8:45                   ` Olivier MATZ
2016-02-29 10:51                 ` Panu Matilainen
2016-02-29 16:14                   ` Thomas Monjalon
2016-02-26  8:55             ` Olivier MATZ
2016-02-26  9:07               ` Xie, Huawei
2016-02-26  9:18                 ` Olivier MATZ
2016-01-26 17:03   ` [PATCH v6 2/2] vhost: call rte_pktmbuf_alloc_bulk in vhost dequeue Huawei Xie
2016-02-28 12:44 ` [PATCH v7] mbuf: provide rte_pktmbuf_alloc_bulk API Huawei Xie
2016-02-29 16:27   ` Thomas Monjalon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.