[dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations
@ 2019-05-17 12:22 Maxime Coquelin
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:22 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev
  Cc: Maxime Coquelin

Some OVS-DPDK PVP benchmarks show a performance drop
when switching from DPDK v17.11 to v18.11.

With the addition of packed ring layout support,
rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
became very large, and only a part of the instructions
are executed (either packed or split ring used).

This series aims at improving the I-cache pressure,
first by un-inlining split and packed rings, but
also by moving parts considered as cold in dedicated
functions (dirty page logging, fragmented descriptors
buffer management added for CVE-2018-1059).

With the series applied, size of the enqueue and
dequeue split paths is reduced significantly:

+---------+--------------------+---------------------+
| Version | Enqueue split path |  Dequeue split path |
+---------+--------------------+---------------------+
| v19.05  | 16461B             | 25521B              |
| +series | 7286B              | 11285B              |
+---------+--------------------+---------------------+

Using perf tool to monitor iTLB-load-misses event
while doing PVP benchmark with testpmd as vswitch,
we can see the number of iTLB misses being reduced:

- v19.05:
# perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

 Performance counter stats for 'CPU(s) 2,3' (10 runs):

             2,438      iTLB-load-miss                                                ( +- 13.43% )

       10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )

- +series:
# perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

 Performance counter stats for 'CPU(s) 2,3' (10 runs):

                55      iTLB-load-miss                                                ( +- 10.08% )

       10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )

The series also force the inlining of some rte_memcpy
helpers, as by adding packed ring support, some of them
were not more inlined but embedded as functions in
the virtio_net object file, which was not expected.

Finally, the series simplifies the descriptors buffers
prefetching, by doing it in the recently introduced
descriptor buffer mapping function.

Maxime Coquelin (4):
  vhost: un-inline dirty pages logging functions
  vhost: do not inline packed and split functions
  vhost: do not inline unlikely fragmented buffers code
  vhost: simplify descriptor's buffer prefetching

root (1):
  eal/x86: force inlining of all memcpy and mov helpers

 .../common/include/arch/x86/rte_memcpy.h      |  18 +-
 lib/librte_vhost/vhost.c                      | 165 ++++++++++++++++++
 lib/librte_vhost/vhost.h                      | 164 ++---------------
 lib/librte_vhost/virtio_net.c                 | 142 +++++++--------
 4 files changed, 250 insertions(+), 239 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [dpdk-dev] [PATCH 1/5] vhost: un-inline dirty pages logging functions
  2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
@ 2019-05-17 12:22 ` Maxime Coquelin
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions Maxime Coquelin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:22 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev
  Cc: Maxime Coquelin

In order to reduce the I-cache pressure, this patch removes
the inlining of the dirty pages logging functions, that we
can consider as cold path.

Indeed, these functions are only called while doing live
migration, so not called most of the time.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.c | 132 +++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost.h | 129 ++++----------------------------------
 2 files changed, 144 insertions(+), 117 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 163f4595ef..4a54ad6bd1 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -69,6 +69,138 @@ __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+#define VHOST_LOG_PAGE	4096
+
+/*
+ * Atomically set a bit in memory.
+ */
+static __rte_always_inline void
+vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
+{
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
+	/*
+	 * __sync_ built-ins are deprecated, but __atomic_ ones
+	 * are sub-optimized in older GCC versions.
+	 */
+	__sync_fetch_and_or_1(addr, (1U << nr));
+#else
+	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
+#endif
+}
+
+static __rte_always_inline void
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	vhost_set_bit(page % 8, &log_base[page / 8]);
+}
+
+void
+__vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (unlikely(!dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	/* To make sure guest memory updates are committed before logging */
+	rte_smp_wmb();
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += 1;
+	}
+}
+
+void
+__vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+	unsigned long *log_base;
+	int i;
+
+	if (unlikely(!dev->log_base))
+		return;
+
+	rte_smp_wmb();
+
+	log_base = (unsigned long *)(uintptr_t)dev->log_base;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
+		/*
+		 * '__sync' builtins are deprecated, but '__atomic' ones
+		 * are sub-optimized in older GCC versions.
+		 */
+		__sync_fetch_and_or(log_base + elem->offset, elem->val);
+#else
+		__atomic_fetch_or(log_base + elem->offset, elem->val,
+				__ATOMIC_RELAXED);
+#endif
+	}
+
+	rte_smp_wmb();
+
+	vq->log_cache_nb_elem = 0;
+}
+
+static __rte_always_inline void
+vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t page)
+{
+	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
+	uint32_t offset = page / (sizeof(unsigned long) << 3);
+	int i;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+		if (elem->offset == offset) {
+			elem->val |= (1UL << bit_nr);
+			return;
+		}
+	}
+
+	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
+		/*
+		 * No more room for a new log cache entry,
+		 * so write the dirty log map directly.
+		 */
+		rte_smp_wmb();
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+
+		return;
+	}
+
+	vq->log_cache[i].offset = offset;
+	vq->log_cache[i].val = (1UL << bit_nr);
+	vq->log_cache_nb_elem++;
+}
+
+void
+__vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (unlikely(!dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_cache_page(dev, vq, page);
+		page += 1;
+	}
+}
+
+
 void
 cleanup_vq(struct vhost_virtqueue *vq, int destroy)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index e9138dfab4..3ab7b4950f 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,138 +350,33 @@ desc_is_avail(struct vring_packed_desc *desc, bool wrap_counter)
 		wrap_counter != !!(flags & VRING_DESC_F_USED);
 }
 
-#define VHOST_LOG_PAGE	4096
-
-/*
- * Atomically set a bit in memory.
- */
-static __rte_always_inline void
-vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
-{
-#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
-	/*
-	 * __sync_ built-ins are deprecated, but __atomic_ ones
-	 * are sub-optimized in older GCC versions.
-	 */
-	__sync_fetch_and_or_1(addr, (1U << nr));
-#else
-	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
-#endif
-}
-
-static __rte_always_inline void
-vhost_log_page(uint8_t *log_base, uint64_t page)
-{
-	vhost_set_bit(page % 8, &log_base[page / 8]);
-}
+void __vhost_log_cache_write(struct virtio_net *dev,
+		struct vhost_virtqueue *vq,
+		uint64_t addr, uint64_t len);
+void __vhost_log_cache_sync(struct virtio_net *dev,
+		struct vhost_virtqueue *vq);
+void __vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len);
 
 static __rte_always_inline void
 vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
 {
-	uint64_t page;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base || !len))
-		return;
-
-	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
-		return;
-
-	/* To make sure guest memory updates are committed before logging */
-	rte_smp_wmb();
-
-	page = addr / VHOST_LOG_PAGE;
-	while (page * VHOST_LOG_PAGE < addr + len) {
-		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
-		page += 1;
-	}
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_write(dev, addr, len);
 }
 
 static __rte_always_inline void
 vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
-	unsigned long *log_base;
-	int i;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base))
-		return;
-
-	rte_smp_wmb();
-
-	log_base = (unsigned long *)(uintptr_t)dev->log_base;
-
-	for (i = 0; i < vq->log_cache_nb_elem; i++) {
-		struct log_cache_entry *elem = vq->log_cache + i;
-
-#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
-		/*
-		 * '__sync' builtins are deprecated, but '__atomic' ones
-		 * are sub-optimized in older GCC versions.
-		 */
-		__sync_fetch_and_or(log_base + elem->offset, elem->val);
-#else
-		__atomic_fetch_or(log_base + elem->offset, elem->val,
-				__ATOMIC_RELAXED);
-#endif
-	}
-
-	rte_smp_wmb();
-
-	vq->log_cache_nb_elem = 0;
-}
-
-static __rte_always_inline void
-vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
-			uint64_t page)
-{
-	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
-	uint32_t offset = page / (sizeof(unsigned long) << 3);
-	int i;
-
-	for (i = 0; i < vq->log_cache_nb_elem; i++) {
-		struct log_cache_entry *elem = vq->log_cache + i;
-
-		if (elem->offset == offset) {
-			elem->val |= (1UL << bit_nr);
-			return;
-		}
-	}
-
-	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
-		/*
-		 * No more room for a new log cache entry,
-		 * so write the dirty log map directly.
-		 */
-		rte_smp_wmb();
-		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
-
-		return;
-	}
-
-	vq->log_cache[i].offset = offset;
-	vq->log_cache[i].val = (1UL << bit_nr);
-	vq->log_cache_nb_elem++;
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_cache_sync(dev, vq);
 }
 
 static __rte_always_inline void
 vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			uint64_t addr, uint64_t len)
 {
-	uint64_t page;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base || !len))
-		return;
-
-	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
-		return;
-
-	page = addr / VHOST_LOG_PAGE;
-	while (page * VHOST_LOG_PAGE < addr + len) {
-		vhost_log_cache_page(dev, vq, page);
-		page += 1;
-	}
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_cache_write(dev, vq, addr, len);
 }
 
 static __rte_always_inline void
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions
  2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
@ 2019-05-17 12:22 ` Maxime Coquelin
  2019-05-17 13:00   ` David Marchand
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:22 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev
  Cc: Maxime Coquelin

At runtime either packed Tx/Rx functions will always be called,
or split Tx/Rx functions will always be called.

This patch removes the forced inlining in order to reduce
the I-cache pressure.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/virtio_net.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index a6a33a1013..35ae4992c2 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -771,7 +771,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
-static __rte_always_inline uint32_t
+static uint32_t
 virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mbuf **pkts, uint32_t count)
 {
@@ -830,7 +830,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return pkt_idx;
 }
 
-static __rte_always_inline uint32_t
+static uint32_t
 virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mbuf **pkts, uint32_t count)
 {
@@ -1300,7 +1300,7 @@ get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
-static __rte_always_inline uint16_t
+static uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
@@ -1422,7 +1422,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return i;
 }
 
-static __rte_always_inline uint16_t
+static uint16_t
 virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions Maxime Coquelin
@ 2019-05-17 12:22 ` Maxime Coquelin
  2019-05-17 12:57   ` Maxime Coquelin
  2019-05-21 19:43   ` Mattias Rönnblom
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:22 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev
  Cc: Maxime Coquelin

Handling of fragmented virtio-net header and indirect descriptors
tables was implemented to fix CVE-2018-1059. It should not never
happen with healthy guests and so are already considered as
unlikely code path.

This patch moves these bits into non-inline dedicated functions
to reduce the I-cache pressure.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.c      |  33 +++++++++++
 lib/librte_vhost/vhost.h      |  35 +-----------
 lib/librte_vhost/virtio_net.c | 102 +++++++++++++++++++---------------
 3 files changed, 91 insertions(+), 79 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 4a54ad6bd1..8a4379bc13 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
 }
 
 
+void *
+alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint64_t desc_addr, uint64_t desc_len)
+{
+	void *idesc;
+	uint64_t src, dst;
+	uint64_t len, remain = desc_len;
+
+	idesc = rte_malloc(__func__, desc_len, 0);
+	if (unlikely(!idesc))
+		return NULL;
+
+	dst = (uint64_t)(uintptr_t)idesc;
+
+	while (remain) {
+		len = remain;
+		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
+				VHOST_ACCESS_RO);
+		if (unlikely(!src || !len)) {
+			rte_free(idesc);
+			return NULL;
+		}
+
+		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
+
+		remain -= len;
+		dst += len;
+		desc_addr += len;
+	}
+
+	return idesc;
+}
+
 void
 cleanup_vq(struct vhost_virtqueue *vq, int destroy)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 3ab7b4950f..ab26454e1c 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -488,6 +488,8 @@ void vhost_backend_cleanup(struct virtio_net *dev);
 
 uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			uint64_t iova, uint64_t *len, uint8_t perm);
+void *alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t desc_addr, uint64_t desc_len);
 int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
 void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);
 
@@ -601,39 +603,6 @@ vhost_vring_call_packed(struct virtio_net *dev, struct vhost_virtqueue *vq)
 		eventfd_write(vq->callfd, (eventfd_t)1);
 }
 
-static __rte_always_inline void *
-alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
-		uint64_t desc_addr, uint64_t desc_len)
-{
-	void *idesc;
-	uint64_t src, dst;
-	uint64_t len, remain = desc_len;
-
-	idesc = rte_malloc(__func__, desc_len, 0);
-	if (unlikely(!idesc))
-		return 0;
-
-	dst = (uint64_t)(uintptr_t)idesc;
-
-	while (remain) {
-		len = remain;
-		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
-				VHOST_ACCESS_RO);
-		if (unlikely(!src || !len)) {
-			rte_free(idesc);
-			return 0;
-		}
-
-		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
-
-		remain -= len;
-		dst += len;
-		desc_addr += len;
-	}
-
-	return idesc;
-}
-
 static __rte_always_inline void
 free_ind_table(void *idesc)
 {
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 35ae4992c2..494dd9957e 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -610,6 +610,35 @@ reserve_avail_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+static void
+copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		struct buf_vector *buf_vec,
+		struct virtio_net_hdr_mrg_rxbuf *hdr){
+	uint64_t len;
+	uint64_t remain = dev->vhost_hlen;
+	uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
+	uint64_t iova = buf_vec->buf_iova;
+
+	while (remain) {
+		len = RTE_MIN(remain,
+				buf_vec->buf_len);
+		dst = buf_vec->buf_addr;
+		rte_memcpy((void *)(uintptr_t)dst,
+				(void *)(uintptr_t)src,
+				len);
+
+		PRINT_PACKET(dev, (uintptr_t)dst,
+				(uint32_t)len, 0);
+		vhost_log_cache_write(dev, vq,
+				iova, len);
+
+		remain -= len;
+		iova += len;
+		src += len;
+		buf_vec++;
+	}
+}
+
 static __rte_always_inline int
 copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			    struct rte_mbuf *m, struct buf_vector *buf_vec,
@@ -703,30 +732,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 						num_buffers);
 
 			if (unlikely(hdr == &tmp_hdr)) {
-				uint64_t len;
-				uint64_t remain = dev->vhost_hlen;
-				uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
-				uint64_t iova = buf_vec[0].buf_iova;
-				uint16_t hdr_vec_idx = 0;
-
-				while (remain) {
-					len = RTE_MIN(remain,
-						buf_vec[hdr_vec_idx].buf_len);
-					dst = buf_vec[hdr_vec_idx].buf_addr;
-					rte_memcpy((void *)(uintptr_t)dst,
-							(void *)(uintptr_t)src,
-							len);
-
-					PRINT_PACKET(dev, (uintptr_t)dst,
-							(uint32_t)len, 0);
-					vhost_log_cache_write(dev, vq,
-							iova, len);
-
-					remain -= len;
-					iova += len;
-					src += len;
-					hdr_vec_idx++;
-				}
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
 			} else {
 				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
 						dev->vhost_hlen, 0);
@@ -1063,6 +1069,31 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
 	}
 }
 
+static void
+copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
+		struct buf_vector *buf_vec)
+{
+	uint64_t len;
+	uint64_t remain = sizeof(struct virtio_net_hdr);
+	uint64_t src;
+	uint64_t dst = (uint64_t)(uintptr_t)&hdr;
+
+	/*
+	 * No luck, the virtio-net header doesn't fit
+	 * in a contiguous virtual area.
+	 */
+	while (remain) {
+		len = RTE_MIN(remain, buf_vec->buf_len);
+		src = buf_vec->buf_addr;
+		rte_memcpy((void *)(uintptr_t)dst,
+				(void *)(uintptr_t)src, len);
+
+		remain -= len;
+		dst += len;
+		buf_vec++;
+	}
+}
+
 static __rte_always_inline int
 copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct buf_vector *buf_vec, uint16_t nr_vec,
@@ -1094,28 +1125,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 	if (virtio_net_with_host_offload(dev)) {
 		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
-			uint64_t len;
-			uint64_t remain = sizeof(struct virtio_net_hdr);
-			uint64_t src;
-			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
-			uint16_t hdr_vec_idx = 0;
-
-			/*
-			 * No luck, the virtio-net header doesn't fit
-			 * in a contiguous virtual area.
-			 */
-			while (remain) {
-				len = RTE_MIN(remain,
-					buf_vec[hdr_vec_idx].buf_len);
-				src = buf_vec[hdr_vec_idx].buf_addr;
-				rte_memcpy((void *)(uintptr_t)dst,
-						   (void *)(uintptr_t)src, len);
-
-				remain -= len;
-				dst += len;
-				hdr_vec_idx++;
-			}
-
+			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
 			hdr = &tmp_hdr;
 		} else {
 			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [dpdk-dev] [PATCH 4/5] vhost: simplify descriptor's buffer prefetching
  2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (2 preceding siblings ...)
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
@ 2019-05-17 12:22 ` Maxime Coquelin
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
  2019-05-17 13:04 ` [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations David Marchand
  5 siblings, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:22 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev
  Cc: Maxime Coquelin

Now that we have a single function to map the descriptors
buffers, let's prefetch them there as it is the earliest
place we can do it.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/virtio_net.c | 32 ++------------------------------
 1 file changed, 2 insertions(+), 30 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 494dd9957e..d9031fe55c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -286,6 +286,8 @@ map_one_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		if (unlikely(!desc_addr))
 			return -1;
 
+		rte_prefetch0((void *)(uintptr_t)desc_addr);
+
 		buf_vec[vec_id].buf_iova = desc_iova;
 		buf_vec[vec_id].buf_addr = desc_addr;
 		buf_vec[vec_id].buf_len  = desc_chunck_len;
@@ -664,9 +666,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	buf_iova = buf_vec[vec_idx].buf_iova;
 	buf_len = buf_vec[vec_idx].buf_len;
 
-	if (nr_vec > 1)
-		rte_prefetch0((void *)(uintptr_t)buf_vec[1].buf_addr);
-
 	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
 		error = -1;
 		goto out;
@@ -709,10 +708,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
-			/* Prefetch next buffer address. */
-			if (vec_idx + 1 < nr_vec)
-				rte_prefetch0((void *)(uintptr_t)
-						buf_vec[vec_idx + 1].buf_addr);
 			buf_offset = 0;
 			buf_avail  = buf_len;
 		}
@@ -810,8 +805,6 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			break;
 		}
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		VHOST_LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
 			dev->vid, vq->last_avail_idx,
 			vq->last_avail_idx + num_buffers);
@@ -859,8 +852,6 @@ virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			break;
 		}
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		VHOST_LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
 			dev->vid, vq->last_avail_idx,
 			vq->last_avail_idx + num_buffers);
@@ -1120,16 +1111,12 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		goto out;
 	}
 
-	if (likely(nr_vec > 1))
-		rte_prefetch0((void *)(uintptr_t)buf_vec[1].buf_addr);
-
 	if (virtio_net_with_host_offload(dev)) {
 		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
 			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
 			hdr = &tmp_hdr;
 		} else {
 			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
-			rte_prefetch0(hdr);
 		}
 	}
 
@@ -1159,9 +1146,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		buf_avail = buf_vec[vec_idx].buf_len - dev->vhost_hlen;
 	}
 
-	rte_prefetch0((void *)(uintptr_t)
-			(buf_addr + buf_offset));
-
 	PRINT_PACKET(dev,
 			(uintptr_t)(buf_addr + buf_offset),
 			(uint32_t)buf_avail, 0);
@@ -1227,14 +1211,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
-			/*
-			 * Prefecth desc n + 1 buffer while
-			 * desc n buffer is processed.
-			 */
-			if (vec_idx + 1 < nr_vec)
-				rte_prefetch0((void *)(uintptr_t)
-						buf_vec[vec_idx + 1].buf_addr);
-
 			buf_offset = 0;
 			buf_avail  = buf_len;
 
@@ -1378,8 +1354,6 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		if (likely(dev->dequeue_zero_copy == 0))
 			update_shadow_used_ring_split(vq, head_idx, 0);
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
@@ -1489,8 +1463,6 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			update_shadow_used_ring_packed(vq, buf_id, 0,
 					desc_count);
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [dpdk-dev] [PATCH 5/5] eal/x86: force inlining of all memcpy and mov helpers
  2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (3 preceding siblings ...)
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
@ 2019-05-17 12:22 ` Maxime Coquelin
  2019-05-17 13:04 ` [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations David Marchand
  5 siblings, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:22 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev
  Cc: root

From: root <root@virtlab510.virt.lab.eng.bos.redhat.com>

Some helpers in the header file are forced inlined other are
only inlined, this patch forces inline for all.

It will avoid it to be embedded as functions when called multiple
times in the same object file. For example, when we added packed
ring support in vhost-user library, rte_memcpy_generic got no
more inlined.

Signed-off-by: root <root@virtlab510.virt.lab.eng.bos.redhat.com>
---
 .../common/include/arch/x86/rte_memcpy.h       | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 7b758094df..ba44c4a328 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -115,7 +115,7 @@ rte_mov256(uint8_t *dst, const uint8_t *src)
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
 	__m512i zmm0, zmm1;
@@ -163,7 +163,7 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
@@ -330,7 +330,7 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
 	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
@@ -343,7 +343,7 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
 	__m256i ymm0, ymm1, ymm2, ymm3;
@@ -363,7 +363,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
@@ -523,7 +523,7 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
 	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
@@ -655,7 +655,7 @@ __extension__ ({                                                      \
     }                                                                 \
 })
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
@@ -800,7 +800,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 
 #endif /* RTE_MACHINE_CPUFLAG */
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_aligned(void *dst, const void *src, size_t n)
 {
 	void *ret = dst;
@@ -860,7 +860,7 @@ rte_memcpy_aligned(void *dst, const void *src, size_t n)
 	return ret;
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
 	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
@ 2019-05-17 12:57   ` Maxime Coquelin
  2019-05-21 19:43   ` Mattias Rönnblom
  1 sibling, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 12:57 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev



On 5/17/19 2:22 PM, Maxime Coquelin wrote:
> Handling of fragmented virtio-net header and indirect descriptors
> tables was implemented to fix CVE-2018-1059. It should not never
> happen with healthy guests and so are already considered as
> unlikely code path.
> 
> This patch moves these bits into non-inline dedicated functions
> to reduce the I-cache pressure.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>   lib/librte_vhost/vhost.c      |  33 +++++++++++
>   lib/librte_vhost/vhost.h      |  35 +-----------
>   lib/librte_vhost/virtio_net.c | 102 +++++++++++++++++++---------------
>   3 files changed, 91 insertions(+), 79 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index 4a54ad6bd1..8a4379bc13 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   }
>   
>   
> +void *
> +alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint64_t desc_addr, uint64_t desc_len)
> +{
> +	void *idesc;
> +	uint64_t src, dst;
> +	uint64_t len, remain = desc_len;
> +
> +	idesc = rte_malloc(__func__, desc_len, 0);
> +	if (unlikely(!idesc))
> +		return NULL;
> +
> +	dst = (uint64_t)(uintptr_t)idesc;
> +
> +	while (remain) {
> +		len = remain;
> +		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
> +				VHOST_ACCESS_RO);
> +		if (unlikely(!src || !len)) {
> +			rte_free(idesc);
> +			return NULL;
> +		}
> +
> +		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
> +
> +		remain -= len;
> +		dst += len;
> +		desc_addr += len;
> +	}
> +
> +	return idesc;
> +}
> +
>   void
>   cleanup_vq(struct vhost_virtqueue *vq, int destroy)
>   {
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 3ab7b4950f..ab26454e1c 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -488,6 +488,8 @@ void vhost_backend_cleanup(struct virtio_net *dev);
>   
>   uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   			uint64_t iova, uint64_t *len, uint8_t perm);
> +void *alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +			uint64_t desc_addr, uint64_t desc_len);
>   int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
>   void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);
>   
> @@ -601,39 +603,6 @@ vhost_vring_call_packed(struct virtio_net *dev, struct vhost_virtqueue *vq)
>   		eventfd_write(vq->callfd, (eventfd_t)1);
>   }
>   
> -static __rte_always_inline void *
> -alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
> -		uint64_t desc_addr, uint64_t desc_len)
> -{
> -	void *idesc;
> -	uint64_t src, dst;
> -	uint64_t len, remain = desc_len;
> -
> -	idesc = rte_malloc(__func__, desc_len, 0);
> -	if (unlikely(!idesc))
> -		return 0;
> -
> -	dst = (uint64_t)(uintptr_t)idesc;
> -
> -	while (remain) {
> -		len = remain;
> -		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
> -				VHOST_ACCESS_RO);
> -		if (unlikely(!src || !len)) {
> -			rte_free(idesc);
> -			return 0;
> -		}
> -
> -		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
> -
> -		remain -= len;
> -		dst += len;
> -		desc_addr += len;
> -	}
> -
> -	return idesc;
> -}
> -
>   static __rte_always_inline void
>   free_ind_table(void *idesc)
>   {
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 35ae4992c2..494dd9957e 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -610,6 +610,35 @@ reserve_avail_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   	return 0;
>   }
>   
> +static void
> +copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		struct buf_vector *buf_vec,
> +		struct virtio_net_hdr_mrg_rxbuf *hdr){

I seem to have missed above open bracket coding style issue while
running checkpatch.

Will fix in next revision.

> +	uint64_t len;
> +	uint64_t remain = dev->vhost_hlen;
> +	uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
> +	uint64_t iova = buf_vec->buf_iova;
> +
> +	while (remain) {
> +		len = RTE_MIN(remain,
> +				buf_vec->buf_len);
> +		dst = buf_vec->buf_addr;
> +		rte_memcpy((void *)(uintptr_t)dst,
> +				(void *)(uintptr_t)src,
> +				len);
> +
> +		PRINT_PACKET(dev, (uintptr_t)dst,
> +				(uint32_t)len, 0);
> +		vhost_log_cache_write(dev, vq,
> +				iova, len);
> +
> +		remain -= len;
> +		iova += len;
> +		src += len;
> +		buf_vec++;
> +	}
> +}
> +
>   static __rte_always_inline int
>   copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   			    struct rte_mbuf *m, struct buf_vector *buf_vec,
> @@ -703,30 +732,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   						num_buffers);
>   
>   			if (unlikely(hdr == &tmp_hdr)) {
> -				uint64_t len;
> -				uint64_t remain = dev->vhost_hlen;
> -				uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
> -				uint64_t iova = buf_vec[0].buf_iova;
> -				uint16_t hdr_vec_idx = 0;
> -
> -				while (remain) {
> -					len = RTE_MIN(remain,
> -						buf_vec[hdr_vec_idx].buf_len);
> -					dst = buf_vec[hdr_vec_idx].buf_addr;
> -					rte_memcpy((void *)(uintptr_t)dst,
> -							(void *)(uintptr_t)src,
> -							len);
> -
> -					PRINT_PACKET(dev, (uintptr_t)dst,
> -							(uint32_t)len, 0);
> -					vhost_log_cache_write(dev, vq,
> -							iova, len);
> -
> -					remain -= len;
> -					iova += len;
> -					src += len;
> -					hdr_vec_idx++;
> -				}
> +				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
>   			} else {
>   				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
>   						dev->vhost_hlen, 0);
> @@ -1063,6 +1069,31 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
>   	}
>   }
>   
> +static void
> +copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
> +		struct buf_vector *buf_vec)
> +{
> +	uint64_t len;
> +	uint64_t remain = sizeof(struct virtio_net_hdr);
> +	uint64_t src;
> +	uint64_t dst = (uint64_t)(uintptr_t)&hdr;
> +
> +	/*
> +	 * No luck, the virtio-net header doesn't fit
> +	 * in a contiguous virtual area.
> +	 */
> +	while (remain) {
> +		len = RTE_MIN(remain, buf_vec->buf_len);
> +		src = buf_vec->buf_addr;
> +		rte_memcpy((void *)(uintptr_t)dst,
> +				(void *)(uintptr_t)src, len);
> +
> +		remain -= len;
> +		dst += len;
> +		buf_vec++;
> +	}
> +}
> +
>   static __rte_always_inline int
>   copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   		  struct buf_vector *buf_vec, uint16_t nr_vec,
> @@ -1094,28 +1125,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   
>   	if (virtio_net_with_host_offload(dev)) {
>   		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
> -			uint64_t len;
> -			uint64_t remain = sizeof(struct virtio_net_hdr);
> -			uint64_t src;
> -			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
> -			uint16_t hdr_vec_idx = 0;
> -
> -			/*
> -			 * No luck, the virtio-net header doesn't fit
> -			 * in a contiguous virtual area.
> -			 */
> -			while (remain) {
> -				len = RTE_MIN(remain,
> -					buf_vec[hdr_vec_idx].buf_len);
> -				src = buf_vec[hdr_vec_idx].buf_addr;
> -				rte_memcpy((void *)(uintptr_t)dst,
> -						   (void *)(uintptr_t)src, len);
> -
> -				remain -= len;
> -				dst += len;
> -				hdr_vec_idx++;
> -			}
> -
> +			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
>   			hdr = &tmp_hdr;
>   		} else {
>   			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions Maxime Coquelin
@ 2019-05-17 13:00   ` David Marchand
  2019-05-17 14:42     ` Maxime Coquelin
  0 siblings, 1 reply; 15+ messages in thread
From: David Marchand @ 2019-05-17 13:00 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, Tiwei Bie, Jens Freimann, Zhihong Wang, Bruce Richardson,
	Ananyev, Konstantin

On Fri, May 17, 2019 at 2:23 PM Maxime Coquelin <maxime.coquelin@redhat.com>
wrote:

> At runtime either packed Tx/Rx functions will always be called,
> or split Tx/Rx functions will always be called.
>
> This patch removes the forced inlining in order to reduce
> the I-cache pressure.
>

I just wonder if the compiler can't decide on its own to inline those
static functions.
We have __rte_noinline for this.


> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  lib/librte_vhost/virtio_net.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index a6a33a1013..35ae4992c2 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -771,7 +771,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>         return error;
>  }
>
> -static __rte_always_inline uint32_t
> +static uint32_t
>  virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
>         struct rte_mbuf **pkts, uint32_t count)
>  {
> @@ -830,7 +830,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>         return pkt_idx;
>  }
>
> -static __rte_always_inline uint32_t
> +static uint32_t
>  virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>         struct rte_mbuf **pkts, uint32_t count)
>  {
> @@ -1300,7 +1300,7 @@ get_zmbuf(struct vhost_virtqueue *vq)
>         return NULL;
>  }
>
> -static __rte_always_inline uint16_t
> +static uint16_t
>  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
>         struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> count)
>  {
> @@ -1422,7 +1422,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>         return i;
>  }
>
> -static __rte_always_inline uint16_t
> +static uint16_t
>  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>         struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> count)
>  {
> --
> 2.21.0
>
>

-- 
David Marchand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations
  2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (4 preceding siblings ...)
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
@ 2019-05-17 13:04 ` David Marchand
  2019-05-17 14:42   ` Maxime Coquelin
  5 siblings, 1 reply; 15+ messages in thread
From: David Marchand @ 2019-05-17 13:04 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, Tiwei Bie, Jens Freimann, Zhihong Wang, Bruce Richardson,
	Ananyev, Konstantin

On Fri, May 17, 2019 at 2:23 PM Maxime Coquelin <maxime.coquelin@redhat.com>
wrote:

> Some OVS-DPDK PVP benchmarks show a performance drop
> when switching from DPDK v17.11 to v18.11.
>
> With the addition of packed ring layout support,
> rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
> became very large, and only a part of the instructions
> are executed (either packed or split ring used).
>
> This series aims at improving the I-cache pressure,
> first by un-inlining split and packed rings, but
> also by moving parts considered as cold in dedicated
> functions (dirty page logging, fragmented descriptors
> buffer management added for CVE-2018-1059).
>
> With the series applied, size of the enqueue and
> dequeue split paths is reduced significantly:
>
> +---------+--------------------+---------------------+
> | Version | Enqueue split path |  Dequeue split path |
> +---------+--------------------+---------------------+
> | v19.05  | 16461B             | 25521B              |
> | +series | 7286B              | 11285B              |
> +---------+--------------------+---------------------+
>
> Using perf tool to monitor iTLB-load-misses event
> while doing PVP benchmark with testpmd as vswitch,
> we can see the number of iTLB misses being reduced:
>
> - v19.05:
> # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
>
>  Performance counter stats for 'CPU(s) 2,3' (10 runs):
>
>              2,438      iTLB-load-miss
>             ( +- 13.43% )
>
>        10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )
>
> - +series:
> # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
>
>  Performance counter stats for 'CPU(s) 2,3' (10 runs):
>
>                 55      iTLB-load-miss
>             ( +- 10.08% )
>
>        10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )
>
> The series also force the inlining of some rte_memcpy
> helpers, as by adding packed ring support, some of them
> were not more inlined but embedded as functions in
> the virtio_net object file, which was not expected.
>
> Finally, the series simplifies the descriptors buffers
> prefetching, by doing it in the recently introduced
> descriptor buffer mapping function.
>
> Maxime Coquelin (4):
>   vhost: un-inline dirty pages logging functions
>   vhost: do not inline packed and split functions
>   vhost: do not inline unlikely fragmented buffers code
>   vhost: simplify descriptor's buffer prefetching
>
> root (1):
>   eal/x86: force inlining of all memcpy and mov helpers
>

root ? "oops" :-)


-- 
David Marchand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions
  2019-05-17 13:00   ` David Marchand
@ 2019-05-17 14:42     ` Maxime Coquelin
  0 siblings, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 14:42 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Tiwei Bie, Jens Freimann, Zhihong Wang, Bruce Richardson,
	Ananyev, Konstantin



On 5/17/19 3:00 PM, David Marchand wrote:
> 
> On Fri, May 17, 2019 at 2:23 PM Maxime Coquelin 
> <maxime.coquelin@redhat.com <mailto:maxime.coquelin@redhat.com>> wrote:
> 
>     At runtime either packed Tx/Rx functions will always be called,
>     or split Tx/Rx functions will always be called.
> 
>     This patch removes the forced inlining in order to reduce
>     the I-cache pressure.
> 
> 
> I just wonder if the compiler can't decide on its own to inline those 
> static functions.
> We have __rte_noinline for this.

Good idea, I think it did not happen in my case because the compiler
would find the functions too large to be inlined.

I'll fix that in v2.

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations
  2019-05-17 13:04 ` [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations David Marchand
@ 2019-05-17 14:42   ` Maxime Coquelin
  0 siblings, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-17 14:42 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Tiwei Bie, Jens Freimann, Zhihong Wang, Bruce Richardson,
	Ananyev, Konstantin



On 5/17/19 3:04 PM, David Marchand wrote:
> 
> 
> On Fri, May 17, 2019 at 2:23 PM Maxime Coquelin 
> <maxime.coquelin@redhat.com <mailto:maxime.coquelin@redhat.com>> wrote:
> 
>     Some OVS-DPDK PVP benchmarks show a performance drop
>     when switching from DPDK v17.11 to v18.11.
> 
>     With the addition of packed ring layout support,
>     rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
>     became very large, and only a part of the instructions
>     are executed (either packed or split ring used).
> 
>     This series aims at improving the I-cache pressure,
>     first by un-inlining split and packed rings, but
>     also by moving parts considered as cold in dedicated
>     functions (dirty page logging, fragmented descriptors
>     buffer management added for CVE-2018-1059).
> 
>     With the series applied, size of the enqueue and
>     dequeue split paths is reduced significantly:
> 
>     +---------+--------------------+---------------------+
>     | Version | Enqueue split path |  Dequeue split path |
>     +---------+--------------------+---------------------+
>     | v19.05  | 16461B             | 25521B              |
>     | +series | 7286B              | 11285B              |
>     +---------+--------------------+---------------------+
> 
>     Using perf tool to monitor iTLB-load-misses event
>     while doing PVP benchmark with testpmd as vswitch,
>     we can see the number of iTLB misses being reduced:
> 
>     - v19.05:
>     # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
> 
>       Performance counter stats for 'CPU(s) 2,3' (10 runs):
> 
>                   2,438      iTLB-load-miss                             
>                        ( +- 13.43% )
> 
>             10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )
> 
>     - +series:
>     # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
> 
>       Performance counter stats for 'CPU(s) 2,3' (10 runs):
> 
>                      55      iTLB-load-miss                             
>                        ( +- 10.08% )
> 
>             10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )
> 
>     The series also force the inlining of some rte_memcpy
>     helpers, as by adding packed ring support, some of them
>     were not more inlined but embedded as functions in
>     the virtio_net object file, which was not expected.
> 
>     Finally, the series simplifies the descriptors buffers
>     prefetching, by doing it in the recently introduced
>     descriptor buffer mapping function.
> 
>     Maxime Coquelin (4):
>        vhost: un-inline dirty pages logging functions
>        vhost: do not inline packed and split functions
>        vhost: do not inline unlikely fragmented buffers code
>        vhost: simplify descriptor's buffer prefetching
> 
>     root (1):
>        eal/x86: force inlining of all memcpy and mov helpers
> 
> 
> root ? "oops" :-)

Indeed... Oops!

> 
> 
> -- 
> David Marchand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-17 12:22 ` [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
  2019-05-17 12:57   ` Maxime Coquelin
@ 2019-05-21 19:43   ` Mattias Rönnblom
  2019-05-23 14:30     ` Maxime Coquelin
  1 sibling, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2019-05-21 19:43 UTC (permalink / raw)
  To: Maxime Coquelin, dev, tiwei.bie, jfreimann, zhihong.wang,
	bruce.richardson, konstantin.ananyev

On 2019-05-17 14:22, Maxime Coquelin wrote:
> Handling of fragmented virtio-net header and indirect descriptors
> tables was implemented to fix CVE-2018-1059. It should not never
> happen with healthy guests and so are already considered as
> unlikely code path.
> 
> This patch moves these bits into non-inline dedicated functions
> to reduce the I-cache pressure.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>   lib/librte_vhost/vhost.c      |  33 +++++++++++
>   lib/librte_vhost/vhost.h      |  35 +-----------
>   lib/librte_vhost/virtio_net.c | 102 +++++++++++++++++++---------------
>   3 files changed, 91 insertions(+), 79 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index 4a54ad6bd1..8a4379bc13 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   }
>   
>   
> +void *
> +alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,

This function should have a prefix.

> +		uint64_t desc_addr, uint64_t desc_len)
> +{
> +	void *idesc;
> +	uint64_t src, dst;
> +	uint64_t len, remain = desc_len;
> +
> +	idesc = rte_malloc(__func__, desc_len, 0);
> +	if (unlikely(!idesc))

if (idesc == NULL)

> +		return NULL;
> +
> +	dst = (uint64_t)(uintptr_t)idesc;
> +
> +	while (remain) {
remain > 0
> +		len = remain;
> +		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
> +				VHOST_ACCESS_RO);
> +		if (unlikely(!src || !len)) {
> +			rte_free(idesc);
> +			return NULL;
> +		}
> +
> +		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);

Just for my understanding: what difference does that (uintptr_t) cast do?

> +
> +		remain -= len;
> +		dst += len;
> +		desc_addr += len;
> +	}
> +
> +	return idesc;
> +}
> +
>   void
>   cleanup_vq(struct vhost_virtqueue *vq, int destroy)
>   {
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 3ab7b4950f..ab26454e1c 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -488,6 +488,8 @@ void vhost_backend_cleanup(struct virtio_net *dev);
>   
>   uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   			uint64_t iova, uint64_t *len, uint8_t perm);
> +void *alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +			uint64_t desc_addr, uint64_t desc_len);
>   int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
>   void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);
>   
> @@ -601,39 +603,6 @@ vhost_vring_call_packed(struct virtio_net *dev, struct vhost_virtqueue *vq)
>   		eventfd_write(vq->callfd, (eventfd_t)1);
>   }
>   
> -static __rte_always_inline void *
> -alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
> -		uint64_t desc_addr, uint64_t desc_len)
> -{
> -	void *idesc;
> -	uint64_t src, dst;
> -	uint64_t len, remain = desc_len;
> -
> -	idesc = rte_malloc(__func__, desc_len, 0);
> -	if (unlikely(!idesc))
> -		return 0;
> -
> -	dst = (uint64_t)(uintptr_t)idesc;
> -
> -	while (remain) {
> -		len = remain;
> -		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
> -				VHOST_ACCESS_RO);
> -		if (unlikely(!src || !len)) {
> -			rte_free(idesc);
> -			return 0;
> -		}
> -
> -		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
> -
> -		remain -= len;
> -		dst += len;
> -		desc_addr += len;
> -	}
> -
> -	return idesc;
> -}
> -
>   static __rte_always_inline void
>   free_ind_table(void *idesc)
>   {
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 35ae4992c2..494dd9957e 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -610,6 +610,35 @@ reserve_avail_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   	return 0;
>   }
>   
> +static void
> +copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,

__rte_noinline? Or you don't care about this function being inlined or not?

> +		struct buf_vector *buf_vec,
> +		struct virtio_net_hdr_mrg_rxbuf *hdr){
> +	uint64_t len;
> +	uint64_t remain = dev->vhost_hlen;
> +	uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
> +	uint64_t iova = buf_vec->buf_iova;
> +
> +	while (remain) {
remain > 0
> +		len = RTE_MIN(remain,
> +				buf_vec->buf_len);
> +		dst = buf_vec->buf_addr;
> +		rte_memcpy((void *)(uintptr_t)dst,
> +				(void *)(uintptr_t)src,
> +				len);
> +
> +		PRINT_PACKET(dev, (uintptr_t)dst,
> +				(uint32_t)len, 0);
> +		vhost_log_cache_write(dev, vq,
> +				iova, len);
> +
> +		remain -= len;
> +		iova += len;
> +		src += len;
> +		buf_vec++;
> +	}
> +}
> +
>   static __rte_always_inline int
>   copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   			    struct rte_mbuf *m, struct buf_vector *buf_vec,
> @@ -703,30 +732,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   						num_buffers);
>   
>   			if (unlikely(hdr == &tmp_hdr)) {
> -				uint64_t len;
> -				uint64_t remain = dev->vhost_hlen;
> -				uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
> -				uint64_t iova = buf_vec[0].buf_iova;
> -				uint16_t hdr_vec_idx = 0;
> -
> -				while (remain) {
> -					len = RTE_MIN(remain,
> -						buf_vec[hdr_vec_idx].buf_len);
> -					dst = buf_vec[hdr_vec_idx].buf_addr;
> -					rte_memcpy((void *)(uintptr_t)dst,
> -							(void *)(uintptr_t)src,
> -							len);
> -
> -					PRINT_PACKET(dev, (uintptr_t)dst,
> -							(uint32_t)len, 0);
> -					vhost_log_cache_write(dev, vq,
> -							iova, len);
> -
> -					remain -= len;
> -					iova += len;
> -					src += len;
> -					hdr_vec_idx++;
> -				}
> +				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
>   			} else {
>   				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
>   						dev->vhost_hlen, 0);
> @@ -1063,6 +1069,31 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
>   	}
>   }
>   
> +static void
> +copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
> +		struct buf_vector *buf_vec)

__rte_noinline?

> +{
> +	uint64_t len;
> +	uint64_t remain = sizeof(struct virtio_net_hdr);
> +	uint64_t src;
> +	uint64_t dst = (uint64_t)(uintptr_t)&hdr;
> +
> +	/*
> +	 * No luck, the virtio-net header doesn't fit
> +	 * in a contiguous virtual area.
> +	 */
> +	while (remain) {
> +		len = RTE_MIN(remain, buf_vec->buf_len);
> +		src = buf_vec->buf_addr;
> +		rte_memcpy((void *)(uintptr_t)dst,
> +				(void *)(uintptr_t)src, len);
> +
> +		remain -= len;
> +		dst += len;
> +		buf_vec++;
> +	}
> +}
> +
>   static __rte_always_inline int
>   copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   		  struct buf_vector *buf_vec, uint16_t nr_vec,
> @@ -1094,28 +1125,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   
>   	if (virtio_net_with_host_offload(dev)) {
>   		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
> -			uint64_t len;
> -			uint64_t remain = sizeof(struct virtio_net_hdr);
> -			uint64_t src;
> -			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
> -			uint16_t hdr_vec_idx = 0;
> -
> -			/*
> -			 * No luck, the virtio-net header doesn't fit
> -			 * in a contiguous virtual area.
> -			 */
> -			while (remain) {
> -				len = RTE_MIN(remain,
> -					buf_vec[hdr_vec_idx].buf_len);
> -				src = buf_vec[hdr_vec_idx].buf_addr;
> -				rte_memcpy((void *)(uintptr_t)dst,
> -						   (void *)(uintptr_t)src, len);
> -
> -				remain -= len;
> -				dst += len;
> -				hdr_vec_idx++;
> -			}
> -
> +			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
>   			hdr = &tmp_hdr;
>   		} else {
>   			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-21 19:43   ` Mattias Rönnblom
@ 2019-05-23 14:30     ` Maxime Coquelin
  2019-05-23 15:17       ` Mattias Rönnblom
  0 siblings, 1 reply; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-23 14:30 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, tiwei.bie, jfreimann, zhihong.wang,
	bruce.richardson, konstantin.ananyev

Hi Mattias,

On 5/21/19 9:43 PM, Mattias Rönnblom wrote:
> On 2019-05-17 14:22, Maxime Coquelin wrote:
>> Handling of fragmented virtio-net header and indirect descriptors
>> tables was implemented to fix CVE-2018-1059. It should not never
>> happen with healthy guests and so are already considered as
>> unlikely code path.
>>
>> This patch moves these bits into non-inline dedicated functions
>> to reduce the I-cache pressure.
>>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>> ---
>>   lib/librte_vhost/vhost.c      |  33 +++++++++++
>>   lib/librte_vhost/vhost.h      |  35 +-----------
>>   lib/librte_vhost/virtio_net.c | 102 +++++++++++++++++++---------------
>>   3 files changed, 91 insertions(+), 79 deletions(-)
>>
>> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
>> index 4a54ad6bd1..8a4379bc13 100644
>> --- a/lib/librte_vhost/vhost.c
>> +++ b/lib/librte_vhost/vhost.c
>> @@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, 
>> struct vhost_virtqueue *vq,
>>   }
>> +void *
>> +alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
> 
> This function should have a prefix.

This function is just moved from vhost.h to vhost.c, so not the purpose
of the patch.

But I agree your comment, I'll send a patch to add a prefix.

> 
>> +        uint64_t desc_addr, uint64_t desc_len)
>> +{
>> +    void *idesc;
>> +    uint64_t src, dst;
>> +    uint64_t len, remain = desc_len;
>> +
>> +    idesc = rte_malloc(__func__, desc_len, 0);
>> +    if (unlikely(!idesc))
> 
> if (idesc == NULL)

Ditto, that is not the purpose of the patch that is just moving the
function.

I agree this is not matching the coding rules specified in the
documentation, though.

> 
>> +        return NULL;
>> +
>> +    dst = (uint64_t)(uintptr_t)idesc;
>> +
>> +    while (remain) {
> remain > 0

Ditto.

>> +        len = remain;
>> +        src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
>> +                VHOST_ACCESS_RO);
>> +        if (unlikely(!src || !len)) {
>> +            rte_free(idesc);
>> +            return NULL;
>> +        }
>> +
>> +        rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
> 
> Just for my understanding: what difference does that (uintptr_t) cast do?

This is required to build 32bits (-Werror=int-to-pointer-cast)

>> +
>> +        remain -= len;
>> +        dst += len;
>> +        desc_addr += len;
>> +    }
>> +
>> +    return idesc;
>> +}
>> +
>>   void
>>   cleanup_vq(struct vhost_virtqueue *vq, int destroy)
>>   {
>> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
>> index 3ab7b4950f..ab26454e1c 100644
>> --- a/lib/librte_vhost/vhost.h
>> +++ b/lib/librte_vhost/vhost.h
>> @@ -488,6 +488,8 @@ void vhost_backend_cleanup(struct virtio_net *dev);
>>   uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct 
>> vhost_virtqueue *vq,
>>               uint64_t iova, uint64_t *len, uint8_t perm);
>> +void *alloc_copy_ind_table(struct virtio_net *dev, struct 
>> vhost_virtqueue *vq,
>> +            uint64_t desc_addr, uint64_t desc_len);
>>   int vring_translate(struct virtio_net *dev, struct vhost_virtqueue 
>> *vq);
>>   void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue 
>> *vq);
>> @@ -601,39 +603,6 @@ vhost_vring_call_packed(struct virtio_net *dev, 
>> struct vhost_virtqueue *vq)
>>           eventfd_write(vq->callfd, (eventfd_t)1);
>>   }
>> -static __rte_always_inline void *
>> -alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> -        uint64_t desc_addr, uint64_t desc_len)
>> -{
>> -    void *idesc;
>> -    uint64_t src, dst;
>> -    uint64_t len, remain = desc_len;
>> -
>> -    idesc = rte_malloc(__func__, desc_len, 0);
>> -    if (unlikely(!idesc))
>> -        return 0;
>> -
>> -    dst = (uint64_t)(uintptr_t)idesc;
>> -
>> -    while (remain) {
>> -        len = remain;
>> -        src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
>> -                VHOST_ACCESS_RO);
>> -        if (unlikely(!src || !len)) {
>> -            rte_free(idesc);
>> -            return 0;
>> -        }
>> -
>> -        rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
>> -
>> -        remain -= len;
>> -        dst += len;
>> -        desc_addr += len;
>> -    }
>> -
>> -    return idesc;
>> -}
>> -
>>   static __rte_always_inline void
>>   free_ind_table(void *idesc)
>>   {
>> diff --git a/lib/librte_vhost/virtio_net.c 
>> b/lib/librte_vhost/virtio_net.c
>> index 35ae4992c2..494dd9957e 100644
>> --- a/lib/librte_vhost/virtio_net.c
>> +++ b/lib/librte_vhost/virtio_net.c
>> @@ -610,6 +610,35 @@ reserve_avail_buf_packed(struct virtio_net *dev, 
>> struct vhost_virtqueue *vq,
>>       return 0;
>>   }
>> +static void
>> +copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue 
>> *vq,
> 
> __rte_noinline? Or you don't care about this function being inlined or not?

Right, I'll add it here and there in next revision.

I'll try to send a patch to fix the kind of style issues you reported.
If you want to do it that would be great, just let me know.

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-23 14:30     ` Maxime Coquelin
@ 2019-05-23 15:17       ` Mattias Rönnblom
  2019-05-23 17:40         ` Maxime Coquelin
  0 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2019-05-23 15:17 UTC (permalink / raw)
  To: Maxime Coquelin, dev, tiwei.bie, jfreimann, zhihong.wang,
	bruce.richardson, konstantin.ananyev

On 2019-05-23 16:30, Maxime Coquelin wrote:
> Hi Mattias,
> 
> On 5/21/19 9:43 PM, Mattias Rönnblom wrote:
>> On 2019-05-17 14:22, Maxime Coquelin wrote:
>>> Handling of fragmented virtio-net header and indirect descriptors
>>> tables was implemented to fix CVE-2018-1059. It should not never
>>> happen with healthy guests and so are already considered as
>>> unlikely code path.
>>>
>>> This patch moves these bits into non-inline dedicated functions
>>> to reduce the I-cache pressure.
>>>
>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> ---
>>>   lib/librte_vhost/vhost.c      |  33 +++++++++++
>>>   lib/librte_vhost/vhost.h      |  35 +-----------
>>>   lib/librte_vhost/virtio_net.c | 102 +++++++++++++++++++---------------
>>>   3 files changed, 91 insertions(+), 79 deletions(-)
>>>
>>> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
>>> index 4a54ad6bd1..8a4379bc13 100644
>>> --- a/lib/librte_vhost/vhost.c
>>> +++ b/lib/librte_vhost/vhost.c
>>> @@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, 
>>> struct vhost_virtqueue *vq,
>>>   }
>>> +void *
>>> +alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue 
>>> *vq,
>>
>> This function should have a prefix.
> 
> This function is just moved from vhost.h to vhost.c, so not the purpose
> of the patch.
> 

It was declared "static inline" in the header file, and thus only 
affected those who included the file, as opposed to polluting the whole 
DPDK library name space.

> But I agree your comment, I'll send a patch to add a prefix.
> 
>>
>>> +        uint64_t desc_addr, uint64_t desc_len)
>>> +{
>>> +    void *idesc;
>>> +    uint64_t src, dst;
>>> +    uint64_t len, remain = desc_len;
>>> +
>>> +    idesc = rte_malloc(__func__, desc_len, 0);
>>> +    if (unlikely(!idesc))
>>
>> if (idesc == NULL)
> 
> Ditto, that is not the purpose of the patch that is just moving the
> function.
> 
> I agree this is not matching the coding rules specified in the
> documentation, though.
> 
>>
>>> +        return NULL;
>>> +
>>> +    dst = (uint64_t)(uintptr_t)idesc;
>>> +
>>> +    while (remain) {
>> remain > 0
> 
> Ditto.
> 
>>> +        len = remain;
>>> +        src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
>>> +                VHOST_ACCESS_RO);
>>> +        if (unlikely(!src || !len)) {
>>> +            rte_free(idesc);
>>> +            return NULL;
>>> +        }
>>> +
>>> +        rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, 
>>> len);
>>
>> Just for my understanding: what difference does that (uintptr_t) cast do?
> 
> This is required to build 32bits (-Werror=int-to-pointer-cast)
> 

Ah. Thanks.

>>> +
>>> +        remain -= len;
>>> +        dst += len;
>>> +        desc_addr += len;
>>> +    }
>>> +
>>> +    return idesc;
>>> +}
>>> +
>>>   void
>>>   cleanup_vq(struct vhost_virtqueue *vq, int destroy)
>>>   {
>>> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
>>> index 3ab7b4950f..ab26454e1c 100644
>>> --- a/lib/librte_vhost/vhost.h
>>> +++ b/lib/librte_vhost/vhost.h
>>> @@ -488,6 +488,8 @@ void vhost_backend_cleanup(struct virtio_net *dev);
>>>   uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct 
>>> vhost_virtqueue *vq,
>>>               uint64_t iova, uint64_t *len, uint8_t perm);
>>> +void *alloc_copy_ind_table(struct virtio_net *dev, struct 
>>> vhost_virtqueue *vq,
>>> +            uint64_t desc_addr, uint64_t desc_len);
>>>   int vring_translate(struct virtio_net *dev, struct vhost_virtqueue 
>>> *vq);
>>>   void vring_invalidate(struct virtio_net *dev, struct 
>>> vhost_virtqueue *vq);
>>> @@ -601,39 +603,6 @@ vhost_vring_call_packed(struct virtio_net *dev, 
>>> struct vhost_virtqueue *vq)
>>>           eventfd_write(vq->callfd, (eventfd_t)1);
>>>   }
>>> -static __rte_always_inline void *
>>> -alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue 
>>> *vq,
>>> -        uint64_t desc_addr, uint64_t desc_len)
>>> -{
>>> -    void *idesc;
>>> -    uint64_t src, dst;
>>> -    uint64_t len, remain = desc_len;
>>> -
>>> -    idesc = rte_malloc(__func__, desc_len, 0);
>>> -    if (unlikely(!idesc))
>>> -        return 0;
>>> -
>>> -    dst = (uint64_t)(uintptr_t)idesc;
>>> -
>>> -    while (remain) {
>>> -        len = remain;
>>> -        src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
>>> -                VHOST_ACCESS_RO);
>>> -        if (unlikely(!src || !len)) {
>>> -            rte_free(idesc);
>>> -            return 0;
>>> -        }
>>> -
>>> -        rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, 
>>> len);
>>> -
>>> -        remain -= len;
>>> -        dst += len;
>>> -        desc_addr += len;
>>> -    }
>>> -
>>> -    return idesc;
>>> -}
>>> -
>>>   static __rte_always_inline void
>>>   free_ind_table(void *idesc)
>>>   {
>>> diff --git a/lib/librte_vhost/virtio_net.c 
>>> b/lib/librte_vhost/virtio_net.c
>>> index 35ae4992c2..494dd9957e 100644
>>> --- a/lib/librte_vhost/virtio_net.c
>>> +++ b/lib/librte_vhost/virtio_net.c
>>> @@ -610,6 +610,35 @@ reserve_avail_buf_packed(struct virtio_net *dev, 
>>> struct vhost_virtqueue *vq,
>>>       return 0;
>>>   }
>>> +static void
>>> +copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue 
>>> *vq,
>>
>> __rte_noinline? Or you don't care about this function being inlined or 
>> not?
> 
> Right, I'll add it here and there in next revision.
> 
> I'll try to send a patch to fix the kind of style issues you reported.
> If you want to do it that would be great, just let me know.
> 

I just figured it made sense to address some style issues when you were 
shuffling things around.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-23 15:17       ` Mattias Rönnblom
@ 2019-05-23 17:40         ` Maxime Coquelin
  0 siblings, 0 replies; 15+ messages in thread
From: Maxime Coquelin @ 2019-05-23 17:40 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, tiwei.bie, jfreimann, zhihong.wang,
	bruce.richardson, konstantin.ananyev



On 5/23/19 5:17 PM, Mattias Rönnblom wrote:
> On 2019-05-23 16:30, Maxime Coquelin wrote:
>> Hi Mattias,
>>
>> On 5/21/19 9:43 PM, Mattias Rönnblom wrote:
>>> On 2019-05-17 14:22, Maxime Coquelin wrote:
>>>> Handling of fragmented virtio-net header and indirect descriptors
>>>> tables was implemented to fix CVE-2018-1059. It should not never
>>>> happen with healthy guests and so are already considered as
>>>> unlikely code path.
>>>>
>>>> This patch moves these bits into non-inline dedicated functions
>>>> to reduce the I-cache pressure.
>>>>
>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> ---
>>>>   lib/librte_vhost/vhost.c      |  33 +++++++++++
>>>>   lib/librte_vhost/vhost.h      |  35 +-----------
>>>>   lib/librte_vhost/virtio_net.c | 102 
>>>> +++++++++++++++++++---------------
>>>>   3 files changed, 91 insertions(+), 79 deletions(-)
>>>>
>>>> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
>>>> index 4a54ad6bd1..8a4379bc13 100644
>>>> --- a/lib/librte_vhost/vhost.c
>>>> +++ b/lib/librte_vhost/vhost.c
>>>> @@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, 
>>>> struct vhost_virtqueue *vq,
>>>>   }
>>>> +void *
>>>> +alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue 
>>>> *vq,
>>>
>>> This function should have a prefix.
>>
>> This function is just moved from vhost.h to vhost.c, so not the purpose
>> of the patch.
>>
> 
> It was declared "static inline" in the header file, and thus only 
> affected those who included the file, as opposed to polluting the whole 
> DPDK library name space.

Right, I'll fix the name in next revision.

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-05-23 17:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 12:22 [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
2019-05-17 12:22 ` [dpdk-dev] [PATCH 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
2019-05-17 12:22 ` [dpdk-dev] [PATCH 2/5] vhost: do not inline packed and split functions Maxime Coquelin
2019-05-17 13:00   ` David Marchand
2019-05-17 14:42     ` Maxime Coquelin
2019-05-17 12:22 ` [dpdk-dev] [PATCH 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
2019-05-17 12:57   ` Maxime Coquelin
2019-05-21 19:43   ` Mattias Rönnblom
2019-05-23 14:30     ` Maxime Coquelin
2019-05-23 15:17       ` Mattias Rönnblom
2019-05-23 17:40         ` Maxime Coquelin
2019-05-17 12:22 ` [dpdk-dev] [PATCH 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
2019-05-17 12:22 ` [dpdk-dev] [PATCH 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
2019-05-17 13:04 ` [dpdk-dev] [PATCH 0/5] vhost: I-cache pressure optimizations David Marchand
2019-05-17 14:42   ` Maxime Coquelin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.