All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Support vector instructions on ICE
@ 2019-02-28  7:48 Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 1/8] net/ice: fix TX function setting Wenzhuo Lu
                   ` (14 more replies)
  0 siblings, 15 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

Wenzhuo Lu (8):
  net/ice: fix TX function setting
  net/ice: add pointer for queue buffer release
  net/ice: support RX SSE vector
  net/ice: support RX scatter SSE vector
  net/ice: support TX SSE vector
  net/ice: support RX AVX2 vector
  net/ice: support RX scatter AVX2 vector
  net/ice: support TX AVX2 vector

 config/common_base                     |   1 +
 doc/guides/nics/features/ice_vec.ini   |  40 ++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             | 101 ++++-
 drivers/net/ice/ice_rxtx.h             |  39 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 764 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 288 +++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 663 ++++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  21 +
 12 files changed, 1935 insertions(+), 13 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH 1/8] net/ice: fix TX function setting
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index a23c63a..b804be1 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 2/8] net/ice: add pointer for queue buffer release
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 1/8] net/ice: fix TX function setting Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 3/8] net/ice: support RX SSE vector Wenzhuo Lu
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..d540ed1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -366,7 +366,7 @@
 		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
 			    rx_queue_id);
 
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		return -EINVAL;
 	}
@@ -393,7 +393,7 @@
 				    rx_queue_id);
 			return -EINVAL;
 		}
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		dev->data->rx_queue_state[rx_queue_id] =
 			RTE_ETH_QUEUE_STATE_STOPPED;
@@ -555,7 +555,7 @@
 		return -EINVAL;
 	}
 
-	ice_tx_queue_release_mbufs(txq);
+	txq->tx_rel_mbufs(txq);
 	ice_reset_tx_queue(txq);
 	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
 
@@ -669,6 +669,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -701,7 +702,7 @@
 		return;
 	}
 
-	ice_rx_queue_release_mbufs(q);
+	q->rx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -866,6 +867,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
 
 	return 0;
 }
@@ -880,7 +882,7 @@
 		return;
 	}
 
-	ice_tx_queue_release_mbufs(q);
+	q->tx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -1552,18 +1554,22 @@
 void
 ice_clear_queues(struct rte_eth_dev *dev)
 {
+	struct ice_rx_queue *rxq;
+	struct ice_tx_queue *txq;
 	uint16_t i;
 
 	PMD_INIT_FUNC_TRACE();
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
-		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
-		ice_reset_tx_queue(dev->data->tx_queues[i]);
+		txq = dev->data->tx_queues[i];
+		txq->tx_rel_mbufs(txq);
+		ice_reset_tx_queue(txq);
 	}
 
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
-		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
-		ice_reset_rx_queue(dev->data->rx_queues[i]);
+		rxq = dev->data->rx_queues[i];
+		rxq->rx_rel_mbufs(rxq);
+		ice_reset_rx_queue(rxq);
 	}
 }
 
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..26380d3 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 3/8] net/ice: support RX SSE vector
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 1/8] net/ice: fix TX function setting Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-03-01  3:44   ` Zhang, Qi Z
  2019-02-28  7:48 ` [PATCH 4/8] net/ice: support RX scatter " Wenzhuo Lu
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 config/common_base                    |   1 +
 doc/guides/nics/features/ice_vec.ini  |  38 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  27 +-
 drivers/net/ice/ice_rxtx.h            |  21 ++
 drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 487 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   6 +
 10 files changed, 738 insertions(+), 4 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/config/common_base b/config/common_base
index 7c6da51..1d5ae2e 100644
--- a/config/common_base
+++ b/config/common_base
@@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
 CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
+CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
 
 # Compile burst-oriented AVF PMD driver
 #
diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1838f99
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,38 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+CRC offload          = Y
+VLAN offload         = Y
+QinQ offload         = Y
+L3 checksum offload  = Y
+L4 checksum offload  = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..33c7fc2 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index b804be1..8e7c7db 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d540ed1..543fefa 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -319,6 +317,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1490,6 +1491,12 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+
 	return NULL;
 }
 
@@ -2225,6 +2232,22 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 26380d3..2659176 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
 
@@ -52,6 +61,11 @@ struct ice_rx_queue {
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
 #endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +170,11 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
+#endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..73837f7
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned mask = rxq->nb_rx_desc - 1;
+	unsigned i;
+
+	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..d444be9
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,487 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+		  struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(
+			0x1c03804, 0x1c03804, 0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16(
+				0, 0, 0,    /* ignore non-length fields */
+				-rxq->crc_len, /* sub crc on data_len */
+				0,          /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				0, 0            /* ignore pkt_type field */
+			);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8(
+		7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+		3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+		15, 14,      /* octet 15~14, 16 bits data_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		15, 14,      /* octet 15~14, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* pkt_type set as unknown */
+		0xFF, 0xFF  /*pkt_type set as unknown */
+		);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0x04, 0x0C, 0x00, 0x08
+					);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static void __attribute__((cold))
+ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	_ice_rx_queue_release_mbufs_vec(rxq);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..73122f8 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,9 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
+	dpdk_conf.set('RTE_LIBRTE_ICE_INC_VECTOR', 1)
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 4/8] net/ice: support RX scatter SSE vector
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (2 preceding siblings ...)
  2019-02-28  7:48 ` [PATCH 3/8] net/ice: support RX SSE vector Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 5/8] net/ice: support TX " Wenzhuo Lu
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 543fefa..55c8131 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1493,7 +1493,8 @@
 		return ptypes;
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 
@@ -2241,9 +2242,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2659176..aab4a3a 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -176,5 +176,7 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index d444be9..789cf07 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -464,6 +464,47 @@
 	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+					      split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+				      &split_flags[i]);
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 5/8] net/ice: support TX SSE vector
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (3 preceding siblings ...)
  2019-02-28  7:48 ` [PATCH 4/8] net/ice: support RX scatter " Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 6/8] net/ice: support RX AVX2 vector Wenzhuo Lu
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  17 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
 5 files changed, 291 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1838f99..3b5f11d 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -27,6 +28,7 @@ L3 checksum offload  = Y
 L4 checksum offload  = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 55c8131..b6c9618 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2332,6 +2332,23 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index aab4a3a..02bb57e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -173,10 +173,14 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index 73837f7..8796ecb 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct ice_tx_entry *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -101,6 +168,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (!txq || !txq->sw_ring) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -137,6 +232,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -152,4 +270,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 789cf07..6babb8d 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -505,12 +505,131 @@
 				      &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
 	_ice_rx_queue_release_mbufs_vec(rxq);
 }
 
+static void __attribute__((cold))
+ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	_ice_tx_queue_release_mbufs_vec(txq);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -522,7 +641,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 6/8] net/ice: support RX AVX2 vector
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (4 preceding siblings ...)
  2019-02-28  7:48 ` [PATCH 5/8] net/ice: support TX " Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 7/8] net/ice: support RX scatter " Wenzhuo Lu
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  17 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 548 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 598 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 33c7fc2..e1cb632 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index b6c9618..342e8f1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1494,7 +1494,8 @@
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2236,21 +2237,31 @@ void __attribute__((cold))
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+#endif
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 02bb57e..63c552c 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -182,5 +182,7 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..f9cce2e
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,548 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/*
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 = _mm256_inserti128_si256(
+				_mm256_castsi128_si256(vaddr0), vaddr1, 1);
+		vaddr2_3 = _mm256_inserti128_si256(
+				_mm256_castsi128_si256(vaddr2), vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust = _mm256_set_epi16(
+			/* first descriptor */
+			0, 0, 0,       /* ignore non-length fields */
+			-rxq->crc_len, /* sub crc on data_len */
+			0,             /* ignore high-16bits of pkt_len */
+			-rxq->crc_len, /* sub crc on pkt_len */
+			0, 0,          /* ignore pkt_type field */
+			/* second descriptor */
+			0, 0, 0,       /* ignore non-length fields */
+			-rxq->crc_len, /* sub crc on data_len */
+			0,             /* ignore high-16bits of pkt_len */
+			-rxq->crc_len, /* sub crc on pkt_len */
+			0, 0           /* ignore pkt_type field */
+	);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk = _mm256_set_epi8(
+			/* first descriptor */
+			7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			15, 14,      /* octet 15~14, 16 bits data_len */
+			0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			0xFF, 0xFF,  /* pkt_type set as unknown */
+			0xFF, 0xFF,  /*pkt_type set as unknown */
+			/* second descriptor */
+			7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			15, 14,      /* octet 15~14, 16 bits data_len */
+			0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			0xFF, 0xFF,  /* pkt_type set as unknown */
+			0xFF, 0xFF   /*pkt_type set as unknown */
+	);
+	/*
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/*
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask = _mm256_set1_epi32(
+			(1 << 2) | (1 << 11) | (3 << 12) | (7 << 22));
+	/*
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf = _mm256_set_epi32(
+			0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+			0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/*
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf = _mm256_set_epi8(
+			0, 0, 0, 0, 0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0, /* end up 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	/*
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask = _mm256_set1_epi32(
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+	for (i = 0, received = 0; i < nb_pkts;
+			i += ICE_DESCS_PER_LOOP_AVX,
+			rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				_mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256((void *)&rx_pkts[i + 4],
+				_mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		do {
+			const __m128i raw_desc7 = _mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 = _mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 = _mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 = _mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 = _mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 = _mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 = _mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 = _mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 = _mm256_inserti128_si256(
+					_mm256_castsi128_si256(raw_desc6), raw_desc7, 1);
+			raw_desc4_5 = _mm256_inserti128_si256(
+					_mm256_castsi128_si256(raw_desc4), raw_desc5, 1);
+			raw_desc2_3 = _mm256_inserti128_si256(
+					_mm256_castsi128_si256(raw_desc2), raw_desc3, 1);
+			raw_desc0_1 = _mm256_inserti128_si256(
+					_mm256_castsi128_si256(raw_desc0), raw_desc1, 1);
+		} while (0);
+
+		if (split_packet) {
+			int j;
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/*
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7, PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5, PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7, len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5, len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/*
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/*
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3, PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1, PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3, len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1, len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+				desc0_1);
+
+		/*
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+				status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits = _mm256_and_si256(
+				status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags = _mm256_shuffle_epi8(
+				vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags = _mm256_shuffle_epi8(
+				rss_flags_shuf, _mm256_srli_epi32(flag_bits, 11));
+		/*
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/*
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init data
+		 * so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend for
+		 * each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+				rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init, _mm256_slli_si256(mbuf_flags, 8), 0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init, _mm256_slli_si256(mbuf_flags, 4), 0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init, _mm256_srli_si256(mbuf_flags, 4), 0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data, rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data, rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data, rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data, rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags = _mm256_castsi128_si256(
+				_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init, _mm256_slli_si256(odd_flags, 8), 0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init, _mm256_slli_si256(odd_flags, 4), 0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init, _mm256_srli_si256(odd_flags, 4), 0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data, rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data, rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data, rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data, rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask = _mm_set1_epi16(
+					1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+					eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits = _mm_packus_epi32(
+					_mm256_castsi256_si128(eop_bits256),
+					_mm256_extractf128_si256(eop_bits256, 1));
+			/*
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/*
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle = _mm_set_epi8(
+					0xFF, 0xFF, 0xFF, 0xFF, /* zero hi 64b */
+					0xFF, 0xFF, 0xFF, 0xFF,
+					8, 0, 10, 2, /* move values to lo 64b */
+					12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet = _mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+				_mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll(_mm_cvtsi128_si64(
+				_mm256_extracti128_si256(status0_7, 1)));
+		burst += __builtin_popcountll(_mm_cvtsi128_si64(
+				_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 73122f8..a1bd5b1 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -16,4 +16,19 @@ if arch_subdir == 'x86'
 	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
 	dpdk_conf.set('RTE_LIBRTE_ICE_INC_VECTOR', 1)
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 7/8] net/ice: support RX scatter AVX2 vector
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (5 preceding siblings ...)
  2019-02-28  7:48 ` [PATCH 6/8] net/ice: support RX AVX2 vector Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-02-28  7:48 ` [PATCH 8/8] net/ice: support TX " Wenzhuo Lu
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 63 +++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 342e8f1..465d389 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1495,7 +1495,8 @@
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2252,9 +2253,12 @@ void __attribute__((cold))
 #endif
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 63c552c..a918646 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -184,5 +184,8 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index f9cce2e..78dda2f 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -546,3 +546,66 @@
 {
 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl64[0] == 0 && split_fl64[1] == 0 &&
+			split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (rxq->pkt_first_seg == NULL) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/*
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 8/8] net/ice: support TX AVX2 vector
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (6 preceding siblings ...)
  2019-02-28  7:48 ` [PATCH 7/8] net/ice: support RX scatter " Wenzhuo Lu
@ 2019-02-28  7:48 ` Wenzhuo Lu
  2019-03-01  3:41 ` [PATCH 0/8] Support vector instructions on ICE Zhang, Qi Z
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-02-28  7:48 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  14 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 153 +++++++++++++++++++++++++++++++++
 4 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index c0390ca..2da89c0 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -71,6 +71,10 @@ New Features
 
    * Added firmware version reading.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 465d389..8a09ea9 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2350,15 +2350,25 @@ void __attribute__((cold))
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+#endif
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index a918646..c5ac02d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -187,5 +187,7 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 78dda2f..c7c117b 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -609,3 +609,156 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 = _mm256_set_epi64x(
+				hi_qw3, pkt[3]->buf_physaddr + pkt[3]->data_off,
+				hi_qw2, pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 = _mm256_set_epi64x(
+				hi_qw1, pkt[1]->buf_physaddr + pkt[1]->data_off,
+				hi_qw0, pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			  uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		   uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (7 preceding siblings ...)
  2019-02-28  7:48 ` [PATCH 8/8] net/ice: support TX " Wenzhuo Lu
@ 2019-03-01  3:41 ` Zhang, Qi Z
  2019-03-04  1:24   ` Lu, Wenzhuo
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 121+ messages in thread
From: Zhang, Qi Z @ 2019-03-01  3:41 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev; +Cc: Lu, Wenzhuo

HI Wenzhuo:

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Thursday, February 28, 2019 3:49 PM
> To: dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: [dpdk-dev] [PATCH 0/8] Support vector instructions on ICE
> 
> Use SSE and AVX2 instructions in ICE RX and TX path.
> 
> Wenzhuo Lu (8):
>   net/ice: fix TX function setting
>   net/ice: add pointer for queue buffer release
>   net/ice: support RX SSE vector
>   net/ice: support RX scatter SSE vector
>   net/ice: support TX SSE vector
>   net/ice: support RX AVX2 vector
>   net/ice: support RX scatter AVX2 vector
>   net/ice: support TX AVX2 vector

Should be "Rx" and "Tx" in the title to follow the headline uppercase rule.
The check-git-log.sh report this warning.

Regards
Qi

> 
>  config/common_base                     |   1 +
>  doc/guides/nics/features/ice_vec.ini   |  40 ++
>  doc/guides/rel_notes/release_19_05.rst |   4 +
>  drivers/net/ice/Makefile               |  22 +
>  drivers/net/ice/ice_ethdev.c           |   3 +-
>  drivers/net/ice/ice_ethdev.h           |   2 +
>  drivers/net/ice/ice_rxtx.c             | 101 ++++-
>  drivers/net/ice/ice_rxtx.h             |  39 ++
>  drivers/net/ice/ice_rxtx_vec_avx2.c    | 764
> +++++++++++++++++++++++++++++++++
>  drivers/net/ice/ice_rxtx_vec_common.h  | 288 +++++++++++++
>  drivers/net/ice/ice_rxtx_vec_sse.c     | 663
> ++++++++++++++++++++++++++++
>  drivers/net/ice/meson.build            |  21 +
>  12 files changed, 1935 insertions(+), 13 deletions(-)  create mode 100644
> doc/guides/nics/features/ice_vec.ini
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> 
> --
> 1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 3/8] net/ice: support RX SSE vector
  2019-02-28  7:48 ` [PATCH 3/8] net/ice: support RX SSE vector Wenzhuo Lu
@ 2019-03-01  3:44   ` Zhang, Qi Z
  2019-03-04  1:27     ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Zhang, Qi Z @ 2019-03-01  3:44 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev; +Cc: Lu, Wenzhuo

Hi

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Thursday, February 28, 2019 3:49 PM
> To: dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: [dpdk-dev] [PATCH 3/8] net/ice: support RX SSE vector
> 
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>  config/common_base                    |   1 +
>  doc/guides/nics/features/ice_vec.ini  |  38 +++
>  drivers/net/ice/Makefile              |   3 +
>  drivers/net/ice/ice_ethdev.c          |   2 -
>  drivers/net/ice/ice_ethdev.h          |   2 +
>  drivers/net/ice/ice_rxtx.c            |  27 +-
>  drivers/net/ice/ice_rxtx.h            |  21 ++
>  drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
>  drivers/net/ice/ice_rxtx_vec_sse.c    | 487
> ++++++++++++++++++++++++++++++++++
>  drivers/net/ice/meson.build           |   6 +
>  10 files changed, 738 insertions(+), 4 deletions(-)  create mode 100644
> doc/guides/nics/features/ice_vec.ini
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> 
> diff --git a/config/common_base b/config/common_base index
> 7c6da51..1d5ae2e 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
> CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
>  CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
>  CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
> +CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
> 
>  # Compile burst-oriented AVF PMD driver  # diff --git
> a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
> new file mode 100644
> index 0000000..1838f99
> --- /dev/null
> +++ b/doc/guides/nics/features/ice_vec.ini
> @@ -0,0 +1,38 @@
> +;
> +; Supported features of the 'ice_vec' network poll mode driver.
> +;
> +; Refer to default.ini for the full list of available PMD features.
> +;
> +[Features]
> +Speed capabilities   = Y
> +Link status          = Y
> +Link status event    = Y
> +Rx interrupt         = Y
> +Queue start/stop     = Y
> +MTU update           = Y
> +Jumbo frame          = Y
> +Scattered Rx         = Y
> +Promiscuous mode     = Y
> +Allmulticast mode    = Y
> +Unicast MAC filter   = Y
> +Multicast MAC filter = Y
> +RSS hash             = Y
> +RSS key update       = Y
> +RSS reta update      = Y
> +VLAN filter          = Y
> +CRC offload          = Y
> +VLAN offload         = Y
> +QinQ offload         = Y
> +L3 checksum offload  = Y
> +L4 checksum offload  = Y

I think the QinQ an checksum offload is not supported by the vPMD , same as FVL, right?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 0/8] Support vector instructions on ICE
  2019-03-01  3:41 ` [PATCH 0/8] Support vector instructions on ICE Zhang, Qi Z
@ 2019-03-04  1:24   ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-04  1:24 UTC (permalink / raw)
  To: Zhang, Qi Z, dev

Hi Qi,

> -----Original Message-----
> From: Zhang, Qi Z
> Sent: Friday, March 1, 2019 11:41 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 0/8] Support vector instructions on ICE
> 
> HI Wenzhuo:
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> > Sent: Thursday, February 28, 2019 3:49 PM
> > To: dev@dpdk.org
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> > Subject: [dpdk-dev] [PATCH 0/8] Support vector instructions on ICE
> >
> > Use SSE and AVX2 instructions in ICE RX and TX path.
> >
> > Wenzhuo Lu (8):
> >   net/ice: fix TX function setting
> >   net/ice: add pointer for queue buffer release
> >   net/ice: support RX SSE vector
> >   net/ice: support RX scatter SSE vector
> >   net/ice: support TX SSE vector
> >   net/ice: support RX AVX2 vector
> >   net/ice: support RX scatter AVX2 vector
> >   net/ice: support TX AVX2 vector
> 
> Should be "Rx" and "Tx" in the title to follow the headline uppercase rule.
> The check-git-log.sh report this warning.
Will change them. Thanks.

> 
> Regards
> Qi
> 
> >
> >  config/common_base                     |   1 +
> >  doc/guides/nics/features/ice_vec.ini   |  40 ++
> >  doc/guides/rel_notes/release_19_05.rst |   4 +
> >  drivers/net/ice/Makefile               |  22 +
> >  drivers/net/ice/ice_ethdev.c           |   3 +-
> >  drivers/net/ice/ice_ethdev.h           |   2 +
> >  drivers/net/ice/ice_rxtx.c             | 101 ++++-
> >  drivers/net/ice/ice_rxtx.h             |  39 ++
> >  drivers/net/ice/ice_rxtx_vec_avx2.c    | 764
> > +++++++++++++++++++++++++++++++++
> >  drivers/net/ice/ice_rxtx_vec_common.h  | 288 +++++++++++++
> >  drivers/net/ice/ice_rxtx_vec_sse.c     | 663
> > ++++++++++++++++++++++++++++
> >  drivers/net/ice/meson.build            |  21 +
> >  12 files changed, 1935 insertions(+), 13 deletions(-)  create mode
> > 100644 doc/guides/nics/features/ice_vec.ini
> >  create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
> >  create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
> >  create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> >
> > --
> > 1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 3/8] net/ice: support RX SSE vector
  2019-03-01  3:44   ` Zhang, Qi Z
@ 2019-03-04  1:27     ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-04  1:27 UTC (permalink / raw)
  To: Zhang, Qi Z, dev

Hi Qi,

> -----Original Message-----
> From: Zhang, Qi Z
> Sent: Friday, March 1, 2019 11:44 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 3/8] net/ice: support RX SSE vector
> 
> Hi
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> > Sent: Thursday, February 28, 2019 3:49 PM
> > To: dev@dpdk.org
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> > Subject: [dpdk-dev] [PATCH 3/8] net/ice: support RX SSE vector
> >
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> > ---
> >  config/common_base                    |   1 +
> >  doc/guides/nics/features/ice_vec.ini  |  38 +++
> >  drivers/net/ice/Makefile              |   3 +
> >  drivers/net/ice/ice_ethdev.c          |   2 -
> >  drivers/net/ice/ice_ethdev.h          |   2 +
> >  drivers/net/ice/ice_rxtx.c            |  27 +-
> >  drivers/net/ice/ice_rxtx.h            |  21 ++
> >  drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
> >  drivers/net/ice/ice_rxtx_vec_sse.c    | 487
> > ++++++++++++++++++++++++++++++++++
> >  drivers/net/ice/meson.build           |   6 +
> >  10 files changed, 738 insertions(+), 4 deletions(-)  create mode
> > 100644 doc/guides/nics/features/ice_vec.ini
> >  create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
> >  create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> >
> > diff --git a/config/common_base b/config/common_base index
> > 7c6da51..1d5ae2e 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
> > CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
> >  CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
> >  CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
> > +CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
> >
> >  # Compile burst-oriented AVF PMD driver  # diff --git
> > a/doc/guides/nics/features/ice_vec.ini
> > b/doc/guides/nics/features/ice_vec.ini
> > new file mode 100644
> > index 0000000..1838f99
> > --- /dev/null
> > +++ b/doc/guides/nics/features/ice_vec.ini
> > @@ -0,0 +1,38 @@
> > +;
> > +; Supported features of the 'ice_vec' network poll mode driver.
> > +;
> > +; Refer to default.ini for the full list of available PMD features.
> > +;
> > +[Features]
> > +Speed capabilities   = Y
> > +Link status          = Y
> > +Link status event    = Y
> > +Rx interrupt         = Y
> > +Queue start/stop     = Y
> > +MTU update           = Y
> > +Jumbo frame          = Y
> > +Scattered Rx         = Y
> > +Promiscuous mode     = Y
> > +Allmulticast mode    = Y
> > +Unicast MAC filter   = Y
> > +Multicast MAC filter = Y
> > +RSS hash             = Y
> > +RSS key update       = Y
> > +RSS reta update      = Y
> > +VLAN filter          = Y
> > +CRC offload          = Y
> > +VLAN offload         = Y
> > +QinQ offload         = Y
> > +L3 checksum offload  = Y
> > +L4 checksum offload  = Y
> 
> I think the QinQ an checksum offload is not supported by the vPMD , same
> as FVL, right?
O, They're not supported well. Only support RX checksum and QinQ filter. I'll remove the words.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v2 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (8 preceding siblings ...)
  2019-03-01  3:41 ` [PATCH 0/8] Support vector instructions on ICE Zhang, Qi Z
@ 2019-03-04  6:53 ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 1/8] net/ice: fix Tx function setting Wenzhuo Lu
                     ` (7 more replies)
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (4 subsequent siblings)
  14 siblings, 8 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

---
v2:
 - Updated feature doc.
 - Fixed checklog and checkpatch issues.

Wenzhuo Lu (8):
  net/ice: fix Tx function setting
  net/ice: add pointer for queue buffer release
  net/ice: support vector SSE in RX
  net/ice: support Rx scatter SSE vector
  net/ice: support Tx SSE vector
  net/ice: support Rx AVX2 vector
  net/ice: support Rx scatter AVX2 vector
  net/ice: support vector AVX2 in TX

 config/common_base                     |   1 +
 doc/guides/nics/features/ice_vec.ini   |  35 ++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             | 101 +++-
 drivers/net/ice/ice_rxtx.h             |  39 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 835 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 288 ++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 663 ++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  21 +
 12 files changed, 2001 insertions(+), 13 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v2 1/8] net/ice: fix Tx function setting
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index a23c63a..b804be1 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 2/8] net/ice: add pointer for queue buffer release
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..d540ed1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -366,7 +366,7 @@
 		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
 			    rx_queue_id);
 
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		return -EINVAL;
 	}
@@ -393,7 +393,7 @@
 				    rx_queue_id);
 			return -EINVAL;
 		}
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		dev->data->rx_queue_state[rx_queue_id] =
 			RTE_ETH_QUEUE_STATE_STOPPED;
@@ -555,7 +555,7 @@
 		return -EINVAL;
 	}
 
-	ice_tx_queue_release_mbufs(txq);
+	txq->tx_rel_mbufs(txq);
 	ice_reset_tx_queue(txq);
 	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
 
@@ -669,6 +669,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -701,7 +702,7 @@
 		return;
 	}
 
-	ice_rx_queue_release_mbufs(q);
+	q->rx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -866,6 +867,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
 
 	return 0;
 }
@@ -880,7 +882,7 @@
 		return;
 	}
 
-	ice_tx_queue_release_mbufs(q);
+	q->tx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -1552,18 +1554,22 @@
 void
 ice_clear_queues(struct rte_eth_dev *dev)
 {
+	struct ice_rx_queue *rxq;
+	struct ice_tx_queue *txq;
 	uint16_t i;
 
 	PMD_INIT_FUNC_TRACE();
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
-		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
-		ice_reset_tx_queue(dev->data->tx_queues[i]);
+		txq = dev->data->tx_queues[i];
+		txq->tx_rel_mbufs(txq);
+		ice_reset_tx_queue(txq);
 	}
 
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
-		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
-		ice_reset_rx_queue(dev->data->rx_queues[i]);
+		rxq = dev->data->rx_queues[i];
+		rxq->rx_rel_mbufs(rxq);
+		ice_reset_rx_queue(rxq);
 	}
 }
 
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..26380d3 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 3/8] net/ice: support vector SSE in RX
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 1/8] net/ice: fix Tx function setting Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-11  3:26     ` Zhang, Qi Z
  2019-03-04  6:53   ` [PATCH v2 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 config/common_base                    |   1 +
 doc/guides/nics/features/ice_vec.ini  |  33 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  27 +-
 drivers/net/ice/ice_rxtx.h            |  21 ++
 drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 487 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   6 +
 10 files changed, 733 insertions(+), 4 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/config/common_base b/config/common_base
index 0b09a93..1104fd0 100644
--- a/config/common_base
+++ b/config/common_base
@@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
 CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
+CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
 
 # Compile burst-oriented IAVF PMD driver
 #
diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1a19788
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,33 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..33c7fc2 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index b804be1..8e7c7db 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d540ed1..543fefa 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -319,6 +317,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1490,6 +1491,12 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+
 	return NULL;
 }
 
@@ -2225,6 +2232,22 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 26380d3..2659176 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
 
@@ -52,6 +61,11 @@ struct ice_rx_queue {
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
 #endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +170,11 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
+#endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..73837f7
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned mask = rxq->nb_rx_desc - 1;
+	unsigned i;
+
+	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..d444be9
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,487 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+		  struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(
+			0x1c03804, 0x1c03804, 0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16(
+				0, 0, 0,    /* ignore non-length fields */
+				-rxq->crc_len, /* sub crc on data_len */
+				0,          /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				0, 0            /* ignore pkt_type field */
+			);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8(
+		7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+		3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+		15, 14,      /* octet 15~14, 16 bits data_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		15, 14,      /* octet 15~14, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* pkt_type set as unknown */
+		0xFF, 0xFF  /*pkt_type set as unknown */
+		);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0x04, 0x0C, 0x00, 0x08
+					);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static void __attribute__((cold))
+ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	_ice_rx_queue_release_mbufs_vec(rxq);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..73122f8 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,9 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
+	dpdk_conf.set('RTE_LIBRTE_ICE_INC_VECTOR', 1)
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 4/8] net/ice: support Rx scatter SSE vector
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
                     ` (2 preceding siblings ...)
  2019-03-04  6:53   ` [PATCH v2 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 5/8] net/ice: support Tx " Wenzhuo Lu
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 543fefa..55c8131 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1493,7 +1493,8 @@
 		return ptypes;
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 
@@ -2241,9 +2242,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2659176..aab4a3a 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -176,5 +176,7 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index d444be9..789cf07 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -464,6 +464,47 @@
 	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+					      split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+				      &split_flags[i]);
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 5/8] net/ice: support Tx SSE vector
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
                     ` (3 preceding siblings ...)
  2019-03-04  6:53   ` [PATCH v2 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  17 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
 5 files changed, 291 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1a19788..173c8f2 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -22,6 +23,7 @@ RSS reta update      = Y
 VLAN filter          = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 55c8131..b6c9618 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2332,6 +2332,23 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index aab4a3a..02bb57e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -173,10 +173,14 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index 73837f7..8796ecb 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct ice_tx_entry *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -101,6 +168,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (!txq || !txq->sw_ring) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -137,6 +232,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -152,4 +270,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 789cf07..6babb8d 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -505,12 +505,131 @@
 				      &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
 	_ice_rx_queue_release_mbufs_vec(rxq);
 }
 
+static void __attribute__((cold))
+ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	_ice_tx_queue_release_mbufs_vec(txq);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -522,7 +641,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 6/8] net/ice: support Rx AVX2 vector
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
                     ` (4 preceding siblings ...)
  2019-03-04  6:53   ` [PATCH v2 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 7/8] net/ice: support Rx scatter " Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  17 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 613 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 663 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 33c7fc2..e1cb632 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index b6c9618..342e8f1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1494,7 +1494,8 @@
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2236,21 +2237,31 @@ void __attribute__((cold))
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+#endif
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 02bb57e..63c552c 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -182,5 +182,7 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..2b9dad7
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,613 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/**
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 = _mm256_inserti128_si256(
+				_mm256_castsi128_si256(vaddr0), vaddr1, 1);
+		vaddr2_3 = _mm256_inserti128_si256(
+				_mm256_castsi128_si256(vaddr2), vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+			uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust = _mm256_set_epi16(
+			/* first descriptor */
+			0, 0, 0,       /* ignore non-length fields */
+			-rxq->crc_len, /* sub crc on data_len */
+			0,             /* ignore high-16bits of pkt_len */
+			-rxq->crc_len, /* sub crc on pkt_len */
+			0, 0,          /* ignore pkt_type field */
+			/* second descriptor */
+			0, 0, 0,       /* ignore non-length fields */
+			-rxq->crc_len, /* sub crc on data_len */
+			0,             /* ignore high-16bits of pkt_len */
+			-rxq->crc_len, /* sub crc on pkt_len */
+			0, 0           /* ignore pkt_type field */
+	);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk = _mm256_set_epi8(
+			/* first descriptor */
+			7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			15, 14,      /* octet 15~14, 16 bits data_len */
+			0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			0xFF, 0xFF,  /* pkt_type set as unknown */
+			0xFF, 0xFF,  /*pkt_type set as unknown */
+			/* second descriptor */
+			7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			15, 14,      /* octet 15~14, 16 bits data_len */
+			0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			0xFF, 0xFF,  /* pkt_type set as unknown */
+			0xFF, 0xFF   /*pkt_type set as unknown */
+	);
+	/**
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/**
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask = _mm256_set1_epi32(
+			(1 << 2) | (1 << 11) | (3 << 12) | (7 << 22));
+	/**
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf = _mm256_set_epi32(
+			0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+			0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/**
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf = _mm256_set_epi8(
+			0, 0, 0, 0, 0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0, /* end up 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	/**
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask = _mm256_set1_epi32(
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+
+	for (i = 0, received = 0; i < nb_pkts;
+	     i += ICE_DESCS_PER_LOOP_AVX,
+	     rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				    _mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256
+			((void *)&rx_pkts[i + 4],
+			 _mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		do {
+			const __m128i raw_desc7 =
+				_mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 =
+				_mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 =
+				_mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 =
+				_mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 =
+				_mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 =
+				_mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 =
+				_mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 =
+				_mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc6),
+					 raw_desc7, 1);
+			raw_desc4_5 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc4),
+					 raw_desc5, 1);
+			raw_desc2_3 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc2),
+					 raw_desc3, 1);
+			raw_desc0_1 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc0),
+					 raw_desc1, 1);
+		} while (0);
+
+		if (split_packet) {
+			int j;
+
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/**
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7,
+							 PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5,
+							 PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7,
+							   len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5,
+							   len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/**
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/**
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3,
+							 PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1,
+							 PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3,
+							   len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1,
+							   len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+								desc0_1);
+
+		/**
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+							  status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits =
+			_mm256_and_si256(status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags =
+			_mm256_shuffle_epi8(vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags =
+			_mm256_shuffle_epi8(rss_flags_shuf,
+					    _mm256_srli_epi32(flag_bits, 11));
+		/**
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/**
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init
+		 * data so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend
+		 * for each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				 offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				 RTE_ALIGN(offsetof(struct rte_mbuf,
+						    rearm_data),
+					   16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+			rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 8),
+					    0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 4),
+					    0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(mbuf_flags, 4),
+					    0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
+				    rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
+				    rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
+				    rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
+				    rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags = _mm256_castsi128_si256(
+				_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 8),
+					    0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 4),
+					    0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(odd_flags, 4),
+					    0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
+				    rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
+				    rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
+				    rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
+				    rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask = _mm_set1_epi16(
+					1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+					eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits =
+				_mm_packus_epi32
+					(_mm256_castsi256_si128(eop_bits256),
+					 _mm256_extractf128_si256(eop_bits256,
+								  1));
+			/**
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/**
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle = _mm_set_epi8(
+					/* zero hi 64b */
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					/* move values to lo 64b */
+					8, 0, 10, 2,
+					12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet =
+				_mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+					       _mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll(_mm_cvtsi128_si64(
+				_mm256_extracti128_si256(status0_7, 1)));
+		burst += __builtin_popcountll(_mm_cvtsi128_si64(
+				_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 73122f8..a1bd5b1 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -16,4 +16,19 @@ if arch_subdir == 'x86'
 	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
 	dpdk_conf.set('RTE_LIBRTE_ICE_INC_VECTOR', 1)
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 7/8] net/ice: support Rx scatter AVX2 vector
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
                     ` (5 preceding siblings ...)
  2019-03-04  6:53   ` [PATCH v2 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  2019-03-04  6:53   ` [PATCH v2 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 342e8f1..465d389 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1495,7 +1495,8 @@
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2252,9 +2253,12 @@ void __attribute__((cold))
 #endif
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 63c552c..a918646 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -184,5 +184,8 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 2b9dad7..a5f9b85 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -611,3 +611,67 @@
 {
 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 8/8] net/ice: support vector AVX2 in TX
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
                     ` (6 preceding siblings ...)
  2019-03-04  6:53   ` [PATCH v2 7/8] net/ice: support Rx scatter " Wenzhuo Lu
@ 2019-03-04  6:53   ` Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-04  6:53 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  14 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
 4 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index 4a3e2a7..ef6f4c8 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -77,6 +77,10 @@ New Features
   which includes the directory name, lib name, filenames, makefile, docs,
   macros, functions, structs and any other strings in the code.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 465d389..8a09ea9 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2350,15 +2350,25 @@ void __attribute__((cold))
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+#endif
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index a918646..c5ac02d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -187,5 +187,7 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index a5f9b85..985125d 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -675,3 +675,161 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 =
+			_mm256_set_epi64x
+				(hi_qw3,
+				 pkt[3]->buf_physaddr + pkt[3]->data_off,
+				 hi_qw2,
+				 pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 =
+			_mm256_set_epi64x
+				(hi_qw1,
+				 pkt[1]->buf_physaddr + pkt[1]->data_off,
+				 hi_qw0,
+				 pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		       uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 3/8] net/ice: support vector SSE in RX
  2019-03-04  6:53   ` [PATCH v2 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-11  3:26     ` Zhang, Qi Z
  2019-03-15  1:50       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Zhang, Qi Z @ 2019-03-11  3:26 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev; +Cc: Lu, Wenzhuo

Hi:

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Monday, March 4, 2019 2:53 PM
> To: dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: [dpdk-dev] [PATCH v2 3/8] net/ice: support vector SSE in RX
> 
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---

.....

> +
> +	if (!ice_rx_vec_dev_check(dev)) {
> +		for (i = 0; i < dev->data->nb_rx_queues; i++) {
> +			rxq = dev->data->rx_queues[i];
> +			(void)ice_rxq_vec_setup(rxq);
> +		}
> +		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
> +			    dev->data->port_id);
> +		dev->rx_pkt_burst = ice_recv_pkts_vec;
> +
> +		return;
> +	}
> +#endif
> 

Since vPMD is only implemented on x86, I think the logic to setup vector path could be wrapped by compile option #ifdef ARCH_X86, 
otherwise I guess there will be some compile error on other platform, for example the function ice_rx_vec_dev_check is only defined in ice_rxtx_vec_sse.c

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 3/8] net/ice: support vector SSE in RX
  2019-03-11  3:26     ` Zhang, Qi Z
@ 2019-03-15  1:50       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-15  1:50 UTC (permalink / raw)
  To: Zhang, Qi Z, dev

Hi Qi,

> -----Original Message-----
> From: Zhang, Qi Z
> Sent: Monday, March 11, 2019 11:27 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: RE: [dpdk-dev] [PATCH v2 3/8] net/ice: support vector SSE in RX
> 
> Hi:
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> > Sent: Monday, March 4, 2019 2:53 PM
> > To: dev@dpdk.org
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> > Subject: [dpdk-dev] [PATCH v2 3/8] net/ice: support vector SSE in RX
> >
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> > ---
> 
> .....
> 
> > +
> > +	if (!ice_rx_vec_dev_check(dev)) {
> > +		for (i = 0; i < dev->data->nb_rx_queues; i++) {
> > +			rxq = dev->data->rx_queues[i];
> > +			(void)ice_rxq_vec_setup(rxq);
> > +		}
> > +		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
> > +			    dev->data->port_id);
> > +		dev->rx_pkt_burst = ice_recv_pkts_vec;
> > +
> > +		return;
> > +	}
> > +#endif
> >
> 
> Since vPMD is only implemented on x86, I think the logic to setup vector path
> could be wrapped by compile option #ifdef ARCH_X86, otherwise I guess
> there will be some compile error on other platform, for example the
> function ice_rx_vec_dev_check is only defined in ice_rxtx_vec_sse.c
Thanks for the comments. There should be compile errors if x86 not supported. Will send a V3.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v3 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (9 preceding siblings ...)
  2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
@ 2019-03-15  6:22 ` Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 1/8] net/ice: fix Tx function setting Wenzhuo Lu
                     ` (8 more replies)
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
                   ` (3 subsequent siblings)
  14 siblings, 9 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

---
v2:
 - Updated feature doc.
 - Fixed checklog and checkpatch issues.

v3:
 - Fixed potential compile issue on non-X86 platform.

Wenzhuo Lu (8):
  net/ice: fix Tx function setting
  net/ice: add pointer for queue buffer release
  net/ice: support vector SSE in RX
  net/ice: support Rx scatter SSE vector
  net/ice: support Tx SSE vector
  net/ice: support Rx AVX2 vector
  net/ice: support Rx scatter AVX2 vector
  net/ice: support vector AVX2 in TX

 config/common_base                     |   1 +
 doc/guides/nics/features/ice_vec.ini   |  35 ++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             | 105 ++++-
 drivers/net/ice/ice_rxtx.h             |  39 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 835 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 288 ++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 663 ++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  21 +
 12 files changed, 2005 insertions(+), 13 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v3 1/8] net/ice: fix Tx function setting
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15 17:52     ` Ferruh Yigit
  2019-03-15  6:22   ` [PATCH v3 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index a23c63a..b804be1 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 2/8] net/ice: add pointer for queue buffer release
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15 17:52     ` Ferruh Yigit
  2019-03-15  6:22   ` [PATCH v3 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..d540ed1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -366,7 +366,7 @@
 		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
 			    rx_queue_id);
 
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		return -EINVAL;
 	}
@@ -393,7 +393,7 @@
 				    rx_queue_id);
 			return -EINVAL;
 		}
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		dev->data->rx_queue_state[rx_queue_id] =
 			RTE_ETH_QUEUE_STATE_STOPPED;
@@ -555,7 +555,7 @@
 		return -EINVAL;
 	}
 
-	ice_tx_queue_release_mbufs(txq);
+	txq->tx_rel_mbufs(txq);
 	ice_reset_tx_queue(txq);
 	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
 
@@ -669,6 +669,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -701,7 +702,7 @@
 		return;
 	}
 
-	ice_rx_queue_release_mbufs(q);
+	q->rx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -866,6 +867,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
 
 	return 0;
 }
@@ -880,7 +882,7 @@
 		return;
 	}
 
-	ice_tx_queue_release_mbufs(q);
+	q->tx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -1552,18 +1554,22 @@
 void
 ice_clear_queues(struct rte_eth_dev *dev)
 {
+	struct ice_rx_queue *rxq;
+	struct ice_tx_queue *txq;
 	uint16_t i;
 
 	PMD_INIT_FUNC_TRACE();
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
-		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
-		ice_reset_tx_queue(dev->data->tx_queues[i]);
+		txq = dev->data->tx_queues[i];
+		txq->tx_rel_mbufs(txq);
+		ice_reset_tx_queue(txq);
 	}
 
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
-		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
-		ice_reset_rx_queue(dev->data->rx_queues[i]);
+		rxq = dev->data->rx_queues[i];
+		rxq->rx_rel_mbufs(rxq);
+		ice_reset_rx_queue(rxq);
 	}
 }
 
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..26380d3 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 3/8] net/ice: support vector SSE in RX
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 1/8] net/ice: fix Tx function setting Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15 17:53     ` Ferruh Yigit
  2019-03-15  6:22   ` [PATCH v3 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 config/common_base                    |   1 +
 doc/guides/nics/features/ice_vec.ini  |  33 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  31 ++-
 drivers/net/ice/ice_rxtx.h            |  21 ++
 drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 487 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   6 +
 10 files changed, 737 insertions(+), 4 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/config/common_base b/config/common_base
index 0b09a93..1104fd0 100644
--- a/config/common_base
+++ b/config/common_base
@@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
 CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
+CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
 
 # Compile burst-oriented IAVF PMD driver
 #
diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1a19788
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,33 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..33c7fc2 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index b804be1..8e7c7db 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d540ed1..8694872 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -319,6 +317,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1490,6 +1491,14 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+#ifdef RTE_ARCH_X86
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+#endif
+
 	return NULL;
 }
 
@@ -2225,6 +2234,24 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+#ifdef RTE_ARCH_X86
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 26380d3..2659176 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
 
@@ -52,6 +61,11 @@ struct ice_rx_queue {
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
 #endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +170,11 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
+#endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..73837f7
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned mask = rxq->nb_rx_desc - 1;
+	unsigned i;
+
+	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..d444be9
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,487 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+		  struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(
+			0x1c03804, 0x1c03804, 0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD,
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16(
+				0, 0, 0,    /* ignore non-length fields */
+				-rxq->crc_len, /* sub crc on data_len */
+				0,          /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				0, 0            /* ignore pkt_type field */
+			);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8(
+		7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+		3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+		15, 14,      /* octet 15~14, 16 bits data_len */
+		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+		15, 14,      /* octet 15~14, low 16 bits pkt_len */
+		0xFF, 0xFF,  /* pkt_type set as unknown */
+		0xFF, 0xFF  /*pkt_type set as unknown */
+		);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0x04, 0x0C, 0x00, 0x08
+					);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static void __attribute__((cold))
+ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	_ice_rx_queue_release_mbufs_vec(rxq);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..73122f8 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,9 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
+	dpdk_conf.set('RTE_LIBRTE_ICE_INC_VECTOR', 1)
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 4/8] net/ice: support Rx scatter SSE vector
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (2 preceding siblings ...)
  2019-03-15  6:22   ` [PATCH v3 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 5/8] net/ice: support Tx " Wenzhuo Lu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 8694872..6529ae5 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1494,7 +1494,8 @@
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 #ifdef RTE_ARCH_X86
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 #endif
@@ -2244,9 +2245,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2659176..aab4a3a 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -176,5 +176,7 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index d444be9..789cf07 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -464,6 +464,47 @@
 	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+					      split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+				      &split_flags[i]);
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 5/8] net/ice: support Tx SSE vector
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (3 preceding siblings ...)
  2019-03-15  6:22   ` [PATCH v3 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  19 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
 5 files changed, 293 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1a19788..173c8f2 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -22,6 +23,7 @@ RSS reta update      = Y
 VLAN filter          = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 6529ae5..6fbb5a2 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2336,6 +2336,25 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_LIBRTE_ICE_INC_VECTOR
+#ifdef RTE_ARCH_X86
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index aab4a3a..02bb57e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -173,10 +173,14 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index 73837f7..8796ecb 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct ice_tx_entry *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -101,6 +168,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (!txq || !txq->sw_ring) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -137,6 +232,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -152,4 +270,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 789cf07..6babb8d 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -505,12 +505,131 @@
 				      &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
 	_ice_rx_queue_release_mbufs_vec(rxq);
 }
 
+static void __attribute__((cold))
+ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	_ice_tx_queue_release_mbufs_vec(txq);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -522,7 +641,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 6/8] net/ice: support Rx AVX2 vector
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (4 preceding siblings ...)
  2019-03-15  6:22   ` [PATCH v3 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15 17:54     ` Ferruh Yigit
  2019-03-15  6:22   ` [PATCH v3 7/8] net/ice: support Rx scatter " Wenzhuo Lu
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  16 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 613 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 662 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 33c7fc2..e1cb632 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_INC_VECTOR) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 6fbb5a2..5f9c2ae 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1495,7 +1495,8 @@
 #ifdef RTE_LIBRTE_ICE_INC_VECTOR
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 #endif
@@ -2239,21 +2240,30 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 02bb57e..63c552c 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -182,5 +182,7 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..2b9dad7
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,613 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/**
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 = _mm256_inserti128_si256(
+				_mm256_castsi128_si256(vaddr0), vaddr1, 1);
+		vaddr2_3 = _mm256_inserti128_si256(
+				_mm256_castsi128_si256(vaddr2), vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+			uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust = _mm256_set_epi16(
+			/* first descriptor */
+			0, 0, 0,       /* ignore non-length fields */
+			-rxq->crc_len, /* sub crc on data_len */
+			0,             /* ignore high-16bits of pkt_len */
+			-rxq->crc_len, /* sub crc on pkt_len */
+			0, 0,          /* ignore pkt_type field */
+			/* second descriptor */
+			0, 0, 0,       /* ignore non-length fields */
+			-rxq->crc_len, /* sub crc on data_len */
+			0,             /* ignore high-16bits of pkt_len */
+			-rxq->crc_len, /* sub crc on pkt_len */
+			0, 0           /* ignore pkt_type field */
+	);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk = _mm256_set_epi8(
+			/* first descriptor */
+			7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			15, 14,      /* octet 15~14, 16 bits data_len */
+			0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			0xFF, 0xFF,  /* pkt_type set as unknown */
+			0xFF, 0xFF,  /*pkt_type set as unknown */
+			/* second descriptor */
+			7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			15, 14,      /* octet 15~14, 16 bits data_len */
+			0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			0xFF, 0xFF,  /* pkt_type set as unknown */
+			0xFF, 0xFF   /*pkt_type set as unknown */
+	);
+	/**
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/**
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask = _mm256_set1_epi32(
+			(1 << 2) | (1 << 11) | (3 << 12) | (7 << 22));
+	/**
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf = _mm256_set_epi32(
+			0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+			0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/**
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf = _mm256_set_epi8(
+			0, 0, 0, 0, 0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0, /* end up 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	/**
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask = _mm256_set1_epi32(
+			PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+			PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+			PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+
+	for (i = 0, received = 0; i < nb_pkts;
+	     i += ICE_DESCS_PER_LOOP_AVX,
+	     rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				    _mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256
+			((void *)&rx_pkts[i + 4],
+			 _mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		do {
+			const __m128i raw_desc7 =
+				_mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 =
+				_mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 =
+				_mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 =
+				_mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 =
+				_mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 =
+				_mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 =
+				_mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 =
+				_mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc6),
+					 raw_desc7, 1);
+			raw_desc4_5 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc4),
+					 raw_desc5, 1);
+			raw_desc2_3 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc2),
+					 raw_desc3, 1);
+			raw_desc0_1 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc0),
+					 raw_desc1, 1);
+		} while (0);
+
+		if (split_packet) {
+			int j;
+
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/**
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7,
+							 PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5,
+							 PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7,
+							   len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5,
+							   len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/**
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/**
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3,
+							 PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1,
+							 PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3,
+							   len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1,
+							   len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+								desc0_1);
+
+		/**
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+							  status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits =
+			_mm256_and_si256(status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags =
+			_mm256_shuffle_epi8(vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags =
+			_mm256_shuffle_epi8(rss_flags_shuf,
+					    _mm256_srli_epi32(flag_bits, 11));
+		/**
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/**
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init
+		 * data so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend
+		 * for each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				 offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				 RTE_ALIGN(offsetof(struct rte_mbuf,
+						    rearm_data),
+					   16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+			rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 8),
+					    0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 4),
+					    0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(mbuf_flags, 4),
+					    0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
+				    rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
+				    rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
+				    rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
+				    rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags = _mm256_castsi128_si256(
+				_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 8),
+					    0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 4),
+					    0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(odd_flags, 4),
+					    0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
+				    rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
+				    rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
+				    rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
+				    rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask = _mm_set1_epi16(
+					1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+					eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits =
+				_mm_packus_epi32
+					(_mm256_castsi256_si128(eop_bits256),
+					 _mm256_extractf128_si256(eop_bits256,
+								  1));
+			/**
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/**
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle = _mm_set_epi8(
+					/* zero hi 64b */
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					/* move values to lo 64b */
+					8, 0, 10, 2,
+					12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet =
+				_mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+					       _mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll(_mm_cvtsi128_si64(
+				_mm256_extracti128_si256(status0_7, 1)));
+		burst += __builtin_popcountll(_mm_cvtsi128_si64(
+				_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 73122f8..a1bd5b1 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -16,4 +16,19 @@ if arch_subdir == 'x86'
 	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
 	dpdk_conf.set('RTE_LIBRTE_ICE_INC_VECTOR', 1)
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 7/8] net/ice: support Rx scatter AVX2 vector
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (5 preceding siblings ...)
  2019-03-15  6:22   ` [PATCH v3 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15  6:22   ` [PATCH v3 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  2019-03-15  8:08   ` [PATCH v3 0/8] Support vector instructions on ICE Zhang, Qi Z
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5f9c2ae..d8c3cdc 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1496,7 +1496,8 @@
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 #endif
@@ -2254,9 +2255,12 @@ void __attribute__((cold))
 
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 63c552c..a918646 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -184,5 +184,8 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 2b9dad7..a5f9b85 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -611,3 +611,67 @@
 {
 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v3 8/8] net/ice: support vector AVX2 in TX
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (6 preceding siblings ...)
  2019-03-15  6:22   ` [PATCH v3 7/8] net/ice: support Rx scatter " Wenzhuo Lu
@ 2019-03-15  6:22   ` Wenzhuo Lu
  2019-03-15 17:54     ` Ferruh Yigit
  2019-03-15  8:08   ` [PATCH v3 0/8] Support vector instructions on ICE Zhang, Qi Z
  8 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-15  6:22 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  13 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
 4 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index 61a2c73..610c4cd 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -91,6 +91,10 @@ New Features
 
   * Added promiscuous mode support.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d8c3cdc..aa7722e 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2354,15 +2354,24 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index a918646..c5ac02d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -187,5 +187,7 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index a5f9b85..985125d 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -675,3 +675,161 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 =
+			_mm256_set_epi64x
+				(hi_qw3,
+				 pkt[3]->buf_physaddr + pkt[3]->data_off,
+				 hi_qw2,
+				 pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 =
+			_mm256_set_epi64x
+				(hi_qw1,
+				 pkt[1]->buf_physaddr + pkt[1]->data_off,
+				 hi_qw0,
+				 pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		       uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 0/8] Support vector instructions on ICE
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (7 preceding siblings ...)
  2019-03-15  6:22   ` [PATCH v3 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
@ 2019-03-15  8:08   ` Zhang, Qi Z
  8 siblings, 0 replies; 121+ messages in thread
From: Zhang, Qi Z @ 2019-03-15  8:08 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev; +Cc: Lu, Wenzhuo



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Friday, March 15, 2019 2:23 PM
> To: dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: [dpdk-dev] [PATCH v3 0/8] Support vector instructions on ICE
> 
> Use SSE and AVX2 instructions in ICE RX and TX path.
> 
> ---
> v2:
>  - Updated feature doc.
>  - Fixed checklog and checkpatch issues.
> 
> v3:
>  - Fixed potential compile issue on non-X86 platform.
> 
> Wenzhuo Lu (8):
>   net/ice: fix Tx function setting
>   net/ice: add pointer for queue buffer release
>   net/ice: support vector SSE in RX
>   net/ice: support Rx scatter SSE vector
>   net/ice: support Tx SSE vector
>   net/ice: support Rx AVX2 vector
>   net/ice: support Rx scatter AVX2 vector
>   net/ice: support vector AVX2 in TX
> 
>  config/common_base                     |   1 +
>  doc/guides/nics/features/ice_vec.ini   |  35 ++
>  doc/guides/rel_notes/release_19_05.rst |   4 +
>  drivers/net/ice/Makefile               |  22 +
>  drivers/net/ice/ice_ethdev.c           |   3 +-
>  drivers/net/ice/ice_ethdev.h           |   2 +
>  drivers/net/ice/ice_rxtx.c             | 105 ++++-
>  drivers/net/ice/ice_rxtx.h             |  39 ++
>  drivers/net/ice/ice_rxtx_vec_avx2.c    | 835
> +++++++++++++++++++++++++++++++++
>  drivers/net/ice/ice_rxtx_vec_common.h  | 288 ++++++++++++
>  drivers/net/ice/ice_rxtx_vec_sse.c     | 663
> ++++++++++++++++++++++++++
>  drivers/net/ice/meson.build            |  21 +
>  12 files changed, 2005 insertions(+), 13 deletions(-)  create mode 100644
> doc/guides/nics/features/ice_vec.ini
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> 
> --
> 1.9.3

Acked-by: Qi Zhang <qi.z.zhang@intel.com>

Applied to dpdk-next-net-intel with minor fix on the patch title (TX->Tx, RX->Rx)

Thanks
Qi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 1/8] net/ice: fix Tx function setting
  2019-03-15  6:22   ` [PATCH v3 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-15 17:52     ` Ferruh Yigit
  2019-03-18  1:08       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-15 17:52 UTC (permalink / raw)
  To: Wenzhuo Lu, dev

On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> The TX setting functions is not called.
> 
> Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")

Do we need stable@dpdk.org tag for this?

> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>  drivers/net/ice/ice_ethdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> index a23c63a..b804be1 100644
> --- a/drivers/net/ice/ice_ethdev.c
> +++ b/drivers/net/ice/ice_ethdev.c
> @@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
>  	}
>  
>  	ice_set_rx_function(dev);
> +	ice_set_tx_function(dev);
>  
>  	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
>  			ETH_VLAN_EXTEND_MASK;
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 2/8] net/ice: add pointer for queue buffer release
  2019-03-15  6:22   ` [PATCH v3 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-15 17:52     ` Ferruh Yigit
  2019-03-18  1:15       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-15 17:52 UTC (permalink / raw)
  To: Wenzhuo Lu, dev

On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> Add function pointers of buffer releasing for RX and
> TX queues, for vector functions will be added for RX
> and TX.
> 
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

<...>

> @@ -27,6 +27,9 @@
>  
>  #define ICE_SUPPORT_CHAIN_NUM 5
>  
> +typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
> +typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
> +
>  struct ice_rx_entry {
>  	struct rte_mbuf *mbuf;
>  };
> @@ -61,6 +64,7 @@ struct ice_rx_queue {
>  	uint16_t max_pkt_len; /* Maximum packet length */
>  	bool q_set; /* indicate if rx queue has been configured */
>  	bool rx_deferred_start; /* don't start this queue in dev start */
> +	ice_rx_release_mbufs rx_rel_mbufs;
>  };
>  
>  struct ice_tx_entry {
> @@ -100,6 +104,7 @@ struct ice_tx_queue {
>  	uint16_t tx_next_rs;
>  	bool tx_deferred_start; /* don't start this queue in dev start */
>  	bool q_set; /* indicate if tx queue has been configured */
> +	ice_tx_release_mbufs tx_rel_mbufs;

We are not using suffixes as coding convention, and indeed it says "Avoid
typedefs ending in _t" explicitly, but for this case it is not clear that they
are function pointers.

So what do you think either appending a _t suffix, or putting verb to the end to
more sound like function more than object:
ice_tx_mbufs_release_t

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 3/8] net/ice: support vector SSE in RX
  2019-03-15  6:22   ` [PATCH v3 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-15 17:53     ` Ferruh Yigit
  2019-03-18  1:22       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-15 17:53 UTC (permalink / raw)
  To: Wenzhuo Lu, dev

On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

<...>

> @@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
>  CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
>  CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
>  CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
> +CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y

Meson seems setting this config automatically. Do we need this compile time
option at all?
Would it work if we replace this with a device arg, which can be used to disable
vector path if set, and 'ice_rx_vec_dev_check()' can check it?

<...>

> @@ -0,0 +1,155 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#ifndef _ICE_RXTX_VEC_COMMON_H_
> +#define _ICE_RXTX_VEC_COMMON_H_
> +
> +#include "ice_rxtx.h"
> +
> +static inline uint16_t
> +reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
> +		   uint16_t nb_bufs, uint8_t *split_flags)
> +{
> +	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
> +	struct rte_mbuf *start = rxq->pkt_first_seg;
> +	struct rte_mbuf *end =  rxq->pkt_last_seg;
> +	unsigned pkt_idx, buf_idx;
There are checkpatch warnings for using 'unsigned int' instead of 'unsigned',
can you please fix them? There are a few of them.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 6/8] net/ice: support Rx AVX2 vector
  2019-03-15  6:22   ` [PATCH v3 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-15 17:54     ` Ferruh Yigit
  2019-03-18  1:37       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-15 17:54 UTC (permalink / raw)
  To: Wenzhuo Lu, dev

On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
<...>

> +#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> +		/* for AVX we need alignment otherwise loads are not atomic */
> +		if (avx_aligned) {
> +			/* load in descriptors, 2 at a time, in reverse order */
> +			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
> +			rte_compiler_barrier();
> +			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
> +			rte_compiler_barrier();
> +			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
> +			rte_compiler_barrier();
> +			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
> +		} else
> +#endif
> +		do {
> +			const __m128i raw_desc7 =
> +				_mm_load_si128((void *)(rxdp + 7));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc6 =
> +				_mm_load_si128((void *)(rxdp + 6));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc5 =
> +				_mm_load_si128((void *)(rxdp + 5));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc4 =
> +				_mm_load_si128((void *)(rxdp + 4));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc3 =
> +				_mm_load_si128((void *)(rxdp + 3));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc2 =
> +				_mm_load_si128((void *)(rxdp + 2));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc1 =
> +				_mm_load_si128((void *)(rxdp + 1));
> +			rte_compiler_barrier();
> +			const __m128i raw_desc0 =
> +				_mm_load_si128((void *)(rxdp + 0));
> +
> +			raw_desc6_7 =
> +				_mm256_inserti128_si256
> +					(_mm256_castsi128_si256(raw_desc6),
> +					 raw_desc7, 1);
> +			raw_desc4_5 =
> +				_mm256_inserti128_si256
> +					(_mm256_castsi128_si256(raw_desc4),
> +					 raw_desc5, 1);
> +			raw_desc2_3 =
> +				_mm256_inserti128_si256
> +					(_mm256_castsi128_si256(raw_desc2),
> +					 raw_desc3, 1);
> +			raw_desc0_1 =
> +				_mm256_inserti128_si256
> +					(_mm256_castsi128_si256(raw_desc0),
> +					 raw_desc1, 1);
> +		} while (0);

Is this to provide the proper indention because of the above #ifdef block? If so
why not simple { } for the scope, is do{ }while(0) has benefit against it?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 8/8] net/ice: support vector AVX2 in TX
  2019-03-15  6:22   ` [PATCH v3 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
@ 2019-03-15 17:54     ` Ferruh Yigit
  2019-03-18  1:38       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-15 17:54 UTC (permalink / raw)
  To: Wenzhuo Lu, dev, Qi Zhang

On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>  doc/guides/rel_notes/release_19_05.rst |   4 +
>  drivers/net/ice/ice_rxtx.c             |  13 ++-
>  drivers/net/ice/ice_rxtx.h             |   2 +
>  drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
>  4 files changed, 175 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
> index 61a2c73..610c4cd 100644
> --- a/doc/guides/rel_notes/release_19_05.rst
> +++ b/doc/guides/rel_notes/release_19_05.rst
> @@ -91,6 +91,10 @@ New Features
>  
>    * Added promiscuous mode support.
>  
> +* **Added support of vector instructions on ICE.**
> +
> +   Added support of SSE and AVX2 instructions in ICE RX and TX path.
> +

ice documentation doesn't have any information about vector path, can you please
update it?

I think it can be good to document when vector path is used? How to decide
scalar, sse or avx to use? What will prevent using vector path, like any offload
or any specific config?

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 1/8] net/ice: fix Tx function setting
  2019-03-15 17:52     ` Ferruh Yigit
@ 2019-03-18  1:08       ` Lu, Wenzhuo
  2019-03-20 17:22         ` Ferruh Yigit
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-18  1:08 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,

> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Saturday, March 16, 2019 1:52 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 1/8] net/ice: fix Tx function setting
> 
> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> > The TX setting functions is not called.
> >
> > Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
> 
> Do we need stable@dpdk.org tag for this?
This patch fixed a bug which is introduced in 19.02. That's why I think "stable" is not needed. Please let me know if I'm wrong. Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 2/8] net/ice: add pointer for queue buffer release
  2019-03-15 17:52     ` Ferruh Yigit
@ 2019-03-18  1:15       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-18  1:15 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,

> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Saturday, March 16, 2019 1:53 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 2/8] net/ice: add pointer for queue buffer
> release
> 
> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> > Add function pointers of buffer releasing for RX and TX queues, for
> > vector functions will be added for RX and TX.
> >
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> 
> <...>
> 
> > @@ -27,6 +27,9 @@
> >
> >  #define ICE_SUPPORT_CHAIN_NUM 5
> >
> > +typedef void (*ice_rx_release_mbufs)(struct ice_rx_queue *rxq);
> > +typedef void (*ice_tx_release_mbufs)(struct ice_tx_queue *txq);
> > +
> >  struct ice_rx_entry {
> >  	struct rte_mbuf *mbuf;
> >  };
> > @@ -61,6 +64,7 @@ struct ice_rx_queue {
> >  	uint16_t max_pkt_len; /* Maximum packet length */
> >  	bool q_set; /* indicate if rx queue has been configured */
> >  	bool rx_deferred_start; /* don't start this queue in dev start */
> > +	ice_rx_release_mbufs rx_rel_mbufs;
> >  };
> >
> >  struct ice_tx_entry {
> > @@ -100,6 +104,7 @@ struct ice_tx_queue {
> >  	uint16_t tx_next_rs;
> >  	bool tx_deferred_start; /* don't start this queue in dev start */
> >  	bool q_set; /* indicate if tx queue has been configured */
> > +	ice_tx_release_mbufs tx_rel_mbufs;
> 
> We are not using suffixes as coding convention, and indeed it says "Avoid
> typedefs ending in _t" explicitly, but for this case it is not clear that they are
> function pointers.
> 
> So what do you think either appending a _t suffix, or putting verb to the end
> to more sound like function more than object:
> ice_tx_mbufs_release_t
Thanks for  the comment. Better to distinguish the function pointers from others. I'll add "_t".

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 3/8] net/ice: support vector SSE in RX
  2019-03-15 17:53     ` Ferruh Yigit
@ 2019-03-18  1:22       ` Lu, Wenzhuo
  2019-03-20 17:35         ` Ferruh Yigit
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-18  1:22 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,

> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Saturday, March 16, 2019 1:54 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 3/8] net/ice: support vector SSE in RX
> 
> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> 
> <...>
> 
> > @@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
> > CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
> >  CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
> >  CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
> > +CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
> 
> Meson seems setting this config automatically. Do we need this compile time
> option at all?
It's not for meson build. It's for the traditional build. To my opinion, meson build even doesn't use the configure file. It has its own configuration inside.

> Would it work if we replace this with a device arg, which can be used to
> disable vector path if set, and 'ice_rx_vec_dev_check()' can check it?
We've implemented the dynamic selection of vector and normal path. Here is the compile setting. In case the user wants to remove the vector code thoroughly, so there may be a little performance benefit.

> 
> <...>
> 
> > @@ -0,0 +1,155 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2019 Intel Corporation  */
> > +
> > +#ifndef _ICE_RXTX_VEC_COMMON_H_
> > +#define _ICE_RXTX_VEC_COMMON_H_
> > +
> > +#include "ice_rxtx.h"
> > +
> > +static inline uint16_t
> > +reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
> > +		   uint16_t nb_bufs, uint8_t *split_flags) {
> > +	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
> > +	struct rte_mbuf *start = rxq->pkt_first_seg;
> > +	struct rte_mbuf *end =  rxq->pkt_last_seg;
> > +	unsigned pkt_idx, buf_idx;
> There are checkpatch warnings for using 'unsigned int' instead of 'unsigned',
> can you please fix them? There are a few of them.
Sure, will fix them.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 6/8] net/ice: support Rx AVX2 vector
  2019-03-15 17:54     ` Ferruh Yigit
@ 2019-03-18  1:37       ` Lu, Wenzhuo
  2019-03-20 17:37         ` Ferruh Yigit
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-18  1:37 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,


> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Saturday, March 16, 2019 1:54 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 6/8] net/ice: support Rx AVX2 vector
> 
> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> <...>
> 
> > +#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> > +		/* for AVX we need alignment otherwise loads are not
> atomic */
> > +		if (avx_aligned) {
> > +			/* load in descriptors, 2 at a time, in reverse order */
> > +			raw_desc6_7 = _mm256_load_si256((void *)(rxdp +
> 6));
> > +			rte_compiler_barrier();
> > +			raw_desc4_5 = _mm256_load_si256((void *)(rxdp +
> 4));
> > +			rte_compiler_barrier();
> > +			raw_desc2_3 = _mm256_load_si256((void *)(rxdp +
> 2));
> > +			rte_compiler_barrier();
> > +			raw_desc0_1 = _mm256_load_si256((void *)(rxdp +
> 0));
> > +		} else
> > +#endif
> > +		do {
> > +			const __m128i raw_desc7 =
> > +				_mm_load_si128((void *)(rxdp + 7));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc6 =
> > +				_mm_load_si128((void *)(rxdp + 6));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc5 =
> > +				_mm_load_si128((void *)(rxdp + 5));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc4 =
> > +				_mm_load_si128((void *)(rxdp + 4));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc3 =
> > +				_mm_load_si128((void *)(rxdp + 3));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc2 =
> > +				_mm_load_si128((void *)(rxdp + 2));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc1 =
> > +				_mm_load_si128((void *)(rxdp + 1));
> > +			rte_compiler_barrier();
> > +			const __m128i raw_desc0 =
> > +				_mm_load_si128((void *)(rxdp + 0));
> > +
> > +			raw_desc6_7 =
> > +				_mm256_inserti128_si256
> > +
> 	(_mm256_castsi128_si256(raw_desc6),
> > +					 raw_desc7, 1);
> > +			raw_desc4_5 =
> > +				_mm256_inserti128_si256
> > +
> 	(_mm256_castsi128_si256(raw_desc4),
> > +					 raw_desc5, 1);
> > +			raw_desc2_3 =
> > +				_mm256_inserti128_si256
> > +
> 	(_mm256_castsi128_si256(raw_desc2),
> > +					 raw_desc3, 1);
> > +			raw_desc0_1 =
> > +				_mm256_inserti128_si256
> > +
> 	(_mm256_castsi128_si256(raw_desc0),
> > +					 raw_desc1, 1);
> > +		} while (0);
> 
> Is this to provide the proper indention because of the above #ifdef block? If
> so why not simple { } for the scope, is do{ }while(0) has benefit against it?
Yes, it's for the indention. To my opinion, "do while" looks friendly to the readers as we always use it in the macros. Only "{}" looks missing a function name or a "for ()" :)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 8/8] net/ice: support vector AVX2 in TX
  2019-03-15 17:54     ` Ferruh Yigit
@ 2019-03-18  1:38       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-18  1:38 UTC (permalink / raw)
  To: Yigit, Ferruh, dev, Zhang, Qi Z

Hi Ferruh,


> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Saturday, March 16, 2019 1:55 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org; Zhang, Qi Z
> <qi.z.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 8/8] net/ice: support vector AVX2 in TX
> 
> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> > ---
> >  doc/guides/rel_notes/release_19_05.rst |   4 +
> >  drivers/net/ice/ice_rxtx.c             |  13 ++-
> >  drivers/net/ice/ice_rxtx.h             |   2 +
> >  drivers/net/ice/ice_rxtx_vec_avx2.c    | 158
> +++++++++++++++++++++++++++++++++
> >  4 files changed, 175 insertions(+), 2 deletions(-)
> >
> > diff --git a/doc/guides/rel_notes/release_19_05.rst
> > b/doc/guides/rel_notes/release_19_05.rst
> > index 61a2c73..610c4cd 100644
> > --- a/doc/guides/rel_notes/release_19_05.rst
> > +++ b/doc/guides/rel_notes/release_19_05.rst
> > @@ -91,6 +91,10 @@ New Features
> >
> >    * Added promiscuous mode support.
> >
> > +* **Added support of vector instructions on ICE.**
> > +
> > +   Added support of SSE and AVX2 instructions in ICE RX and TX path.
> > +
> 
> ice documentation doesn't have any information about vector path, can you
> please update it?
> 
> I think it can be good to document when vector path is used? How to decide
> scalar, sse or avx to use? What will prevent using vector path, like any
> offload or any specific config?
Thanks for the comments. Will add more info here. 
> 
> Thanks,
> ferruh

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 1/8] net/ice: fix Tx function setting
  2019-03-18  1:08       ` Lu, Wenzhuo
@ 2019-03-20 17:22         ` Ferruh Yigit
  2019-03-21  2:29           ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-20 17:22 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev

On 3/18/2019 1:08 AM, Lu, Wenzhuo wrote:
> Hi Ferruh,
> 
>> -----Original Message-----
>> From: Yigit, Ferruh
>> Sent: Saturday, March 16, 2019 1:52 AM
>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 1/8] net/ice: fix Tx function setting
>>
>> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
>>> The TX setting functions is not called.
>>>
>>> Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
>>
>> Do we need stable@dpdk.org tag for this?
> This patch fixed a bug which is introduced in 19.02. That's why I think "stable" is not needed. Please let me know if I'm wrong. Thanks.
> 

'stable' tag is not required if the commit you are fixing is not released yet,
it means it is in the current release, 19.05.
Otherwise 'stable' tag is required.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 3/8] net/ice: support vector SSE in RX
  2019-03-18  1:22       ` Lu, Wenzhuo
@ 2019-03-20 17:35         ` Ferruh Yigit
  2019-03-21  2:48           ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-20 17:35 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev

On 3/18/2019 1:22 AM, Lu, Wenzhuo wrote:
> Hi Ferruh,
> 
>> -----Original Message-----
>> From: Yigit, Ferruh
>> Sent: Saturday, March 16, 2019 1:54 AM
>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 3/8] net/ice: support vector SSE in RX
>>
>> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
>>> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
>>
>> <...>
>>
>>> @@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
>>> CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
>>>  CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
>>>  CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
>>> +CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
>>
>> Meson seems setting this config automatically. Do we need this compile time
>> option at all?
> It's not for meson build. It's for the traditional build. To my opinion, meson build even doesn't use the configure file. It has its own configuration inside.

I am not saying this is for meson.

This is a compile time config option, to let user enable/disable VECTOR path of
the PMD.
But meson build doesn't get this input from the user and enables it by default.
If enabling it by default works and an acceptable solution, why we are not doing
same thing for makefile.

Why not just remove the config option completely and update code as it is enabled?

In this default enabled case, if there is a need to let user disable the vector
path, which may be a need, why not add a device argument to disable the vector
path on runtime, I believe this can be done easily by described below.

What is the benefit of a compile time flag against runtime devargs?
Why someone would want to remove the all vector path from the binary, just to
gain a few kilobytes from the final binary?
But other way around, having the vector path in binary but disable it
dynamically when needed has advantage of easily enable it back without need to
recompile when the platform has vector path support.

> 
>> Would it work if we replace this with a device arg, which can be used to
>> disable vector path if set, and 'ice_rx_vec_dev_check()' can check it?
> We've implemented the dynamic selection of vector and normal path. Here is the compile setting. In case the user wants to remove the vector code thoroughly, so there may be a little performance benefit.
> 
>>
>> <...>
>>
>>> @@ -0,0 +1,155 @@
>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>> + * Copyright(c) 2019 Intel Corporation  */
>>> +
>>> +#ifndef _ICE_RXTX_VEC_COMMON_H_
>>> +#define _ICE_RXTX_VEC_COMMON_H_
>>> +
>>> +#include "ice_rxtx.h"
>>> +
>>> +static inline uint16_t
>>> +reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
>>> +		   uint16_t nb_bufs, uint8_t *split_flags) {
>>> +	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
>>> +	struct rte_mbuf *start = rxq->pkt_first_seg;
>>> +	struct rte_mbuf *end =  rxq->pkt_last_seg;
>>> +	unsigned pkt_idx, buf_idx;
>> There are checkpatch warnings for using 'unsigned int' instead of 'unsigned',
>> can you please fix them? There are a few of them.
> Sure, will fix them.
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 6/8] net/ice: support Rx AVX2 vector
  2019-03-18  1:37       ` Lu, Wenzhuo
@ 2019-03-20 17:37         ` Ferruh Yigit
  2019-03-21  2:31           ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-20 17:37 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev

On 3/18/2019 1:37 AM, Lu, Wenzhuo wrote:
> Hi Ferruh,
> 
> 
>> -----Original Message-----
>> From: Yigit, Ferruh
>> Sent: Saturday, March 16, 2019 1:54 AM
>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 6/8] net/ice: support Rx AVX2 vector
>>
>> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
>>> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
>> <...>
>>
>>> +#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
>>> +		/* for AVX we need alignment otherwise loads are not
>> atomic */
>>> +		if (avx_aligned) {
>>> +			/* load in descriptors, 2 at a time, in reverse order */
>>> +			raw_desc6_7 = _mm256_load_si256((void *)(rxdp +
>> 6));
>>> +			rte_compiler_barrier();
>>> +			raw_desc4_5 = _mm256_load_si256((void *)(rxdp +
>> 4));
>>> +			rte_compiler_barrier();
>>> +			raw_desc2_3 = _mm256_load_si256((void *)(rxdp +
>> 2));
>>> +			rte_compiler_barrier();
>>> +			raw_desc0_1 = _mm256_load_si256((void *)(rxdp +
>> 0));
>>> +		} else
>>> +#endif
>>> +		do {
>>> +			const __m128i raw_desc7 =
>>> +				_mm_load_si128((void *)(rxdp + 7));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc6 =
>>> +				_mm_load_si128((void *)(rxdp + 6));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc5 =
>>> +				_mm_load_si128((void *)(rxdp + 5));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc4 =
>>> +				_mm_load_si128((void *)(rxdp + 4));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc3 =
>>> +				_mm_load_si128((void *)(rxdp + 3));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc2 =
>>> +				_mm_load_si128((void *)(rxdp + 2));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc1 =
>>> +				_mm_load_si128((void *)(rxdp + 1));
>>> +			rte_compiler_barrier();
>>> +			const __m128i raw_desc0 =
>>> +				_mm_load_si128((void *)(rxdp + 0));
>>> +
>>> +			raw_desc6_7 =
>>> +				_mm256_inserti128_si256
>>> +
>> 	(_mm256_castsi128_si256(raw_desc6),
>>> +					 raw_desc7, 1);
>>> +			raw_desc4_5 =
>>> +				_mm256_inserti128_si256
>>> +
>> 	(_mm256_castsi128_si256(raw_desc4),
>>> +					 raw_desc5, 1);
>>> +			raw_desc2_3 =
>>> +				_mm256_inserti128_si256
>>> +
>> 	(_mm256_castsi128_si256(raw_desc2),
>>> +					 raw_desc3, 1);
>>> +			raw_desc0_1 =
>>> +				_mm256_inserti128_si256
>>> +
>> 	(_mm256_castsi128_si256(raw_desc0),
>>> +					 raw_desc1, 1);
>>> +		} while (0);
>>
>> Is this to provide the proper indention because of the above #ifdef block? If
>> so why not simple { } for the scope, is do{ }while(0) has benefit against it?
> Yes, it's for the indention. To my opinion, "do while" looks friendly to the readers as we always use it in the macros. Only "{}" looks missing a function name or a "for ()" :)
> 

I found '{ }' more clear but no strong opinion, perhaps a comment to clarify to
intention can be good but that also looks like optional, so up to you.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 1/8] net/ice: fix Tx function setting
  2019-03-20 17:22         ` Ferruh Yigit
@ 2019-03-21  2:29           ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-21  2:29 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,


> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Thursday, March 21, 2019 1:22 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 1/8] net/ice: fix Tx function setting
> 
> On 3/18/2019 1:08 AM, Lu, Wenzhuo wrote:
> > Hi Ferruh,
> >
> >> -----Original Message-----
> >> From: Yigit, Ferruh
> >> Sent: Saturday, March 16, 2019 1:52 AM
> >> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 1/8] net/ice: fix Tx function
> >> setting
> >>
> >> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> >>> The TX setting functions is not called.
> >>>
> >>> Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
> >>
> >> Do we need stable@dpdk.org tag for this?
> > This patch fixed a bug which is introduced in 19.02. That's why I think
> "stable" is not needed. Please let me know if I'm wrong. Thanks.
> >
> 
> 'stable' tag is not required if the commit you are fixing is not released yet, it
> means it is in the current release, 19.05.
> Otherwise 'stable' tag is required.
Thanks. I'll add it.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 6/8] net/ice: support Rx AVX2 vector
  2019-03-20 17:37         ` Ferruh Yigit
@ 2019-03-21  2:31           ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-21  2:31 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,


> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Thursday, March 21, 2019 1:37 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 6/8] net/ice: support Rx AVX2 vector
> 
> On 3/18/2019 1:37 AM, Lu, Wenzhuo wrote:
> > Hi Ferruh,
> >
> >
> >> -----Original Message-----
> >> From: Yigit, Ferruh
> >> Sent: Saturday, March 16, 2019 1:54 AM
> >> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 6/8] net/ice: support Rx AVX2
> >> vector
> >>
> >> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> >>> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> >> <...>
> >>
> >>> +#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> >>> +		/* for AVX we need alignment otherwise loads are not
> >> atomic */
> >>> +		if (avx_aligned) {
> >>> +			/* load in descriptors, 2 at a time, in reverse order */
> >>> +			raw_desc6_7 = _mm256_load_si256((void *)(rxdp +
> >> 6));
> >>> +			rte_compiler_barrier();
> >>> +			raw_desc4_5 = _mm256_load_si256((void *)(rxdp +
> >> 4));
> >>> +			rte_compiler_barrier();
> >>> +			raw_desc2_3 = _mm256_load_si256((void *)(rxdp +
> >> 2));
> >>> +			rte_compiler_barrier();
> >>> +			raw_desc0_1 = _mm256_load_si256((void *)(rxdp +
> >> 0));
> >>> +		} else
> >>> +#endif
> >>> +		do {
> >>> +			const __m128i raw_desc7 =
> >>> +				_mm_load_si128((void *)(rxdp + 7));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc6 =
> >>> +				_mm_load_si128((void *)(rxdp + 6));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc5 =
> >>> +				_mm_load_si128((void *)(rxdp + 5));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc4 =
> >>> +				_mm_load_si128((void *)(rxdp + 4));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc3 =
> >>> +				_mm_load_si128((void *)(rxdp + 3));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc2 =
> >>> +				_mm_load_si128((void *)(rxdp + 2));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc1 =
> >>> +				_mm_load_si128((void *)(rxdp + 1));
> >>> +			rte_compiler_barrier();
> >>> +			const __m128i raw_desc0 =
> >>> +				_mm_load_si128((void *)(rxdp + 0));
> >>> +
> >>> +			raw_desc6_7 =
> >>> +				_mm256_inserti128_si256
> >>> +
> >> 	(_mm256_castsi128_si256(raw_desc6),
> >>> +					 raw_desc7, 1);
> >>> +			raw_desc4_5 =
> >>> +				_mm256_inserti128_si256
> >>> +
> >> 	(_mm256_castsi128_si256(raw_desc4),
> >>> +					 raw_desc5, 1);
> >>> +			raw_desc2_3 =
> >>> +				_mm256_inserti128_si256
> >>> +
> >> 	(_mm256_castsi128_si256(raw_desc2),
> >>> +					 raw_desc3, 1);
> >>> +			raw_desc0_1 =
> >>> +				_mm256_inserti128_si256
> >>> +
> >> 	(_mm256_castsi128_si256(raw_desc0),
> >>> +					 raw_desc1, 1);
> >>> +		} while (0);
> >>
> >> Is this to provide the proper indention because of the above #ifdef
> >> block? If so why not simple { } for the scope, is do{ }while(0) has benefit
> against it?
> > Yes, it's for the indention. To my opinion, "do while" looks friendly
> > to the readers as we always use it in the macros. Only "{}" looks
> > missing a function name or a "for ()" :)
> >
> 
> I found '{ }' more clear but no strong opinion, perhaps a comment to clarify
> to intention can be good but that also looks like optional, so up to you.
OK. I'll use '{}'. It also looks good to me.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3 3/8] net/ice: support vector SSE in RX
  2019-03-20 17:35         ` Ferruh Yigit
@ 2019-03-21  2:48           ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-21  2:48 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,


> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Thursday, March 21, 2019 1:35 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 3/8] net/ice: support vector SSE in RX
> 
> On 3/18/2019 1:22 AM, Lu, Wenzhuo wrote:
> > Hi Ferruh,
> >
> >> -----Original Message-----
> >> From: Yigit, Ferruh
> >> Sent: Saturday, March 16, 2019 1:54 AM
> >> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 3/8] net/ice: support vector SSE in
> >> RX
> >>
> >> On 3/15/2019 6:22 AM, Wenzhuo Lu wrote:
> >>> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> >>
> >> <...>
> >>
> >>> @@ -305,6 +305,7 @@ CONFIG_RTE_LIBRTE_ICE_DEBUG_TX=n
> >>> CONFIG_RTE_LIBRTE_ICE_DEBUG_TX_FREE=n
> >>>  CONFIG_RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC=y
> >>>  CONFIG_RTE_LIBRTE_ICE_16BYTE_RX_DESC=n
> >>> +CONFIG_RTE_LIBRTE_ICE_INC_VECTOR=y
> >>
> >> Meson seems setting this config automatically. Do we need this
> >> compile time option at all?
> > It's not for meson build. It's for the traditional build. To my opinion, meson
> build even doesn't use the configure file. It has its own configuration inside.
> 
> I am not saying this is for meson.
> 
> This is a compile time config option, to let user enable/disable VECTOR path
> of the PMD.
> But meson build doesn't get this input from the user and enables it by
> default.
> If enabling it by default works and an acceptable solution, why we are not
> doing same thing for makefile.
> 
> Why not just remove the config option completely and update code as it is
> enabled?
> 
> In this default enabled case, if there is a need to let user disable the vector
> path, which may be a need, why not add a device argument to disable the
> vector path on runtime, I believe this can be done easily by described below.
> 
> What is the benefit of a compile time flag against runtime devargs?
> Why someone would want to remove the all vector path from the binary,
> just to gain a few kilobytes from the final binary?
> But other way around, having the vector path in binary but disable it
> dynamically when needed has advantage of easily enable it back without
> need to recompile when the platform has vector path support.
> 
> >
> >> Would it work if we replace this with a device arg, which can be used
> >> to disable vector path if set, and 'ice_rx_vec_dev_check()' can check it?
> > We've implemented the dynamic selection of vector and normal path.
> Here is the compile setting. In case the user wants to remove the vector code
> thoroughly, so there may be a little performance benefit.
I think you're right. The vector and normal path can be selected automatically. And we do recommend using vector path if it satisfies user's requirement. I'll remove this configuration.

> >
> >>
> >> <...>
> >>
> >>> @@ -0,0 +1,155 @@
> >>> +/* SPDX-License-Identifier: BSD-3-Clause
> >>> + * Copyright(c) 2019 Intel Corporation  */
> >>> +
> >>> +#ifndef _ICE_RXTX_VEC_COMMON_H_
> >>> +#define _ICE_RXTX_VEC_COMMON_H_
> >>> +
> >>> +#include "ice_rxtx.h"
> >>> +
> >>> +static inline uint16_t
> >>> +reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf
> **rx_bufs,
> >>> +		   uint16_t nb_bufs, uint8_t *split_flags) {
> >>> +	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
> >>> +	struct rte_mbuf *start = rxq->pkt_first_seg;
> >>> +	struct rte_mbuf *end =  rxq->pkt_last_seg;
> >>> +	unsigned pkt_idx, buf_idx;
> >> There are checkpatch warnings for using 'unsigned int' instead of
> >> 'unsigned', can you please fix them? There are a few of them.
> > Sure, will fix them.
> >


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v4 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (10 preceding siblings ...)
  2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-03-21  6:26 ` Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 1/8] net/ice: fix Tx function setting Wenzhuo Lu
                     ` (7 more replies)
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (2 subsequent siblings)
  14 siblings, 8 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

---
v2:
 - Updated feature doc.
 - Fixed checklog and checkpatch issues.

v3:
 - Fixed potential compile issue on non-X86 platform.

v4:
 - Removed compile configure, CONFIG_RTE_LIBRTE_ICE_INC_VECTOR.
 - Fixed checkpatch warnings.
 - Added more explanation of vector path in the device document.
 - Some other minor change.

Wenzhuo Lu (8):
  net/ice: fix Tx function setting
  net/ice: add pointer for queue buffer release
  net/ice: support vector SSE in RX
  net/ice: support Rx scatter SSE vector
  net/ice: support Tx SSE vector
  net/ice: support Rx AVX2 vector
  net/ice: support Rx scatter AVX2 vector
  net/ice: support vector AVX2 in TX

 doc/guides/nics/features/ice_vec.ini   |  35 ++
 doc/guides/nics/ice.rst                |  17 +
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             |  99 +++-
 drivers/net/ice/ice_rxtx.h             |  37 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 844 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 288 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 672 ++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  20 +
 12 files changed, 2030 insertions(+), 13 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v4 1/8] net/ice: fix Tx function setting
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-22  8:46     ` Maxime Coquelin
  2019-03-21  6:26   ` [PATCH v4 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, stable

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Cc: stable@dpdk.org

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index a23c63a..b804be1 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 2/8] net/ice: add pointer for queue buffer release
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-22  8:59     ` Maxime Coquelin
  2019-03-21  6:26   ` [PATCH v4 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..d540ed1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -366,7 +366,7 @@
 		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
 			    rx_queue_id);
 
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		return -EINVAL;
 	}
@@ -393,7 +393,7 @@
 				    rx_queue_id);
 			return -EINVAL;
 		}
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		dev->data->rx_queue_state[rx_queue_id] =
 			RTE_ETH_QUEUE_STATE_STOPPED;
@@ -555,7 +555,7 @@
 		return -EINVAL;
 	}
 
-	ice_tx_queue_release_mbufs(txq);
+	txq->tx_rel_mbufs(txq);
 	ice_reset_tx_queue(txq);
 	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
 
@@ -669,6 +669,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -701,7 +702,7 @@
 		return;
 	}
 
-	ice_rx_queue_release_mbufs(q);
+	q->rx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -866,6 +867,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
 
 	return 0;
 }
@@ -880,7 +882,7 @@
 		return;
 	}
 
-	ice_tx_queue_release_mbufs(q);
+	q->tx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -1552,18 +1554,22 @@
 void
 ice_clear_queues(struct rte_eth_dev *dev)
 {
+	struct ice_rx_queue *rxq;
+	struct ice_tx_queue *txq;
 	uint16_t i;
 
 	PMD_INIT_FUNC_TRACE();
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
-		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
-		ice_reset_tx_queue(dev->data->tx_queues[i]);
+		txq = dev->data->tx_queues[i];
+		txq->tx_rel_mbufs(txq);
+		ice_reset_tx_queue(txq);
 	}
 
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
-		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
-		ice_reset_rx_queue(dev->data->rx_queues[i]);
+		rxq = dev->data->rx_queues[i];
+		rxq->rx_rel_mbufs(rxq);
+		ice_reset_rx_queue(rxq);
 	}
 }
 
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..78b4928 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs_t rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs_t tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 3/8] net/ice: support vector SSE in RX
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 1/8] net/ice: fix Tx function setting Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-21 19:02     ` Ferruh Yigit
  2019-03-21  6:26   ` [PATCH v4 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |  33 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  27 +-
 drivers/net/ice/ice_rxtx.h            |  19 ++
 drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 496 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   5 +
 9 files changed, 738 insertions(+), 4 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1a19788
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,33 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..92594bb 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index b804be1..8e7c7db 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d540ed1..ebb1cab 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -319,6 +317,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1490,6 +1491,12 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_ARCH_X86
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+
 	return NULL;
 }
 
@@ -2225,6 +2232,22 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 78b4928..bfafcf0 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
 
@@ -52,6 +61,11 @@ struct ice_rx_queue {
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
 #endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +170,9 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..cfef91b
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned int pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned int mask = rxq->nb_rx_desc - 1;
+	unsigned int i;
+
+	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..f6fe9ef
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,496 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+		  struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(0x1c03804, 0x1c03804,
+						   0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16
+				(0, 0, 0,    /* ignore non-length fields */
+				 -rxq->crc_len, /* sub crc on data_len */
+				 0,          /* ignore high-16bits of pkt_len */
+				 -rxq->crc_len, /* sub crc on pkt_len */
+				 0, 0            /* ignore pkt_type field */
+				);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8
+			(7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF  /*pkt_type set as unknown */
+			);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0x04, 0x0C,
+							     0x00, 0x08);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static void __attribute__((cold))
+ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	_ice_rx_queue_release_mbufs_vec(rxq);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..94c780b 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,8 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 4/8] net/ice: support Rx scatter SSE vector
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
                     ` (2 preceding siblings ...)
  2019-03-21  6:26   ` [PATCH v4 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 5/8] net/ice: support Tx " Wenzhuo Lu
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ebb1cab..5409dd0 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1493,7 +1493,8 @@
 		return ptypes;
 
 #ifdef RTE_ARCH_X86
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 
@@ -2241,9 +2242,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index bfafcf0..8917b51 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -175,4 +175,6 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index f6fe9ef..e1f057a 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -473,6 +473,47 @@
 	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+					      split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+				      &split_flags[i]);
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 5/8] net/ice: support Tx SSE vector
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
                     ` (3 preceding siblings ...)
  2019-03-21  6:26   ` [PATCH v4 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  17 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
 5 files changed, 291 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1a19788..173c8f2 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -22,6 +23,7 @@ RSS reta update      = Y
 VLAN filter          = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5409dd0..f9ecffa 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2332,6 +2332,23 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 8917b51..b0c755c 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -172,9 +172,13 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
 
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index cfef91b..be079a3 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct ice_tx_entry *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -101,6 +168,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (!txq || !txq->sw_ring) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -137,6 +232,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -152,4 +270,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index e1f057a..4e148d6 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -514,12 +514,131 @@
 				      &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
 	_ice_rx_queue_release_mbufs_vec(rxq);
 }
 
+static void __attribute__((cold))
+ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	_ice_tx_queue_release_mbufs_vec(txq);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -531,7 +650,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 6/8] net/ice: support Rx AVX2 vector
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
                     ` (4 preceding siblings ...)
  2019-03-21  6:26   ` [PATCH v4 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 7/8] net/ice: support Rx scatter " Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  16 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 622 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 671 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 92594bb..5ba59f4 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index f9ecffa..6191f34 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1494,7 +1494,8 @@
 
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2236,21 +2237,30 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index b0c755c..fc6b72e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -181,4 +181,6 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..763fa9f
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,622 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/**
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
+						vaddr1, 1);
+		vaddr2_3 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
+						vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+			uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust =
+		_mm256_set_epi16
+			(/* first descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0,          /* ignore pkt_type field */
+			 /* second descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0           /* ignore pkt_type field */
+			);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk =
+		_mm256_set_epi8
+			(/* first descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF,  /*pkt_type set as unknown */
+			 /* second descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF   /*pkt_type set as unknown */
+			);
+	/**
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/**
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask =
+		 _mm256_set1_epi32((1 << 2) | (1 << 11) |
+				   (3 << 12) | (7 << 22));
+	/**
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf =
+		_mm256_set_epi32(0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+				 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/**
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf =
+		_mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0,/* end up 128-bits */
+				0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0);
+
+	/**
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask =
+		 _mm256_set1_epi32(PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+				   PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+				   PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+
+	for (i = 0, received = 0; i < nb_pkts;
+	     i += ICE_DESCS_PER_LOOP_AVX,
+	     rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				    _mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256
+			((void *)&rx_pkts[i + 4],
+			 _mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		{
+			const __m128i raw_desc7 =
+				_mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 =
+				_mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 =
+				_mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 =
+				_mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 =
+				_mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 =
+				_mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 =
+				_mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 =
+				_mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc6),
+					 raw_desc7, 1);
+			raw_desc4_5 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc4),
+					 raw_desc5, 1);
+			raw_desc2_3 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc2),
+					 raw_desc3, 1);
+			raw_desc0_1 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc0),
+					 raw_desc1, 1);
+		}
+
+		if (split_packet) {
+			int j;
+
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/**
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7,
+							 PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5,
+							 PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7,
+							   len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5,
+							   len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/**
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/**
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3,
+							 PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1,
+							 PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3,
+							   len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1,
+							   len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+								desc0_1);
+
+		/**
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+							  status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits =
+			_mm256_and_si256(status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags =
+			_mm256_shuffle_epi8(vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags =
+			_mm256_shuffle_epi8(rss_flags_shuf,
+					    _mm256_srli_epi32(flag_bits, 11));
+		/**
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/**
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init
+		 * data so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend
+		 * for each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				 offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				 RTE_ALIGN(offsetof(struct rte_mbuf,
+						    rearm_data),
+					   16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+			rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 8),
+					    0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 4),
+					    0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(mbuf_flags, 4),
+					    0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
+				    rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
+				    rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
+				    rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
+				    rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags =
+			_mm256_castsi128_si256
+				(_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 8),
+					    0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 4),
+					    0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(odd_flags, 4),
+					    0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
+				    rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
+				    rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
+				    rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
+				    rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask =
+				_mm_set1_epi16(1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+								     eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits =
+				_mm_packus_epi32
+					(_mm256_castsi256_si128(eop_bits256),
+					 _mm256_extractf128_si256(eop_bits256,
+								  1));
+			/**
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/**
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle =
+				_mm_set_epi8(/* zero hi 64b */
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     /* move values to lo 64b */
+					     8, 0, 10, 2,
+					     12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet =
+				_mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+					       _mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll
+					(_mm_cvtsi128_si64
+						(_mm256_extracti128_si256
+							(status0_7, 1)));
+		burst += __builtin_popcountll
+				(_mm_cvtsi128_si64
+					(_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 94c780b..c9b87af 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -15,4 +15,19 @@ includes += include_directories('base')
 if arch_subdir == 'x86'
 	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 7/8] net/ice: support Rx scatter AVX2 vector
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
                     ` (5 preceding siblings ...)
  2019-03-21  6:26   ` [PATCH v4 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-21  6:26   ` [PATCH v4 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 6191f34..34b8386 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1495,7 +1495,8 @@
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2251,9 +2252,12 @@ void __attribute__((cold))
 
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index fc6b72e..ddc7a3d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -183,4 +183,7 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 763fa9f..7bea3a9 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -620,3 +620,67 @@
 {
 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v4 8/8] net/ice: support vector AVX2 in TX
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
                     ` (6 preceding siblings ...)
  2019-03-21  6:26   ` [PATCH v4 7/8] net/ice: support Rx scatter " Wenzhuo Lu
@ 2019-03-21  6:26   ` Wenzhuo Lu
  2019-03-21 19:20     ` Ferruh Yigit
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-21  6:26 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/ice.rst                |  17 ++++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  13 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
 5 files changed, 192 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/ice.rst b/doc/guides/nics/ice.rst
index 3998d5e..0725669 100644
--- a/doc/guides/nics/ice.rst
+++ b/doc/guides/nics/ice.rst
@@ -64,6 +64,23 @@ Driver compilation and testing
 Refer to the document :ref:`compiling and testing a PMD for a NIC <pmd_build_and_test>`
 for details.
 
+Features
+--------
+
+Vector PMD
+~~~~~~~~~~
+
+Vector PMD for RX and TX path are selected automatically. The paths
+are chosen based on 2 conditions.
+ - CPU
+   On the X86 platform, the driver checks if the CPU supports AVX2.
+   If it's supported, AVX2 paths will be chosen. If not, SSE is chosen.
+
+ - Offload features
+   The supported HW offload features are described in the document ice_vec.ini.
+   If any not supported features are used, ICE vector PMD is disabled and the
+   normal paths are chosen.
+
 Sample Application Notes
 ------------------------
 
diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index 61a2c73..610c4cd 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -91,6 +91,10 @@ New Features
 
   * Added promiscuous mode support.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 34b8386..4a09457 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2349,15 +2349,24 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ddc7a3d..f69cd80 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -186,4 +186,6 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 7bea3a9..730b882 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -684,3 +684,161 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 =
+			_mm256_set_epi64x
+				(hi_qw3,
+				 pkt[3]->buf_physaddr + pkt[3]->data_off,
+				 hi_qw2,
+				 pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 =
+			_mm256_set_epi64x
+				(hi_qw1,
+				 pkt[1]->buf_physaddr + pkt[1]->data_off,
+				 hi_qw0,
+				 pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		       uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 3/8] net/ice: support vector SSE in RX
  2019-03-21  6:26   ` [PATCH v4 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-21 19:02     ` Ferruh Yigit
  2019-03-22  1:46       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-21 19:02 UTC (permalink / raw)
  To: Wenzhuo Lu, dev

On 3/21/2019 6:26 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>

<...>

> @@ -11,3 +11,8 @@ sources = files(
>  
>  deps += ['hash']
>  includes += include_directories('base')
> +
> +if arch_subdir == 'x86'
> +	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)

Setting this config option seems unrelated with the patch, since it is already
used in the existing code, I guess it should be added when BULK_ALLOC define added.
Also is it x86 specific config option?

> +	sources += files('ice_rxtx_vec_sse.c')
> +endif
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 8/8] net/ice: support vector AVX2 in TX
  2019-03-21  6:26   ` [PATCH v4 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
@ 2019-03-21 19:20     ` Ferruh Yigit
  2019-03-22  1:45       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-21 19:20 UTC (permalink / raw)
  To: Wenzhuo Lu, dev

On 3/21/2019 6:26 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>  doc/guides/nics/ice.rst                |  17 ++++
>  doc/guides/rel_notes/release_19_05.rst |   4 +
>  drivers/net/ice/ice_rxtx.c             |  13 ++-
>  drivers/net/ice/ice_rxtx.h             |   2 +
>  drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
>  5 files changed, 192 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/guides/nics/ice.rst b/doc/guides/nics/ice.rst
> index 3998d5e..0725669 100644
> --- a/doc/guides/nics/ice.rst
> +++ b/doc/guides/nics/ice.rst
> @@ -64,6 +64,23 @@ Driver compilation and testing
>  Refer to the document :ref:`compiling and testing a PMD for a NIC <pmd_build_and_test>`
>  for details.
>  
> +Features
> +--------
> +
> +Vector PMD
> +~~~~~~~~~~
> +
> +Vector PMD for RX and TX path are selected automatically. The paths
> +are chosen based on 2 conditions.
> + - CPU
> +   On the X86 platform, the driver checks if the CPU supports AVX2.
> +   If it's supported, AVX2 paths will be chosen. If not, SSE is chosen.
> +
> + - Offload features
> +   The supported HW offload features are described in the document ice_vec.ini.
> +   If any not supported features are used, ICE vector PMD is disabled and the
> +   normal paths are chosen.
> +

doc build is complaining about the syntax:
doc/guides/nics/ice.rst:75: WARNING: Unexpected indentation.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 8/8] net/ice: support vector AVX2 in TX
  2019-03-21 19:20     ` Ferruh Yigit
@ 2019-03-22  1:45       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-22  1:45 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,

> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Friday, March 22, 2019 3:20 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4 8/8] net/ice: support vector AVX2 in TX
> 
> On 3/21/2019 6:26 AM, Wenzhuo Lu wrote:
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> > ---
> >  doc/guides/nics/ice.rst                |  17 ++++
> >  doc/guides/rel_notes/release_19_05.rst |   4 +
> >  drivers/net/ice/ice_rxtx.c             |  13 ++-
> >  drivers/net/ice/ice_rxtx.h             |   2 +
> >  drivers/net/ice/ice_rxtx_vec_avx2.c    | 158
> +++++++++++++++++++++++++++++++++
> >  5 files changed, 192 insertions(+), 2 deletions(-)
> >
> > diff --git a/doc/guides/nics/ice.rst b/doc/guides/nics/ice.rst index
> > 3998d5e..0725669 100644
> > --- a/doc/guides/nics/ice.rst
> > +++ b/doc/guides/nics/ice.rst
> > @@ -64,6 +64,23 @@ Driver compilation and testing  Refer to the
> > document :ref:`compiling and testing a PMD for a NIC
> > <pmd_build_and_test>`  for details.
> >
> > +Features
> > +--------
> > +
> > +Vector PMD
> > +~~~~~~~~~~
> > +
> > +Vector PMD for RX and TX path are selected automatically. The paths
> > +are chosen based on 2 conditions.
> > + - CPU
> > +   On the X86 platform, the driver checks if the CPU supports AVX2.
> > +   If it's supported, AVX2 paths will be chosen. If not, SSE is chosen.
> > +
> > + - Offload features
> > +   The supported HW offload features are described in the document
> ice_vec.ini.
> > +   If any not supported features are used, ICE vector PMD is disabled and
> the
> > +   normal paths are chosen.
> > +
> 
> doc build is complaining about the syntax:
> doc/guides/nics/ice.rst:75: WARNING: Unexpected indentation.
Will remove the indentation.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 3/8] net/ice: support vector SSE in RX
  2019-03-21 19:02     ` Ferruh Yigit
@ 2019-03-22  1:46       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-22  1:46 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

Hi Ferruh,


> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Friday, March 22, 2019 3:02 AM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4 3/8] net/ice: support vector SSE in RX
> 
> On 3/21/2019 6:26 AM, Wenzhuo Lu wrote:
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> 
> <...>
> 
> > @@ -11,3 +11,8 @@ sources = files(
> >
> >  deps += ['hash']
> >  includes += include_directories('base')
> > +
> > +if arch_subdir == 'x86'
> > +	dpdk_conf.set('RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC', 1)
> 
> Setting this config option seems unrelated with the patch, since it is already
> used in the existing code, I guess it should be added when BULK_ALLOC
> define added.
> Also is it x86 specific config option?
It's not related to vector or x86. Will correct it.

> 
> > +	sources += files('ice_rxtx_vec_sse.c') endif
> >


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v5 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (11 preceding siblings ...)
  2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
@ 2019-03-22  2:58 ` Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 1/8] net/ice: fix Tx function setting Wenzhuo Lu
                     ` (7 more replies)
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
  14 siblings, 8 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

---
v2:
 - Updated feature doc.
 - Fixed checklog and checkpatch issues.

v3:
 - Fixed potential compile issue on non-X86 platform.

v4:
 - Removed compile configure, CONFIG_RTE_LIBRTE_ICE_INC_VECTOR.
 - Fixed checkpatch warnings.
 - Added more explanation of vector path in the device document.
 - Some other minor change.

v5:
 - Fixed a compile issue.
 - Fixed a doc build warning.

Wenzhuo Lu (8):
  net/ice: fix Tx function setting
  net/ice: add pointer for queue buffer release
  net/ice: support vector SSE in RX
  net/ice: support Rx scatter SSE vector
  net/ice: support Tx SSE vector
  net/ice: support Rx AVX2 vector
  net/ice: support Rx scatter AVX2 vector
  net/ice: support vector AVX2 in TX

 doc/guides/nics/features/ice_vec.ini   |  35 ++
 doc/guides/nics/ice.rst                |  18 +
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             |  99 +++-
 drivers/net/ice/ice_rxtx.h             |  39 +-
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 844 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 288 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 672 ++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  19 +
 12 files changed, 2030 insertions(+), 15 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v5 1/8] net/ice: fix Tx function setting
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, stable

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Cc: stable@dpdk.org

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index a23c63a..b804be1 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 2/8] net/ice: add pointer for queue buffer release
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..d540ed1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -366,7 +366,7 @@
 		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
 			    rx_queue_id);
 
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		return -EINVAL;
 	}
@@ -393,7 +393,7 @@
 				    rx_queue_id);
 			return -EINVAL;
 		}
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		dev->data->rx_queue_state[rx_queue_id] =
 			RTE_ETH_QUEUE_STATE_STOPPED;
@@ -555,7 +555,7 @@
 		return -EINVAL;
 	}
 
-	ice_tx_queue_release_mbufs(txq);
+	txq->tx_rel_mbufs(txq);
 	ice_reset_tx_queue(txq);
 	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
 
@@ -669,6 +669,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -701,7 +702,7 @@
 		return;
 	}
 
-	ice_rx_queue_release_mbufs(q);
+	q->rx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -866,6 +867,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
 
 	return 0;
 }
@@ -880,7 +882,7 @@
 		return;
 	}
 
-	ice_tx_queue_release_mbufs(q);
+	q->tx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -1552,18 +1554,22 @@
 void
 ice_clear_queues(struct rte_eth_dev *dev)
 {
+	struct ice_rx_queue *rxq;
+	struct ice_tx_queue *txq;
 	uint16_t i;
 
 	PMD_INIT_FUNC_TRACE();
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
-		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
-		ice_reset_tx_queue(dev->data->tx_queues[i]);
+		txq = dev->data->tx_queues[i];
+		txq->tx_rel_mbufs(txq);
+		ice_reset_tx_queue(txq);
 	}
 
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
-		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
-		ice_reset_rx_queue(dev->data->rx_queues[i]);
+		rxq = dev->data->rx_queues[i];
+		rxq->rx_rel_mbufs(rxq);
+		ice_reset_rx_queue(rxq);
 	}
 }
 
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..78b4928 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs_t rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs_t tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 3/8] net/ice: support vector SSE in RX
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 1/8] net/ice: fix Tx function setting Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22  9:42     ` Maxime Coquelin
  2019-03-22  2:58   ` [PATCH v5 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |  33 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  27 +-
 drivers/net/ice/ice_rxtx.h            |  21 +-
 drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 496 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   4 +
 9 files changed, 737 insertions(+), 6 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1a19788
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,33 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..92594bb 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index b804be1..8e7c7db 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d540ed1..ebb1cab 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -319,6 +317,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1490,6 +1491,12 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_ARCH_X86
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+
 	return NULL;
 }
 
@@ -2225,6 +2232,22 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 78b4928..656ca0d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
 
@@ -45,13 +54,16 @@ struct ice_rx_queue {
 	uint16_t nb_rx_hold; /* number of held free RX desc */
 	struct rte_mbuf *pkt_first_seg; /**< first segment of current packet */
 	struct rte_mbuf *pkt_last_seg; /**< last segment of current packet */
-#ifdef RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC
 	uint16_t rx_nb_avail; /**< number of staged packets ready */
 	uint16_t rx_next_avail; /**< index of next staged packets */
 	uint16_t rx_free_trigger; /**< triggers rx buffer allocation */
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
-#endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +168,9 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..cfef91b
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		   uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned int pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned int mask = rxq->nb_rx_desc - 1;
+	unsigned int i;
+
+	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..f6fe9ef
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,496 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+		  struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(0x1c03804, 0x1c03804,
+						   0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		   uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16
+				(0, 0, 0,    /* ignore non-length fields */
+				 -rxq->crc_len, /* sub crc on data_len */
+				 0,          /* ignore high-16bits of pkt_len */
+				 -rxq->crc_len, /* sub crc on pkt_len */
+				 0, 0            /* ignore pkt_type field */
+				);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8
+			(7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF  /*pkt_type set as unknown */
+			);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0x04, 0x0C,
+							     0x00, 0x08);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static void __attribute__((cold))
+ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	_ice_rx_queue_release_mbufs_vec(rxq);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..469264d 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,7 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 4/8] net/ice: support Rx scatter SSE vector
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (2 preceding siblings ...)
  2019-03-22  2:58   ` [PATCH v5 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 5/8] net/ice: support Tx " Wenzhuo Lu
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ebb1cab..5409dd0 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1493,7 +1493,8 @@
 		return ptypes;
 
 #ifdef RTE_ARCH_X86
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 
@@ -2241,9 +2242,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 656ca0d..6ef0a84 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -173,4 +173,6 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index f6fe9ef..e1f057a 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -473,6 +473,47 @@
 	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+					      split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+				      &split_flags[i]);
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 5/8] net/ice: support Tx SSE vector
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (3 preceding siblings ...)
  2019-03-22  2:58   ` [PATCH v5 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22  9:58     ` Maxime Coquelin
  2019-03-22  2:58   ` [PATCH v5 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  17 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
 5 files changed, 291 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1a19788..173c8f2 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -22,6 +23,7 @@ RSS reta update      = Y
 VLAN filter          = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5409dd0..f9ecffa 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2332,6 +2332,23 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6ef0a84..1dde4e7 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -170,9 +170,13 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
 
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index cfef91b..be079a3 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+tx_backlog_entry(struct ice_tx_entry *txep,
+		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -101,6 +168,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (!txq || !txq->sw_ring) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -137,6 +232,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -152,4 +270,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index e1f057a..4e148d6 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -514,12 +514,131 @@
 				      &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
 	_ice_rx_queue_release_mbufs_vec(rxq);
 }
 
+static void __attribute__((cold))
+ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	_ice_tx_queue_release_mbufs_vec(txq);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -531,7 +650,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (4 preceding siblings ...)
  2019-03-22  2:58   ` [PATCH v5 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22 10:12     ` Maxime Coquelin
  2019-03-22  2:58   ` [PATCH v5 7/8] net/ice: support Rx scatter " Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  7 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  16 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 622 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 671 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 92594bb..5ba59f4 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index f9ecffa..6191f34 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1494,7 +1494,8 @@
 
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2236,21 +2237,30 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1dde4e7..d1c9b92 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -179,4 +179,6 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..763fa9f
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,622 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/**
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
+						vaddr1, 1);
+		vaddr2_3 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
+						vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+			uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust =
+		_mm256_set_epi16
+			(/* first descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0,          /* ignore pkt_type field */
+			 /* second descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0           /* ignore pkt_type field */
+			);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk =
+		_mm256_set_epi8
+			(/* first descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF,  /*pkt_type set as unknown */
+			 /* second descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF   /*pkt_type set as unknown */
+			);
+	/**
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/**
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask =
+		 _mm256_set1_epi32((1 << 2) | (1 << 11) |
+				   (3 << 12) | (7 << 22));
+	/**
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf =
+		_mm256_set_epi32(0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+				 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/**
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf =
+		_mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0,/* end up 128-bits */
+				0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0);
+
+	/**
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask =
+		 _mm256_set1_epi32(PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+				   PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+				   PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+
+	for (i = 0, received = 0; i < nb_pkts;
+	     i += ICE_DESCS_PER_LOOP_AVX,
+	     rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				    _mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256
+			((void *)&rx_pkts[i + 4],
+			 _mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		{
+			const __m128i raw_desc7 =
+				_mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 =
+				_mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 =
+				_mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 =
+				_mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 =
+				_mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 =
+				_mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 =
+				_mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 =
+				_mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc6),
+					 raw_desc7, 1);
+			raw_desc4_5 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc4),
+					 raw_desc5, 1);
+			raw_desc2_3 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc2),
+					 raw_desc3, 1);
+			raw_desc0_1 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc0),
+					 raw_desc1, 1);
+		}
+
+		if (split_packet) {
+			int j;
+
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/**
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7,
+							 PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5,
+							 PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7,
+							   len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5,
+							   len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/**
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/**
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3,
+							 PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1,
+							 PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3,
+							   len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1,
+							   len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+								desc0_1);
+
+		/**
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+							  status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits =
+			_mm256_and_si256(status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags =
+			_mm256_shuffle_epi8(vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags =
+			_mm256_shuffle_epi8(rss_flags_shuf,
+					    _mm256_srli_epi32(flag_bits, 11));
+		/**
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/**
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init
+		 * data so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend
+		 * for each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				 offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				 RTE_ALIGN(offsetof(struct rte_mbuf,
+						    rearm_data),
+					   16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+			rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 8),
+					    0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 4),
+					    0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(mbuf_flags, 4),
+					    0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
+				    rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
+				    rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
+				    rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
+				    rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags =
+			_mm256_castsi128_si256
+				(_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 8),
+					    0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 4),
+					    0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(odd_flags, 4),
+					    0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
+				    rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
+				    rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
+				    rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
+				    rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask =
+				_mm_set1_epi16(1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+								     eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits =
+				_mm_packus_epi32
+					(_mm256_castsi256_si128(eop_bits256),
+					 _mm256_extractf128_si256(eop_bits256,
+								  1));
+			/**
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/**
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle =
+				_mm_set_epi8(/* zero hi 64b */
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     /* move values to lo 64b */
+					     8, 0, 10, 2,
+					     12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet =
+				_mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+					       _mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll
+					(_mm_cvtsi128_si64
+						(_mm256_extracti128_si256
+							(status0_7, 1)));
+		burst += __builtin_popcountll
+				(_mm_cvtsi128_si64
+					(_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 469264d..2bec688 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -14,4 +14,19 @@ includes += include_directories('base')
 
 if arch_subdir == 'x86'
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 7/8] net/ice: support Rx scatter AVX2 vector
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (5 preceding siblings ...)
  2019-03-22  2:58   ` [PATCH v5 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  2019-03-22  2:58   ` [PATCH v5 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 6191f34..34b8386 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1495,7 +1495,8 @@
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2251,9 +2252,12 @@ void __attribute__((cold))
 
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index d1c9b92..dfc3224 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -181,4 +181,7 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 763fa9f..7bea3a9 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -620,3 +620,67 @@
 {
 	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v5 8/8] net/ice: support vector AVX2 in TX
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (6 preceding siblings ...)
  2019-03-22  2:58   ` [PATCH v5 7/8] net/ice: support Rx scatter " Wenzhuo Lu
@ 2019-03-22  2:58   ` Wenzhuo Lu
  7 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-22  2:58 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/ice.rst                |  18 ++++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  13 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
 5 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/ice.rst b/doc/guides/nics/ice.rst
index 3998d5e..fdbc02e 100644
--- a/doc/guides/nics/ice.rst
+++ b/doc/guides/nics/ice.rst
@@ -64,6 +64,24 @@ Driver compilation and testing
 Refer to the document :ref:`compiling and testing a PMD for a NIC <pmd_build_and_test>`
 for details.
 
+Features
+--------
+
+Vector PMD
+~~~~~~~~~~
+
+Vector PMD for RX and TX path are selected automatically. The paths
+are chosen based on 2 conditions.
+
+- ``CPU``
+  On the X86 platform, the driver checks if the CPU supports AVX2.
+  If it's supported, AVX2 paths will be chosen. If not, SSE is chosen.
+
+- ``Offload features``
+  The supported HW offload features are described in the document ice_vec.ini.
+  If any not supported features are used, ICE vector PMD is disabled and the
+  normal paths are chosen.
+
 Sample Application Notes
 ------------------------
 
diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index 61a2c73..610c4cd 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -91,6 +91,10 @@ New Features
 
   * Added promiscuous mode support.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 34b8386..4a09457 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2349,15 +2349,24 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index dfc3224..64e9f20 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -184,4 +184,6 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 7bea3a9..730b882 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -684,3 +684,161 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 =
+			_mm256_set_epi64x
+				(hi_qw3,
+				 pkt[3]->buf_physaddr + pkt[3]->data_off,
+				 hi_qw2,
+				 pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 =
+			_mm256_set_epi64x
+				(hi_qw1,
+				 pkt[1]->buf_physaddr + pkt[1]->data_off,
+				 hi_qw0,
+				 pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		       uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 1/8] net/ice: fix Tx function setting
  2019-03-21  6:26   ` [PATCH v4 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-22  8:46     ` Maxime Coquelin
  2019-03-22  9:01       ` Maxime Coquelin
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-22  8:46 UTC (permalink / raw)
  To: Wenzhuo Lu, dev; +Cc: stable



On 3/21/19 7:26 AM, Wenzhuo Lu wrote:
> The TX setting functions is not called.

s/functions/function/

> 
> Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>   drivers/net/ice/ice_ethdev.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> index a23c63a..b804be1 100644
> --- a/drivers/net/ice/ice_ethdev.c
> +++ b/drivers/net/ice/ice_ethdev.c
> @@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
>   	}
>   
>   	ice_set_rx_function(dev);
> +	ice_set_tx_function(dev);
>   
>   	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
>   			ETH_VLAN_EXTEND_MASK;
> 

Other than that:

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 2/8] net/ice: add pointer for queue buffer release
  2019-03-21  6:26   ` [PATCH v4 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-22  8:59     ` Maxime Coquelin
  0 siblings, 0 replies; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-22  8:59 UTC (permalink / raw)
  To: Wenzhuo Lu, dev



On 3/21/19 7:26 AM, Wenzhuo Lu wrote:
> Add function pointers of buffer releasing for RX and
> TX queues, for vector functions will be added for RX
> and TX.
> 
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>   drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
>   drivers/net/ice/ice_rxtx.h |  5 +++++
>   2 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> index c794ee8..d540ed1 100644
> --- a/drivers/net/ice/ice_rxtx.c
> +++ b/drivers/net/ice/ice_rxtx.c
> @@ -366,7 +366,7 @@
>   		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
>   			    rx_queue_id);
>   
> -		ice_rx_queue_release_mbufs(rxq);
> +		rxq->rx_rel_mbufs(rxq);
>   		ice_reset_rx_queue(rxq);
>   		return -EINVAL;
>   	}
> @@ -393,7 +393,7 @@
>   				    rx_queue_id);
>   			return -EINVAL;
>   		}
> -		ice_rx_queue_release_mbufs(rxq);
> +		rxq->rx_rel_mbufs(rxq);
>   		ice_reset_rx_queue(rxq);
>   		dev->data->rx_queue_state[rx_queue_id] =
>   			RTE_ETH_QUEUE_STATE_STOPPED;
> @@ -555,7 +555,7 @@
>   		return -EINVAL;
>   	}
>   
> -	ice_tx_queue_release_mbufs(txq);
> +	txq->tx_rel_mbufs(txq);
>   	ice_reset_tx_queue(txq);
>   	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
>   
> @@ -669,6 +669,7 @@
>   	ice_reset_rx_queue(rxq);
>   	rxq->q_set = TRUE;
>   	dev->data->rx_queues[queue_idx] = rxq;
> +	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;

I think it could be cleaner to have ice_rx_queue_release_mbufs() to call
the callback. So you would rename current ice_rx_queue_release_mbufs to
something else.

I would make the code more consistent IMHO, and would avoid to patch all
call sites.

>   
>   	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
>   
> @@ -701,7 +702,7 @@
>   		return;
>   	}
>   
> -	ice_rx_queue_release_mbufs(q);
> +	q->rx_rel_mbufs(q);
>   	rte_free(q->sw_ring);
>   	rte_free(q);
>   }
> @@ -866,6 +867,7 @@
>   	ice_reset_tx_queue(txq);
>   	txq->q_set = TRUE;
>   	dev->data->tx_queues[queue_idx] = txq;
> +	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
>   
>   	return 0;
>   }
> @@ -880,7 +882,7 @@
>   		return;
>   	}
>   
> -	ice_tx_queue_release_mbufs(q);
> +	q->tx_rel_mbufs(q);
>   	rte_free(q->sw_ring);
>   	rte_free(q);
>   }
> @@ -1552,18 +1554,22 @@
>   void
>   ice_clear_queues(struct rte_eth_dev *dev)
>   {
> +	struct ice_rx_queue *rxq;
> +	struct ice_tx_queue *txq;
>   	uint16_t i;
>   
>   	PMD_INIT_FUNC_TRACE();
>   
>   	for (i = 0; i < dev->data->nb_tx_queues; i++) {
> -		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
> -		ice_reset_tx_queue(dev->data->tx_queues[i]);
> +		txq = dev->data->tx_queues[i];
> +		txq->tx_rel_mbufs(txq);
> +		ice_reset_tx_queue(txq);
>   	}
>   
>   	for (i = 0; i < dev->data->nb_rx_queues; i++) {
> -		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
> -		ice_reset_rx_queue(dev->data->rx_queues[i]);
> +		rxq = dev->data->rx_queues[i];
> +		rxq->rx_rel_mbufs(rxq);
> +		ice_reset_rx_queue(rxq);
>   	}
>   }
>   
> diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
> index ec0e52e..78b4928 100644
> --- a/drivers/net/ice/ice_rxtx.h
> +++ b/drivers/net/ice/ice_rxtx.h
> @@ -27,6 +27,9 @@
>   
>   #define ICE_SUPPORT_CHAIN_NUM 5
>   
> +typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
> +typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
> +
>   struct ice_rx_entry {
>   	struct rte_mbuf *mbuf;
>   };
> @@ -61,6 +64,7 @@ struct ice_rx_queue {
>   	uint16_t max_pkt_len; /* Maximum packet length */
>   	bool q_set; /* indicate if rx queue has been configured */
>   	bool rx_deferred_start; /* don't start this queue in dev start */
> +	ice_rx_release_mbufs_t rx_rel_mbufs;
>   };
>   
>   struct ice_tx_entry {
> @@ -100,6 +104,7 @@ struct ice_tx_queue {
>   	uint16_t tx_next_rs;
>   	bool tx_deferred_start; /* don't start this queue in dev start */
>   	bool q_set; /* indicate if tx queue has been configured */
> +	ice_tx_release_mbufs_t tx_rel_mbufs;
>   };
>   
>   /* Offload features */
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v4 1/8] net/ice: fix Tx function setting
  2019-03-22  8:46     ` Maxime Coquelin
@ 2019-03-22  9:01       ` Maxime Coquelin
  0 siblings, 0 replies; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-22  9:01 UTC (permalink / raw)
  To: Wenzhuo Lu, dev; +Cc: stable



On 3/22/19 9:46 AM, Maxime Coquelin wrote:
> 
> 
> On 3/21/19 7:26 AM, Wenzhuo Lu wrote:
>> The TX setting functions is not called.
> 
> s/functions/function/
> 
>>
>> Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
>> ---
>>   drivers/net/ice/ice_ethdev.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
>> index a23c63a..b804be1 100644
>> --- a/drivers/net/ice/ice_ethdev.c
>> +++ b/drivers/net/ice/ice_ethdev.c
>> @@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
>>       }
>>       ice_set_rx_function(dev);
>> +    ice_set_tx_function(dev);
>>       mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
>>               ETH_VLAN_EXTEND_MASK;
>>
> 
> Other than that:
> 
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Sorry, just notice I didn't fetch my mailbox, so I didn't see the v5.
Anyway my comments seem to apply to v5 too.

Maxime

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 3/8] net/ice: support vector SSE in RX
  2019-03-22  2:58   ` [PATCH v5 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-22  9:42     ` Maxime Coquelin
  2019-03-25  1:56       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-22  9:42 UTC (permalink / raw)
  To: Wenzhuo Lu, dev



On 3/22/19 3:58 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>   doc/guides/nics/features/ice_vec.ini  |  33 +++
>   drivers/net/ice/Makefile              |   3 +
>   drivers/net/ice/ice_ethdev.c          |   2 -
>   drivers/net/ice/ice_ethdev.h          |   2 +
>   drivers/net/ice/ice_rxtx.c            |  27 +-
>   drivers/net/ice/ice_rxtx.h            |  21 +-
>   drivers/net/ice/ice_rxtx_vec_common.h | 155 +++++++++++
>   drivers/net/ice/ice_rxtx_vec_sse.c    | 496 ++++++++++++++++++++++++++++++++++
>   drivers/net/ice/meson.build           |   4 +
>   9 files changed, 737 insertions(+), 6 deletions(-)
>   create mode 100644 doc/guides/nics/features/ice_vec.ini
>   create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
>   create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> 
> diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
> new file mode 100644
> index 0000000..1a19788
> --- /dev/null
> +++ b/doc/guides/nics/features/ice_vec.ini
> @@ -0,0 +1,33 @@
> +;
> +; Supported features of the 'ice_vec' network poll mode driver.
> +;
> +; Refer to default.ini for the full list of available PMD features.
> +;
> +[Features]
> +Speed capabilities   = Y
> +Link status          = Y
> +Link status event    = Y
> +Rx interrupt         = Y
> +Queue start/stop     = Y
> +MTU update           = Y
> +Jumbo frame          = Y
> +Scattered Rx         = Y
> +Promiscuous mode     = Y
> +Allmulticast mode    = Y
> +Unicast MAC filter   = Y
> +Multicast MAC filter = Y
> +RSS hash             = Y
> +RSS key update       = Y
> +RSS reta update      = Y
> +VLAN filter          = Y
> +Packet type parsing  = Y
> +Rx descriptor status = Y
> +Basic stats          = Y
> +Extended stats       = Y
> +FW version           = Y
> +Module EEPROM dump   = Y
> +BSD nic_uio          = Y
> +Linux UIO            = Y
> +Linux VFIO           = Y
> +x86-32               = Y
> +x86-64               = Y
> diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
> index 61846ca..92594bb 100644
> --- a/drivers/net/ice/Makefile
> +++ b/drivers/net/ice/Makefile
> @@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
>   
>   SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
>   SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
> +ifeq ($(CONFIG_RTE_ARCH_X86), y)
> +SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
> +endif
>   
>   include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> index b804be1..8e7c7db 100644
> --- a/drivers/net/ice/ice_ethdev.c
> +++ b/drivers/net/ice/ice_ethdev.c
> @@ -2,8 +2,6 @@
>    * Copyright(c) 2018 Intel Corporation
>    */
>   
> -#include <rte_ethdev_pci.h>
> -
>   #include "base/ice_sched.h"
>   #include "ice_ethdev.h"
>   #include "ice_rxtx.h"
> diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
> index 3cefa5b..151a09e 100644
> --- a/drivers/net/ice/ice_ethdev.h
> +++ b/drivers/net/ice/ice_ethdev.h
> @@ -7,6 +7,8 @@
>   
>   #include <rte_kvargs.h>
>   
> +#include <rte_ethdev_pci.h>
> +
>   #include "base/ice_common.h"
>   #include "base/ice_adminq_cmd.h"
>   
> diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> index d540ed1..ebb1cab 100644
> --- a/drivers/net/ice/ice_rxtx.c
> +++ b/drivers/net/ice/ice_rxtx.c
> @@ -7,8 +7,6 @@
>   
>   #include "ice_rxtx.h"
>   
> -#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
> -
>   #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
>   		PKT_TX_IP_CKSUM |		 \
>   		PKT_TX_L4_MASK |		 \
> @@ -319,6 +317,9 @@
>   	rxq->nb_rx_hold = 0;
>   	rxq->pkt_first_seg = NULL;
>   	rxq->pkt_last_seg = NULL;
> +
> +	rxq->rxrearm_start = 0;
> +	rxq->rxrearm_nb = 0;
>   }
>   
>   int
> @@ -1490,6 +1491,12 @@
>   #endif
>   	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
>   		return ptypes;
> +
> +#ifdef RTE_ARCH_X86
> +	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
> +		return ptypes;
> +#endif
> +
>   	return NULL;
>   }
>   
> @@ -2225,6 +2232,22 @@ void __attribute__((cold))
>   	PMD_INIT_FUNC_TRACE();
>   	struct ice_adapter *ad =
>   		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
> +#ifdef RTE_ARCH_X86
> +	struct ice_rx_queue *rxq;
> +	int i;
> +
> +	if (!ice_rx_vec_dev_check(dev)) {
> +		for (i = 0; i < dev->data->nb_rx_queues; i++) {
> +			rxq = dev->data->rx_queues[i];
> +			(void)ice_rxq_vec_setup(rxq);
> +		}
> +		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
> +			    dev->data->port_id);
> +		dev->rx_pkt_burst = ice_recv_pkts_vec;
> +
> +		return;
> +	}
> +#endif
>   
>   	if (dev->data->scattered_rx) {
>   		/* Set the non-LRO scattered function */
> diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
> index 78b4928..656ca0d 100644
> --- a/drivers/net/ice/ice_rxtx.h
> +++ b/drivers/net/ice/ice_rxtx.h
> @@ -27,6 +27,15 @@
>   
>   #define ICE_SUPPORT_CHAIN_NUM 5
>   
> +#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
> +
> +#define ICE_VPMD_RX_BURST           32
> +#define ICE_VPMD_TX_BURST           32
> +#define ICE_RXQ_REARM_THRESH        32
> +#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
> +#define ICE_TX_MAX_FREE_BUF_SZ      64
> +#define ICE_DESCS_PER_LOOP          4
> +
>   typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
>   typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
>   
> @@ -45,13 +54,16 @@ struct ice_rx_queue {
>   	uint16_t nb_rx_hold; /* number of held free RX desc */
>   	struct rte_mbuf *pkt_first_seg; /**< first segment of current packet */
>   	struct rte_mbuf *pkt_last_seg; /**< last segment of current packet */
> -#ifdef RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC
>   	uint16_t rx_nb_avail; /**< number of staged packets ready */
>   	uint16_t rx_next_avail; /**< index of next staged packets */
>   	uint16_t rx_free_trigger; /**< triggers rx buffer allocation */
>   	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
>   	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
> -#endif
> +
> +	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
> +	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
> +	uint64_t mbuf_initializer; /**< value to init mbufs */
> +
>   	uint8_t port_id; /* device port ID */
>   	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
>   	uint16_t queue_id; /* RX queue index */
> @@ -156,4 +168,9 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
>   int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
>   void ice_set_default_ptype_table(struct rte_eth_dev *dev);
>   const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
> +
> +int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
> +int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
> +uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> +			   uint16_t nb_pkts);
>   #endif /* _ICE_RXTX_H_ */
> diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
> new file mode 100644
> index 0000000..cfef91b
> --- /dev/null
> +++ b/drivers/net/ice/ice_rxtx_vec_common.h
> @@ -0,0 +1,155 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#ifndef _ICE_RXTX_VEC_COMMON_H_
> +#define _ICE_RXTX_VEC_COMMON_H_
> +
> +#include "ice_rxtx.h"
> +
> +static inline uint16_t
> +reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
As this is in the header file, I think it could be better to prefix it
with 'ice_'. Or maybe with 'ice_rx_' as it seems to be rx-only.
> +		   uint16_t nb_bufs, uint8_t *split_flags)
> +{
> +	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
> +	struct rte_mbuf *start = rxq->pkt_first_seg;
> +	struct rte_mbuf *end =  rxq->pkt_last_seg;
> +	unsigned int pkt_idx, buf_idx;
> +
> +	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
> +		if (end) {
> +			/* processing a split packet */
> +			end->next = rx_bufs[buf_idx];
> +			rx_bufs[buf_idx]->data_len += rxq->crc_len;
> +
> +			start->nb_segs++;
> +			start->pkt_len += rx_bufs[buf_idx]->data_len;
> +			end = end->next;
> +
> +			if (!split_flags[buf_idx]) {
> +				/* it's the last packet of the set */
> +				start->hash = end->hash;
> +				start->ol_flags = end->ol_flags;
> +				/* we need to strip crc for the whole packet */
> +				start->pkt_len -= rxq->crc_len;
> +				if (end->data_len > rxq->crc_len) {
> +					end->data_len -= rxq->crc_len;
> +				} else {
> +					/* free up last mbuf */
> +					struct rte_mbuf *secondlast = start;
> +
> +					start->nb_segs--;
> +					while (secondlast->next != end)
> +						secondlast = secondlast->next;
> +					secondlast->data_len -= (rxq->crc_len -
> +							end->data_len);
> +					secondlast->next = NULL;
> +					rte_pktmbuf_free_seg(end);
> +				}
> +				pkts[pkt_idx++] = start;
> +				start = NULL;
> +				end = NULL;
> +			}
> +		} else {
> +			/* not processing a split packet */
> +			if (!split_flags[buf_idx]) {
> +				/* not a split packet, save and skip */
> +				pkts[pkt_idx++] = rx_bufs[buf_idx];
> +				continue;
> +			}
> +			start = rx_bufs[buf_idx];
> +			end = start;
> +			rx_bufs[buf_idx]->data_len += rxq->crc_len;
> +			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
> +		}
> +	}
> +
> +	/* save the partial packet for next time */
> +	rxq->pkt_first_seg = start;
> +	rxq->pkt_last_seg = end;
> +	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
> +	return pkt_idx;
> +}
> +
> +static inline void
> +_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
> +{
> +	const unsigned int mask = rxq->nb_rx_desc - 1;
> +	unsigned int i;
> +
> +	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
> +		return;

Maybe not a big deal, but I understand that !rxq->sw_ring is not the
common case, more an error. If so, the if condition could be split in
two, and having the first one tagged with unlikely.

Looking at Tx patch, you should also ensure that rxq != NULL and also
print a debug/error message to be consistent.

> +
> +	/* free all mbufs that are valid in the ring */
> +	if (rxq->rxrearm_nb == 0) {
> +		for (i = 0; i < rxq->nb_rx_desc; i++) {
> +			if (rxq->sw_ring[i].mbuf)
> +				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
> +		}
> +	} else {
> +		for (i = rxq->rx_tail;
> +		     i != rxq->rxrearm_start;
> +		     i = (i + 1) & mask) {
> +			if (rxq->sw_ring[i].mbuf)
> +				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
> +		}
> +	}
> +
> +	rxq->rxrearm_nb = rxq->nb_rx_desc;
> +
> +	/* set all entries to NULL */
> +	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
> +}

...

> diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
> new file mode 100644
> index 0000000..f6fe9ef
> --- /dev/null
> +++ b/drivers/net/ice/ice_rxtx_vec_sse.c

...

> +
> +/**
> + * Notice:
> + * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
> + * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
> + *   numbers of DD bits
> + */
> +uint16_t
> +ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> +		  uint16_t nb_pkts)
> +{
> +	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);

Same as below comment.

> +}
> +
> +static void __attribute__((cold))
> +ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
> +{
> +	_ice_rx_queue_release_mbufs_vec(rxq);

What is the point of having _ice_rx_queue_release_mbufs_vec as it is
only called once here?

> +}
> +
> +int __attribute__((cold))
> +ice_rxq_vec_setup(struct ice_rx_queue *rxq)
> +{
> +	if (!rxq)
> +		return -1;
> +
> +	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
> +	return ice_rxq_vec_setup_default(rxq);
> +}
> +
> +int __attribute__((cold))
> +ice_rx_vec_dev_check(struct rte_eth_dev *dev)
> +{
> +	return ice_rx_vec_dev_check_default(dev);
> +}
> diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
> index 857dc0e..469264d 100644
> --- a/drivers/net/ice/meson.build
> +++ b/drivers/net/ice/meson.build
> @@ -11,3 +11,7 @@ sources = files(
>   
>   deps += ['hash']
>   includes += include_directories('base')
> +
> +if arch_subdir == 'x86'
> +	sources += files('ice_rxtx_vec_sse.c')
> +endif
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 5/8] net/ice: support Tx SSE vector
  2019-03-22  2:58   ` [PATCH v5 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-22  9:58     ` Maxime Coquelin
  2019-03-25  2:02       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-22  9:58 UTC (permalink / raw)
  To: Wenzhuo Lu, dev



On 3/22/19 3:58 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>   doc/guides/nics/features/ice_vec.ini  |   2 +
>   drivers/net/ice/ice_rxtx.c            |  17 +++++
>   drivers/net/ice/ice_rxtx.h            |   4 +
>   drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
>   drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
>   5 files changed, 291 insertions(+)
> 
> diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
> index 1a19788..173c8f2 100644
> --- a/doc/guides/nics/features/ice_vec.ini
> +++ b/doc/guides/nics/features/ice_vec.ini
> @@ -12,6 +12,7 @@ Queue start/stop     = Y
>   MTU update           = Y
>   Jumbo frame          = Y
>   Scattered Rx         = Y
> +TSO                  = Y
>   Promiscuous mode     = Y
>   Allmulticast mode    = Y
>   Unicast MAC filter   = Y
> @@ -22,6 +23,7 @@ RSS reta update      = Y
>   VLAN filter          = Y
>   Packet type parsing  = Y
>   Rx descriptor status = Y
> +Tx descriptor status = Y
>   Basic stats          = Y
>   Extended stats       = Y
>   FW version           = Y
> diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> index 5409dd0..f9ecffa 100644
> --- a/drivers/net/ice/ice_rxtx.c
> +++ b/drivers/net/ice/ice_rxtx.c
> @@ -2332,6 +2332,23 @@ void __attribute__((cold))
>   {
>   	struct ice_adapter *ad =
>   		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
> +#ifdef RTE_ARCH_X86
> +	struct ice_tx_queue *txq;
> +	int i;
> +
> +	if (!ice_tx_vec_dev_check(dev)) {
> +		for (i = 0; i < dev->data->nb_tx_queues; i++) {
> +			txq = dev->data->tx_queues[i];
> +			(void)ice_txq_vec_setup(txq);
> +		}
> +		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
> +			    dev->data->port_id);
> +		dev->tx_pkt_burst = ice_xmit_pkts_vec;
> +		dev->tx_pkt_prepare = NULL;
> +
> +		return;
> +	}
> +#endif
>   
>   	if (ad->tx_simple_allowed) {
>   		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
> diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
> index 6ef0a84..1dde4e7 100644
> --- a/drivers/net/ice/ice_rxtx.h
> +++ b/drivers/net/ice/ice_rxtx.h
> @@ -170,9 +170,13 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
>   const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
>   
>   int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
> +int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
>   int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
> +int ice_txq_vec_setup(struct ice_tx_queue *txq);
>   uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>   			   uint16_t nb_pkts);
>   uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>   				     uint16_t nb_pkts);
> +uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
> +			   uint16_t nb_pkts);
>   #endif /* _ICE_RXTX_H_ */
> diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
> index cfef91b..be079a3 100644
> --- a/drivers/net/ice/ice_rxtx_vec_common.h
> +++ b/drivers/net/ice/ice_rxtx_vec_common.h
> @@ -71,6 +71,73 @@
>   	return pkt_idx;
>   }
>   
> +static __rte_always_inline int
> +ice_tx_free_bufs(struct ice_tx_queue *txq)
> +{
> +	struct ice_tx_entry *txep;
> +	uint32_t n;
> +	uint32_t i;
> +	int nb_free = 0;
> +	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
> +
> +	/* check DD bits on threshold descriptor */
> +	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> +			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
> +			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
> +		return 0;
> +
> +	n = txq->tx_rs_thresh;
> +
> +	 /* first buffer to free from S/W ring is at index
> +	  * tx_next_dd - (tx_rs_thresh-1)
> +	  */
> +	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> +	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
> +	if (likely(m)) {
> +		free[0] = m;
> +		nb_free = 1;
> +		for (i = 1; i < n; i++) {
> +			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +			if (likely(m)) {
> +				if (likely(m->pool == free[0]->pool)) {
> +					free[nb_free++] = m;
> +				} else {
> +					rte_mempool_put_bulk(free[0]->pool,
> +							     (void *)free,
> +							     nb_free);
> +					free[0] = m;
> +					nb_free = 1;
> +				}
> +			}
> +		}
> +		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
> +	} else {
> +		for (i = 1; i < n; i++) {
> +			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +			if (m)
> +				rte_mempool_put(m->pool, m);
> +		}
> +	}
> +
> +	/* buffers were freed, update counters */
> +	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +	if (txq->tx_next_dd >= txq->nb_tx_desc)
> +		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> +	return txq->tx_rs_thresh;
> +}
> +
> +static __rte_always_inline void
> +tx_backlog_entry(struct ice_tx_entry *txep,
Consider prefixing it with 'ice_tx_'.

> +		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> +{
> +	int i;
> +
> +	for (i = 0; i < (int)nb_pkts; ++i)
> +		txep[i].mbuf = tx_pkts[i];
> +}
> +
>   static inline void
>   _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
>   {
> @@ -101,6 +168,34 @@
>   	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
>   }
>   
> +static inline void
> +_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
> +{
> +	uint16_t i;
> +
> +	if (!txq || !txq->sw_ring) {

if (unlikely(...)) {

> +		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");

s/rxq/txq/

> +		return;
> +	}
> +
> +	/**
> +	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
> +	 *  so need to free remains more carefully.
> +	 */
> +	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
> +	if (txq->tx_tail < i) {
> +		for (; i < txq->nb_tx_desc; i++) {
> +			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> +			txq->sw_ring[i].mbuf = NULL;
> +		}
> +		i = 0;
> +	}
> +	for (; i < txq->tx_tail; i++) {
> +		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> +		txq->sw_ring[i].mbuf = NULL;
> +	}
> +}
> +
>   static inline int
>   ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
>   {
> @@ -137,6 +232,29 @@
>   	return 0;
>   }
>   
> +#define ICE_NO_VECTOR_FLAGS (				 \
> +		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
> +		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
> +		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
> +		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
> +		DEV_TX_OFFLOAD_TCP_CKSUM)
> +
> +static inline int
> +ice_tx_vec_queue_default(struct ice_tx_queue *txq)
> +{
> +	if (!txq)
> +		return -1;
> +
> +	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
> +		return -1;
> +
> +	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
> +	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
> +		return -1;
> +
> +	return 0;
> +}
> +
>   static inline int
>   ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
>   {
> @@ -152,4 +270,19 @@
>   	return 0;
>   }
>   
> +static inline int
> +ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
> +{
> +	int i;
> +	struct ice_tx_queue *txq;
> +
> +	for (i = 0; i < dev->data->nb_tx_queues; i++) {
> +		txq = dev->data->tx_queues[i];
> +		if (ice_tx_vec_queue_default(txq))
> +			return -1;

return ice_tx_vec_queue_default(txq);

Applies also to rx path.

> +	}
> +
> +	return 0;
> +}
> +
>   #endif
> diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
> index e1f057a..4e148d6 100644
> --- a/drivers/net/ice/ice_rxtx_vec_sse.c
> +++ b/drivers/net/ice/ice_rxtx_vec_sse.c
> @@ -514,12 +514,131 @@
>   				      &split_flags[i]);
>   }
>   
> +static inline void
> +ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
> +	 uint64_t flags)
> +{
> +	uint64_t high_qw =
> +		(ICE_TX_DESC_DTYPE_DATA |
> +		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
> +		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
> +
> +	__m128i descriptor = _mm_set_epi64x(high_qw,
> +					    pkt->buf_iova + pkt->data_off);
> +	_mm_store_si128((__m128i *)txdp, descriptor);
> +}
> +
> +static inline void
> +ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
> +	uint16_t nb_pkts, uint64_t flags)
> +{
> +	int i;
> +
> +	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
> +		ice_vtx1(txdp, *pkt, flags);
> +}
> +
> +static uint16_t
> +ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
> +			 uint16_t nb_pkts)
> +{
> +	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
> +	volatile struct ice_tx_desc *txdp;
> +	struct ice_tx_entry *txep;
> +	uint16_t n, nb_commit, tx_id;
> +	uint64_t flags = ICE_TD_CMD;
> +	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
> +	int i;
> +
> +	/* cross rx_thresh boundary is not allowed */
> +	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
> +
> +	if (txq->nb_tx_free < txq->tx_free_thresh)
> +		ice_tx_free_bufs(txq);
> +
> +	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
> +	nb_commit = nb_pkts;
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	tx_id = txq->tx_tail;
> +	txdp = &txq->tx_ring[tx_id];
> +	txep = &txq->sw_ring[tx_id];
> +
> +	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
> +
> +	n = (uint16_t)(txq->nb_tx_desc - tx_id);
> +	if (nb_commit >= n) {
> +		tx_backlog_entry(txep, tx_pkts, n);
> +
> +		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
> +			ice_vtx1(txdp, *tx_pkts, flags);
> +
> +		ice_vtx1(txdp, *tx_pkts++, rs);
> +
> +		nb_commit = (uint16_t)(nb_commit - n);
> +
> +		tx_id = 0;
> +		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> +		/* avoid reach the end of ring */
> +		txdp = &txq->tx_ring[tx_id];
> +		txep = &txq->sw_ring[tx_id];
> +	}
> +
> +	tx_backlog_entry(txep, tx_pkts, nb_commit);
> +
> +	ice_vtx(txdp, tx_pkts, nb_commit, flags);
> +
> +	tx_id = (uint16_t)(tx_id + nb_commit);
> +	if (tx_id > txq->tx_next_rs) {
> +		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
> +			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
> +					 ICE_TXD_QW1_CMD_S);
> +		txq->tx_next_rs =
> +			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
> +	}
> +
> +	txq->tx_tail = tx_id;
> +
> +	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
> +
> +	return nb_pkts;
> +}
> +
> +uint16_t
> +ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
> +		  uint16_t nb_pkts)
> +{
> +	uint16_t nb_tx = 0;
> +	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
> +
> +	while (nb_pkts) {
> +		uint16_t ret, num;
> +
> +		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
> +		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
> +		nb_tx += ret;
> +		nb_pkts -= ret;
> +		if (ret < num)
> +			break;
> +	}
> +
> +	return nb_tx;
> +}
> +
>   static void __attribute__((cold))
>   ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
>   {
>   	_ice_rx_queue_release_mbufs_vec(rxq);
>   }
>   
> +static void __attribute__((cold))
> +ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
> +{
> +	_ice_tx_queue_release_mbufs_vec(txq);

As for Rx, consider putting the code firectly here as
_ice_tx_queue_release_mbufs_vec() is called only once here.

> +}
> +
>   int __attribute__((cold))
>   ice_rxq_vec_setup(struct ice_rx_queue *rxq)
>   {
> @@ -531,7 +650,23 @@ int __attribute__((cold))
>   }
>   
>   int __attribute__((cold))
> +ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
> +{
> +	if (!txq)
> +		return -1;
> +
> +	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
> +	return 0;
> +}
> +
> +int __attribute__((cold))
>   ice_rx_vec_dev_check(struct rte_eth_dev *dev)
>   {
>   	return ice_rx_vec_dev_check_default(dev);
>   }
> +
> +int __attribute__((cold))
> +ice_tx_vec_dev_check(struct rte_eth_dev *dev)
> +{
> +	return ice_tx_vec_dev_check_default(dev);
> +}
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-22  2:58   ` [PATCH v5 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-22 10:12     ` Maxime Coquelin
  2019-03-25  2:22       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-22 10:12 UTC (permalink / raw)
  To: Wenzhuo Lu, dev



On 3/22/19 3:58 AM, Wenzhuo Lu wrote:
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>   drivers/net/ice/Makefile            |  19 ++
>   drivers/net/ice/ice_rxtx.c          |  16 +-
>   drivers/net/ice/ice_rxtx.h          |   2 +
>   drivers/net/ice/ice_rxtx_vec_avx2.c | 622 ++++++++++++++++++++++++++++++++++++
>   drivers/net/ice/meson.build         |  15 +
>   5 files changed, 671 insertions(+), 3 deletions(-)
>   create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
> 
> diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
> index 92594bb..5ba59f4 100644
> --- a/drivers/net/ice/Makefile
> +++ b/drivers/net/ice/Makefile
> @@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
>   SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
>   endif
>   
> +ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
> +	CC_AVX2_SUPPORT=1
> +else
> +	CC_AVX2_SUPPORT=\
> +	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
> +	grep -q AVX2 && echo 1)
> +	ifeq ($(CC_AVX2_SUPPORT), 1)
> +		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
> +			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
> +		else
> +			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
> +		endif
> +	endif
> +endif
> +
> +ifeq ($(CC_AVX2_SUPPORT), 1)
> +	SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_avx2.c
> +endif
> +
>   include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> index f9ecffa..6191f34 100644
> --- a/drivers/net/ice/ice_rxtx.c
> +++ b/drivers/net/ice/ice_rxtx.c
> @@ -1494,7 +1494,8 @@
>   
>   #ifdef RTE_ARCH_X86
>   	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
> -	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
> +	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
> +	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
>   		return ptypes;
>   #endif
>   
> @@ -2236,21 +2237,30 @@ void __attribute__((cold))
>   #ifdef RTE_ARCH_X86
>   	struct ice_rx_queue *rxq;
>   	int i;
> +	bool use_avx2 = false;
>   
>   	if (!ice_rx_vec_dev_check(dev)) {
>   		for (i = 0; i < dev->data->nb_rx_queues; i++) {
>   			rxq = dev->data->rx_queues[i];
>   			(void)ice_rxq_vec_setup(rxq);
>   		}
> +
> +		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
> +		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
> +			use_avx2 = true;
> +
>   		if (dev->data->scattered_rx) {
>   			PMD_DRV_LOG(DEBUG,
>   				    "Using Vector Scattered Rx (port %d).",
>   				    dev->data->port_id);
>   			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
>   		} else {
> -			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
> +			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
> +				    use_avx2 ? "avx2 " : "",
>   				    dev->data->port_id);
> -			dev->rx_pkt_burst = ice_recv_pkts_vec;
> +			dev->rx_pkt_burst = use_avx2 ?
> +					    ice_recv_pkts_vec_avx2 :
> +					    ice_recv_pkts_vec;
>   		}
>   
>   		return;
> diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
> index 1dde4e7..d1c9b92 100644
> --- a/drivers/net/ice/ice_rxtx.h
> +++ b/drivers/net/ice/ice_rxtx.h
> @@ -179,4 +179,6 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>   				     uint16_t nb_pkts);
>   uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
>   			   uint16_t nb_pkts);
> +uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
> +				uint16_t nb_pkts);
>   #endif /* _ICE_RXTX_H_ */
> diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
> new file mode 100644
> index 0000000..763fa9f
> --- /dev/null
> +++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
> @@ -0,0 +1,622 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#include "ice_rxtx_vec_common.h"
> +
> +#include <x86intrin.h>
> +
> +#ifndef __INTEL_COMPILER
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +#endif
> +
> +static inline void
> +ice_rxq_rearm(struct ice_rx_queue *rxq)
> +{
> +	int i;
> +	uint16_t rx_id;
> +	volatile union ice_rx_desc *rxdp;
> +	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> +	rxdp = rxq->rx_ring + rxq->rxrearm_start;
> +
> +	/* Pull 'n' more MBUFs into the software ring */
> +	if (rte_mempool_get_bulk(rxq->mp,
> +				 (void *)rxep,
> +				 ICE_RXQ_REARM_THRESH) < 0) {
> +		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
> +		    rxq->nb_rx_desc) {
> +			__m128i dma_addr0;
> +
> +			dma_addr0 = _mm_setzero_si128();
> +			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
> +				rxep[i].mbuf = &rxq->fake_mbuf;
> +				_mm_store_si128((__m128i *)&rxdp[i].read,
> +						dma_addr0);
> +			}
> +		}
> +		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> +			ICE_RXQ_REARM_THRESH;
> +		return;
> +	}
> +
> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC

I see same is done for other Intel NICs, but I wonder what would be the
performance cost of making it dynamic, if any cost?

Having it dynamic (as a dev arg for instance) would make it possible to
change the value when the user is using dpdk from a distro. It would
also help testing coverage.

Btw, how do you select this option with meson build system?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 3/8] net/ice: support vector SSE in RX
  2019-03-22  9:42     ` Maxime Coquelin
@ 2019-03-25  1:56       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-25  1:56 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, March 22, 2019 5:43 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 3/8] net/ice: support vector SSE in RX
> 
> > +
> > +static inline uint16_t
> > +reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf
> > +**rx_bufs,
> As this is in the header file, I think it could be better to prefix it with 'ice_'. Or
> maybe with 'ice_rx_' as it seems to be rx-only.
Thanks for the comment. I'll add the prefix.

> > +static inline void
> > +_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq) {
> > +	const unsigned int mask = rxq->nb_rx_desc - 1;
> > +	unsigned int i;
> > +
> > +	if (!rxq->sw_ring || rxq->rxrearm_nb >= rxq->nb_rx_desc)
> > +		return;
> 
> Maybe not a big deal, but I understand that !rxq->sw_ring is not the
> common case, more an error. If so, the if condition could be split in two, and
> having the first one tagged with unlikely.
> 
> Looking at Tx patch, you should also ensure that rxq != NULL and also print a
> debug/error message to be consistent.
Thanks for the suggestion. I'll change it.

> > +/**
> > + * Notice:
> > + * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
> > + * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
> > + *   numbers of DD bits
> > + */
> > +uint16_t
> > +ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> > +		  uint16_t nb_pkts)
> > +{
> > +	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
> 
> Same as below comment.
_recv_raw_pkts_vec is used by the normal RX and scatter RX. It will be called again later in the patch 4. So, we make it an inline function.

> 
> > +}
> > +
> > +static void __attribute__((cold))
> > +ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq) {
> > +	_ice_rx_queue_release_mbufs_vec(rxq);
> 
> What is the point of having _ice_rx_queue_release_mbufs_vec as it is only
> called once here?
To our experience, it can be reused when the vector is implemented on other platform. So we put it in the common.h.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 5/8] net/ice: support Tx SSE vector
  2019-03-22  9:58     ` Maxime Coquelin
@ 2019-03-25  2:02       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-25  2:02 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,


> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, March 22, 2019 5:59 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 5/8] net/ice: support Tx SSE vector
> 


> > +
> > +static __rte_always_inline void
> > +tx_backlog_entry(struct ice_tx_entry *txep,
> Consider prefixing it with 'ice_tx_'.
Thanks. Will change it.

> > +static inline void
> > +_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq) {
> > +	uint16_t i;
> > +
> > +	if (!txq || !txq->sw_ring) {
> 
> if (unlikely(...)) {
> 
> > +		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
> 
> s/rxq/txq/
Thanks. Will change it.

> >
> > +static inline int
> > +ice_tx_vec_dev_check_default(struct rte_eth_dev *dev) {
> > +	int i;
> > +	struct ice_tx_queue *txq;
> > +
> > +	for (i = 0; i < dev->data->nb_tx_queues; i++) {
> > +		txq = dev->data->tx_queues[i];
> > +		if (ice_tx_vec_queue_default(txq))
> > +			return -1;
> 
> return ice_tx_vec_queue_default(txq);
> 
> Applies also to rx path.
Thanks. Will change it.


> >
> > +static void __attribute__((cold))
> > +ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq) {
> > +	_ice_tx_queue_release_mbufs_vec(txq);
> 
> As for Rx, consider putting the code firectly here as
> _ice_tx_queue_release_mbufs_vec() is called only once here.
Like RX path, to our experience, it most probably can be reused by other platform.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-22 10:12     ` Maxime Coquelin
@ 2019-03-25  2:22       ` Lu, Wenzhuo
  2019-03-25  8:26         ` Maxime Coquelin
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-25  2:22 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,


> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, March 22, 2019 6:12 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector


> > +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> 
> I see same is done for other Intel NICs, but I wonder what would be the
> performance cost of making it dynamic, if any cost?
Currently we don't have a good idea to make it dynamic. If we use pointer to point to different functions for 16 byte and 32 byte, there's too much duplicate code to make it hard to maintain. If we use the same function, and check the configure in it. It impacts the performance.
As HW does not support to change the configuration dynamically. The device must be stopped and restarted if the configuration is changed. It's not very helpful to make it a dynamic configuration. We assume that the users can make their choice at the beginning and will not change it.

> 
> Having it dynamic (as a dev arg for instance) would make it possible to
> change the value when the user is using dpdk from a distro. It would also
> help testing coverage.
> 
> Btw, how do you select this option with meson build system?
Not very familiar with meson. As I know, we can change the meson.build to add the configure.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v6 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (12 preceding siblings ...)
  2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-03-25  6:06 ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 1/8] net/ice: fix Tx function setting Wenzhuo Lu
                     ` (8 more replies)
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
  14 siblings, 9 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

---
v2:
 - Updated feature doc.
 - Fixed checklog and checkpatch issues.

v3:
 - Fixed potential compile issue on non-X86 platform.

v4:
 - Removed compile configure, CONFIG_RTE_LIBRTE_ICE_INC_VECTOR.
 - Fixed checkpatch warnings.
 - Added more explanation of vector path in the device document.
 - Some other minor change.

v5:
 - Fixed a compile issue.
 - Fixed a doc build warning.

v6:
 - Added prefix "ice_" for ICE specific functions.
 - Added unlikely for rarely used code.

Wenzhuo Lu (8):
  net/ice: fix Tx function setting
  net/ice: add pointer for queue buffer release
  net/ice: support vector SSE in RX
  net/ice: support Rx scatter SSE vector
  net/ice: support Tx SSE vector
  net/ice: support Rx AVX2 vector
  net/ice: support Rx scatter AVX2 vector
  net/ice: support vector AVX2 in TX

 doc/guides/nics/features/ice_vec.ini   |  35 ++
 doc/guides/nics/ice.rst                |  18 +
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             |  99 +++-
 drivers/net/ice/ice_rxtx.h             |  39 +-
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 844 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 293 ++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 672 ++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  19 +
 12 files changed, 2035 insertions(+), 15 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v6 1/8] net/ice: fix Tx function setting
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, stable

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Cc: stable@dpdk.org

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index a23c63a..b804be1 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 2/8] net/ice: add pointer for queue buffer release
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25 13:23     ` Maxime Coquelin
  2019-03-25  6:06   ` [PATCH v6 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..d540ed1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -366,7 +366,7 @@
 		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
 			    rx_queue_id);
 
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		return -EINVAL;
 	}
@@ -393,7 +393,7 @@
 				    rx_queue_id);
 			return -EINVAL;
 		}
-		ice_rx_queue_release_mbufs(rxq);
+		rxq->rx_rel_mbufs(rxq);
 		ice_reset_rx_queue(rxq);
 		dev->data->rx_queue_state[rx_queue_id] =
 			RTE_ETH_QUEUE_STATE_STOPPED;
@@ -555,7 +555,7 @@
 		return -EINVAL;
 	}
 
-	ice_tx_queue_release_mbufs(txq);
+	txq->tx_rel_mbufs(txq);
 	ice_reset_tx_queue(txq);
 	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
 
@@ -669,6 +669,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -701,7 +702,7 @@
 		return;
 	}
 
-	ice_rx_queue_release_mbufs(q);
+	q->rx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -866,6 +867,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
 
 	return 0;
 }
@@ -880,7 +882,7 @@
 		return;
 	}
 
-	ice_tx_queue_release_mbufs(q);
+	q->tx_rel_mbufs(q);
 	rte_free(q->sw_ring);
 	rte_free(q);
 }
@@ -1552,18 +1554,22 @@
 void
 ice_clear_queues(struct rte_eth_dev *dev)
 {
+	struct ice_rx_queue *rxq;
+	struct ice_tx_queue *txq;
 	uint16_t i;
 
 	PMD_INIT_FUNC_TRACE();
 
 	for (i = 0; i < dev->data->nb_tx_queues; i++) {
-		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
-		ice_reset_tx_queue(dev->data->tx_queues[i]);
+		txq = dev->data->tx_queues[i];
+		txq->tx_rel_mbufs(txq);
+		ice_reset_tx_queue(txq);
 	}
 
 	for (i = 0; i < dev->data->nb_rx_queues; i++) {
-		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
-		ice_reset_rx_queue(dev->data->rx_queues[i]);
+		rxq = dev->data->rx_queues[i];
+		rxq->rx_rel_mbufs(rxq);
+		ice_reset_rx_queue(rxq);
 	}
 }
 
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..78b4928 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs_t rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs_t tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 3/8] net/ice: support vector SSE in RX
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 1/8] net/ice: fix Tx function setting Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |  33 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  27 +-
 drivers/net/ice/ice_rxtx.h            |  21 +-
 drivers/net/ice/ice_rxtx_vec_common.h | 160 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 496 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   4 +
 9 files changed, 742 insertions(+), 6 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1a19788
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,33 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..92594bb 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index b804be1..8e7c7db 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d540ed1..ebb1cab 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -319,6 +317,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1490,6 +1491,12 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_ARCH_X86
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+
 	return NULL;
 }
 
@@ -2225,6 +2232,22 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 78b4928..656ca0d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
 
@@ -45,13 +54,16 @@ struct ice_rx_queue {
 	uint16_t nb_rx_hold; /* number of held free RX desc */
 	struct rte_mbuf *pkt_first_seg; /**< first segment of current packet */
 	struct rte_mbuf *pkt_last_seg; /**< last segment of current packet */
-#ifdef RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC
 	uint16_t rx_nb_avail; /**< number of staged packets ready */
 	uint16_t rx_next_avail; /**< index of next staged packets */
 	uint16_t rx_free_trigger; /**< triggers rx buffer allocation */
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
-#endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +168,9 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..d41232d
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+ice_rx_reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+			  uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned int pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned int mask = rxq->nb_rx_desc - 1;
+	unsigned int i;
+
+	if (unlikely(!rxq->sw_ring)) {
+		PMD_DRV_LOG(DEBUG, "sw_ring is NULL");
+		return;
+	}
+
+	if (rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..07cbbf3
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,496 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+ice_rx_desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+			 struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(0x1c03804, 0x1c03804,
+						   0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+ice_rx_desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		       uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_ice_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16
+				(0, 0, 0,    /* ignore non-length fields */
+				 -rxq->crc_len, /* sub crc on data_len */
+				 0,          /* ignore high-16bits of pkt_len */
+				 -rxq->crc_len, /* sub crc on pkt_len */
+				 0, 0            /* ignore pkt_type field */
+				);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8
+			(7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF  /*pkt_type set as unknown */
+			);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		ice_rx_desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0x04, 0x0C,
+							     0x00, 0x08);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		ice_rx_desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _ice_recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static void __attribute__((cold))
+ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	_ice_rx_queue_release_mbufs_vec(rxq);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..469264d 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,7 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 4/8] net/ice: support Rx scatter SSE vector
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (2 preceding siblings ...)
  2019-03-25  6:06   ` [PATCH v6 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 5/8] net/ice: support Tx " Wenzhuo Lu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ebb1cab..5409dd0 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1493,7 +1493,8 @@
 		return ptypes;
 
 #ifdef RTE_ARCH_X86
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 
@@ -2241,9 +2242,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 656ca0d..6ef0a84 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -173,4 +173,6 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 07cbbf3..639dc86 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -473,6 +473,47 @@
 	return _ice_recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _ice_recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+						  split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + ice_rx_reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+					     &split_flags[i]);
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 5/8] net/ice: support Tx SSE vector
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (3 preceding siblings ...)
  2019-03-25  6:06   ` [PATCH v6 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  17 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 135 ++++++++++++++++++++++++++++++++++
 5 files changed, 291 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1a19788..173c8f2 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -22,6 +23,7 @@ RSS reta update      = Y
 VLAN filter          = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5409dd0..f9ecffa 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2332,6 +2332,23 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6ef0a84..1dde4e7 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -170,9 +170,13 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
 
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index d41232d..c5f0d56 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+ice_tx_backlog_entry(struct ice_tx_entry *txep,
+		     struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -106,6 +173,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (unlikely(!txq || !txq->sw_ring)) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -142,6 +237,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -157,4 +275,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 639dc86..3acfaaf 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -514,12 +514,131 @@
 					     &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		ice_tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	ice_tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 static void __attribute__((cold))
 ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
 	_ice_rx_queue_release_mbufs_vec(rxq);
 }
 
+static void __attribute__((cold))
+ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	_ice_tx_queue_release_mbufs_vec(txq);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -531,7 +650,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 6/8] net/ice: support Rx AVX2 vector
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (4 preceding siblings ...)
  2019-03-25  6:06   ` [PATCH v6 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 7/8] net/ice: support Rx scatter " Wenzhuo Lu
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  16 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 622 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 671 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 92594bb..5ba59f4 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index f9ecffa..6191f34 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1494,7 +1494,8 @@
 
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2236,21 +2237,30 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1dde4e7..d1c9b92 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -179,4 +179,6 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..42f761d
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,622 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/**
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
+						vaddr1, 1);
+		vaddr2_3 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
+						vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_ice_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust =
+		_mm256_set_epi16
+			(/* first descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0,          /* ignore pkt_type field */
+			 /* second descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0           /* ignore pkt_type field */
+			);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk =
+		_mm256_set_epi8
+			(/* first descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF,  /*pkt_type set as unknown */
+			 /* second descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF   /*pkt_type set as unknown */
+			);
+	/**
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/**
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask =
+		 _mm256_set1_epi32((1 << 2) | (1 << 11) |
+				   (3 << 12) | (7 << 22));
+	/**
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf =
+		_mm256_set_epi32(0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+				 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/**
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf =
+		_mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0,/* end up 128-bits */
+				0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0);
+
+	/**
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask =
+		 _mm256_set1_epi32(PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+				   PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+				   PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+
+	for (i = 0, received = 0; i < nb_pkts;
+	     i += ICE_DESCS_PER_LOOP_AVX,
+	     rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				    _mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256
+			((void *)&rx_pkts[i + 4],
+			 _mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		{
+			const __m128i raw_desc7 =
+				_mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 =
+				_mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 =
+				_mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 =
+				_mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 =
+				_mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 =
+				_mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 =
+				_mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 =
+				_mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc6),
+					 raw_desc7, 1);
+			raw_desc4_5 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc4),
+					 raw_desc5, 1);
+			raw_desc2_3 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc2),
+					 raw_desc3, 1);
+			raw_desc0_1 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc0),
+					 raw_desc1, 1);
+		}
+
+		if (split_packet) {
+			int j;
+
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/**
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7,
+							 PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5,
+							 PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7,
+							   len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5,
+							   len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/**
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/**
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3,
+							 PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1,
+							 PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3,
+							   len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1,
+							   len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+								desc0_1);
+
+		/**
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+							  status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits =
+			_mm256_and_si256(status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags =
+			_mm256_shuffle_epi8(vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags =
+			_mm256_shuffle_epi8(rss_flags_shuf,
+					    _mm256_srli_epi32(flag_bits, 11));
+		/**
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/**
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init
+		 * data so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend
+		 * for each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				 offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				 RTE_ALIGN(offsetof(struct rte_mbuf,
+						    rearm_data),
+					   16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+			rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 8),
+					    0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 4),
+					    0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(mbuf_flags, 4),
+					    0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
+				    rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
+				    rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
+				    rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
+				    rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags =
+			_mm256_castsi128_si256
+				(_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 8),
+					    0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 4),
+					    0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(odd_flags, 4),
+					    0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
+				    rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
+				    rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
+				    rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
+				    rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask =
+				_mm_set1_epi16(1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+								     eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits =
+				_mm_packus_epi32
+					(_mm256_castsi256_si128(eop_bits256),
+					 _mm256_extractf128_si256(eop_bits256,
+								  1));
+			/**
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/**
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle =
+				_mm_set_epi8(/* zero hi 64b */
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     /* move values to lo 64b */
+					     8, 0, 10, 2,
+					     12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet =
+				_mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+					       _mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll
+					(_mm_cvtsi128_si64
+						(_mm256_extracti128_si256
+							(status0_7, 1)));
+		burst += __builtin_popcountll
+				(_mm_cvtsi128_si64
+					(_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts)
+{
+	return _ice_recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 469264d..2bec688 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -14,4 +14,19 @@ includes += include_directories('base')
 
 if arch_subdir == 'x86'
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 7/8] net/ice: support Rx scatter AVX2 vector
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (5 preceding siblings ...)
  2019-03-25  6:06   ` [PATCH v6 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  6:06   ` [PATCH v6 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  2019-03-25  7:56   ` [PATCH v6 0/8] Support vector instructions on ICE Zhang, Qi Z
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 6191f34..34b8386 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1495,7 +1495,8 @@
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2251,9 +2252,12 @@ void __attribute__((cold))
 
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index d1c9b92..dfc3224 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -181,4 +181,7 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 42f761d..2459ff3 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -620,3 +620,67 @@
 {
 	return _ice_recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _ice_recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+						       split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + ice_rx_reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+					     &split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v6 8/8] net/ice: support vector AVX2 in TX
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (6 preceding siblings ...)
  2019-03-25  6:06   ` [PATCH v6 7/8] net/ice: support Rx scatter " Wenzhuo Lu
@ 2019-03-25  6:06   ` Wenzhuo Lu
  2019-03-25  7:56   ` [PATCH v6 0/8] Support vector instructions on ICE Zhang, Qi Z
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-25  6:06 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/ice.rst                |  18 ++++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  13 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
 5 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/ice.rst b/doc/guides/nics/ice.rst
index 3998d5e..fdbc02e 100644
--- a/doc/guides/nics/ice.rst
+++ b/doc/guides/nics/ice.rst
@@ -64,6 +64,24 @@ Driver compilation and testing
 Refer to the document :ref:`compiling and testing a PMD for a NIC <pmd_build_and_test>`
 for details.
 
+Features
+--------
+
+Vector PMD
+~~~~~~~~~~
+
+Vector PMD for RX and TX path are selected automatically. The paths
+are chosen based on 2 conditions.
+
+- ``CPU``
+  On the X86 platform, the driver checks if the CPU supports AVX2.
+  If it's supported, AVX2 paths will be chosen. If not, SSE is chosen.
+
+- ``Offload features``
+  The supported HW offload features are described in the document ice_vec.ini.
+  If any not supported features are used, ICE vector PMD is disabled and the
+  normal paths are chosen.
+
 Sample Application Notes
 ------------------------
 
diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index 61a2c73..610c4cd 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -91,6 +91,10 @@ New Features
 
   * Added promiscuous mode support.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 34b8386..4a09457 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2349,15 +2349,24 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index dfc3224..64e9f20 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -184,4 +184,6 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 2459ff3..fac869a 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -684,3 +684,161 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 =
+			_mm256_set_epi64x
+				(hi_qw3,
+				 pkt[3]->buf_physaddr + pkt[3]->data_off,
+				 hi_qw2,
+				 pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 =
+			_mm256_set_epi64x
+				(hi_qw1,
+				 pkt[1]->buf_physaddr + pkt[1]->data_off,
+				 hi_qw0,
+				 pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		ice_tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	ice_tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		       uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v6 0/8] Support vector instructions on ICE
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
                     ` (7 preceding siblings ...)
  2019-03-25  6:06   ` [PATCH v6 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
@ 2019-03-25  7:56   ` Zhang, Qi Z
  8 siblings, 0 replies; 121+ messages in thread
From: Zhang, Qi Z @ 2019-03-25  7:56 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev; +Cc: Lu, Wenzhuo



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Monday, March 25, 2019 2:06 PM
> To: dev@dpdk.org
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>
> Subject: [dpdk-dev] [PATCH v6 0/8] Support vector instructions on ICE
> 
> Use SSE and AVX2 instructions in ICE RX and TX path.
> 
> ---
> v2:
>  - Updated feature doc.
>  - Fixed checklog and checkpatch issues.
> 
> v3:
>  - Fixed potential compile issue on non-X86 platform.
> 
> v4:
>  - Removed compile configure, CONFIG_RTE_LIBRTE_ICE_INC_VECTOR.
>  - Fixed checkpatch warnings.
>  - Added more explanation of vector path in the device document.
>  - Some other minor change.
> 
> v5:
>  - Fixed a compile issue.
>  - Fixed a doc build warning.
> 
> v6:
>  - Added prefix "ice_" for ICE specific functions.
>  - Added unlikely for rarely used code.
> 
> Wenzhuo Lu (8):
>   net/ice: fix Tx function setting
>   net/ice: add pointer for queue buffer release
>   net/ice: support vector SSE in RX
>   net/ice: support Rx scatter SSE vector
>   net/ice: support Tx SSE vector
>   net/ice: support Rx AVX2 vector
>   net/ice: support Rx scatter AVX2 vector
>   net/ice: support vector AVX2 in TX
> 
>  doc/guides/nics/features/ice_vec.ini   |  35 ++
>  doc/guides/nics/ice.rst                |  18 +
>  doc/guides/rel_notes/release_19_05.rst |   4 +
>  drivers/net/ice/Makefile               |  22 +
>  drivers/net/ice/ice_ethdev.c           |   3 +-
>  drivers/net/ice/ice_ethdev.h           |   2 +
>  drivers/net/ice/ice_rxtx.c             |  99 +++-
>  drivers/net/ice/ice_rxtx.h             |  39 +-
>  drivers/net/ice/ice_rxtx_vec_avx2.c    | 844
> +++++++++++++++++++++++++++++++++
>  drivers/net/ice/ice_rxtx_vec_common.h  | 293 ++++++++++++
>  drivers/net/ice/ice_rxtx_vec_sse.c     | 672
> ++++++++++++++++++++++++++
>  drivers/net/ice/meson.build            |  19 +
>  12 files changed, 2035 insertions(+), 15 deletions(-)  create mode 100644
> doc/guides/nics/features/ice_vec.ini
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
>  create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c
> 
> --
> 1.9.3

Synced on dpdk-next-net-intel.

Thanks
Qi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-25  2:22       ` Lu, Wenzhuo
@ 2019-03-25  8:26         ` Maxime Coquelin
  2019-03-26  1:00           ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-25  8:26 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev

Hi,

On 3/25/19 3:22 AM, Lu, Wenzhuo wrote:
> Hi Maxime,
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Friday, March 22, 2019 6:12 PM
>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector
> 
> 
>>> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
>>
>> I see same is done for other Intel NICs, but I wonder what would be the
>> performance cost of making it dynamic, if any cost?
> Currently we don't have a good idea to make it dynamic. If we use pointer to point to different functions for 16 byte and 32 byte, there's too much duplicate code to make it hard to maintain. If we use the same function, and check the configure in it. It impacts the performance.

Have you done some measurements, what would be the performance impact?

> As HW does not support to change the configuration dynamically. The device must be stopped and restarted if the configuration is changed. It's not very helpful to make it a dynamic configuration. We assume that the users can make their choice at the beginning and will not change it.

The problem is that the user has to recompile to switch between the two
configurations. And it may not be an option for the user if he uses dpdk 
packaged by a distribution, for example.

Maybe I was not clear, but I don't mean to be able to switch mode while 
the port is started. I think it would be better to make it possible to 
switch mode at application startup time.

> 
>>
>> Having it dynamic (as a dev arg for instance) would make it possible to
>> change the value when the user is using dpdk from a distro. It would also
>> help testing coverage.
>>
>> Btw, how do you select this option with meson build system?
> Not very familiar with meson. As I know, we can change the meson.build to add the configure.
> 

Ok, then please try to do it, because the legacy build system is going
to be deprecated.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v6 2/8] net/ice: add pointer for queue buffer release
  2019-03-25  6:06   ` [PATCH v6 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-25 13:23     ` Maxime Coquelin
  2019-03-26  1:15       ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-25 13:23 UTC (permalink / raw)
  To: Wenzhuo Lu, dev



On 3/25/19 7:06 AM, Wenzhuo Lu wrote:
> Add function pointers of buffer releasing for RX and
> TX queues, for vector functions will be added for RX
> and TX.
> 
> Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> ---
>   drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
>   drivers/net/ice/ice_rxtx.h |  5 +++++
>   2 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> index c794ee8..d540ed1 100644
> --- a/drivers/net/ice/ice_rxtx.c
> +++ b/drivers/net/ice/ice_rxtx.c
> @@ -366,7 +366,7 @@
>   		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
>   			    rx_queue_id);
>   
> -		ice_rx_queue_release_mbufs(rxq);
> +		rxq->rx_rel_mbufs(rxq);
>   		ice_reset_rx_queue(rxq);
>   		return -EINVAL;
>   	}
> @@ -393,7 +393,7 @@
>   				    rx_queue_id);
>   			return -EINVAL;
>   		}
> -		ice_rx_queue_release_mbufs(rxq);
> +		rxq->rx_rel_mbufs(rxq);
>   		ice_reset_rx_queue(rxq);
>   		dev->data->rx_queue_state[rx_queue_id] =
>   			RTE_ETH_QUEUE_STATE_STOPPED;
> @@ -555,7 +555,7 @@
>   		return -EINVAL;
>   	}
>   
> -	ice_tx_queue_release_mbufs(txq);
> +	txq->tx_rel_mbufs(txq);
>   	ice_reset_tx_queue(txq);
>   	dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
>   
> @@ -669,6 +669,7 @@
>   	ice_reset_rx_queue(rxq);
>   	rxq->q_set = TRUE;
>   	dev->data->rx_queues[queue_idx] = rxq;
> +	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
>   
>   	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
>   
> @@ -701,7 +702,7 @@
>   		return;
>   	}
>   
> -	ice_rx_queue_release_mbufs(q);
> +	q->rx_rel_mbufs(q);
>   	rte_free(q->sw_ring);
>   	rte_free(q);
>   }
> @@ -866,6 +867,7 @@
>   	ice_reset_tx_queue(txq);
>   	txq->q_set = TRUE;
>   	dev->data->tx_queues[queue_idx] = txq;
> +	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;

I think it could be cleaner to have ice_rx_queue_release_mbufs() to call
the callback. So you would rename current ice_rx_queue_release_mbufs to
something else.

I would make the code more consistent IMHO, and would avoid to patch all
call sites.


>   
>   	return 0;
>   }
> @@ -880,7 +882,7 @@
>   		return;
>   	}
>   
> -	ice_tx_queue_release_mbufs(q);
> +	q->tx_rel_mbufs(q);
>   	rte_free(q->sw_ring);
>   	rte_free(q);
>   }
> @@ -1552,18 +1554,22 @@
>   void
>   ice_clear_queues(struct rte_eth_dev *dev)
>   {
> +	struct ice_rx_queue *rxq;
> +	struct ice_tx_queue *txq;
>   	uint16_t i;
>   
>   	PMD_INIT_FUNC_TRACE();
>   
>   	for (i = 0; i < dev->data->nb_tx_queues; i++) {
> -		ice_tx_queue_release_mbufs(dev->data->tx_queues[i]);
> -		ice_reset_tx_queue(dev->data->tx_queues[i]);
> +		txq = dev->data->tx_queues[i];
> +		txq->tx_rel_mbufs(txq);
> +		ice_reset_tx_queue(txq);
>   	}
>   
>   	for (i = 0; i < dev->data->nb_rx_queues; i++) {
> -		ice_rx_queue_release_mbufs(dev->data->rx_queues[i]);
> -		ice_reset_rx_queue(dev->data->rx_queues[i]);
> +		rxq = dev->data->rx_queues[i];
> +		rxq->rx_rel_mbufs(rxq);
> +		ice_reset_rx_queue(rxq);
>   	}
>   }
>   
> diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
> index ec0e52e..78b4928 100644
> --- a/drivers/net/ice/ice_rxtx.h
> +++ b/drivers/net/ice/ice_rxtx.h
> @@ -27,6 +27,9 @@
>   
>   #define ICE_SUPPORT_CHAIN_NUM 5
>   
> +typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
> +typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
> +
>   struct ice_rx_entry {
>   	struct rte_mbuf *mbuf;
>   };
> @@ -61,6 +64,7 @@ struct ice_rx_queue {
>   	uint16_t max_pkt_len; /* Maximum packet length */
>   	bool q_set; /* indicate if rx queue has been configured */
>   	bool rx_deferred_start; /* don't start this queue in dev start */
> +	ice_rx_release_mbufs_t rx_rel_mbufs;
>   };
>   
>   struct ice_tx_entry {
> @@ -100,6 +104,7 @@ struct ice_tx_queue {
>   	uint16_t tx_next_rs;
>   	bool tx_deferred_start; /* don't start this queue in dev start */
>   	bool q_set; /* indicate if tx queue has been configured */
> +	ice_tx_release_mbufs_t tx_rel_mbufs;
>   };
>   
>   /* Offload features */
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-25  8:26         ` Maxime Coquelin
@ 2019-03-26  1:00           ` Lu, Wenzhuo
  2019-03-26  9:28             ` Maxime Coquelin
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-26  1:00 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Monday, March 25, 2019 4:26 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector
> 
> Hi,
> 
> On 3/25/19 3:22 AM, Lu, Wenzhuo wrote:
> > Hi Maxime,
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Friday, March 22, 2019 6:12 PM
> >> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
> >> vector
> >
> >
> >>> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> >>
> >> I see same is done for other Intel NICs, but I wonder what would be
> >> the performance cost of making it dynamic, if any cost?
> > Currently we don't have a good idea to make it dynamic. If we use pointer
> to point to different functions for 16 byte and 32 byte, there's too much
> duplicate code to make it hard to maintain. If we use the same function, and
> check the configure in it. It impacts the performance.
> 
> Have you done some measurements, what would be the performance
> impact?
I mean if we check the configuration is 16 byte or 32 byte, this check will consume extra CPU cycles.
That why I think the better way is to have different paths for 16 byte and 32 byte. We should choose the appropriate path at the beginning.

> 
> > As HW does not support to change the configuration dynamically. The
> device must be stopped and restarted if the configuration is changed. It's not
> very helpful to make it a dynamic configuration. We assume that the users
> can make their choice at the beginning and will not change it.
> 
> The problem is that the user has to recompile to switch between the two
> configurations. And it may not be an option for the user if he uses dpdk
> packaged by a distribution, for example.
> 
> Maybe I was not clear, but I don't mean to be able to switch mode while the
> port is started. I think it would be better to make it possible to switch mode
> at application startup time.
Yes, I understand the problem is the recompiling. But we think the users will not change it after they made decision. That's why's acceptable in previous drivers.
Agree it's better to remove all the compile configuration. Looks like that's what we're trying to do. We'd like to think about how to optimize it later.


> 
> >
> >>
> >> Having it dynamic (as a dev arg for instance) would make it possible
> >> to change the value when the user is using dpdk from a distro. It
> >> would also help testing coverage.
> >>
> >> Btw, how do you select this option with meson build system?
> > Not very familiar with meson. As I know, we can change the meson.build
> to add the configure.
> >
> 
> Ok, then please try to do it, because the legacy build system is going to be
> deprecated.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v6 2/8] net/ice: add pointer for queue buffer release
  2019-03-25 13:23     ` Maxime Coquelin
@ 2019-03-26  1:15       ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-26  1:15 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,


> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Monday, March 25, 2019 9:24 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v6 2/8] net/ice: add pointer for queue buffer
> release
> 
> 
> 
> On 3/25/19 7:06 AM, Wenzhuo Lu wrote:
> > Add function pointers of buffer releasing for RX and TX queues, for
> > vector functions will be added for RX and TX.
> >
> > Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
> > ---
> >   drivers/net/ice/ice_rxtx.c | 24 +++++++++++++++---------
> >   drivers/net/ice/ice_rxtx.h |  5 +++++
> >   2 files changed, 20 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> > index c794ee8..d540ed1 100644
> > --- a/drivers/net/ice/ice_rxtx.c
> > +++ b/drivers/net/ice/ice_rxtx.c
> > @@ -366,7 +366,7 @@
> >   		PMD_DRV_LOG(ERR, "Failed to switch RX queue %u on",
> >   			    rx_queue_id);
> >
> > -		ice_rx_queue_release_mbufs(rxq);
> > +		rxq->rx_rel_mbufs(rxq);
> >   		ice_reset_rx_queue(rxq);
> >   		return -EINVAL;
> >   	}
> > @@ -393,7 +393,7 @@
> >   				    rx_queue_id);
> >   			return -EINVAL;
> >   		}
> > -		ice_rx_queue_release_mbufs(rxq);
> > +		rxq->rx_rel_mbufs(rxq);
> >   		ice_reset_rx_queue(rxq);
> >   		dev->data->rx_queue_state[rx_queue_id] =
> >   			RTE_ETH_QUEUE_STATE_STOPPED;
> > @@ -555,7 +555,7 @@
> >   		return -EINVAL;
> >   	}
> >
> > -	ice_tx_queue_release_mbufs(txq);
> > +	txq->tx_rel_mbufs(txq);
> >   	ice_reset_tx_queue(txq);
> >   	dev->data->tx_queue_state[tx_queue_id] =
> > RTE_ETH_QUEUE_STATE_STOPPED;
> >
> > @@ -669,6 +669,7 @@
> >   	ice_reset_rx_queue(rxq);
> >   	rxq->q_set = TRUE;
> >   	dev->data->rx_queues[queue_idx] = rxq;
> > +	rxq->rx_rel_mbufs = ice_rx_queue_release_mbufs;
> >
> >   	use_def_burst_func =
> > ice_check_rx_burst_bulk_alloc_preconditions(rxq);
> >
> > @@ -701,7 +702,7 @@
> >   		return;
> >   	}
> >
> > -	ice_rx_queue_release_mbufs(q);
> > +	q->rx_rel_mbufs(q);
> >   	rte_free(q->sw_ring);
> >   	rte_free(q);
> >   }
> > @@ -866,6 +867,7 @@
> >   	ice_reset_tx_queue(txq);
> >   	txq->q_set = TRUE;
> >   	dev->data->tx_queues[queue_idx] = txq;
> > +	txq->tx_rel_mbufs = ice_tx_queue_release_mbufs;
> 
> I think it could be cleaner to have ice_rx_queue_release_mbufs() to call the
> callback. So you would rename current ice_rx_queue_release_mbufs to
> something else.
> 
> I would make the code more consistent IMHO, and would avoid to patch all
> call sites.
Good suggestion. I'll change it.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v7 0/8] Support vector instructions on ICE
  2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
                   ` (13 preceding siblings ...)
  2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
@ 2019-03-26  6:16 ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 1/8] net/ice: fix Tx function setting Wenzhuo Lu
                     ` (8 more replies)
  14 siblings, 9 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Use SSE and AVX2 instructions in ICE RX and TX path.

---
v2:
 - Updated feature doc.
 - Fixed checklog and checkpatch issues.

v3:
 - Fixed potential compile issue on non-X86 platform.

v4:
 - Removed compile configure, CONFIG_RTE_LIBRTE_ICE_INC_VECTOR.
 - Fixed checkpatch warnings.
 - Added more explanation of vector path in the device document.
 - Some other minor change.

v5:
 - Fixed a compile issue.
 - Fixed a doc build warning.

v6:
 - Added prefix "ice_" for ICE specific functions.
 - Added unlikely for rarely used code.

v7:
 - Reserved the original buffer release functions.

Wenzhuo Lu (8):
  net/ice: fix Tx function setting
  net/ice: add pointer for queue buffer release
  net/ice: support vector SSE in RX
  net/ice: support Rx scatter SSE vector
  net/ice: support Tx SSE vector
  net/ice: support Rx AVX2 vector
  net/ice: support Rx scatter AVX2 vector
  net/ice: support vector AVX2 in TX

 doc/guides/nics/features/ice_vec.ini   |  35 ++
 doc/guides/nics/ice.rst                |  18 +
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/Makefile               |  22 +
 drivers/net/ice/ice_ethdev.c           |   3 +-
 drivers/net/ice/ice_ethdev.h           |   2 +
 drivers/net/ice/ice_rxtx.c             |  92 +++-
 drivers/net/ice/ice_rxtx.h             |  39 +-
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 844 +++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_common.h  | 293 ++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c     | 660 ++++++++++++++++++++++++++
 drivers/net/ice/meson.build            |  19 +
 12 files changed, 2023 insertions(+), 8 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v7 1/8] net/ice: fix Tx function setting
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, stable

The TX setting functions is not called.

Fixes: 17c7d0f9d6a4 ("net/ice: support basic Rx/Tx")
Cc: stable@dpdk.org

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index cdb5502..7233c3e 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1741,6 +1741,7 @@ static int ice_init_rss(struct ice_pf *pf)
 	}
 
 	ice_set_rx_function(dev);
+	ice_set_tx_function(dev);
 
 	mask = ETH_VLAN_STRIP_MASK | ETH_VLAN_FILTER_MASK |
 			ETH_VLAN_EXTEND_MASK;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 2/8] net/ice: add pointer for queue buffer release
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 1/8] net/ice: fix Tx function setting Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Add function pointers of buffer releasing for RX and
TX queues, for vector functions will be added for RX
and TX.

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c | 17 +++++++++++++++--
 drivers/net/ice/ice_rxtx.h |  5 +++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index c794ee8..22e1fb5 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -165,7 +165,7 @@
 
 /* Free all mbufs for descriptors in rx queue */
 static void
-ice_rx_queue_release_mbufs(struct ice_rx_queue *rxq)
+_ice_rx_queue_release_mbufs(struct ice_rx_queue *rxq)
 {
 	uint16_t i;
 
@@ -193,6 +193,12 @@
 #endif /* RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC */
 }
 
+static void
+ice_rx_queue_release_mbufs(struct ice_rx_queue *rxq)
+{
+	rxq->rx_rel_mbufs(rxq);
+}
+
 /* turn on or off rx queue
  * @q_idx: queue index in pf scope
  * @on: turn on or off the queue
@@ -468,7 +474,7 @@
 
 /* Free all mbufs for descriptors in tx queue */
 static void
-ice_tx_queue_release_mbufs(struct ice_tx_queue *txq)
+_ice_tx_queue_release_mbufs(struct ice_tx_queue *txq)
 {
 	uint16_t i;
 
@@ -484,6 +490,11 @@
 		}
 	}
 }
+static void
+ice_tx_queue_release_mbufs(struct ice_tx_queue *txq)
+{
+	txq->tx_rel_mbufs(txq);
+}
 
 static void
 ice_reset_tx_queue(struct ice_tx_queue *txq)
@@ -669,6 +680,7 @@
 	ice_reset_rx_queue(rxq);
 	rxq->q_set = TRUE;
 	dev->data->rx_queues[queue_idx] = rxq;
+	rxq->rx_rel_mbufs = _ice_rx_queue_release_mbufs;
 
 	use_def_burst_func = ice_check_rx_burst_bulk_alloc_preconditions(rxq);
 
@@ -866,6 +878,7 @@
 	ice_reset_tx_queue(txq);
 	txq->q_set = TRUE;
 	dev->data->tx_queues[queue_idx] = txq;
+	txq->tx_rel_mbufs = _ice_tx_queue_release_mbufs;
 
 	return 0;
 }
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index ec0e52e..78b4928 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,9 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
+typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
+
 struct ice_rx_entry {
 	struct rte_mbuf *mbuf;
 };
@@ -61,6 +64,7 @@ struct ice_rx_queue {
 	uint16_t max_pkt_len; /* Maximum packet length */
 	bool q_set; /* indicate if rx queue has been configured */
 	bool rx_deferred_start; /* don't start this queue in dev start */
+	ice_rx_release_mbufs_t rx_rel_mbufs;
 };
 
 struct ice_tx_entry {
@@ -100,6 +104,7 @@ struct ice_tx_queue {
 	uint16_t tx_next_rs;
 	bool tx_deferred_start; /* don't start this queue in dev start */
 	bool q_set; /* indicate if tx queue has been configured */
+	ice_tx_release_mbufs_t tx_rel_mbufs;
 };
 
 /* Offload features */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 3/8] net/ice: support vector SSE in RX
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 1/8] net/ice: fix Tx function setting Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |  33 +++
 drivers/net/ice/Makefile              |   3 +
 drivers/net/ice/ice_ethdev.c          |   2 -
 drivers/net/ice/ice_ethdev.h          |   2 +
 drivers/net/ice/ice_rxtx.c            |  27 +-
 drivers/net/ice/ice_rxtx.h            |  21 +-
 drivers/net/ice/ice_rxtx_vec_common.h | 160 +++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 490 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build           |   4 +
 9 files changed, 736 insertions(+), 6 deletions(-)
 create mode 100644 doc/guides/nics/features/ice_vec.ini
 create mode 100644 drivers/net/ice/ice_rxtx_vec_common.h
 create mode 100644 drivers/net/ice/ice_rxtx_vec_sse.c

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
new file mode 100644
index 0000000..1a19788
--- /dev/null
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -0,0 +1,33 @@
+;
+; Supported features of the 'ice_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = Y
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+MTU update           = Y
+Jumbo frame          = Y
+Scattered Rx         = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+RSS hash             = Y
+RSS key update       = Y
+RSS reta update      = Y
+VLAN filter          = Y
+Packet type parsing  = Y
+Rx descriptor status = Y
+Basic stats          = Y
+Extended stats       = Y
+FW version           = Y
+Module EEPROM dump   = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-32               = Y
+x86-64               = Y
diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 61846ca..92594bb 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -54,5 +54,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_flow.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx.c
+ifeq ($(CONFIG_RTE_ARCH_X86), y)
+SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
+endif
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 7233c3e..c468962 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -2,8 +2,6 @@
  * Copyright(c) 2018 Intel Corporation
  */
 
-#include <rte_ethdev_pci.h>
-
 #include "base/ice_sched.h"
 #include "ice_ethdev.h"
 #include "ice_rxtx.h"
diff --git a/drivers/net/ice/ice_ethdev.h b/drivers/net/ice/ice_ethdev.h
index 3cefa5b..151a09e 100644
--- a/drivers/net/ice/ice_ethdev.h
+++ b/drivers/net/ice/ice_ethdev.h
@@ -7,6 +7,8 @@
 
 #include <rte_kvargs.h>
 
+#include <rte_ethdev_pci.h>
+
 #include "base/ice_common.h"
 #include "base/ice_adminq_cmd.h"
 
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 22e1fb5..1c6121f 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -7,8 +7,6 @@
 
 #include "ice_rxtx.h"
 
-#define ICE_TD_CMD ICE_TX_DESC_CMD_EOP
-
 #define ICE_TX_CKSUM_OFFLOAD_MASK (		 \
 		PKT_TX_IP_CKSUM |		 \
 		PKT_TX_L4_MASK |		 \
@@ -325,6 +323,9 @@
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+
+	rxq->rxrearm_start = 0;
+	rxq->rxrearm_nb = 0;
 }
 
 int
@@ -1501,6 +1502,12 @@
 #endif
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts)
 		return ptypes;
+
+#ifdef RTE_ARCH_X86
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+		return ptypes;
+#endif
+
 	return NULL;
 }
 
@@ -2232,6 +2239,22 @@ void __attribute__((cold))
 	PMD_INIT_FUNC_TRACE();
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_rx_queue *rxq;
+	int i;
+
+	if (!ice_rx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_rx_queues; i++) {
+			rxq = dev->data->rx_queues[i];
+			(void)ice_rxq_vec_setup(rxq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			    dev->data->port_id);
+		dev->rx_pkt_burst = ice_recv_pkts_vec;
+
+		return;
+	}
+#endif
 
 	if (dev->data->scattered_rx) {
 		/* Set the non-LRO scattered function */
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 78b4928..656ca0d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -27,6 +27,15 @@
 
 #define ICE_SUPPORT_CHAIN_NUM 5
 
+#define ICE_TD_CMD                      ICE_TX_DESC_CMD_EOP
+
+#define ICE_VPMD_RX_BURST           32
+#define ICE_VPMD_TX_BURST           32
+#define ICE_RXQ_REARM_THRESH        32
+#define ICE_MAX_RX_BURST            ICE_RXQ_REARM_THRESH
+#define ICE_TX_MAX_FREE_BUF_SZ      64
+#define ICE_DESCS_PER_LOOP          4
+
 typedef void (*ice_rx_release_mbufs_t)(struct ice_rx_queue *rxq);
 typedef void (*ice_tx_release_mbufs_t)(struct ice_tx_queue *txq);
 
@@ -45,13 +54,16 @@ struct ice_rx_queue {
 	uint16_t nb_rx_hold; /* number of held free RX desc */
 	struct rte_mbuf *pkt_first_seg; /**< first segment of current packet */
 	struct rte_mbuf *pkt_last_seg; /**< last segment of current packet */
-#ifdef RTE_LIBRTE_ICE_RX_ALLOW_BULK_ALLOC
 	uint16_t rx_nb_avail; /**< number of staged packets ready */
 	uint16_t rx_next_avail; /**< index of next staged packets */
 	uint16_t rx_free_trigger; /**< triggers rx buffer allocation */
 	struct rte_mbuf fake_mbuf; /**< dummy mbuf */
 	struct rte_mbuf *rx_stage[ICE_RX_MAX_BURST * 2];
-#endif
+
+	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
+	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint64_t mbuf_initializer; /**< value to init mbufs */
+
 	uint8_t port_id; /* device port ID */
 	uint8_t crc_len; /* 0 if CRC stripped, 4 otherwise */
 	uint16_t queue_id; /* RX queue index */
@@ -156,4 +168,9 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_tx_descriptor_status(void *tx_queue, uint16_t offset);
 void ice_set_default_ptype_table(struct rte_eth_dev *dev);
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
+
+int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
new file mode 100644
index 0000000..d41232d
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _ICE_RXTX_VEC_COMMON_H_
+#define _ICE_RXTX_VEC_COMMON_H_
+
+#include "ice_rxtx.h"
+
+static inline uint16_t
+ice_rx_reassemble_packets(struct ice_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+			  uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[ICE_VPMD_RX_BURST] = {0}; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned int pkt_idx, buf_idx;
+
+	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
+		if (end) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len) {
+					end->data_len -= rxq->crc_len;
+				} else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+
+					start->nb_segs--;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+				}
+				pkts[pkt_idx++] = start;
+				start = NULL;
+				end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			start = rx_bufs[buf_idx];
+			end = start;
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	rte_memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+static inline void
+_ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
+{
+	const unsigned int mask = rxq->nb_rx_desc - 1;
+	unsigned int i;
+
+	if (unlikely(!rxq->sw_ring)) {
+		PMD_DRV_LOG(DEBUG, "sw_ring is NULL");
+		return;
+	}
+
+	if (rxq->rxrearm_nb >= rxq->nb_rx_desc)
+		return;
+
+	/* free all mbufs that are valid in the ring */
+	if (rxq->rxrearm_nb == 0) {
+		for (i = 0; i < rxq->nb_rx_desc; i++) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	} else {
+		for (i = rxq->rx_tail;
+		     i != rxq->rxrearm_start;
+		     i = (i + 1) & mask) {
+			if (rxq->sw_ring[i].mbuf)
+				rte_pktmbuf_free_seg(rxq->sw_ring[i].mbuf);
+		}
+	}
+
+	rxq->rxrearm_nb = rxq->nb_rx_desc;
+
+	/* set all entries to NULL */
+	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
+}
+
+static inline int
+ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
+{
+	uintptr_t p;
+	struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+	mb_def.nb_segs = 1;
+	mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+	mb_def.port = rxq->port_id;
+	rte_mbuf_refcnt_set(&mb_def, 1);
+
+	/* prevent compiler reordering: rearm_data covers previous fields */
+	rte_compiler_barrier();
+	p = (uintptr_t)&mb_def.rearm_data;
+	rxq->mbuf_initializer = *(uint64_t *)p;
+	return 0;
+}
+
+static inline int
+ice_rx_vec_queue_default(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	if (!rte_is_power_of_2(rxq->nb_rx_desc))
+		return -1;
+
+	if (rxq->rx_free_thresh < ICE_VPMD_RX_BURST)
+		return -1;
+
+	if (rxq->nb_rx_desc % rxq->rx_free_thresh)
+		return -1;
+
+	return 0;
+}
+
+static inline int
+ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_rx_queue *rxq;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		rxq = dev->data->rx_queues[i];
+		if (ice_rx_vec_queue_default(rxq))
+			return -1;
+	}
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
new file mode 100644
index 0000000..ec31e36
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -0,0 +1,490 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <tmmintrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+	struct rte_mbuf *mb0, *mb1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+					  RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				 offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+static inline void
+ice_rx_desc_to_olflags_v(struct ice_rx_queue *rxq, __m128i descs[4],
+			 struct rte_mbuf **rx_pkts)
+{
+	const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer);
+	__m128i rearm0, rearm1, rearm2, rearm3;
+
+	__m128i vlan0, vlan1, rss, l3_l4e;
+
+	/* mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication.
+	 */
+	const __m128i rss_vlan_msk = _mm_set_epi32(0x1c03804, 0x1c03804,
+						   0x1c03804, 0x1c03804);
+
+	const __m128i cksum_mask = _mm_set_epi32(PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD,
+						 PKT_RX_IP_CKSUM_GOOD |
+						 PKT_RX_IP_CKSUM_BAD |
+						 PKT_RX_L4_CKSUM_GOOD |
+						 PKT_RX_L4_CKSUM_BAD |
+						 PKT_RX_EIP_CKSUM_BAD);
+
+	/* map rss and vlan type to rss hash and vlan flag */
+	const __m128i vlan_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			0, 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED,
+			0, 0, 0, 0);
+
+	const __m128i rss_flags = _mm_set_epi8(0, 0, 0, 0,
+			0, 0, 0, 0,
+			PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH, 0, 0,
+			0, 0, PKT_RX_FDIR, 0);
+
+	const __m128i l3_l4e_flags = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	vlan0 = _mm_unpackhi_epi32(descs[0], descs[1]);
+	vlan1 = _mm_unpackhi_epi32(descs[2], descs[3]);
+	vlan0 = _mm_unpacklo_epi64(vlan0, vlan1);
+
+	vlan1 = _mm_and_si128(vlan0, rss_vlan_msk);
+	vlan0 = _mm_shuffle_epi8(vlan_flags, vlan1);
+
+	rss = _mm_srli_epi32(vlan1, 11);
+	rss = _mm_shuffle_epi8(rss_flags, rss);
+
+	l3_l4e = _mm_srli_epi32(vlan1, 22);
+	l3_l4e = _mm_shuffle_epi8(l3_l4e_flags, l3_l4e);
+	/* then we shift left 1 bit */
+	l3_l4e = _mm_slli_epi32(l3_l4e, 1);
+	/* we need to mask out the reduntant bits */
+	l3_l4e = _mm_and_si128(l3_l4e, cksum_mask);
+
+	vlan0 = _mm_or_si128(vlan0, rss);
+	vlan0 = _mm_or_si128(vlan0, l3_l4e);
+
+	/**
+	 * At this point, we have the 4 sets of flags in the low 16-bits
+	 * of each 32-bit value in vlan0.
+	 * We want to extract these, and merge them with the mbuf init data
+	 * so we can do a single 16-byte write to the mbuf to set the flags
+	 * and all the other initialization fields. Extracting the
+	 * appropriate flags means that we have to do a shift and blend for
+	 * each mbuf before we do the write.
+	 */
+	rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 8), 0x10);
+	rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(vlan0, 4), 0x10);
+	rearm2 = _mm_blend_epi16(mbuf_init, vlan0, 0x10);
+	rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(vlan0, 4), 0x10);
+
+	/* write the rearm data and the olflags in one write */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+			 offsetof(struct rte_mbuf, rearm_data) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+			 RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16));
+	_mm_store_si128((__m128i *)&rx_pkts[0]->rearm_data, rearm0);
+	_mm_store_si128((__m128i *)&rx_pkts[1]->rearm_data, rearm1);
+	_mm_store_si128((__m128i *)&rx_pkts[2]->rearm_data, rearm2);
+	_mm_store_si128((__m128i *)&rx_pkts[3]->rearm_data, rearm3);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline void
+ice_rx_desc_to_ptype_v(__m128i descs[4], struct rte_mbuf **rx_pkts,
+		       uint32_t *ptype_tbl)
+{
+	__m128i ptype0 = _mm_unpackhi_epi64(descs[0], descs[1]);
+	__m128i ptype1 = _mm_unpackhi_epi64(descs[2], descs[3]);
+
+	ptype0 = _mm_srli_epi64(ptype0, 30);
+	ptype1 = _mm_srli_epi64(ptype1, 30);
+
+	rx_pkts[0]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 0)];
+	rx_pkts[1]->packet_type = ptype_tbl[_mm_extract_epi8(ptype0, 8)];
+	rx_pkts[2]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 0)];
+	rx_pkts[3]->packet_type = ptype_tbl[_mm_extract_epi8(ptype1, 8)];
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+static inline uint16_t
+_ice_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts, uint8_t *split_packet)
+{
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *sw_ring;
+	uint16_t nb_pkts_recd;
+	int pos;
+	uint64_t var;
+	__m128i shuf_msk;
+	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+
+	__m128i crc_adjust = _mm_set_epi16
+				(0, 0, 0,    /* ignore non-length fields */
+				 -rxq->crc_len, /* sub crc on data_len */
+				 0,          /* ignore high-16bits of pkt_len */
+				 -rxq->crc_len, /* sub crc on pkt_len */
+				 0, 0            /* ignore pkt_type field */
+				);
+	/**
+	 * compile-time check the above crc_adjust layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi16
+	 * call above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	__m128i dd_check, eop_check;
+
+	/* nb_pkts shall be less equal than ICE_MAX_RX_BURST */
+	nb_pkts = RTE_MIN(nb_pkts, ICE_MAX_RX_BURST);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP);
+
+	/* Just the act of getting into the function from the application is
+	 * going to cost about 7 cycles
+	 */
+	rxdp = rxq->rx_ring + rxq->rx_tail;
+
+	rte_prefetch0(rxdp);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+	      rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* 4 packets DD mask */
+	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
+
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
+	/* mask to shuffle from desc. to mbuf */
+	shuf_msk = _mm_set_epi8
+			(7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF  /*pkt_type set as unknown */
+			);
+	/**
+	 * Compile-time verify the shuffle mask
+	 * NOTE: some field positions already verified above, but duplicated
+	 * here for completeness in case of future modifications.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			 offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Cache is empty -> need to scan the buffer rings, but first move
+	 * the next 'n' mbufs into the cache
+	 */
+	sw_ring = &rxq->sw_ring[rxq->rx_tail];
+
+	/* A. load 4 packet in one loop
+	 * [A*. mask out 4 unused dirty field in desc]
+	 * B. copy 4 mbuf point from swring to rx_pkts
+	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
+	 * D. fill info. from desc to mbuf
+	 */
+
+	for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
+	     pos += ICE_DESCS_PER_LOOP,
+	     rxdp += ICE_DESCS_PER_LOOP) {
+		__m128i descs[ICE_DESCS_PER_LOOP];
+		__m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
+		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
+		/* 2 64 bit or 4 32 bit mbuf pointers in one XMM reg. */
+		__m128i mbp1;
+#if defined(RTE_ARCH_X86_64)
+		__m128i mbp2;
+#endif
+
+		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
+		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
+		/* Read desc statuses backwards to avoid race condition */
+		/* A.1 load 4 pkts desc */
+		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
+		rte_compiler_barrier();
+
+		/* B.2 copy 2 64 bit or 4 32 bit mbuf point into rx_pkts */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos], mbp1);
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.1 load 2 64 bit mbuf points */
+		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
+#endif
+
+		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
+		rte_compiler_barrier();
+		/* B.1 load 2 mbuf point */
+		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
+		rte_compiler_barrier();
+		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
+
+#if defined(RTE_ARCH_X86_64)
+		/* B.2 copy 2 mbuf point into rx_pkts  */
+		_mm_storeu_si128((__m128i *)&rx_pkts[pos + 2], mbp2);
+#endif
+
+		if (split_packet) {
+			rte_mbuf_prefetch_part2(rx_pkts[pos]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
+			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+		}
+
+		/* avoid compiler reorder optimization */
+		rte_compiler_barrier();
+
+		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT);
+		const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[3] = _mm_blend_epi16(descs[3], len3, 0x80);
+		descs[2] = _mm_blend_epi16(descs[2], len2, 0x80);
+
+		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
+		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp1 = _mm_unpackhi_epi32(descs[1], descs[0]);
+
+		ice_rx_desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
+		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
+
+		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT);
+		const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT);
+
+		/* merge the now-aligned packet length fields back in */
+		descs[1] = _mm_blend_epi16(descs[1], len1, 0x80);
+		descs[0] = _mm_blend_epi16(descs[0], len0, 0x80);
+
+		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
+		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+
+		/* C.2 get 4 pkts staterr value  */
+		zero = _mm_xor_si128(dd_check, dd_check);
+		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
+
+		/* D.3 copy final 3,4 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+			 pkt_mb4);
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+			 pkt_mb3);
+
+		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
+
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0xFF, 0xFF,
+							     0x04, 0x0C,
+							     0x00, 0x08);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit
+			 */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += ICE_DESCS_PER_LOOP;
+		}
+
+		/* C.3 calc available number of desc */
+		staterr = _mm_and_si128(staterr, dd_check);
+		staterr = _mm_packs_epi32(staterr, zero);
+
+		/* D.3 copy final 1,2 data to rx_pkts */
+		_mm_storeu_si128
+			((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+			 pkt_mb2);
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				 pkt_mb1);
+		ice_rx_desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+		/* C.4 calc avaialbe number of desc */
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
+		nb_pkts_recd += var;
+		if (likely(var != ICE_DESCS_PER_LOOP))
+			break;
+	}
+
+	/* Update our internal tail pointer */
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd);
+	rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1));
+	rxq->rxrearm_nb = (uint16_t)(rxq->rxrearm_nb + nb_pkts_recd);
+
+	return nb_pkts_recd;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		  uint16_t nb_pkts)
+{
+	return _ice_recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+int __attribute__((cold))
+ice_rxq_vec_setup(struct ice_rx_queue *rxq)
+{
+	if (!rxq)
+		return -1;
+
+	rxq->rx_rel_mbufs = _ice_rx_queue_release_mbufs_vec;
+	return ice_rxq_vec_setup_default(rxq);
+}
+
+int __attribute__((cold))
+ice_rx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_rx_vec_dev_check_default(dev);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 857dc0e..469264d 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -11,3 +11,7 @@ sources = files(
 
 deps += ['hash']
 includes += include_directories('base')
+
+if arch_subdir == 'x86'
+	sources += files('ice_rxtx_vec_sse.c')
+endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 4/8] net/ice: support Rx scatter SSE vector
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
                     ` (2 preceding siblings ...)
  2019-03-26  6:16   ` [PATCH v7 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 5/8] net/ice: support Tx " Wenzhuo Lu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c         | 16 +++++++++++----
 drivers/net/ice/ice_rxtx.h         |  2 ++
 drivers/net/ice/ice_rxtx_vec_sse.c | 41 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 1c6121f..748954f 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1504,7 +1504,8 @@
 		return ptypes;
 
 #ifdef RTE_ARCH_X86
-	if (dev->rx_pkt_burst == ice_recv_pkts_vec)
+	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
 		return ptypes;
 #endif
 
@@ -2248,9 +2249,16 @@ void __attribute__((cold))
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
-			    dev->data->port_id);
-		dev->rx_pkt_burst = ice_recv_pkts_vec;
+		if (dev->data->scattered_rx) {
+			PMD_DRV_LOG(DEBUG,
+				    "Using Vector Scattered Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+		} else {
+			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+				    dev->data->port_id);
+			dev->rx_pkt_burst = ice_recv_pkts_vec;
+		}
 
 		return;
 	}
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 656ca0d..6ef0a84 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -173,4 +173,6 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+				     uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index ec31e36..c49c344 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -473,6 +473,47 @@
 	return _ice_recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
 }
 
+/* vPMD receive routine that reassembles scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ * - nb_pkts > ICE_VPMD_RX_BURST, only scan ICE_VPMD_RX_BURST
+ *   numbers of DD bits
+ */
+uint16_t
+ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _ice_recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+						  split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + ice_rx_reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+					     &split_flags[i]);
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 5/8] net/ice: support Tx SSE vector
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
                     ` (3 preceding siblings ...)
  2019-03-26  6:16   ` [PATCH v7 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/features/ice_vec.ini  |   2 +
 drivers/net/ice/ice_rxtx.c            |  17 +++++
 drivers/net/ice/ice_rxtx.h            |   4 +
 drivers/net/ice/ice_rxtx_vec_common.h | 133 ++++++++++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx_vec_sse.c    | 129 +++++++++++++++++++++++++++++++++
 5 files changed, 285 insertions(+)

diff --git a/doc/guides/nics/features/ice_vec.ini b/doc/guides/nics/features/ice_vec.ini
index 1a19788..173c8f2 100644
--- a/doc/guides/nics/features/ice_vec.ini
+++ b/doc/guides/nics/features/ice_vec.ini
@@ -12,6 +12,7 @@ Queue start/stop     = Y
 MTU update           = Y
 Jumbo frame          = Y
 Scattered Rx         = Y
+TSO                  = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
@@ -22,6 +23,7 @@ RSS reta update      = Y
 VLAN filter          = Y
 Packet type parsing  = Y
 Rx descriptor status = Y
+Tx descriptor status = Y
 Basic stats          = Y
 Extended stats       = Y
 FW version           = Y
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 748954f..715dcad 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2339,6 +2339,23 @@ void __attribute__((cold))
 {
 	struct ice_adapter *ad =
 		ICE_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+#ifdef RTE_ARCH_X86
+	struct ice_tx_queue *txq;
+	int i;
+
+	if (!ice_tx_vec_dev_check(dev)) {
+		for (i = 0; i < dev->data->nb_tx_queues; i++) {
+			txq = dev->data->tx_queues[i];
+			(void)ice_txq_vec_setup(txq);
+		}
+		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+			    dev->data->port_id);
+		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_prepare = NULL;
+
+		return;
+	}
+#endif
 
 	if (ad->tx_simple_allowed) {
 		PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6ef0a84..1dde4e7 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -170,9 +170,13 @@ void ice_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 const uint32_t *ice_dev_supported_ptypes_get(struct rte_eth_dev *dev);
 
 int ice_rx_vec_dev_check(struct rte_eth_dev *dev);
+int ice_tx_vec_dev_check(struct rte_eth_dev *dev);
 int ice_rxq_vec_setup(struct ice_rx_queue *rxq);
+int ice_txq_vec_setup(struct ice_tx_queue *txq);
 uint16_t ice_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			   uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_common.h b/drivers/net/ice/ice_rxtx_vec_common.h
index d41232d..c5f0d56 100644
--- a/drivers/net/ice/ice_rxtx_vec_common.h
+++ b/drivers/net/ice/ice_rxtx_vec_common.h
@@ -71,6 +71,73 @@
 	return pkt_idx;
 }
 
+static __rte_always_inline int
+ice_tx_free_bufs(struct ice_tx_queue *txq)
+{
+	struct ice_tx_entry *txep;
+	uint32_t n;
+	uint32_t i;
+	int nb_free = 0;
+	struct rte_mbuf *m, *free[ICE_TX_MAX_FREE_BUF_SZ];
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(ICE_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(ICE_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	 /* first buffer to free from S/W ring is at index
+	  * tx_next_dd - (tx_rs_thresh-1)
+	  */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m)) {
+				if (likely(m->pool == free[0]->pool)) {
+					free[nb_free++] = m;
+				} else {
+					rte_mempool_put_bulk(free[0]->pool,
+							     (void *)free,
+							     nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m)
+				rte_mempool_put(m->pool, m);
+		}
+	}
+
+	/* buffers were freed, update counters */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return txq->tx_rs_thresh;
+}
+
+static __rte_always_inline void
+ice_tx_backlog_entry(struct ice_tx_entry *txep,
+		     struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	int i;
+
+	for (i = 0; i < (int)nb_pkts; ++i)
+		txep[i].mbuf = tx_pkts[i];
+}
+
 static inline void
 _ice_rx_queue_release_mbufs_vec(struct ice_rx_queue *rxq)
 {
@@ -106,6 +173,34 @@
 	memset(rxq->sw_ring, 0, sizeof(rxq->sw_ring[0]) * rxq->nb_rx_desc);
 }
 
+static inline void
+_ice_tx_queue_release_mbufs_vec(struct ice_tx_queue *txq)
+{
+	uint16_t i;
+
+	if (unlikely(!txq || !txq->sw_ring)) {
+		PMD_DRV_LOG(DEBUG, "Pointer to rxq or sw_ring is NULL");
+		return;
+	}
+
+	/**
+	 *  vPMD tx will not set sw_ring's mbuf to NULL after free,
+	 *  so need to free remains more carefully.
+	 */
+	i = txq->tx_next_dd - txq->tx_rs_thresh + 1;
+	if (txq->tx_tail < i) {
+		for (; i < txq->nb_tx_desc; i++) {
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			txq->sw_ring[i].mbuf = NULL;
+		}
+		i = 0;
+	}
+	for (; i < txq->tx_tail; i++) {
+		rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		txq->sw_ring[i].mbuf = NULL;
+	}
+}
+
 static inline int
 ice_rxq_vec_setup_default(struct ice_rx_queue *rxq)
 {
@@ -142,6 +237,29 @@
 	return 0;
 }
 
+#define ICE_NO_VECTOR_FLAGS (				 \
+		DEV_TX_OFFLOAD_MULTI_SEGS |		 \
+		DEV_TX_OFFLOAD_VLAN_INSERT |		 \
+		DEV_TX_OFFLOAD_SCTP_CKSUM |		 \
+		DEV_TX_OFFLOAD_UDP_CKSUM |		 \
+		DEV_TX_OFFLOAD_TCP_CKSUM)
+
+static inline int
+ice_tx_vec_queue_default(struct ice_tx_queue *txq)
+{
+	if (!txq)
+		return -1;
+
+	if (txq->offloads & ICE_NO_VECTOR_FLAGS)
+		return -1;
+
+	if (txq->tx_rs_thresh < ICE_VPMD_TX_BURST ||
+	    txq->tx_rs_thresh > ICE_TX_MAX_FREE_BUF_SZ)
+		return -1;
+
+	return 0;
+}
+
 static inline int
 ice_rx_vec_dev_check_default(struct rte_eth_dev *dev)
 {
@@ -157,4 +275,19 @@
 	return 0;
 }
 
+static inline int
+ice_tx_vec_dev_check_default(struct rte_eth_dev *dev)
+{
+	int i;
+	struct ice_tx_queue *txq;
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		txq = dev->data->tx_queues[i];
+		if (ice_tx_vec_queue_default(txq))
+			return -1;
+	}
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index c49c344..049f60d 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -514,6 +514,119 @@
 					     &split_flags[i]);
 }
 
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp, struct rte_mbuf *pkt,
+	 uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+					    pkt->buf_iova + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp, struct rte_mbuf **pkt,
+	uint16_t nb_pkts, uint64_t flags)
+{
+	int i;
+
+	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
+		ice_vtx1(txdp, *pkt, flags);
+}
+
+static uint16_t
+ice_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			 uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+	int i;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	nb_commit = nb_pkts;
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		ice_tx_backlog_entry(txep, tx_pkts, n);
+
+		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
+			ice_vtx1(txdp, *tx_pkts, flags);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	ice_tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		  uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec(tx_queue, &tx_pkts[nb_tx], num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
+
 int __attribute__((cold))
 ice_rxq_vec_setup(struct ice_rx_queue *rxq)
 {
@@ -525,7 +638,23 @@ int __attribute__((cold))
 }
 
 int __attribute__((cold))
+ice_txq_vec_setup(struct ice_tx_queue __rte_unused *txq)
+{
+	if (!txq)
+		return -1;
+
+	txq->tx_rel_mbufs = _ice_tx_queue_release_mbufs_vec;
+	return 0;
+}
+
+int __attribute__((cold))
 ice_rx_vec_dev_check(struct rte_eth_dev *dev)
 {
 	return ice_rx_vec_dev_check_default(dev);
 }
+
+int __attribute__((cold))
+ice_tx_vec_dev_check(struct rte_eth_dev *dev)
+{
+	return ice_tx_vec_dev_check_default(dev);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 6/8] net/ice: support Rx AVX2 vector
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
                     ` (4 preceding siblings ...)
  2019-03-26  6:16   ` [PATCH v7 5/8] net/ice: support Tx " Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 7/8] net/ice: support Rx scatter " Wenzhuo Lu
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/Makefile            |  19 ++
 drivers/net/ice/ice_rxtx.c          |  16 +-
 drivers/net/ice/ice_rxtx.h          |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c | 622 ++++++++++++++++++++++++++++++++++++
 drivers/net/ice/meson.build         |  15 +
 5 files changed, 671 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ice/ice_rxtx_vec_avx2.c

diff --git a/drivers/net/ice/Makefile b/drivers/net/ice/Makefile
index 92594bb..5ba59f4 100644
--- a/drivers/net/ice/Makefile
+++ b/drivers/net/ice/Makefile
@@ -58,4 +58,23 @@ ifeq ($(CONFIG_RTE_ARCH_X86), y)
 SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_sse.c
 endif
 
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX2,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX2)
+	CC_AVX2_SUPPORT=1
+else
+	CC_AVX2_SUPPORT=\
+	$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+	ifeq ($(CC_AVX2_SUPPORT), 1)
+		ifeq ($(CONFIG_RTE_TOOLCHAIN_ICC),y)
+			CFLAGS_ice_rxtx_vec_avx2.o += -march=core-avx2
+		else
+			CFLAGS_ice_rxtx_vec_avx2.o += -mavx2
+		endif
+	endif
+endif
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ICE_PMD) += ice_rxtx_vec_avx2.c
+endif
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 715dcad..28d5974 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1505,7 +1505,8 @@
 
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec)
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2243,21 +2244,30 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_rx_queue *rxq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_rx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_rx_queues; i++) {
 			rxq = dev->data->rx_queues[i];
 			(void)ice_rxq_vec_setup(rxq);
 		}
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
 				    "Using Vector Scattered Rx (port %d).",
 				    dev->data->port_id);
 			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
 		} else {
-			PMD_DRV_LOG(DEBUG, "Using Vector Rx (port %d).",
+			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_pkts_vec_avx2 :
+					    ice_recv_pkts_vec;
 		}
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1dde4e7..d1c9b92 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -179,4 +179,6 @@ uint16_t ice_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				     uint16_t nb_pkts);
 uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
+uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
new file mode 100644
index 0000000..42f761d
--- /dev/null
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -0,0 +1,622 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include "ice_rxtx_vec_common.h"
+
+#include <x86intrin.h>
+
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+
+static inline void
+ice_rxq_rearm(struct ice_rx_queue *rxq)
+{
+	int i;
+	uint16_t rx_id;
+	volatile union ice_rx_desc *rxdp;
+	struct ice_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+
+	/* Pull 'n' more MBUFs into the software ring */
+	if (rte_mempool_get_bulk(rxq->mp,
+				 (void *)rxep,
+				 ICE_RXQ_REARM_THRESH) < 0) {
+		if (rxq->rxrearm_nb + ICE_RXQ_REARM_THRESH >=
+		    rxq->nb_rx_desc) {
+			__m128i dma_addr0;
+
+			dma_addr0 = _mm_setzero_si128();
+			for (i = 0; i < ICE_DESCS_PER_LOOP; i++) {
+				rxep[i].mbuf = &rxq->fake_mbuf;
+				_mm_store_si128((__m128i *)&rxdp[i].read,
+						dma_addr0);
+			}
+		}
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+			ICE_RXQ_REARM_THRESH;
+		return;
+	}
+
+#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+	struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+	__m256i dma_addr0_1, dma_addr2_3;
+	__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+	for (i = 0; i < ICE_RXQ_REARM_THRESH;
+			i += 4, rxep += 4, rxdp += 4) {
+		__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+		__m256i vaddr0_1, vaddr2_3;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+		mb2 = rxep[2].mbuf;
+		mb3 = rxep[3].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_physaddr) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+		vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+		vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+		/**
+		 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+		 * into the high lanes. Similarly for 2 & 3
+		 */
+		vaddr0_1 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
+						vaddr1, 1);
+		vaddr2_3 =
+			_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
+						vaddr3, 1);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+		dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+		/* add headroom to pa values */
+		dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+		dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+		_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+	}
+
+#endif
+
+	rxq->rxrearm_start += ICE_RXQ_REARM_THRESH;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= ICE_RXQ_REARM_THRESH;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	ICE_PCI_REG_WRITE(rxq->qrx_tail, rx_id);
+}
+
+#define PKTLEN_SHIFT     10
+
+static inline uint16_t
+_ice_recv_raw_pkts_vec_avx2(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts, uint8_t *split_packet)
+{
+#define ICE_DESCS_PER_LOOP_AVX 8
+
+	const uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
+	const __m256i mbuf_init = _mm256_set_epi64x(0, 0,
+			0, rxq->mbuf_initializer);
+	struct ice_rx_entry *sw_ring = &rxq->sw_ring[rxq->rx_tail];
+	volatile union ice_rx_desc *rxdp = rxq->rx_ring + rxq->rx_tail;
+	const int avx_aligned = ((rxq->rx_tail & 1) == 0);
+
+	rte_prefetch0(rxdp);
+
+	/* nb_pkts has to be floor-aligned to ICE_DESCS_PER_LOOP_AVX */
+	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, ICE_DESCS_PER_LOOP_AVX);
+
+	/* See if we need to rearm the RX queue - gives the prefetch a bit
+	 * of time to act
+	 */
+	if (rxq->rxrearm_nb > ICE_RXQ_REARM_THRESH)
+		ice_rxq_rearm(rxq);
+
+	/* Before we start moving massive data around, check to see if
+	 * there is actually a packet available
+	 */
+	if (!(rxdp->wb.qword1.status_error_len &
+			rte_cpu_to_le_32(1 << ICE_RX_DESC_STATUS_DD_S)))
+		return 0;
+
+	/* constants used in processing loop */
+	const __m256i crc_adjust =
+		_mm256_set_epi16
+			(/* first descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0,          /* ignore pkt_type field */
+			 /* second descriptor */
+			 0, 0, 0,       /* ignore non-length fields */
+			 -rxq->crc_len, /* sub crc on data_len */
+			 0,             /* ignore high-16bits of pkt_len */
+			 -rxq->crc_len, /* sub crc on pkt_len */
+			 0, 0           /* ignore pkt_type field */
+			);
+
+	/* 8 packets DD mask, LSB in each 32-bit value */
+	const __m256i dd_check = _mm256_set1_epi32(1);
+
+	/* 8 packets EOP mask, second-LSB in each 32-bit value */
+	const __m256i eop_check = _mm256_slli_epi32(dd_check,
+			ICE_RX_DESC_STATUS_EOF_S);
+
+	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
+	const __m256i shuf_msk =
+		_mm256_set_epi8
+			(/* first descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF,  /*pkt_type set as unknown */
+			 /* second descriptor */
+			 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
+			 3, 2,        /* octet 2~3, low 16 bits vlan_macip */
+			 15, 14,      /* octet 15~14, 16 bits data_len */
+			 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
+			 15, 14,      /* octet 15~14, low 16 bits pkt_len */
+			 0xFF, 0xFF,  /* pkt_type set as unknown */
+			 0xFF, 0xFF   /*pkt_type set as unknown */
+			);
+	/**
+	 * compile-time check the above crc and shuffle layout is correct.
+	 * NOTE: the first field (lowest address) is given last in set_epi
+	 * calls above.
+	 */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=
+			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
+
+	/* Status/Error flag masks */
+	/**
+	 * mask everything except RSS, flow director and VLAN flags
+	 * bit2 is for VLAN tag, bit11 for flow director indication
+	 * bit13:12 for RSS indication. Bits 3-5 of error
+	 * field (bits 22-24) are for IP/L4 checksum errors
+	 */
+	const __m256i flags_mask =
+		 _mm256_set1_epi32((1 << 2) | (1 << 11) |
+				   (3 << 12) | (7 << 22));
+	/**
+	 * data to be shuffled by result of flag mask. If VLAN bit is set,
+	 * (bit 2), then position 4 in this array will be used in the
+	 * destination
+	 */
+	const __m256i vlan_flags_shuf =
+		_mm256_set_epi32(0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0,
+				 0, 0, PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED, 0);
+	/**
+	 * data to be shuffled by result of flag mask, shifted down 11.
+	 * If RSS/FDIR bits are set, shuffle moves appropriate flags in
+	 * place.
+	 */
+	const __m256i rss_flags_shuf =
+		_mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0,/* end up 128-bits */
+				0, 0, 0, 0, 0, 0, 0, 0,
+				PKT_RX_RSS_HASH | PKT_RX_FDIR, PKT_RX_RSS_HASH,
+				0, 0, 0, 0, PKT_RX_FDIR, 0);
+
+	/**
+	 * data to be shuffled by the result of the flags mask shifted by 22
+	 * bits.  This gives use the l3_l4 flags.
+	 */
+	const __m256i l3_l4_flags_shuf = _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+			/* shift right 1 bit to make sure it not exceed 255 */
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1,
+			/* second 128-bits */
+			0, 0, 0, 0, 0, 0, 0, 0,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD |
+			 PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD |
+			 PKT_RX_L4_CKSUM_BAD) >> 1,
+			(PKT_RX_EIP_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_EIP_CKSUM_BAD) >> 1,
+			(PKT_RX_L4_CKSUM_BAD | PKT_RX_IP_CKSUM_BAD) >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD) >> 1,
+			PKT_RX_IP_CKSUM_BAD >> 1,
+			(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD) >> 1);
+
+	const __m256i cksum_mask =
+		 _mm256_set1_epi32(PKT_RX_IP_CKSUM_GOOD | PKT_RX_IP_CKSUM_BAD |
+				   PKT_RX_L4_CKSUM_GOOD | PKT_RX_L4_CKSUM_BAD |
+				   PKT_RX_EIP_CKSUM_BAD);
+
+	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this */
+
+	uint16_t i, received;
+
+	for (i = 0, received = 0; i < nb_pkts;
+	     i += ICE_DESCS_PER_LOOP_AVX,
+	     rxdp += ICE_DESCS_PER_LOOP_AVX) {
+		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
+		_mm256_storeu_si256((void *)&rx_pkts[i],
+				    _mm256_loadu_si256((void *)&sw_ring[i]));
+#ifdef RTE_ARCH_X86_64
+		_mm256_storeu_si256
+			((void *)&rx_pkts[i + 4],
+			 _mm256_loadu_si256((void *)&sw_ring[i + 4]));
+#endif
+
+		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
+#ifdef RTE_LIBRTE_ICE_16BYTE_RX_DESC
+		/* for AVX we need alignment otherwise loads are not atomic */
+		if (avx_aligned) {
+			/* load in descriptors, 2 at a time, in reverse order */
+			raw_desc6_7 = _mm256_load_si256((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			raw_desc4_5 = _mm256_load_si256((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			raw_desc2_3 = _mm256_load_si256((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			raw_desc0_1 = _mm256_load_si256((void *)(rxdp + 0));
+		} else
+#endif
+		{
+			const __m128i raw_desc7 =
+				_mm_load_si128((void *)(rxdp + 7));
+			rte_compiler_barrier();
+			const __m128i raw_desc6 =
+				_mm_load_si128((void *)(rxdp + 6));
+			rte_compiler_barrier();
+			const __m128i raw_desc5 =
+				_mm_load_si128((void *)(rxdp + 5));
+			rte_compiler_barrier();
+			const __m128i raw_desc4 =
+				_mm_load_si128((void *)(rxdp + 4));
+			rte_compiler_barrier();
+			const __m128i raw_desc3 =
+				_mm_load_si128((void *)(rxdp + 3));
+			rte_compiler_barrier();
+			const __m128i raw_desc2 =
+				_mm_load_si128((void *)(rxdp + 2));
+			rte_compiler_barrier();
+			const __m128i raw_desc1 =
+				_mm_load_si128((void *)(rxdp + 1));
+			rte_compiler_barrier();
+			const __m128i raw_desc0 =
+				_mm_load_si128((void *)(rxdp + 0));
+
+			raw_desc6_7 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc6),
+					 raw_desc7, 1);
+			raw_desc4_5 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc4),
+					 raw_desc5, 1);
+			raw_desc2_3 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc2),
+					 raw_desc3, 1);
+			raw_desc0_1 =
+				_mm256_inserti128_si256
+					(_mm256_castsi128_si256(raw_desc0),
+					 raw_desc1, 1);
+		}
+
+		if (split_packet) {
+			int j;
+
+			for (j = 0; j < ICE_DESCS_PER_LOOP_AVX; j++)
+				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
+		}
+
+		/**
+		 * convert descriptors 4-7 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len6_7 = _mm256_slli_epi32(raw_desc6_7,
+							 PKTLEN_SHIFT);
+		const __m256i len4_5 = _mm256_slli_epi32(raw_desc4_5,
+							 PKTLEN_SHIFT);
+		const __m256i desc6_7 = _mm256_blend_epi16(raw_desc6_7,
+							   len6_7, 0x80);
+		const __m256i desc4_5 = _mm256_blend_epi16(raw_desc4_5,
+							   len4_5, 0x80);
+		__m256i mb6_7 = _mm256_shuffle_epi8(desc6_7, shuf_msk);
+		__m256i mb4_5 = _mm256_shuffle_epi8(desc4_5, shuf_msk);
+
+		mb6_7 = _mm256_add_epi16(mb6_7, crc_adjust);
+		mb4_5 = _mm256_add_epi16(mb4_5, crc_adjust);
+		/**
+		 * to get packet types, shift 64-bit values down 30 bits
+		 * and so ptype is in lower 8-bits in each
+		 */
+		const __m256i ptypes6_7 = _mm256_srli_epi64(desc6_7, 30);
+		const __m256i ptypes4_5 = _mm256_srli_epi64(desc4_5, 30);
+		const uint8_t ptype7 = _mm256_extract_epi8(ptypes6_7, 24);
+		const uint8_t ptype6 = _mm256_extract_epi8(ptypes6_7, 8);
+		const uint8_t ptype5 = _mm256_extract_epi8(ptypes4_5, 24);
+		const uint8_t ptype4 = _mm256_extract_epi8(ptypes4_5, 8);
+
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
+		mb6_7 = _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
+		mb4_5 = _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
+		/* merge the status bits into one register */
+		const __m256i status4_7 = _mm256_unpackhi_epi32(desc6_7,
+				desc4_5);
+
+		/**
+		 * convert descriptors 0-3 into mbufs, adjusting length and
+		 * re-arranging fields. Then write into the mbuf
+		 */
+		const __m256i len2_3 = _mm256_slli_epi32(raw_desc2_3,
+							 PKTLEN_SHIFT);
+		const __m256i len0_1 = _mm256_slli_epi32(raw_desc0_1,
+							 PKTLEN_SHIFT);
+		const __m256i desc2_3 = _mm256_blend_epi16(raw_desc2_3,
+							   len2_3, 0x80);
+		const __m256i desc0_1 = _mm256_blend_epi16(raw_desc0_1,
+							   len0_1, 0x80);
+		__m256i mb2_3 = _mm256_shuffle_epi8(desc2_3, shuf_msk);
+		__m256i mb0_1 = _mm256_shuffle_epi8(desc0_1, shuf_msk);
+
+		mb2_3 = _mm256_add_epi16(mb2_3, crc_adjust);
+		mb0_1 = _mm256_add_epi16(mb0_1, crc_adjust);
+		/* get the packet types */
+		const __m256i ptypes2_3 = _mm256_srli_epi64(desc2_3, 30);
+		const __m256i ptypes0_1 = _mm256_srli_epi64(desc0_1, 30);
+		const uint8_t ptype3 = _mm256_extract_epi8(ptypes2_3, 24);
+		const uint8_t ptype2 = _mm256_extract_epi8(ptypes2_3, 8);
+		const uint8_t ptype1 = _mm256_extract_epi8(ptypes0_1, 24);
+		const uint8_t ptype0 = _mm256_extract_epi8(ptypes0_1, 8);
+
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
+		mb2_3 = _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
+		mb0_1 = _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
+		/* merge the status bits into one register */
+		const __m256i status0_3 = _mm256_unpackhi_epi32(desc2_3,
+								desc0_1);
+
+		/**
+		 * take the two sets of status bits and merge to one
+		 * After merge, the packets status flags are in the
+		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
+		 */
+		__m256i status0_7 = _mm256_unpacklo_epi64(status4_7,
+							  status0_3);
+
+		/* now do flag manipulation */
+
+		/* get only flag/error bits we want */
+		const __m256i flag_bits =
+			_mm256_and_si256(status0_7, flags_mask);
+		/* set vlan and rss flags */
+		const __m256i vlan_flags =
+			_mm256_shuffle_epi8(vlan_flags_shuf, flag_bits);
+		const __m256i rss_flags =
+			_mm256_shuffle_epi8(rss_flags_shuf,
+					    _mm256_srli_epi32(flag_bits, 11));
+		/**
+		 * l3_l4_error flags, shuffle, then shift to correct adjustment
+		 * of flags in flags_shuf, and finally mask out extra bits
+		 */
+		__m256i l3_l4_flags = _mm256_shuffle_epi8(l3_l4_flags_shuf,
+				_mm256_srli_epi32(flag_bits, 22));
+		l3_l4_flags = _mm256_slli_epi32(l3_l4_flags, 1);
+		l3_l4_flags = _mm256_and_si256(l3_l4_flags, cksum_mask);
+
+		/* merge flags */
+		const __m256i mbuf_flags = _mm256_or_si256(l3_l4_flags,
+				_mm256_or_si256(rss_flags, vlan_flags));
+		/**
+		 * At this point, we have the 8 sets of flags in the low 16-bits
+		 * of each 32-bit value in vlan0.
+		 * We want to extract these, and merge them with the mbuf init
+		 * data so we can do a single write to the mbuf to set the flags
+		 * and all the other initialization fields. Extracting the
+		 * appropriate flags means that we have to do a shift and blend
+		 * for each mbuf before we do the write. However, we can also
+		 * add in the previously computed rx_descriptor fields to
+		 * make a single 256-bit write per mbuf
+		 */
+		/* check the structure matches expectations */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=
+				 offsetof(struct rte_mbuf, rearm_data) + 8);
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=
+				 RTE_ALIGN(offsetof(struct rte_mbuf,
+						    rearm_data),
+					   16));
+		/* build up data and do writes */
+		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
+			rearm6, rearm7;
+		rearm6 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 8),
+					    0x04);
+		rearm4 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(mbuf_flags, 4),
+					    0x04);
+		rearm2 = _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
+		rearm0 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(mbuf_flags, 4),
+					    0x04);
+		/* permute to add in the rx_descriptor e.g. rss fields */
+		rearm6 = _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
+		rearm4 = _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
+		rearm2 = _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
+		rearm0 = _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
+		/* write to mbuf */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
+				    rearm6);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
+				    rearm4);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
+				    rearm2);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
+				    rearm0);
+
+		/* repeat for the odd mbufs */
+		const __m256i odd_flags =
+			_mm256_castsi128_si256
+				(_mm256_extracti128_si256(mbuf_flags, 1));
+		rearm7 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 8),
+					    0x04);
+		rearm5 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_slli_si256(odd_flags, 4),
+					    0x04);
+		rearm3 = _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
+		rearm1 = _mm256_blend_epi32(mbuf_init,
+					    _mm256_srli_si256(odd_flags, 4),
+					    0x04);
+		/* since odd mbufs are already in hi 128-bits use blend */
+		rearm7 = _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
+		rearm5 = _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
+		rearm3 = _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
+		rearm1 = _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
+		/* again write to mbufs */
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
+				    rearm7);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
+				    rearm5);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
+				    rearm3);
+		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
+				    rearm1);
+
+		/* extract and record EOP bit */
+		if (split_packet) {
+			const __m128i eop_mask =
+				_mm_set1_epi16(1 << ICE_RX_DESC_STATUS_EOF_S);
+			const __m256i eop_bits256 = _mm256_and_si256(status0_7,
+								     eop_check);
+			/* pack status bits into a single 128-bit register */
+			const __m128i eop_bits =
+				_mm_packus_epi32
+					(_mm256_castsi256_si128(eop_bits256),
+					 _mm256_extractf128_si256(eop_bits256,
+								  1));
+			/**
+			 * flip bits, and mask out the EOP bit, which is now
+			 * a split-packet bit i.e. !EOP, rather than EOP one.
+			 */
+			__m128i split_bits = _mm_andnot_si128(eop_bits,
+					eop_mask);
+			/**
+			 * eop bits are out of order, so we need to shuffle them
+			 * back into order again. In doing so, only use low 8
+			 * bits, which acts like another pack instruction
+			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
+			 * [Since we use epi8, the 16-bit positions are
+			 * multiplied by 2 in the eop_shuffle value.]
+			 */
+			__m128i eop_shuffle =
+				_mm_set_epi8(/* zero hi 64b */
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     0xFF, 0xFF, 0xFF, 0xFF,
+					     /* move values to lo 64b */
+					     8, 0, 10, 2,
+					     12, 4, 14, 6);
+			split_bits = _mm_shuffle_epi8(split_bits, eop_shuffle);
+			*(uint64_t *)split_packet =
+				_mm_cvtsi128_si64(split_bits);
+			split_packet += ICE_DESCS_PER_LOOP_AVX;
+		}
+
+		/* perform dd_check */
+		status0_7 = _mm256_and_si256(status0_7, dd_check);
+		status0_7 = _mm256_packs_epi32(status0_7,
+					       _mm256_setzero_si256());
+
+		uint64_t burst = __builtin_popcountll
+					(_mm_cvtsi128_si64
+						(_mm256_extracti128_si256
+							(status0_7, 1)));
+		burst += __builtin_popcountll
+				(_mm_cvtsi128_si64
+					(_mm256_castsi256_si128(status0_7)));
+		received += burst;
+		if (burst != ICE_DESCS_PER_LOOP_AVX)
+			break;
+	}
+
+	/* update tail pointers */
+	rxq->rx_tail += received;
+	rxq->rx_tail &= (rxq->nb_rx_desc - 1);
+	if ((rxq->rx_tail & 1) == 1 && received > 1) { /* keep avx2 aligned */
+		rxq->rx_tail--;
+		received--;
+	}
+	rxq->rxrearm_nb += received;
+	return received;
+}
+
+/**
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+		       uint16_t nb_pkts)
+{
+	return _ice_recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
+}
diff --git a/drivers/net/ice/meson.build b/drivers/net/ice/meson.build
index 469264d..2bec688 100644
--- a/drivers/net/ice/meson.build
+++ b/drivers/net/ice/meson.build
@@ -14,4 +14,19 @@ includes += include_directories('base')
 
 if arch_subdir == 'x86'
 	sources += files('ice_rxtx_vec_sse.c')
+
+	# compile AVX2 version if either:
+	# a. we have AVX supported in minimum instruction set baseline
+	# b. it's not minimum instruction set, but supported by compiler
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX2')
+		sources += files('ice_rxtx_vec_avx2.c')
+	elif cc.has_argument('-mavx2')
+		ice_avx2_lib = static_library('ice_avx2_lib',
+				'ice_rxtx_vec_avx2.c',
+				dependencies: [static_rte_ethdev,
+					static_rte_kvargs, static_rte_hash],
+				include_directories: includes,
+				c_args: [cflags, '-mavx2'])
+		objs += ice_avx2_lib.extract_objects('ice_rxtx_vec_avx2.c')
+	endif
 endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 7/8] net/ice: support Rx scatter AVX2 vector
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
                     ` (5 preceding siblings ...)
  2019-03-26  6:16   ` [PATCH v7 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  6:16   ` [PATCH v7 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
  2019-03-26  9:50   ` [PATCH v7 0/8] Support vector instructions on ICE Ferruh Yigit
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 drivers/net/ice/ice_rxtx.c          | 10 ++++--
 drivers/net/ice/ice_rxtx.h          |  3 ++
 drivers/net/ice/ice_rxtx_vec_avx2.c | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 28d5974..860155f 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -1506,7 +1506,8 @@
 #ifdef RTE_ARCH_X86
 	if (dev->rx_pkt_burst == ice_recv_pkts_vec ||
 	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec ||
-	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2)
+	    dev->rx_pkt_burst == ice_recv_pkts_vec_avx2 ||
+	    dev->rx_pkt_burst == ice_recv_scattered_pkts_vec_avx2)
 		return ptypes;
 #endif
 
@@ -2258,9 +2259,12 @@ void __attribute__((cold))
 
 		if (dev->data->scattered_rx) {
 			PMD_DRV_LOG(DEBUG,
-				    "Using Vector Scattered Rx (port %d).",
+				    "Using %sVector Scattered Rx (port %d).",
+				    use_avx2 ? "avx2 " : "",
 				    dev->data->port_id);
-			dev->rx_pkt_burst = ice_recv_scattered_pkts_vec;
+			dev->rx_pkt_burst = use_avx2 ?
+					    ice_recv_scattered_pkts_vec_avx2 :
+					    ice_recv_scattered_pkts_vec;
 		} else {
 			PMD_DRV_LOG(DEBUG, "Using %sVector Rx (port %d).",
 				    use_avx2 ? "avx2 " : "",
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index d1c9b92..dfc3224 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -181,4 +181,7 @@ uint16_t ice_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 			   uint16_t nb_pkts);
 uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 				uint16_t nb_pkts);
+uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
+					  struct rte_mbuf **rx_pkts,
+					  uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 42f761d..2459ff3 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -620,3 +620,67 @@
 {
 	return _ice_recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
 }
+
+/**
+ * vPMD receive routine that reassembles single burst of 32 scattered packets
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+static uint16_t
+ice_recv_scattered_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				  uint16_t nb_pkts)
+{
+	struct ice_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[ICE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _ice_recv_raw_pkts_vec_avx2(rxq, rx_pkts, nb_pkts,
+						       split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint64_t *split_fl64 = (uint64_t *)split_flags;
+
+	if (!rxq->pkt_first_seg &&
+	    split_fl64[0] == 0 && split_fl64[1] == 0 &&
+	    split_fl64[2] == 0 && split_fl64[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned int i = 0;
+
+	if (!rxq->pkt_first_seg) {
+		/* find the first split flag, and only reassemble then*/
+		while (i < nb_bufs && !split_flags[i])
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + ice_rx_reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+					     &split_flags[i]);
+}
+
+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < ICE_DESCS_PER_LOOP, just return no packet
+ */
+uint16_t
+ice_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
+				 uint16_t nb_pkts)
+{
+	uint16_t retval = 0;
+
+	while (nb_pkts > ICE_VPMD_RX_BURST) {
+		uint16_t burst = ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, ICE_VPMD_RX_BURST);
+		retval += burst;
+		nb_pkts -= burst;
+		if (burst < ICE_VPMD_RX_BURST)
+			return retval;
+	}
+	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
+				rx_pkts + retval, nb_pkts);
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v7 8/8] net/ice: support vector AVX2 in TX
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
                     ` (6 preceding siblings ...)
  2019-03-26  6:16   ` [PATCH v7 7/8] net/ice: support Rx scatter " Wenzhuo Lu
@ 2019-03-26  6:16   ` Wenzhuo Lu
  2019-03-26  9:50   ` [PATCH v7 0/8] Support vector instructions on ICE Ferruh Yigit
  8 siblings, 0 replies; 121+ messages in thread
From: Wenzhuo Lu @ 2019-03-26  6:16 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu

Signed-off-by: Wenzhuo Lu <wenzhuo.lu@intel.com>
---
 doc/guides/nics/ice.rst                |  18 ++++
 doc/guides/rel_notes/release_19_05.rst |   4 +
 drivers/net/ice/ice_rxtx.c             |  13 ++-
 drivers/net/ice/ice_rxtx.h             |   2 +
 drivers/net/ice/ice_rxtx_vec_avx2.c    | 158 +++++++++++++++++++++++++++++++++
 5 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/ice.rst b/doc/guides/nics/ice.rst
index 3998d5e..fdbc02e 100644
--- a/doc/guides/nics/ice.rst
+++ b/doc/guides/nics/ice.rst
@@ -64,6 +64,24 @@ Driver compilation and testing
 Refer to the document :ref:`compiling and testing a PMD for a NIC <pmd_build_and_test>`
 for details.
 
+Features
+--------
+
+Vector PMD
+~~~~~~~~~~
+
+Vector PMD for RX and TX path are selected automatically. The paths
+are chosen based on 2 conditions.
+
+- ``CPU``
+  On the X86 platform, the driver checks if the CPU supports AVX2.
+  If it's supported, AVX2 paths will be chosen. If not, SSE is chosen.
+
+- ``Offload features``
+  The supported HW offload features are described in the document ice_vec.ini.
+  If any not supported features are used, ICE vector PMD is disabled and the
+  normal paths are chosen.
+
 Sample Application Notes
 ------------------------
 
diff --git a/doc/guides/rel_notes/release_19_05.rst b/doc/guides/rel_notes/release_19_05.rst
index 6f76de3..fbea42f 100644
--- a/doc/guides/rel_notes/release_19_05.rst
+++ b/doc/guides/rel_notes/release_19_05.rst
@@ -96,6 +96,10 @@ New Features
   Improved testpmd application performance on ARM platform. For ``macswap``
   forwarding mode, NEON intrinsics were used to do swap to save CPU cycles.
 
+* **Added support of vector instructions on ICE.**
+
+   Added support of SSE and AVX2 instructions in ICE RX and TX path.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 860155f..5264055 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -2356,15 +2356,24 @@ void __attribute__((cold))
 #ifdef RTE_ARCH_X86
 	struct ice_tx_queue *txq;
 	int i;
+	bool use_avx2 = false;
 
 	if (!ice_tx_vec_dev_check(dev)) {
 		for (i = 0; i < dev->data->nb_tx_queues; i++) {
 			txq = dev->data->tx_queues[i];
 			(void)ice_txq_vec_setup(txq);
 		}
-		PMD_DRV_LOG(DEBUG, "Using Vector Tx (port %d).",
+
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1 ||
+		    rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) == 1)
+			use_avx2 = true;
+
+		PMD_DRV_LOG(DEBUG, "Using %sVector Tx (port %d).",
+			    use_avx2 ? "avx2 " : "",
 			    dev->data->port_id);
-		dev->tx_pkt_burst = ice_xmit_pkts_vec;
+		dev->tx_pkt_burst = use_avx2 ?
+				    ice_xmit_pkts_vec_avx2 :
+				    ice_xmit_pkts_vec;
 		dev->tx_pkt_prepare = NULL;
 
 		return;
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index dfc3224..64e9f20 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -184,4 +184,6 @@ uint16_t ice_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ice_recv_scattered_pkts_vec_avx2(void *rx_queue,
 					  struct rte_mbuf **rx_pkts,
 					  uint16_t nb_pkts);
+uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+				uint16_t nb_pkts);
 #endif /* _ICE_RXTX_H_ */
diff --git a/drivers/net/ice/ice_rxtx_vec_avx2.c b/drivers/net/ice/ice_rxtx_vec_avx2.c
index 2459ff3..fac869a 100644
--- a/drivers/net/ice/ice_rxtx_vec_avx2.c
+++ b/drivers/net/ice/ice_rxtx_vec_avx2.c
@@ -684,3 +684,161 @@
 	return retval + ice_recv_scattered_burst_vec_avx2(rx_queue,
 				rx_pkts + retval, nb_pkts);
 }
+
+static inline void
+ice_vtx1(volatile struct ice_tx_desc *txdp,
+	 struct rte_mbuf *pkt, uint64_t flags)
+{
+	uint64_t high_qw =
+		(ICE_TX_DESC_DTYPE_DATA |
+		 ((uint64_t)flags  << ICE_TXD_QW1_CMD_S) |
+		 ((uint64_t)pkt->data_len << ICE_TXD_QW1_TX_BUF_SZ_S));
+
+	__m128i descriptor = _mm_set_epi64x(high_qw,
+				pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)txdp, descriptor);
+}
+
+static inline void
+ice_vtx(volatile struct ice_tx_desc *txdp,
+	struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
+{
+	const uint64_t hi_qw_tmpl = (ICE_TX_DESC_DTYPE_DATA |
+			((uint64_t)flags  << ICE_TXD_QW1_CMD_S));
+
+	/* if unaligned on 32-bit boundary, do one to align */
+	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
+		ice_vtx1(txdp, *pkt, flags);
+		nb_pkts--, txdp++, pkt++;
+	}
+
+	/* do two at a time while possible, in bursts */
+	for (; nb_pkts > 3; txdp += 4, pkt += 4, nb_pkts -= 4) {
+		uint64_t hi_qw3 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[3]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw2 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[2]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw1 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[1]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+		uint64_t hi_qw0 =
+			hi_qw_tmpl |
+			((uint64_t)pkt[0]->data_len <<
+			 ICE_TXD_QW1_TX_BUF_SZ_S);
+
+		__m256i desc2_3 =
+			_mm256_set_epi64x
+				(hi_qw3,
+				 pkt[3]->buf_physaddr + pkt[3]->data_off,
+				 hi_qw2,
+				 pkt[2]->buf_physaddr + pkt[2]->data_off);
+		__m256i desc0_1 =
+			_mm256_set_epi64x
+				(hi_qw1,
+				 pkt[1]->buf_physaddr + pkt[1]->data_off,
+				 hi_qw0,
+				 pkt[0]->buf_physaddr + pkt[0]->data_off);
+		_mm256_store_si256((void *)(txdp + 2), desc2_3);
+		_mm256_store_si256((void *)txdp, desc0_1);
+	}
+
+	/* do any last ones */
+	while (nb_pkts) {
+		ice_vtx1(txdp, *pkt, flags);
+		txdp++, pkt++, nb_pkts--;
+	}
+}
+
+static inline uint16_t
+ice_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+			      uint16_t nb_pkts)
+{
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+	volatile struct ice_tx_desc *txdp;
+	struct ice_tx_entry *txep;
+	uint16_t n, nb_commit, tx_id;
+	uint64_t flags = ICE_TD_CMD;
+	uint64_t rs = ICE_TX_DESC_CMD_RS | ICE_TD_CMD;
+
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
+	if (txq->nb_tx_free < txq->tx_free_thresh)
+		ice_tx_free_bufs(txq);
+
+	nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	tx_id = txq->tx_tail;
+	txdp = &txq->tx_ring[tx_id];
+	txep = &txq->sw_ring[tx_id];
+
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
+
+	n = (uint16_t)(txq->nb_tx_desc - tx_id);
+	if (nb_commit >= n) {
+		ice_tx_backlog_entry(txep, tx_pkts, n);
+
+		ice_vtx(txdp, tx_pkts, n - 1, flags);
+		tx_pkts += (n - 1);
+		txdp += (n - 1);
+
+		ice_vtx1(txdp, *tx_pkts++, rs);
+
+		nb_commit = (uint16_t)(nb_commit - n);
+
+		tx_id = 0;
+		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
+
+		/* avoid reach the end of ring */
+		txdp = &txq->tx_ring[tx_id];
+		txep = &txq->sw_ring[tx_id];
+	}
+
+	ice_tx_backlog_entry(txep, tx_pkts, nb_commit);
+
+	ice_vtx(txdp, tx_pkts, nb_commit, flags);
+
+	tx_id = (uint16_t)(tx_id + nb_commit);
+	if (tx_id > txq->tx_next_rs) {
+		txq->tx_ring[txq->tx_next_rs].cmd_type_offset_bsz |=
+			rte_cpu_to_le_64(((uint64_t)ICE_TX_DESC_CMD_RS) <<
+					 ICE_TXD_QW1_CMD_S);
+		txq->tx_next_rs =
+			(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
+	}
+
+	txq->tx_tail = tx_id;
+
+	ICE_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
+
+	return nb_pkts;
+}
+
+uint16_t
+ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
+		       uint16_t nb_pkts)
+{
+	uint16_t nb_tx = 0;
+	struct ice_tx_queue *txq = (struct ice_tx_queue *)tx_queue;
+
+	while (nb_pkts) {
+		uint16_t ret, num;
+
+		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+		ret = ice_xmit_fixed_burst_vec_avx2(tx_queue, &tx_pkts[nb_tx],
+						    num);
+		nb_tx += ret;
+		nb_pkts -= ret;
+		if (ret < num)
+			break;
+	}
+
+	return nb_tx;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-26  1:00           ` Lu, Wenzhuo
@ 2019-03-26  9:28             ` Maxime Coquelin
  2019-03-27  0:56               ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-26  9:28 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev

Hi,

On 3/26/19 2:00 AM, Lu, Wenzhuo wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Monday, March 25, 2019 4:26 PM
>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector
>>
>> Hi,
>>
>> On 3/25/19 3:22 AM, Lu, Wenzhuo wrote:
>>> Hi Maxime,
>>>
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Friday, March 22, 2019 6:12 PM
>>>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
>>>> vector
>>>
>>>
>>>>> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
>>>>
>>>> I see same is done for other Intel NICs, but I wonder what would be
>>>> the performance cost of making it dynamic, if any cost?
>>> Currently we don't have a good idea to make it dynamic. If we use pointer
>> to point to different functions for 16 byte and 32 byte, there's too much
>> duplicate code to make it hard to maintain. If we use the same function, and
>> check the configure in it. It impacts the performance.
>>
>> Have you done some measurements, what would be the performance
>> impact?
> I mean if we check the configuration is 16 byte or 32 byte, this check will consume extra CPU cycles.
> That why I think the better way is to have different paths for 16 byte and 32 byte. We should choose the appropriate path at the beginning.
> 
>>
>>> As HW does not support to change the configuration dynamically. The
>> device must be stopped and restarted if the configuration is changed. It's not
>> very helpful to make it a dynamic configuration. We assume that the users
>> can make their choice at the beginning and will not change it.
>>
>> The problem is that the user has to recompile to switch between the two
>> configurations. And it may not be an option for the user if he uses dpdk
>> packaged by a distribution, for example.
>>
>> Maybe I was not clear, but I don't mean to be able to switch mode while the
>> port is started. I think it would be better to make it possible to switch mode
>> at application startup time.
> Yes, I understand the problem is the recompiling. But we think the users will not change it after they made decision. That's why's acceptable in previous drivers.

The problem is that the user may not be able to change it, if he does
not get DPDK from source but from a distribution like Debian, Ubuntu or
Red Hat.

In this case, it means the user has no choice than sticking to 32 bytes
descriptors.

> Agree it's better to remove all the compile configuration. Looks like that's what we're trying to do. We'd like to think about how to optimize it later.

My suggestion would be a devarg, so that you can have a per-port
policy (which is another advantage of doing so).

> 
> 
>>
>>>
>>>>
>>>> Having it dynamic (as a dev arg for instance) would make it possible
>>>> to change the value when the user is using dpdk from a distro. It
>>>> would also help testing coverage.
>>>>
>>>> Btw, how do you select this option with meson build system?
>>> Not very familiar with meson. As I know, we can change the meson.build
>> to add the configure.
>>>
>>
>> Ok, then please try to do it, because the legacy build system is going to be
>> deprecated.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
                     ` (7 preceding siblings ...)
  2019-03-26  6:16   ` [PATCH v7 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
@ 2019-03-26  9:50   ` Ferruh Yigit
  2019-03-31 15:52     ` Thomas Monjalon
  8 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-03-26  9:50 UTC (permalink / raw)
  To: Wenzhuo Lu, dev; +Cc: Qi Zhang

On 3/26/2019 6:16 AM, Wenzhuo Lu wrote:
> Use SSE and AVX2 instructions in ICE RX and TX path.
> 
> ---
> v2:
>  - Updated feature doc.
>  - Fixed checklog and checkpatch issues.
> 
> v3:
>  - Fixed potential compile issue on non-X86 platform.
> 
> v4:
>  - Removed compile configure, CONFIG_RTE_LIBRTE_ICE_INC_VECTOR.
>  - Fixed checkpatch warnings.
>  - Added more explanation of vector path in the device document.
>  - Some other minor change.
> 
> v5:
>  - Fixed a compile issue.
>  - Fixed a doc build warning.
> 
> v6:
>  - Added prefix "ice_" for ICE specific functions.
>  - Added unlikely for rarely used code.
> 
> v7:
>  - Reserved the original buffer release functions.
> 
> Wenzhuo Lu (8):
>   net/ice: fix Tx function setting
>   net/ice: add pointer for queue buffer release
>   net/ice: support vector SSE in RX
>   net/ice: support Rx scatter SSE vector
>   net/ice: support Tx SSE vector
>   net/ice: support Rx AVX2 vector
>   net/ice: support Rx scatter AVX2 vector
>   net/ice: support vector AVX2 in TX

This version (v7) pulled from next-net-intel to next-net.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-26  9:28             ` Maxime Coquelin
@ 2019-03-27  0:56               ` Lu, Wenzhuo
  2019-03-27  7:50                 ` Maxime Coquelin
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-27  0:56 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Tuesday, March 26, 2019 5:29 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector
> 
> Hi,
> 
> On 3/26/19 2:00 AM, Lu, Wenzhuo wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Monday, March 25, 2019 4:26 PM
> >> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
> >> vector
> >>
> >> Hi,
> >>
> >> On 3/25/19 3:22 AM, Lu, Wenzhuo wrote:
> >>> Hi Maxime,
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Friday, March 22, 2019 6:12 PM
> >>>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >>>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
> >>>> vector
> >>>
> >>>
> >>>>> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> >>>>
> >>>> I see same is done for other Intel NICs, but I wonder what would be
> >>>> the performance cost of making it dynamic, if any cost?
> >>> Currently we don't have a good idea to make it dynamic. If we use
> >>> pointer
> >> to point to different functions for 16 byte and 32 byte, there's too
> >> much duplicate code to make it hard to maintain. If we use the same
> >> function, and check the configure in it. It impacts the performance.
> >>
> >> Have you done some measurements, what would be the performance
> >> impact?
> > I mean if we check the configuration is 16 byte or 32 byte, this check will
> consume extra CPU cycles.
> > That why I think the better way is to have different paths for 16 byte and
> 32 byte. We should choose the appropriate path at the beginning.
> >
> >>
> >>> As HW does not support to change the configuration dynamically. The
> >> device must be stopped and restarted if the configuration is changed.
> >> It's not very helpful to make it a dynamic configuration. We assume
> >> that the users can make their choice at the beginning and will not change
> it.
> >>
> >> The problem is that the user has to recompile to switch between the
> >> two configurations. And it may not be an option for the user if he
> >> uses dpdk packaged by a distribution, for example.
> >>
> >> Maybe I was not clear, but I don't mean to be able to switch mode
> >> while the port is started. I think it would be better to make it
> >> possible to switch mode at application startup time.
> > Yes, I understand the problem is the recompiling. But we think the users
> will not change it after they made decision. That's why's acceptable in
> previous drivers.
> 
> The problem is that the user may not be able to change it, if he does not get
> DPDK from source but from a distribution like Debian, Ubuntu or Red Hat.
> 
> In this case, it means the user has no choice than sticking to 32 bytes
> descriptors.
Normally using 32 bytes is the default behavior and it's good to do that.
But I have to say I don't quite understand the scenario. DPDK is open source, whatever OS that users are using, nothing prevents them going to dpdk website to get the code and customize it.

> 
> > Agree it's better to remove all the compile configuration. Looks like that's
> what we're trying to do. We'd like to think about how to optimize it later.
> 
> My suggestion would be a devarg, so that you can have a per-port policy
> (which is another advantage of doing so).
We're thinking about moving some configuration from per port to per queue.
To my opinion, it's also a case that maybe it’s better to make it a queue's parameter.
Obviously it’s an API change. So we have to be slow and careful :)

> 
> >
> >
> >>
> >>>
> >>>>
> >>>> Having it dynamic (as a dev arg for instance) would make it
> >>>> possible to change the value when the user is using dpdk from a
> >>>> distro. It would also help testing coverage.
> >>>>
> >>>> Btw, how do you select this option with meson build system?
> >>> Not very familiar with meson. As I know, we can change the
> >>> meson.build
> >> to add the configure.
> >>>
> >>
> >> Ok, then please try to do it, because the legacy build system is
> >> going to be deprecated.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-27  0:56               ` Lu, Wenzhuo
@ 2019-03-27  7:50                 ` Maxime Coquelin
  2019-03-28  1:56                   ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Maxime Coquelin @ 2019-03-27  7:50 UTC (permalink / raw)
  To: Lu, Wenzhuo, dev



On 3/27/19 1:56 AM, Lu, Wenzhuo wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Tuesday, March 26, 2019 5:29 PM
>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector
>>
>> Hi,
>>
>> On 3/26/19 2:00 AM, Lu, Wenzhuo wrote:
>>> Hi Maxime,
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Monday, March 25, 2019 4:26 PM
>>>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
>>>> vector
>>>>
>>>> Hi,
>>>>
>>>> On 3/25/19 3:22 AM, Lu, Wenzhuo wrote:
>>>>> Hi Maxime,
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>> Sent: Friday, March 22, 2019 6:12 PM
>>>>>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
>>>>>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
>>>>>> vector
>>>>>
>>>>>
>>>>>>> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
>>>>>>
>>>>>> I see same is done for other Intel NICs, but I wonder what would be
>>>>>> the performance cost of making it dynamic, if any cost?
>>>>> Currently we don't have a good idea to make it dynamic. If we use
>>>>> pointer
>>>> to point to different functions for 16 byte and 32 byte, there's too
>>>> much duplicate code to make it hard to maintain. If we use the same
>>>> function, and check the configure in it. It impacts the performance.
>>>>
>>>> Have you done some measurements, what would be the performance
>>>> impact?
>>> I mean if we check the configuration is 16 byte or 32 byte, this check will
>> consume extra CPU cycles.
>>> That why I think the better way is to have different paths for 16 byte and
>> 32 byte. We should choose the appropriate path at the beginning.
>>>
>>>>
>>>>> As HW does not support to change the configuration dynamically. The
>>>> device must be stopped and restarted if the configuration is changed.
>>>> It's not very helpful to make it a dynamic configuration. We assume
>>>> that the users can make their choice at the beginning and will not change
>> it.
>>>>
>>>> The problem is that the user has to recompile to switch between the
>>>> two configurations. And it may not be an option for the user if he
>>>> uses dpdk packaged by a distribution, for example.
>>>>
>>>> Maybe I was not clear, but I don't mean to be able to switch mode
>>>> while the port is started. I think it would be better to make it
>>>> possible to switch mode at application startup time.
>>> Yes, I understand the problem is the recompiling. But we think the users
>> will not change it after they made decision. That's why's acceptable in
>> previous drivers.
>>
>> The problem is that the user may not be able to change it, if he does not get
>> DPDK from source but from a distribution like Debian, Ubuntu or Red Hat.
>>
>> In this case, it means the user has no choice than sticking to 32 bytes
>> descriptors.
> Normally using 32 bytes is the default behavior and it's good to do that.
> But I have to say I don't quite understand the scenario. DPDK is open source, whatever OS that users are using, nothing prevents them going to dpdk website to get the code and customize it.

The user may prefer to use the distribution package for several reasons.
Like not loosing the support he pays to the distributor by recompiling
the package, or also not benefiting from the validation done by the
distributor on the pre-built package.

For example, would it make sense to fix the queue size at build time
instead of using the --txd/--rxd run-time paramaters to save a few
cycles here and there? I think not.

> 
>>
>>> Agree it's better to remove all the compile configuration. Looks like that's
>> what we're trying to do. We'd like to think about how to optimize it later.
>>
>> My suggestion would be a devarg, so that you can have a per-port policy
>> (which is another advantage of doing so).
> We're thinking about moving some configuration from per port to per queue.
> To my opinion, it's also a case that maybe it’s better to make it a queue's parameter.
> Obviously it’s an API change. So we have to be slow and careful :)

Having it per queue would be even better, but yes, it would certainly
mean an API change.

>>
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Having it dynamic (as a dev arg for instance) would make it
>>>>>> possible to change the value when the user is using dpdk from a
>>>>>> distro. It would also help testing coverage.
>>>>>>
>>>>>> Btw, how do you select this option with meson build system?
>>>>> Not very familiar with meson. As I know, we can change the
>>>>> meson.build
>>>> to add the configure.
>>>>>
>>>>
>>>> Ok, then please try to do it, because the legacy build system is
>>>> going to be deprecated.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v5 6/8] net/ice: support Rx AVX2 vector
  2019-03-27  7:50                 ` Maxime Coquelin
@ 2019-03-28  1:56                   ` Lu, Wenzhuo
  0 siblings, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-03-28  1:56 UTC (permalink / raw)
  To: Maxime Coquelin, dev

Hi Maxime,


> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Wednesday, March 27, 2019 3:50 PM
> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2 vector
> 
> 
> 
> On 3/27/19 1:56 AM, Lu, Wenzhuo wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Tuesday, March 26, 2019 5:29 PM
> >> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
> >> vector
> >>
> >> Hi,
> >>
> >> On 3/26/19 2:00 AM, Lu, Wenzhuo wrote:
> >>> Hi Maxime,
> >>>
> >>>> -----Original Message-----
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Monday, March 25, 2019 4:26 PM
> >>>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >>>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
> >>>> vector
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 3/25/19 3:22 AM, Lu, Wenzhuo wrote:
> >>>>> Hi Maxime,
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>>>> Sent: Friday, March 22, 2019 6:12 PM
> >>>>>> To: Lu, Wenzhuo <wenzhuo.lu@intel.com>; dev@dpdk.org
> >>>>>> Subject: Re: [dpdk-dev] [PATCH v5 6/8] net/ice: support Rx AVX2
> >>>>>> vector
> >>>>>
> >>>>>
> >>>>>>> +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC
> >>>>>>
> >>>>>> I see same is done for other Intel NICs, but I wonder what would
> >>>>>> be the performance cost of making it dynamic, if any cost?
> >>>>> Currently we don't have a good idea to make it dynamic. If we use
> >>>>> pointer
> >>>> to point to different functions for 16 byte and 32 byte, there's
> >>>> too much duplicate code to make it hard to maintain. If we use the
> >>>> same function, and check the configure in it. It impacts the
> performance.
> >>>>
> >>>> Have you done some measurements, what would be the performance
> >>>> impact?
> >>> I mean if we check the configuration is 16 byte or 32 byte, this
> >>> check will
> >> consume extra CPU cycles.
> >>> That why I think the better way is to have different paths for 16
> >>> byte and
> >> 32 byte. We should choose the appropriate path at the beginning.
> >>>
> >>>>
> >>>>> As HW does not support to change the configuration dynamically.
> >>>>> The
> >>>> device must be stopped and restarted if the configuration is changed.
> >>>> It's not very helpful to make it a dynamic configuration. We assume
> >>>> that the users can make their choice at the beginning and will not
> >>>> change
> >> it.
> >>>>
> >>>> The problem is that the user has to recompile to switch between the
> >>>> two configurations. And it may not be an option for the user if he
> >>>> uses dpdk packaged by a distribution, for example.
> >>>>
> >>>> Maybe I was not clear, but I don't mean to be able to switch mode
> >>>> while the port is started. I think it would be better to make it
> >>>> possible to switch mode at application startup time.
> >>> Yes, I understand the problem is the recompiling. But we think the
> >>> users
> >> will not change it after they made decision. That's why's acceptable
> >> in previous drivers.
> >>
> >> The problem is that the user may not be able to change it, if he does
> >> not get DPDK from source but from a distribution like Debian, Ubuntu or
> Red Hat.
> >>
> >> In this case, it means the user has no choice than sticking to 32
> >> bytes descriptors.
> > Normally using 32 bytes is the default behavior and it's good to do that.
> > But I have to say I don't quite understand the scenario. DPDK is open
> source, whatever OS that users are using, nothing prevents them going to
> dpdk website to get the code and customize it.
> 
> The user may prefer to use the distribution package for several reasons.
> Like not loosing the support he pays to the distributor by recompiling the
> package, or also not benefiting from the validation done by the distributor
> on the pre-built package.
Thanks for sharing the info about the deployment. Good to know that. How to say I'm always looking at the problem from the developer's point of view.

> 
> For example, would it make sense to fix the queue size at build time instead
> of using the --txd/--rxd run-time paramaters to save a few cycles here and
> there? I think not.
Agree, we have to balance them if we cannot make it perfect. We'll think about how to optimize it.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-03-26  9:50   ` [PATCH v7 0/8] Support vector instructions on ICE Ferruh Yigit
@ 2019-03-31 15:52     ` Thomas Monjalon
  2019-04-01  5:46       ` Lu, Wenzhuo
  2019-04-01 12:51       ` Ferruh Yigit
  0 siblings, 2 replies; 121+ messages in thread
From: Thomas Monjalon @ 2019-03-31 15:52 UTC (permalink / raw)
  To: Ferruh Yigit, Wenzhuo Lu, Qi Zhang; +Cc: dev, cathal.ohare, john.mcnamara

26/03/2019 10:50, Ferruh Yigit:
> > Wenzhuo Lu (8):
> >   net/ice: fix Tx function setting
> >   net/ice: add pointer for queue buffer release
> >   net/ice: support vector SSE in RX
> >   net/ice: support Rx scatter SSE vector
> >   net/ice: support Tx SSE vector
> >   net/ice: support Rx AVX2 vector
> >   net/ice: support Rx scatter AVX2 vector
> >   net/ice: support vector AVX2 in TX
> 
> This version (v7) pulled from next-net-intel to next-net.

I assume these patches have been tested, or at least compiled.
However, when running devtools/test-meson-builds.sh, there is a
compilation error for build-x86-default:

In file included from ../drivers/net/ice/ice_ethdev.h:10:
rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found

It can be fixed in
	net/ice: support Rx AVX2 vector
by adding static_rte_pci and static_rte_bus_pci to the dependencies.
I fixed it even better in
	net/ice: support vector SSE in Rx
by replacing the useless include of rte_ethdev_pci.h in ice_ethdev.h
with rte_ethdev_driver.h.

I could just reject the next-net tree, but I don't really have such option
if we want to close 19.05-rc1 quickly.

In summary, I am spending my Sunday hours to fix the mess in your driver
which was supposed to be tested before submitting, plus before merge in
next-net-intel, plus compilation-tested before pull in next-net.
I don't know what failed in the process, but I really don't like it.
I don't want to see any new patch for ice PMD in 19.05 cycle.
If you really need some fixes in 19.05 (very likely given the mass
code drop you are doing few days before the -rc1 deadline),
then I advise you to double check everything and make commits fully
justified and explained.

Sorry for the bad mood, and I hope it won't happen again soon.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-03-31 15:52     ` Thomas Monjalon
@ 2019-04-01  5:46       ` Lu, Wenzhuo
  2019-04-01 12:51       ` Ferruh Yigit
  1 sibling, 0 replies; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-04-01  5:46 UTC (permalink / raw)
  To: Thomas Monjalon, Yigit, Ferruh, Zhang, Qi Z
  Cc: dev, O'Hare, Cathal, Mcnamara, John

Hi Thomas,


> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Sunday, March 31, 2019 11:52 PM
> To: Yigit, Ferruh <ferruh.yigit@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; O'Hare, Cathal <cathal.ohare@intel.com>; Mcnamara,
> John <john.mcnamara@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 0/8] Support vector instructions on ICE
> 
> 26/03/2019 10:50, Ferruh Yigit:
> > > Wenzhuo Lu (8):
> > >   net/ice: fix Tx function setting
> > >   net/ice: add pointer for queue buffer release
> > >   net/ice: support vector SSE in RX
> > >   net/ice: support Rx scatter SSE vector
> > >   net/ice: support Tx SSE vector
> > >   net/ice: support Rx AVX2 vector
> > >   net/ice: support Rx scatter AVX2 vector
> > >   net/ice: support vector AVX2 in TX
> >
> > This version (v7) pulled from next-net-intel to next-net.
> 
> I assume these patches have been tested, or at least compiled.
> However, when running devtools/test-meson-builds.sh, there is a
> compilation error for build-x86-default:
> 
> In file included from ../drivers/net/ice/ice_ethdev.h:10:
> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> 
> It can be fixed in
> 	net/ice: support Rx AVX2 vector
> by adding static_rte_pci and static_rte_bus_pci to the dependencies.
> I fixed it even better in
> 	net/ice: support vector SSE in Rx
> by replacing the useless include of rte_ethdev_pci.h in ice_ethdev.h with
> rte_ethdev_driver.h.
Really sorry for this. Although I don't understand the issue. I do use meson build and it works.
In my server,  no matter using " rte_ethdev_pci.h " or " rte_ethdev_driver.h ", it works fine.
To be honest, the compile error looks weird to me. Looks like any file which includes " rte_ethdev_pci.h " can hit the same problem. But I cannot tell anything, as I cannot reproduce the error.
Again, really appreciate for root causing and fixing the error but not rejecting the patches.

> 
> I could just reject the next-net tree, but I don't really have such option if we
> want to close 19.05-rc1 quickly.
> 
> In summary, I am spending my Sunday hours to fix the mess in your driver
> which was supposed to be tested before submitting, plus before merge in
> next-net-intel, plus compilation-tested before pull in next-net.
> I don't know what failed in the process, but I really don't like it.
> I don't want to see any new patch for ice PMD in 19.05 cycle.
> If you really need some fixes in 19.05 (very likely given the mass code drop
> you are doing few days before the -rc1 deadline), then I advise you to
> double check everything and make commits fully justified and explained.
> 
> Sorry for the bad mood, and I hope it won't happen again soon.
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-03-31 15:52     ` Thomas Monjalon
  2019-04-01  5:46       ` Lu, Wenzhuo
@ 2019-04-01 12:51       ` Ferruh Yigit
  2019-04-01 13:27         ` Thomas Monjalon
  2019-04-01 14:39         ` Bruce Richardson
  1 sibling, 2 replies; 121+ messages in thread
From: Ferruh Yigit @ 2019-04-01 12:51 UTC (permalink / raw)
  To: Thomas Monjalon, Wenzhuo Lu, Qi Zhang; +Cc: dev, cathal.ohare, john.mcnamara

On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> 26/03/2019 10:50, Ferruh Yigit:
>>> Wenzhuo Lu (8):
>>>   net/ice: fix Tx function setting
>>>   net/ice: add pointer for queue buffer release
>>>   net/ice: support vector SSE in RX
>>>   net/ice: support Rx scatter SSE vector
>>>   net/ice: support Tx SSE vector
>>>   net/ice: support Rx AVX2 vector
>>>   net/ice: support Rx scatter AVX2 vector
>>>   net/ice: support vector AVX2 in TX
>>
>> This version (v7) pulled from next-net-intel to next-net.
> 
> I assume these patches have been tested, or at least compiled.
> However, when running devtools/test-meson-builds.sh, there is a
> compilation error for build-x86-default:
> 
> In file included from ../drivers/net/ice/ice_ethdev.h:10:
> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found

I tested this with meson but not able to catch the issue. Perhaps for my case
dependencies were build fast enough to cause a problem.

> 
> It can be fixed in
> 	net/ice: support Rx AVX2 vector
> by adding static_rte_pci and static_rte_bus_pci to the dependencies.
> I fixed it even better in
> 	net/ice: support vector SSE in Rx
> by replacing the useless include of rte_ethdev_pci.h in ice_ethdev.h
> with rte_ethdev_driver.h.

Thanks.

> 
> I could just reject the next-net tree, but I don't really have such option
> if we want to close 19.05-rc1 quickly.
> 
> In summary, I am spending my Sunday hours to fix the mess in your driver
> which was supposed to be tested before submitting, plus before merge in
> next-net-intel, plus compilation-tested before pull in next-net.
> I don't know what failed in the process, but I really don't like it.
> I don't want to see any new patch for ice PMD in 19.05 cycle.
> If you really need some fixes in 19.05 (very likely given the mass
> code drop you are doing few days before the -rc1 deadline),
> then I advise you to double check everything and make commits fully
> justified and explained.
> 
> Sorry for the bad mood, and I hope it won't happen again soon.
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 12:51       ` Ferruh Yigit
@ 2019-04-01 13:27         ` Thomas Monjalon
  2019-04-01 15:12           ` Ferruh Yigit
  2019-04-01 14:39         ` Bruce Richardson
  1 sibling, 1 reply; 121+ messages in thread
From: Thomas Monjalon @ 2019-04-01 13:27 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

01/04/2019 14:51, Ferruh Yigit:
> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> > 26/03/2019 10:50, Ferruh Yigit:
> >>> Wenzhuo Lu (8):
> >>>   net/ice: fix Tx function setting
> >>>   net/ice: add pointer for queue buffer release
> >>>   net/ice: support vector SSE in RX
> >>>   net/ice: support Rx scatter SSE vector
> >>>   net/ice: support Tx SSE vector
> >>>   net/ice: support Rx AVX2 vector
> >>>   net/ice: support Rx scatter AVX2 vector
> >>>   net/ice: support vector AVX2 in TX
> >>
> >> This version (v7) pulled from next-net-intel to next-net.
> > 
> > I assume these patches have been tested, or at least compiled.
> > However, when running devtools/test-meson-builds.sh, there is a
> > compilation error for build-x86-default:
> > 
> > In file included from ../drivers/net/ice/ice_ethdev.h:10:
> > rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> 
> I tested this with meson but not able to catch the issue. Perhaps for my case
> dependencies were build fast enough to cause a problem.

No, it's not a matter of speed.
Are you running devtools/test-meson-builds.sh ?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 12:51       ` Ferruh Yigit
  2019-04-01 13:27         ` Thomas Monjalon
@ 2019-04-01 14:39         ` Bruce Richardson
  2019-04-01 14:56           ` Ferruh Yigit
  1 sibling, 1 reply; 121+ messages in thread
From: Bruce Richardson @ 2019-04-01 14:39 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Thomas Monjalon, Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

On Mon, Apr 01, 2019 at 01:51:38PM +0100, Ferruh Yigit wrote:
> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> > 26/03/2019 10:50, Ferruh Yigit:
> >>> Wenzhuo Lu (8):
> >>>   net/ice: fix Tx function setting
> >>>   net/ice: add pointer for queue buffer release
> >>>   net/ice: support vector SSE in RX
> >>>   net/ice: support Rx scatter SSE vector
> >>>   net/ice: support Tx SSE vector
> >>>   net/ice: support Rx AVX2 vector
> >>>   net/ice: support Rx scatter AVX2 vector
> >>>   net/ice: support vector AVX2 in TX
> >>
> >> This version (v7) pulled from next-net-intel to next-net.
> > 
> > I assume these patches have been tested, or at least compiled.
> > However, when running devtools/test-meson-builds.sh, there is a
> > compilation error for build-x86-default:
> > 
> > In file included from ../drivers/net/ice/ice_ethdev.h:10:
> > rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> 
> I tested this with meson but not able to catch the issue. Perhaps for my case
> dependencies were build fast enough to cause a problem.
> 

That should be a problem with the meson builds. While with make builds, the
headers files are picked up after they are copied to the "include"
directory by the build process, in meson no such copying occurs and the
header files are picked up by having the paths to them passed in the
"dependency object" to each build. If the dependency does not exist then
the build will never pass, irrespective of ordering, and if the dependency
exists, the build will always find the header in its original location.

[The biggest benefit of this is that when building with ninja there are no
dependencies between the individual .c files - each one can be compiled
in parallel with all the others. It's only at the linking step that we need
to wait for previous jobs to complete]

In terms of this specific error with the header - did it get root caused?
Since it occurs on the "default" path, I'd suggest the fallback handling in
the meson.build file for the absense of AVX may be faulty, e.g. are you
replacing c flags or dependencies rather than appending to them?

/Bruce

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 14:39         ` Bruce Richardson
@ 2019-04-01 14:56           ` Ferruh Yigit
  2019-04-01 15:09             ` Ferruh Yigit
  2019-04-01 15:13             ` Thomas Monjalon
  0 siblings, 2 replies; 121+ messages in thread
From: Ferruh Yigit @ 2019-04-01 14:56 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Thomas Monjalon, Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

On 4/1/2019 3:39 PM, Bruce Richardson wrote:
> On Mon, Apr 01, 2019 at 01:51:38PM +0100, Ferruh Yigit wrote:
>> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
>>> 26/03/2019 10:50, Ferruh Yigit:
>>>>> Wenzhuo Lu (8):
>>>>>   net/ice: fix Tx function setting
>>>>>   net/ice: add pointer for queue buffer release
>>>>>   net/ice: support vector SSE in RX
>>>>>   net/ice: support Rx scatter SSE vector
>>>>>   net/ice: support Tx SSE vector
>>>>>   net/ice: support Rx AVX2 vector
>>>>>   net/ice: support Rx scatter AVX2 vector
>>>>>   net/ice: support vector AVX2 in TX
>>>>
>>>> This version (v7) pulled from next-net-intel to next-net.
>>>
>>> I assume these patches have been tested, or at least compiled.
>>> However, when running devtools/test-meson-builds.sh, there is a
>>> compilation error for build-x86-default:
>>>
>>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
>>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
>>
>> I tested this with meson but not able to catch the issue. Perhaps for my case
>> dependencies were build fast enough to cause a problem.
>>
> 
> That should be a problem with the meson builds. While with make builds, the
> headers files are picked up after they are copied to the "include"
> directory by the build process, in meson no such copying occurs and the
> header files are picked up by having the paths to them passed in the
> "dependency object" to each build. If the dependency does not exist then
> the build will never pass, irrespective of ordering, and if the dependency
> exists, the build will always find the header in its original location.

I was checking this and recognized that no copying is happening. And I can see
many PMDs are using this header [1], not sure why ice is failing.

> 
> [The biggest benefit of this is that when building with ninja there are no
> dependencies between the individual .c files - each one can be compiled
> in parallel with all the others. It's only at the linking step that we need
> to wait for previous jobs to complete]
> 
> In terms of this specific error with the header - did it get root caused?
> Since it occurs on the "default" path, I'd suggest the fallback handling in
> the meson.build file for the absense of AVX may be faulty, e.g. are you
> replacing c flags or dependencies rather than appending to them?

Trying to find out the root cause, but as you said it occurs on the 'default'
path only, and taking into account that there is not copying dependent headers,
I am not able to find it yet, checking.

[1]
$ git grep  rte_ethdev_pci.h



drivers/net/ark/ark_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/atlantic/atl_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/avp/avp_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/axgbe/axgbe_common.h:#include <rte_ethdev_pci.h>
drivers/net/bnx2x/bnx2x_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/bnxt/bnxt_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/cxgbe/cxgbe_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/cxgbe/cxgbe_main.c:#include <rte_ethdev_pci.h>
drivers/net/cxgbe/cxgbevf_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/cxgbe/cxgbevf_main.c:#include <rte_ethdev_pci.h>
drivers/net/e1000/em_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/e1000/igb_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/e1000/igb_flow.c:#include <rte_ethdev_pci.h>
drivers/net/ena/ena_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/enetc/enetc_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/enic/enic_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/fm10k/fm10k_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/i40e/i40e_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/i40e/i40e_ethdev_vf.c:#include <rte_ethdev_pci.h>
drivers/net/iavf/iavf_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/ice/ice_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/ixgbe/ixgbe_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/ixgbe/ixgbe_ipsec.c:#include <rte_ethdev_pci.h>
drivers/net/liquidio/lio_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/mlx4/mlx4.c:#include <rte_ethdev_pci.h>
drivers/net/mlx5/mlx5.c:#include <rte_ethdev_pci.h>
drivers/net/nfp/nfp_net.c:#include <rte_ethdev_pci.h>
drivers/net/nfp/nfpcore/nfp_cpp.h:#include <rte_ethdev_pci.h>
drivers/net/nfp/nfpcore/nfp_cpp_pcie_ops.c:#include <rte_ethdev_pci.h>
drivers/net/nfp/nfpcore/nfp_cppcore.c:#include <rte_ethdev_pci.h>
drivers/net/qede/qede_ethdev.h:#include <rte_ethdev_pci.h>
drivers/net/sfc/sfc_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/szedata2/rte_eth_szedata2.c:#include <rte_ethdev_pci.h>
drivers/net/thunderx/nicvf_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/virtio/virtio_ethdev.c:#include <rte_ethdev_pci.h>
drivers/net/vmxnet3/vmxnet3_ethdev.c:#include <rte_ethdev_pci.h>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 14:56           ` Ferruh Yigit
@ 2019-04-01 15:09             ` Ferruh Yigit
  2019-04-01 15:13             ` Thomas Monjalon
  1 sibling, 0 replies; 121+ messages in thread
From: Ferruh Yigit @ 2019-04-01 15:09 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Thomas Monjalon, Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

On 4/1/2019 3:56 PM, Ferruh Yigit wrote:
> On 4/1/2019 3:39 PM, Bruce Richardson wrote:
>> On Mon, Apr 01, 2019 at 01:51:38PM +0100, Ferruh Yigit wrote:
>>> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
>>>> 26/03/2019 10:50, Ferruh Yigit:
>>>>>> Wenzhuo Lu (8):
>>>>>>   net/ice: fix Tx function setting
>>>>>>   net/ice: add pointer for queue buffer release
>>>>>>   net/ice: support vector SSE in RX
>>>>>>   net/ice: support Rx scatter SSE vector
>>>>>>   net/ice: support Tx SSE vector
>>>>>>   net/ice: support Rx AVX2 vector
>>>>>>   net/ice: support Rx scatter AVX2 vector
>>>>>>   net/ice: support vector AVX2 in TX
>>>>>
>>>>> This version (v7) pulled from next-net-intel to next-net.
>>>>
>>>> I assume these patches have been tested, or at least compiled.
>>>> However, when running devtools/test-meson-builds.sh, there is a
>>>> compilation error for build-x86-default:
>>>>
>>>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
>>>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
>>>
>>> I tested this with meson but not able to catch the issue. Perhaps for my case
>>> dependencies were build fast enough to cause a problem.
>>>
>>
>> That should be a problem with the meson builds. While with make builds, the
>> headers files are picked up after they are copied to the "include"
>> directory by the build process, in meson no such copying occurs and the
>> header files are picked up by having the paths to them passed in the
>> "dependency object" to each build. If the dependency does not exist then
>> the build will never pass, irrespective of ordering, and if the dependency
>> exists, the build will always find the header in its original location.
> 
> I was checking this and recognized that no copying is happening. And I can see
> many PMDs are using this header [1], not sure why ice is failing.
> 
>>
>> [The biggest benefit of this is that when building with ninja there are no
>> dependencies between the individual .c files - each one can be compiled
>> in parallel with all the others. It's only at the linking step that we need
>> to wait for previous jobs to complete]
>>
>> In terms of this specific error with the header - did it get root caused?
>> Since it occurs on the "default" path, I'd suggest the fallback handling in
>> the meson.build file for the absense of AVX may be faulty, e.g. are you
>> replacing c flags or dependencies rather than appending to them?
> 
> Trying to find out the root cause, but as you said it occurs on the 'default'
> path only, and taking into account that there is not copying dependent headers,
> I am not able to find it yet, checking.

This is meson fallback handling as Bruce guessed ...

> 
> [1]
> $ git grep  rte_ethdev_pci.h
> 
> 
> 
> drivers/net/ark/ark_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/atlantic/atl_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/avp/avp_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/axgbe/axgbe_common.h:#include <rte_ethdev_pci.h>
> drivers/net/bnx2x/bnx2x_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/bnxt/bnxt_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/cxgbe/cxgbe_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/cxgbe/cxgbe_main.c:#include <rte_ethdev_pci.h>
> drivers/net/cxgbe/cxgbevf_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/cxgbe/cxgbevf_main.c:#include <rte_ethdev_pci.h>
> drivers/net/e1000/em_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/e1000/igb_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/e1000/igb_flow.c:#include <rte_ethdev_pci.h>
> drivers/net/ena/ena_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/enetc/enetc_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/enic/enic_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/fm10k/fm10k_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/i40e/i40e_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/i40e/i40e_ethdev_vf.c:#include <rte_ethdev_pci.h>
> drivers/net/iavf/iavf_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/ice/ice_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/ixgbe/ixgbe_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/ixgbe/ixgbe_ipsec.c:#include <rte_ethdev_pci.h>
> drivers/net/liquidio/lio_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/mlx4/mlx4.c:#include <rte_ethdev_pci.h>
> drivers/net/mlx5/mlx5.c:#include <rte_ethdev_pci.h>
> drivers/net/nfp/nfp_net.c:#include <rte_ethdev_pci.h>
> drivers/net/nfp/nfpcore/nfp_cpp.h:#include <rte_ethdev_pci.h>
> drivers/net/nfp/nfpcore/nfp_cpp_pcie_ops.c:#include <rte_ethdev_pci.h>
> drivers/net/nfp/nfpcore/nfp_cppcore.c:#include <rte_ethdev_pci.h>
> drivers/net/qede/qede_ethdev.h:#include <rte_ethdev_pci.h>
> drivers/net/sfc/sfc_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/szedata2/rte_eth_szedata2.c:#include <rte_ethdev_pci.h>
> drivers/net/thunderx/nicvf_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/virtio/virtio_ethdev.c:#include <rte_ethdev_pci.h>
> drivers/net/vmxnet3/vmxnet3_ethdev.c:#include <rte_ethdev_pci.h>
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 13:27         ` Thomas Monjalon
@ 2019-04-01 15:12           ` Ferruh Yigit
  2019-04-01 15:14             ` Thomas Monjalon
  0 siblings, 1 reply; 121+ messages in thread
From: Ferruh Yigit @ 2019-04-01 15:12 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

On 4/1/2019 2:27 PM, Thomas Monjalon wrote:
> 01/04/2019 14:51, Ferruh Yigit:
>> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
>>> 26/03/2019 10:50, Ferruh Yigit:
>>>>> Wenzhuo Lu (8):
>>>>>   net/ice: fix Tx function setting
>>>>>   net/ice: add pointer for queue buffer release
>>>>>   net/ice: support vector SSE in RX
>>>>>   net/ice: support Rx scatter SSE vector
>>>>>   net/ice: support Tx SSE vector
>>>>>   net/ice: support Rx AVX2 vector
>>>>>   net/ice: support Rx scatter AVX2 vector
>>>>>   net/ice: support vector AVX2 in TX
>>>>
>>>> This version (v7) pulled from next-net-intel to next-net.
>>>
>>> I assume these patches have been tested, or at least compiled.
>>> However, when running devtools/test-meson-builds.sh, there is a
>>> compilation error for build-x86-default:
>>>
>>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
>>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
>>
>> I tested this with meson but not able to catch the issue. Perhaps for my case
>> dependencies were build fast enough to cause a problem.
> 
> No, it's not a matter of speed.

No it is not,

> Are you running devtools/test-meson-builds.sh ?

No, I am using with another script but not building 'default' architecture with
meson where problem happens, that is why not able to catch this. Will try to
catch next time..

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 14:56           ` Ferruh Yigit
  2019-04-01 15:09             ` Ferruh Yigit
@ 2019-04-01 15:13             ` Thomas Monjalon
  1 sibling, 0 replies; 121+ messages in thread
From: Thomas Monjalon @ 2019-04-01 15:13 UTC (permalink / raw)
  To: Ferruh Yigit, Bruce Richardson
  Cc: Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

01/04/2019 16:56, Ferruh Yigit:
> On 4/1/2019 3:39 PM, Bruce Richardson wrote:
> > On Mon, Apr 01, 2019 at 01:51:38PM +0100, Ferruh Yigit wrote:
> >> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> >>> 26/03/2019 10:50, Ferruh Yigit:
> >>>>> Wenzhuo Lu (8):
> >>>>>   net/ice: fix Tx function setting
> >>>>>   net/ice: add pointer for queue buffer release
> >>>>>   net/ice: support vector SSE in RX
> >>>>>   net/ice: support Rx scatter SSE vector
> >>>>>   net/ice: support Tx SSE vector
> >>>>>   net/ice: support Rx AVX2 vector
> >>>>>   net/ice: support Rx scatter AVX2 vector
> >>>>>   net/ice: support vector AVX2 in TX
> >>>>
> >>>> This version (v7) pulled from next-net-intel to next-net.
> >>>
> >>> I assume these patches have been tested, or at least compiled.
> >>> However, when running devtools/test-meson-builds.sh, there is a
> >>> compilation error for build-x86-default:
> >>>
> >>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
> >>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> >>
> >> I tested this with meson but not able to catch the issue. Perhaps for my case
> >> dependencies were build fast enough to cause a problem.
> > 
> > That should be a problem with the meson builds. While with make builds, the
> > headers files are picked up after they are copied to the "include"
> > directory by the build process, in meson no such copying occurs and the
> > header files are picked up by having the paths to them passed in the
> > "dependency object" to each build. If the dependency does not exist then
> > the build will never pass, irrespective of ordering, and if the dependency
> > exists, the build will always find the header in its original location.
> 
> I was checking this and recognized that no copying is happening. And I can see
> many PMDs are using this header [1], not sure why ice is failing.
> 
> > 
> > [The biggest benefit of this is that when building with ninja there are no
> > dependencies between the individual .c files - each one can be compiled
> > in parallel with all the others. It's only at the linking step that we need
> > to wait for previous jobs to complete]
> > 
> > In terms of this specific error with the header - did it get root caused?
> > Since it occurs on the "default" path, I'd suggest the fallback handling in
> > the meson.build file for the absense of AVX may be faulty, e.g. are you
> > replacing c flags or dependencies rather than appending to them?
> 
> Trying to find out the root cause, but as you said it occurs on the 'default'
> path only, and taking into account that there is not copying dependent headers,
> I am not able to find it yet, checking.

Guys! Are you kidding me?
I already described the root cause and the possible fixes:
	http://mails.dpdk.org/archives/dev/2019-March/128375.html
This is in the case AVX2 being chosen at runtime.
rte_ethdev_pci.h is included in a header file.
The dependencies object is missing static_rte_pci and static_rte_bus_pci.
I chose to just use rte_ethdev_driver.h instead.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 15:12           ` Ferruh Yigit
@ 2019-04-01 15:14             ` Thomas Monjalon
  2019-04-02  1:01               ` Lu, Wenzhuo
  0 siblings, 1 reply; 121+ messages in thread
From: Thomas Monjalon @ 2019-04-01 15:14 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: Wenzhuo Lu, Qi Zhang, dev, cathal.ohare, john.mcnamara

01/04/2019 17:12, Ferruh Yigit:
> On 4/1/2019 2:27 PM, Thomas Monjalon wrote:
> > 01/04/2019 14:51, Ferruh Yigit:
> >> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> >>> 26/03/2019 10:50, Ferruh Yigit:
> >>>>> Wenzhuo Lu (8):
> >>>>>   net/ice: fix Tx function setting
> >>>>>   net/ice: add pointer for queue buffer release
> >>>>>   net/ice: support vector SSE in RX
> >>>>>   net/ice: support Rx scatter SSE vector
> >>>>>   net/ice: support Tx SSE vector
> >>>>>   net/ice: support Rx AVX2 vector
> >>>>>   net/ice: support Rx scatter AVX2 vector
> >>>>>   net/ice: support vector AVX2 in TX
> >>>>
> >>>> This version (v7) pulled from next-net-intel to next-net.
> >>>
> >>> I assume these patches have been tested, or at least compiled.
> >>> However, when running devtools/test-meson-builds.sh, there is a
> >>> compilation error for build-x86-default:
> >>>
> >>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
> >>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> >>
> >> I tested this with meson but not able to catch the issue. Perhaps for my case
> >> dependencies were build fast enough to cause a problem.
> > 
> > No, it's not a matter of speed.
> 
> No it is not,
> 
> > Are you running devtools/test-meson-builds.sh ?
> 
> No, I am using with another script but not building 'default' architecture with
> meson where problem happens, that is why not able to catch this. Will try to
> catch next time..

Ferruh, why not using the scripts we are building in devtools?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-01 15:14             ` Thomas Monjalon
@ 2019-04-02  1:01               ` Lu, Wenzhuo
  2019-04-02  7:12                 ` Thomas Monjalon
  0 siblings, 1 reply; 121+ messages in thread
From: Lu, Wenzhuo @ 2019-04-02  1:01 UTC (permalink / raw)
  To: Thomas Monjalon, Yigit, Ferruh
  Cc: Zhang, Qi Z, dev, O'Hare, Cathal, Mcnamara, John

Hi Thomas,


> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Monday, April 1, 2019 11:14 PM
> To: Yigit, Ferruh <ferruh.yigit@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Qi Z
> <qi.z.zhang@intel.com>; dev@dpdk.org; O'Hare, Cathal
> <cathal.ohare@intel.com>; Mcnamara, John <john.mcnamara@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 0/8] Support vector instructions on ICE
> 
> 01/04/2019 17:12, Ferruh Yigit:
> > On 4/1/2019 2:27 PM, Thomas Monjalon wrote:
> > > 01/04/2019 14:51, Ferruh Yigit:
> > >> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> > >>> 26/03/2019 10:50, Ferruh Yigit:
> > >>>>> Wenzhuo Lu (8):
> > >>>>>   net/ice: fix Tx function setting
> > >>>>>   net/ice: add pointer for queue buffer release
> > >>>>>   net/ice: support vector SSE in RX
> > >>>>>   net/ice: support Rx scatter SSE vector
> > >>>>>   net/ice: support Tx SSE vector
> > >>>>>   net/ice: support Rx AVX2 vector
> > >>>>>   net/ice: support Rx scatter AVX2 vector
> > >>>>>   net/ice: support vector AVX2 in TX
> > >>>>
> > >>>> This version (v7) pulled from next-net-intel to next-net.
> > >>>
> > >>> I assume these patches have been tested, or at least compiled.
> > >>> However, when running devtools/test-meson-builds.sh, there is a
> > >>> compilation error for build-x86-default:
> > >>>
> > >>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
> > >>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> > >>
> > >> I tested this with meson but not able to catch the issue. Perhaps
> > >> for my case dependencies were build fast enough to cause a problem.
> > >
> > > No, it's not a matter of speed.
> >
> > No it is not,
> >
> > > Are you running devtools/test-meson-builds.sh ?
> >
> > No, I am using with another script but not building 'default'
> > architecture with meson where problem happens, that is why not able to
> > catch this. Will try to catch next time..
> 
> Ferruh, why not using the scripts we are building in devtools?
I have a suggestion. Why not integrating all the compile check here, like http://mails.dpdk.org/archives/test-report/2019-March/077966.html.  If it fails, the patch state should be changed to "compile error". It will not be accepted. It can help everyone, especially the committer.
We cannot assume everyone knows everything. To me, have to say not familiar with meson build.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v7 0/8] Support vector instructions on ICE
  2019-04-02  1:01               ` Lu, Wenzhuo
@ 2019-04-02  7:12                 ` Thomas Monjalon
  0 siblings, 0 replies; 121+ messages in thread
From: Thomas Monjalon @ 2019-04-02  7:12 UTC (permalink / raw)
  To: Lu, Wenzhuo
  Cc: Yigit, Ferruh, Zhang, Qi Z, dev, O'Hare, Cathal, Mcnamara,
	John, bruce.richardson, qian.q.xu

02/04/2019 03:01, Lu, Wenzhuo:
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 01/04/2019 17:12, Ferruh Yigit:
> > > On 4/1/2019 2:27 PM, Thomas Monjalon wrote:
> > > > 01/04/2019 14:51, Ferruh Yigit:
> > > >> On 3/31/2019 4:52 PM, Thomas Monjalon wrote:
> > > >>> 26/03/2019 10:50, Ferruh Yigit:
> > > >>>>> Wenzhuo Lu (8):
> > > >>>>>   net/ice: fix Tx function setting
> > > >>>>>   net/ice: add pointer for queue buffer release
> > > >>>>>   net/ice: support vector SSE in RX
> > > >>>>>   net/ice: support Rx scatter SSE vector
> > > >>>>>   net/ice: support Tx SSE vector
> > > >>>>>   net/ice: support Rx AVX2 vector
> > > >>>>>   net/ice: support Rx scatter AVX2 vector
> > > >>>>>   net/ice: support vector AVX2 in TX
> > > >>>>
> > > >>>> This version (v7) pulled from next-net-intel to next-net.
> > > >>>
> > > >>> I assume these patches have been tested, or at least compiled.
> > > >>> However, when running devtools/test-meson-builds.sh, there is a
> > > >>> compilation error for build-x86-default:
> > > >>>
> > > >>> In file included from ../drivers/net/ice/ice_ethdev.h:10:
> > > >>> rte_ethdev_pci.h:38:10: fatal error: 'rte_pci.h' file not found
> > > >>
> > > >> I tested this with meson but not able to catch the issue. Perhaps
> > > >> for my case dependencies were build fast enough to cause a problem.
> > > >
> > > > No, it's not a matter of speed.
> > >
> > > No it is not,
> > >
> > > > Are you running devtools/test-meson-builds.sh ?
> > >
> > > No, I am using with another script but not building 'default'
> > > architecture with meson where problem happens, that is why not able to
> > > catch this. Will try to catch next time..
> > 
> > Ferruh, why not using the scripts we are building in devtools?
> I have a suggestion. Why not integrating all the compile check here, like http://mails.dpdk.org/archives/test-report/2019-March/077966.html.  If it fails, the patch state should be changed to "compile error". It will not be accepted. It can help everyone, especially the committer.
> We cannot assume everyone knows everything. To me, have to say not familiar with meson build.

Yes it would help to have meson tested in CI.

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2019-04-02  7:12 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-28  7:48 [PATCH 0/8] Support vector instructions on ICE Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 1/8] net/ice: fix TX function setting Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 3/8] net/ice: support RX SSE vector Wenzhuo Lu
2019-03-01  3:44   ` Zhang, Qi Z
2019-03-04  1:27     ` Lu, Wenzhuo
2019-02-28  7:48 ` [PATCH 4/8] net/ice: support RX scatter " Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 5/8] net/ice: support TX " Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 6/8] net/ice: support RX AVX2 vector Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 7/8] net/ice: support RX scatter " Wenzhuo Lu
2019-02-28  7:48 ` [PATCH 8/8] net/ice: support TX " Wenzhuo Lu
2019-03-01  3:41 ` [PATCH 0/8] Support vector instructions on ICE Zhang, Qi Z
2019-03-04  1:24   ` Lu, Wenzhuo
2019-03-04  6:53 ` [PATCH v2 " Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 1/8] net/ice: fix Tx function setting Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
2019-03-11  3:26     ` Zhang, Qi Z
2019-03-15  1:50       ` Lu, Wenzhuo
2019-03-04  6:53   ` [PATCH v2 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 5/8] net/ice: support Tx " Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 7/8] net/ice: support Rx scatter " Wenzhuo Lu
2019-03-04  6:53   ` [PATCH v2 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
2019-03-15  6:22 ` [PATCH v3 0/8] Support vector instructions on ICE Wenzhuo Lu
2019-03-15  6:22   ` [PATCH v3 1/8] net/ice: fix Tx function setting Wenzhuo Lu
2019-03-15 17:52     ` Ferruh Yigit
2019-03-18  1:08       ` Lu, Wenzhuo
2019-03-20 17:22         ` Ferruh Yigit
2019-03-21  2:29           ` Lu, Wenzhuo
2019-03-15  6:22   ` [PATCH v3 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-03-15 17:52     ` Ferruh Yigit
2019-03-18  1:15       ` Lu, Wenzhuo
2019-03-15  6:22   ` [PATCH v3 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
2019-03-15 17:53     ` Ferruh Yigit
2019-03-18  1:22       ` Lu, Wenzhuo
2019-03-20 17:35         ` Ferruh Yigit
2019-03-21  2:48           ` Lu, Wenzhuo
2019-03-15  6:22   ` [PATCH v3 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
2019-03-15  6:22   ` [PATCH v3 5/8] net/ice: support Tx " Wenzhuo Lu
2019-03-15  6:22   ` [PATCH v3 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
2019-03-15 17:54     ` Ferruh Yigit
2019-03-18  1:37       ` Lu, Wenzhuo
2019-03-20 17:37         ` Ferruh Yigit
2019-03-21  2:31           ` Lu, Wenzhuo
2019-03-15  6:22   ` [PATCH v3 7/8] net/ice: support Rx scatter " Wenzhuo Lu
2019-03-15  6:22   ` [PATCH v3 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
2019-03-15 17:54     ` Ferruh Yigit
2019-03-18  1:38       ` Lu, Wenzhuo
2019-03-15  8:08   ` [PATCH v3 0/8] Support vector instructions on ICE Zhang, Qi Z
2019-03-21  6:26 ` [PATCH v4 " Wenzhuo Lu
2019-03-21  6:26   ` [PATCH v4 1/8] net/ice: fix Tx function setting Wenzhuo Lu
2019-03-22  8:46     ` Maxime Coquelin
2019-03-22  9:01       ` Maxime Coquelin
2019-03-21  6:26   ` [PATCH v4 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-03-22  8:59     ` Maxime Coquelin
2019-03-21  6:26   ` [PATCH v4 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
2019-03-21 19:02     ` Ferruh Yigit
2019-03-22  1:46       ` Lu, Wenzhuo
2019-03-21  6:26   ` [PATCH v4 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
2019-03-21  6:26   ` [PATCH v4 5/8] net/ice: support Tx " Wenzhuo Lu
2019-03-21  6:26   ` [PATCH v4 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
2019-03-21  6:26   ` [PATCH v4 7/8] net/ice: support Rx scatter " Wenzhuo Lu
2019-03-21  6:26   ` [PATCH v4 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
2019-03-21 19:20     ` Ferruh Yigit
2019-03-22  1:45       ` Lu, Wenzhuo
2019-03-22  2:58 ` [PATCH v5 0/8] Support vector instructions on ICE Wenzhuo Lu
2019-03-22  2:58   ` [PATCH v5 1/8] net/ice: fix Tx function setting Wenzhuo Lu
2019-03-22  2:58   ` [PATCH v5 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-03-22  2:58   ` [PATCH v5 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
2019-03-22  9:42     ` Maxime Coquelin
2019-03-25  1:56       ` Lu, Wenzhuo
2019-03-22  2:58   ` [PATCH v5 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
2019-03-22  2:58   ` [PATCH v5 5/8] net/ice: support Tx " Wenzhuo Lu
2019-03-22  9:58     ` Maxime Coquelin
2019-03-25  2:02       ` Lu, Wenzhuo
2019-03-22  2:58   ` [PATCH v5 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
2019-03-22 10:12     ` Maxime Coquelin
2019-03-25  2:22       ` Lu, Wenzhuo
2019-03-25  8:26         ` Maxime Coquelin
2019-03-26  1:00           ` Lu, Wenzhuo
2019-03-26  9:28             ` Maxime Coquelin
2019-03-27  0:56               ` Lu, Wenzhuo
2019-03-27  7:50                 ` Maxime Coquelin
2019-03-28  1:56                   ` Lu, Wenzhuo
2019-03-22  2:58   ` [PATCH v5 7/8] net/ice: support Rx scatter " Wenzhuo Lu
2019-03-22  2:58   ` [PATCH v5 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
2019-03-25  6:06 ` [PATCH v6 0/8] Support vector instructions on ICE Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 1/8] net/ice: fix Tx function setting Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-03-25 13:23     ` Maxime Coquelin
2019-03-26  1:15       ` Lu, Wenzhuo
2019-03-25  6:06   ` [PATCH v6 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 5/8] net/ice: support Tx " Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 7/8] net/ice: support Rx scatter " Wenzhuo Lu
2019-03-25  6:06   ` [PATCH v6 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
2019-03-25  7:56   ` [PATCH v6 0/8] Support vector instructions on ICE Zhang, Qi Z
2019-03-26  6:16 ` [PATCH v7 " Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 1/8] net/ice: fix Tx function setting Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 2/8] net/ice: add pointer for queue buffer release Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 3/8] net/ice: support vector SSE in RX Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 4/8] net/ice: support Rx scatter SSE vector Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 5/8] net/ice: support Tx " Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 6/8] net/ice: support Rx AVX2 vector Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 7/8] net/ice: support Rx scatter " Wenzhuo Lu
2019-03-26  6:16   ` [PATCH v7 8/8] net/ice: support vector AVX2 in TX Wenzhuo Lu
2019-03-26  9:50   ` [PATCH v7 0/8] Support vector instructions on ICE Ferruh Yigit
2019-03-31 15:52     ` Thomas Monjalon
2019-04-01  5:46       ` Lu, Wenzhuo
2019-04-01 12:51       ` Ferruh Yigit
2019-04-01 13:27         ` Thomas Monjalon
2019-04-01 15:12           ` Ferruh Yigit
2019-04-01 15:14             ` Thomas Monjalon
2019-04-02  1:01               ` Lu, Wenzhuo
2019-04-02  7:12                 ` Thomas Monjalon
2019-04-01 14:39         ` Bruce Richardson
2019-04-01 14:56           ` Ferruh Yigit
2019-04-01 15:09             ` Ferruh Yigit
2019-04-01 15:13             ` Thomas Monjalon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.