All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] add ifcvf driver
@ 2018-03-09 23:08 Xiao Wang
  2018-03-09 23:08 ` [PATCH 1/3] eal/vfio: add support for multiple container Xiao Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-09 23:08 UTC (permalink / raw)
  To: dev
  Cc: zhihong.wang, maxime.coquelin, yliu, cunming.liang, rosen.xu,
	junjie.j.chen, dan.daly, Xiao Wang

This patch set has dependency on http://dpdk.org/dev/patchwork/patch/35635/
(vhost: support selective datapath);

ifc VF is compatible with virtio vring operations, this driver implements
vDPA driver ops which configures ifc VF to be a vhost data path accelerator.

ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
to it. It registers vDPA device ops to vhost lib to enable these VFs to be
used as vhost data path accelerator.

Live migration feature is supported by ifc VF and this driver enables
it based on vhost lib.

vDPA needs to create different containers for different devices, thus this
patch set adds APIs in eal/vfio to support multiple container.

Junjie Chen (1):
  eal/vfio: add support for multiple container

Xiao Wang (2):
  bus/pci: expose sysfs parsing API
  net/ifcvf: add ifcvf driver

 config/common_base                       |    6 +
 config/common_linuxapp                   |    1 +
 drivers/bus/pci/linux/pci.c              |    9 +-
 drivers/bus/pci/linux/pci_init.h         |    8 +
 drivers/bus/pci/rte_bus_pci_version.map  |    8 +
 drivers/net/Makefile                     |    1 +
 drivers/net/ifcvf/Makefile               |   40 +
 drivers/net/ifcvf/base/ifcvf.c           |  329 ++++++++
 drivers/net/ifcvf/base/ifcvf.h           |  156 ++++
 drivers/net/ifcvf/base/ifcvf_osdep.h     |   52 ++
 drivers/net/ifcvf/ifcvf_ethdev.c         | 1241 ++++++++++++++++++++++++++++++
 drivers/net/ifcvf/rte_ifcvf_version.map  |    4 +
 lib/librte_eal/bsdapp/eal/eal.c          |   51 +-
 lib/librte_eal/common/include/rte_vfio.h |  117 ++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   |  553 ++++++++++---
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |    2 +
 lib/librte_eal/rte_eal_version.map       |    7 +
 mk/rte.app.mk                            |    1 +
 18 files changed, 2480 insertions(+), 106 deletions(-)
 create mode 100644 drivers/net/ifcvf/Makefile
 create mode 100644 drivers/net/ifcvf/base/ifcvf.c
 create mode 100644 drivers/net/ifcvf/base/ifcvf.h
 create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
 create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/3] eal/vfio: add support for multiple container
  2018-03-09 23:08 [PATCH 0/3] add ifcvf driver Xiao Wang
@ 2018-03-09 23:08 ` Xiao Wang
  2018-03-14 12:08   ` Burakov, Anatoly
  2018-03-09 23:08 ` [PATCH 2/3] bus/pci: expose sysfs parsing API Xiao Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-03-09 23:08 UTC (permalink / raw)
  To: dev
  Cc: zhihong.wang, maxime.coquelin, yliu, cunming.liang, rosen.xu,
	junjie.j.chen, dan.daly, Xiao Wang

From: Junjie Chen <junjie.j.chen@intel.com>

Currently eal vfio framework binds vfio group fd to the default
container fd, while in some cases, e.g. vDPA (vhost data path
acceleration), we want to set vfio group to a new container and
program DMA mapping via this new container, so this patch adds
APIs to support multiple container.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal.c          |  51 ++-
 lib/librte_eal/common/include/rte_vfio.h | 117 ++++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 553 ++++++++++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   2 +
 lib/librte_eal/rte_eal_version.map       |   7 +
 5 files changed, 629 insertions(+), 101 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..6cc321a70 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -38,6 +38,7 @@
 #include <rte_interrupts.h>
 #include <rte_bus.h>
 #include <rte_dev.h>
+#include <rte_vfio.h>
 #include <rte_devargs.h>
 #include <rte_version.h>
 #include <rte_atomic.h>
@@ -738,15 +739,6 @@ rte_eal_vfio_intr_mode(void)
 /* dummy forward declaration. */
 struct vfio_device_info;
 
-/* dummy prototypes. */
-int rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
-		int *vfio_dev_fd, struct vfio_device_info *device_info);
-int rte_vfio_release_device(const char *sysfs_base, const char *dev_addr, int fd);
-int rte_vfio_enable(const char *modname);
-int rte_vfio_is_enabled(const char *modname);
-int rte_vfio_noiommu_is_enabled(void);
-int rte_vfio_clear_group(int vfio_group_fd);
-
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
 		      __rte_unused int *vfio_dev_fd,
@@ -781,3 +773,44 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int rte_vfio_bind_group_no(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_unbind_group_no(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_get_group_fd(__rte_unused int iommu_group_no)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index e981a6228..3aad9cace 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -123,6 +123,121 @@ int rte_vfio_noiommu_is_enabled(void);
 int
 rte_vfio_clear_group(int vfio_group_fd);
 
-#endif /* VFIO_PRESENT */
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container
+ * @return
+ *    the container fd if success
+ *    else < 0
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio group number.
+ * @param container_fd
+ *   the container fd to destroy
+ * @return
+ *    0 if true.
+ *   !0 otherwise.
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a group number to container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ * @return
+ *    group fd if successful
+ *    < 0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group_no(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a group from specified container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group_no(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *   the dma type for mapping
+ * @param ms
+ *   the dma address region to map
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *    the dma map type
+ * @param ms
+ *   the dma address region to unmap
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Get group fd via group number
+ * @param iommu_group_number
+ *  the group number
+ * @return
+ *     corresonding group fd if successful
+ *     -1 if failed
+ */
+int __rte_experimental
+rte_vfio_get_group_fd(int iommu_group_no);
 
+#endif /* VFIO_PRESENT */
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..939917da9 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,353 @@ vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+vfio_get_container(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+vfio_get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+static int
+vfio_find_container_idx(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return i;
+		}
+	}
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	vfio_cfg = vfio_cfgs[i];
+
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+
+	for (i = 0 ; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+	}
+
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0)
+		return -1;
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = vfio_get_container_idx(container_fd);
+	vfio_cfg = vfio_cfgs[idx];
+
+	if (!idx)
+		return 0;
+
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group_no(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group_no(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	/* Check room for new group */
+	if (cur_vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (cur_vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	cur_vfio_cfg->vfio_active_groups++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group_no(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+
+	if (!i)
+		return 0;
+
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (cur_vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (close(cur_grp->fd) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	cur_vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_find_container_idx(iommu_group_no);
+	vfio_cfg = vfio_cfgs[i];
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
 		if (i < 0)
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,9 +520,11 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
-	int ret;
+	int ret = 0;
+	int index;
 
 	/* get group number */
 	ret = vfio_get_group_no(sysfs_base, dev_addr, &iommu_group_no);
@@ -309,12 +570,14 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	index = vfio_find_container_idx(iommu_group_no);
+	vfio_container_fd = vfio_cfgs[index]->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +594,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfgs[index]->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +608,13 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +658,7 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +726,9 @@ rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +750,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +767,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +935,87 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd,
+	const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	struct vfio_iommu_type1_dma_map dma_map;
+	int ret;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	if (ms->addr == NULL) {
+		RTE_LOG(ERR, EAL, "invalid dma addr");
+		return -1;
+	}
 
-		if (ms[i].addr == NULL)
-			break;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
+	return 0;
+}
+
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd,
+	const struct rte_memseg *ms)
+{
+	int ret;
+	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
 			return -1;
-		}
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1159,59 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int
+rte_vfio_dma_map(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int
+rte_vfio_dma_unmap(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int rte_vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = vfio_cfgs[i];
+		if (!vfio_cfg)
+			continue;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg->vfio_groups[j].fd;
+		}
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..716fe4551 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -157,6 +157,8 @@ int vfio_mp_sync_setup(void);
 #define SOCKET_NO_FD 0x1
 #define SOCKET_ERR 0xFF
 
+#define VFIO_MAX_CONTAINERS 256
+
 #endif /* VFIO_PRESENT */
 
 #endif /* EAL_VFIO_H_ */
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d12360235..fc78a1581 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -254,5 +254,12 @@ EXPERIMENTAL {
 	rte_service_set_runstate_mapped_check;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_bind_group_no;
+	rte_vfio_unbind_group_no;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_get_group_fd;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 2/3] bus/pci: expose sysfs parsing API
  2018-03-09 23:08 [PATCH 0/3] add ifcvf driver Xiao Wang
  2018-03-09 23:08 ` [PATCH 1/3] eal/vfio: add support for multiple container Xiao Wang
@ 2018-03-09 23:08 ` Xiao Wang
  2018-03-14 11:19   ` Burakov, Anatoly
  2018-03-21 13:21   ` [PATCH v2 0/3] add ifcvf driver Xiao Wang
  2018-03-09 23:08 ` [PATCH 3/3] net/ifcvf: add ifcvf driver Xiao Wang
  2018-03-10 18:23 ` [PATCH 0/3] " Maxime Coquelin
  3 siblings, 2 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-09 23:08 UTC (permalink / raw)
  To: dev
  Cc: zhihong.wang, maxime.coquelin, yliu, cunming.liang, rosen.xu,
	junjie.j.chen, dan.daly, Xiao Wang

Some existing sysfs parsing functions are helpful for the later vDPA
driver, this patch make them global and expose them to shared lib.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 drivers/bus/pci/linux/pci.c             | 9 ++++-----
 drivers/bus/pci/linux/pci_init.h        | 8 ++++++++
 drivers/bus/pci/rte_bus_pci_version.map | 8 ++++++++
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index abde64119..81e5e5650 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -32,7 +32,7 @@
 
 extern struct rte_pci_bus rte_pci_bus;
 
-static int
+int
 pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
 {
 	int count;
@@ -168,9 +168,8 @@ pci_parse_one_sysfs_resource(char *line, size_t len, uint64_t *phys_addr,
 	return 0;
 }
 
-/* parse the "resource" sysfs file */
-static int
-pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev)
+int
+rte_pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev)
 {
 	FILE *f;
 	char buf[BUFSIZ];
@@ -302,7 +301,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 
 	/* parse resources */
 	snprintf(filename, sizeof(filename), "%s/resource", dirname);
-	if (pci_parse_sysfs_resource(filename, dev) < 0) {
+	if (rte_pci_parse_sysfs_resource(filename, dev) < 0) {
 		RTE_LOG(ERR, EAL, "%s(): cannot parse resource\n", __func__);
 		free(dev);
 		return -1;
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index c2e603a37..e871c3942 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -83,6 +83,14 @@ int pci_vfio_unmap_resource(struct rte_pci_device *dev);
 
 int pci_vfio_is_enabled(void);
 
+/* parse sysfs file path */
+int
+pci_get_kernel_driver_by_path(const char *filename, char *dri_name);
+
+/* parse the "resource" sysfs file */
+int
+rte_pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev);
+
 #endif
 
 #endif /* EAL_PCI_INIT_H_ */
diff --git a/drivers/bus/pci/rte_bus_pci_version.map b/drivers/bus/pci/rte_bus_pci_version.map
index 27e9c4f10..dff2b52e8 100644
--- a/drivers/bus/pci/rte_bus_pci_version.map
+++ b/drivers/bus/pci/rte_bus_pci_version.map
@@ -16,3 +16,11 @@ DPDK_17.11 {
 
 	local: *;
 };
+
+DPDK_18.05 {
+	global:
+
+	pci_get_kernel_driver_by_path;
+	rte_pci_parse_sysfs_resource;
+
+} DPDK_17.11;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 3/3] net/ifcvf: add ifcvf driver
  2018-03-09 23:08 [PATCH 0/3] add ifcvf driver Xiao Wang
  2018-03-09 23:08 ` [PATCH 1/3] eal/vfio: add support for multiple container Xiao Wang
  2018-03-09 23:08 ` [PATCH 2/3] bus/pci: expose sysfs parsing API Xiao Wang
@ 2018-03-09 23:08 ` Xiao Wang
  2018-03-10 18:23 ` [PATCH 0/3] " Maxime Coquelin
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-09 23:08 UTC (permalink / raw)
  To: dev
  Cc: zhihong.wang, maxime.coquelin, yliu, cunming.liang, rosen.xu,
	junjie.j.chen, dan.daly, Xiao Wang

ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
to it. It registers vDPA device ops to vhost lib to enable these VFs to be
used as vhost data path accelerator.

Live migration feature is supported by ifc VF and this driver enables
it based on vhost lib.

Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
---
 config/common_base                      |    6 +
 config/common_linuxapp                  |    1 +
 drivers/net/Makefile                    |    1 +
 drivers/net/ifcvf/Makefile              |   40 +
 drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
 drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
 drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
 drivers/net/ifcvf/ifcvf_ethdev.c        | 1241 +++++++++++++++++++++++++++++++
 drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
 mk/rte.app.mk                           |    1 +
 10 files changed, 1831 insertions(+)
 create mode 100644 drivers/net/ifcvf/Makefile
 create mode 100644 drivers/net/ifcvf/base/ifcvf.c
 create mode 100644 drivers/net/ifcvf/base/ifcvf.h
 create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
 create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index ad03cf433..06fce1ebf 100644
--- a/config/common_base
+++ b/config/common_base
@@ -791,6 +791,12 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..358d00468 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..496acf2d2 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -53,6 +53,7 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF) += ifcvf
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
diff --git a/drivers/net/ifcvf/Makefile b/drivers/net/ifcvf/Makefile
new file mode 100644
index 000000000..f3670cdf2
--- /dev/null
+++ b/drivers/net/ifcvf/Makefile
@@ -0,0 +1,40 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_mempool -lrte_pci
+LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost
+LDLIBS += -lrte_bus_vdev -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/linux
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+$(foreach obj, $(BASE_DRIVER_OBJS), $(eval CFLAGS_$(obj)+=$(CFLAGS_BASE_DRIVER)))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf_ethdev.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifcvf/base/ifcvf.c b/drivers/net/ifcvf/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifcvf/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifcvf/base/ifcvf.h b/drivers/net/ifcvf/base/ifcvf.h
new file mode 100644
index 000000000..4a3a94c8c
--- /dev/null
+++ b/drivers/net/ifcvf/base/ifcvf.h
@@ -0,0 +1,156 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_MAX_QUEUES		1
+#define IFCVF_MAX_DEVICES		64
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifcvf/base/ifcvf_osdep.h b/drivers/net/ifcvf/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifcvf/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifcvf/ifcvf_ethdev.c b/drivers/net/ifcvf/ifcvf_ethdev.c
new file mode 100644
index 000000000..a924f7e0b
--- /dev/null
+++ b/drivers/net/ifcvf/ifcvf_ethdev.c
@@ -0,0 +1,1241 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+#include <sys/mman.h>
+
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_ethdev_vdev.h>
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_bus_vdev.h>
+#include <rte_bus_pci.h>
+#include <rte_kvargs.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <eal_vfio.h>
+#include <pci_init.h>
+
+#include "base/ifcvf.h"
+
+#define ETH_IFCVF_BDF_ARG	"bdf"
+#define ETH_IFCVF_DEVICES_ARG	"int"
+
+static const char *const valid_arguments[] = {
+	ETH_IFCVF_BDF_ARG,
+	ETH_IFCVF_DEVICES_ARG,
+	NULL
+};
+
+static struct ether_addr base_eth_addr = {
+	.addr_bytes = {
+		0x56 /* V */,
+		0x44 /* D */,
+		0x50 /* P */,
+		0x41 /* A */,
+		0x00,
+		0x00
+	}
+};
+
+struct ifcvf_info {
+	struct ifcvf_hw hw;
+	struct rte_pci_device pdev;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct ifcvf_internal {
+	char *dev_name;
+	uint16_t max_queues;
+	uint16_t max_devices;
+	uint64_t features;
+	struct rte_vdpa_eng_addr eng_addr;
+	int eid;
+	struct ifcvf_info vf_info[IFCVF_MAX_DEVICES];
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct rte_eth_dev *eth_dev;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct rte_eth_link vdpa_link = {
+		.link_speed = 10000,
+		.link_duplex = ETH_LINK_FULL_DUPLEX,
+		.link_status = ETH_LINK_DOWN
+};
+
+static struct internal_list *
+find_internal_resource_by_eid(int eid)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (eid == internal->eid) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_eng_addr(struct rte_vdpa_eng_addr *addr)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (addr == &internal->eng_addr) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+check_pci_dev(struct rte_pci_device *dev)
+{
+	char filename[PATH_MAX];
+	char dev_dir[PATH_MAX];
+	char driver[PATH_MAX];
+	int ret;
+
+	snprintf(dev_dir, sizeof(dev_dir), "%s/" PCI_PRI_FMT,
+			rte_pci_get_sysfs_path(),
+			dev->addr.domain, dev->addr.bus,
+			dev->addr.devid, dev->addr.function);
+	if (access(dev_dir, R_OK) != 0) {
+		RTE_LOG(ERR, PMD, "%s not exist\n", dev_dir);
+		return -1;
+	}
+
+	/* parse resources */
+	snprintf(filename, sizeof(filename), "%s/resource", dev_dir);
+	if (rte_pci_parse_sysfs_resource(filename, dev) < 0) {
+		RTE_LOG(ERR, PMD, "cannot parse resource: %s\n", filename);
+		return -1;
+	}
+
+	/* parse driver */
+	snprintf(filename, sizeof(filename), "%s/driver", dev_dir);
+	ret = pci_get_kernel_driver_by_path(filename, driver);
+	if (ret != 0) {
+		RTE_LOG(ERR, PMD, "Fail to get kernel driver: %s\n", filename);
+		return -1;
+	}
+
+	if (strcmp(driver, "vfio-pci") != 0) {
+		RTE_LOG(ERR, PMD, "kernel driver %s is not vfio-pci\n", driver);
+		return -1;
+	}
+	dev->kdrv = RTE_KDRV_VFIO;
+	return 0;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_info *vf_info)
+{
+	struct rte_pci_device *dev = &vf_info->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	vf_info->vfio_container_fd = rte_vfio_create_container();
+	if (vf_info->vfio_container_fd < 0)
+		return -1;
+
+	ret = rte_vfio_bind_group_no(vf_info->vfio_container_fd,
+			iommu_group_no);
+	if (ret)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	vf_info->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+	vf_info->vfio_group_fd = rte_vfio_get_group_fd(iommu_group_no);
+	if (vf_info->vfio_group_fd < 0)
+		goto err;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		vf_info->hw.mem_resource[i].addr =
+			vf_info->pdev.mem_resource[i].addr;
+		vf_info->hw.mem_resource[i].phys_addr =
+			vf_info->pdev.mem_resource[i].phys_addr;
+		vf_info->hw.mem_resource[i].len =
+			vf_info->pdev.mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&vf_info->hw, &vf_info->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_destroy_container(vf_info->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_info *vf_info)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(vf_info->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = vf_info->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		RTE_LOG(INFO, PMD, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx\n", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_map(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_info *vf_info)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(vf_info->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = vf_info->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_unmap(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_info *vf_info)
+{
+	struct ifcvf_hw *hw = &vf_info->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = vf_info->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&vf_info->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_info *vf_info)
+{
+	struct ifcvf_hw *hw = &vf_info->hw;
+	int i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint8_t *log_buf;
+
+	vid = vf_info->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(vf_info->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark all used ring as dirty.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / 4096;
+			for (j = 0; j <= hw->vring[i].size * 8 / 4096; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_info *vf_info)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(vf_info->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = vf_info->pdev.intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vf_info->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(vf_info->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error enabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_info *vf_info)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(vf_info->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error disabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_info *vf_info = (struct ifcvf_info *)arg;
+	struct ifcvf_hw *hw = &vf_info->hw;
+
+	q_num = rte_vhost_get_vring_num(vf_info->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		RTE_LOG(ERR, PMD, "failed to create epoll instance\n");
+		return NULL;
+	}
+	vf_info->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(vf_info->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			RTE_LOG(ERR, PMD, "epoll add error, %s\n",
+					strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			RTE_LOG(ERR, PMD, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					RTE_LOG(INFO, PMD, "Error reading "
+						"kickfd: %s\n",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_info *vf_info)
+{
+	int ret;
+
+	ret = pthread_create(&vf_info->tid, NULL, notify_relay,
+			(void *)vf_info);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "failed to create notify relay pthread\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_info *vf_info)
+{
+	void *status;
+
+	if (vf_info->tid) {
+		pthread_cancel(vf_info->tid);
+		pthread_join(vf_info->tid, &status);
+	}
+	vf_info->tid = 0;
+
+	if (vf_info->epfd >= 0)
+		close(vf_info->epfd);
+	vf_info->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_info *vf_info)
+{
+	int ret;
+
+	rte_spinlock_lock(&vf_info->lock);
+
+	if (!rte_atomic32_read(&vf_info->running) &&
+	    (rte_atomic32_read(&vf_info->started) &&
+	     rte_atomic32_read(&vf_info->dev_attached))) {
+		ret = ifcvf_dma_map(vf_info);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(vf_info);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(vf_info);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(vf_info);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&vf_info->running, 1);
+	} else if (rte_atomic32_read(&vf_info->running) &&
+		   (!rte_atomic32_read(&vf_info->started) ||
+		    !rte_atomic32_read(&vf_info->dev_attached))) {
+		vdpa_ifcvf_stop(vf_info);
+
+		ret = unset_notify_relay(vf_info);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(vf_info);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(vf_info);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&vf_info->running, 0);
+	}
+
+	rte_spinlock_unlock(&vf_info->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&vf_info->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+	vf_info->vid = vid;
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_UP;
+
+	rte_atomic32_set(&vf_info->dev_attached, 1);
+	update_datapath(vf_info);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_DOWN;
+
+	rte_atomic32_set(&vf_info->dev_attached, 0);
+	update_datapath(vf_info);
+	vf_info->vid = -1;
+
+	return 0;
+}
+
+static int
+ifcvf_feature_set(int vid)
+{
+	uint64_t features;
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	uint64_t log_base, log_size;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+
+	rte_vhost_get_negotiated_features(vf_info->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(vf_info->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&vf_info->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+	return vf_info->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+	return vf_info->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+
+	reg.index = ifcvf_get_notify_region(&vf_info->hw);
+	ret = ioctl(vf_info->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Get not get device region info: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&vf_info->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+vdpa_eng_init(int eid, struct rte_vdpa_eng_addr *addr)
+{
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	uint64_t features;
+	int i;
+
+	list = find_internal_resource_by_eng_addr(addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		vf_info->vfio_dev_fd = -1;
+		vf_info->vfio_group_fd = -1;
+		vf_info->vfio_container_fd = -1;
+
+		if (check_pci_dev(&vf_info->pdev) < 0)
+			return -1;
+
+		if (ifcvf_vfio_setup(vf_info) < 0)
+			return -1;
+	}
+
+	internal->eid = eid;
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->vf_info[0].hw);
+	internal->features = (features & ~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << RTE_VHOST_USER_F_PROTOCOL_FEATURES);
+
+	return 0;
+}
+
+static int
+vdpa_eng_uninit(int eid)
+{
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	int i;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		rte_pci_unmap_device(&vf_info->pdev);
+		rte_vfio_destroy_container(vf_info->vfio_container_fd);
+	}
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << RTE_VHOST_USER_PROTOCOL_F_REPLY_ACK)
+static int
+vdpa_info_query(int eid, struct rte_vdpa_eng_attr *attr)
+{
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	attr->dev_num = internal->max_devices;
+	attr->queue_num = internal->max_queues;
+	attr->features = internal->features;
+	attr->protocol_features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+
+	return 0;
+}
+
+struct rte_vdpa_eng_driver vdpa_ifcvf_driver = {
+	.name = "ifcvf",
+	.eng_ops = {
+		.eng_init = vdpa_eng_init,
+		.eng_uninit = vdpa_eng_uninit,
+		.info_query = vdpa_info_query,
+	},
+	.dev_ops = {
+		.dev_conf = ifcvf_dev_config,
+		.dev_close = ifcvf_dev_close,
+		.vring_state_set = NULL,
+		.feature_set = ifcvf_feature_set,
+		.migration_done = NULL,
+		.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+		.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+		.get_notify_area = ifcvf_get_notify_area,
+	},
+};
+
+RTE_VDPA_REGISTER_DRIVER(ifcvf, vdpa_ifcvf_driver);
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	int i;
+
+	internal = dev->data->dev_private;
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		rte_atomic32_set(&vf_info->started, 1);
+		update_datapath(vf_info);
+	}
+
+	return 0;
+}
+
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	int i;
+
+	internal = dev->data->dev_private;
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		rte_atomic32_set(&vf_info->started, 0);
+		update_datapath(vf_info);
+	}
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	internal = dev->data->dev_private;
+	eth_dev_stop(dev);
+
+	list = find_internal_resource_by_eng_addr(&internal->eng_addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return;
+	}
+
+	rte_vdpa_unregister_engine(internal->eid);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+	rte_free(list);
+
+	rte_free(dev->data->mac_addrs);
+	free(internal->dev_name);
+	rte_free(internal);
+
+	dev->data->dev_private = NULL;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct ifcvf_internal *internal;
+
+	internal = dev->data->dev_private;
+	if (internal == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid device specified\n");
+		return;
+	}
+
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)-1;
+	dev_info->max_rx_queues = internal->max_queues;
+	dev_info->max_tx_queues = internal->max_queues;
+	dev_info->min_rx_bufsize = 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t rx_queue_id __rte_unused,
+		   uint16_t nb_rx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t tx_queue_id __rte_unused,
+		   uint16_t nb_tx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static uint16_t
+eth_ifcvf_rx(void *q __rte_unused, struct rte_mbuf **bufs __rte_unused,
+		uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static uint16_t
+eth_ifcvf_tx(void *q __rte_unused, struct rte_mbuf **bufs __rte_unused,
+		uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+		int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+};
+
+static int
+eth_dev_ifcvf_create(struct rte_vdev_device *dev,
+		struct rte_pci_addr *pci_addr, int devices)
+{
+	const char *name = rte_vdev_device_name(dev);
+	struct rte_eth_dev *eth_dev = NULL;
+	struct ether_addr *eth_addr = NULL;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+	struct rte_eth_dev_data *data = NULL;
+	struct rte_pci_addr pf_addr = *pci_addr;
+	int i;
+
+	list = rte_zmalloc_socket(name, sizeof(*list), 0,
+			dev->device.numa_node);
+	if (list == NULL)
+		goto error;
+
+	/* reserve an ethdev entry */
+	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
+	if (eth_dev == NULL)
+		goto error;
+
+	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
+			dev->device.numa_node);
+	if (eth_addr == NULL)
+		goto error;
+
+	*eth_addr = base_eth_addr;
+	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
+
+	internal = eth_dev->data->dev_private;
+	internal->dev_name = strdup(name);
+	if (internal->dev_name == NULL)
+		goto error;
+
+	internal->eng_addr.pci_addr = *pci_addr;
+	for (i = 0; i < devices; i++) {
+		pf_addr.domain = pci_addr->domain;
+		pf_addr.bus = pci_addr->bus;
+		pf_addr.devid = pci_addr->devid + (i + 1) / 8;
+		pf_addr.function = pci_addr->function + (i + 1) % 8;
+		internal->vf_info[i].pdev.addr = pf_addr;
+		rte_spinlock_init(&internal->vf_info[i].lock);
+	}
+	internal->max_devices = devices;
+
+	list->eth_dev = eth_dev;
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	data = eth_dev->data;
+	data->nb_rx_queues = IFCVF_MAX_QUEUES;
+	data->nb_tx_queues = IFCVF_MAX_QUEUES;
+	data->dev_link = vdpa_link;
+	data->mac_addrs = eth_addr;
+	data->dev_flags = RTE_ETH_DEV_INTR_LSC;
+	eth_dev->dev_ops = &ops;
+
+	/* assign rx and tx ops, could be used as vDPA fallback */
+	eth_dev->rx_pkt_burst = eth_ifcvf_rx;
+	eth_dev->tx_pkt_burst = eth_ifcvf_tx;
+
+	if (rte_vdpa_register_engine(vdpa_ifcvf_driver.name,
+				&internal->eng_addr) < 0)
+		goto error;
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(eth_addr);
+	if (internal && internal->dev_name)
+		free(internal->dev_name);
+	rte_free(internal);
+	if (eth_dev)
+		rte_eth_dev_release_port(eth_dev);
+
+	return -1;
+}
+
+static int
+get_pci_addr(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	if (value == NULL || extra_args == NULL)
+		return -1;
+
+	return rte_pci_addr_parse(value, extra_args);
+}
+
+static inline int
+open_int(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	uint16_t *n = extra_args;
+
+	if (value == NULL || extra_args == NULL)
+		return -EINVAL;
+
+	*n = (uint16_t)strtoul(value, NULL, 0);
+	if (*n == USHRT_MAX && errno == ERANGE)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * If this vdev is created by user, then ifcvf will be taken by
+ * this vdev.
+ */
+static int
+ifcvf_take_over(struct rte_pci_addr *pci_addr, int num)
+{
+	uint16_t port_id;
+	int i, ret;
+	char devname[RTE_DEV_NAME_MAX_LEN];
+	struct rte_pci_addr vf_addr = *pci_addr;
+
+	for (i = 0; i < num; i++) {
+		vf_addr.function += i % 8;
+		vf_addr.devid += i / 8;
+		rte_pci_device_name(&vf_addr, devname, RTE_DEV_NAME_MAX_LEN);
+		ret = rte_eth_dev_get_port_by_name(devname, &port_id);
+		if (ret == 0) {
+			rte_eth_dev_close(port_id);
+			if (rte_eth_dev_detach(port_id, devname) < 0)
+				return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
+rte_ifcvf_probe(struct rte_vdev_device *dev)
+{
+	struct rte_kvargs *kvlist = NULL;
+	int ret = 0;
+	struct rte_pci_addr pci_addr;
+	int devices;
+
+	RTE_LOG(INFO, PMD, "Initializing ifcvf for %s\n",
+			rte_vdev_device_name(dev));
+
+	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	if (rte_kvargs_count(kvlist, ETH_IFCVF_BDF_ARG) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_IFCVF_BDF_ARG,
+				&get_pci_addr, &pci_addr);
+		if (ret < 0)
+			goto out_free;
+
+	} else {
+		ret = -1;
+		goto out_free;
+	}
+
+	if (rte_kvargs_count(kvlist, ETH_IFCVF_DEVICES_ARG) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_IFCVF_DEVICES_ARG,
+				&open_int, &devices);
+		if (ret < 0 || devices > IFCVF_MAX_DEVICES)
+			goto out_free;
+	} else {
+		devices = 1;
+	}
+
+	ret = ifcvf_take_over(&pci_addr, devices);
+	if (ret < 0)
+		goto out_free;
+
+	eth_dev_ifcvf_create(dev, &pci_addr, devices);
+
+out_free:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
+static int
+rte_ifcvf_remove(struct rte_vdev_device *dev)
+{
+	const char *name;
+	struct rte_eth_dev *eth_dev = NULL;
+
+	name = rte_vdev_device_name(dev);
+	RTE_LOG(INFO, PMD, "Un-Initializing ifcvf for %s\n", name);
+
+	/* find an ethdev entry */
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	eth_dev_close(eth_dev);
+	rte_free(eth_dev->data);
+	rte_eth_dev_release_port(eth_dev);
+
+	return 0;
+}
+
+static struct rte_vdev_driver ifcvf_drv = {
+	.probe = rte_ifcvf_probe,
+	.remove = rte_ifcvf_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_ifcvf, ifcvf_drv);
+RTE_PMD_REGISTER_ALIAS(net_ifcvf, eth_ifcvf);
+RTE_PMD_REGISTER_PARAM_STRING(net_ifcvf,
+	"bdf=<bdf> "
+	"devices=<int>");
diff --git a/drivers/net/ifcvf/rte_ifcvf_version.map b/drivers/net/ifcvf/rte_ifcvf_version.map
new file mode 100644
index 000000000..33d237913
--- /dev/null
+++ b/drivers/net/ifcvf/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+EXPERIMENTAL {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..be5f765e4 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -171,6 +171,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF)          += -lrte_ifcvf
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-09 23:08 [PATCH 0/3] add ifcvf driver Xiao Wang
                   ` (2 preceding siblings ...)
  2018-03-09 23:08 ` [PATCH 3/3] net/ifcvf: add ifcvf driver Xiao Wang
@ 2018-03-10 18:23 ` Maxime Coquelin
  2018-03-15 16:49   ` Wang, Xiao W
  3 siblings, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-10 18:23 UTC (permalink / raw)
  To: Xiao Wang, dev
  Cc: zhihong.wang, yliu, cunming.liang, rosen.xu, junjie.j.chen, dan.daly

Hi Xiao,

On 03/10/2018 12:08 AM, Xiao Wang wrote:
> This patch set has dependency on http://dpdk.org/dev/patchwork/patch/35635/
> (vhost: support selective datapath);
> 
> ifc VF is compatible with virtio vring operations, this driver implements
> vDPA driver ops which configures ifc VF to be a vhost data path accelerator.
> 
> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> used as vhost data path accelerator.
> 
> Live migration feature is supported by ifc VF and this driver enables
> it based on vhost lib.
> 
> vDPA needs to create different containers for different devices, thus this
> patch set adds APIs in eal/vfio to support multiple container.
Thanks for this! That will avoind having to duplicate these functions
for every new offload driver.


> 
> Junjie Chen (1):
>    eal/vfio: add support for multiple container
> 
> Xiao Wang (2):
>    bus/pci: expose sysfs parsing API

Still, I'm not convinced the offload device should be a virtual device.
It is a real PCI device, why not having a new device type for offload
devices, and let the device to be probed automatically by the existing
device model?

Thanks,
Maxime


>    net/ifcvf: add ifcvf driver
> 
>   config/common_base                       |    6 +
>   config/common_linuxapp                   |    1 +
>   drivers/bus/pci/linux/pci.c              |    9 +-
>   drivers/bus/pci/linux/pci_init.h         |    8 +
>   drivers/bus/pci/rte_bus_pci_version.map  |    8 +
>   drivers/net/Makefile                     |    1 +
>   drivers/net/ifcvf/Makefile               |   40 +
>   drivers/net/ifcvf/base/ifcvf.c           |  329 ++++++++
>   drivers/net/ifcvf/base/ifcvf.h           |  156 ++++
>   drivers/net/ifcvf/base/ifcvf_osdep.h     |   52 ++
>   drivers/net/ifcvf/ifcvf_ethdev.c         | 1241 ++++++++++++++++++++++++++++++
>   drivers/net/ifcvf/rte_ifcvf_version.map  |    4 +
>   lib/librte_eal/bsdapp/eal/eal.c          |   51 +-
>   lib/librte_eal/common/include/rte_vfio.h |  117 ++-
>   lib/librte_eal/linuxapp/eal/eal_vfio.c   |  553 ++++++++++---
>   lib/librte_eal/linuxapp/eal/eal_vfio.h   |    2 +
>   lib/librte_eal/rte_eal_version.map       |    7 +
>   mk/rte.app.mk                            |    1 +
>   18 files changed, 2480 insertions(+), 106 deletions(-)
>   create mode 100644 drivers/net/ifcvf/Makefile
>   create mode 100644 drivers/net/ifcvf/base/ifcvf.c
>   create mode 100644 drivers/net/ifcvf/base/ifcvf.h
>   create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
>   create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
>   create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/3] bus/pci: expose sysfs parsing API
  2018-03-09 23:08 ` [PATCH 2/3] bus/pci: expose sysfs parsing API Xiao Wang
@ 2018-03-14 11:19   ` Burakov, Anatoly
  2018-03-14 13:30     ` Gaëtan Rivet
  2018-03-21 13:21   ` [PATCH v2 0/3] add ifcvf driver Xiao Wang
  1 sibling, 1 reply; 98+ messages in thread
From: Burakov, Anatoly @ 2018-03-14 11:19 UTC (permalink / raw)
  To: Xiao Wang, dev
  Cc: zhihong.wang, maxime.coquelin, yliu, cunming.liang, rosen.xu,
	junjie.j.chen, dan.daly

On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> Some existing sysfs parsing functions are helpful for the later vDPA
> driver, this patch make them global and expose them to shared lib.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---
>   drivers/bus/pci/linux/pci.c             | 9 ++++-----
>   drivers/bus/pci/linux/pci_init.h        | 8 ++++++++
>   drivers/bus/pci/rte_bus_pci_version.map | 8 ++++++++
>   3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> index abde64119..81e5e5650 100644
> --- a/drivers/bus/pci/linux/pci.c
> +++ b/drivers/bus/pci/linux/pci.c
> @@ -32,7 +32,7 @@
>   
>   extern struct rte_pci_bus rte_pci_bus;
>   
> -static int
> +int
>   pci_get_kernel_driver_by_path(const char *filename, char *dri_name)

Here and in other places - shouldn't this too be prefixed with rte_?


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/3] eal/vfio: add support for multiple container
  2018-03-09 23:08 ` [PATCH 1/3] eal/vfio: add support for multiple container Xiao Wang
@ 2018-03-14 12:08   ` Burakov, Anatoly
  2018-03-15 16:49     ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Burakov, Anatoly @ 2018-03-14 12:08 UTC (permalink / raw)
  To: Xiao Wang, dev
  Cc: zhihong.wang, maxime.coquelin, yliu, cunming.liang, rosen.xu,
	junjie.j.chen, dan.daly

On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> From: Junjie Chen <junjie.j.chen@intel.com>
> 
> Currently eal vfio framework binds vfio group fd to the default
> container fd, while in some cases, e.g. vDPA (vhost data path
> acceleration), we want to set vfio group to a new container and
> program DMA mapping via this new container, so this patch adds
> APIs to support multiple container.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---

I'm not going to get into virtual vs. real device debate, but i do have 
some issues with VFIO side of things.

I'm not completely convinced this change is needed in the first place. 
If the device driver manages its own groups anyway, it knows which VFIO 
groups belong to it, so it can add/remove them without putting them into 
separate containers. What is the purpose of keeping them in a separate 
container as opposed to just keeping track of group id's?

<...>


> +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> +
> +	if (vfio_cfg->vfio_container_fd < 0)
> +		return -1;
> +
> +	return vfio_cfg->vfio_container_fd;
> +}

Please correct me if i'm wrong, but this patch appears to be mistitled. 
You're not really creating multiple containers, you're just partitioning 
existing one. Do we really need to open/store/close container fd's 
separately, if all we have is a single container anyway?

The semantics of this are also weird in multiprocess. When secondary 
process requests a container, we always create a new one, send it over 
IPC and close it afterwards. It seems to be oblivious that you may have 
several container fd's, and does not know which one you are asking for. 
We know it's all the same container, but that's clearly not what the 
code appears to be doing.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/3] bus/pci: expose sysfs parsing API
  2018-03-14 11:19   ` Burakov, Anatoly
@ 2018-03-14 13:30     ` Gaëtan Rivet
  2018-03-15 16:49       ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Gaëtan Rivet @ 2018-03-14 13:30 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Xiao Wang, dev, zhihong.wang, maxime.coquelin, yliu,
	cunming.liang, rosen.xu, junjie.j.chen, dan.daly

Hi,

On Wed, Mar 14, 2018 at 11:19:31AM +0000, Burakov, Anatoly wrote:
> On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> > Some existing sysfs parsing functions are helpful for the later vDPA
> > driver, this patch make them global and expose them to shared lib.
> > 
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > ---
> >   drivers/bus/pci/linux/pci.c             | 9 ++++-----
> >   drivers/bus/pci/linux/pci_init.h        | 8 ++++++++
> >   drivers/bus/pci/rte_bus_pci_version.map | 8 ++++++++
> >   3 files changed, 20 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> > index abde64119..81e5e5650 100644
> > --- a/drivers/bus/pci/linux/pci.c
> > +++ b/drivers/bus/pci/linux/pci.c
> > @@ -32,7 +32,7 @@
> >   extern struct rte_pci_bus rte_pci_bus;
> > -static int
> > +int
> >   pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
> 
> Here and in other places - shouldn't this too be prefixed with rte_?
> 

A public PCI function should be prefixed by rte_pci_ yes.

Additionally, if this function was to be exposed, then there should be a
BSD implementation as well (shared map file).

I don't know how BSD works, I'm not sure parsing the filesystem is the
way to get a PCI driver name. If so, maybe the function should be called
another, generic, way, that would work for both linux and BSD (and
ideally, having a real BSD implementation).

> 
> -- 
> Thanks,
> Anatoly

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-10 18:23 ` [PATCH 0/3] " Maxime Coquelin
@ 2018-03-15 16:49   ` Wang, Xiao W
  2018-03-21 20:47     ` Maxime Coquelin
  0 siblings, 1 reply; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-15 16:49 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Sunday, March 11, 2018 2:24 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> Subject: Re: [PATCH 0/3] add ifcvf driver
> 
> Hi Xiao,
> 
> On 03/10/2018 12:08 AM, Xiao Wang wrote:
> > This patch set has dependency on
> http://dpdk.org/dev/patchwork/patch/35635/
> > (vhost: support selective datapath);
> >
> > ifc VF is compatible with virtio vring operations, this driver implements
> > vDPA driver ops which configures ifc VF to be a vhost data path accelerator.
> >
> > ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> > to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> > used as vhost data path accelerator.
> >
> > Live migration feature is supported by ifc VF and this driver enables
> > it based on vhost lib.
> >
> > vDPA needs to create different containers for different devices, thus this
> > patch set adds APIs in eal/vfio to support multiple container.
> Thanks for this! That will avoind having to duplicate these functions
> for every new offload driver.
> 
> 
> >
> > Junjie Chen (1):
> >    eal/vfio: add support for multiple container
> >
> > Xiao Wang (2):
> >    bus/pci: expose sysfs parsing API
> 
> Still, I'm not convinced the offload device should be a virtual device.
> It is a real PCI device, why not having a new device type for offload
> devices, and let the device to be probed automatically by the existing
> device model?

IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
In DPDK we need to have something to represent PF, to register itself as
a vDPA engine, so a virtual device is used for this purpose.

The VFs are used for vhost net offload, and we could implement exception traffic
Rx/Tx function on the VFs in future via port-representor mechanism. So this patch
keeps the device type as net.

BRs,
Xiao

> 
> Thanks,
> Maxime
> 
> 
> >    net/ifcvf: add ifcvf driver
> >
> >   config/common_base                       |    6 +
> >   config/common_linuxapp                   |    1 +
> >   drivers/bus/pci/linux/pci.c              |    9 +-
> >   drivers/bus/pci/linux/pci_init.h         |    8 +
> >   drivers/bus/pci/rte_bus_pci_version.map  |    8 +
> >   drivers/net/Makefile                     |    1 +
> >   drivers/net/ifcvf/Makefile               |   40 +
> >   drivers/net/ifcvf/base/ifcvf.c           |  329 ++++++++
> >   drivers/net/ifcvf/base/ifcvf.h           |  156 ++++
> >   drivers/net/ifcvf/base/ifcvf_osdep.h     |   52 ++
> >   drivers/net/ifcvf/ifcvf_ethdev.c         | 1241
> ++++++++++++++++++++++++++++++
> >   drivers/net/ifcvf/rte_ifcvf_version.map  |    4 +
> >   lib/librte_eal/bsdapp/eal/eal.c          |   51 +-
> >   lib/librte_eal/common/include/rte_vfio.h |  117 ++-
> >   lib/librte_eal/linuxapp/eal/eal_vfio.c   |  553 ++++++++++---
> >   lib/librte_eal/linuxapp/eal/eal_vfio.h   |    2 +
> >   lib/librte_eal/rte_eal_version.map       |    7 +
> >   mk/rte.app.mk                            |    1 +
> >   18 files changed, 2480 insertions(+), 106 deletions(-)
> >   create mode 100644 drivers/net/ifcvf/Makefile
> >   create mode 100644 drivers/net/ifcvf/base/ifcvf.c
> >   create mode 100644 drivers/net/ifcvf/base/ifcvf.h
> >   create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
> >   create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
> >   create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/3] eal/vfio: add support for multiple container
  2018-03-14 12:08   ` Burakov, Anatoly
@ 2018-03-15 16:49     ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-15 16:49 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Wang, Zhihong, maxime.coquelin, yliu, Liang, Cunming, Xu, Rosen,
	Chen, Junjie J, Daly, Dan

Hi Anatoly,

> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Wednesday, March 14, 2018 8:08 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Wang, Zhihong <zhihong.wang@intel.com>;
> maxime.coquelin@redhat.com; yliu@fridaylinux.org; Liang, Cunming
> <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen, Junjie J
> <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/3] eal/vfio: add support for multiple
> container
> 
> On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> > From: Junjie Chen <junjie.j.chen@intel.com>
> >
> > Currently eal vfio framework binds vfio group fd to the default
> > container fd, while in some cases, e.g. vDPA (vhost data path
> > acceleration), we want to set vfio group to a new container and
> > program DMA mapping via this new container, so this patch adds
> > APIs to support multiple container.
> >
> > Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > ---
> 
> I'm not going to get into virtual vs. real device debate, but i do have
> some issues with VFIO side of things.
> 
> I'm not completely convinced this change is needed in the first place.
> If the device driver manages its own groups anyway, it knows which VFIO
> groups belong to it, so it can add/remove them without putting them into
> separate containers. What is the purpose of keeping them in a separate
> container as opposed to just keeping track of group id's?

The device driver needs to have a separate container to program IOMMU
For the device, with the VM's addr translation table. So driver needs the
Devices be put into new containers, rather than the default one.

> 
> <...>
> 
> 
> > +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> > +
> > +	if (vfio_cfg->vfio_container_fd < 0)
> > +		return -1;
> > +
> > +	return vfio_cfg->vfio_container_fd;
> > +}
> 
> Please correct me if i'm wrong, but this patch appears to be mistitled.
> You're not really creating multiple containers, you're just partitioning
> existing one. Do we really need to open/store/close container fd's
> separately, if all we have is a single container anyway?

This driver are creating new containers for devices, it needs each device
to have its own container, then we can dma_map/ummap for the device
via it's associated container.

BRs,
Xiao

> 
> The semantics of this are also weird in multiprocess. When secondary
> process requests a container, we always create a new one, send it over
> IPC and close it afterwards. It seems to be oblivious that you may have
> several container fd's, and does not know which one you are asking for.
> We know it's all the same container, but that's clearly not what the
> code appears to be doing.
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/3] bus/pci: expose sysfs parsing API
  2018-03-14 13:30     ` Gaëtan Rivet
@ 2018-03-15 16:49       ` Wang, Xiao W
  2018-03-15 17:19         ` Gaëtan Rivet
  0 siblings, 1 reply; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-15 16:49 UTC (permalink / raw)
  To: Gaëtan Rivet, Burakov, Anatoly
  Cc: dev, Wang, Zhihong, maxime.coquelin, yliu, Liang, Cunming, Xu,
	Rosen, Chen, Junjie J, Daly, Dan

Hi Rivet,

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Wednesday, March 14, 2018 9:31 PM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>
> Cc: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org; Wang, Zhihong
> <zhihong.wang@intel.com>; maxime.coquelin@redhat.com;
> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Xu, Rosen
> <rosen.xu@intel.com>; Chen, Junjie J <junjie.j.chen@intel.com>; Daly, Dan
> <dan.daly@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 2/3] bus/pci: expose sysfs parsing API
> 
> Hi,
> 
> On Wed, Mar 14, 2018 at 11:19:31AM +0000, Burakov, Anatoly wrote:
> > On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> > > Some existing sysfs parsing functions are helpful for the later vDPA
> > > driver, this patch make them global and expose them to shared lib.
> > >
> > > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > > ---
> > >   drivers/bus/pci/linux/pci.c             | 9 ++++-----
> > >   drivers/bus/pci/linux/pci_init.h        | 8 ++++++++
> > >   drivers/bus/pci/rte_bus_pci_version.map | 8 ++++++++
> > >   3 files changed, 20 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> > > index abde64119..81e5e5650 100644
> > > --- a/drivers/bus/pci/linux/pci.c
> > > +++ b/drivers/bus/pci/linux/pci.c
> > > @@ -32,7 +32,7 @@
> > >   extern struct rte_pci_bus rte_pci_bus;
> > > -static int
> > > +int
> > >   pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
> >
> > Here and in other places - shouldn't this too be prefixed with rte_?
> >
> 
> A public PCI function should be prefixed by rte_pci_ yes.

OK, will add this prefix.

> 
> Additionally, if this function was to be exposed, then there should be a
> BSD implementation as well (shared map file).
> 
> I don't know how BSD works, I'm not sure parsing the filesystem is the
> way to get a PCI driver name. If so, maybe the function should be called
> another, generic, way, that would work for both linux and BSD (and
> ideally, having a real BSD implementation).

BSD is not parsing the filesystem, it uses PCIOCGETCONF ioctl to retrieve
PCI device information.
This function is quite linux, especially for the API name. I'm afraid we can
only return err on BSD for this API.

BRs,
Xiao

> 
> >
> > --
> > Thanks,
> > Anatoly
> 
> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/3] bus/pci: expose sysfs parsing API
  2018-03-15 16:49       ` Wang, Xiao W
@ 2018-03-15 17:19         ` Gaëtan Rivet
  2018-03-19  1:31           ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Gaëtan Rivet @ 2018-03-15 17:19 UTC (permalink / raw)
  To: Wang, Xiao W
  Cc: Burakov, Anatoly, dev, Wang, Zhihong, maxime.coquelin, yliu,
	Liang, Cunming, Xu, Rosen, Chen, Junjie J, Daly, Dan

On Thu, Mar 15, 2018 at 04:49:41PM +0000, Wang, Xiao W wrote:
> Hi Rivet,
> 
> > -----Original Message-----
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > Sent: Wednesday, March 14, 2018 9:31 PM
> > To: Burakov, Anatoly <anatoly.burakov@intel.com>
> > Cc: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org; Wang, Zhihong
> > <zhihong.wang@intel.com>; maxime.coquelin@redhat.com;
> > yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Xu, Rosen
> > <rosen.xu@intel.com>; Chen, Junjie J <junjie.j.chen@intel.com>; Daly, Dan
> > <dan.daly@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 2/3] bus/pci: expose sysfs parsing API
> > 
> > Hi,
> > 
> > On Wed, Mar 14, 2018 at 11:19:31AM +0000, Burakov, Anatoly wrote:
> > > On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> > > > Some existing sysfs parsing functions are helpful for the later vDPA
> > > > driver, this patch make them global and expose them to shared lib.
> > > >
> > > > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > > > ---
> > > >   drivers/bus/pci/linux/pci.c             | 9 ++++-----
> > > >   drivers/bus/pci/linux/pci_init.h        | 8 ++++++++
> > > >   drivers/bus/pci/rte_bus_pci_version.map | 8 ++++++++
> > > >   3 files changed, 20 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> > > > index abde64119..81e5e5650 100644
> > > > --- a/drivers/bus/pci/linux/pci.c
> > > > +++ b/drivers/bus/pci/linux/pci.c
> > > > @@ -32,7 +32,7 @@
> > > >   extern struct rte_pci_bus rte_pci_bus;
> > > > -static int
> > > > +int
> > > >   pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
> > >
> > > Here and in other places - shouldn't this too be prefixed with rte_?
> > >
> > 
> > A public PCI function should be prefixed by rte_pci_ yes.
> 
> OK, will add this prefix.
> 
> > 
> > Additionally, if this function was to be exposed, then there should be a
> > BSD implementation as well (shared map file).
> > 
> > I don't know how BSD works, I'm not sure parsing the filesystem is the
> > way to get a PCI driver name. If so, maybe the function should be called
> > another, generic, way, that would work for both linux and BSD (and
> > ideally, having a real BSD implementation).
> 
> BSD is not parsing the filesystem, it uses PCIOCGETCONF ioctl to retrieve
> PCI device information.
> This function is quite linux, especially for the API name. I'm afraid we can
> only return err on BSD for this API.

How about renaming the function to something like
rte_pci_device_kdriver_name();

and allowing for a sensible BSD implementation to happen if someone
needs it?

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/3] bus/pci: expose sysfs parsing API
  2018-03-15 17:19         ` Gaëtan Rivet
@ 2018-03-19  1:31           ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-19  1:31 UTC (permalink / raw)
  To: Gaëtan Rivet
  Cc: Burakov, Anatoly, dev, Wang, Zhihong, maxime.coquelin, yliu,
	Liang, Cunming, Xu, Rosen, Chen, Junjie J, Daly, Dan

Hi Rivet,

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Friday, March 16, 2018 1:19 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>
> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org; Wang,
> Zhihong <zhihong.wang@intel.com>; maxime.coquelin@redhat.com;
> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Xu, Rosen
> <rosen.xu@intel.com>; Chen, Junjie J <junjie.j.chen@intel.com>; Daly, Dan
> <dan.daly@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 2/3] bus/pci: expose sysfs parsing API
> 
> On Thu, Mar 15, 2018 at 04:49:41PM +0000, Wang, Xiao W wrote:
> > Hi Rivet,
> >
> > > -----Original Message-----
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > Sent: Wednesday, March 14, 2018 9:31 PM
> > > To: Burakov, Anatoly <anatoly.burakov@intel.com>
> > > Cc: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org; Wang,
> Zhihong
> > > <zhihong.wang@intel.com>; maxime.coquelin@redhat.com;
> > > yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Xu,
> Rosen
> > > <rosen.xu@intel.com>; Chen, Junjie J <junjie.j.chen@intel.com>; Daly, Dan
> > > <dan.daly@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 2/3] bus/pci: expose sysfs parsing API
> > >
> > > Hi,
> > >
> > > On Wed, Mar 14, 2018 at 11:19:31AM +0000, Burakov, Anatoly wrote:
> > > > On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> > > > > Some existing sysfs parsing functions are helpful for the later vDPA
> > > > > driver, this patch make them global and expose them to shared lib.
> > > > >
> > > > > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > > > > ---
> > > > >   drivers/bus/pci/linux/pci.c             | 9 ++++-----
> > > > >   drivers/bus/pci/linux/pci_init.h        | 8 ++++++++
> > > > >   drivers/bus/pci/rte_bus_pci_version.map | 8 ++++++++
> > > > >   3 files changed, 20 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> > > > > index abde64119..81e5e5650 100644
> > > > > --- a/drivers/bus/pci/linux/pci.c
> > > > > +++ b/drivers/bus/pci/linux/pci.c
> > > > > @@ -32,7 +32,7 @@
> > > > >   extern struct rte_pci_bus rte_pci_bus;
> > > > > -static int
> > > > > +int
> > > > >   pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
> > > >
> > > > Here and in other places - shouldn't this too be prefixed with rte_?
> > > >
> > >
> > > A public PCI function should be prefixed by rte_pci_ yes.
> >
> > OK, will add this prefix.
> >
> > >
> > > Additionally, if this function was to be exposed, then there should be a
> > > BSD implementation as well (shared map file).
> > >
> > > I don't know how BSD works, I'm not sure parsing the filesystem is the
> > > way to get a PCI driver name. If so, maybe the function should be called
> > > another, generic, way, that would work for both linux and BSD (and
> > > ideally, having a real BSD implementation).
> >
> > BSD is not parsing the filesystem, it uses PCIOCGETCONF ioctl to retrieve
> > PCI device information.
> > This function is quite linux, especially for the API name. I'm afraid we can
> > only return err on BSD for this API.
> 
> How about renaming the function to something like
> rte_pci_device_kdriver_name();
> 
> and allowing for a sensible BSD implementation to happen if someone
> needs it?

Yes, it looks more generic, and allows a BSD implementation to happen.
I will rename it as below in next version.
rte_pci_device_kdriver_name(const struct rte_pci_addr *addr, char *dri_name)

BRs,
Xiao

> 
> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 0/3] add ifcvf driver
  2018-03-09 23:08 ` [PATCH 2/3] bus/pci: expose sysfs parsing API Xiao Wang
  2018-03-14 11:19   ` Burakov, Anatoly
@ 2018-03-21 13:21   ` Xiao Wang
  2018-03-21 13:21     ` [PATCH v2 1/3] eal/vfio: add support for multiple container Xiao Wang
                       ` (2 more replies)
  1 sibling, 3 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-21 13:21 UTC (permalink / raw)
  To: maxime.coquelin, yliu
  Cc: dev, zhihong.wang, tiwei.bie, junjie.j.chen, rosen.xu, dan.daly,
	cunming.liang, anatoly.burakov, gaetan.rivet, Xiao Wang

This patch set has dependency on http://dpdk.org/dev/patchwork/patch/36241/
(vhost: support selective datapath);

ifc VF is virtio vring compatible device, it can be used to accelerate vhost
data path. This patch implements vDPA driver ops which configures ifc VF to
be a vhost data path accelerator.

ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
to it. It registers vDPA device ops to vhost lib to enable these VFs to be
used as vhost data path accelerator.

Live migration feature is supported by ifc VF and this driver enables
it based on vhost lib.

vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group_no
- rte_vfio_unbind_group_no
By this extension, a device can be put into a new specific container, rather
than the previous default container.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.

Junjie Chen (1):
  eal/vfio: add support for multiple container

Xiao Wang (2):
  bus/pci: expose sysfs parsing API
  net/ifcvf: add ifcvf driver

 config/common_base                       |    6 +
 config/common_linuxapp                   |    1 +
 drivers/bus/pci/Makefile                 |    2 +
 drivers/bus/pci/bsd/pci.c                |   14 +
 drivers/bus/pci/linux/pci.c              |   22 +-
 drivers/bus/pci/rte_bus_pci.h            |   32 +
 drivers/bus/pci/rte_bus_pci_version.map  |    8 +
 drivers/net/Makefile                     |    1 +
 drivers/net/ifcvf/Makefile               |   40 +
 drivers/net/ifcvf/base/ifcvf.c           |  329 ++++++++
 drivers/net/ifcvf/base/ifcvf.h           |  156 ++++
 drivers/net/ifcvf/base/ifcvf_osdep.h     |   52 ++
 drivers/net/ifcvf/ifcvf_ethdev.c         | 1240 ++++++++++++++++++++++++++++++
 drivers/net/ifcvf/rte_ifcvf_version.map  |    4 +
 lib/librte_eal/bsdapp/eal/eal.c          |   51 +-
 lib/librte_eal/common/include/rte_vfio.h |  117 ++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   |  553 ++++++++++---
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |    2 +
 lib/librte_eal/rte_eal_version.map       |    7 +
 mk/rte.app.mk                            |    1 +
 20 files changed, 2527 insertions(+), 111 deletions(-)
 create mode 100644 drivers/net/ifcvf/Makefile
 create mode 100644 drivers/net/ifcvf/base/ifcvf.c
 create mode 100644 drivers/net/ifcvf/base/ifcvf.h
 create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
 create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 1/3] eal/vfio: add support for multiple container
  2018-03-21 13:21   ` [PATCH v2 0/3] add ifcvf driver Xiao Wang
@ 2018-03-21 13:21     ` Xiao Wang
  2018-03-21 20:32       ` Thomas Monjalon
  2018-03-21 13:21     ` [PATCH v2 2/3] bus/pci: expose sysfs parsing API Xiao Wang
  2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
  2 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-03-21 13:21 UTC (permalink / raw)
  To: maxime.coquelin, yliu
  Cc: dev, zhihong.wang, tiwei.bie, junjie.j.chen, rosen.xu, dan.daly,
	cunming.liang, anatoly.burakov, gaetan.rivet, Xiao Wang

From: Junjie Chen <junjie.j.chen@intel.com>

Currently eal vfio framework binds vfio group fd to the default
container fd, while in some cases, e.g. vDPA (vhost data path
acceleration), we want to set vfio group to a new container and
program DMA mapping via this new container, so this patch adds
APIs to support multiple container.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
v2:
- Free memory when create container fails.
- Add boundary check on group idx.
---
 lib/librte_eal/bsdapp/eal/eal.c          |  51 ++-
 lib/librte_eal/common/include/rte_vfio.h | 117 ++++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 553 ++++++++++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   2 +
 lib/librte_eal/rte_eal_version.map       |   7 +
 5 files changed, 628 insertions(+), 102 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..6cc321a70 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -38,6 +38,7 @@
 #include <rte_interrupts.h>
 #include <rte_bus.h>
 #include <rte_dev.h>
+#include <rte_vfio.h>
 #include <rte_devargs.h>
 #include <rte_version.h>
 #include <rte_atomic.h>
@@ -738,15 +739,6 @@ rte_eal_vfio_intr_mode(void)
 /* dummy forward declaration. */
 struct vfio_device_info;
 
-/* dummy prototypes. */
-int rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
-		int *vfio_dev_fd, struct vfio_device_info *device_info);
-int rte_vfio_release_device(const char *sysfs_base, const char *dev_addr, int fd);
-int rte_vfio_enable(const char *modname);
-int rte_vfio_is_enabled(const char *modname);
-int rte_vfio_noiommu_is_enabled(void);
-int rte_vfio_clear_group(int vfio_group_fd);
-
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
 		      __rte_unused int *vfio_dev_fd,
@@ -781,3 +773,44 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int rte_vfio_bind_group_no(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_unbind_group_no(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_get_group_fd(__rte_unused int iommu_group_no)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index e981a6228..3aad9cace 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -123,6 +123,121 @@ int rte_vfio_noiommu_is_enabled(void);
 int
 rte_vfio_clear_group(int vfio_group_fd);
 
-#endif /* VFIO_PRESENT */
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container
+ * @return
+ *    the container fd if success
+ *    else < 0
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio group number.
+ * @param container_fd
+ *   the container fd to destroy
+ * @return
+ *    0 if true.
+ *   !0 otherwise.
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a group number to container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ * @return
+ *    group fd if successful
+ *    < 0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group_no(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a group from specified container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group_no(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *   the dma type for mapping
+ * @param ms
+ *   the dma address region to map
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *    the dma map type
+ * @param ms
+ *   the dma address region to unmap
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Get group fd via group number
+ * @param iommu_group_number
+ *  the group number
+ * @return
+ *     corresonding group fd if successful
+ *     -1 if failed
+ */
+int __rte_experimental
+rte_vfio_get_group_fd(int iommu_group_no);
 
+#endif /* VFIO_PRESENT */
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..8a7ed84d7 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
 
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
-
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,351 @@ vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+vfio_get_container(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+vfio_get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+static int
+vfio_find_container_idx(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return i;
+		}
+	}
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+	vfio_cfg = vfio_cfgs[i];
+
+	for (i = 0 ; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+	}
+	vfio_cfg->vfio_active_groups = 0;
+
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0) {
+		rte_free(vfio_cfgs[i]);
+		vfio_cfgs[i] = NULL;
+		return -1;
+	}
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = vfio_get_container_idx(container_fd);
+	vfio_cfg = vfio_cfgs[idx];
+
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group_no(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group_no(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	/* Check room for new group */
+	if (cur_vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (cur_vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	cur_vfio_cfg->vfio_active_groups++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group_no(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (cur_vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	cur_vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_find_container_idx(iommu_group_no);
+	vfio_cfg = vfio_cfgs[i];
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
-		if (i < 0)
+		if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
+			RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		}
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,9 +518,11 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
-	int ret;
+	int ret = 0;
+	int index;
 
 	/* get group number */
 	ret = vfio_get_group_no(sysfs_base, dev_addr, &iommu_group_no);
@@ -309,12 +568,14 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	index = vfio_find_container_idx(iommu_group_no);
+	vfio_container_fd = vfio_cfgs[index]->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +592,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfgs[index]->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +606,13 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +656,7 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +724,9 @@ rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +748,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +765,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +933,87 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd,
+	const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	struct vfio_iommu_type1_dma_map dma_map;
+	int ret;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	if (ms->addr == NULL) {
+		RTE_LOG(ERR, EAL, "invalid dma addr");
+		return -1;
+	}
 
-		if (ms[i].addr == NULL)
-			break;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
+	return 0;
+}
+
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd,
+	const struct rte_memseg *ms)
+{
+	int ret;
+	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
 			return -1;
-		}
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1157,59 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int
+rte_vfio_dma_map(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int
+rte_vfio_dma_unmap(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int rte_vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = vfio_cfgs[i];
+		if (!vfio_cfg)
+			continue;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg->vfio_groups[j].fd;
+		}
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..716fe4551 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -157,6 +157,8 @@ int vfio_mp_sync_setup(void);
 #define SOCKET_NO_FD 0x1
 #define SOCKET_ERR 0xFF
 
+#define VFIO_MAX_CONTAINERS 256
+
 #endif /* VFIO_PRESENT */
 
 #endif /* EAL_VFIO_H_ */
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d12360235..fc78a1581 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -254,5 +254,12 @@ EXPERIMENTAL {
 	rte_service_set_runstate_mapped_check;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_bind_group_no;
+	rte_vfio_unbind_group_no;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_get_group_fd;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 2/3] bus/pci: expose sysfs parsing API
  2018-03-21 13:21   ` [PATCH v2 0/3] add ifcvf driver Xiao Wang
  2018-03-21 13:21     ` [PATCH v2 1/3] eal/vfio: add support for multiple container Xiao Wang
@ 2018-03-21 13:21     ` Xiao Wang
  2018-03-21 20:44       ` Thomas Monjalon
  2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
  2 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-03-21 13:21 UTC (permalink / raw)
  To: maxime.coquelin, yliu
  Cc: dev, zhihong.wang, tiwei.bie, junjie.j.chen, rosen.xu, dan.daly,
	cunming.liang, anatoly.burakov, gaetan.rivet, Xiao Wang

Some existing sysfs parsing functions are helpful for the later vDPA
driver, this patch make them global and expose them to shared lib.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
---
 drivers/bus/pci/Makefile                |  2 ++
 drivers/bus/pci/bsd/pci.c               | 14 ++++++++++++++
 drivers/bus/pci/linux/pci.c             | 22 +++++++++++++---------
 drivers/bus/pci/rte_bus_pci.h           | 32 ++++++++++++++++++++++++++++++++
 drivers/bus/pci/rte_bus_pci_version.map |  8 ++++++++
 5 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
index f3df1c4ce..eb494e0f2 100644
--- a/drivers/bus/pci/Makefile
+++ b/drivers/bus/pci/Makefile
@@ -49,6 +49,8 @@ CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/$(SYSTEM)
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/$(SYSTEM)app/eal
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_pci
 
diff --git a/drivers/bus/pci/bsd/pci.c b/drivers/bus/pci/bsd/pci.c
index 655b34b7e..08fbe085e 100644
--- a/drivers/bus/pci/bsd/pci.c
+++ b/drivers/bus/pci/bsd/pci.c
@@ -649,3 +649,17 @@ rte_pci_ioport_unmap(struct rte_pci_ioport *p)
 
 	return ret;
 }
+
+int __rte_experimental
+rte_pci_device_kdriver_name(__rte_unused const struct rte_pci_addr *addr,
+		__rte_unused char *dri_name)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_pci_parse_sysfs_resource(__rte_unused const char *filename,
+		__rte_unused struct rte_pci_device *dev)
+{
+	return -1;
+}
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index abde64119..4ba411c7c 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -32,17 +32,22 @@
 
 extern struct rte_pci_bus rte_pci_bus;
 
-static int
-pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
+int __rte_experimental
+rte_pci_device_kdriver_name(const struct rte_pci_addr *addr, char *dri_name)
 {
 	int count;
+	char link[PATH_MAX];
 	char path[PATH_MAX];
 	char *name;
 
-	if (!filename || !dri_name)
+	if (!addr || !dri_name)
 		return -1;
 
-	count = readlink(filename, path, PATH_MAX);
+	snprintf(link, sizeof(link), "%s/" PCI_PRI_FMT "/driver",
+		 rte_pci_get_sysfs_path(), addr->domain, addr->bus, addr->devid,
+		 addr->function);
+
+	count = readlink(link, path, PATH_MAX);
 	if (count >= PATH_MAX)
 		return -1;
 
@@ -168,9 +173,8 @@ pci_parse_one_sysfs_resource(char *line, size_t len, uint64_t *phys_addr,
 	return 0;
 }
 
-/* parse the "resource" sysfs file */
-static int
-pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev)
+int __rte_experimental
+rte_pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev)
 {
 	FILE *f;
 	char buf[BUFSIZ];
@@ -302,7 +306,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 
 	/* parse resources */
 	snprintf(filename, sizeof(filename), "%s/resource", dirname);
-	if (pci_parse_sysfs_resource(filename, dev) < 0) {
+	if (rte_pci_parse_sysfs_resource(filename, dev) < 0) {
 		RTE_LOG(ERR, EAL, "%s(): cannot parse resource\n", __func__);
 		free(dev);
 		return -1;
@@ -310,7 +314,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 
 	/* parse driver */
 	snprintf(filename, sizeof(filename), "%s/driver", dirname);
-	ret = pci_get_kernel_driver_by_path(filename, driver);
+	ret = rte_pci_device_kdriver_name(addr, driver);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "Fail to get kernel driver\n");
 		free(dev);
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index 357afb912..98b643356 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -304,6 +304,38 @@ void rte_pci_ioport_read(struct rte_pci_ioport *p,
 void rte_pci_ioport_write(struct rte_pci_ioport *p,
 		const void *data, size_t len, off_t offset);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Get the name of kernel driver bound to PCI device.
+ *
+ * @param addr
+ *   The PCI device's location.
+ * @param dri_name
+ *   Output buffer pointer.
+ * @return
+ *   0 on success, negative on error.
+ */
+int __rte_experimental
+rte_pci_device_kdriver_name(const struct rte_pci_addr *addr, char *dri_name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Parse the "resource" sysfs file.
+ *
+ * @param filename
+ *   The PCI resource file path.
+ * @dev
+ *   Pointer of rte_pci_device object, into which the parse result is recorded.
+ * @return
+ *   0 on success, -1 on error, 1 on no driver found.
+ */
+int __rte_experimental
+rte_pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/bus/pci/rte_bus_pci_version.map b/drivers/bus/pci/rte_bus_pci_version.map
index 27e9c4f10..d5576302b 100644
--- a/drivers/bus/pci/rte_bus_pci_version.map
+++ b/drivers/bus/pci/rte_bus_pci_version.map
@@ -16,3 +16,11 @@ DPDK_17.11 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	global:
+
+	rte_pci_device_kdriver_name;
+	rte_pci_parse_sysfs_resource;
+
+} DPDK_17.11;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-21 13:21   ` [PATCH v2 0/3] add ifcvf driver Xiao Wang
  2018-03-21 13:21     ` [PATCH v2 1/3] eal/vfio: add support for multiple container Xiao Wang
  2018-03-21 13:21     ` [PATCH v2 2/3] bus/pci: expose sysfs parsing API Xiao Wang
@ 2018-03-21 13:21     ` Xiao Wang
  2018-03-21 20:52       ` Thomas Monjalon
                         ` (3 more replies)
  2 siblings, 4 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-21 13:21 UTC (permalink / raw)
  To: maxime.coquelin, yliu
  Cc: dev, zhihong.wang, tiwei.bie, junjie.j.chen, rosen.xu, dan.daly,
	cunming.liang, anatoly.burakov, gaetan.rivet, Xiao Wang

ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
to it. It registers vDPA device ops to vhost lib to enable these VFs to be
used as vhost data path accelerator.

Live migration feature is supported by ifc VF and this driver enables
it based on vhost lib.

Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
---
v2:
- Rebase on Zhihong's vDPA v3 patch set.
---
 config/common_base                      |    6 +
 config/common_linuxapp                  |    1 +
 drivers/net/Makefile                    |    1 +
 drivers/net/ifcvf/Makefile              |   40 +
 drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
 drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
 drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
 drivers/net/ifcvf/ifcvf_ethdev.c        | 1240 +++++++++++++++++++++++++++++++
 drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
 mk/rte.app.mk                           |    1 +
 10 files changed, 1830 insertions(+)
 create mode 100644 drivers/net/ifcvf/Makefile
 create mode 100644 drivers/net/ifcvf/base/ifcvf.c
 create mode 100644 drivers/net/ifcvf/base/ifcvf.h
 create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
 create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index ad03cf433..06fce1ebf 100644
--- a/config/common_base
+++ b/config/common_base
@@ -791,6 +791,12 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..358d00468 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..496acf2d2 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -53,6 +53,7 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF) += ifcvf
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
diff --git a/drivers/net/ifcvf/Makefile b/drivers/net/ifcvf/Makefile
new file mode 100644
index 000000000..f3670cdf2
--- /dev/null
+++ b/drivers/net/ifcvf/Makefile
@@ -0,0 +1,40 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_mempool -lrte_pci
+LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost
+LDLIBS += -lrte_bus_vdev -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/linux
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+$(foreach obj, $(BASE_DRIVER_OBJS), $(eval CFLAGS_$(obj)+=$(CFLAGS_BASE_DRIVER)))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf_ethdev.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifcvf/base/ifcvf.c b/drivers/net/ifcvf/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifcvf/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifcvf/base/ifcvf.h b/drivers/net/ifcvf/base/ifcvf.h
new file mode 100644
index 000000000..4a3a94c8c
--- /dev/null
+++ b/drivers/net/ifcvf/base/ifcvf.h
@@ -0,0 +1,156 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_MAX_QUEUES		1
+#define IFCVF_MAX_DEVICES		64
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifcvf/base/ifcvf_osdep.h b/drivers/net/ifcvf/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifcvf/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifcvf/ifcvf_ethdev.c b/drivers/net/ifcvf/ifcvf_ethdev.c
new file mode 100644
index 000000000..3d6250959
--- /dev/null
+++ b/drivers/net/ifcvf/ifcvf_ethdev.c
@@ -0,0 +1,1240 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+#include <sys/mman.h>
+
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_ethdev_vdev.h>
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_bus_vdev.h>
+#include <rte_bus_pci.h>
+#include <rte_kvargs.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <eal_vfio.h>
+#include <pci_init.h>
+
+#include "base/ifcvf.h"
+
+#define ETH_IFCVF_BDF_ARG	"bdf"
+#define ETH_IFCVF_DEVICES_ARG	"int"
+
+static const char *const valid_arguments[] = {
+	ETH_IFCVF_BDF_ARG,
+	ETH_IFCVF_DEVICES_ARG,
+	NULL
+};
+
+static struct ether_addr base_eth_addr = {
+	.addr_bytes = {
+		0x56 /* V */,
+		0x44 /* D */,
+		0x50 /* P */,
+		0x41 /* A */,
+		0x00,
+		0x00
+	}
+};
+
+struct ifcvf_info {
+	struct ifcvf_hw hw;
+	struct rte_pci_device pdev;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct ifcvf_internal {
+	char *dev_name;
+	uint16_t max_queues;
+	uint16_t max_devices;
+	uint64_t features;
+	struct rte_vdpa_eng_addr eng_addr;
+	int eid;
+	struct ifcvf_info vf_info[IFCVF_MAX_DEVICES];
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct rte_eth_dev *eth_dev;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct rte_eth_link vdpa_link = {
+		.link_speed = 10000,
+		.link_duplex = ETH_LINK_FULL_DUPLEX,
+		.link_status = ETH_LINK_DOWN
+};
+
+static struct internal_list *
+find_internal_resource_by_eid(int eid)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (eid == internal->eid) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_eng_addr(struct rte_vdpa_eng_addr *addr)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (addr == &internal->eng_addr) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+check_pci_dev(struct rte_pci_device *dev)
+{
+	char filename[PATH_MAX];
+	char dev_dir[PATH_MAX];
+	char driver[PATH_MAX];
+	int ret;
+
+	snprintf(dev_dir, sizeof(dev_dir), "%s/" PCI_PRI_FMT,
+			rte_pci_get_sysfs_path(),
+			dev->addr.domain, dev->addr.bus,
+			dev->addr.devid, dev->addr.function);
+	if (access(dev_dir, R_OK) != 0) {
+		RTE_LOG(ERR, PMD, "%s not exist\n", dev_dir);
+		return -1;
+	}
+
+	/* parse resources */
+	snprintf(filename, sizeof(filename), "%s/resource", dev_dir);
+	if (rte_pci_parse_sysfs_resource(filename, dev) < 0) {
+		RTE_LOG(ERR, PMD, "cannot parse resource: %s\n", filename);
+		return -1;
+	}
+
+	/* parse driver */
+	ret = rte_pci_device_kdriver_name(&dev->addr, driver);
+	if (ret != 0) {
+		RTE_LOG(ERR, PMD, "Fail to get kernel driver: %s\n", dev_dir);
+		return -1;
+	}
+
+	if (strcmp(driver, "vfio-pci") != 0) {
+		RTE_LOG(ERR, PMD, "kernel driver %s is not vfio-pci\n", driver);
+		return -1;
+	}
+	dev->kdrv = RTE_KDRV_VFIO;
+	return 0;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_info *vf_info)
+{
+	struct rte_pci_device *dev = &vf_info->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	vf_info->vfio_container_fd = rte_vfio_create_container();
+	if (vf_info->vfio_container_fd < 0)
+		return -1;
+
+	ret = rte_vfio_bind_group_no(vf_info->vfio_container_fd,
+			iommu_group_no);
+	if (ret)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	vf_info->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+	vf_info->vfio_group_fd = rte_vfio_get_group_fd(iommu_group_no);
+	if (vf_info->vfio_group_fd < 0)
+		goto err;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		vf_info->hw.mem_resource[i].addr =
+			vf_info->pdev.mem_resource[i].addr;
+		vf_info->hw.mem_resource[i].phys_addr =
+			vf_info->pdev.mem_resource[i].phys_addr;
+		vf_info->hw.mem_resource[i].len =
+			vf_info->pdev.mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&vf_info->hw, &vf_info->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_destroy_container(vf_info->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_info *vf_info)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(vf_info->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = vf_info->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		RTE_LOG(INFO, PMD, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx\n", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_map(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_info *vf_info)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(vf_info->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = vf_info->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_unmap(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_info *vf_info)
+{
+	struct ifcvf_hw *hw = &vf_info->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = vf_info->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&vf_info->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_info *vf_info)
+{
+	struct ifcvf_hw *hw = &vf_info->hw;
+	int i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint8_t *log_buf;
+
+	vid = vf_info->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(vf_info->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark all used ring as dirty.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / 4096;
+			for (j = 0; j <= hw->vring[i].size * 8 / 4096; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_info *vf_info)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(vf_info->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = vf_info->pdev.intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vf_info->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(vf_info->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error enabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_info *vf_info)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(vf_info->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error disabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_info *vf_info = (struct ifcvf_info *)arg;
+	struct ifcvf_hw *hw = &vf_info->hw;
+
+	q_num = rte_vhost_get_vring_num(vf_info->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		RTE_LOG(ERR, PMD, "failed to create epoll instance\n");
+		return NULL;
+	}
+	vf_info->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(vf_info->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			RTE_LOG(ERR, PMD, "epoll add error, %s\n",
+					strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			RTE_LOG(ERR, PMD, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					RTE_LOG(INFO, PMD, "Error reading "
+						"kickfd: %s\n",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_info *vf_info)
+{
+	int ret;
+
+	ret = pthread_create(&vf_info->tid, NULL, notify_relay,
+			(void *)vf_info);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "failed to create notify relay pthread\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_info *vf_info)
+{
+	void *status;
+
+	if (vf_info->tid) {
+		pthread_cancel(vf_info->tid);
+		pthread_join(vf_info->tid, &status);
+	}
+	vf_info->tid = 0;
+
+	if (vf_info->epfd >= 0)
+		close(vf_info->epfd);
+	vf_info->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_info *vf_info)
+{
+	int ret;
+
+	rte_spinlock_lock(&vf_info->lock);
+
+	if (!rte_atomic32_read(&vf_info->running) &&
+	    (rte_atomic32_read(&vf_info->started) &&
+	     rte_atomic32_read(&vf_info->dev_attached))) {
+		ret = ifcvf_dma_map(vf_info);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(vf_info);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(vf_info);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(vf_info);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&vf_info->running, 1);
+	} else if (rte_atomic32_read(&vf_info->running) &&
+		   (!rte_atomic32_read(&vf_info->started) ||
+		    !rte_atomic32_read(&vf_info->dev_attached))) {
+		vdpa_ifcvf_stop(vf_info);
+
+		ret = unset_notify_relay(vf_info);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(vf_info);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(vf_info);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&vf_info->running, 0);
+	}
+
+	rte_spinlock_unlock(&vf_info->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&vf_info->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+	vf_info->vid = vid;
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_UP;
+
+	rte_atomic32_set(&vf_info->dev_attached, 1);
+	update_datapath(vf_info);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_DOWN;
+
+	rte_atomic32_set(&vf_info->dev_attached, 0);
+	update_datapath(vf_info);
+	vf_info->vid = -1;
+
+	return 0;
+}
+
+static int
+ifcvf_feature_set(int vid)
+{
+	uint64_t features;
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	uint64_t log_base, log_size;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+
+	rte_vhost_get_negotiated_features(vf_info->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(vf_info->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&vf_info->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+	return vf_info->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+	return vf_info->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int eid, did;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+	vf_info = &internal->vf_info[did];
+
+	reg.index = ifcvf_get_notify_region(&vf_info->hw);
+	ret = ioctl(vf_info->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Get not get device region info: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&vf_info->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+vdpa_eng_init(int eid, struct rte_vdpa_eng_addr *addr)
+{
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	uint64_t features;
+	int i;
+
+	list = find_internal_resource_by_eng_addr(addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		vf_info->vfio_dev_fd = -1;
+		vf_info->vfio_group_fd = -1;
+		vf_info->vfio_container_fd = -1;
+
+		if (check_pci_dev(&vf_info->pdev) < 0)
+			return -1;
+
+		if (ifcvf_vfio_setup(vf_info) < 0)
+			return -1;
+	}
+
+	internal->eid = eid;
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->vf_info[0].hw);
+	internal->features = (features & ~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES);
+
+	return 0;
+}
+
+static int
+vdpa_eng_uninit(int eid)
+{
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	int i;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		rte_pci_unmap_device(&vf_info->pdev);
+		rte_vfio_destroy_container(vf_info->vfio_container_fd);
+	}
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK)
+static int
+vdpa_info_query(int eid, struct rte_vdpa_eng_attr *attr)
+{
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	attr->dev_num = internal->max_devices;
+	attr->queue_num = internal->max_queues;
+	attr->features = internal->features;
+	attr->protocol_features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+
+	return 0;
+}
+
+struct rte_vdpa_eng_driver vdpa_ifcvf_driver = {
+	.name = "ifcvf",
+	.eng_ops = {
+		.eng_init = vdpa_eng_init,
+		.eng_uninit = vdpa_eng_uninit,
+		.info_query = vdpa_info_query,
+	},
+	.dev_ops = {
+		.dev_conf = ifcvf_dev_config,
+		.dev_close = ifcvf_dev_close,
+		.vring_state_set = NULL,
+		.feature_set = ifcvf_feature_set,
+		.migration_done = NULL,
+		.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+		.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+		.get_notify_area = ifcvf_get_notify_area,
+	},
+};
+
+RTE_VDPA_REGISTER_DRIVER(ifcvf, vdpa_ifcvf_driver);
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	int i;
+
+	internal = dev->data->dev_private;
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		rte_atomic32_set(&vf_info->started, 1);
+		update_datapath(vf_info);
+	}
+
+	return 0;
+}
+
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	struct ifcvf_internal *internal;
+	struct ifcvf_info *vf_info;
+	int i;
+
+	internal = dev->data->dev_private;
+	for (i = 0; i < internal->max_devices; i++) {
+		vf_info = &internal->vf_info[i];
+		rte_atomic32_set(&vf_info->started, 0);
+		update_datapath(vf_info);
+	}
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	internal = dev->data->dev_private;
+	eth_dev_stop(dev);
+
+	list = find_internal_resource_by_eng_addr(&internal->eng_addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return;
+	}
+
+	rte_vdpa_unregister_engine(internal->eid);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+	rte_free(list);
+
+	rte_free(dev->data->mac_addrs);
+	free(internal->dev_name);
+	rte_free(internal);
+
+	dev->data->dev_private = NULL;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct ifcvf_internal *internal;
+
+	internal = dev->data->dev_private;
+	if (internal == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid device specified\n");
+		return;
+	}
+
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)-1;
+	dev_info->max_rx_queues = internal->max_queues;
+	dev_info->max_tx_queues = internal->max_queues;
+	dev_info->min_rx_bufsize = 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t rx_queue_id __rte_unused,
+		   uint16_t nb_rx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t tx_queue_id __rte_unused,
+		   uint16_t nb_tx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static uint16_t
+eth_ifcvf_rx(void *q __rte_unused, struct rte_mbuf **bufs __rte_unused,
+		uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static uint16_t
+eth_ifcvf_tx(void *q __rte_unused, struct rte_mbuf **bufs __rte_unused,
+		uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+		int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+};
+
+static int
+eth_dev_ifcvf_create(struct rte_vdev_device *dev,
+		struct rte_pci_addr *pci_addr, int devices)
+{
+	const char *name = rte_vdev_device_name(dev);
+	struct rte_eth_dev *eth_dev = NULL;
+	struct ether_addr *eth_addr = NULL;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+	struct rte_eth_dev_data *data = NULL;
+	struct rte_pci_addr pf_addr = *pci_addr;
+	int i;
+
+	list = rte_zmalloc_socket(name, sizeof(*list), 0,
+			dev->device.numa_node);
+	if (list == NULL)
+		goto error;
+
+	/* reserve an ethdev entry */
+	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
+	if (eth_dev == NULL)
+		goto error;
+
+	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
+			dev->device.numa_node);
+	if (eth_addr == NULL)
+		goto error;
+
+	*eth_addr = base_eth_addr;
+	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
+
+	internal = eth_dev->data->dev_private;
+	internal->dev_name = strdup(name);
+	if (internal->dev_name == NULL)
+		goto error;
+
+	internal->eng_addr.pci_addr = *pci_addr;
+	for (i = 0; i < devices; i++) {
+		pf_addr.domain = pci_addr->domain;
+		pf_addr.bus = pci_addr->bus;
+		pf_addr.devid = pci_addr->devid + (i + 1) / 8;
+		pf_addr.function = pci_addr->function + (i + 1) % 8;
+		internal->vf_info[i].pdev.addr = pf_addr;
+		rte_spinlock_init(&internal->vf_info[i].lock);
+	}
+	internal->max_devices = devices;
+
+	list->eth_dev = eth_dev;
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	data = eth_dev->data;
+	data->nb_rx_queues = IFCVF_MAX_QUEUES;
+	data->nb_tx_queues = IFCVF_MAX_QUEUES;
+	data->dev_link = vdpa_link;
+	data->mac_addrs = eth_addr;
+	data->dev_flags = RTE_ETH_DEV_INTR_LSC;
+	eth_dev->dev_ops = &ops;
+
+	/* assign rx and tx ops, could be used as vDPA fallback */
+	eth_dev->rx_pkt_burst = eth_ifcvf_rx;
+	eth_dev->tx_pkt_burst = eth_ifcvf_tx;
+
+	if (rte_vdpa_register_engine(vdpa_ifcvf_driver.name,
+				&internal->eng_addr) < 0)
+		goto error;
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(eth_addr);
+	if (internal && internal->dev_name)
+		free(internal->dev_name);
+	rte_free(internal);
+	if (eth_dev)
+		rte_eth_dev_release_port(eth_dev);
+
+	return -1;
+}
+
+static int
+get_pci_addr(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	if (value == NULL || extra_args == NULL)
+		return -1;
+
+	return rte_pci_addr_parse(value, extra_args);
+}
+
+static inline int
+open_int(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	uint16_t *n = extra_args;
+
+	if (value == NULL || extra_args == NULL)
+		return -EINVAL;
+
+	*n = (uint16_t)strtoul(value, NULL, 0);
+	if (*n == USHRT_MAX && errno == ERANGE)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * If this vdev is created by user, then ifcvf will be taken by
+ * this vdev.
+ */
+static int
+ifcvf_take_over(struct rte_pci_addr *pci_addr, int num)
+{
+	uint16_t port_id;
+	int i, ret;
+	char devname[RTE_DEV_NAME_MAX_LEN];
+	struct rte_pci_addr vf_addr = *pci_addr;
+
+	for (i = 0; i < num; i++) {
+		vf_addr.function += i % 8;
+		vf_addr.devid += i / 8;
+		rte_pci_device_name(&vf_addr, devname, RTE_DEV_NAME_MAX_LEN);
+		ret = rte_eth_dev_get_port_by_name(devname, &port_id);
+		if (ret == 0) {
+			rte_eth_dev_close(port_id);
+			if (rte_eth_dev_detach(port_id, devname) < 0)
+				return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
+rte_ifcvf_probe(struct rte_vdev_device *dev)
+{
+	struct rte_kvargs *kvlist = NULL;
+	int ret = 0;
+	struct rte_pci_addr pci_addr;
+	int devices;
+
+	RTE_LOG(INFO, PMD, "Initializing ifcvf for %s\n",
+			rte_vdev_device_name(dev));
+
+	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	if (rte_kvargs_count(kvlist, ETH_IFCVF_BDF_ARG) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_IFCVF_BDF_ARG,
+				&get_pci_addr, &pci_addr);
+		if (ret < 0)
+			goto out_free;
+
+	} else {
+		ret = -1;
+		goto out_free;
+	}
+
+	if (rte_kvargs_count(kvlist, ETH_IFCVF_DEVICES_ARG) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_IFCVF_DEVICES_ARG,
+				&open_int, &devices);
+		if (ret < 0 || devices > IFCVF_MAX_DEVICES)
+			goto out_free;
+	} else {
+		devices = 1;
+	}
+
+	ret = ifcvf_take_over(&pci_addr, devices);
+	if (ret < 0)
+		goto out_free;
+
+	eth_dev_ifcvf_create(dev, &pci_addr, devices);
+
+out_free:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
+static int
+rte_ifcvf_remove(struct rte_vdev_device *dev)
+{
+	const char *name;
+	struct rte_eth_dev *eth_dev = NULL;
+
+	name = rte_vdev_device_name(dev);
+	RTE_LOG(INFO, PMD, "Un-Initializing ifcvf for %s\n", name);
+
+	/* find an ethdev entry */
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	eth_dev_close(eth_dev);
+	rte_free(eth_dev->data);
+	rte_eth_dev_release_port(eth_dev);
+
+	return 0;
+}
+
+static struct rte_vdev_driver ifcvf_drv = {
+	.probe = rte_ifcvf_probe,
+	.remove = rte_ifcvf_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_ifcvf, ifcvf_drv);
+RTE_PMD_REGISTER_ALIAS(net_ifcvf, eth_ifcvf);
+RTE_PMD_REGISTER_PARAM_STRING(net_ifcvf,
+	"bdf=<bdf> "
+	"devices=<int>");
diff --git a/drivers/net/ifcvf/rte_ifcvf_version.map b/drivers/net/ifcvf/rte_ifcvf_version.map
new file mode 100644
index 000000000..33d237913
--- /dev/null
+++ b/drivers/net/ifcvf/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+EXPERIMENTAL {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..be5f765e4 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -171,6 +171,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF)          += -lrte_ifcvf
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/3] eal/vfio: add support for multiple container
  2018-03-21 13:21     ` [PATCH v2 1/3] eal/vfio: add support for multiple container Xiao Wang
@ 2018-03-21 20:32       ` Thomas Monjalon
  2018-03-21 21:37         ` Gaëtan Rivet
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2018-03-21 20:32 UTC (permalink / raw)
  To: Xiao Wang, junjie.j.chen
  Cc: dev, maxime.coquelin, yliu, zhihong.wang, tiwei.bie, rosen.xu,
	dan.daly, cunming.liang, anatoly.burakov, gaetan.rivet

Hi,

21/03/2018 14:21, Xiao Wang:
> +#endif /* VFIO_PRESENT */
>  #endif /* _RTE_VFIO_H_ */

Please keep the empty line which was present between endif.

> +	rte_vfio_create_container;
> +	rte_vfio_destroy_container;
> +	rte_vfio_bind_group_no;
> +	rte_vfio_unbind_group_no;
> +	rte_vfio_dma_map;
> +	rte_vfio_dma_unmap;
> +	rte_vfio_get_group_fd;

Please keep alphabetical order.

About the naming, I see "no" and "idx" are used.
Other APIs in DPDK are using "num" and "id". Any strong opinion?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/3] bus/pci: expose sysfs parsing API
  2018-03-21 13:21     ` [PATCH v2 2/3] bus/pci: expose sysfs parsing API Xiao Wang
@ 2018-03-21 20:44       ` Thomas Monjalon
  2018-03-22  2:46         ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2018-03-21 20:44 UTC (permalink / raw)
  To: Xiao Wang
  Cc: dev, maxime.coquelin, yliu, zhihong.wang, tiwei.bie,
	junjie.j.chen, rosen.xu, dan.daly, cunming.liang,
	anatoly.burakov, gaetan.rivet

21/03/2018 14:21, Xiao Wang:
> Some existing sysfs parsing functions are helpful for the later vDPA
> driver, this patch make them global and expose them to shared lib.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---
>  	/* parse driver */
>  	snprintf(filename, sizeof(filename), "%s/driver", dirname);
> -	ret = pci_get_kernel_driver_by_path(filename, driver);
> +	ret = rte_pci_device_kdriver_name(addr, driver);

I guess the snprintf above becomes useless.

> + * @param dri_name
> + *   Output buffer pointer.

Parameter name and comment can be improved here:
"kdrv_name" would be more meaningful.
As a comment, "Output buffer for kernel driver name"

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Parse the "resource" sysfs file.
> + *
> + * @param filename
> + *   The PCI resource file path.
> + * @dev
> + *   Pointer of rte_pci_device object, into which the parse result is recorded.
> + * @return
> + *   0 on success, -1 on error, 1 on no driver found.
> + */
> +int __rte_experimental
> +rte_pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev);

This is a Linux specific API.
Maybe remove "sysfs" and replace "filename" by "resource"?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-15 16:49   ` Wang, Xiao W
@ 2018-03-21 20:47     ` Maxime Coquelin
  2018-03-23 10:27       ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-21 20:47 UTC (permalink / raw)
  To: Wang, Xiao W, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan

Hi Xiao,

On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Sunday, March 11, 2018 2:24 AM
>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>
>> Hi Xiao,
>>
>> On 03/10/2018 12:08 AM, Xiao Wang wrote:
>>> This patch set has dependency on
>> http://dpdk.org/dev/patchwork/patch/35635/
>>> (vhost: support selective datapath);
>>>
>>> ifc VF is compatible with virtio vring operations, this driver implements
>>> vDPA driver ops which configures ifc VF to be a vhost data path accelerator.
>>>
>>> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
>>> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
>>> used as vhost data path accelerator.
>>>
>>> Live migration feature is supported by ifc VF and this driver enables
>>> it based on vhost lib.
>>>
>>> vDPA needs to create different containers for different devices, thus this
>>> patch set adds APIs in eal/vfio to support multiple container.
>> Thanks for this! That will avoind having to duplicate these functions
>> for every new offload driver.
>>
>>
>>>
>>> Junjie Chen (1):
>>>     eal/vfio: add support for multiple container
>>>
>>> Xiao Wang (2):
>>>     bus/pci: expose sysfs parsing API
>>
>> Still, I'm not convinced the offload device should be a virtual device.
>> It is a real PCI device, why not having a new device type for offload
>> devices, and let the device to be probed automatically by the existing
>> device model?
> 
> IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
> In DPDK we need to have something to represent PF, to register itself as
> a vDPA engine, so a virtual device is used for this purpose.
I went through the code, and something is not clear to me.

Why do we need to have a representation of the PF in DPDK?
Why cannot we just bind at VF level?


> The VFs are used for vhost net offload, and we could implement exception traffic
> Rx/Tx function on the VFs in future via port-representor mechanism. So this patch
> keeps the device type as net.
> 
> BRs,
> Xiao
> 
>>
>> Thanks,
>> Maxime
>>
>>
>>>     net/ifcvf: add ifcvf driver
>>>
>>>    config/common_base                       |    6 +
>>>    config/common_linuxapp                   |    1 +
>>>    drivers/bus/pci/linux/pci.c              |    9 +-
>>>    drivers/bus/pci/linux/pci_init.h         |    8 +
>>>    drivers/bus/pci/rte_bus_pci_version.map  |    8 +
>>>    drivers/net/Makefile                     |    1 +
>>>    drivers/net/ifcvf/Makefile               |   40 +
>>>    drivers/net/ifcvf/base/ifcvf.c           |  329 ++++++++
>>>    drivers/net/ifcvf/base/ifcvf.h           |  156 ++++
>>>    drivers/net/ifcvf/base/ifcvf_osdep.h     |   52 ++
>>>    drivers/net/ifcvf/ifcvf_ethdev.c         | 1241
>> ++++++++++++++++++++++++++++++
>>>    drivers/net/ifcvf/rte_ifcvf_version.map  |    4 +
>>>    lib/librte_eal/bsdapp/eal/eal.c          |   51 +-
>>>    lib/librte_eal/common/include/rte_vfio.h |  117 ++-
>>>    lib/librte_eal/linuxapp/eal/eal_vfio.c   |  553 ++++++++++---
>>>    lib/librte_eal/linuxapp/eal/eal_vfio.h   |    2 +
>>>    lib/librte_eal/rte_eal_version.map       |    7 +
>>>    mk/rte.app.mk                            |    1 +
>>>    18 files changed, 2480 insertions(+), 106 deletions(-)
>>>    create mode 100644 drivers/net/ifcvf/Makefile
>>>    create mode 100644 drivers/net/ifcvf/base/ifcvf.c
>>>    create mode 100644 drivers/net/ifcvf/base/ifcvf.h
>>>    create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
>>>    create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
>>>    create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
>>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
@ 2018-03-21 20:52       ` Thomas Monjalon
  2018-03-23 10:39         ` Wang, Xiao W
  2018-03-21 20:57       ` Maxime Coquelin
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2018-03-21 20:52 UTC (permalink / raw)
  To: Xiao Wang, rosen.xu
  Cc: dev, maxime.coquelin, yliu, zhihong.wang, tiwei.bie,
	junjie.j.chen, dan.daly, cunming.liang, anatoly.burakov,
	gaetan.rivet

21/03/2018 14:21, Xiao Wang:
> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> used as vhost data path accelerator.

Not everybody work at Intel.
Please explain what means ifcvf and what is a control domain.

> Live migration feature is supported by ifc VF and this driver enables
> it based on vhost lib.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> ---
> v2:
> - Rebase on Zhihong's vDPA v3 patch set.
> ---
>  config/common_base                      |    6 +
>  config/common_linuxapp                  |    1 +
>  drivers/net/Makefile                    |    1 +
>  drivers/net/ifcvf/Makefile              |   40 +
>  drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
>  drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
>  drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
>  drivers/net/ifcvf/ifcvf_ethdev.c        | 1240 +++++++++++++++++++++++++++++++
>  drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
>  mk/rte.app.mk                           |    1 +

This feature needs to be explained and documented.
It will be helpful to understand the mechanism and to have a good review.
Please do not merge it until there is a good documentation.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
  2018-03-21 20:52       ` Thomas Monjalon
@ 2018-03-21 20:57       ` Maxime Coquelin
  2018-03-23 10:37         ` Wang, Xiao W
  2018-03-22  8:51       ` Ferruh Yigit
  2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
  3 siblings, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-21 20:57 UTC (permalink / raw)
  To: Xiao Wang, yliu
  Cc: dev, zhihong.wang, tiwei.bie, junjie.j.chen, rosen.xu, dan.daly,
	cunming.liang, anatoly.burakov, gaetan.rivet



On 03/21/2018 02:21 PM, Xiao Wang wrote:
> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> used as vhost data path accelerator.
> 
> Live migration feature is supported by ifc VF and this driver enables
> it based on vhost lib.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> ---
> v2:
> - Rebase on Zhihong's vDPA v3 patch set.
> ---
>   config/common_base                      |    6 +
>   config/common_linuxapp                  |    1 +
>   drivers/net/Makefile                    |    1 +
>   drivers/net/ifcvf/Makefile              |   40 +
>   drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
>   drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
>   drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
>   drivers/net/ifcvf/ifcvf_ethdev.c        | 1240 +++++++++++++++++++++++++++++++
>   drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
>   mk/rte.app.mk                           |    1 +
>   10 files changed, 1830 insertions(+)
>   create mode 100644 drivers/net/ifcvf/Makefile
>   create mode 100644 drivers/net/ifcvf/base/ifcvf.c
>   create mode 100644 drivers/net/ifcvf/base/ifcvf.h
>   create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
>   create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
>   create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
> 

...

> +static int
> +eth_dev_ifcvf_create(struct rte_vdev_device *dev,
> +		struct rte_pci_addr *pci_addr, int devices)
> +{
> +	const char *name = rte_vdev_device_name(dev);
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct ether_addr *eth_addr = NULL;
> +	struct ifcvf_internal *internal = NULL;
> +	struct internal_list *list = NULL;
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_addr pf_addr = *pci_addr;
> +	int i;
> +
> +	list = rte_zmalloc_socket(name, sizeof(*list), 0,
> +			dev->device.numa_node);
> +	if (list == NULL)
> +		goto error;
> +
> +	/* reserve an ethdev entry */
> +	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
> +	if (eth_dev == NULL)
> +		goto error;
> +
> +	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
> +			dev->device.numa_node);
> +	if (eth_addr == NULL)
> +		goto error;
> +
> +	*eth_addr = base_eth_addr;
> +	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
> +
> +	internal = eth_dev->data->dev_private;
> +	internal->dev_name = strdup(name);
> +	if (internal->dev_name == NULL)
> +		goto error;
> +
> +	internal->eng_addr.pci_addr = *pci_addr;
> +	for (i = 0; i < devices; i++) {
> +		pf_addr.domain = pci_addr->domain;
> +		pf_addr.bus = pci_addr->bus;
> +		pf_addr.devid = pci_addr->devid + (i + 1) / 8;
> +		pf_addr.function = pci_addr->function + (i + 1) % 8;
> +		internal->vf_info[i].pdev.addr = pf_addr;
> +		rte_spinlock_init(&internal->vf_info[i].lock);
> +	}
> +	internal->max_devices = devices;
> +
> +	list->eth_dev = eth_dev;
> +	pthread_mutex_lock(&internal_list_lock);
> +	TAILQ_INSERT_TAIL(&internal_list, list, next);
> +	pthread_mutex_unlock(&internal_list_lock);
> +
> +	data = eth_dev->data;
> +	data->nb_rx_queues = IFCVF_MAX_QUEUES;
> +	data->nb_tx_queues = IFCVF_MAX_QUEUES;
> +	data->dev_link = vdpa_link;
> +	data->mac_addrs = eth_addr;

We might want one ethernet device per VF, as for example you set
dev_link.link_status to UP as soon as a VF is configured, and DOWN
as when a single VF is removed.

> +	data->dev_flags = RTE_ETH_DEV_INTR_LSC;
> +	eth_dev->dev_ops = &ops;
> +
> +	/* assign rx and tx ops, could be used as vDPA fallback */
> +	eth_dev->rx_pkt_burst = eth_ifcvf_rx;
> +	eth_dev->tx_pkt_burst = eth_ifcvf_tx;
> +
> +	if (rte_vdpa_register_engine(vdpa_ifcvf_driver.name,
> +				&internal->eng_addr) < 0)
> +		goto error;
> +
> +	return 0;
> +
> +error:
> +	rte_free(list);
> +	rte_free(eth_addr);
> +	if (internal && internal->dev_name)
> +		free(internal->dev_name);
> +	rte_free(internal);
> +	if (eth_dev)
> +		rte_eth_dev_release_port(eth_dev);
> +
> +	return -1;
> +}
> +
> +static int
> +get_pci_addr(const char *key __rte_unused, const char *value, void *extra_args)
> +{
> +	if (value == NULL || extra_args == NULL)
> +		return -1;
> +
> +	return rte_pci_addr_parse(value, extra_args);
> +}
> +
> +static inline int
> +open_int(const char *key __rte_unused, const char *value, void *extra_args)
> +{
> +	uint16_t *n = extra_args;
> +
> +	if (value == NULL || extra_args == NULL)
> +		return -EINVAL;
> +
> +	*n = (uint16_t)strtoul(value, NULL, 0);
> +	if (*n == USHRT_MAX && errno == ERANGE)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +/*
> + * If this vdev is created by user, then ifcvf will be taken by
> + * this vdev.
> + */
> +static int
> +ifcvf_take_over(struct rte_pci_addr *pci_addr, int num)
> +{
> +	uint16_t port_id;
> +	int i, ret;
> +	char devname[RTE_DEV_NAME_MAX_LEN];
> +	struct rte_pci_addr vf_addr = *pci_addr;
> +
> +	for (i = 0; i < num; i++) {
> +		vf_addr.function += i % 8;
> +		vf_addr.devid += i / 8;
> +		rte_pci_device_name(&vf_addr, devname, RTE_DEV_NAME_MAX_LEN);
> +		ret = rte_eth_dev_get_port_by_name(devname, &port_id);
> +		if (ret == 0) {
> +			rte_eth_dev_close(port_id);
> +			if (rte_eth_dev_detach(port_id, devname) < 0)
> +				return -1;
> +		}
That seems a bit hard.
Shouldn't we at least check the port is not started?

> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_ifcvf_probe(struct rte_vdev_device *dev)
> +{
> +	struct rte_kvargs *kvlist = NULL;
> +	int ret = 0;
> +	struct rte_pci_addr pci_addr;
> +	int devices;
> +
> +	RTE_LOG(INFO, PMD, "Initializing ifcvf for %s\n",
> +			rte_vdev_device_name(dev));
> +
> +	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
> +	if (kvlist == NULL)
> +		return -1;
> +
> +	if (rte_kvargs_count(kvlist, ETH_IFCVF_BDF_ARG) == 1) {
> +		ret = rte_kvargs_process(kvlist, ETH_IFCVF_BDF_ARG,
> +				&get_pci_addr, &pci_addr);
> +		if (ret < 0)
> +			goto out_free;
> +
> +	} else {
> +		ret = -1;
> +		goto out_free;
> +	}
> +
> +	if (rte_kvargs_count(kvlist, ETH_IFCVF_DEVICES_ARG) == 1) {
> +		ret = rte_kvargs_process(kvlist, ETH_IFCVF_DEVICES_ARG,
> +				&open_int, &devices);
> +		if (ret < 0 || devices > IFCVF_MAX_DEVICES)
> +			goto out_free;
> +	} else {
> +		devices = 1;
> +	}
> +
> +	ret = ifcvf_take_over(&pci_addr, devices);
> +	if (ret < 0)
> +		goto out_free;
> +
> +	eth_dev_ifcvf_create(dev, &pci_addr, devices);
> +
> +out_free:
> +	rte_kvargs_free(kvlist);
> +	return ret;
> +}
> +
> +static int
> +rte_ifcvf_remove(struct rte_vdev_device *dev)
> +{
> +	const char *name;
> +	struct rte_eth_dev *eth_dev = NULL;
> +
> +	name = rte_vdev_device_name(dev);
> +	RTE_LOG(INFO, PMD, "Un-Initializing ifcvf for %s\n", name);
> +
> +	/* find an ethdev entry */
> +	eth_dev = rte_eth_dev_allocated(name);
> +	if (eth_dev == NULL)
> +		return -ENODEV;
> +
> +	eth_dev_close(eth_dev);
> +	rte_free(eth_dev->data);
> +	rte_eth_dev_release_port(eth_dev);
> +
> +	return 0;
> +}
> +
> +static struct rte_vdev_driver ifcvf_drv = {
> +	.probe = rte_ifcvf_probe,
> +	.remove = rte_ifcvf_remove,
> +};
> +
> +RTE_PMD_REGISTER_VDEV(net_ifcvf, ifcvf_drv);
> +RTE_PMD_REGISTER_ALIAS(net_ifcvf, eth_ifcvf);
> +RTE_PMD_REGISTER_PARAM_STRING(net_ifcvf,
> +	"bdf=<bdf> "
> +	"devices=<int>");
> diff --git a/drivers/net/ifcvf/rte_ifcvf_version.map b/drivers/net/ifcvf/rte_ifcvf_version.map
> new file mode 100644
> index 000000000..33d237913
> --- /dev/null
> +++ b/drivers/net/ifcvf/rte_ifcvf_version.map
> @@ -0,0 +1,4 @@
> +EXPERIMENTAL {
> +
> +	local: *;
> +};
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 3eb41d176..be5f765e4 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -171,6 +171,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
>   ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF)          += -lrte_ifcvf
>   endif # $(CONFIG_RTE_LIBRTE_VHOST)
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
>   
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/3] eal/vfio: add support for multiple container
  2018-03-21 20:32       ` Thomas Monjalon
@ 2018-03-21 21:37         ` Gaëtan Rivet
  2018-03-22  3:00           ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Gaëtan Rivet @ 2018-03-21 21:37 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Xiao Wang, junjie.j.chen, dev, maxime.coquelin, yliu,
	zhihong.wang, tiwei.bie, rosen.xu, dan.daly, cunming.liang,
	anatoly.burakov

On Wed, Mar 21, 2018 at 09:32:18PM +0100, Thomas Monjalon wrote:
> Hi,
> 
> 21/03/2018 14:21, Xiao Wang:
> > +#endif /* VFIO_PRESENT */
> >  #endif /* _RTE_VFIO_H_ */
> 
> Please keep the empty line which was present between endif.
> 
> > +	rte_vfio_create_container;
> > +	rte_vfio_destroy_container;
> > +	rte_vfio_bind_group_no;
> > +	rte_vfio_unbind_group_no;
> > +	rte_vfio_dma_map;
> > +	rte_vfio_dma_unmap;
> > +	rte_vfio_get_group_fd;
> 
> Please keep alphabetical order.
> 
> About the naming, I see "no" and "idx" are used.
> Other APIs in DPDK are using "num" and "id". Any strong opinion?

{bind,unbind}_group is sufficient as a name.
_no is redundant as implicit from the parameter type.

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/3] bus/pci: expose sysfs parsing API
  2018-03-21 20:44       ` Thomas Monjalon
@ 2018-03-22  2:46         ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-22  2:46 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, maxime.coquelin, yliu, Wang, Zhihong, Bie, Tiwei, Chen,
	Junjie J, Xu, Rosen, Daly, Dan, Liang, Cunming, Burakov, Anatoly,
	gaetan.rivet

Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, March 22, 2018 4:45 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>
> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; yliu@fridaylinux.org; Wang,
> Zhihong <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Chen,
> Junjie J <junjie.j.chen@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Daly,
> Dan <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>;
> Burakov, Anatoly <anatoly.burakov@intel.com>; gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v2 2/3] bus/pci: expose sysfs parsing API
> 
> 21/03/2018 14:21, Xiao Wang:
> > Some existing sysfs parsing functions are helpful for the later vDPA
> > driver, this patch make them global and expose them to shared lib.
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > ---
> >  	/* parse driver */
> >  	snprintf(filename, sizeof(filename), "%s/driver", dirname);
> > -	ret = pci_get_kernel_driver_by_path(filename, driver);
> > +	ret = rte_pci_device_kdriver_name(addr, driver);
> 
> I guess the snprintf above becomes useless.

Will remove it.
> 
> > + * @param dri_name
> > + *   Output buffer pointer.
> 
> Parameter name and comment can be improved here:
> "kdrv_name" would be more meaningful.
> As a comment, "Output buffer for kernel driver name"

Thanks for the suggestion. Will improve it.

> 
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Parse the "resource" sysfs file.
> > + *
> > + * @param filename
> > + *   The PCI resource file path.
> > + * @dev
> > + *   Pointer of rte_pci_device object, into which the parse result is recorded.
> > + * @return
> > + *   0 on success, -1 on error, 1 on no driver found.
> > + */
> > +int __rte_experimental
> > +rte_pci_parse_sysfs_resource(const char *filename, struct rte_pci_device
> *dev);
> 
> This is a Linux specific API.
> Maybe remove "sysfs" and replace "filename" by "resource"?

Yes, "sysfs" makes it Linux specific. Will change it.
Thanks for the above comments.

BRs,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/3] eal/vfio: add support for multiple container
  2018-03-21 21:37         ` Gaëtan Rivet
@ 2018-03-22  3:00           ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-22  3:00 UTC (permalink / raw)
  To: Gaëtan Rivet, Thomas Monjalon
  Cc: Chen, Junjie J, dev, maxime.coquelin, yliu, Wang, Zhihong, Bie,
	Tiwei, Xu, Rosen, Daly, Dan, Liang, Cunming, Burakov, Anatoly

Hi Thomas, Rivet,

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Thursday, March 22, 2018 5:38 AM
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Wang, Xiao W <xiao.w.wang@intel.com>; Chen, Junjie J
> <junjie.j.chen@intel.com>; dev@dpdk.org; maxime.coquelin@redhat.com;
> yliu@fridaylinux.org; Wang, Zhihong <zhihong.wang@intel.com>; Bie, Tiwei
> <tiwei.bie@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Daly, Dan
> <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Burakov,
> Anatoly <anatoly.burakov@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/vfio: add support for multiple
> container
> 
> On Wed, Mar 21, 2018 at 09:32:18PM +0100, Thomas Monjalon wrote:
> > Hi,
> >
> > 21/03/2018 14:21, Xiao Wang:
> > > +#endif /* VFIO_PRESENT */
> > >  #endif /* _RTE_VFIO_H_ */
> >
> > Please keep the empty line which was present between endif.

OK.
> >
> > > +	rte_vfio_create_container;
> > > +	rte_vfio_destroy_container;
> > > +	rte_vfio_bind_group_no;
> > > +	rte_vfio_unbind_group_no;
> > > +	rte_vfio_dma_map;
> > > +	rte_vfio_dma_unmap;
> > > +	rte_vfio_get_group_fd;
> >
> > Please keep alphabetical order.

OK. Will do.
> >
> > About the naming, I see "no" and "idx" are used.
> > Other APIs in DPDK are using "num" and "id". Any strong opinion?
> 
> {bind,unbind}_group is sufficient as a name.
> _no is redundant as implicit from the parameter type.

{bind,unbind}_group looks very neat. Will remove "_no".
For the eal_vfio.c internal API with "_idx" postfix, the return value is an index of an array. I think "idx" is appropriate. Before this patch, we already have function get_vfio_group_idx.

Thanks for the comments.
-Xiao
> 
> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
  2018-03-21 20:52       ` Thomas Monjalon
  2018-03-21 20:57       ` Maxime Coquelin
@ 2018-03-22  8:51       ` Ferruh Yigit
  2018-03-22 17:23         ` Wang, Xiao W
  2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
  3 siblings, 1 reply; 98+ messages in thread
From: Ferruh Yigit @ 2018-03-22  8:51 UTC (permalink / raw)
  To: Xiao Wang, maxime.coquelin, yliu
  Cc: dev, zhihong.wang, tiwei.bie, junjie.j.chen, rosen.xu, dan.daly,
	cunming.liang, anatoly.burakov, gaetan.rivet

On 3/21/2018 1:21 PM, Xiao Wang wrote:
> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> used as vhost data path accelerator.
> 
> Live migration feature is supported by ifc VF and this driver enables
> it based on vhost lib.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> ---
> v2:
> - Rebase on Zhihong's vDPA v3 patch set.
> ---
>  config/common_base                      |    6 +
>  config/common_linuxapp                  |    1 +
>  drivers/net/Makefile                    |    1 +
>  drivers/net/ifcvf/Makefile              |   40 +
>  drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
>  drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
>  drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
>  drivers/net/ifcvf/ifcvf_ethdev.c        | 1240 +++++++++++++++++++++++++++++++
>  drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
>  mk/rte.app.mk                           |    1 +

Need .ini file to represent driver features.
Also it is good to add driver documentation and a note into release note to
announce new driver.

>  10 files changed, 1830 insertions(+)
>  create mode 100644 drivers/net/ifcvf/Makefile
>  create mode 100644 drivers/net/ifcvf/base/ifcvf.c
>  create mode 100644 drivers/net/ifcvf/base/ifcvf.h
>  create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
>  create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
>  create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
> 
> diff --git a/config/common_base b/config/common_base
> index ad03cf433..06fce1ebf 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -791,6 +791,12 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
>  #
>  CONFIG_RTE_LIBRTE_PMD_VHOST=n
>  
> +#
> +# Compile IFCVF driver
> +# To compile, CONFIG_RTE_LIBRTE_VHOST should be enabled.
> +#
> +CONFIG_RTE_LIBRTE_IFCVF=n
> +
>  #
>  # Compile the test application
>  #
> diff --git a/config/common_linuxapp b/config/common_linuxapp
> index ff98f2355..358d00468 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
>  CONFIG_RTE_LIBRTE_VHOST=y
>  CONFIG_RTE_LIBRTE_VHOST_NUMA=y
>  CONFIG_RTE_LIBRTE_PMD_VHOST=y
> +CONFIG_RTE_LIBRTE_IFCVF=y

Current syntax for PMD config options:
Virtual ones: CONFIG_RTE_LIBRTE_PMD_XXX
Physical ones: CONFIG_RTE_LIBRTE_XXX_PMD

Virtual / Physical difference most probably not done intentionally but that is
what it is right now.

Is "PMD" not added intentionally to the config option?

And what is the config time dependency of the driver, I assume VHOST is one of
them but are there more?

>  CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
>  CONFIG_RTE_LIBRTE_PMD_TAP=y
>  CONFIG_RTE_LIBRTE_AVP_PMD=y
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index e1127326b..496acf2d2 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -53,6 +53,7 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
>  
>  ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
> +DIRS-$(CONFIG_RTE_LIBRTE_IFCVF) += ifcvf

Since this is mainly vpda driver, does it make sense to put it under
drivers/net/virtio/vpda/ifcvf

When there are more vpda driver they can go into drivers/net/virtio/vpda/*

Combining with below not registering ethdev comment, virtual driver can register
itself as vpda_ifcvf:
RTE_PMD_REGISTER_VDEV(vpda_ifcvf, ifcvf_drv);

>  endif # $(CONFIG_RTE_LIBRTE_VHOST)
>  
>  ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
> diff --git a/drivers/net/ifcvf/Makefile b/drivers/net/ifcvf/Makefile
> new file mode 100644
> index 000000000..f3670cdf2
> --- /dev/null
> +++ b/drivers/net/ifcvf/Makefile
> @@ -0,0 +1,40 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2018 Intel Corporation
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_ifcvf.a
> +
> +LDLIBS += -lpthread
> +LDLIBS += -lrte_eal -lrte_mempool -lrte_pci
> +LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost
> +LDLIBS += -lrte_bus_vdev -lrte_bus_pci
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
> +CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/linux
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
> +
> +#
> +# Add extra flags for base driver source files to disable warnings in them
> +#
> +BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
> +$(foreach obj, $(BASE_DRIVER_OBJS), $(eval CFLAGS_$(obj)+=$(CFLAGS_BASE_DRIVER)))

It seems no CFLAGS_BASE_DRIVER defined yet, above lines can be removed for now.

> +
> +VPATH += $(SRCDIR)/base
> +
> +EXPORT_MAP := rte_ifcvf_version.map
> +
> +LIBABIVER := 1
> +
> +#
> +# all source are stored in SRCS-y
> +#
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf_ethdev.c
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf.c

Is it intentionally used "RTE_LIBRTE_PMD_VHOST" because of dependency or typo?

> +
> +include $(RTE_SDK)/mk/rte.lib.mk
<...>

> +static int
> +eth_dev_ifcvf_create(struct rte_vdev_device *dev,
> +		struct rte_pci_addr *pci_addr, int devices)
> +{
> +	const char *name = rte_vdev_device_name(dev);
> +	struct rte_eth_dev *eth_dev = NULL;
> +	struct ether_addr *eth_addr = NULL;
> +	struct ifcvf_internal *internal = NULL;
> +	struct internal_list *list = NULL;
> +	struct rte_eth_dev_data *data = NULL;
> +	struct rte_pci_addr pf_addr = *pci_addr;
> +	int i;
> +
> +	list = rte_zmalloc_socket(name, sizeof(*list), 0,
> +			dev->device.numa_node);
> +	if (list == NULL)
> +		goto error;
> +
> +	/* reserve an ethdev entry */
> +	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));

Is this eth_dev used at all? It looks like it is only used for its private data,
if so can it be possible to use something like:

struct ifdev {
	void *private;
	struct rte_device *dev;
}

allocate memory for private and add this struct to the list, this may save
ethdev overhead.

But I can see dev_start() and dev_stop() are used, not sure if they are the
reason of the ethdev.

> +	if (eth_dev == NULL)
> +		goto error;
> +
> +	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
> +			dev->device.numa_node);
> +	if (eth_addr == NULL)
> +		goto error;
> +
> +	*eth_addr = base_eth_addr;
> +	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
> +
> +	internal = eth_dev->data->dev_private;
> +	internal->dev_name = strdup(name);

Need to free this later and on error paths

> +	if (internal->dev_name == NULL)
> +		goto error;
> +
> +	internal->eng_addr.pci_addr = *pci_addr;
> +	for (i = 0; i < devices; i++) {
> +		pf_addr.domain = pci_addr->domain;
> +		pf_addr.bus = pci_addr->bus;
> +		pf_addr.devid = pci_addr->devid + (i + 1) / 8;
> +		pf_addr.function = pci_addr->function + (i + 1) % 8;
> +		internal->vf_info[i].pdev.addr = pf_addr;
> +		rte_spinlock_init(&internal->vf_info[i].lock);
> +	}
> +	internal->max_devices = devices;

is it max_devices or number of devices?

<...>

> +/*
> + * If this vdev is created by user, then ifcvf will be taken by

created by user?

> + * this vdev.
> + */
> +static int
> +ifcvf_take_over(struct rte_pci_addr *pci_addr, int num)
> +{
> +	uint16_t port_id;
> +	int i, ret;
> +	char devname[RTE_DEV_NAME_MAX_LEN];
> +	struct rte_pci_addr vf_addr = *pci_addr;
> +
> +	for (i = 0; i < num; i++) {
> +		vf_addr.function += i % 8;
> +		vf_addr.devid += i / 8;
> +		rte_pci_device_name(&vf_addr, devname, RTE_DEV_NAME_MAX_LEN);
> +		ret = rte_eth_dev_get_port_by_name(devname, &port_id);

Who probed this device at first place?

> +		if (ret == 0) {
> +			rte_eth_dev_close(port_id);
> +			if (rte_eth_dev_detach(port_id, devname) < 0)

This will call the driver remov() also will remove device from device list, is
it OK?

> +				return -1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +rte_ifcvf_probe(struct rte_vdev_device *dev)
> +{
> +	struct rte_kvargs *kvlist = NULL;
> +	int ret = 0;
> +	struct rte_pci_addr pci_addr;
> +	int devices;

devices can't be negative, and according open_int() it is uint16_t, it is
possible to pick an unsigned storage type for it.

<...>

> +static int
> +rte_ifcvf_remove(struct rte_vdev_device *dev)
> +{
> +	const char *name;
> +	struct rte_eth_dev *eth_dev = NULL;
> +
> +	name = rte_vdev_device_name(dev);
> +	RTE_LOG(INFO, PMD, "Un-Initializing ifcvf for %s\n", name);
> +
> +	/* find an ethdev entry */
> +	eth_dev = rte_eth_dev_allocated(name);
> +	if (eth_dev == NULL)
> +		return -ENODEV;
> +
> +	eth_dev_close(eth_dev);
> +	rte_free(eth_dev->data);
> +	rte_eth_dev_release_port(eth_dev);

This does memset(ethdev->data, ..), so should be called before rte_free(data)

> +
> +	return 0;
> +}
> +
> +static struct rte_vdev_driver ifcvf_drv = {
> +	.probe = rte_ifcvf_probe,
> +	.remove = rte_ifcvf_remove,
> +};
> +
> +RTE_PMD_REGISTER_VDEV(net_ifcvf, ifcvf_drv);
> +RTE_PMD_REGISTER_ALIAS(net_ifcvf, eth_ifcvf);

Alias for backport support, not needed for new drivers.

> +RTE_PMD_REGISTER_PARAM_STRING(net_ifcvf,
> +	"bdf=<bdf> "
> +	"devices=<int>");

Above says:
  #define ETH_IFCVF_DEVICES_ARG	"int"

Is argument "int" or "devices"? Using macro here helps preventing errors.

> diff --git a/drivers/net/ifcvf/rte_ifcvf_version.map b/drivers/net/ifcvf/rte_ifcvf_version.map
> new file mode 100644
> index 000000000..33d237913
> --- /dev/null
> +++ b/drivers/net/ifcvf/rte_ifcvf_version.map
> @@ -0,0 +1,4 @@
> +EXPERIMENTAL {

Please put release version here.

<...>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-22  8:51       ` Ferruh Yigit
@ 2018-03-22 17:23         ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-22 17:23 UTC (permalink / raw)
  To: Yigit, Ferruh, maxime.coquelin, yliu
  Cc: dev, Wang, Zhihong, Bie, Tiwei, Chen, Junjie J, Xu, Rosen, Daly,
	Dan, Liang, Cunming, Burakov, Anatoly, gaetan.rivet

Hi Ferruh,

> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Thursday, March 22, 2018 4:51 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; maxime.coquelin@redhat.com;
> yliu@fridaylinux.org
> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Bie, Tiwei
> <tiwei.bie@intel.com>; Chen, Junjie J <junjie.j.chen@intel.com>; Xu, Rosen
> <rosen.xu@intel.com>; Daly, Dan <dan.daly@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>; Burakov, Anatoly <anatoly.burakov@intel.com>;
> gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v2 3/3] net/ifcvf: add ifcvf driver
> 
> On 3/21/2018 1:21 PM, Xiao Wang wrote:
> > ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> > to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> > used as vhost data path accelerator.
> >
> > Live migration feature is supported by ifc VF and this driver enables
> > it based on vhost lib.
> >
> > Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> > only vfio-pci is supported currently.
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> > ---
> > v2:
> > - Rebase on Zhihong's vDPA v3 patch set.
> > ---
> >  config/common_base                      |    6 +
> >  config/common_linuxapp                  |    1 +
> >  drivers/net/Makefile                    |    1 +
> >  drivers/net/ifcvf/Makefile              |   40 +
> >  drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
> >  drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
> >  drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
> >  drivers/net/ifcvf/ifcvf_ethdev.c        | 1240
> +++++++++++++++++++++++++++++++
> >  drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
> >  mk/rte.app.mk                           |    1 +
> 
> Need .ini file to represent driver features.
> Also it is good to add driver documentation and a note into release note to
> announce new driver.

Will do.

> 
> >  10 files changed, 1830 insertions(+)
> >  create mode 100644 drivers/net/ifcvf/Makefile
> >  create mode 100644 drivers/net/ifcvf/base/ifcvf.c
> >  create mode 100644 drivers/net/ifcvf/base/ifcvf.h
> >  create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
> >  create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
> >  create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
> >
> > diff --git a/config/common_base b/config/common_base
> > index ad03cf433..06fce1ebf 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -791,6 +791,12 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
> >  #
> >  CONFIG_RTE_LIBRTE_PMD_VHOST=n
> >
> > +#
> > +# Compile IFCVF driver
> > +# To compile, CONFIG_RTE_LIBRTE_VHOST should be enabled.
> > +#
> > +CONFIG_RTE_LIBRTE_IFCVF=n
> > +
> >  #
> >  # Compile the test application
> >  #
> > diff --git a/config/common_linuxapp b/config/common_linuxapp
> > index ff98f2355..358d00468 100644
> > --- a/config/common_linuxapp
> > +++ b/config/common_linuxapp
> > @@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
> >  CONFIG_RTE_LIBRTE_VHOST=y
> >  CONFIG_RTE_LIBRTE_VHOST_NUMA=y
> >  CONFIG_RTE_LIBRTE_PMD_VHOST=y
> > +CONFIG_RTE_LIBRTE_IFCVF=y
> 
> Current syntax for PMD config options:
> Virtual ones: CONFIG_RTE_LIBRTE_PMD_XXX
> Physical ones: CONFIG_RTE_LIBRTE_XXX_PMD
> 
> Virtual / Physical difference most probably not done intentionally but that is
> what it is right now.
> 
> Is "PMD" not added intentionally to the config option?

I think vDPA driver is not polling mode, so I didn't put a "PMD" here. Do you think CONFIG_RTE_LIBRTE_VDPA_IFCVF is better?

> 
> And what is the config time dependency of the driver, I assume VHOST is one
> of
> them but are there more?

This dependency is described in drivers/net/Makefile, CONFIG_RTE_EAL_VFIO is another one, will add it.

> 
> >  CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
> >  CONFIG_RTE_LIBRTE_PMD_TAP=y
> >  CONFIG_RTE_LIBRTE_AVP_PMD=y
> > diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> > index e1127326b..496acf2d2 100644
> > --- a/drivers/net/Makefile
> > +++ b/drivers/net/Makefile
> > @@ -53,6 +53,7 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
> >
> >  ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
> >  DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
> > +DIRS-$(CONFIG_RTE_LIBRTE_IFCVF) += ifcvf
> 
> Since this is mainly vpda driver, does it make sense to put it under
> drivers/net/virtio/vpda/ifcvf
> 
> When there are more vpda driver they can go into drivers/net/virtio/vpda/*

vDPA is for vhost offloading/acceleration, the device can be quite different from virtio,
they just need to be virtio ring compatible, and the usage model is quite different from virtio pmd.
I think vDPA driver should not go into drivers/net/virtio dir.

> 
> Combining with below not registering ethdev comment, virtual driver can
> register
> itself as vpda_ifcvf:
> RTE_PMD_REGISTER_VDEV(vpda_ifcvf, ifcvf_drv);

Yes, very limited ethdev APIs can be implemented for this ethdev, I'll try to remove the ethdev registering.

> 
> >  endif # $(CONFIG_RTE_LIBRTE_VHOST)
> >
> >  ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
> > diff --git a/drivers/net/ifcvf/Makefile b/drivers/net/ifcvf/Makefile
> > new file mode 100644
> > index 000000000..f3670cdf2
> > --- /dev/null
> > +++ b/drivers/net/ifcvf/Makefile
> > @@ -0,0 +1,40 @@
> > +# SPDX-License-Identifier: BSD-3-Clause
> > +# Copyright(c) 2018 Intel Corporation
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_ifcvf.a
> > +
> > +LDLIBS += -lpthread
> > +LDLIBS += -lrte_eal -lrte_mempool -lrte_pci
> > +LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost
> > +LDLIBS += -lrte_bus_vdev -lrte_bus_pci
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
> > +CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/linux
> > +CFLAGS += -DALLOW_EXPERIMENTAL_API
> > +
> > +#
> > +# Add extra flags for base driver source files to disable warnings in them
> > +#
> > +BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard
> $(SRCDIR)/base/*.c))))
> > +$(foreach obj, $(BASE_DRIVER_OBJS), $(eval
> CFLAGS_$(obj)+=$(CFLAGS_BASE_DRIVER)))
> 
> It seems no CFLAGS_BASE_DRIVER defined yet, above lines can be removed for
> now.

Will remove it.

> 
> > +
> > +VPATH += $(SRCDIR)/base
> > +
> > +EXPORT_MAP := rte_ifcvf_version.map
> > +
> > +LIBABIVER := 1
> > +
> > +#
> > +# all source are stored in SRCS-y
> > +#
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf_ethdev.c
> > +SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += ifcvf.c
> 
> Is it intentionally used "RTE_LIBRTE_PMD_VHOST" because of dependency or
> typo?

Sorry for the typo.

> 
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> <...>
> 
> > +static int
> > +eth_dev_ifcvf_create(struct rte_vdev_device *dev,
> > +		struct rte_pci_addr *pci_addr, int devices)
> > +{
> > +	const char *name = rte_vdev_device_name(dev);
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +	struct ether_addr *eth_addr = NULL;
> > +	struct ifcvf_internal *internal = NULL;
> > +	struct internal_list *list = NULL;
> > +	struct rte_eth_dev_data *data = NULL;
> > +	struct rte_pci_addr pf_addr = *pci_addr;
> > +	int i;
> > +
> > +	list = rte_zmalloc_socket(name, sizeof(*list), 0,
> > +			dev->device.numa_node);
> > +	if (list == NULL)
> > +		goto error;
> > +
> > +	/* reserve an ethdev entry */
> > +	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
> 
> Is this eth_dev used at all? It looks like it is only used for its private data,
> if so can it be possible to use something like:
> 
> struct ifdev {
> 	void *private;
> 	struct rte_device *dev;
> }
> 
> allocate memory for private and add this struct to the list, this may save
> ethdev overhead.
> 
> But I can see dev_start() and dev_stop() are used, not sure if they are the
> reason of the ethdev.

Registering an ethdev allows to dev_start/stop, but it seems this overhead doesn’t bring much benefit.
Your suggestion looks good.

> 
> > +	if (eth_dev == NULL)
> > +		goto error;
> > +
> > +	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
> > +			dev->device.numa_node);
> > +	if (eth_addr == NULL)
> > +		goto error;
> > +
> > +	*eth_addr = base_eth_addr;
> > +	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
> > +
> > +	internal = eth_dev->data->dev_private;
> > +	internal->dev_name = strdup(name);
> 
> Need to free this later and on error paths

The error path has it:
        if (internal && internal->dev_name)
                free(internal->dev_name);

> 
> > +	if (internal->dev_name == NULL)
> > +		goto error;
> > +
> > +	internal->eng_addr.pci_addr = *pci_addr;
> > +	for (i = 0; i < devices; i++) {
> > +		pf_addr.domain = pci_addr->domain;
> > +		pf_addr.bus = pci_addr->bus;
> > +		pf_addr.devid = pci_addr->devid + (i + 1) / 8;
> > +		pf_addr.function = pci_addr->function + (i + 1) % 8;
> > +		internal->vf_info[i].pdev.addr = pf_addr;
> > +		rte_spinlock_init(&internal->vf_info[i].lock);
> > +	}
> > +	internal->max_devices = devices;
> 
> is it max_devices or number of devices?

It's a field to describe how many devices are contained in this vDPA engine. The value is min(user argument, IFCVF MAX VFs).
Rename it to dev_num looks better.

> 
> <...>
> 
> > +/*
> > + * If this vdev is created by user, then ifcvf will be taken by
> 
> created by user?

I mean when app creates this vdev, we can assume app wants ifcvf to be used as vDPA device.
Ifcvf has virtio's vendor ID and device ID, but it has its specific subsystem vendor ID and device ID.
So virtio pmd can take ifcvf first, then app stops the virtio port, and creates ifcvf vdev to drive ifcvf.

> 
> > + * this vdev.
> > + */
> > +static int
> > +ifcvf_take_over(struct rte_pci_addr *pci_addr, int num)
> > +{
> > +	uint16_t port_id;
> > +	int i, ret;
> > +	char devname[RTE_DEV_NAME_MAX_LEN];
> > +	struct rte_pci_addr vf_addr = *pci_addr;
> > +
> > +	for (i = 0; i < num; i++) {
> > +		vf_addr.function += i % 8;
> > +		vf_addr.devid += i / 8;
> > +		rte_pci_device_name(&vf_addr, devname,
> RTE_DEV_NAME_MAX_LEN);
> > +		ret = rte_eth_dev_get_port_by_name(devname, &port_id);
> 
> Who probed this device at first place?

If no whitelist specified, virtio pmd will probe it first.

> 
> > +		if (ret == 0) {
> > +			rte_eth_dev_close(port_id);
> > +			if (rte_eth_dev_detach(port_id, devname) < 0)
> 
> This will call the driver remov() also will remove device from device list, is
> it OK?

Or we can just call rte_eth_dev_release_port, to keep the device in the device list.
This will be better.

> 
> > +				return -1;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +rte_ifcvf_probe(struct rte_vdev_device *dev)
> > +{
> > +	struct rte_kvargs *kvlist = NULL;
> > +	int ret = 0;
> > +	struct rte_pci_addr pci_addr;
> > +	int devices;
> 
> devices can't be negative, and according open_int() it is uint16_t, it is
> possible to pick an unsigned storage type for it.

Will use unsigned type.

> 
> <...>
> 
> > +static int
> > +rte_ifcvf_remove(struct rte_vdev_device *dev)
> > +{
> > +	const char *name;
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +
> > +	name = rte_vdev_device_name(dev);
> > +	RTE_LOG(INFO, PMD, "Un-Initializing ifcvf for %s\n", name);
> > +
> > +	/* find an ethdev entry */
> > +	eth_dev = rte_eth_dev_allocated(name);
> > +	if (eth_dev == NULL)
> > +		return -ENODEV;
> > +
> > +	eth_dev_close(eth_dev);
> > +	rte_free(eth_dev->data);
> > +	rte_eth_dev_release_port(eth_dev);
> 
> This does memset(ethdev->data, ..), so should be called before rte_free(data)

Agree, will change it .

> 
> > +
> > +	return 0;
> > +}
> > +
> > +static struct rte_vdev_driver ifcvf_drv = {
> > +	.probe = rte_ifcvf_probe,
> > +	.remove = rte_ifcvf_remove,
> > +};
> > +
> > +RTE_PMD_REGISTER_VDEV(net_ifcvf, ifcvf_drv);
> > +RTE_PMD_REGISTER_ALIAS(net_ifcvf, eth_ifcvf);
> 
> Alias for backport support, not needed for new drivers.

OK, will remove it.

> 
> > +RTE_PMD_REGISTER_PARAM_STRING(net_ifcvf,
> > +	"bdf=<bdf> "
> > +	"devices=<int>");
> 
> Above says:
>   #define ETH_IFCVF_DEVICES_ARG	"int"
> 
> Is argument "int" or "devices"? Using macro here helps preventing errors.

It's "devices", will fix it with using macro.

> 
> > diff --git a/drivers/net/ifcvf/rte_ifcvf_version.map
> b/drivers/net/ifcvf/rte_ifcvf_version.map
> > new file mode 100644
> > index 000000000..33d237913
> > --- /dev/null
> > +++ b/drivers/net/ifcvf/rte_ifcvf_version.map
> > @@ -0,0 +1,4 @@
> > +EXPERIMENTAL {
> 
> Please put release version here.

OK, will put "DPDK_18.05"

Thanks for the comments,
-Xiao

> 
> <...>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-21 20:47     ` Maxime Coquelin
@ 2018-03-23 10:27       ` Wang, Xiao W
  2018-03-25  9:51         ` Maxime Coquelin
  0 siblings, 1 reply; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-23 10:27 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Thursday, March 22, 2018 4:48 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> Subject: Re: [PATCH 0/3] add ifcvf driver
> 
> Hi Xiao,
> 
> On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Sunday, March 11, 2018 2:24 AM
> >> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> >> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
> Chen,
> >> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> >> Subject: Re: [PATCH 0/3] add ifcvf driver
> >>
> >> Hi Xiao,
> >>
> >> On 03/10/2018 12:08 AM, Xiao Wang wrote:
> >>> This patch set has dependency on
> >> http://dpdk.org/dev/patchwork/patch/35635/
> >>> (vhost: support selective datapath);
> >>>
> >>> ifc VF is compatible with virtio vring operations, this driver implements
> >>> vDPA driver ops which configures ifc VF to be a vhost data path accelerator.
> >>>
> >>> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> >>> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> >>> used as vhost data path accelerator.
> >>>
> >>> Live migration feature is supported by ifc VF and this driver enables
> >>> it based on vhost lib.
> >>>
> >>> vDPA needs to create different containers for different devices, thus this
> >>> patch set adds APIs in eal/vfio to support multiple container.
> >> Thanks for this! That will avoind having to duplicate these functions
> >> for every new offload driver.
> >>
> >>
> >>>
> >>> Junjie Chen (1):
> >>>     eal/vfio: add support for multiple container
> >>>
> >>> Xiao Wang (2):
> >>>     bus/pci: expose sysfs parsing API
> >>
> >> Still, I'm not convinced the offload device should be a virtual device.
> >> It is a real PCI device, why not having a new device type for offload
> >> devices, and let the device to be probed automatically by the existing
> >> device model?
> >
> > IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
> > In DPDK we need to have something to represent PF, to register itself as
> > a vDPA engine, so a virtual device is used for this purpose.
> I went through the code, and something is not clear to me.
> 
> Why do we need to have a representation of the PF in DPDK?
> Why cannot we just bind at VF level?

1. With the vdev representation we could use it to talk to PF kernel driver to do flow configuration, we can implement
flow API on the vdev in future for this purpose. Using a vdev allows introducing this kind of control plane thing.

2. When port representor is ready, we would integrate it into ifcvf driver, then each VF will have a
Representor port. For now we don’t have port representor, so this patch set manages VF resource internally.

BRs,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-21 20:57       ` Maxime Coquelin
@ 2018-03-23 10:37         ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-23 10:37 UTC (permalink / raw)
  To: Maxime Coquelin, yliu
  Cc: dev, Wang, Zhihong, Bie, Tiwei, Chen, Junjie J, Xu, Rosen, Daly,
	Dan, Liang, Cunming, Burakov, Anatoly, gaetan.rivet

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Thursday, March 22, 2018 4:58 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; yliu@fridaylinux.org
> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Bie, Tiwei
> <tiwei.bie@intel.com>; Chen, Junjie J <junjie.j.chen@intel.com>; Xu, Rosen
> <rosen.xu@intel.com>; Daly, Dan <dan.daly@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>; Burakov, Anatoly <anatoly.burakov@intel.com>;
> gaetan.rivet@6wind.com
> Subject: Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
> 
> 
> 
> On 03/21/2018 02:21 PM, Xiao Wang wrote:
> > ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> > to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> > used as vhost data path accelerator.
> >
> > Live migration feature is supported by ifc VF and this driver enables
> > it based on vhost lib.
> >
> > Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> > only vfio-pci is supported currently.
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> > ---
> > v2:
> > - Rebase on Zhihong's vDPA v3 patch set.
> > ---
> >   config/common_base                      |    6 +
> >   config/common_linuxapp                  |    1 +
> >   drivers/net/Makefile                    |    1 +
> >   drivers/net/ifcvf/Makefile              |   40 +
> >   drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
> >   drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
> >   drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
> >   drivers/net/ifcvf/ifcvf_ethdev.c        | 1240
> +++++++++++++++++++++++++++++++
> >   drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
> >   mk/rte.app.mk                           |    1 +
> >   10 files changed, 1830 insertions(+)
> >   create mode 100644 drivers/net/ifcvf/Makefile
> >   create mode 100644 drivers/net/ifcvf/base/ifcvf.c
> >   create mode 100644 drivers/net/ifcvf/base/ifcvf.h
> >   create mode 100644 drivers/net/ifcvf/base/ifcvf_osdep.h
> >   create mode 100644 drivers/net/ifcvf/ifcvf_ethdev.c
> >   create mode 100644 drivers/net/ifcvf/rte_ifcvf_version.map
> >
> 
> ...
> 
> > +static int
> > +eth_dev_ifcvf_create(struct rte_vdev_device *dev,
> > +		struct rte_pci_addr *pci_addr, int devices)
> > +{
> > +	const char *name = rte_vdev_device_name(dev);
> > +	struct rte_eth_dev *eth_dev = NULL;
> > +	struct ether_addr *eth_addr = NULL;
> > +	struct ifcvf_internal *internal = NULL;
> > +	struct internal_list *list = NULL;
> > +	struct rte_eth_dev_data *data = NULL;
> > +	struct rte_pci_addr pf_addr = *pci_addr;
> > +	int i;
> > +
> > +	list = rte_zmalloc_socket(name, sizeof(*list), 0,
> > +			dev->device.numa_node);
> > +	if (list == NULL)
> > +		goto error;
> > +
> > +	/* reserve an ethdev entry */
> > +	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
> > +	if (eth_dev == NULL)
> > +		goto error;
> > +
> > +	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
> > +			dev->device.numa_node);
> > +	if (eth_addr == NULL)
> > +		goto error;
> > +
> > +	*eth_addr = base_eth_addr;
> > +	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
> > +
> > +	internal = eth_dev->data->dev_private;
> > +	internal->dev_name = strdup(name);
> > +	if (internal->dev_name == NULL)
> > +		goto error;
> > +
> > +	internal->eng_addr.pci_addr = *pci_addr;
> > +	for (i = 0; i < devices; i++) {
> > +		pf_addr.domain = pci_addr->domain;
> > +		pf_addr.bus = pci_addr->bus;
> > +		pf_addr.devid = pci_addr->devid + (i + 1) / 8;
> > +		pf_addr.function = pci_addr->function + (i + 1) % 8;
> > +		internal->vf_info[i].pdev.addr = pf_addr;
> > +		rte_spinlock_init(&internal->vf_info[i].lock);
> > +	}
> > +	internal->max_devices = devices;
> > +
> > +	list->eth_dev = eth_dev;
> > +	pthread_mutex_lock(&internal_list_lock);
> > +	TAILQ_INSERT_TAIL(&internal_list, list, next);
> > +	pthread_mutex_unlock(&internal_list_lock);
> > +
> > +	data = eth_dev->data;
> > +	data->nb_rx_queues = IFCVF_MAX_QUEUES;
> > +	data->nb_tx_queues = IFCVF_MAX_QUEUES;
> > +	data->dev_link = vdpa_link;
> > +	data->mac_addrs = eth_addr;
> 
> We might want one ethernet device per VF, as for example you set
> dev_link.link_status to UP as soon as a VF is configured, and DOWN
> as when a single VF is removed.

Ideally it will be one representor port per VF, each representor port
has a link_status. Will integrate port representor when it's ready.
I will remove the vdev's ethdev registering for now, and add it back when we
need to implement flow APIs on the vdev.

> 
> > +	data->dev_flags = RTE_ETH_DEV_INTR_LSC;
> > +	eth_dev->dev_ops = &ops;
> > +
> > +	/* assign rx and tx ops, could be used as vDPA fallback */
> > +	eth_dev->rx_pkt_burst = eth_ifcvf_rx;
> > +	eth_dev->tx_pkt_burst = eth_ifcvf_tx;
> > +
> > +	if (rte_vdpa_register_engine(vdpa_ifcvf_driver.name,
> > +				&internal->eng_addr) < 0)
> > +		goto error;
> > +
> > +	return 0;
> > +
> > +error:
> > +	rte_free(list);
> > +	rte_free(eth_addr);
> > +	if (internal && internal->dev_name)
> > +		free(internal->dev_name);
> > +	rte_free(internal);
> > +	if (eth_dev)
> > +		rte_eth_dev_release_port(eth_dev);
> > +
> > +	return -1;
> > +}
> > +
> > +static int
> > +get_pci_addr(const char *key __rte_unused, const char *value, void
> *extra_args)
> > +{
> > +	if (value == NULL || extra_args == NULL)
> > +		return -1;
> > +
> > +	return rte_pci_addr_parse(value, extra_args);
> > +}
> > +
> > +static inline int
> > +open_int(const char *key __rte_unused, const char *value, void
> *extra_args)
> > +{
> > +	uint16_t *n = extra_args;
> > +
> > +	if (value == NULL || extra_args == NULL)
> > +		return -EINVAL;
> > +
> > +	*n = (uint16_t)strtoul(value, NULL, 0);
> > +	if (*n == USHRT_MAX && errno == ERANGE)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * If this vdev is created by user, then ifcvf will be taken by
> > + * this vdev.
> > + */
> > +static int
> > +ifcvf_take_over(struct rte_pci_addr *pci_addr, int num)
> > +{
> > +	uint16_t port_id;
> > +	int i, ret;
> > +	char devname[RTE_DEV_NAME_MAX_LEN];
> > +	struct rte_pci_addr vf_addr = *pci_addr;
> > +
> > +	for (i = 0; i < num; i++) {
> > +		vf_addr.function += i % 8;
> > +		vf_addr.devid += i / 8;
> > +		rte_pci_device_name(&vf_addr, devname,
> RTE_DEV_NAME_MAX_LEN);
> > +		ret = rte_eth_dev_get_port_by_name(devname, &port_id);
> > +		if (ret == 0) {
> > +			rte_eth_dev_close(port_id);
> > +			if (rte_eth_dev_detach(port_id, devname) < 0)
> > +				return -1;
> > +		}
> That seems a bit hard.
> Shouldn't we at least check the port is not started?

That looks better, will do it.

Thanks for the comments,
-Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/3] net/ifcvf: add ifcvf driver
  2018-03-21 20:52       ` Thomas Monjalon
@ 2018-03-23 10:39         ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-23 10:39 UTC (permalink / raw)
  To: Thomas Monjalon, Xu, Rosen
  Cc: dev, maxime.coquelin, yliu, Wang, Zhihong, Bie, Tiwei, Chen,
	Junjie J, Daly, Dan, Liang, Cunming, Burakov, Anatoly,
	gaetan.rivet

Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, March 22, 2018 4:52 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; Xu, Rosen <rosen.xu@intel.com>
> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; yliu@fridaylinux.org; Wang,
> Zhihong <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Chen,
> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>; Liang,
> Cunming <cunming.liang@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>; gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v2 3/3] net/ifcvf: add ifcvf driver
> 
> 21/03/2018 14:21, Xiao Wang:
> > ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> > to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> > used as vhost data path accelerator.
> 
> Not everybody work at Intel.
> Please explain what means ifcvf and what is a control domain.

OK, and I will add a document.
> 
> > Live migration feature is supported by ifc VF and this driver enables
> > it based on vhost lib.
> >
> > Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> > only vfio-pci is supported currently.
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> > ---
> > v2:
> > - Rebase on Zhihong's vDPA v3 patch set.
> > ---
> >  config/common_base                      |    6 +
> >  config/common_linuxapp                  |    1 +
> >  drivers/net/Makefile                    |    1 +
> >  drivers/net/ifcvf/Makefile              |   40 +
> >  drivers/net/ifcvf/base/ifcvf.c          |  329 ++++++++
> >  drivers/net/ifcvf/base/ifcvf.h          |  156 ++++
> >  drivers/net/ifcvf/base/ifcvf_osdep.h    |   52 ++
> >  drivers/net/ifcvf/ifcvf_ethdev.c        | 1240
> +++++++++++++++++++++++++++++++
> >  drivers/net/ifcvf/rte_ifcvf_version.map |    4 +
> >  mk/rte.app.mk                           |    1 +
> 
> This feature needs to be explained and documented.
> It will be helpful to understand the mechanism and to have a good review.
> Please do not merge it until there is a good documentation.
> 

Will add a doc with more details.

BRs,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-23 10:27       ` Wang, Xiao W
@ 2018-03-25  9:51         ` Maxime Coquelin
  2018-03-26  9:05           ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-25  9:51 UTC (permalink / raw)
  To: Wang, Xiao W, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan



On 03/23/2018 11:27 AM, Wang, Xiao W wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Thursday, March 22, 2018 4:48 AM
>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>
>> Hi Xiao,
>>
>> On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
>>> Hi Maxime,
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Sunday, March 11, 2018 2:24 AM
>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
>> Chen,
>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>>>
>>>> Hi Xiao,
>>>>
>>>> On 03/10/2018 12:08 AM, Xiao Wang wrote:
>>>>> This patch set has dependency on
>>>> http://dpdk.org/dev/patchwork/patch/35635/
>>>>> (vhost: support selective datapath);
>>>>>
>>>>> ifc VF is compatible with virtio vring operations, this driver implements
>>>>> vDPA driver ops which configures ifc VF to be a vhost data path accelerator.
>>>>>
>>>>> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
>>>>> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
>>>>> used as vhost data path accelerator.
>>>>>
>>>>> Live migration feature is supported by ifc VF and this driver enables
>>>>> it based on vhost lib.
>>>>>
>>>>> vDPA needs to create different containers for different devices, thus this
>>>>> patch set adds APIs in eal/vfio to support multiple container.
>>>> Thanks for this! That will avoind having to duplicate these functions
>>>> for every new offload driver.
>>>>
>>>>
>>>>>
>>>>> Junjie Chen (1):
>>>>>      eal/vfio: add support for multiple container
>>>>>
>>>>> Xiao Wang (2):
>>>>>      bus/pci: expose sysfs parsing API
>>>>
>>>> Still, I'm not convinced the offload device should be a virtual device.
>>>> It is a real PCI device, why not having a new device type for offload
>>>> devices, and let the device to be probed automatically by the existing
>>>> device model?
>>>
>>> IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
>>> In DPDK we need to have something to represent PF, to register itself as
>>> a vDPA engine, so a virtual device is used for this purpose.
>> I went through the code, and something is not clear to me.
>>
>> Why do we need to have a representation of the PF in DPDK?
>> Why cannot we just bind at VF level?
> 
> 1. With the vdev representation we could use it to talk to PF kernel driver to do flow configuration, we can implement
> flow API on the vdev in future for this purpose. Using a vdev allows introducing this kind of control plane thing.
> 
> 2. When port representor is ready, we would integrate it into ifcvf driver, then each VF will have a
> Representor port. For now we don’t have port representor, so this patch set manages VF resource internally.

Ok, we may need to have a vdev to represent the PF, but we need to be
able to bind at VF level anyway.

Else, how do you support passing two VFs of the same PF to different
DPDK applications?
Or have some VFs managed by Kernel or QEMU and some by the DPDK
application? My feeling is that current implementation is creating an
artificial constraint.

Isn't there a possibility to have the virtual representation for the PF
to be probed separately? Or created automatically when the first VF of a
PF is probed (and later VFs attach to the PF rep when probed)?

Doing this, we could use the generic device probing.
For IFCVF, to specify we want it to be probed as an offload device
instead of a virtio device, we could have a new EAL parameter to specify
for a given device if we want it to be probed as an offload device (for
example --offload=00:01.1).
Offload drivers would register by passing a flag that specifies they are
an offload driver.
This new argument would be optional and only used to force the device to
being probed in offload mode. For devices that have their own device
IDs for offload mode, then it would be automatic.

Regards,
Maxime
> BRs,
> Xiao
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-25  9:51         ` Maxime Coquelin
@ 2018-03-26  9:05           ` Wang, Xiao W
  2018-03-26 13:29             ` Maxime Coquelin
  0 siblings, 1 reply; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-26  9:05 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Sunday, March 25, 2018 5:51 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> Subject: Re: [PATCH 0/3] add ifcvf driver
> 
> 
> 
> On 03/23/2018 11:27 AM, Wang, Xiao W wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Thursday, March 22, 2018 4:48 AM
> >> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> >> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
> Chen,
> >> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> >> Subject: Re: [PATCH 0/3] add ifcvf driver
> >>
> >> Hi Xiao,
> >>
> >> On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
> >>> Hi Maxime,
> >>>
> >>>> -----Original Message-----
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Sunday, March 11, 2018 2:24 AM
> >>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org;
> Liang,
> >>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
> >> Chen,
> >>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> >>>> Subject: Re: [PATCH 0/3] add ifcvf driver
> >>>>
> >>>> Hi Xiao,
> >>>>
> >>>> On 03/10/2018 12:08 AM, Xiao Wang wrote:
> >>>>> This patch set has dependency on
> >>>> http://dpdk.org/dev/patchwork/patch/35635/
> >>>>> (vhost: support selective datapath);
> >>>>>
> >>>>> ifc VF is compatible with virtio vring operations, this driver implements
> >>>>> vDPA driver ops which configures ifc VF to be a vhost data path
> accelerator.
> >>>>>
> >>>>> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
> >>>>> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
> >>>>> used as vhost data path accelerator.
> >>>>>
> >>>>> Live migration feature is supported by ifc VF and this driver enables
> >>>>> it based on vhost lib.
> >>>>>
> >>>>> vDPA needs to create different containers for different devices, thus this
> >>>>> patch set adds APIs in eal/vfio to support multiple container.
> >>>> Thanks for this! That will avoind having to duplicate these functions
> >>>> for every new offload driver.
> >>>>
> >>>>
> >>>>>
> >>>>> Junjie Chen (1):
> >>>>>      eal/vfio: add support for multiple container
> >>>>>
> >>>>> Xiao Wang (2):
> >>>>>      bus/pci: expose sysfs parsing API
> >>>>
> >>>> Still, I'm not convinced the offload device should be a virtual device.
> >>>> It is a real PCI device, why not having a new device type for offload
> >>>> devices, and let the device to be probed automatically by the existing
> >>>> device model?
> >>>
> >>> IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
> >>> In DPDK we need to have something to represent PF, to register itself as
> >>> a vDPA engine, so a virtual device is used for this purpose.
> >> I went through the code, and something is not clear to me.
> >>
> >> Why do we need to have a representation of the PF in DPDK?
> >> Why cannot we just bind at VF level?
> >
> > 1. With the vdev representation we could use it to talk to PF kernel driver to
> do flow configuration, we can implement
> > flow API on the vdev in future for this purpose. Using a vdev allows
> introducing this kind of control plane thing.
> >
> > 2. When port representor is ready, we would integrate it into ifcvf driver,
> then each VF will have a
> > Representor port. For now we don’t have port representor, so this patch set
> manages VF resource internally.
> 
> Ok, we may need to have a vdev to represent the PF, but we need to be
> able to bind at VF level anyway.

Device management on VF level is feasible, according to the previous port-representor patch.
A tuple of (PF_addr, VF_index) can identify a certain VF, we have vport_mask
and device addr to describe a PF, and we can specify a VF index to create a representor port,
so , the VF port creation will be on-demand at VF level.

+struct port_rep_parameters {
+	uint64_t vport_mask;
+	struct {
+		char bus[RTE_DEV_NAME_MAX_LEN];
+		char device[RTE_DEV_NAME_MAX_LEN];
+	} parent;
+};

+int
+rte_representor_port_register(char *pf_addr_str,
+		uint32_t vport_id, uint16_t *port_id)

Besides, IFCVF supports live migration, vDPA exerts IFCVF device better than QEMU (this patch has enabled LM feature).
vDPA is the main usage model for IFCVF, and one DPDK application taking control of all the VF resource
management is a straightforward usage model.

Best Regards,
Xiao

> 
> Else, how do you support passing two VFs of the same PF to different
> DPDK applications?
> Or have some VFs managed by Kernel or QEMU and some by the DPDK
> application? My feeling is that current implementation is creating an
> artificial constraint.
> 
> Isn't there a possibility to have the virtual representation for the PF
> to be probed separately? Or created automatically when the first VF of a
> PF is probed (and later VFs attach to the PF rep when probed)?
> 
> Doing this, we could use the generic device probing.
> For IFCVF, to specify we want it to be probed as an offload device
> instead of a virtio device, we could have a new EAL parameter to specify
> for a given device if we want it to be probed as an offload device (for
> example --offload=00:01.1).
> Offload drivers would register by passing a flag that specifies they are
> an offload driver.
> This new argument would be optional and only used to force the device to
> being probed in offload mode. For devices that have their own device
> IDs for offload mode, then it would be automatic.
> 
> Regards,
> Maxime
> > BRs,
> > Xiao
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-26  9:05           ` Wang, Xiao W
@ 2018-03-26 13:29             ` Maxime Coquelin
  2018-03-27  4:40               ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-26 13:29 UTC (permalink / raw)
  To: Wang, Xiao W, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan



On 03/26/2018 11:05 AM, Wang, Xiao W wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Sunday, March 25, 2018 5:51 PM
>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>
>>
>>
>> On 03/23/2018 11:27 AM, Wang, Xiao W wrote:
>>> Hi Maxime,
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Thursday, March 22, 2018 4:48 AM
>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
>> Chen,
>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>>>
>>>> Hi Xiao,
>>>>
>>>> On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
>>>>> Hi Maxime,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>> Sent: Sunday, March 11, 2018 2:24 AM
>>>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>>>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org;
>> Liang,
>>>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
>>>> Chen,
>>>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>>>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>>>>>
>>>>>> Hi Xiao,
>>>>>>
>>>>>> On 03/10/2018 12:08 AM, Xiao Wang wrote:
>>>>>>> This patch set has dependency on
>>>>>> http://dpdk.org/dev/patchwork/patch/35635/
>>>>>>> (vhost: support selective datapath);
>>>>>>>
>>>>>>> ifc VF is compatible with virtio vring operations, this driver implements
>>>>>>> vDPA driver ops which configures ifc VF to be a vhost data path
>> accelerator.
>>>>>>>
>>>>>>> ifcvf driver uses vdev as a control domain to manage ifc VFs that belong
>>>>>>> to it. It registers vDPA device ops to vhost lib to enable these VFs to be
>>>>>>> used as vhost data path accelerator.
>>>>>>>
>>>>>>> Live migration feature is supported by ifc VF and this driver enables
>>>>>>> it based on vhost lib.
>>>>>>>
>>>>>>> vDPA needs to create different containers for different devices, thus this
>>>>>>> patch set adds APIs in eal/vfio to support multiple container.
>>>>>> Thanks for this! That will avoind having to duplicate these functions
>>>>>> for every new offload driver.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Junjie Chen (1):
>>>>>>>       eal/vfio: add support for multiple container
>>>>>>>
>>>>>>> Xiao Wang (2):
>>>>>>>       bus/pci: expose sysfs parsing API
>>>>>>
>>>>>> Still, I'm not convinced the offload device should be a virtual device.
>>>>>> It is a real PCI device, why not having a new device type for offload
>>>>>> devices, and let the device to be probed automatically by the existing
>>>>>> device model?
>>>>>
>>>>> IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
>>>>> In DPDK we need to have something to represent PF, to register itself as
>>>>> a vDPA engine, so a virtual device is used for this purpose.
>>>> I went through the code, and something is not clear to me.
>>>>
>>>> Why do we need to have a representation of the PF in DPDK?
>>>> Why cannot we just bind at VF level?
>>>
>>> 1. With the vdev representation we could use it to talk to PF kernel driver to
>> do flow configuration, we can implement
>>> flow API on the vdev in future for this purpose. Using a vdev allows
>> introducing this kind of control plane thing.
>>>
>>> 2. When port representor is ready, we would integrate it into ifcvf driver,
>> then each VF will have a
>>> Representor port. For now we don’t have port representor, so this patch set
>> manages VF resource internally.
>>
>> Ok, we may need to have a vdev to represent the PF, but we need to be
>> able to bind at VF level anyway.
> 
> Device management on VF level is feasible, according to the previous port-representor patch.
> A tuple of (PF_addr, VF_index) can identify a certain VF, we have vport_mask
> and device addr to describe a PF, and we can specify a VF index to create a representor port,
> so , the VF port creation will be on-demand at VF level.
> 
> +struct port_rep_parameters {
> +	uint64_t vport_mask;
> +	struct {
> +		char bus[RTE_DEV_NAME_MAX_LEN];
> +		char device[RTE_DEV_NAME_MAX_LEN];
> +	} parent;
> +};
> 
> +int
> +rte_representor_port_register(char *pf_addr_str,
> +		uint32_t vport_id, uint16_t *port_id)

IIUC, even with this using port representor, we'll still have the
problem of having the VFs probed first as Virtio driver, right?

In my opinion, the IFCVF devices in offload mode are to be managed 
differently than having a way to represent on host side VFs assigned to
a VM.

In offload case, you have a real device to deal with, else we
wouldn't have to bind it with VFIO.

Maybe we could have a real device probed as proposed yesterday [0], and 
this device gets registered to the port representor for the PF?

Thanks,
Maxime
> Besides, IFCVF supports live migration, vDPA exerts IFCVF device better than QEMU (this patch has enabled LM feature).
> vDPA is the main usage model for IFCVF, and one DPDK application taking control of all the VF resource
> management is a straightforward usage model.
> 
> Best Regards,
> Xiao
> 
>>
>> Else, how do you support passing two VFs of the same PF to different
>> DPDK applications?
>> Or have some VFs managed by Kernel or QEMU and some by the DPDK
>> application? My feeling is that current implementation is creating an
>> artificial constraint.
>>
>> Isn't there a possibility to have the virtual representation for the PF
>> to be probed separately? Or created automatically when the first VF of a
>> PF is probed (and later VFs attach to the PF rep when probed)?
>>
>> Doing this, we could use the generic device probing.

[0]:
>> For IFCVF, to specify we want it to be probed as an offload device
>> instead of a virtio device, we could have a new EAL parameter to specify
>> for a given device if we want it to be probed as an offload device (for
>> example --offload=00:01.1).
>> Offload drivers would register by passing a flag that specifies they are
>> an offload driver.
>> This new argument would be optional and only used to force the device to
>> being probed in offload mode. For devices that have their own device
>> IDs for offload mode, then it would be automatic.
>>
>> Regards,
>> Maxime
>>> BRs,
>>> Xiao
>>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-26 13:29             ` Maxime Coquelin
@ 2018-03-27  4:40               ` Wang, Xiao W
  2018-03-27  5:09                 ` Maxime Coquelin
  0 siblings, 1 reply; 98+ messages in thread
From: Wang, Xiao W @ 2018-03-27  4:40 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Monday, March 26, 2018 9:30 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> Subject: Re: [PATCH 0/3] add ifcvf driver
> 
> 
> 
> On 03/26/2018 11:05 AM, Wang, Xiao W wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Sunday, March 25, 2018 5:51 PM
> >> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
> >> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
> Chen,
> >> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> >> Subject: Re: [PATCH 0/3] add ifcvf driver
> >>
> >>
> >>
> >> On 03/23/2018 11:27 AM, Wang, Xiao W wrote:
> >>> Hi Maxime,
> >>>
> >>>> -----Original Message-----
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Thursday, March 22, 2018 4:48 AM
> >>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org;
> Liang,
> >>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
> >> Chen,
> >>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> >>>> Subject: Re: [PATCH 0/3] add ifcvf driver
> >>>>
> >>>> Hi Xiao,
> >>>>
> >>>> On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
> >>>>> Hi Maxime,
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>>>> Sent: Sunday, March 11, 2018 2:24 AM
> >>>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >>>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org;
> >> Liang,
> >>>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen
> <rosen.xu@intel.com>;
> >>>> Chen,
> >>>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
> >>>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
> >>>>>>
> >>>>>> Hi Xiao,
> >>>>>>
> >>>>>> On 03/10/2018 12:08 AM, Xiao Wang wrote:
> >>>>>>> This patch set has dependency on
> >>>>>> http://dpdk.org/dev/patchwork/patch/35635/
> >>>>>>> (vhost: support selective datapath);
> >>>>>>>
> >>>>>>> ifc VF is compatible with virtio vring operations, this driver
> implements
> >>>>>>> vDPA driver ops which configures ifc VF to be a vhost data path
> >> accelerator.
> >>>>>>>
> >>>>>>> ifcvf driver uses vdev as a control domain to manage ifc VFs that
> belong
> >>>>>>> to it. It registers vDPA device ops to vhost lib to enable these VFs to
> be
> >>>>>>> used as vhost data path accelerator.
> >>>>>>>
> >>>>>>> Live migration feature is supported by ifc VF and this driver enables
> >>>>>>> it based on vhost lib.
> >>>>>>>
> >>>>>>> vDPA needs to create different containers for different devices, thus
> this
> >>>>>>> patch set adds APIs in eal/vfio to support multiple container.
> >>>>>> Thanks for this! That will avoind having to duplicate these functions
> >>>>>> for every new offload driver.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Junjie Chen (1):
> >>>>>>>       eal/vfio: add support for multiple container
> >>>>>>>
> >>>>>>> Xiao Wang (2):
> >>>>>>>       bus/pci: expose sysfs parsing API
> >>>>>>
> >>>>>> Still, I'm not convinced the offload device should be a virtual device.
> >>>>>> It is a real PCI device, why not having a new device type for offload
> >>>>>> devices, and let the device to be probed automatically by the existing
> >>>>>> device model?
> >>>>>
> >>>>> IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
> >>>>> In DPDK we need to have something to represent PF, to register itself as
> >>>>> a vDPA engine, so a virtual device is used for this purpose.
> >>>> I went through the code, and something is not clear to me.
> >>>>
> >>>> Why do we need to have a representation of the PF in DPDK?
> >>>> Why cannot we just bind at VF level?
> >>>
> >>> 1. With the vdev representation we could use it to talk to PF kernel driver
> to
> >> do flow configuration, we can implement
> >>> flow API on the vdev in future for this purpose. Using a vdev allows
> >> introducing this kind of control plane thing.
> >>>
> >>> 2. When port representor is ready, we would integrate it into ifcvf driver,
> >> then each VF will have a
> >>> Representor port. For now we don’t have port representor, so this patch
> set
> >> manages VF resource internally.
> >>
> >> Ok, we may need to have a vdev to represent the PF, but we need to be
> >> able to bind at VF level anyway.
> >
> > Device management on VF level is feasible, according to the previous port-
> representor patch.
> > A tuple of (PF_addr, VF_index) can identify a certain VF, we have vport_mask
> > and device addr to describe a PF, and we can specify a VF index to create a
> representor port,
> > so , the VF port creation will be on-demand at VF level.
> >
> > +struct port_rep_parameters {
> > +	uint64_t vport_mask;
> > +	struct {
> > +		char bus[RTE_DEV_NAME_MAX_LEN];
> > +		char device[RTE_DEV_NAME_MAX_LEN];
> > +	} parent;
> > +};
> >
> > +int
> > +rte_representor_port_register(char *pf_addr_str,
> > +		uint32_t vport_id, uint16_t *port_id)
> 
> IIUC, even with this using port representor, we'll still have the
> problem of having the VFs probed first as Virtio driver, right?
> 
> In my opinion, the IFCVF devices in offload mode are to be managed
> differently than having a way to represent on host side VFs assigned to
> a VM.
> 
> In offload case, you have a real device to deal with, else we
> wouldn't have to bind it with VFIO.
> 
> Maybe we could have a real device probed as proposed yesterday [0], and
> this device gets registered to the port representor for the PF?

Adding a list of offload-device in eal is a way to skip virtio pmd probe [0].
I think using device devargs could also achieve that: add a parameter "vdpa=1"
to the device, virtio pmd parses the devargs and detects that the device is in vdpa mode,
quits its probe immediately.
Devargs could be flexible of allowing one device with multi driver case.

BRs,
Xiao

> 
> Thanks,
> Maxime
> > Besides, IFCVF supports live migration, vDPA exerts IFCVF device better than
> QEMU (this patch has enabled LM feature).
> > vDPA is the main usage model for IFCVF, and one DPDK application taking
> control of all the VF resource
> > management is a straightforward usage model.
> >
> > Best Regards,
> > Xiao
> >
> >>
> >> Else, how do you support passing two VFs of the same PF to different
> >> DPDK applications?
> >> Or have some VFs managed by Kernel or QEMU and some by the DPDK
> >> application? My feeling is that current implementation is creating an
> >> artificial constraint.
> >>
> >> Isn't there a possibility to have the virtual representation for the PF
> >> to be probed separately? Or created automatically when the first VF of a
> >> PF is probed (and later VFs attach to the PF rep when probed)?
> >>
> >> Doing this, we could use the generic device probing.
> 
> [0]:
> >> For IFCVF, to specify we want it to be probed as an offload device
> >> instead of a virtio device, we could have a new EAL parameter to specify
> >> for a given device if we want it to be probed as an offload device (for
> >> example --offload=00:01.1).
> >> Offload drivers would register by passing a flag that specifies they are
> >> an offload driver.
> >> This new argument would be optional and only used to force the device to
> >> being probed in offload mode. For devices that have their own device
> >> IDs for offload mode, then it would be automatic.
> >>
> >> Regards,
> >> Maxime
> >>> BRs,
> >>> Xiao
> >>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/3] add ifcvf driver
  2018-03-27  4:40               ` Wang, Xiao W
@ 2018-03-27  5:09                 ` Maxime Coquelin
  0 siblings, 0 replies; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-27  5:09 UTC (permalink / raw)
  To: Wang, Xiao W, dev
  Cc: Wang, Zhihong, yliu, Liang, Cunming, Xu, Rosen, Chen, Junjie J,
	Daly, Dan

Hi Xiao,

On 03/27/2018 06:40 AM, Wang, Xiao W wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Monday, March 26, 2018 9:30 PM
>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen,
>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>
>>
>>
>> On 03/26/2018 11:05 AM, Wang, Xiao W wrote:
>>> Hi Maxime,
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Sunday, March 25, 2018 5:51 PM
>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org; Liang,
>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
>> Chen,
>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>>>
>>>>
>>>>
>>>> On 03/23/2018 11:27 AM, Wang, Xiao W wrote:
>>>>> Hi Maxime,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>> Sent: Thursday, March 22, 2018 4:48 AM
>>>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>>>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org;
>> Liang,
>>>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>;
>>>> Chen,
>>>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>>>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>>>>>
>>>>>> Hi Xiao,
>>>>>>
>>>>>> On 03/15/2018 05:49 PM, Wang, Xiao W wrote:
>>>>>>> Hi Maxime,
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>>>> Sent: Sunday, March 11, 2018 2:24 AM
>>>>>>>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>>>>>>>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; yliu@fridaylinux.org;
>>>> Liang,
>>>>>>>> Cunming <cunming.liang@intel.com>; Xu, Rosen
>> <rosen.xu@intel.com>;
>>>>>> Chen,
>>>>>>>> Junjie J <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>
>>>>>>>> Subject: Re: [PATCH 0/3] add ifcvf driver
>>>>>>>>
>>>>>>>> Hi Xiao,
>>>>>>>>
>>>>>>>> On 03/10/2018 12:08 AM, Xiao Wang wrote:
>>>>>>>>> This patch set has dependency on
>>>>>>>> http://dpdk.org/dev/patchwork/patch/35635/
>>>>>>>>> (vhost: support selective datapath);
>>>>>>>>>
>>>>>>>>> ifc VF is compatible with virtio vring operations, this driver
>> implements
>>>>>>>>> vDPA driver ops which configures ifc VF to be a vhost data path
>>>> accelerator.
>>>>>>>>>
>>>>>>>>> ifcvf driver uses vdev as a control domain to manage ifc VFs that
>> belong
>>>>>>>>> to it. It registers vDPA device ops to vhost lib to enable these VFs to
>> be
>>>>>>>>> used as vhost data path accelerator.
>>>>>>>>>
>>>>>>>>> Live migration feature is supported by ifc VF and this driver enables
>>>>>>>>> it based on vhost lib.
>>>>>>>>>
>>>>>>>>> vDPA needs to create different containers for different devices, thus
>> this
>>>>>>>>> patch set adds APIs in eal/vfio to support multiple container.
>>>>>>>> Thanks for this! That will avoind having to duplicate these functions
>>>>>>>> for every new offload driver.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Junjie Chen (1):
>>>>>>>>>        eal/vfio: add support for multiple container
>>>>>>>>>
>>>>>>>>> Xiao Wang (2):
>>>>>>>>>        bus/pci: expose sysfs parsing API
>>>>>>>>
>>>>>>>> Still, I'm not convinced the offload device should be a virtual device.
>>>>>>>> It is a real PCI device, why not having a new device type for offload
>>>>>>>> devices, and let the device to be probed automatically by the existing
>>>>>>>> device model?
>>>>>>>
>>>>>>> IFC VFs are generated from SRIOV, with the PF driven by kernel driver.
>>>>>>> In DPDK we need to have something to represent PF, to register itself as
>>>>>>> a vDPA engine, so a virtual device is used for this purpose.
>>>>>> I went through the code, and something is not clear to me.
>>>>>>
>>>>>> Why do we need to have a representation of the PF in DPDK?
>>>>>> Why cannot we just bind at VF level?
>>>>>
>>>>> 1. With the vdev representation we could use it to talk to PF kernel driver
>> to
>>>> do flow configuration, we can implement
>>>>> flow API on the vdev in future for this purpose. Using a vdev allows
>>>> introducing this kind of control plane thing.
>>>>>
>>>>> 2. When port representor is ready, we would integrate it into ifcvf driver,
>>>> then each VF will have a
>>>>> Representor port. For now we don’t have port representor, so this patch
>> set
>>>> manages VF resource internally.
>>>>
>>>> Ok, we may need to have a vdev to represent the PF, but we need to be
>>>> able to bind at VF level anyway.
>>>
>>> Device management on VF level is feasible, according to the previous port-
>> representor patch.
>>> A tuple of (PF_addr, VF_index) can identify a certain VF, we have vport_mask
>>> and device addr to describe a PF, and we can specify a VF index to create a
>> representor port,
>>> so , the VF port creation will be on-demand at VF level.
>>>
>>> +struct port_rep_parameters {
>>> +	uint64_t vport_mask;
>>> +	struct {
>>> +		char bus[RTE_DEV_NAME_MAX_LEN];
>>> +		char device[RTE_DEV_NAME_MAX_LEN];
>>> +	} parent;
>>> +};
>>>
>>> +int
>>> +rte_representor_port_register(char *pf_addr_str,
>>> +		uint32_t vport_id, uint16_t *port_id)
>>
>> IIUC, even with this using port representor, we'll still have the
>> problem of having the VFs probed first as Virtio driver, right?
>>
>> In my opinion, the IFCVF devices in offload mode are to be managed
>> differently than having a way to represent on host side VFs assigned to
>> a VM.
>>
>> In offload case, you have a real device to deal with, else we
>> wouldn't have to bind it with VFIO.
>>
>> Maybe we could have a real device probed as proposed yesterday [0], and
>> this device gets registered to the port representor for the PF?
> 
> Adding a list of offload-device in eal is a way to skip virtio pmd probe [0].
> I think using device devargs could also achieve that: add a parameter "vdpa=1"
> to the device, virtio pmd parses the devargs and detects that the device is in vdpa mode,
> quits its probe immediately.
> Devargs could be flexible of allowing one device with multi driver case.

That's a good idea, but if we want the vDPA VF to probed in a generic
way, will it work? I think the Virtio PMD won't be probed but no other
driver will be tried, but I might be wrong.

Thanks,
Maxime
> BRs,
> Xiao
> 
>>
>> Thanks,
>> Maxime
>>> Besides, IFCVF supports live migration, vDPA exerts IFCVF device better than
>> QEMU (this patch has enabled LM feature).
>>> vDPA is the main usage model for IFCVF, and one DPDK application taking
>> control of all the VF resource
>>> management is a straightforward usage model.
>>>
>>> Best Regards,
>>> Xiao
>>>
>>>>
>>>> Else, how do you support passing two VFs of the same PF to different
>>>> DPDK applications?
>>>> Or have some VFs managed by Kernel or QEMU and some by the DPDK
>>>> application? My feeling is that current implementation is creating an
>>>> artificial constraint.
>>>>
>>>> Isn't there a possibility to have the virtual representation for the PF
>>>> to be probed separately? Or created automatically when the first VF of a
>>>> PF is probed (and later VFs attach to the PF rep when probed)?
>>>>
>>>> Doing this, we could use the generic device probing.
>>
>> [0]:
>>>> For IFCVF, to specify we want it to be probed as an offload device
>>>> instead of a virtio device, we could have a new EAL parameter to specify
>>>> for a given device if we want it to be probed as an offload device (for
>>>> example --offload=00:01.1).
>>>> Offload drivers would register by passing a flag that specifies they are
>>>> an offload driver.
>>>> This new argument would be optional and only used to force the device to
>>>> being probed in offload mode. For devices that have their own device
>>>> IDs for offload mode, then it would be automatic.
>>>>
>>>> Regards,
>>>> Maxime
>>>>> BRs,
>>>>> Xiao
>>>>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v3 0/3] add ifcvf vdpa driver
  2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
                         ` (2 preceding siblings ...)
  2018-03-22  8:51       ` Ferruh Yigit
@ 2018-03-31  2:29       ` Xiao Wang
  2018-03-31  2:29         ` [PATCH v3 1/4] eal/vfio: add support for multiple container Xiao Wang
                           ` (3 more replies)
  3 siblings, 4 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-31  2:29 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Xiao Wang

This patch set has dependency on http://dpdk.org/dev/patchwork/patch/36772/
(vhost: support selective datapath).

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Junjie Chen (1):
  eal/vfio: add support for multiple container

Xiao Wang (3):
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  net/ifcvf: add driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  85 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  36 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 842 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  51 +-
 lib/librte_eal/common/include/rte_vfio.h | 116 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 552 ++++++++++++++++----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   7 +
 mk/rte.app.mk                            |   3 +
 20 files changed, 2210 insertions(+), 101 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v3 1/4] eal/vfio: add support for multiple container
  2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
@ 2018-03-31  2:29         ` Xiao Wang
  2018-03-31 11:06           ` Maxime Coquelin
  2018-03-31  2:29         ` [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-03-31  2:29 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Junjie Chen,
	Xiao Wang

From: Junjie Chen <junjie.j.chen@intel.com>

Currently eal vfio framework binds vfio group fd to the default
container fd, while in some cases, e.g. vDPA (vhost data path
acceleration), we want to set vfio group to a new container and
program DMA mapping via this new container, so this patch adds
APIs to support multiple container.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 config/common_base                       |   1 +
 lib/librte_eal/bsdapp/eal/eal.c          |  51 ++-
 lib/librte_eal/common/include/rte_vfio.h | 116 +++++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 552 +++++++++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   7 +
 6 files changed, 627 insertions(+), 101 deletions(-)

diff --git a/config/common_base b/config/common_base
index ad03cf433..b2df1b482 100644
--- a/config/common_base
+++ b/config/common_base
@@ -74,6 +74,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 
diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..be4590e41 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -38,6 +38,7 @@
 #include <rte_interrupts.h>
 #include <rte_bus.h>
 #include <rte_dev.h>
+#include <rte_vfio.h>
 #include <rte_devargs.h>
 #include <rte_version.h>
 #include <rte_atomic.h>
@@ -738,15 +739,6 @@ rte_eal_vfio_intr_mode(void)
 /* dummy forward declaration. */
 struct vfio_device_info;
 
-/* dummy prototypes. */
-int rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
-		int *vfio_dev_fd, struct vfio_device_info *device_info);
-int rte_vfio_release_device(const char *sysfs_base, const char *dev_addr, int fd);
-int rte_vfio_enable(const char *modname);
-int rte_vfio_is_enabled(const char *modname);
-int rte_vfio_noiommu_is_enabled(void);
-int rte_vfio_clear_group(int vfio_group_fd);
-
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
 		      __rte_unused int *vfio_dev_fd,
@@ -781,3 +773,44 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int rte_vfio_bind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_unbind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_get_group_fd(__rte_unused int iommu_group_no)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index e981a6228..d6131d4c8 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -123,6 +123,122 @@ int rte_vfio_noiommu_is_enabled(void);
 int
 rte_vfio_clear_group(int vfio_group_fd);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container
+ * @return
+ *    the container fd if success
+ *    else < 0
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio group number.
+ * @param container_fd
+ *   the container fd to destroy
+ * @return
+ *    0 if true.
+ *   !0 otherwise.
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a group number to container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ * @return
+ *    group fd if successful
+ *    < 0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a group from specified container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *   the dma type for mapping
+ * @param ms
+ *   the dma address region to map
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *    the dma map type
+ * @param ms
+ *   the dma address region to unmap
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Get group fd via group number
+ * @param iommu_group_number
+ *  the group number
+ * @return
+ *     corresonding group fd if successful
+ *     -1 if failed
+ */
+int __rte_experimental
+rte_vfio_get_group_fd(int iommu_group_no);
+
 #endif /* VFIO_PRESENT */
 
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..987f316f7 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
 
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
-
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,350 @@ vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+vfio_get_container(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+vfio_get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+static int
+vfio_find_container_idx(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return i;
+		}
+	}
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+	vfio_cfg = vfio_cfgs[i];
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0) {
+		rte_free(vfio_cfgs[i]);
+		vfio_cfgs[i] = NULL;
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+	}
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = vfio_get_container_idx(container_fd);
+	vfio_cfg = vfio_cfgs[idx];
+
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	/* Check room for new group */
+	if (cur_vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (cur_vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	cur_vfio_cfg->vfio_active_groups++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (cur_vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	cur_vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_find_container_idx(iommu_group_no);
+	vfio_cfg = vfio_cfgs[i];
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
-		if (i < 0)
+		if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
+			RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		}
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,9 +517,11 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
-	int ret;
+	int ret = 0;
+	int index;
 
 	/* get group number */
 	ret = vfio_get_group_no(sysfs_base, dev_addr, &iommu_group_no);
@@ -309,12 +567,14 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	index = vfio_find_container_idx(iommu_group_no);
+	vfio_container_fd = vfio_cfgs[index]->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +591,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfgs[index]->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +605,13 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +655,7 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +723,9 @@ rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +747,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +764,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +932,87 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd,
+	const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	struct vfio_iommu_type1_dma_map dma_map;
+	int ret;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	if (ms->addr == NULL) {
+		RTE_LOG(ERR, EAL, "invalid dma addr");
+		return -1;
+	}
 
-		if (ms[i].addr == NULL)
-			break;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
+	return 0;
+}
+
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd,
+	const struct rte_memseg *ms)
+{
+	int ret;
+	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
 			return -1;
-		}
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1156,59 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int
+rte_vfio_dma_map(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int
+rte_vfio_dma_unmap(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int rte_vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = vfio_cfgs[i];
+		if (!vfio_cfg)
+			continue;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg->vfio_groups[j].fd;
+		}
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..23a1e3608 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -86,6 +86,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d12360235..cdf211b20 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -254,5 +254,12 @@ EXPERIMENTAL {
 	rte_service_set_runstate_mapped_check;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_vfio_bind_group;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_get_group_fd;
+	rte_vfio_unbind_group;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode
  2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
  2018-03-31  2:29         ` [PATCH v3 1/4] eal/vfio: add support for multiple container Xiao Wang
@ 2018-03-31  2:29         ` Xiao Wang
  2018-03-31 11:13           ` Maxime Coquelin
  2018-03-31  2:29         ` [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
  2018-03-31  2:29         ` [PATCH v3 4/4] net/ifcvf: add " Xiao Wang
  3 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-03-31  2:29 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 884f74ad0..6551a367f 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -29,6 +29,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1744,9 +1745,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
  2018-03-31  2:29         ` [PATCH v3 1/4] eal/vfio: add support for multiple container Xiao Wang
  2018-03-31  2:29         ` [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-03-31  2:29         ` Xiao Wang
  2018-03-31 11:26           ` Maxime Coquelin
  2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
  2018-03-31  2:29         ` [PATCH v3 4/4] net/ifcvf: add " Xiao Wang
  3 siblings, 2 replies; 98+ messages in thread
From: Xiao Wang @ 2018-03-31  2:29 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Xiao Wang,
	Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  36 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 842 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1437 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index b2df1b482..f63f5c7c4 100644
--- a/config/common_base
+++ b/config/common_base
@@ -792,6 +792,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..1a5513c20 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..aec3a32f8 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -53,6 +53,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..f08fcaad8
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..ff87f5153
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,842 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+#include <eal_vfio.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	internal->vfio_container_fd = rte_vfio_create_container();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	ret = rte_vfio_bind_group(internal->vfio_container_fd,
+			iommu_group_no);
+	if (ret)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+	internal->vfio_group_fd = rte_vfio_get_group_fd(iommu_group_no);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx.", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_map(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_unmap(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / 4096;
+			for (j = 0; j <= hw->vring[i].size * 8 / 4096; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_feature_set(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_did(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_feature(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK)
+static int
+ifcvf_get_protocol_feature(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.queue_num_get = ifcvf_get_queue_num,
+	.feature_get = ifcvf_get_vdpa_feature,
+	.protocol_feature_get = ifcvf_get_protocol_feature,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.vring_state_set = NULL,
+	.feature_set = ifcvf_feature_set,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops) < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * The set of PCI devices this driver supports.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.driver = {
+		.name = "net_ifcvf",
+	},
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..46f76146e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -171,6 +171,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA)     += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 4/4] net/ifcvf: add driver document and release note
  2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
                           ` (2 preceding siblings ...)
  2018-03-31  2:29         ` [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-03-31  2:29         ` Xiao Wang
  2018-03-31 11:28           ` Maxime Coquelin
  3 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-03-31  2:29 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 ++++
 doc/guides/nics/ifcvf.rst              | 85 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 103 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..5d82bd25e
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,85 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 59419f432..379798c20 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -44,6 +44,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index 3923dc253..dc6854035 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -41,6 +41,15 @@ New Features
      Also, make sure to start the actual text at the margin.
      =========================================================
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 API Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 1/4] eal/vfio: add support for multiple container
  2018-03-31  2:29         ` [PATCH v3 1/4] eal/vfio: add support for multiple container Xiao Wang
@ 2018-03-31 11:06           ` Maxime Coquelin
  0 siblings, 0 replies; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-31 11:06 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Junjie Chen



On 03/31/2018 04:29 AM, Xiao Wang wrote:
> From: Junjie Chen <junjie.j.chen@intel.com>
> 
> Currently eal vfio framework binds vfio group fd to the default
> container fd, while in some cases, e.g. vDPA (vhost data path
> acceleration), we want to set vfio group to a new container and
> program DMA mapping via this new container, so this patch adds
> APIs to support multiple container.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---
>   config/common_base                       |   1 +
>   lib/librte_eal/bsdapp/eal/eal.c          |  51 ++-
>   lib/librte_eal/common/include/rte_vfio.h | 116 +++++++
>   lib/librte_eal/linuxapp/eal/eal_vfio.c   | 552 +++++++++++++++++++++++++------
>   lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
>   lib/librte_eal/rte_eal_version.map       |   7 +
>   6 files changed, 627 insertions(+), 101 deletions(-)
> 

FWIW:

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode
  2018-03-31  2:29         ` [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-03-31 11:13           ` Maxime Coquelin
  2018-03-31 13:16             ` Thomas Monjalon
  0 siblings, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-31 11:13 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov

Hi Xiao,

On 03/31/2018 04:29 AM, Xiao Wang wrote:
> If we want a virtio device to work in vDPA (vhost data path acceleration)
> mode, we could add a "vdpa=1" devarg for this device to specify the mode.
> 
> This patch let virtio pmd skip device probe when detecting this parameter.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---
>   drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
>   1 file changed, 43 insertions(+)
> 

As we discussed, I would prefer a generic solution at EAL level.
But as a start, I agree with this solution:

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks!
Maxime

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-03-31  2:29         ` [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-03-31 11:26           ` Maxime Coquelin
  2018-04-03  9:38             ` Wang, Xiao W
  2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
  1 sibling, 1 reply; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-31 11:26 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, Rosen Xu



On 03/31/2018 04:29 AM, Xiao Wang wrote:
> The IFCVF vDPA (vhost data path acceleration) driver provides support for
> the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
> it works as a HW vhost backend which can send/receive packets to/from
> virtio directly by DMA.
> 
> Different VF devices serve different virtio frontends which are in
> different VMs, so each VF needs to have its own DMA address translation
> service. During the driver probe a new container is created, with this
> container vDPA driver can program DMA remapping table with the VM's memory
> region information.
> 
> Key vDPA driver ops implemented:
> 
> - ifcvf_dev_config:
>    Enable VF data path with virtio information provided by vhost lib,
>    including IOMMU programming to enable VF DMA to VM's memory, VFIO
>    interrupt setup to route HW interrupt to virtio driver, create notify
>    relay thread to translate virtio driver's kick to a MMIO write onto HW,
>    HW queues configuration.
> 
> - ifcvf_dev_close:
>    Revoke all the setup in ifcvf_dev_config.
> 
> Live migration feature is supported by IFCVF and this driver enables
> it. For the dirty page logging, VF helps to log for packet buffer write,
> driver helps to make the used ring as dirty when device stops.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the
> guest, only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> ---
>   config/common_base                    |   7 +
>   config/common_linuxapp                |   1 +
>   drivers/net/Makefile                  |   3 +
>   drivers/net/ifc/Makefile              |  36 ++
>   drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
>   drivers/net/ifc/base/ifcvf.h          | 160 +++++++
>   drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
>   drivers/net/ifc/ifcvf_vdpa.c          | 842 ++++++++++++++++++++++++++++++++++
>   drivers/net/ifc/rte_ifcvf_version.map |   4 +
>   mk/rte.app.mk                         |   3 +
>   10 files changed, 1437 insertions(+)
>   create mode 100644 drivers/net/ifc/Makefile
>   create mode 100644 drivers/net/ifc/base/ifcvf.c
>   create mode 100644 drivers/net/ifc/base/ifcvf.h
>   create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
>   create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
>   create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

Thanks for having handled the changes, please see minor comments below.

Feel free to add my:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks!
Maxime

> +static uint64_t
> +qva_to_gpa(int vid, uint64_t qva)

We might want to have this in vhost-lib to avoid duplication,
but that can be done later.

> +{
> +	struct rte_vhost_memory *mem = NULL;
> +	struct rte_vhost_mem_region *reg;
> +	uint32_t i;
> +	uint64_t gpa = 0;
> +
> +	if (rte_vhost_get_mem_table(vid, &mem) < 0)
> +		goto exit;
> +
> +	for (i = 0; i < mem->nregions; i++) {
> +		reg = &mem->regions[i];
> +
> +		if (qva >= reg->host_user_addr &&
> +				qva < reg->host_user_addr + reg->size) {
> +			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
> +			break;
> +		}
> +	}
> +
> +exit:
> +	if (gpa == 0)
> +		rte_panic("failed to get gpa\n");
> +	if (mem)
> +		free(mem);
> +	return gpa;
> +}
> +
> +static int
> +vdpa_ifcvf_start(struct ifcvf_internal *internal)
> +{
> +	struct ifcvf_hw *hw = &internal->hw;
> +	int i, nr_vring;
> +	int vid;
> +	struct rte_vhost_vring vq;
> +
> +	vid = internal->vid;
> +	nr_vring = rte_vhost_get_vring_num(vid);
> +	rte_vhost_get_negotiated_features(vid, &hw->req_features);
> +
> +	for (i = 0; i < nr_vring; i++) {
> +		rte_vhost_get_vhost_vring(vid, i, &vq);
> +		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
> +		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
> +		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
> +		hw->vring[i].size = vq.size;
> +		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
> +				&hw->vring[i].last_used_idx);
> +	}
> +	hw->nr_vring = i;
> +
> +	return ifcvf_start_hw(&internal->hw);
> +}
> +
> +static void
> +vdpa_ifcvf_stop(struct ifcvf_internal *internal)
> +{
> +	struct ifcvf_hw *hw = &internal->hw;
> +	int i, j;
> +	int vid;
> +	uint64_t features, pfn;
> +	uint64_t log_base, log_size;
> +	uint8_t *log_buf;
> +
> +	vid = internal->vid;
> +	ifcvf_stop_hw(hw);
> +
> +	for (i = 0; i < hw->nr_vring; i++)
> +		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
> +				hw->vring[i].last_used_idx);
> +
> +	rte_vhost_get_negotiated_features(vid, &features);
> +	if (RTE_VHOST_NEED_LOG(features)) {
> +		ifcvf_disable_logging(hw);
> +		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
> +		/*
> +		 * IFCVF marks dirty memory pages for only packet buffer,
> +		 * SW helps to mark the used ring as dirty after device stops.
> +		 */
> +		log_buf = (uint8_t *)(uintptr_t)log_base;
> +		for (i = 0; i < hw->nr_vring; i++) {
> +			pfn = hw->vring[i].used / 4096;
> +			for (j = 0; j <= hw->vring[i].size * 8 / 4096; j++)
> +				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
> +						 1 << ((pfn + j) % 8));
> +		}
> +	}
> +}
> +
> +#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
> +		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
> +static int
> +vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
> +{
> +	int ret;
> +	uint32_t i, nr_vring;
> +	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
> +	struct vfio_irq_set *irq_set;
> +	int *fd_ptr;
> +	struct rte_vhost_vring vring;
> +
> +	nr_vring = rte_vhost_get_vring_num(internal->vid);
> +
> +	irq_set = (struct vfio_irq_set *)irq_set_buf;
> +	irq_set->argsz = sizeof(irq_set_buf);
> +	irq_set->count = nr_vring + 1;
> +	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> +			 VFIO_IRQ_SET_ACTION_TRIGGER;
> +	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
> +	irq_set->start = 0;
> +	fd_ptr = (int *)&irq_set->data;
> +	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
> +
> +	for (i = 0; i < nr_vring; i++) {
> +		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
> +		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
> +	}
> +
> +	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +	if (ret) {
> +		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
> +{
> +	int ret;
> +	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
> +	struct vfio_irq_set *irq_set;
> +
> +	irq_set = (struct vfio_irq_set *)irq_set_buf;
> +	irq_set->argsz = sizeof(irq_set_buf);
> +	irq_set->count = 0;
> +	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
> +	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
> +	irq_set->start = 0;
> +
> +	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +	if (ret) {
> +		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void *
> +notify_relay(void *arg)
> +{
> +	int i, kickfd, epfd, nfds = 0;
> +	uint32_t qid, q_num;
> +	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
> +	struct epoll_event ev;
> +	uint64_t buf;
> +	int nbytes;
> +	struct rte_vhost_vring vring;
> +	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
> +	struct ifcvf_hw *hw = &internal->hw;
> +
> +	q_num = rte_vhost_get_vring_num(internal->vid);
> +
> +	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
> +	if (epfd < 0) {
> +		DRV_LOG(ERR, "failed to create epoll instance.");
> +		return NULL;
> +	}
> +	internal->epfd = epfd;
> +
> +	for (qid = 0; qid < q_num; qid++) {
> +		ev.events = EPOLLIN | EPOLLPRI;
> +		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
> +		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
> +		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
> +			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
> +			return NULL;
> +		}
> +	}
> +
> +	for (;;) {
> +		nfds = epoll_wait(epfd, events, q_num, -1);
> +		if (nfds < 0) {
> +			if (errno == EINTR)
> +				continue;
> +			DRV_LOG(ERR, "epoll_wait return fail\n");
> +			return NULL;
> +		}
> +
> +		for (i = 0; i < nfds; i++) {
> +			qid = events[i].data.u32;
> +			kickfd = (uint32_t)(events[i].data.u64 >> 32);
> +			do {
> +				nbytes = read(kickfd, &buf, 8);
> +				if (nbytes < 0) {
> +					if (errno == EINTR ||
> +					    errno == EWOULDBLOCK ||
> +					    errno == EAGAIN)
> +						continue;
> +					DRV_LOG(INFO, "Error reading "
> +						"kickfd: %s",
> +						strerror(errno));
> +				}
> +				break;
> +			} while (1);
> +
> +			ifcvf_notify_queue(hw, qid);
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static int
> +setup_notify_relay(struct ifcvf_internal *internal)
> +{
> +	int ret;
> +
> +	ret = pthread_create(&internal->tid, NULL, notify_relay,
> +			(void *)internal);
> +	if (ret) {
> +		DRV_LOG(ERR, "failed to create notify relay pthread.");
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static int
> +unset_notify_relay(struct ifcvf_internal *internal)
> +{
> +	void *status;
> +
> +	if (internal->tid) {
> +		pthread_cancel(internal->tid);
> +		pthread_join(internal->tid, &status);
> +	}
> +	internal->tid = 0;
> +
> +	if (internal->epfd >= 0)
> +		close(internal->epfd);
> +	internal->epfd = -1;
> +
> +	return 0;
> +}
> +
> +static int
> +update_datapath(struct ifcvf_internal *internal)
> +{
> +	int ret;
> +
> +	rte_spinlock_lock(&internal->lock);
> +
> +	if (!rte_atomic32_read(&internal->running) &&
> +	    (rte_atomic32_read(&internal->started) &&
> +	     rte_atomic32_read(&internal->dev_attached))) {
> +		ret = ifcvf_dma_map(internal);
> +		if (ret)
> +			goto err;
> +
> +		ret = vdpa_enable_vfio_intr(internal);
> +		if (ret)
> +			goto err;
> +
> +		ret = setup_notify_relay(internal);
> +		if (ret)
> +			goto err;
> +
> +		ret = vdpa_ifcvf_start(internal);
> +		if (ret)
> +			goto err;
> +
> +		rte_atomic32_set(&internal->running, 1);
> +	} else if (rte_atomic32_read(&internal->running) &&
> +		   (!rte_atomic32_read(&internal->started) ||
> +		    !rte_atomic32_read(&internal->dev_attached))) {
> +		vdpa_ifcvf_stop(internal);
> +
> +		ret = unset_notify_relay(internal);
> +		if (ret)
> +			goto err;
> +
> +		ret = vdpa_disable_vfio_intr(internal);
> +		if (ret)
> +			goto err;
> +
> +		ret = ifcvf_dma_unmap(internal);
> +		if (ret)
> +			goto err;
> +
> +		rte_atomic32_set(&internal->running, 0);
> +	}
> +
> +	rte_spinlock_unlock(&internal->lock);
> +	return 0;
> +err:
> +	rte_spinlock_unlock(&internal->lock);
> +	return ret;
> +}
> +
> +static int
> +ifcvf_dev_config(int vid)
> +{
> +	int did;
> +	struct internal_list *list;
> +	struct ifcvf_internal *internal;
> +
> +	did = rte_vhost_get_vdpa_did(vid);
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	internal = list->internal;
> +	internal->vid = vid;
> +	rte_atomic32_set(&internal->dev_attached, 1);
> +	update_datapath(internal);
> +
> +	return 0;
> +}
> +
> +static int
> +ifcvf_dev_close(int vid)
> +{
> +	int did;
> +	struct internal_list *list;
> +	struct ifcvf_internal *internal;
> +
> +	did = rte_vhost_get_vdpa_did(vid);
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	internal = list->internal;
> +	rte_atomic32_set(&internal->dev_attached, 0);
> +	update_datapath(internal);
> +
> +	return 0;
> +}
> +
> +static int
> +ifcvf_feature_set(int vid)
> +{
> +	uint64_t features;
> +	int did;
> +	struct internal_list *list;
> +	struct ifcvf_internal *internal;
> +	uint64_t log_base, log_size;
> +
> +	did = rte_vhost_get_vdpa_did(vid);
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	internal = list->internal;
> +	rte_vhost_get_negotiated_features(internal->vid, &features);
> +
> +	if (RTE_VHOST_NEED_LOG(features)) {
> +		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
> +		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
> +		ifcvf_enable_logging(&internal->hw, log_base, log_size);
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +ifcvf_get_vfio_group_fd(int vid)
> +{
> +	int did;
> +	struct internal_list *list;
> +
> +	did = rte_vhost_get_vdpa_did(vid);
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	return list->internal->vfio_group_fd;
> +}
> +
> +static int
> +ifcvf_get_vfio_device_fd(int vid)
> +{
> +	int did;
> +	struct internal_list *list;
> +
> +	did = rte_vhost_get_vdpa_did(vid);
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	return list->internal->vfio_dev_fd;
> +}
> +
> +static int
> +ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
> +{
> +	int did;
> +	struct internal_list *list;
> +	struct ifcvf_internal *internal;
> +	struct vfio_region_info reg = { .argsz = sizeof(reg) };
> +	int ret;
> +
> +	did = rte_vhost_get_vdpa_did(vid);
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	internal = list->internal;
> +
> +	reg.index = ifcvf_get_notify_region(&internal->hw);
> +	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
> +	if (ret) {
> +		DRV_LOG(ERR, "Get not get device region info: %s",
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
> +	*size = 0x1000;
> +
> +	return 0;
> +}
> +
> +static int
> +ifcvf_get_queue_num(int did, uint32_t *queue_num)
> +{
> +	struct internal_list *list;
> +
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	*queue_num = list->internal->max_queues;
> +
> +	return 0;
> +}
> +
> +static int
> +ifcvf_get_vdpa_feature(int did, uint64_t *features)
> +{
> +	struct internal_list *list;
> +
> +	list = find_internal_resource_by_did(did);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device id: %d", did);
> +		return -1;
> +	}
> +
> +	*features = list->internal->features;
> +
> +	return 0;
> +}
> +
> +#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
> +		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK)
> +static int
> +ifcvf_get_protocol_feature(int did __rte_unused, uint64_t *features)
> +{
> +	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
> +	return 0;
> +}
> +
> +struct rte_vdpa_dev_ops ifcvf_ops = {
> +	.queue_num_get = ifcvf_get_queue_num,
> +	.feature_get = ifcvf_get_vdpa_feature,
> +	.protocol_feature_get = ifcvf_get_protocol_feature,

I have proposed in vDPA series to rename the ops so that it is 
consistant with Vhost-user protocol:
e.g. get_protocol_features, get_features...

So you might have to rebase if this is change is implemented.

> +	.dev_conf = ifcvf_dev_config,
> +	.dev_close = ifcvf_dev_close,
> +	.vring_state_set = NULL,
> +	.feature_set = ifcvf_feature_set,
> +	.migration_done = NULL,
> +	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
> +	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
> +	.get_notify_area = ifcvf_get_notify_area,
> +};
> +
> +static int
> +ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
> +		struct rte_pci_device *pci_dev)
> +{
> +	uint64_t features;
> +	struct ifcvf_internal *internal = NULL;
> +	struct internal_list *list = NULL;
> +
> +	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
> +		return 0;
> +
> +	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
> +	if (list == NULL)
> +		goto error;
> +
> +	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
> +	if (internal == NULL)
> +		goto error;
> +
> +	internal->pdev = pci_dev;
> +	rte_spinlock_init(&internal->lock);
> +	if (ifcvf_vfio_setup(internal) < 0)
> +		return -1;
> +
> +	internal->max_queues = IFCVF_MAX_QUEUES;
> +	features = ifcvf_get_features(&internal->hw);
> +	internal->features = (features &
> +		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
> +		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES);
> +
> +	internal->dev_addr.pci_addr = pci_dev->addr;
> +	internal->dev_addr.type = PCI_ADDR;
> +	list->internal = internal;
> +
> +	pthread_mutex_lock(&internal_list_lock);
> +	TAILQ_INSERT_TAIL(&internal_list, list, next);
> +	pthread_mutex_unlock(&internal_list_lock);
> +
> +	if (rte_vdpa_register_device(&internal->dev_addr,
> +				&ifcvf_ops) < 0)
> +		goto error;
> +
> +	rte_atomic32_set(&internal->started, 1);
> +	update_datapath(internal);
> +
> +	return 0;
> +
> +error:
> +	rte_free(list);
> +	rte_free(internal);
> +	return -1;
> +}
> +
> +static int
> +ifcvf_pci_remove(struct rte_pci_device *pci_dev)
> +{
> +	struct ifcvf_internal *internal;
> +	struct internal_list *list;
> +
> +	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
> +		return 0;
> +
> +	list = find_internal_resource_by_dev(pci_dev);
> +	if (list == NULL) {
> +		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
> +		return -1;
> +	}
> +
> +	internal = list->internal;
> +	rte_atomic32_set(&internal->started, 0);
> +	update_datapath(internal);
> +
> +	rte_pci_unmap_device(internal->pdev);
> +	rte_vfio_destroy_container(internal->vfio_container_fd);
> +	rte_vdpa_unregister_device(internal->did);
> +
> +	pthread_mutex_lock(&internal_list_lock);
> +	TAILQ_REMOVE(&internal_list, list, next);
> +	pthread_mutex_unlock(&internal_list_lock);
> +
> +	rte_free(list);
> +	rte_free(internal);
> +
> +	return 0;
> +}
> +
> +/*
> + * The set of PCI devices this driver supports.
> + */
> +static const struct rte_pci_id pci_id_ifcvf_map[] = {
> +	{ .class_id = RTE_CLASS_ANY_ID,
> +	  .vendor_id = IFCVF_VENDOR_ID,
> +	  .device_id = IFCVF_DEVICE_ID,
> +	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
> +	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
> +	},
> +
> +	{ .vendor_id = 0, /* sentinel */
> +	},
> +};
> +
> +static struct rte_pci_driver rte_ifcvf_vdpa = {
> +	.driver = {
> +		.name = "net_ifcvf",
> +	},
> +	.id_table = pci_id_ifcvf_map,
> +	.drv_flags = 0,
> +	.probe = ifcvf_pci_probe,
> +	.remove = ifcvf_pci_remove,
> +};
> +
> +RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
> +RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
> +RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
> +
> +RTE_INIT(ifcvf_vdpa_init_log);
> +static void
> +ifcvf_vdpa_init_log(void)
> +{
> +	ifcvf_vdpa_logtype = rte_log_register("net.ifcvf_vdpa");
> +	if (ifcvf_vdpa_logtype >= 0)
> +		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
> +}
> diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
> new file mode 100644
> index 000000000..9b9ab1a4c
> --- /dev/null
> +++ b/drivers/net/ifc/rte_ifcvf_version.map
> @@ -0,0 +1,4 @@
> +DPDK_18.05 {
> +
> +	local: *;
> +};
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 3eb41d176..46f76146e 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -171,6 +171,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
>   ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
> +ifeq ($(CONFIG_RTE_EAL_VFIO),y)
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA)     += -lrte_ifcvf_vdpa
> +endif # $(CONFIG_RTE_EAL_VFIO)
>   endif # $(CONFIG_RTE_LIBRTE_VHOST)
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
>   
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 4/4] net/ifcvf: add driver document and release note
  2018-03-31  2:29         ` [PATCH v3 4/4] net/ifcvf: add " Xiao Wang
@ 2018-03-31 11:28           ` Maxime Coquelin
  0 siblings, 0 replies; 98+ messages in thread
From: Maxime Coquelin @ 2018-03-31 11:28 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov



On 03/31/2018 04:29 AM, Xiao Wang wrote:
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---
>   doc/guides/nics/features/ifcvf.ini     |  8 ++++
>   doc/guides/nics/ifcvf.rst              | 85 ++++++++++++++++++++++++++++++++++
>   doc/guides/nics/index.rst              |  1 +
>   doc/guides/rel_notes/release_18_05.rst |  9 ++++
>   4 files changed, 103 insertions(+)
>   create mode 100644 doc/guides/nics/features/ifcvf.ini
>   create mode 100644 doc/guides/nics/ifcvf.rst
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks!
Maxime

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode
  2018-03-31 11:13           ` Maxime Coquelin
@ 2018-03-31 13:16             ` Thomas Monjalon
  2018-04-02  4:08               ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2018-03-31 13:16 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Xiao Wang, ferruh.yigit, dev, zhihong.wang, yliu, jianfeng.tan,
	tiwei.bie, cunming.liang, dan.daly, gaetan.rivet,
	anatoly.burakov

Hi,

31/03/2018 13:13, Maxime Coquelin:
> On 03/31/2018 04:29 AM, Xiao Wang wrote:
> > If we want a virtio device to work in vDPA (vhost data path acceleration)
> > mode, we could add a "vdpa=1" devarg for this device to specify the mode.
> > 
> > This patch let virtio pmd skip device probe when detecting this parameter.
> 
> As we discussed, I would prefer a generic solution at EAL level.

Please could you explain the requirement and the context?
Can we use RTE_ETH_DEV_DEFERRED state and device ownership?

Without knowing what's behind, I would say that a PMD should
never skip a device by itself, but let other entities decide
what to do with the probed device (thanks to probe notifications).

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode
  2018-03-31 13:16             ` Thomas Monjalon
@ 2018-04-02  4:08               ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-02  4:08 UTC (permalink / raw)
  To: Thomas Monjalon, Maxime Coquelin
  Cc: Yigit, Ferruh, dev, Wang, Zhihong, yliu, Tan, Jianfeng, Bie,
	Tiwei, Liang, Cunming, Daly, Dan, gaetan.rivet, Burakov, Anatoly

Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Saturday, March 31, 2018 9:16 PM
> To: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; dev@dpdk.org; Wang, Zhihong
> <zhihong.wang@intel.com>; yliu@fridaylinux.org; Tan, Jianfeng
> <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>; Daly, Dan <dan.daly@intel.com>;
> gaetan.rivet@6wind.com; Burakov, Anatoly <anatoly.burakov@intel.com>
> Subject: Re: [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode
> 
> Hi,
> 
> 31/03/2018 13:13, Maxime Coquelin:
> > On 03/31/2018 04:29 AM, Xiao Wang wrote:
> > > If we want a virtio device to work in vDPA (vhost data path acceleration)
> > > mode, we could add a "vdpa=1" devarg for this device to specify the mode.
> > >
> > > This patch let virtio pmd skip device probe when detecting this parameter.
> >
> > As we discussed, I would prefer a generic solution at EAL level.
> 
> Please could you explain the requirement and the context?
> Can we use RTE_ETH_DEV_DEFERRED state and device ownership?
> 
> Without knowing what's behind, I would say that a PMD should
> never skip a device by itself, but let other entities decide
> what to do with the probed device (thanks to probe notifications).
> 

IFCVF's vendor ID and device ID are the same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. The context is
IFCVF can be driven by both virtio pmd and IFCVF driver, so we add this
devarg to specify if we want the device to work in vDPA mode or not.
For vdpa-mode IFCVF, virtio pmd should not take over it, so we let it skip.

BRs,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-03-31 11:26           ` Maxime Coquelin
@ 2018-04-03  9:38             ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-03  9:38 UTC (permalink / raw)
  To: Maxime Coquelin, Yigit, Ferruh
  Cc: dev, Wang, Zhihong, yliu, Tan, Jianfeng, Bie, Tiwei, Liang,
	Cunming, Daly, Dan, thomas, gaetan.rivet, Burakov, Anatoly, Xu,
	Rosen

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Saturday, March 31, 2018 7:27 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>;
> yliu@fridaylinux.org; Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei
> <tiwei.bie@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> <dan.daly@intel.com>; thomas@monjalon.net; gaetan.rivet@6wind.com;
> Burakov, Anatoly <anatoly.burakov@intel.com>; Xu, Rosen
> <rosen.xu@intel.com>
> Subject: Re: [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver
> 
> 
> 
> On 03/31/2018 04:29 AM, Xiao Wang wrote:
> > The IFCVF vDPA (vhost data path acceleration) driver provides support for
> > the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
> > it works as a HW vhost backend which can send/receive packets to/from
> > virtio directly by DMA.
> >
> > Different VF devices serve different virtio frontends which are in
> > different VMs, so each VF needs to have its own DMA address translation
> > service. During the driver probe a new container is created, with this
> > container vDPA driver can program DMA remapping table with the VM's
> memory
> > region information.
> >
> > Key vDPA driver ops implemented:
> >
> > - ifcvf_dev_config:
> >    Enable VF data path with virtio information provided by vhost lib,
> >    including IOMMU programming to enable VF DMA to VM's memory, VFIO
> >    interrupt setup to route HW interrupt to virtio driver, create notify
> >    relay thread to translate virtio driver's kick to a MMIO write onto HW,
> >    HW queues configuration.
> >
> > - ifcvf_dev_close:
> >    Revoke all the setup in ifcvf_dev_config.
> >
> > Live migration feature is supported by IFCVF and this driver enables
> > it. For the dirty page logging, VF helps to log for packet buffer write,
> > driver helps to make the used ring as dirty when device stops.
> >
> > Because vDPA driver needs to set up MSI-X vector to interrupt the
> > guest, only vfio-pci is supported currently.
> >
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> > ---
> >   config/common_base                    |   7 +
> >   config/common_linuxapp                |   1 +
> >   drivers/net/Makefile                  |   3 +
> >   drivers/net/ifc/Makefile              |  36 ++
> >   drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
> >   drivers/net/ifc/base/ifcvf.h          | 160 +++++++
> >   drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
> >   drivers/net/ifc/ifcvf_vdpa.c          | 842
> ++++++++++++++++++++++++++++++++++
> >   drivers/net/ifc/rte_ifcvf_version.map |   4 +
> >   mk/rte.app.mk                         |   3 +
> >   10 files changed, 1437 insertions(+)
> >   create mode 100644 drivers/net/ifc/Makefile
> >   create mode 100644 drivers/net/ifc/base/ifcvf.c
> >   create mode 100644 drivers/net/ifc/base/ifcvf.h
> >   create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
> >   create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
> >   create mode 100644 drivers/net/ifc/rte_ifcvf_version.map
> 
> Thanks for having handled the changes, please see minor comments below.
> 
> Feel free to add my:
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> 
> Thanks!
> Maxime
> 
> > +static uint64_t
> > +qva_to_gpa(int vid, uint64_t qva)
> 
> We might want to have this in vhost-lib to avoid duplication,
> but that can be done later.
> 
> > +{
> > +	struct rte_vhost_memory *mem = NULL;
> > +	struct rte_vhost_mem_region *reg;
> > +	uint32_t i;
> > +	uint64_t gpa = 0;
> > +
> > +	if (rte_vhost_get_mem_table(vid, &mem) < 0)
> > +		goto exit;
> > +
> > +	for (i = 0; i < mem->nregions; i++) {

[...]

> > +
> > +struct rte_vdpa_dev_ops ifcvf_ops = {
> > +	.queue_num_get = ifcvf_get_queue_num,
> > +	.feature_get = ifcvf_get_vdpa_feature,
> > +	.protocol_feature_get = ifcvf_get_protocol_feature,
> 
> I have proposed in vDPA series to rename the ops so that it is
> consistant with Vhost-user protocol:
> e.g. get_protocol_features, get_features...
> 
> So you might have to rebase if this is change is implemented.
> 

Will rebase on the latest vDPA series.

Thanks,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v4 0/4] add ifcvf vdpa driver
  2018-03-31  2:29         ` [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
  2018-03-31 11:26           ` Maxime Coquelin
@ 2018-04-04 14:40           ` Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 1/4] eal/vfio: add multiple container support Xiao Wang
                               ` (3 more replies)
  1 sibling, 4 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-04 14:40 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, hemant.agrawal,
	Xiao Wang

This patch set has dependency on http://dpdk.org/dev/patchwork/patch/36772/
(vhost: support selective datapath).

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the internal vfio_cfg search APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (4):
  eal/vfio: add multiple container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  85 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  36 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 840 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  52 +-
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 521 +++++++++++++++----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 20 files changed, 2174 insertions(+), 101 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v4 1/4] eal/vfio: add multiple container support
  2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
@ 2018-04-04 14:40             ` Xiao Wang
  2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-04-04 14:40 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, hemant.agrawal,
	Xiao Wang, Junjie Chen

Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch adds some APIs to support container creating and device
binding with a container.

A driver could use "rte_vfio_create_container" helper to create a
new container from eal, use "rte_vfio_bind_group" to bind a device
to the newly created container.

During rte_vfio_setup_device, the container bound with the device
will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
v4:
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the internal vfio_cfg search APIs naming.
---
 config/common_base                       |   1 +
 lib/librte_eal/bsdapp/eal/eal.c          |  52 ++-
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 521 +++++++++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 6 files changed, 593 insertions(+), 101 deletions(-)

diff --git a/config/common_base b/config/common_base
index 7abf7c6fc..2c40b2603 100644
--- a/config/common_base
+++ b/config/common_base
@@ -74,6 +74,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 
diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..76f3beb39 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -38,6 +38,7 @@
 #include <rte_interrupts.h>
 #include <rte_bus.h>
 #include <rte_dev.h>
+#include <rte_vfio.h>
 #include <rte_devargs.h>
 #include <rte_version.h>
 #include <rte_atomic.h>
@@ -738,15 +739,6 @@ rte_eal_vfio_intr_mode(void)
 /* dummy forward declaration. */
 struct vfio_device_info;
 
-/* dummy prototypes. */
-int rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
-		int *vfio_dev_fd, struct vfio_device_info *device_info);
-int rte_vfio_release_device(const char *sysfs_base, const char *dev_addr, int fd);
-int rte_vfio_enable(const char *modname);
-int rte_vfio_is_enabled(const char *modname);
-int rte_vfio_noiommu_is_enabled(void);
-int rte_vfio_clear_group(int vfio_group_fd);
-
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
 		      __rte_unused int *vfio_dev_fd,
@@ -781,3 +773,45 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index 249095e46..b6eb7bdb4 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -32,6 +32,8 @@
 extern "C" {
 #endif
 
+struct rte_memseg;
+
 /**
  * Setup vfio_cfg for the device identified by its address.
  * It discovers the configured I/O MMU groups or sets a new one for the device.
@@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
 }
 #endif
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a group number to container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a group from specified container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for devices in specified conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *   the dma type for mapping
+ *
+ * @param ms
+ *   the dma address region to map
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for devices in specified conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *    the dma map type
+ *
+ * @param ms
+ *   the dma address region to unmap
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
+
 #endif /* VFIO_PRESENT */
 
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..1685745ac 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,348 @@ vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_no(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg;
+		}
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+	vfio_cfg = vfio_cfgs[i];
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0) {
+		rte_free(vfio_cfgs[i]);
+		vfio_cfgs[i] = NULL;
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+	}
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = get_container_idx(container_fd);
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[idx];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	cur_vfio_cfg = vfio_cfgs[i];
+	/* Check room for new group */
+	if (cur_vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (cur_vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	cur_vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	cur_vfio_cfg = vfio_cfgs[i];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (cur_vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	cur_vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
-		if (i < 0)
+		if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
+			RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		}
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,6 +515,8 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
 	int ret;
@@ -309,12 +565,14 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +589,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfg->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +603,13 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +653,7 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +721,9 @@ rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +745,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +762,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +930,80 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	int ret;
+	struct vfio_iommu_type1_dma_map dma_map;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
 
-		if (ms[i].addr == NULL)
-			break;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	return 0;
+}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
-			return -1;
-		}
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)
+{
+	int ret;
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1147,37 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..23a1e3608 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -86,6 +86,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index f331f54c9..fcf9494d1 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -255,5 +255,11 @@ EXPERIMENTAL {
 	rte_service_set_runstate_mapped_check;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_vfio_bind_group;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_unbind_group;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v4 2/4] net/virtio: skip device probe in vdpa mode
  2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 1/4] eal/vfio: add multiple container support Xiao Wang
@ 2018-04-04 14:40             ` Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 4/4] doc: add ifcvf driver document and release note Xiao Wang
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-04 14:40 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, hemant.agrawal,
	Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 2ef213d1a..afb096804 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -28,6 +28,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1708,9 +1709,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v4 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 1/4] eal/vfio: add multiple container support Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-04 14:40             ` Xiao Wang
  2018-04-04 14:40             ` [PATCH v4 4/4] doc: add ifcvf driver document and release note Xiao Wang
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-04 14:40 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, hemant.agrawal,
	Xiao Wang, Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  36 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 840 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1435 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index 2c40b2603..5d4f9e75c 100644
--- a/config/common_base
+++ b/config/common_base
@@ -796,6 +796,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index d0437e5d6..e88e20f02 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 37ca19aa7..3fa51cca3 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -57,6 +57,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..f08fcaad8
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..bafd42153
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,840 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+#include <eal_vfio.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	internal->vfio_container_fd = rte_vfio_create_container();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	internal->vfio_group_fd = rte_vfio_bind_group(
+			internal->vfio_container_fd, iommu_group_no);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx.", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_map(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_unmap(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / 4096;
+			for (j = 0; j <= hw->vring[i].size * 8 / 4096; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_features_set(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_features(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK | \
+		 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+static int
+ifcvf_get_protocol_features(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.get_queue_num = ifcvf_get_queue_num,
+	.get_features = ifcvf_get_vdpa_features,
+	.get_protocol_features = ifcvf_get_protocol_features,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.set_vring_state = NULL,
+	.set_features = ifcvf_features_set,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops) < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * The set of PCI devices this driver supports.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.driver = {
+		.name = "net_ifcvf",
+	},
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index a9b4b0502..65f28cc1c 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -181,6 +181,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA)     += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v4 4/4] doc: add ifcvf driver document and release note
  2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
                               ` (2 preceding siblings ...)
  2018-04-04 14:40             ` [PATCH v4 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-04 14:40             ` Xiao Wang
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-04 14:40 UTC (permalink / raw)
  To: ferruh.yigit, maxime.coquelin
  Cc: dev, zhihong.wang, yliu, jianfeng.tan, tiwei.bie, cunming.liang,
	dan.daly, thomas, gaetan.rivet, anatoly.burakov, hemant.agrawal,
	Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 ++++
 doc/guides/nics/ifcvf.rst              | 85 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 103 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..5d82bd25e
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,85 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 51c453d9c..a294ab389 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -44,6 +44,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index 9cc77f893..c3d996fdc 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -58,6 +58,15 @@ New Features
   * Added support for NVGRE, VXLAN and GENEVE filters in flow API.
   * Added support for DROP action in flow API.
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 API Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v5 0/4] add ifcvf vdpa driver
  2018-04-04 14:40             ` [PATCH v4 1/4] eal/vfio: add multiple container support Xiao Wang
@ 2018-04-05 18:06               ` Xiao Wang
  2018-04-05 18:06                 ` [PATCH v5 1/4] eal/vfio: add multiple container support Xiao Wang
                                   ` (4 more replies)
  0 siblings, 5 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-05 18:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang

This patch set has dependency on http://dpdk.org/dev/patchwork/patch/36772/
(vhost: support selective datapath).

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v5:
- Fix compilation in BSD, remove the rte_vfio.h including in BSD.

v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the vfio_cfg search internal APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (4):
  eal/vfio: add multiple container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  85 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  36 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 840 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  50 ++
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 522 +++++++++++++++----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 20 files changed, 2182 insertions(+), 92 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v5 1/4] eal/vfio: add multiple container support
  2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
@ 2018-04-05 18:06                 ` Xiao Wang
  2018-04-05 18:06                 ` [PATCH v5 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-05 18:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang, Junjie Chen

Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch adds some APIs to support container creating and device
binding with a container.

A driver could use "rte_vfio_create_container" helper to create a
new container from eal, use "rte_vfio_bind_group" to bind a device
to the newly created container.

During rte_vfio_setup_device, the container bound with the device
will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 config/common_base                       |   1 +
 lib/librte_eal/bsdapp/eal/eal.c          |  50 +++
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 522 +++++++++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 6 files changed, 601 insertions(+), 92 deletions(-)

diff --git a/config/common_base b/config/common_base
index 7abf7c6fc..2c40b2603 100644
--- a/config/common_base
+++ b/config/common_base
@@ -74,6 +74,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 
diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..0a3d8783d 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -746,6 +746,14 @@ int rte_vfio_enable(const char *modname);
 int rte_vfio_is_enabled(const char *modname);
 int rte_vfio_noiommu_is_enabled(void);
 int rte_vfio_clear_group(int vfio_group_fd);
+int rte_vfio_create_container(void);
+int rte_vfio_destroy_container(int container_fd);
+int rte_vfio_bind_group(int container_fd, int iommu_group_no);
+int rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+int rte_vfio_dma_map(int container_fd, int dma_type,
+		const struct rte_memseg *ms);
+int rte_vfio_dma_unmap(int container_fd, int dma_type,
+		const struct rte_memseg *ms);
 
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
@@ -781,3 +789,45 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index 249095e46..9bb026703 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -32,6 +32,8 @@
 extern "C" {
 #endif
 
+struct rte_memseg;
+
 /**
  * Setup vfio_cfg for the device identified by its address.
  * It discovers the configured I/O MMU groups or sets a new one for the device.
@@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
 }
 #endif
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container for device binding.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a IOMMU group to a container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a IOMMU group from a container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *   the dma map type
+ *
+ * @param ms
+ *   the dma address region to map
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *    the dma map type
+ *
+ * @param ms
+ *   the dma address region to unmap
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
+
 #endif /* VFIO_PRESENT */
 
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..e474f6e9f 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_no(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg;
+		}
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+	vfio_cfg = vfio_cfgs[i];
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0) {
+		rte_free(vfio_cfgs[i]);
+		vfio_cfgs[i] = NULL;
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+	}
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = get_container_idx(container_fd);
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[idx];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[i];
+	/* Check room for new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[i];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
-		if (i < 0)
+		if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
+			RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		}
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,6 +516,8 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
 	int ret;
@@ -309,12 +566,14 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +590,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfg->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +604,13 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +654,7 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +722,9 @@ rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +746,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +763,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	int ret;
+	struct vfio_iommu_type1_dma_map dma_map;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
 
-		if (ms[i].addr == NULL)
-			break;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	return 0;
+}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
-			return -1;
-		}
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)
+{
+	int ret;
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1148,37 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..23a1e3608 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -86,6 +86,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index f331f54c9..fcf9494d1 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -255,5 +255,11 @@ EXPERIMENTAL {
 	rte_service_set_runstate_mapped_check;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_vfio_bind_group;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_unbind_group;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v5 2/4] net/virtio: skip device probe in vdpa mode
  2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
  2018-04-05 18:06                 ` [PATCH v5 1/4] eal/vfio: add multiple container support Xiao Wang
@ 2018-04-05 18:06                 ` Xiao Wang
  2018-04-11 18:58                   ` Ferruh Yigit
  2018-04-05 18:07                 ` [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
                                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-04-05 18:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 2ef213d1a..afb096804 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -28,6 +28,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1708,9 +1709,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
  2018-04-05 18:06                 ` [PATCH v5 1/4] eal/vfio: add multiple container support Xiao Wang
  2018-04-05 18:06                 ` [PATCH v5 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-05 18:07                 ` Xiao Wang
  2018-04-11 18:58                   ` Ferruh Yigit
  2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
  2018-04-05 18:07                 ` [PATCH v5 " Xiao Wang
  2018-04-11 18:59                 ` [PATCH v5 0/4] add ifcvf vdpa driver Ferruh Yigit
  4 siblings, 2 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-05 18:07 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang, Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  36 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 840 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1435 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index 2c40b2603..5d4f9e75c 100644
--- a/config/common_base
+++ b/config/common_base
@@ -796,6 +796,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index d0437e5d6..e88e20f02 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 37ca19aa7..3fa51cca3 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -57,6 +57,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..f08fcaad8
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..bafd42153
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,840 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+#include <eal_vfio.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	internal->vfio_container_fd = rte_vfio_create_container();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	internal->vfio_group_fd = rte_vfio_bind_group(
+			internal->vfio_container_fd, iommu_group_no);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx.", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_map(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_unmap(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / 4096;
+			for (j = 0; j <= hw->vring[i].size * 8 / 4096; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_features_set(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_features(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK | \
+		 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+static int
+ifcvf_get_protocol_features(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.get_queue_num = ifcvf_get_queue_num,
+	.get_features = ifcvf_get_vdpa_features,
+	.get_protocol_features = ifcvf_get_protocol_features,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.set_vring_state = NULL,
+	.set_features = ifcvf_features_set,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops) < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * The set of PCI devices this driver supports.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.driver = {
+		.name = "net_ifcvf",
+	},
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index a9b4b0502..65f28cc1c 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -181,6 +181,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA)     += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v5 4/4] doc: add ifcvf driver document and release note
  2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
                                   ` (2 preceding siblings ...)
  2018-04-05 18:07                 ` [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-05 18:07                 ` Xiao Wang
  2018-04-11 18:59                 ` [PATCH v5 0/4] add ifcvf vdpa driver Ferruh Yigit
  4 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-05 18:07 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 ++++
 doc/guides/nics/ifcvf.rst              | 85 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 103 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..5d82bd25e
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,85 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 51c453d9c..a294ab389 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -44,6 +44,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index 9cc77f893..c3d996fdc 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -58,6 +58,15 @@ New Features
   * Added support for NVGRE, VXLAN and GENEVE filters in flow API.
   * Added support for DROP action in flow API.
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 API Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-04-05 18:07                 ` [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-11 18:58                   ` Ferruh Yigit
  2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
  1 sibling, 0 replies; 98+ messages in thread
From: Ferruh Yigit @ 2018-04-11 18:58 UTC (permalink / raw)
  To: Xiao Wang
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Rosen Xu

On 4/5/2018 7:07 PM, Xiao Wang wrote:
> The IFCVF vDPA (vhost data path acceleration) driver provides support for
> the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
> it works as a HW vhost backend which can send/receive packets to/from
> virtio directly by DMA.
> 
> Different VF devices serve different virtio frontends which are in
> different VMs, so each VF needs to have its own DMA address translation
> service. During the driver probe a new container is created, with this
> container vDPA driver can program DMA remapping table with the VM's memory
> region information.
> 
> Key vDPA driver ops implemented:
> 
> - ifcvf_dev_config:
>   Enable VF data path with virtio information provided by vhost lib,
>   including IOMMU programming to enable VF DMA to VM's memory, VFIO
>   interrupt setup to route HW interrupt to virtio driver, create notify
>   relay thread to translate virtio driver's kick to a MMIO write onto HW,
>   HW queues configuration.
> 
> - ifcvf_dev_close:
>   Revoke all the setup in ifcvf_dev_config.
> 
> Live migration feature is supported by IFCVF and this driver enables
> it. For the dirty page logging, VF helps to log for packet buffer write,
> driver helps to make the used ring as dirty when device stops.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the
> guest, only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Signed-off-by: Rosen Xu <rosen.xu@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  config/common_base                    |   7 +
>  config/common_linuxapp                |   1 +
>  drivers/net/Makefile                  |   3 +
>  drivers/net/ifc/Makefile              |  36 ++
>  drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
>  drivers/net/ifc/base/ifcvf.h          | 160 +++++++
>  drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
>  drivers/net/ifc/ifcvf_vdpa.c          | 840 ++++++++++++++++++++++++++++++++++
>  drivers/net/ifc/rte_ifcvf_version.map |   4 +
>  mk/rte.app.mk                         |   3 +
>  10 files changed, 1435 insertions(+)
>  create mode 100644 drivers/net/ifc/Makefile
>  create mode 100644 drivers/net/ifc/base/ifcvf.c
>  create mode 100644 drivers/net/ifc/base/ifcvf.h
>  create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
>  create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
>  create mode 100644 drivers/net/ifc/rte_ifcvf_version.map
> 
> diff --git a/config/common_base b/config/common_base
> index 2c40b2603..5d4f9e75c 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -796,6 +796,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
>  #
>  CONFIG_RTE_LIBRTE_PMD_VHOST=n
>  
> +#
> +# Compile IFCVF driver
> +# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
> +# should be enabled.
> +#
> +CONFIG_RTE_LIBRTE_IFCVF_VDPA=n

I believe better to keep "PMD" in config option for consistency:
CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD
And add this into PMD section of the doc.

<...>

> +/*
> + * The set of PCI devices this driver supports.
> + */
> +static const struct rte_pci_id pci_id_ifcvf_map[] = {
> +	{ .class_id = RTE_CLASS_ANY_ID,
> +	  .vendor_id = IFCVF_VENDOR_ID,
> +	  .device_id = IFCVF_DEVICE_ID,
> +	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
> +	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,

Can be good to add comment that these can be same with virtio device id to
clarify this is known/expected.

> +	},
> +
> +	{ .vendor_id = 0, /* sentinel */
> +	},
> +};
> +
> +static struct rte_pci_driver rte_ifcvf_vdpa = {
> +	.driver = {
> +		.name = "net_ifcvf",
> +	},

No need to set name, already done by RTE_PMD_REGISTER_PCI

<...>

> +RTE_INIT(ifcvf_vdpa_init_log);
> +static void
> +ifcvf_vdpa_init_log(void)
> +{
> +	ifcvf_vdpa_logtype = rte_log_register("net.ifcvf_vdpa");

latest format is "pmd.net.ifcvf_vdpa"

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v5 2/4] net/virtio: skip device probe in vdpa mode
  2018-04-05 18:06                 ` [PATCH v5 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-11 18:58                   ` Ferruh Yigit
  0 siblings, 0 replies; 98+ messages in thread
From: Ferruh Yigit @ 2018-04-11 18:58 UTC (permalink / raw)
  To: Xiao Wang
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal

On 4/5/2018 7:06 PM, Xiao Wang wrote:
> If we want a virtio device to work in vDPA (vhost data path acceleration)
> mode, we could add a "vdpa=1" devarg for this device to specify the mode.
> 
> This patch let virtio pmd skip device probe when detecting this parameter.
> 
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
> index 2ef213d1a..afb096804 100644
> --- a/drivers/net/virtio/virtio_ethdev.c
> +++ b/drivers/net/virtio/virtio_ethdev.c

This devargs needs to be documented in virtio documentation.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v5 0/4] add ifcvf vdpa driver
  2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
                                   ` (3 preceding siblings ...)
  2018-04-05 18:07                 ` [PATCH v5 " Xiao Wang
@ 2018-04-11 18:59                 ` Ferruh Yigit
  2018-04-12  5:47                   ` Wang, Xiao W
  4 siblings, 1 reply; 98+ messages in thread
From: Ferruh Yigit @ 2018-04-11 18:59 UTC (permalink / raw)
  To: Xiao Wang
  Cc: maxime.coquelin, dev, zhihong.wang, jianfeng.tan, tiwei.bie,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal

On 4/5/2018 7:06 PM, Xiao Wang wrote:
> This patch set has dependency on http://dpdk.org/dev/patchwork/patch/36772/
> (vhost: support selective datapath).
> 
> IFCVF driver
> ============
> The IFCVF vDPA (vhost data path acceleration) driver provides support for the
> Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
> works as a HW vhost backend which can send/receive packets to/from virtio
> directly by DMA. Besides, it supports dirty page logging and device state
> report/restore. This driver enables its vDPA functionality with live migration
> feature.
> 
> vDPA mode
> =========
> IFCVF's vendor ID and device ID are same as that of virtio net pci device,
> with its specific subsystem vendor ID and device ID. To let the device be
> probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
> device is to be used in vDPA mode, rather than polling mode, virtio pmd will
> skip when it detects this message.
> 
> Container per device
> ====================
> vDPA needs to create different containers for different devices, thus this
> patch set adds some APIs in eal/vfio to support multiple container, e.g.
> - rte_vfio_create_container
> - rte_vfio_destroy_container
> - rte_vfio_bind_group
> - rte_vfio_unbind_group
> 
> By this extension, a device can be put into a new specific container, rather
> than the previous default container.
> 
> IFCVF vDPA details
> ==================
> Key vDPA driver ops implemented:
> - ifcvf_dev_config:
>   Enable VF data path with virtio information provided by vhost lib, including
>   IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
>   route HW interrupt to virtio driver, create notify relay thread to translate
>   virtio driver's kick to a MMIO write onto HW, HW queues configuration.
> 
>   This function gets called to set up HW data path backend when virtio driver
>   in VM gets ready.
> 
> - ifcvf_dev_close:
>   Revoke all the setup in ifcvf_dev_config.
> 
>   This function gets called when virtio driver stops device in VM.
> 
> Change log
> ==========
> v5:
> - Fix compilation in BSD, remove the rte_vfio.h including in BSD.
> 
> v4:
> - Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
> - Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
> - Align the vfio_cfg search internal APIs naming.
> 
> v3:
> - Add doc and release note for the new driver.
> - Remove the vdev concept, make the driver as a PCI driver, it will get probed
>   by PCI bus driver.
> - Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
> - Remove the PCI API exposure accordingly.
> - Move the MAX_VFIO_CONTAINERS definition to config file.
> - Let virtio pmd skips when a virtio device needs to work in vDPA mode.
> 
> v2:
> - Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
>   to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
> - Rebase on Zhihong's vDPA v3 patch set.
> - Minor code cleanup on vfio extension.
> 
> 
> Xiao Wang (4):
>   eal/vfio: add multiple container support
>   net/virtio: skip device probe in vdpa mode
>   net/ifcvf: add ifcvf vdpa driver
>   doc: add ifcvf driver document and release note

Hi Xiao,

Current patch doesn't apply cleanly after latest updates, can you please rebase
it onto latest next-net, also there are a few minor comments I put into
individual patches can you please check them?

After above changes done, please add for series:
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v5 0/4] add ifcvf vdpa driver
  2018-04-11 18:59                 ` [PATCH v5 0/4] add ifcvf vdpa driver Ferruh Yigit
@ 2018-04-12  5:47                   ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-12  5:47 UTC (permalink / raw)
  To: Yigit, Ferruh
  Cc: maxime.coquelin, dev, Wang, Zhihong, Tan, Jianfeng, Bie, Tiwei,
	Liang, Cunming, Daly, Dan, thomas, gaetan.rivet, Burakov,
	Anatoly, hemant.agrawal

Hi,

> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Thursday, April 12, 2018 2:59 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>
> Cc: maxime.coquelin@redhat.com; dev@dpdk.org; Wang, Zhihong
> <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>; Bie,
> Tiwei <tiwei.bie@intel.com>; Liang, Cunming <cunming.liang@intel.com>;
> Daly, Dan <dan.daly@intel.com>; thomas@monjalon.net;
> gaetan.rivet@6wind.com; Burakov, Anatoly <anatoly.burakov@intel.com>;
> hemant.agrawal@nxp.com
> Subject: Re: [PATCH v5 0/4] add ifcvf vdpa driver
> 
> On 4/5/2018 7:06 PM, Xiao Wang wrote:
> > This patch set has dependency on
> http://dpdk.org/dev/patchwork/patch/36772/
> > (vhost: support selective datapath).
> >
> > IFCVF driver
> > ============
> > The IFCVF vDPA (vhost data path acceleration) driver provides support for
> the
> > Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
> > works as a HW vhost backend which can send/receive packets to/from virtio
> > directly by DMA. Besides, it supports dirty page logging and device state
> > report/restore. This driver enables its vDPA functionality with live migration
> > feature.
> >
> > vDPA mode
> > =========
> > IFCVF's vendor ID and device ID are same as that of virtio net pci device,
> > with its specific subsystem vendor ID and device ID. To let the device be
> > probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
> > device is to be used in vDPA mode, rather than polling mode, virtio pmd will
> > skip when it detects this message.
> >
> > Container per device
> > ====================
> > vDPA needs to create different containers for different devices, thus this
> > patch set adds some APIs in eal/vfio to support multiple container, e.g.
> > - rte_vfio_create_container
> > - rte_vfio_destroy_container
> > - rte_vfio_bind_group
> > - rte_vfio_unbind_group
> >
> > By this extension, a device can be put into a new specific container, rather
> > than the previous default container.
> >
> > IFCVF vDPA details
> > ==================
> > Key vDPA driver ops implemented:
> > - ifcvf_dev_config:
> >   Enable VF data path with virtio information provided by vhost lib, including
> >   IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt
> setup to
> >   route HW interrupt to virtio driver, create notify relay thread to translate
> >   virtio driver's kick to a MMIO write onto HW, HW queues configuration.
> >
> >   This function gets called to set up HW data path backend when virtio driver
> >   in VM gets ready.
> >
> > - ifcvf_dev_close:
> >   Revoke all the setup in ifcvf_dev_config.
> >
> >   This function gets called when virtio driver stops device in VM.
> >
> > Change log
> > ==========
> > v5:
> > - Fix compilation in BSD, remove the rte_vfio.h including in BSD.
> >
> > v4:
> > - Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
> > - Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the
> fd.
> > - Align the vfio_cfg search internal APIs naming.
> >
> > v3:
> > - Add doc and release note for the new driver.
> > - Remove the vdev concept, make the driver as a PCI driver, it will get probed
> >   by PCI bus driver.
> > - Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
> > - Remove the PCI API exposure accordingly.
> > - Move the MAX_VFIO_CONTAINERS definition to config file.
> > - Let virtio pmd skips when a virtio device needs to work in vDPA mode.
> >
> > v2:
> > - Rename function pci_get_kernel_driver_by_path to
> rte_pci_device_kdriver_name
> >   to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
> > - Rebase on Zhihong's vDPA v3 patch set.
> > - Minor code cleanup on vfio extension.
> >
> >
> > Xiao Wang (4):
> >   eal/vfio: add multiple container support
> >   net/virtio: skip device probe in vdpa mode
> >   net/ifcvf: add ifcvf vdpa driver
> >   doc: add ifcvf driver document and release note
> 
> Hi Xiao,
> 
> Current patch doesn't apply cleanly after latest updates, can you please rebase
> it onto latest next-net, also there are a few minor comments I put into
> individual patches can you please check them?
> 
> After above changes done, please add for series:
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>

Thanks, will update according to that.

BRs,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v6 0/4] add ifcvf vdpa driver
  2018-04-05 18:07                 ` [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
  2018-04-11 18:58                   ` Ferruh Yigit
@ 2018-04-12  7:19                   ` Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 1/4] eal/vfio: add multiple container support Xiao Wang
                                       ` (3 more replies)
  1 sibling, 4 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-12  7:19 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v6:
- Rebase on master branch.
- Document "vdpa" devarg in virtio documentation.
- Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
  consistensy, and add it into driver documentation.
- Add comments for ifcvf device ID.
- Minor code cleaning.

v5:
- Fix compilation in BSD, remove the rte_vfio.h including in BSD.

v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the vfio_cfg search internal APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (4):
  eal/vfio: add multiple container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  98 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/nics/virtio.rst               |  13 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  36 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 845 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  50 ++
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 522 +++++++++++++++----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 21 files changed, 2213 insertions(+), 92 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v6 1/4] eal/vfio: add multiple container support
  2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
@ 2018-04-12  7:19                     ` Xiao Wang
  2018-04-12 14:03                       ` Burakov, Anatoly
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
                                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-12  7:19 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang, Junjie Chen

Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch adds some APIs to support container creating and device
binding with a container.

A driver could use "rte_vfio_create_container" helper to create a
new container from eal, use "rte_vfio_bind_group" to bind a device
to the newly created container.

During rte_vfio_setup_device, the container bound with the device
will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                       |   1 +
 lib/librte_eal/bsdapp/eal/eal.c          |  50 +++
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 522 +++++++++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 6 files changed, 601 insertions(+), 92 deletions(-)

diff --git a/config/common_base b/config/common_base
index c09c7cf88..90c2821ae 100644
--- a/config/common_base
+++ b/config/common_base
@@ -74,6 +74,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..0a3d8783d 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -746,6 +746,14 @@ int rte_vfio_enable(const char *modname);
 int rte_vfio_is_enabled(const char *modname);
 int rte_vfio_noiommu_is_enabled(void);
 int rte_vfio_clear_group(int vfio_group_fd);
+int rte_vfio_create_container(void);
+int rte_vfio_destroy_container(int container_fd);
+int rte_vfio_bind_group(int container_fd, int iommu_group_no);
+int rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+int rte_vfio_dma_map(int container_fd, int dma_type,
+		const struct rte_memseg *ms);
+int rte_vfio_dma_unmap(int container_fd, int dma_type,
+		const struct rte_memseg *ms);
 
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
@@ -781,3 +789,45 @@ int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index 249095e46..9bb026703 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -32,6 +32,8 @@
 extern "C" {
 #endif
 
+struct rte_memseg;
+
 /**
  * Setup vfio_cfg for the device identified by its address.
  * It discovers the configured I/O MMU groups or sets a new one for the device.
@@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
 }
 #endif
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container for device binding.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a IOMMU group to a container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a IOMMU group from a container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *   the dma map type
+ *
+ * @param ms
+ *   the dma address region to map
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *    the dma map type
+ *
+ * @param ms
+ *   the dma address region to unmap
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
+
 #endif /* VFIO_PRESENT */
 
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..e474f6e9f 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_no(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg;
+		}
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+	vfio_cfg = vfio_cfgs[i];
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0) {
+		rte_free(vfio_cfgs[i]);
+		vfio_cfgs[i] = NULL;
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+	}
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = get_container_idx(container_fd);
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[idx];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[i];
+	/* Check room for new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[i];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
-		if (i < 0)
+		if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
+			RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		}
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,6 +516,8 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
 	int ret;
@@ -309,12 +566,14 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +590,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfg->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +604,13 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +654,7 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +722,9 @@ rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +746,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +763,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	int ret;
+	struct vfio_iommu_type1_dma_map dma_map;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
 
-		if (ms[i].addr == NULL)
-			break;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	return 0;
+}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
-			return -1;
-		}
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)
+{
+	int ret;
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1148,37 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..23a1e3608 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -86,6 +86,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index dd38783a2..a62833ed1 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -258,5 +258,11 @@ EXPERIMENTAL {
 	rte_service_start_with_defaults;
 	rte_socket_count;
 	rte_socket_id_by_idx;
+	rte_vfio_bind_group;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_unbind_group;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v6 2/4] net/virtio: skip device probe in vdpa mode
  2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 1/4] eal/vfio: add multiple container support Xiao Wang
@ 2018-04-12  7:19                     ` Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 4/4] doc: add ifcvf driver document and release note Xiao Wang
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-12  7:19 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/virtio.rst         | 13 ++++++++++++
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index ca09cd203..8922f9c0b 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -318,3 +318,16 @@ Here we use l3fwd-power as an example to show how to get started.
 
         $ l3fwd-power -l 0-1 -- -p 1 -P --config="(0,0,1)" \
                                                --no-numa --parse-ptype
+
+
+Virtio PMD arguments
+--------------------
+
+The user can specify below argument in devargs.
+
+#.  ``vdpa``:
+
+    A virtio device could also be driven by vDPA (vhost data path acceleration)
+    driver, and works as a HW vhost backend. This argument is used to specify
+    a virtio device needs to work in vDPA mode.
+    (Default: 0 (disabled))
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 11f758929..6d6c50e89 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -28,6 +28,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1708,9 +1709,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v6 3/4] net/ifcvf: add ifcvf vdpa driver
  2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 1/4] eal/vfio: add multiple container support Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-12  7:19                     ` Xiao Wang
  2018-04-12  7:19                     ` [PATCH v6 4/4] doc: add ifcvf driver document and release note Xiao Wang
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-12  7:19 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang, Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  36 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 845 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1440 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index 90c2821ae..8d5d95868 100644
--- a/config/common_base
+++ b/config/common_base
@@ -790,6 +790,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index d0437e5d6..14e56cb4d 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 37ca19aa7..d3fafbfe1 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -57,6 +57,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..95bb8d769
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..d16ffc3b6
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,845 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+#include <eal_vfio.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+#ifndef PAGE_SIZE
+#define PAGE_SIZE 4096
+#endif
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	internal->vfio_container_fd = rte_vfio_create_container();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	internal->vfio_group_fd = rte_vfio_bind_group(
+			internal->vfio_container_fd, iommu_group_no);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx.", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_map(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+		struct rte_memseg ms;
+
+		reg = &mem->regions[i];
+		ms.addr_64 = reg->host_user_addr;
+		ms.iova = reg->guest_phys_addr;
+		ms.len = reg->size;
+		rte_vfio_dma_unmap(vfio_container_fd, VFIO_TYPE1_IOMMU, &ms);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	uint32_t i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint32_t size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		size = hw->vring[i].size * 8 + 4;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / PAGE_SIZE;
+			for (j = 0; j <= size / PAGE_SIZE; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_set_features(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_features(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK | \
+		 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+static int
+ifcvf_get_protocol_features(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.get_queue_num = ifcvf_get_queue_num,
+	.get_features = ifcvf_get_vdpa_features,
+	.get_protocol_features = ifcvf_get_protocol_features,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.set_vring_state = NULL,
+	.set_features = ifcvf_set_features,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES) |
+		(1ULL << VHOST_F_LOG_ALL);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops) < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_destroy_container(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * IFCVF has the same vendor ID and device ID as virtio net PCI
+ * device, with its specific subsystem vendor ID and device ID.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("pmd.net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 258590819..9f01ff9e7 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -185,6 +185,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v6 4/4] doc: add ifcvf driver document and release note
  2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
                                       ` (2 preceding siblings ...)
  2018-04-12  7:19                     ` [PATCH v6 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-12  7:19                     ` Xiao Wang
  3 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-12  7:19 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, gaetan.rivet, anatoly.burakov,
	hemant.agrawal, Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 +++
 doc/guides/nics/ifcvf.rst              | 98 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 116 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..d7e76353c
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,98 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+Pre-Installation Configuration
+------------------------------
+
+Config File Options
+~~~~~~~~~~~~~~~~~~~
+
+The following option can be modified in the ``config`` file.
+
+- ``CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD`` (default ``y`` for linux)
+
+  Toggle compilation of the ``librte_ifcvf_vdpa`` driver.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 51c453d9c..a294ab389 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -44,6 +44,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index 3e1ae0cfd..1bf609f6b 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -84,6 +84,15 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 ABI Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v6 1/4] eal/vfio: add multiple container support
  2018-04-12  7:19                     ` [PATCH v6 1/4] eal/vfio: add multiple container support Xiao Wang
@ 2018-04-12 14:03                       ` Burakov, Anatoly
  2018-04-12 16:07                         ` Wang, Xiao W
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
  1 sibling, 1 reply; 98+ messages in thread
From: Burakov, Anatoly @ 2018-04-12 14:03 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, gaetan.rivet, hemant.agrawal,
	Junjie Chen

On 12-Apr-18 8:19 AM, Xiao Wang wrote:
> Currently eal vfio framework binds vfio group fd to the default
> container fd during rte_vfio_setup_device, while in some cases,
> e.g. vDPA (vhost data path acceleration), we want to put vfio group
> to a separate container and program IOMMU via this container.
> 
> This patch adds some APIs to support container creating and device
> binding with a container.
> 
> A driver could use "rte_vfio_create_container" helper to create a
> new container from eal, use "rte_vfio_bind_group" to bind a device
> to the newly created container.
> 
> During rte_vfio_setup_device, the container bound with the device
> will be used for IOMMU setup.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---

Apologies for late review. Some comments below.

<...>

>   
> +struct rte_memseg;
> +
>   /**
>    * Setup vfio_cfg for the device identified by its address.
>    * It discovers the configured I/O MMU groups or sets a new one for the device.
> @@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
>   }
>   #endif
>   

<...>

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma mapping for devices in a conainer.
> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param dma_type
> + *   the dma map type
> + *
> + * @param ms
> + *   the dma address region to map
> + *
> + * @return
> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
> +

First of all, why memseg, instead of va/iova/len? This seems like 
unnecessary attachment to internals of DPDK memory representation. Not 
all memory comes in memsegs, this makes the API unnecessarily specific 
to DPDK memory.

Also, why providing DMA type? There's already a VFIO type pointer in 
vfio_config - you can set this pointer for every new created container, 
so the user wouldn't have to care about IOMMU type. Is it not possible 
to figure out DMA type from within EAL VFIO? If not, maybe provide an 
API to do so, e.g. rte_vfio_container_set_dma_type()?

This will also need to be rebased on top of latest HEAD because there 
already is a similar DMA map/unmap API added, only without the container 
parameter. Perhaps rename these new functions to 
rte_vfio_container_(create|destroy|dma_map|dma_unmap)?

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma unmapping for devices in a conainer.
> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param dma_type
> + *    the dma map type
> + *
> + * @param ms
> + *   the dma address region to unmap
> + *
> + * @return
> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
> +
>   #endif /* VFIO_PRESENT */
>   

<...>

> @@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
>   		if (vfio_group_fd < 0) {
>   			/* if file not found, it's not an error */
>   			if (errno != ENOENT) {
> -				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
> -						strerror(errno));
> +				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
> +					filename, strerror(errno));

This looks like unintended change.

>   				return -1;
>   			}
>   
> @@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
>   			vfio_group_fd = open(filename, O_RDWR);
>   			if (vfio_group_fd < 0) {
>   				if (errno != ENOENT) {
> -					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
> -							strerror(errno));
> +					RTE_LOG(ERR, EAL,
> +						"Cannot open %s: %s\n",
> +						filename,
> +						strerror(errno));

This looks like unintended change.

>   					return -1;
>   				}
>   				return 0;
> @@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
>   			/* noiommu group found */
>   		}
>   
> -		cur_grp->group_no = iommu_group_no;
> -		cur_grp->fd = vfio_group_fd;
> -		vfio_cfg.vfio_active_groups++;
>   		return vfio_group_fd;
>   	}
> -	/* if we're in a secondary process, request group fd from the primary
> +	/*
> +	 * if we're in a secondary process, request group fd from the primary
>   	 * process via our socket
>   	 */

This looks like unintended change.

>   	else {
> -		int socket_fd, ret;
> -
> -		socket_fd = vfio_mp_sync_connect_to_primary();
> +		int ret;
> +		int socket_fd = vfio_mp_sync_connect_to_primary();
>   
>   		if (socket_fd < 0) {
> -			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
> +			RTE_LOG(ERR, EAL,
> +				"  cannot connect to primary process!\n");

This looks like unintended change.

>   			return -1;
>   		}
>   		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
> @@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
>   			close(socket_fd);
>   			return -1;
>   		}
> +
>   		ret = vfio_mp_sync_receive_request(socket_fd);

This looks like unintended change.

(hint: "git revert -n HEAD && git add -p" is your friend :) )

>   		switch (ret) {
>   		case SOCKET_NO_FD:
> @@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
>   			/* if we got the fd, store it and return it */
>   			if (vfio_group_fd > 0) {
>   				close(socket_fd);
> -				cur_grp->group_no = iommu_group_no;
> -				cur_grp->fd = vfio_group_fd;
> -				vfio_cfg.vfio_active_groups++;
>   				return vfio_group_fd;
>   			}
>   			/* fall-through on error */
> @@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)
>   	return -1;

<...>

> +int __rte_experimental
> +rte_vfio_create_container(void)
> +{
> +	struct vfio_config *vfio_cfg;
> +	int i;
> +
> +	/* Find an empty slot to store new vfio config */
> +	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
> +		if (vfio_cfgs[i] == NULL)
> +			break;
> +	}
> +
> +	if (i == VFIO_MAX_CONTAINERS) {
> +		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
> +		return -1;
> +	}
> +
> +	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
> +		RTE_CACHE_LINE_SIZE);
> +	if (vfio_cfgs[i] == NULL)
> +		return -ENOMEM;

Is there a specific reason why 1) dynamic allocation is used (as opposed 
to just storing a static array), and 2) DPDK memory allocation is used? 
This seems like unnecessary complication.

Even if you were to decide to allocate memory instead of having a static 
array, you'll have to register for rte_eal_cleanup() to delete any 
allocated containers on DPDK exit. But, as i said, i think it would be 
better to keep it as static array.

> +
> +	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
> +	vfio_cfg = vfio_cfgs[i];
> +	vfio_cfg->vfio_active_groups = 0;
> +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> +
> +	if (vfio_cfg->vfio_container_fd < 0) {
> +		rte_free(vfio_cfgs[i]);
> +		vfio_cfgs[i] = NULL;
> +		return -1;
> +	}
> +
> +	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
> +		vfio_cfg->vfio_groups[i].group_no = -1;
> +		vfio_cfg->vfio_groups[i].fd = -1;
> +		vfio_cfg->vfio_groups[i].devices = 0;
> +	}

<...>

> @@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,
>   }
>   
>   static int
> -vfio_type1_dma_map(int vfio_container_fd)
> +do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)

<...>


> +static int
> +do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)

API's such as these two were recently added to DPDK.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v6 1/4] eal/vfio: add multiple container support
  2018-04-12 14:03                       ` Burakov, Anatoly
@ 2018-04-12 16:07                         ` Wang, Xiao W
  2018-04-12 16:24                           ` Burakov, Anatoly
  0 siblings, 1 reply; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-12 16:07 UTC (permalink / raw)
  To: Burakov, Anatoly, Yigit, Ferruh
  Cc: dev, maxime.coquelin, Wang, Zhihong, Bie, Tiwei, Tan, Jianfeng,
	Liang, Cunming, Daly, Dan, thomas, gaetan.rivet, hemant.agrawal,
	Chen, Junjie J

Hi Anatoly,

> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Thursday, April 12, 2018 10:04 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; Wang, Zhihong
> <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Tan, Jianfeng
> <jianfeng.tan@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly,
> Dan <dan.daly@intel.com>; thomas@monjalon.net; gaetan.rivet@6wind.com;
> hemant.agrawal@nxp.com; Chen, Junjie J <junjie.j.chen@intel.com>
> Subject: Re: [PATCH v6 1/4] eal/vfio: add multiple container support
> 
> On 12-Apr-18 8:19 AM, Xiao Wang wrote:
> > Currently eal vfio framework binds vfio group fd to the default
> > container fd during rte_vfio_setup_device, while in some cases,
> > e.g. vDPA (vhost data path acceleration), we want to put vfio group
> > to a separate container and program IOMMU via this container.
> >
> > This patch adds some APIs to support container creating and device
> > binding with a container.
> >
> > A driver could use "rte_vfio_create_container" helper to create a
> > new container from eal, use "rte_vfio_bind_group" to bind a device
> > to the newly created container.
> >
> > During rte_vfio_setup_device, the container bound with the device
> > will be used for IOMMU setup.
> >
> > Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> > ---
> 
> Apologies for late review. Some comments below.
> 
> <...>
> 
> >
> > +struct rte_memseg;
> > +
> >   /**
> >    * Setup vfio_cfg for the device identified by its address.
> >    * It discovers the configured I/O MMU groups or sets a new one for the
> device.
> > @@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
> >   }
> >   #endif
> >
> 
> <...>
> 
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Perform dma mapping for devices in a conainer.
> > + *
> > + * @param container_fd
> > + *   the specified container fd
> > + *
> > + * @param dma_type
> > + *   the dma map type
> > + *
> > + * @param ms
> > + *   the dma address region to map
> > + *
> > + * @return
> > + *    0 if successful
> > + *   <0 if failed
> > + */
> > +int __rte_experimental
> > +rte_vfio_dma_map(int container_fd, int dma_type, const struct
> rte_memseg *ms);
> > +
> 
> First of all, why memseg, instead of va/iova/len? This seems like
> unnecessary attachment to internals of DPDK memory representation. Not
> all memory comes in memsegs, this makes the API unnecessarily specific
> to DPDK memory.

Agree, will use va/iova/len.

> 
> Also, why providing DMA type? There's already a VFIO type pointer in
> vfio_config - you can set this pointer for every new created container,
> so the user wouldn't have to care about IOMMU type. Is it not possible
> to figure out DMA type from within EAL VFIO? If not, maybe provide an
> API to do so, e.g. rte_vfio_container_set_dma_type()?

It's possible, EAL VFIO should be able to figure out a container's DMA type.

> 
> This will also need to be rebased on top of latest HEAD because there
> already is a similar DMA map/unmap API added, only without the container
> parameter. Perhaps rename these new functions to
> rte_vfio_container_(create|destroy|dma_map|dma_unmap)?

OK, will check the latest HEAD and rebase on that.

> 
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Perform dma unmapping for devices in a conainer.
> > + *
> > + * @param container_fd
> > + *   the specified container fd
> > + *
> > + * @param dma_type
> > + *    the dma map type
> > + *
> > + * @param ms
> > + *   the dma address region to unmap
> > + *
> > + * @return
> > + *    0 if successful
> > + *   <0 if failed
> > + */
> > +int __rte_experimental
> > +rte_vfio_dma_unmap(int container_fd, int dma_type, const struct
> rte_memseg *ms);
> > +
> >   #endif /* VFIO_PRESENT */
> >
> 
> <...>
> 
> > @@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
> >   		if (vfio_group_fd < 0) {
> >   			/* if file not found, it's not an error */
> >   			if (errno != ENOENT) {
> > -				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
> filename,
> > -						strerror(errno));
> > +				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
> > +					filename, strerror(errno));
> 
> This looks like unintended change.
> 
> >   				return -1;
> >   			}
> >
> > @@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
> >   			vfio_group_fd = open(filename, O_RDWR);
> >   			if (vfio_group_fd < 0) {
> >   				if (errno != ENOENT) {
> > -					RTE_LOG(ERR, EAL, "Cannot
> open %s: %s\n", filename,
> > -							strerror(errno));
> > +					RTE_LOG(ERR, EAL,
> > +						"Cannot open %s: %s\n",
> > +						filename,
> > +						strerror(errno));
> 
> This looks like unintended change.
> 
> >   					return -1;
> >   				}
> >   				return 0;
> > @@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
> >   			/* noiommu group found */
> >   		}
> >
> > -		cur_grp->group_no = iommu_group_no;
> > -		cur_grp->fd = vfio_group_fd;
> > -		vfio_cfg.vfio_active_groups++;
> >   		return vfio_group_fd;
> >   	}
> > -	/* if we're in a secondary process, request group fd from the primary
> > +	/*
> > +	 * if we're in a secondary process, request group fd from the primary
> >   	 * process via our socket
> >   	 */
> 
> This looks like unintended change.
> 
> >   	else {
> > -		int socket_fd, ret;
> > -
> > -		socket_fd = vfio_mp_sync_connect_to_primary();
> > +		int ret;
> > +		int socket_fd = vfio_mp_sync_connect_to_primary();
> >
> >   		if (socket_fd < 0) {
> > -			RTE_LOG(ERR, EAL, "  cannot connect to primary
> process!\n");
> > +			RTE_LOG(ERR, EAL,
> > +				"  cannot connect to primary process!\n");
> 
> This looks like unintended change.
> 
> >   			return -1;
> >   		}
> >   		if (vfio_mp_sync_send_request(socket_fd,
> SOCKET_REQ_GROUP) < 0) {
> > @@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
> >   			close(socket_fd);
> >   			return -1;
> >   		}
> > +
> >   		ret = vfio_mp_sync_receive_request(socket_fd);
> 
> This looks like unintended change.
> 
> (hint: "git revert -n HEAD && git add -p" is your friend :) )

Thanks, will remove these diff.

> 
> >   		switch (ret) {
> >   		case SOCKET_NO_FD:
> > @@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
> >   			/* if we got the fd, store it and return it */
> >   			if (vfio_group_fd > 0) {
> >   				close(socket_fd);
> > -				cur_grp->group_no = iommu_group_no;
> > -				cur_grp->fd = vfio_group_fd;
> > -				vfio_cfg.vfio_active_groups++;
> >   				return vfio_group_fd;
> >   			}
> >   			/* fall-through on error */
> > @@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)
> >   	return -1;
> 
> <...>
> 
> > +int __rte_experimental
> > +rte_vfio_create_container(void)
> > +{
> > +	struct vfio_config *vfio_cfg;
> > +	int i;
> > +
> > +	/* Find an empty slot to store new vfio config */
> > +	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
> > +		if (vfio_cfgs[i] == NULL)
> > +			break;
> > +	}
> > +
> > +	if (i == VFIO_MAX_CONTAINERS) {
> > +		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
> > +		return -1;
> > +	}
> > +
> > +	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
> > +		RTE_CACHE_LINE_SIZE);
> > +	if (vfio_cfgs[i] == NULL)
> > +		return -ENOMEM;
> 
> Is there a specific reason why 1) dynamic allocation is used (as opposed
> to just storing a static array), and 2) DPDK memory allocation is used?
> This seems like unnecessary complication.
> 
> Even if you were to decide to allocate memory instead of having a static
> array, you'll have to register for rte_eal_cleanup() to delete any
> allocated containers on DPDK exit. But, as i said, i think it would be
> better to keep it as static array.
>

Thanks for the suggestion, static array looks simpler and cleaner.
 
> > +
> > +	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
> > +	vfio_cfg = vfio_cfgs[i];
> > +	vfio_cfg->vfio_active_groups = 0;
> > +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> > +
> > +	if (vfio_cfg->vfio_container_fd < 0) {
> > +		rte_free(vfio_cfgs[i]);
> > +		vfio_cfgs[i] = NULL;
> > +		return -1;
> > +	}
> > +
> > +	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
> > +		vfio_cfg->vfio_groups[i].group_no = -1;
> > +		vfio_cfg->vfio_groups[i].fd = -1;
> > +		vfio_cfg->vfio_groups[i].devices = 0;
> > +	}
> 
> <...>
> 
> > @@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,
> >   }
> >
> >   static int
> > -vfio_type1_dma_map(int vfio_container_fd)
> > +do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg
> *ms)
> 
> <...>
> 
> 
> > +static int
> > +do_vfio_type1_dma_unmap(int vfio_container_fd, const struct
> rte_memseg *ms)
> 
> API's such as these two were recently added to DPDK.

Will check and rebase.

BRs,
Xiao

> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v6 1/4] eal/vfio: add multiple container support
  2018-04-12 16:07                         ` Wang, Xiao W
@ 2018-04-12 16:24                           ` Burakov, Anatoly
  2018-04-13  9:18                             ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Burakov, Anatoly @ 2018-04-12 16:24 UTC (permalink / raw)
  To: Wang, Xiao W, Yigit, Ferruh
  Cc: dev, maxime.coquelin, Wang, Zhihong, Bie, Tiwei, Tan, Jianfeng,
	Liang, Cunming, Daly, Dan, thomas, gaetan.rivet, hemant.agrawal,
	Chen, Junjie J

On 12-Apr-18 5:07 PM, Wang, Xiao W wrote:
> Hi Anatoly,
> 

<...>

>>
>> Also, why providing DMA type? There's already a VFIO type pointer in
>> vfio_config - you can set this pointer for every new created container,
>> so the user wouldn't have to care about IOMMU type. Is it not possible
>> to figure out DMA type from within EAL VFIO? If not, maybe provide an
>> API to do so, e.g. rte_vfio_container_set_dma_type()?
> 
> It's possible, EAL VFIO should be able to figure out a container's DMA type.

You probably won't be able to do it until you add a group into the 
container, so probably best place to do it would be on group_bind?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v6 1/4] eal/vfio: add multiple container support
  2018-04-12 16:24                           ` Burakov, Anatoly
@ 2018-04-13  9:18                             ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-13  9:18 UTC (permalink / raw)
  To: Burakov, Anatoly, Yigit, Ferruh
  Cc: dev, maxime.coquelin, Wang, Zhihong, Bie, Tiwei, Tan, Jianfeng,
	Liang, Cunming, Daly, Dan, thomas, gaetan.rivet, hemant.agrawal,
	Chen, Junjie J



> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Friday, April 13, 2018 12:24 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; Wang, Zhihong
> <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Tan, Jianfeng
> <jianfeng.tan@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly,
> Dan <dan.daly@intel.com>; thomas@monjalon.net; gaetan.rivet@6wind.com;
> hemant.agrawal@nxp.com; Chen, Junjie J <junjie.j.chen@intel.com>
> Subject: Re: [PATCH v6 1/4] eal/vfio: add multiple container support
> 
> On 12-Apr-18 5:07 PM, Wang, Xiao W wrote:
> > Hi Anatoly,
> >
> 
> <...>
> 
> >>
> >> Also, why providing DMA type? There's already a VFIO type pointer in
> >> vfio_config - you can set this pointer for every new created container,
> >> so the user wouldn't have to care about IOMMU type. Is it not possible
> >> to figure out DMA type from within EAL VFIO? If not, maybe provide an
> >> API to do so, e.g. rte_vfio_container_set_dma_type()?
> >
> > It's possible, EAL VFIO should be able to figure out a container's DMA type.
> 
> You probably won't be able to do it until you add a group into the
> container, so probably best place to do it would be on group_bind?

Yes, the IOMMU type pointer could be set when group binding.

BRs,
Xiao

> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v7 0/5] add ifcvf vdpa driver
  2018-04-12  7:19                     ` [PATCH v6 1/4] eal/vfio: add multiple container support Xiao Wang
  2018-04-12 14:03                       ` Burakov, Anatoly
@ 2018-04-15 15:33                       ` Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 1/5] vfio: extend data structure for multi container Xiao Wang
                                           ` (4 more replies)
  1 sibling, 5 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-15 15:33 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v7:
- Rebase on HEAD.
- Split the vfio patch into 2 parts, one for data structure extension, one for
  adding new API.
- Use static vfio_config array instead of dynamic alloating.
- Change rte_vfio_container_dma_map/unmap's parameters to use (va, iova, len).

v6:
- Rebase on master branch.
- Document "vdpa" devarg in virtio documentation.
- Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
  consistensy, and add it into driver documentation.
- Add comments for ifcvf device ID.
- Minor code cleaning.

v5:
- Fix compilation in BSD, remove the rte_vfio.h including in BSD.

v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the vfio_cfg search internal APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (5):
  vfio: extend data structure for multi container
  vfio: add multi container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  98 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/nics/virtio.rst               |  13 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  36 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 842 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  52 ++
 lib/librte_eal/common/include/rte_vfio.h | 119 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 723 +++++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |  19 +-
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 21 files changed, 2377 insertions(+), 152 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v7 1/5] vfio: extend data structure for multi container
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
@ 2018-04-15 15:33                         ` Xiao Wang
  2018-04-16 10:02                           ` Burakov, Anatoly
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 2/5] vfio: add multi container support Xiao Wang
                                           ` (3 subsequent siblings)
  4 siblings, 2 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-15 15:33 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang, Junjie Chen

Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch extends the vfio_config structure to contain per-container
user_mem_maps and defines an array of vfio_config. The next patch will
base on this to add container API.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                     |   1 +
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 407 ++++++++++++++++++++++-----------
 lib/librte_eal/linuxapp/eal/eal_vfio.h |  19 +-
 3 files changed, 275 insertions(+), 152 deletions(-)

diff --git a/config/common_base b/config/common_base
index c4236fd1f..4a76d2f14 100644
--- a/config/common_base
+++ b/config/common_base
@@ -87,6 +87,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 589d7d478..46fba2d8d 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -22,8 +22,46 @@
 
 #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
 
+/*
+ * we don't need to store device fd's anywhere since they can be obtained from
+ * the group fd via an ioctl() call.
+ */
+struct vfio_group {
+	int group_no;
+	int fd;
+	int devices;
+};
+
+/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
+ * recreate the mappings for DPDK segments, but we cannot do so for memory that
+ * was registered by the user themselves, so we need to store the user mappings
+ * somewhere, to recreate them later.
+ */
+#define VFIO_MAX_USER_MEM_MAPS 256
+struct user_mem_map {
+	uint64_t addr;
+	uint64_t iova;
+	uint64_t len;
+};
+
+struct user_mem_maps {
+	rte_spinlock_recursive_t lock;
+	int n_maps;
+	struct user_mem_map maps[VFIO_MAX_USER_MEM_MAPS];
+};
+
+struct vfio_config {
+	int vfio_enabled;
+	int vfio_container_fd;
+	int vfio_active_groups;
+	const struct vfio_iommu_type *vfio_iommu_type;
+	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
+	struct user_mem_maps mem_maps;
+};
+
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config vfio_cfgs[VFIO_MAX_CONTAINERS];
+static struct vfio_config *default_vfio_cfg = &vfio_cfgs[0];
 
 static int vfio_type1_dma_map(int);
 static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
@@ -31,8 +69,8 @@ static int vfio_spapr_dma_map(int);
 static int vfio_spapr_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
 static int vfio_noiommu_dma_map(int);
 static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
-static int vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len,
-		int do_map);
+static int vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map);
 
 /* IOMMU types we support */
 static const struct vfio_iommu_type iommu_types[] = {
@@ -59,25 +97,6 @@ static const struct vfio_iommu_type iommu_types[] = {
 	},
 };
 
-/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
- * recreate the mappings for DPDK segments, but we cannot do so for memory that
- * was registered by the user themselves, so we need to store the user mappings
- * somewhere, to recreate them later.
- */
-#define VFIO_MAX_USER_MEM_MAPS 256
-struct user_mem_map {
-	uint64_t addr;
-	uint64_t iova;
-	uint64_t len;
-};
-static struct {
-	rte_spinlock_recursive_t lock;
-	int n_maps;
-	struct user_mem_map maps[VFIO_MAX_USER_MEM_MAPS];
-} user_mem_maps = {
-	.lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER
-};
-
 /* for sPAPR IOMMU, we will need to walk memseg list, but we cannot use
  * rte_memseg_walk() because by the time we enter callback we will be holding a
  * write lock, so regular rte-memseg_walk will deadlock. copying the same
@@ -206,14 +225,15 @@ merge_map(struct user_mem_map *left, struct user_mem_map *right)
 }
 
 static struct user_mem_map *
-find_user_mem_map(uint64_t addr, uint64_t iova, uint64_t len)
+find_user_mem_map(struct user_mem_maps *user_mem_maps, uint64_t addr,
+		uint64_t iova, uint64_t len)
 {
 	uint64_t va_end = addr + len;
 	uint64_t iova_end = iova + len;
 	int i;
 
-	for (i = 0; i < user_mem_maps.n_maps; i++) {
-		struct user_mem_map *map = &user_mem_maps.maps[i];
+	for (i = 0; i < user_mem_maps->n_maps; i++) {
+		struct user_mem_map *map = &user_mem_maps->maps[i];
 		uint64_t map_va_end = map->addr + map->len;
 		uint64_t map_iova_end = map->iova + map->len;
 
@@ -239,20 +259,20 @@ find_user_mem_map(uint64_t addr, uint64_t iova, uint64_t len)
 
 /* this will sort all user maps, and merge/compact any adjacent maps */
 static void
-compact_user_maps(void)
+compact_user_maps(struct user_mem_maps *user_mem_maps)
 {
 	int i, n_merged, cur_idx;
 
-	qsort(user_mem_maps.maps, user_mem_maps.n_maps,
-			sizeof(user_mem_maps.maps[0]), user_mem_map_cmp);
+	qsort(user_mem_maps->maps, user_mem_maps->n_maps,
+			sizeof(user_mem_maps->maps[0]), user_mem_map_cmp);
 
 	/* we'll go over the list backwards when merging */
 	n_merged = 0;
-	for (i = user_mem_maps.n_maps - 2; i >= 0; i--) {
+	for (i = user_mem_maps->n_maps - 2; i >= 0; i--) {
 		struct user_mem_map *l, *r;
 
-		l = &user_mem_maps.maps[i];
-		r = &user_mem_maps.maps[i + 1];
+		l = &user_mem_maps->maps[i];
+		r = &user_mem_maps->maps[i + 1];
 
 		if (is_null_map(l) || is_null_map(r))
 			continue;
@@ -266,12 +286,12 @@ compact_user_maps(void)
 	 */
 	if (n_merged > 0) {
 		cur_idx = 0;
-		for (i = 0; i < user_mem_maps.n_maps; i++) {
-			if (!is_null_map(&user_mem_maps.maps[i])) {
+		for (i = 0; i < user_mem_maps->n_maps; i++) {
+			if (!is_null_map(&user_mem_maps->maps[i])) {
 				struct user_mem_map *src, *dst;
 
-				src = &user_mem_maps.maps[i];
-				dst = &user_mem_maps.maps[cur_idx++];
+				src = &user_mem_maps->maps[i];
+				dst = &user_mem_maps->maps[cur_idx++];
 
 				if (src != dst) {
 					memcpy(dst, src, sizeof(*src));
@@ -279,41 +299,16 @@ compact_user_maps(void)
 				}
 			}
 		}
-		user_mem_maps.n_maps = cur_idx;
+		user_mem_maps->n_maps = cur_idx;
 	}
 }
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
 	/* if primary, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
@@ -343,9 +338,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
 	/* if we're in a secondary process, request group fd from the primary
@@ -380,9 +372,6 @@ vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -392,56 +381,164 @@ vfio_get_group_fd(int iommu_group_no)
 			return -1;
 		}
 	}
-	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_no(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg;
+		}
+	}
+
+	return default_vfio_cfg;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
 
-static int
-get_vfio_group_idx(int vfio_group_fd)
+	return default_vfio_cfg;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_container_fd(int container_fd)
 {
 	int i;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i].vfio_container_fd == container_fd)
+			return &vfio_cfgs[i];
+	}
+
+	return NULL;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	int i;
+	int vfio_group_fd;
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
 	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
-			return i;
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 static void
@@ -457,9 +554,11 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len)
 	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
 		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(vfio_va, vfio_va, len, 1);
+			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
+					len, 1);
 		else
-			vfio_dma_mem_map(vfio_va, vfio_va, len, 0);
+			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
+					len, 0);
 		return;
 	}
 
@@ -467,9 +566,11 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len)
 	ms = rte_mem_virt2memseg(addr, msl);
 	while (cur_len < len) {
 		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(ms->addr_64, ms->iova, ms->len, 1);
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
 		else
-			vfio_dma_mem_map(ms->addr_64, ms->iova, ms->len, 0);
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
 
 		cur_len += ms->len;
 		++ms;
@@ -481,16 +582,19 @@ rte_vfio_clear_group(int vfio_group_fd)
 {
 	int i;
 	int socket_fd, ret;
+	struct vfio_config *vfio_cfg;
+
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
 
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
 		if (i < 0)
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -543,6 +647,9 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
 	int i, ret;
@@ -591,12 +698,17 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+	user_mem_maps = &vfio_cfg->mem_maps;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
 
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -614,11 +726,11 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * functionality.
 		 */
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfg->vfio_active_groups == 1) {
 			const struct vfio_iommu_type *t;
 
 			/* select an IOMMU type which we will be using */
-			t = vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+			t = vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -631,7 +743,10 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 			 * after registering callback, to prevent races
 			 */
 			rte_rwlock_read_lock(mem_lock);
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			if (vfio_cfg == default_vfio_cfg)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -642,22 +757,22 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				return -1;
 			}
 
-			vfio_cfg.vfio_iommu_type = t;
+			vfio_cfg->vfio_iommu_type = t;
 
 			/* re-map all user-mapped segments */
-			rte_spinlock_recursive_lock(&user_mem_maps.lock);
+			rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 			/* this IOMMU type may not support DMA mapping, but
 			 * if we have mappings in the list - that means we have
 			 * previously mapped something successfully, so we can
 			 * be sure that DMA mapping is supported.
 			 */
-			for (i = 0; i < user_mem_maps.n_maps; i++) {
+			for (i = 0; i < user_mem_maps->n_maps; i++) {
 				struct user_mem_map *map;
-				map = &user_mem_maps.maps[i];
+				map = &user_mem_maps->maps[i];
 
 				ret = t->dma_user_map_func(
-						vfio_cfg.vfio_container_fd,
+						vfio_container_fd,
 						map->addr, map->iova, map->len,
 						1);
 				if (ret) {
@@ -668,17 +783,20 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 							map->addr, map->iova,
 							map->len);
 					rte_spinlock_recursive_unlock(
-							&user_mem_maps.lock);
+							&user_mem_maps->lock);
 					rte_rwlock_read_unlock(mem_lock);
 					return -1;
 				}
 			}
-			rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+			rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 
 			/* register callback for mem events */
-			ret = rte_mem_event_callback_register(
+			if (vfio_cfg == default_vfio_cfg)
+				ret = rte_mem_event_callback_register(
 					VFIO_MEM_EVENT_CLB_NAME,
 					vfio_mem_event_callback);
+			else
+				ret = 0;
 			/* unlock memory hotplug */
 			rte_rwlock_read_unlock(mem_lock);
 
@@ -732,6 +850,7 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
 	int vfio_group_fd;
 	int iommu_group_no;
 	int ret;
@@ -761,6 +880,9 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 		goto out;
 	}
 
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+
 	/* At this point we got an active group. Closing it will make the
 	 * container detachment. If this is the last active group, VFIO kernel
 	 * code will unset the container and the IOMMU mappings.
@@ -798,7 +920,7 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 	/* if there are no active device groups, unregister the callback to
 	 * avoid spurious attempts to map/unmap memory from VFIO.
 	 */
-	if (vfio_cfg.vfio_active_groups == 0)
+	if (vfio_cfg == default_vfio_cfg && vfio_cfg->vfio_active_groups == 0)
 		rte_mem_event_callback_unregister(VFIO_MEM_EVENT_CLB_NAME);
 
 	/* success */
@@ -813,13 +935,21 @@ int
 rte_vfio_enable(const char *modname)
 {
 	/* initialize group list */
-	int i;
+	int i, j;
 	int vfio_available;
-
-	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+	rte_spinlock_recursive_t lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfgs[i].vfio_container_fd = -1;
+		vfio_cfgs[i].vfio_active_groups = 0;
+		vfio_cfgs[i].vfio_iommu_type = NULL;
+		vfio_cfgs[i].mem_maps.lock = lock;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			vfio_cfgs[i].vfio_groups[j].fd = -1;
+			vfio_cfgs[i].vfio_groups[j].group_no = -1;
+			vfio_cfgs[i].vfio_groups[j].devices = 0;
+		}
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -841,12 +971,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg->vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg->vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg->vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -858,7 +988,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg->vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -1220,9 +1350,13 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 	struct vfio_iommu_spapr_tce_create create = {
 		.argsz = sizeof(create),
 	};
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
 	int i, ret = 0;
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
+	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
+	user_mem_maps = &vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* check if window size needs to be adjusted */
 	memset(&param, 0, sizeof(param));
@@ -1235,9 +1369,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 	}
 
 	/* also check user maps */
-	for (i = 0; i < user_mem_maps.n_maps; i++) {
-		uint64_t max = user_mem_maps.maps[i].iova +
-				user_mem_maps.maps[i].len;
+	for (i = 0; i < user_mem_maps->n_maps; i++) {
+		uint64_t max = user_mem_maps->maps[i].iova +
+				user_mem_maps->maps[i].len;
 		create.window_size = RTE_MAX(create.window_size, max);
 	}
 
@@ -1263,9 +1397,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				goto out;
 			}
 			/* remap all user maps */
-			for (i = 0; i < user_mem_maps.n_maps; i++) {
+			for (i = 0; i < user_mem_maps->n_maps; i++) {
 				struct user_mem_map *map =
-						&user_mem_maps.maps[i];
+						&user_mem_maps->maps[i];
 				if (vfio_spapr_dma_do_map(vfio_container_fd,
 						map->addr, map->iova, map->len,
 						1)) {
@@ -1306,7 +1440,7 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
@@ -1358,9 +1492,10 @@ vfio_noiommu_dma_mem_map(int __rte_unused vfio_container_fd,
 }
 
 static int
-vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
+vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len, int do_map)
 {
-	const struct vfio_iommu_type *t = vfio_cfg.vfio_iommu_type;
+	const struct vfio_iommu_type *t = vfio_cfg->vfio_iommu_type;
 
 	if (!t) {
 		RTE_LOG(ERR, EAL, "  VFIO support not initialized\n");
@@ -1376,7 +1511,7 @@ vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
 		return -1;
 	}
 
-	return t->dma_user_map_func(vfio_cfg.vfio_container_fd, vaddr, iova,
+	return t->dma_user_map_func(vfio_cfg->vfio_container_fd, vaddr, iova,
 			len, do_map);
 }
 
@@ -1384,6 +1519,7 @@ int __rte_experimental
 rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 {
 	struct user_mem_map *new_map;
+	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
 	if (len == 0) {
@@ -1391,15 +1527,16 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		return -1;
 	}
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
-	if (user_mem_maps.n_maps == VFIO_MAX_USER_MEM_MAPS) {
+	user_mem_maps = &default_vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
+	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
 		rte_errno = ENOMEM;
 		ret = -1;
 		goto out;
 	}
 	/* map the entry */
-	if (vfio_dma_mem_map(vaddr, iova, len, 1)) {
+	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 1)) {
 		/* technically, this will fail if there are currently no devices
 		 * plugged in, even if a device were added later, this mapping
 		 * might have succeeded. however, since we cannot verify if this
@@ -1412,14 +1549,14 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		goto out;
 	}
 	/* create new user mem map entry */
-	new_map = &user_mem_maps.maps[user_mem_maps.n_maps++];
+	new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
 	new_map->addr = vaddr;
 	new_map->iova = iova;
 	new_map->len = len;
 
-	compact_user_maps();
+	compact_user_maps(user_mem_maps);
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
@@ -1427,6 +1564,7 @@ int __rte_experimental
 rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 {
 	struct user_mem_map *map, *new_map = NULL;
+	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
 	if (len == 0) {
@@ -1434,10 +1572,11 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 		return -1;
 	}
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
+	user_mem_maps = &default_vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* find our mapping */
-	map = find_user_mem_map(vaddr, iova, len);
+	map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
 	if (!map) {
 		RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
 		rte_errno = EINVAL;
@@ -1448,17 +1587,17 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
-		if (user_mem_maps.n_maps == VFIO_MAX_USER_MEM_MAPS) {
+		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
 			ret = -1;
 			goto out;
 		}
-		new_map = &user_mem_maps.maps[user_mem_maps.n_maps++];
+		new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
 	}
 
 	/* unmap the entry */
-	if (vfio_dma_mem_map(vaddr, iova, len, 0)) {
+	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 0)) {
 		/* there may not be any devices plugged in, so unmapping will
 		 * fail with ENODEV/ENOTSUP rte_errno values, but that doesn't
 		 * stop us from removing the mapping, as the assumption is we
@@ -1481,19 +1620,19 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 
 		/* if we've created a new map by splitting, sort everything */
 		if (!is_null_map(new_map)) {
-			compact_user_maps();
+			compact_user_maps(user_mem_maps);
 		} else {
 			/* we've created a new mapping, but it was unused */
-			user_mem_maps.n_maps--;
+			user_mem_maps->n_maps--;
 		}
 	} else {
 		memset(map, 0, sizeof(*map));
-		compact_user_maps();
-		user_mem_maps.n_maps--;
+		compact_user_maps(user_mem_maps);
+		user_mem_maps->n_maps--;
 	}
 
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 549f4427e..e14d5be99 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -88,6 +88,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
@@ -98,24 +99,6 @@ int vfio_mp_sync_send_fd(int socket, int fd);
 int vfio_mp_sync_receive_fd(int socket);
 int vfio_mp_sync_connect_to_primary(void);
 
-/*
- * we don't need to store device fd's anywhere since they can be obtained from
- * the group fd via an ioctl() call.
- */
-struct vfio_group {
-	int group_no;
-	int fd;
-	int devices;
-};
-
-struct vfio_config {
-	int vfio_enabled;
-	int vfio_container_fd;
-	int vfio_active_groups;
-	const struct vfio_iommu_type *vfio_iommu_type;
-	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
-};
-
 /* DMA mapping function prototype.
  * Takes VFIO container fd as a parameter.
  * Returns 0 on success, -1 on error.
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v7 2/5] vfio: add multi container support
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 1/5] vfio: extend data structure for multi container Xiao Wang
@ 2018-04-15 15:33                         ` Xiao Wang
  2018-04-16 10:03                           ` Burakov, Anatoly
  2018-04-15 15:33                         ` [PATCH v7 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
                                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-04-15 15:33 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang, Junjie Chen

This patch adds APIs to support container create/destroy and device
bind/unbind with a container. It also provides API for IOMMU programing
on a specified container.

A driver could use "rte_vfio_create_container" helper to create a
new container from eal, use "rte_vfio_bind_group" to bind a device
to the newly created container. During rte_vfio_setup_device the
container bound with the device will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal.c          |  52 +++++
 lib/librte_eal/common/include/rte_vfio.h | 119 ++++++++++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 316 +++++++++++++++++++++++++++++++
 lib/librte_eal/rte_eal_version.map       |   6 +
 4 files changed, 493 insertions(+)

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 727adc5d2..c5106d0d6 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -769,6 +769,14 @@ int rte_vfio_noiommu_is_enabled(void);
 int rte_vfio_clear_group(int vfio_group_fd);
 int rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
 int rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
+int rte_vfio_container_create(void);
+int rte_vfio_container_destroy(int container_fd);
+int rte_vfio_bind_group(int container_fd, int iommu_group_no);
+int rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+int rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+int rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
 
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
@@ -818,3 +826,47 @@ rte_vfio_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova,
 {
 	return -1;
 }
+
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(__rte_unused int container_fd,
+		__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(__rte_unused int container_fd,
+		__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(__rte_unused int container_fd,
+			__rte_unused uint64_t vaddr,
+			__rte_unused uint64_t iova,
+			__rte_unused uint64_t len)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(__rte_unused int container_fd,
+			__rte_unused uint64_t vaddr,
+			__rte_unused uint64_t iova,
+			__rte_unused uint64_t len)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index d26ab01cb..0c1509b29 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -168,6 +168,125 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
 int __rte_experimental
 rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container for device binding.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_create(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_destroy(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a IOMMU group to a container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a IOMMU group from a container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param vaddr
+ *   Starting virtual address of memory to be mapped.
+ *
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ *
+ * @param len
+ *   Length of memory segment being mapped.
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param vaddr
+ *   Starting virtual address of memory to be unmapped.
+ *
+ * @param iova
+ *   Starting IOVA address of memory to be unmapped.
+ *
+ * @param len
+ *   Length of memory segment being unmapped.
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 46fba2d8d..2f566a621 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1668,6 +1668,278 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i].vfio_container_fd == -1)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i].vfio_container_fd = vfio_get_container_fd();
+	if (vfio_cfgs[i].vfio_container_fd < 0) {
+		RTE_LOG(NOTICE, EAL, "fail to create a new container\n");
+		return -1;
+	}
+
+	return vfio_cfgs[i].vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	close(container_fd);
+	vfio_cfg->vfio_container_fd = -1;
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_iommu_type = NULL;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	/* Check room for new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	cur_grp->devices = 0;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+			" iommu_group_no %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	cur_grp->devices = 0;
+	vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(int container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct user_mem_map *new_map;
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
+	int ret = 0;
+
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	user_mem_maps = &vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
+	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
+		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
+		rte_errno = ENOMEM;
+		ret = -1;
+		goto out;
+	}
+	/* map the entry */
+	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 1)) {
+		/* technically, this will fail if there are currently no devices
+		 * plugged in, even if a device were added later, this mapping
+		 * might have succeeded. however, since we cannot verify if this
+		 * is a valid mapping without having a device attached, consider
+		 * this to be unsupported, because we can't just store any old
+		 * mapping and pollute list of active mappings willy-nilly.
+		 */
+		RTE_LOG(ERR, EAL, "Couldn't map new region for DMA\n");
+		ret = -1;
+		goto out;
+	}
+	/* create new user mem map entry */
+	new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
+	new_map->addr = vaddr;
+	new_map->iova = iova;
+	new_map->len = len;
+
+	compact_user_maps(user_mem_maps);
+out:
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+	return ret;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct user_mem_map *map, *new_map = NULL;
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
+	int ret = 0;
+
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	user_mem_maps = &vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
+
+	/* find our mapping */
+	map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
+	if (!map) {
+		RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
+		rte_errno = EINVAL;
+		ret = -1;
+		goto out;
+	}
+	if (map->addr != vaddr || map->iova != iova || map->len != len) {
+		/* we're partially unmapping a previously mapped region, so we
+		 * need to split entry into two.
+		 */
+		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
+			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
+			rte_errno = ENOMEM;
+			ret = -1;
+			goto out;
+		}
+		new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
+	}
+
+	/* unmap the entry */
+	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 0)) {
+		/* there may not be any devices plugged in, so unmapping will
+		 * fail with ENODEV/ENOTSUP rte_errno values, but that doesn't
+		 * stop us from removing the mapping, as the assumption is we
+		 * won't be needing this memory any more and thus will want to
+		 * prevent it from being remapped again on hotplug. so, only
+		 * fail if we indeed failed to unmap (e.g. if the mapping was
+		 * within our mapped range but had invalid alignment).
+		 */
+		if (rte_errno != ENODEV && rte_errno != ENOTSUP) {
+			RTE_LOG(ERR, EAL, "Couldn't unmap region for DMA\n");
+			ret = -1;
+			goto out;
+		} else {
+			RTE_LOG(DEBUG, EAL, "DMA unmapping failed, but removing mappings anyway\n");
+		}
+	}
+	/* remove map from the list of active mappings */
+	if (new_map != NULL) {
+		adjust_map(map, new_map, vaddr, len);
+
+		/* if we've created a new map by splitting, sort everything */
+		if (!is_null_map(new_map)) {
+			compact_user_maps(user_mem_maps);
+		} else {
+			/* we've created a new mapping, but it was unused */
+			user_mem_maps->n_maps--;
+		}
+	} else {
+		memset(map, 0, sizeof(*map));
+		compact_user_maps(user_mem_maps);
+		user_mem_maps->n_maps--;
+	}
+
+out:
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+	return ret;
+}
+
 #else
 
 int __rte_experimental
@@ -1684,4 +1956,48 @@ rte_vfio_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova,
 	return -1;
 }
 
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(__rte_unused int container_fd,
+		__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(__rte_unused int container_fd,
+		__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(__rte_unused int container_fd,
+		__rte_unused uint64_t vaddr,
+		__rte_unused uint64_t iova,
+		__rte_unused uint64_t len)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(__rte_unused int container_fd,
+		__rte_unused uint64_t vaddr,
+		__rte_unused uint64_t iova,
+		__rte_unused uint64_t len)
+{
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index 2b5b1dcf5..c5eff065e 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -284,7 +284,13 @@ EXPERIMENTAL {
 	rte_service_start_with_defaults;
 	rte_socket_count;
 	rte_socket_id_by_idx;
+	rte_vfio_bind_group;
+	rte_vfio_container_create;
+	rte_vfio_container_destroy;
+	rte_vfio_container_dma_map;
+	rte_vfio_container_dma_unmap;
 	rte_vfio_dma_map;
 	rte_vfio_dma_unmap;
+	rte_vfio_unbind_group;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v7 3/5] net/virtio: skip device probe in vdpa mode
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 1/5] vfio: extend data structure for multi container Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 2/5] vfio: add multi container support Xiao Wang
@ 2018-04-15 15:33                         ` Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 5/5] doc: add ifcvf driver document and release note Xiao Wang
  4 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-15 15:33 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/virtio.rst         | 13 ++++++++++++
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index ca09cd203..8922f9c0b 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -318,3 +318,16 @@ Here we use l3fwd-power as an example to show how to get started.
 
         $ l3fwd-power -l 0-1 -- -p 1 -P --config="(0,0,1)" \
                                                --no-numa --parse-ptype
+
+
+Virtio PMD arguments
+--------------------
+
+The user can specify below argument in devargs.
+
+#.  ``vdpa``:
+
+    A virtio device could also be driven by vDPA (vhost data path acceleration)
+    driver, and works as a HW vhost backend. This argument is used to specify
+    a virtio device needs to work in vDPA mode.
+    (Default: 0 (disabled))
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 41042cb23..5833dad73 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -28,6 +28,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1713,9 +1714,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v7 4/5] net/ifcvf: add ifcvf vdpa driver
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
                                           ` (2 preceding siblings ...)
  2018-04-15 15:33                         ` [PATCH v7 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-15 15:33                         ` Xiao Wang
  2018-04-15 15:33                         ` [PATCH v7 5/5] doc: add ifcvf driver document and release note Xiao Wang
  4 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-15 15:33 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang, Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  36 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 842 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1437 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index 4a76d2f14..a08d370c5 100644
--- a/config/common_base
+++ b/config/common_base
@@ -809,6 +809,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index d0437e5d6..14e56cb4d 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index dc5047e04..9f9da6651 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -58,6 +58,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..95bb8d769
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..d2925251e
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,842 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+#include <eal_vfio.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+#ifndef PAGE_SIZE
+#define PAGE_SIZE 4096
+#endif
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_no;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	vfio_get_group_no(rte_pci_get_sysfs_path(), devname, &iommu_group_no);
+
+	internal->vfio_container_fd = rte_vfio_container_create();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	internal->vfio_group_fd = rte_vfio_bind_group(
+			internal->vfio_container_fd, iommu_group_no);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_container_destroy(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx.", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		rte_vfio_container_dma_map(vfio_container_fd,
+				reg->host_user_addr, reg->guest_phys_addr,
+				reg->size);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+
+		reg = &mem->regions[i];
+		rte_vfio_container_dma_map(vfio_container_fd,
+				reg->host_user_addr, reg->guest_phys_addr,
+				reg->size);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	uint32_t i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint32_t size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		size = hw->vring[i].size * 8 + 4;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / PAGE_SIZE;
+			for (j = 0; j <= size / PAGE_SIZE; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_set_features(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_features(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK | \
+		 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+static int
+ifcvf_get_protocol_features(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.get_queue_num = ifcvf_get_queue_num,
+	.get_features = ifcvf_get_vdpa_features,
+	.get_protocol_features = ifcvf_get_protocol_features,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.set_vring_state = NULL,
+	.set_features = ifcvf_set_features,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES) |
+		(1ULL << VHOST_F_LOG_ALL);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	internal->did = rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops);
+	if (internal->did < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_container_destroy(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * IFCVF has the same vendor ID and device ID as virtio net PCI
+ * device, with its specific subsystem vendor ID and device ID.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("pmd.net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 005803a56..f6e7ccc37 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -186,6 +186,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v7 5/5] doc: add ifcvf driver document and release note
  2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
                                           ` (3 preceding siblings ...)
  2018-04-15 15:33                         ` [PATCH v7 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-15 15:33                         ` Xiao Wang
  4 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-15 15:33 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 +++
 doc/guides/nics/ifcvf.rst              | 98 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 116 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..d7e76353c
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,98 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+Pre-Installation Configuration
+------------------------------
+
+Config File Options
+~~~~~~~~~~~~~~~~~~~
+
+The following option can be modified in the ``config`` file.
+
+- ``CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD`` (default ``y`` for linux)
+
+  Toggle compilation of the ``librte_ifcvf_vdpa`` driver.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index ea9110c81..9b98c620f 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -45,6 +45,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index a5f816f8a..d84c7de8f 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -82,6 +82,15 @@ New Features
   backend connects to. This means that if the backend restarts, it can reconnect
   to virtio-user and continue communications.
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 API Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v7 1/5] vfio: extend data structure for multi container
  2018-04-15 15:33                         ` [PATCH v7 1/5] vfio: extend data structure for multi container Xiao Wang
@ 2018-04-16 10:02                           ` Burakov, Anatoly
  2018-04-16 12:22                             ` Wang, Xiao W
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
  1 sibling, 1 reply; 98+ messages in thread
From: Burakov, Anatoly @ 2018-04-16 10:02 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Junjie Chen

On 15-Apr-18 4:33 PM, Xiao Wang wrote:
> Currently eal vfio framework binds vfio group fd to the default
> container fd during rte_vfio_setup_device, while in some cases,
> e.g. vDPA (vhost data path acceleration), we want to put vfio group
> to a separate container and program IOMMU via this container.
> 
> This patch extends the vfio_config structure to contain per-container
> user_mem_maps and defines an array of vfio_config. The next patch will
> base on this to add container API.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---
>   config/common_base                     |   1 +
>   lib/librte_eal/linuxapp/eal/eal_vfio.c | 407 ++++++++++++++++++++++-----------
>   lib/librte_eal/linuxapp/eal/eal_vfio.h |  19 +-
>   3 files changed, 275 insertions(+), 152 deletions(-)
> 
> diff --git a/config/common_base b/config/common_base
> index c4236fd1f..4a76d2f14 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -87,6 +87,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
>   CONFIG_RTE_EAL_IGB_UIO=n
>   CONFIG_RTE_EAL_VFIO=n
>   CONFIG_RTE_MAX_VFIO_GROUPS=64
> +CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>   CONFIG_RTE_MALLOC_DEBUG=n
>   CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>   CONFIG_RTE_USE_LIBBSD=n
> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> index 589d7d478..46fba2d8d 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> @@ -22,8 +22,46 @@
>   
>   #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
>   
> +/*
> + * we don't need to store device fd's anywhere since they can be obtained from
> + * the group fd via an ioctl() call.
> + */
> +struct vfio_group {
> +	int group_no;
> +	int fd;
> +	int devices;
> +};

What is the purpose of moving this into .c file? Seems like an 
unnecessary change.

> +
> +/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
> + * recreate the mappings for DPDK segments, but we cannot do so for memory that
> + * was registered by the user themselves, so we need to store the user mappings
> + * somewhere, to recreate them later.
> + */
> +#define VFIO_MAX_USER_MEM_MAPS 256
> +struct user_mem_map {
> +	uint64_t addr;
> +	uint64_t iova;
> +	uint64_t len;
> +};
> +

<...>

> +static struct vfio_config *
> +get_vfio_cfg_by_group_no(int iommu_group_no)
> +{
> +	struct vfio_config *vfio_cfg;
> +	int i, j;
> +
> +	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
> +		vfio_cfg = &vfio_cfgs[i];
> +		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
> +			if (vfio_cfg->vfio_groups[j].group_no ==
> +					iommu_group_no)
> +				return vfio_cfg;
> +		}
> +	}
> +
> +	return default_vfio_cfg;

Here and in other places: i'm not sure returning default vfio config if 
group not found is such a good idea. It would be better if calling code 
explicitly handled case of group not existing yet.

> +}
> +
> +static struct vfio_config *
> +get_vfio_cfg_by_group_fd(int vfio_group_fd)
> +{
> +	struct vfio_config *vfio_cfg;
> +	int i, j;
> +
> +	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
> +		vfio_cfg = &vfio_cfgs[i];
> +		for (j = 0; j < VFIO_MAX_GROUPS; j++)
> +			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
> +				return vfio_cfg;
> +	}
>   

<...>

> -	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
> -		vfio_cfg.vfio_groups[i].fd = -1;
> -		vfio_cfg.vfio_groups[i].group_no = -1;
> -		vfio_cfg.vfio_groups[i].devices = 0;
> +	rte_spinlock_recursive_t lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER;
> +
> +	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
> +		vfio_cfgs[i].vfio_container_fd = -1;
> +		vfio_cfgs[i].vfio_active_groups = 0;
> +		vfio_cfgs[i].vfio_iommu_type = NULL;
> +		vfio_cfgs[i].mem_maps.lock = lock;

Nitpick - why copy, instead of straight up initializing with 
RTE_SPINLOCK_RECURSIVE_INITIALIZER?

> +
> +		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
> +			vfio_cfgs[i].vfio_groups[j].fd = -1;
> +			vfio_cfgs[i].vfio_groups[j].group_no = -1;
> +			vfio_cfgs[i].vfio_groups[j].devices = 0;
> +		}
>   	}
>   
>   	/* inform the user that we are probing for VFIO */
> @@ -841,12 +971,12 @@ rte_vfio_enable(const char *modname)
>   		return 0;
>   	}

<...>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v7 2/5] vfio: add multi container support
  2018-04-15 15:33                         ` [PATCH v7 2/5] vfio: add multi container support Xiao Wang
@ 2018-04-16 10:03                           ` Burakov, Anatoly
  2018-04-16 12:44                             ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Burakov, Anatoly @ 2018-04-16 10:03 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Junjie Chen

On 15-Apr-18 4:33 PM, Xiao Wang wrote:
> This patch adds APIs to support container create/destroy and device
> bind/unbind with a container. It also provides API for IOMMU programing
> on a specified container.
> 
> A driver could use "rte_vfio_create_container" helper to create a

^^ wrong API name in commit message :)

> new container from eal, use "rte_vfio_bind_group" to bind a device
> to the newly created container. During rte_vfio_setup_device the
> container bound with the device will be used for IOMMU setup.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---
>   lib/librte_eal/bsdapp/eal/eal.c          |  52 +++++
>   lib/librte_eal/common/include/rte_vfio.h | 119 ++++++++++++
>   lib/librte_eal/linuxapp/eal/eal_vfio.c   | 316 +++++++++++++++++++++++++++++++
>   lib/librte_eal/rte_eal_version.map       |   6 +
>   4 files changed, 493 insertions(+)
> 
> diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
> index 727adc5d2..c5106d0d6 100644
> --- a/lib/librte_eal/bsdapp/eal/eal.c
> +++ b/lib/librte_eal/bsdapp/eal/eal.c
> @@ -769,6 +769,14 @@ int rte_vfio_noiommu_is_enabled(void);
>   int rte_vfio_clear_group(int vfio_group_fd);
>   int rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
>   int rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
> +int rte_vfio_container_create(void);
> +int rte_vfio_container_destroy(int container_fd);
> +int rte_vfio_bind_group(int container_fd, int iommu_group_no);
> +int rte_vfio_unbind_group(int container_fd, int iommu_group_no);

Maybe have these under "container" too? e.g. 
rte_vfio_container_group_bind/unbind? Seems like it would be more 
consistent that way - anything to do with custom containers would be 
under rte_vfio_container_* namespace.

> +int rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
> +		uint64_t iova, uint64_t len);
> +int rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
> +		uint64_t iova, uint64_t len);
>   
>   int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
>   		      __rte_unused const char *dev_addr,
> @@ -818,3 +826,47 @@ rte_vfio_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova,
>   {
>   	return -1;
>   }
> +

<...>

> diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
> index d26ab01cb..0c1509b29 100644
> --- a/lib/librte_eal/common/include/rte_vfio.h
> +++ b/lib/librte_eal/common/include/rte_vfio.h
> @@ -168,6 +168,125 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
>   int __rte_experimental
>   rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Create a new container for device binding.

I would add a note that any newly allocated DPDK memory will not be 
mapped into these containers by default.

> + *
> + * @return
> + *   the container fd if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_container_create(void);
> +

<...>

> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_unbind_group(int container_fd, int iommu_group_no);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma mapping for devices in a conainer.

Here and in other places: "dma" should be DMA, and typo: "conainer" :)

I think you should also add a note to the original API (not this one, 
but the old one) that DMA maps done via that API will only apply to 
default container and will not apply to any of the containers created 
via container_create(). IOW, documentation should make it clear that if 
you use this functionality, you're on your own and you have to manage 
your own DMA mappings for any containers you create.

> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param vaddr
> + *   Starting virtual address of memory to be mapped.
> + *

<...>

> +
> +int __rte_experimental
> +rte_vfio_container_dma_map(int container_fd, uint64_t vaddr, uint64_t iova,
> +		uint64_t len)
> +{
> +	struct user_mem_map *new_map;
> +	struct vfio_config *vfio_cfg;
> +	struct user_mem_maps *user_mem_maps;
> +	int ret = 0;
> +
> +	if (len == 0) {
> +		rte_errno = EINVAL;
> +		return -1;
> +	}
> +
> +	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
> +	if (vfio_cfg == NULL) {
> +		RTE_LOG(ERR, EAL, "Invalid container fd\n");
> +		return -1;
> +	}
> +
> +	user_mem_maps = &vfio_cfg->mem_maps;
> +	rte_spinlock_recursive_lock(&user_mem_maps->lock);
> +	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
> +		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
> +		rte_errno = ENOMEM;
> +		ret = -1;
> +		goto out;
> +	}
> +	/* map the entry */
> +	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 1)) {
> +		/* technically, this will fail if there are currently no devices
> +		 * plugged in, even if a device were added later, this mapping
> +		 * might have succeeded. however, since we cannot verify if this
> +		 * is a valid mapping without having a device attached, consider
> +		 * this to be unsupported, because we can't just store any old
> +		 * mapping and pollute list of active mappings willy-nilly.
> +		 */
> +		RTE_LOG(ERR, EAL, "Couldn't map new region for DMA\n");
> +		ret = -1;
> +		goto out;
> +	}
> +	/* create new user mem map entry */
> +	new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
> +	new_map->addr = vaddr;
> +	new_map->iova = iova;
> +	new_map->len = len;
> +
> +	compact_user_maps(user_mem_maps);
> +out:
> +	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
> +	return ret;

Please correct me if i'm wrong, but it looks like you've just duplicated 
the code for rte_vfio_dma_map() here and made a few small changes. It 
would be better if you moved most of this into a static function (e.g. 
static int container_dma_map(vfio_cfg, vaddr, iova, len)) and called it 
with either default vfio_cfg from rte_vfio_dma_map, or found vfio_cfg 
from rte_vfio_container_dma_map. Same applies to function below.

> +}
> +
> +int __rte_experimental
> +rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr, uint64_t iova,
> +		uint64_t len)
> +{
> +	struct user_mem_map *map, *new_map = NULL;
> +	struct vfio_config *vfio_cfg;
> +	struct user_mem_maps *user_mem_maps;
> +	int ret = 0;
> +
> +	if (len == 0) {
> +		rte_errno = EINVAL;
> +		return -1;
> +	}
> +

<...>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v7 1/5] vfio: extend data structure for multi container
  2018-04-16 10:02                           ` Burakov, Anatoly
@ 2018-04-16 12:22                             ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-16 12:22 UTC (permalink / raw)
  To: Burakov, Anatoly, Yigit, Ferruh
  Cc: dev, maxime.coquelin, Wang, Zhihong, Bie, Tiwei, Tan, Jianfeng,
	Liang, Cunming, Daly, Dan, thomas, Chen, Junjie J

Hi Anatoly,

> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Monday, April 16, 2018 6:03 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; Wang, Zhihong
> <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Tan, Jianfeng
> <jianfeng.tan@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly,
> Dan <dan.daly@intel.com>; thomas@monjalon.net; Chen, Junjie J
> <junjie.j.chen@intel.com>
> Subject: Re: [PATCH v7 1/5] vfio: extend data structure for multi container
> 
> On 15-Apr-18 4:33 PM, Xiao Wang wrote:
> > Currently eal vfio framework binds vfio group fd to the default
> > container fd during rte_vfio_setup_device, while in some cases,
> > e.g. vDPA (vhost data path acceleration), we want to put vfio group
> > to a separate container and program IOMMU via this container.
> >
> > This patch extends the vfio_config structure to contain per-container
> > user_mem_maps and defines an array of vfio_config. The next patch will
> > base on this to add container API.
> >
> > Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> > ---
> >   config/common_base                     |   1 +
> >   lib/librte_eal/linuxapp/eal/eal_vfio.c | 407 ++++++++++++++++++++++-------
> ----
> >   lib/librte_eal/linuxapp/eal/eal_vfio.h |  19 +-
> >   3 files changed, 275 insertions(+), 152 deletions(-)
> >
> > diff --git a/config/common_base b/config/common_base
> > index c4236fd1f..4a76d2f14 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -87,6 +87,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
> >   CONFIG_RTE_EAL_IGB_UIO=n
> >   CONFIG_RTE_EAL_VFIO=n
> >   CONFIG_RTE_MAX_VFIO_GROUPS=64
> > +CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >   CONFIG_RTE_MALLOC_DEBUG=n
> >   CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >   CONFIG_RTE_USE_LIBBSD=n
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > index 589d7d478..46fba2d8d 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > @@ -22,8 +22,46 @@
> >
> >   #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
> >
> > +/*
> > + * we don't need to store device fd's anywhere since they can be obtained
> from
> > + * the group fd via an ioctl() call.
> > + */
> > +struct vfio_group {
> > +	int group_no;
> > +	int fd;
> > +	int devices;
> > +};
> 
> What is the purpose of moving this into .c file? Seems like an
> unnecessary change.

Yes, we can let vfio_group stay at .h, and move vfio_config into .c

> 
> > +
> > +/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped.
> we can
> > + * recreate the mappings for DPDK segments, but we cannot do so for
> memory that
> > + * was registered by the user themselves, so we need to store the user
> mappings
> > + * somewhere, to recreate them later.
> > + */
> > +#define VFIO_MAX_USER_MEM_MAPS 256
> > +struct user_mem_map {
> > +	uint64_t addr;
> > +	uint64_t iova;
> > +	uint64_t len;
> > +};
> > +
> 
> <...>
> 
> > +static struct vfio_config *
> > +get_vfio_cfg_by_group_no(int iommu_group_no)
> > +{
> > +	struct vfio_config *vfio_cfg;
> > +	int i, j;
> > +
> > +	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
> > +		vfio_cfg = &vfio_cfgs[i];
> > +		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
> > +			if (vfio_cfg->vfio_groups[j].group_no ==
> > +					iommu_group_no)
> > +				return vfio_cfg;
> > +		}
> > +	}
> > +
> > +	return default_vfio_cfg;
> 
> Here and in other places: i'm not sure returning default vfio config if
> group not found is such a good idea. It would be better if calling code
> explicitly handled case of group not existing yet.

Agree. It would be explicit.

> 
> > +}
> > +
> > +static struct vfio_config *
> > +get_vfio_cfg_by_group_fd(int vfio_group_fd)
> > +{
> > +	struct vfio_config *vfio_cfg;
> > +	int i, j;
> > +
> > +	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
> > +		vfio_cfg = &vfio_cfgs[i];
> > +		for (j = 0; j < VFIO_MAX_GROUPS; j++)
> > +			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
> > +				return vfio_cfg;
> > +	}
> >
> 
> <...>
> 
> > -	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
> > -		vfio_cfg.vfio_groups[i].fd = -1;
> > -		vfio_cfg.vfio_groups[i].group_no = -1;
> > -		vfio_cfg.vfio_groups[i].devices = 0;
> > +	rte_spinlock_recursive_t lock =
> RTE_SPINLOCK_RECURSIVE_INITIALIZER;
> > +
> > +	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
> > +		vfio_cfgs[i].vfio_container_fd = -1;
> > +		vfio_cfgs[i].vfio_active_groups = 0;
> > +		vfio_cfgs[i].vfio_iommu_type = NULL;
> > +		vfio_cfgs[i].mem_maps.lock = lock;
> 
> Nitpick - why copy, instead of straight up initializing with
> RTE_SPINLOCK_RECURSIVE_INITIALIZER?

I tried but compiler doesn't allow this assignment.
RTE_SPINLOCK_RECURSIVE_INITIALIZER could only be used for initialization.

Thanks for the comments,
Xiao

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v7 2/5] vfio: add multi container support
  2018-04-16 10:03                           ` Burakov, Anatoly
@ 2018-04-16 12:44                             ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-16 12:44 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, maxime.coquelin, Wang, Zhihong, Bie, Tiwei, Tan, Jianfeng,
	Liang, Cunming, Daly, Dan, thomas, Chen, Junjie J, Yigit, Ferruh

Hi Anatoly,

> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Monday, April 16, 2018 6:03 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; Wang, Zhihong
> <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Tan, Jianfeng
> <jianfeng.tan@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly,
> Dan <dan.daly@intel.com>; thomas@monjalon.net; Chen, Junjie J
> <junjie.j.chen@intel.com>
> Subject: Re: [PATCH v7 2/5] vfio: add multi container support
> 
> On 15-Apr-18 4:33 PM, Xiao Wang wrote:
> > This patch adds APIs to support container create/destroy and device
> > bind/unbind with a container. It also provides API for IOMMU programing
> > on a specified container.
> >
> > A driver could use "rte_vfio_create_container" helper to create a
> 
> ^^ wrong API name in commit message :)

Thanks for the catch. Will fix it.

> 
> > new container from eal, use "rte_vfio_bind_group" to bind a device
> > to the newly created container. During rte_vfio_setup_device the
> > container bound with the device will be used for IOMMU setup.
> >
> > Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> > ---
> >   lib/librte_eal/bsdapp/eal/eal.c          |  52 +++++
> >   lib/librte_eal/common/include/rte_vfio.h | 119 ++++++++++++
> >   lib/librte_eal/linuxapp/eal/eal_vfio.c   | 316
> +++++++++++++++++++++++++++++++
> >   lib/librte_eal/rte_eal_version.map       |   6 +
> >   4 files changed, 493 insertions(+)
> >
> > diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
> > index 727adc5d2..c5106d0d6 100644
> > --- a/lib/librte_eal/bsdapp/eal/eal.c
> > +++ b/lib/librte_eal/bsdapp/eal/eal.c
> > @@ -769,6 +769,14 @@ int rte_vfio_noiommu_is_enabled(void);
> >   int rte_vfio_clear_group(int vfio_group_fd);
> >   int rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
> >   int rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
> > +int rte_vfio_container_create(void);
> > +int rte_vfio_container_destroy(int container_fd);
> > +int rte_vfio_bind_group(int container_fd, int iommu_group_no);
> > +int rte_vfio_unbind_group(int container_fd, int iommu_group_no);
> 
> Maybe have these under "container" too? e.g.
> rte_vfio_container_group_bind/unbind? Seems like it would be more
> consistent that way - anything to do with custom containers would be
> under rte_vfio_container_* namespace.

Agree.

> 
> > +int rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
> > +		uint64_t iova, uint64_t len);
> > +int rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
> > +		uint64_t iova, uint64_t len);
> >
> >   int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
> >   		      __rte_unused const char *dev_addr,
> > @@ -818,3 +826,47 @@ rte_vfio_dma_unmap(uint64_t __rte_unused vaddr,
> uint64_t __rte_unused iova,
> >   {
> >   	return -1;
> >   }
> > +
> 
> <...>
> 
> > diff --git a/lib/librte_eal/common/include/rte_vfio.h
> b/lib/librte_eal/common/include/rte_vfio.h
> > index d26ab01cb..0c1509b29 100644
> > --- a/lib/librte_eal/common/include/rte_vfio.h
> > +++ b/lib/librte_eal/common/include/rte_vfio.h
> > @@ -168,6 +168,125 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova,
> uint64_t len);
> >   int __rte_experimental
> >   rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Create a new container for device binding.
> 
> I would add a note that any newly allocated DPDK memory will not be
> mapped into these containers by default.

Will add it.

> 
> > + *
> > + * @return
> > + *   the container fd if successful
> > + *   <0 if failed
> > + */
> > +int __rte_experimental
> > +rte_vfio_container_create(void);
> > +
> 
> <...>
> 
> > + *    0 if successful
> > + *   <0 if failed
> > + */
> > +int __rte_experimental
> > +rte_vfio_unbind_group(int container_fd, int iommu_group_no);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Perform dma mapping for devices in a conainer.
> 
> Here and in other places: "dma" should be DMA, and typo: "conainer" :)
> 
> I think you should also add a note to the original API (not this one,
> but the old one) that DMA maps done via that API will only apply to
> default container and will not apply to any of the containers created
> via container_create(). IOW, documentation should make it clear that if
> you use this functionality, you're on your own and you have to manage
> your own DMA mappings for any containers you create.

OK, will add note to clearly describe it.

> 
> > + *
> > + * @param container_fd
> > + *   the specified container fd
> > + *
> > + * @param vaddr
> > + *   Starting virtual address of memory to be mapped.
> > + *
> 
> <...>
> 
> > +
> > +int __rte_experimental
> > +rte_vfio_container_dma_map(int container_fd, uint64_t vaddr, uint64_t
> iova,
> > +		uint64_t len)
> > +{
> > +	struct user_mem_map *new_map;
> > +	struct vfio_config *vfio_cfg;
> > +	struct user_mem_maps *user_mem_maps;
> > +	int ret = 0;
> > +
> > +	if (len == 0) {
> > +		rte_errno = EINVAL;
> > +		return -1;
> > +	}
> > +
> > +	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
> > +	if (vfio_cfg == NULL) {
> > +		RTE_LOG(ERR, EAL, "Invalid container fd\n");
> > +		return -1;
> > +	}
> > +
> > +	user_mem_maps = &vfio_cfg->mem_maps;
> > +	rte_spinlock_recursive_lock(&user_mem_maps->lock);
> > +	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
> > +		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
> > +		rte_errno = ENOMEM;
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +	/* map the entry */
> > +	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 1)) {
> > +		/* technically, this will fail if there are currently no devices
> > +		 * plugged in, even if a device were added later, this mapping
> > +		 * might have succeeded. however, since we cannot verify if
> this
> > +		 * is a valid mapping without having a device attached,
> consider
> > +		 * this to be unsupported, because we can't just store any old
> > +		 * mapping and pollute list of active mappings willy-nilly.
> > +		 */
> > +		RTE_LOG(ERR, EAL, "Couldn't map new region for DMA\n");
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +	/* create new user mem map entry */
> > +	new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
> > +	new_map->addr = vaddr;
> > +	new_map->iova = iova;
> > +	new_map->len = len;
> > +
> > +	compact_user_maps(user_mem_maps);
> > +out:
> > +	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
> > +	return ret;
> 
> Please correct me if i'm wrong, but it looks like you've just duplicated
> the code for rte_vfio_dma_map() here and made a few small changes. It
> would be better if you moved most of this into a static function (e.g.
> static int container_dma_map(vfio_cfg, vaddr, iova, len)) and called it
> with either default vfio_cfg from rte_vfio_dma_map, or found vfio_cfg
> from rte_vfio_container_dma_map. Same applies to function below.

Agree, will do it in v8.

BRs,
Xiao

> 
> > +}
> > +
> > +int __rte_experimental
> > +rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr, uint64_t
> iova,
> > +		uint64_t len)
> > +{
> > +	struct user_mem_map *map, *new_map = NULL;
> > +	struct vfio_config *vfio_cfg;
> > +	struct user_mem_maps *user_mem_maps;
> > +	int ret = 0;
> > +
> > +	if (len == 0) {
> > +		rte_errno = EINVAL;
> > +		return -1;
> > +	}
> > +
> 
> <...>
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v8 0/5] add ifcvf vdpa driver
  2018-04-15 15:33                         ` [PATCH v7 1/5] vfio: extend data structure for multi container Xiao Wang
  2018-04-16 10:02                           ` Burakov, Anatoly
@ 2018-04-16 15:34                           ` Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 1/5] vfio: extend data structure for multi container Xiao Wang
                                               ` (5 more replies)
  1 sibling, 6 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-16 15:34 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v8:
- Rebase on HEAD.
- Move vfio_group definition back to eal_vfio.h.
- Return NULL when vfio group num/fd is not found, let caller handle that.
- Fix wrong API name in commit log.
- Rename bind/unbind function to rte_vfio_container_group_bind/unbind for
  consistensy.
- Add note for rte_vfio_container_create and rte_vfio_dma_map and fix typo
  in comment.
- Extract out the shared code snip of rte_vfio_dma_map and
  rte_vfio_container_dma_map to avoid code duplication. So do for the unmap.

v7:
- Rebase on HEAD.
- Split the vfio patch into 2 parts, one for data structure extension, one for
  adding new API.
- Use static vfio_config array instead of dynamic alloating.
- Change rte_vfio_container_dma_map/unmap's parameters to use (va, iova, len).

v6:
- Rebase on master branch.
- Document "vdpa" devarg in virtio documentation.
- Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
  consistensy, and add it into driver documentation.
- Add comments for ifcvf device ID.
- Minor code cleaning.

v5:
- Fix compilation in BSD, remove the rte_vfio.h including in BSD.

v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the vfio_cfg search internal APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (5):
  vfio: extend data structure for multi container
  vfio: add multi container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  98 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/nics/virtio.rst               |  13 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  35 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 842 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  52 ++
 lib/librte_eal/common/include/rte_vfio.h | 128 ++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 681 +++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   9 +-
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 21 files changed, 2329 insertions(+), 156 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v8 1/5] vfio: extend data structure for multi container
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
@ 2018-04-16 15:34                             ` Xiao Wang
  2018-04-16 15:56                               ` Burakov, Anatoly
  2018-04-16 15:34                             ` [PATCH v8 2/5] vfio: add multi container support Xiao Wang
                                               ` (4 subsequent siblings)
  5 siblings, 1 reply; 98+ messages in thread
From: Xiao Wang @ 2018-04-16 15:34 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang, Junjie Chen

Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch extends the vfio_config structure to contain per-container
user_mem_maps and defines an array of vfio_config. The next patch will
base on this to add container API.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                     |   1 +
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 420 ++++++++++++++++++++++-----------
 lib/librte_eal/linuxapp/eal/eal_vfio.h |   9 +-
 3 files changed, 289 insertions(+), 141 deletions(-)

diff --git a/config/common_base b/config/common_base
index 0b977a544..74ed0d8b1 100644
--- a/config/common_base
+++ b/config/common_base
@@ -87,6 +87,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 16ee7302a..6289f6316 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -22,8 +22,36 @@
 
 #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
 
+/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
+ * recreate the mappings for DPDK segments, but we cannot do so for memory that
+ * was registered by the user themselves, so we need to store the user mappings
+ * somewhere, to recreate them later.
+ */
+#define VFIO_MAX_USER_MEM_MAPS 256
+struct user_mem_map {
+	uint64_t addr;
+	uint64_t iova;
+	uint64_t len;
+};
+
+struct user_mem_maps {
+	rte_spinlock_recursive_t lock;
+	int n_maps;
+	struct user_mem_map maps[VFIO_MAX_USER_MEM_MAPS];
+};
+
+struct vfio_config {
+	int vfio_enabled;
+	int vfio_container_fd;
+	int vfio_active_groups;
+	const struct vfio_iommu_type *vfio_iommu_type;
+	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
+	struct user_mem_maps mem_maps;
+};
+
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config vfio_cfgs[VFIO_MAX_CONTAINERS];
+static struct vfio_config *default_vfio_cfg = &vfio_cfgs[0];
 
 static int vfio_type1_dma_map(int);
 static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
@@ -31,8 +59,8 @@ static int vfio_spapr_dma_map(int);
 static int vfio_spapr_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
 static int vfio_noiommu_dma_map(int);
 static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
-static int vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len,
-		int do_map);
+static int vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map);
 
 /* IOMMU types we support */
 static const struct vfio_iommu_type iommu_types[] = {
@@ -59,25 +87,6 @@ static const struct vfio_iommu_type iommu_types[] = {
 	},
 };
 
-/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
- * recreate the mappings for DPDK segments, but we cannot do so for memory that
- * was registered by the user themselves, so we need to store the user mappings
- * somewhere, to recreate them later.
- */
-#define VFIO_MAX_USER_MEM_MAPS 256
-struct user_mem_map {
-	uint64_t addr;
-	uint64_t iova;
-	uint64_t len;
-};
-static struct {
-	rte_spinlock_recursive_t lock;
-	int n_maps;
-	struct user_mem_map maps[VFIO_MAX_USER_MEM_MAPS];
-} user_mem_maps = {
-	.lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER
-};
-
 /* for sPAPR IOMMU, we will need to walk memseg list, but we cannot use
  * rte_memseg_walk() because by the time we enter callback we will be holding a
  * write lock, so regular rte-memseg_walk will deadlock. copying the same
@@ -206,14 +215,15 @@ merge_map(struct user_mem_map *left, struct user_mem_map *right)
 }
 
 static struct user_mem_map *
-find_user_mem_map(uint64_t addr, uint64_t iova, uint64_t len)
+find_user_mem_map(struct user_mem_maps *user_mem_maps, uint64_t addr,
+		uint64_t iova, uint64_t len)
 {
 	uint64_t va_end = addr + len;
 	uint64_t iova_end = iova + len;
 	int i;
 
-	for (i = 0; i < user_mem_maps.n_maps; i++) {
-		struct user_mem_map *map = &user_mem_maps.maps[i];
+	for (i = 0; i < user_mem_maps->n_maps; i++) {
+		struct user_mem_map *map = &user_mem_maps->maps[i];
 		uint64_t map_va_end = map->addr + map->len;
 		uint64_t map_iova_end = map->iova + map->len;
 
@@ -239,20 +249,20 @@ find_user_mem_map(uint64_t addr, uint64_t iova, uint64_t len)
 
 /* this will sort all user maps, and merge/compact any adjacent maps */
 static void
-compact_user_maps(void)
+compact_user_maps(struct user_mem_maps *user_mem_maps)
 {
 	int i, n_merged, cur_idx;
 
-	qsort(user_mem_maps.maps, user_mem_maps.n_maps,
-			sizeof(user_mem_maps.maps[0]), user_mem_map_cmp);
+	qsort(user_mem_maps->maps, user_mem_maps->n_maps,
+			sizeof(user_mem_maps->maps[0]), user_mem_map_cmp);
 
 	/* we'll go over the list backwards when merging */
 	n_merged = 0;
-	for (i = user_mem_maps.n_maps - 2; i >= 0; i--) {
+	for (i = user_mem_maps->n_maps - 2; i >= 0; i--) {
 		struct user_mem_map *l, *r;
 
-		l = &user_mem_maps.maps[i];
-		r = &user_mem_maps.maps[i + 1];
+		l = &user_mem_maps->maps[i];
+		r = &user_mem_maps->maps[i + 1];
 
 		if (is_null_map(l) || is_null_map(r))
 			continue;
@@ -266,12 +276,12 @@ compact_user_maps(void)
 	 */
 	if (n_merged > 0) {
 		cur_idx = 0;
-		for (i = 0; i < user_mem_maps.n_maps; i++) {
-			if (!is_null_map(&user_mem_maps.maps[i])) {
+		for (i = 0; i < user_mem_maps->n_maps; i++) {
+			if (!is_null_map(&user_mem_maps->maps[i])) {
 				struct user_mem_map *src, *dst;
 
-				src = &user_mem_maps.maps[i];
-				dst = &user_mem_maps.maps[cur_idx++];
+				src = &user_mem_maps->maps[i];
+				dst = &user_mem_maps->maps[cur_idx++];
 
 				if (src != dst) {
 					memcpy(dst, src, sizeof(*src));
@@ -279,41 +289,16 @@ compact_user_maps(void)
 				}
 			}
 		}
-		user_mem_maps.n_maps = cur_idx;
+		user_mem_maps->n_maps = cur_idx;
 	}
 }
 
-int
-rte_vfio_get_group_fd(int iommu_group_num)
+static int
+vfio_open_group_fd(int iommu_group_num)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_num == iommu_group_num)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_num == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
 	/* if primary, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
@@ -344,9 +329,6 @@ rte_vfio_get_group_fd(int iommu_group_num)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_num = iommu_group_num;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
 	/* if we're in a secondary process, request group fd from the primary
@@ -381,9 +363,6 @@ rte_vfio_get_group_fd(int iommu_group_num)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_num = iommu_group_num;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -393,56 +372,177 @@ rte_vfio_get_group_fd(int iommu_group_num)
 			return -1;
 		}
 	}
-	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_num(int iommu_group_num)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_num ==
+					iommu_group_num)
+				return vfio_cfg;
+		}
+	}
 
-static int
-get_vfio_group_idx(int vfio_group_fd)
+	return NULL;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return NULL;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_container_fd(int container_fd)
+{
+	int i;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i].vfio_container_fd == container_fd)
+			return &vfio_cfgs[i];
+	}
+
+	return NULL;
+}
+
+int
+rte_vfio_get_group_fd(int iommu_group_num)
 {
 	int i;
+	int vfio_group_fd;
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_num(iommu_group_num);
+	vfio_cfg = vfio_cfg ? vfio_cfg : default_vfio_cfg;
+
+	/* check if we already have the group descriptor open */
 	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
-			return i;
+		if (vfio_cfg->vfio_groups[i].group_num == iommu_group_num)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_num == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_num);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_num);
+		return -1;
+	}
+
+	cur_grp->group_num = iommu_group_num;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return;
+	}
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return;
+	}
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return -1;
+	}
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 static void
@@ -458,9 +558,11 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len)
 	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
 		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(vfio_va, vfio_va, len, 1);
+			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
+					len, 1);
 		else
-			vfio_dma_mem_map(vfio_va, vfio_va, len, 0);
+			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
+					len, 0);
 		return;
 	}
 
@@ -468,9 +570,11 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len)
 	ms = rte_mem_virt2memseg(addr, msl);
 	while (cur_len < len) {
 		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(ms->addr_64, ms->iova, ms->len, 1);
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
 		else
-			vfio_dma_mem_map(ms->addr_64, ms->iova, ms->len, 0);
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
 
 		cur_len += ms->len;
 		++ms;
@@ -482,16 +586,23 @@ rte_vfio_clear_group(int vfio_group_fd)
 {
 	int i;
 	int socket_fd, ret;
+	struct vfio_config *vfio_cfg;
+
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return -1;
+	}
 
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
 		if (i < 0)
 			return -1;
-		vfio_cfg.vfio_groups[i].group_num = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		vfio_cfg->vfio_groups[i].group_num = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -544,6 +655,9 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_num;
 	int i, ret;
@@ -592,12 +706,18 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_num(iommu_group_num);
+	vfio_cfg = vfio_cfg ? vfio_cfg : default_vfio_cfg;
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+	user_mem_maps = &vfio_cfg->mem_maps;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
 
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -615,12 +735,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * functionality.
 		 */
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1 &&
+				vfio_cfg->vfio_active_groups == 1 &&
 				vfio_group_device_count(vfio_group_fd) == 0) {
 			const struct vfio_iommu_type *t;
 
 			/* select an IOMMU type which we will be using */
-			t = vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+			t = vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -633,7 +753,10 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 			 * after registering callback, to prevent races
 			 */
 			rte_rwlock_read_lock(mem_lock);
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			if (vfio_cfg == default_vfio_cfg)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -644,22 +767,22 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				return -1;
 			}
 
-			vfio_cfg.vfio_iommu_type = t;
+			vfio_cfg->vfio_iommu_type = t;
 
 			/* re-map all user-mapped segments */
-			rte_spinlock_recursive_lock(&user_mem_maps.lock);
+			rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 			/* this IOMMU type may not support DMA mapping, but
 			 * if we have mappings in the list - that means we have
 			 * previously mapped something successfully, so we can
 			 * be sure that DMA mapping is supported.
 			 */
-			for (i = 0; i < user_mem_maps.n_maps; i++) {
+			for (i = 0; i < user_mem_maps->n_maps; i++) {
 				struct user_mem_map *map;
-				map = &user_mem_maps.maps[i];
+				map = &user_mem_maps->maps[i];
 
 				ret = t->dma_user_map_func(
-						vfio_cfg.vfio_container_fd,
+						vfio_container_fd,
 						map->addr, map->iova, map->len,
 						1);
 				if (ret) {
@@ -670,17 +793,20 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 							map->addr, map->iova,
 							map->len);
 					rte_spinlock_recursive_unlock(
-							&user_mem_maps.lock);
+							&user_mem_maps->lock);
 					rte_rwlock_read_unlock(mem_lock);
 					return -1;
 				}
 			}
-			rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+			rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 
 			/* register callback for mem events */
-			ret = rte_mem_event_callback_register(
+			if (vfio_cfg == default_vfio_cfg)
+				ret = rte_mem_event_callback_register(
 					VFIO_MEM_EVENT_CLB_NAME,
 					vfio_mem_event_callback);
+			else
+				ret = 0;
 			/* unlock memory hotplug */
 			rte_rwlock_read_unlock(mem_lock);
 
@@ -734,6 +860,7 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
 	int vfio_group_fd;
 	int iommu_group_num;
 	int ret;
@@ -763,6 +890,10 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 		goto out;
 	}
 
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_num(iommu_group_num);
+	vfio_cfg = vfio_cfg ? vfio_cfg : default_vfio_cfg;
+
 	/* At this point we got an active group. Closing it will make the
 	 * container detachment. If this is the last active group, VFIO kernel
 	 * code will unset the container and the IOMMU mappings.
@@ -800,7 +931,7 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 	/* if there are no active device groups, unregister the callback to
 	 * avoid spurious attempts to map/unmap memory from VFIO.
 	 */
-	if (vfio_cfg.vfio_active_groups == 0)
+	if (vfio_cfg == default_vfio_cfg && vfio_cfg->vfio_active_groups == 0)
 		rte_mem_event_callback_unregister(VFIO_MEM_EVENT_CLB_NAME);
 
 	/* success */
@@ -815,13 +946,22 @@ int
 rte_vfio_enable(const char *modname)
 {
 	/* initialize group list */
-	int i;
+	int i, j;
 	int vfio_available;
 
-	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_num = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+	rte_spinlock_recursive_t lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfgs[i].vfio_container_fd = -1;
+		vfio_cfgs[i].vfio_active_groups = 0;
+		vfio_cfgs[i].vfio_iommu_type = NULL;
+		vfio_cfgs[i].mem_maps.lock = lock;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			vfio_cfgs[i].vfio_groups[j].fd = -1;
+			vfio_cfgs[i].vfio_groups[j].group_num = -1;
+			vfio_cfgs[i].vfio_groups[j].devices = 0;
+		}
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -843,12 +983,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = rte_vfio_get_container_fd();
+	default_vfio_cfg->vfio_container_fd = rte_vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg->vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg->vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -860,7 +1000,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg->vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -1222,9 +1362,18 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 	struct vfio_iommu_spapr_tce_create create = {
 		.argsz = sizeof(create),
 	};
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
 	int i, ret = 0;
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
+	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
+		return -1;
+	}
+
+	user_mem_maps = &vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* check if window size needs to be adjusted */
 	memset(&param, 0, sizeof(param));
@@ -1237,9 +1386,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 	}
 
 	/* also check user maps */
-	for (i = 0; i < user_mem_maps.n_maps; i++) {
-		uint64_t max = user_mem_maps.maps[i].iova +
-				user_mem_maps.maps[i].len;
+	for (i = 0; i < user_mem_maps->n_maps; i++) {
+		uint64_t max = user_mem_maps->maps[i].iova +
+				user_mem_maps->maps[i].len;
 		create.window_size = RTE_MAX(create.window_size, max);
 	}
 
@@ -1265,9 +1414,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				goto out;
 			}
 			/* remap all user maps */
-			for (i = 0; i < user_mem_maps.n_maps; i++) {
+			for (i = 0; i < user_mem_maps->n_maps; i++) {
 				struct user_mem_map *map =
-						&user_mem_maps.maps[i];
+						&user_mem_maps->maps[i];
 				if (vfio_spapr_dma_do_map(vfio_container_fd,
 						map->addr, map->iova, map->len,
 						1)) {
@@ -1308,7 +1457,7 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
@@ -1360,9 +1509,10 @@ vfio_noiommu_dma_mem_map(int __rte_unused vfio_container_fd,
 }
 
 static int
-vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
+vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len, int do_map)
 {
-	const struct vfio_iommu_type *t = vfio_cfg.vfio_iommu_type;
+	const struct vfio_iommu_type *t = vfio_cfg->vfio_iommu_type;
 
 	if (!t) {
 		RTE_LOG(ERR, EAL, "  VFIO support not initialized\n");
@@ -1378,7 +1528,7 @@ vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
 		return -1;
 	}
 
-	return t->dma_user_map_func(vfio_cfg.vfio_container_fd, vaddr, iova,
+	return t->dma_user_map_func(vfio_cfg->vfio_container_fd, vaddr, iova,
 			len, do_map);
 }
 
@@ -1386,6 +1536,7 @@ int __rte_experimental
 rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 {
 	struct user_mem_map *new_map;
+	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
 	if (len == 0) {
@@ -1393,15 +1544,16 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		return -1;
 	}
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
-	if (user_mem_maps.n_maps == VFIO_MAX_USER_MEM_MAPS) {
+	user_mem_maps = &default_vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
+	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
 		rte_errno = ENOMEM;
 		ret = -1;
 		goto out;
 	}
 	/* map the entry */
-	if (vfio_dma_mem_map(vaddr, iova, len, 1)) {
+	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 1)) {
 		/* technically, this will fail if there are currently no devices
 		 * plugged in, even if a device were added later, this mapping
 		 * might have succeeded. however, since we cannot verify if this
@@ -1414,14 +1566,14 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		goto out;
 	}
 	/* create new user mem map entry */
-	new_map = &user_mem_maps.maps[user_mem_maps.n_maps++];
+	new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
 	new_map->addr = vaddr;
 	new_map->iova = iova;
 	new_map->len = len;
 
-	compact_user_maps();
+	compact_user_maps(user_mem_maps);
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
@@ -1429,6 +1581,7 @@ int __rte_experimental
 rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 {
 	struct user_mem_map *map, *new_map = NULL;
+	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
 	if (len == 0) {
@@ -1436,10 +1589,11 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 		return -1;
 	}
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
+	user_mem_maps = &default_vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* find our mapping */
-	map = find_user_mem_map(vaddr, iova, len);
+	map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
 	if (!map) {
 		RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
 		rte_errno = EINVAL;
@@ -1450,17 +1604,17 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
-		if (user_mem_maps.n_maps == VFIO_MAX_USER_MEM_MAPS) {
+		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
 			ret = -1;
 			goto out;
 		}
-		new_map = &user_mem_maps.maps[user_mem_maps.n_maps++];
+		new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
 	}
 
 	/* unmap the entry */
-	if (vfio_dma_mem_map(vaddr, iova, len, 0)) {
+	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 0)) {
 		/* there may not be any devices plugged in, so unmapping will
 		 * fail with ENODEV/ENOTSUP rte_errno values, but that doesn't
 		 * stop us from removing the mapping, as the assumption is we
@@ -1483,19 +1637,19 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 
 		/* if we've created a new map by splitting, sort everything */
 		if (!is_null_map(new_map)) {
-			compact_user_maps();
+			compact_user_maps(user_mem_maps);
 		} else {
 			/* we've created a new mapping, but it was unused */
-			user_mem_maps.n_maps--;
+			user_mem_maps->n_maps--;
 		}
 	} else {
 		memset(map, 0, sizeof(*map));
-		compact_user_maps();
-		user_mem_maps.n_maps--;
+		compact_user_maps(user_mem_maps);
+		user_mem_maps->n_maps--;
 	}
 
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index c788bba44..18f85fb4f 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -82,6 +82,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
@@ -102,14 +103,6 @@ struct vfio_group {
 	int devices;
 };
 
-struct vfio_config {
-	int vfio_enabled;
-	int vfio_container_fd;
-	int vfio_active_groups;
-	const struct vfio_iommu_type *vfio_iommu_type;
-	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
-};
-
 /* DMA mapping function prototype.
  * Takes VFIO container fd as a parameter.
  * Returns 0 on success, -1 on error.
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 2/5] vfio: add multi container support
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 1/5] vfio: extend data structure for multi container Xiao Wang
@ 2018-04-16 15:34                             ` Xiao Wang
  2018-04-16 15:58                               ` Burakov, Anatoly
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
                                               ` (3 subsequent siblings)
  5 siblings, 2 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-16 15:34 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang, Junjie Chen

This patch adds APIs to support container create/destroy and device
bind/unbind with a container. It also provides API for IOMMU programing
on a specified container.

A driver could use "rte_vfio_container_create" helper to create a new
container from eal, use "rte_vfio_container_group_bind" to bind a device
to the newly created container. During rte_vfio_setup_device the container
bound with the device will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal.c          |  52 ++++++
 lib/librte_eal/common/include/rte_vfio.h | 128 ++++++++++++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 269 ++++++++++++++++++++++++++++---
 lib/librte_eal/rte_eal_version.map       |   6 +
 4 files changed, 436 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index bfbec0d7f..b5c0386e4 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -769,6 +769,14 @@ int rte_vfio_noiommu_is_enabled(void);
 int rte_vfio_clear_group(int vfio_group_fd);
 int rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
 int rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
+int rte_vfio_container_create(void);
+int rte_vfio_container_destroy(int container_fd);
+int rte_vfio_container_group_bind(int container_fd, int iommu_group_num);
+int rte_vfio_container_group_unbind(int container_fd, int iommu_group_num);
+int rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+int rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
 
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
@@ -838,3 +846,47 @@ rte_vfio_get_group_fd(__rte_unused int iommu_group_num)
 {
 	return -1;
 }
+
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_bind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_unbind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(__rte_unused int container_fd,
+			__rte_unused uint64_t vaddr,
+			__rte_unused uint64_t iova,
+			__rte_unused uint64_t len)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(__rte_unused int container_fd,
+			__rte_unused uint64_t vaddr,
+			__rte_unused uint64_t iova,
+			__rte_unused uint64_t len)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index c4a2e606f..c10c206a3 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -154,7 +154,10 @@ rte_vfio_clear_group(int vfio_group_fd);
 /**
  * Map memory region for use with VFIO.
  *
- * @note requires at least one device to be attached at the time of mapping.
+ * @note Require at least one device to be attached at the time of
+ *       mapping. DMA maps done via this API will only apply to default
+ *       container and will not apply to any of the containers created
+ *       via rte_vfio_container_create().
  *
  * @param vaddr
  *   Starting virtual address of memory to be mapped.
@@ -245,6 +248,129 @@ rte_vfio_get_container_fd(void);
 int __rte_experimental
 rte_vfio_get_group_fd(int iommu_group_num);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container for device binding.
+ *
+ * @note Any newly allocated DPDK memory will not be mapped into these
+ *       containers by default, user needs to manage DMA mappings for
+ *       any container created by this API.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_create(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_destroy(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a IOMMU group to a container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_num
+ *   the iommu group number to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_group_bind(int container_fd, int iommu_group_num);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a IOMMU group from a container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_num
+ *   the iommu group number to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_group_unbind(int container_fd, int iommu_group_num);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform DMA mapping for devices in a container.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param vaddr
+ *   Starting virtual address of memory to be mapped.
+ *
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ *
+ * @param len
+ *   Length of memory segment being mapped.
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform DMA unmapping for devices in a container.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param vaddr
+ *   Starting virtual address of memory to be unmapped.
+ *
+ * @param iova
+ *   Starting IOVA address of memory to be unmapped.
+ *
+ * @param len
+ *   Length of memory segment being unmapped.
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 6289f6316..64ea194f0 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1532,19 +1532,15 @@ vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 			len, do_map);
 }
 
-int __rte_experimental
-rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
+static int
+container_dma_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
 {
 	struct user_mem_map *new_map;
 	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
-	if (len == 0) {
-		rte_errno = EINVAL;
-		return -1;
-	}
-
-	user_mem_maps = &default_vfio_cfg->mem_maps;
+	user_mem_maps = &vfio_cfg->mem_maps;
 	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
@@ -1553,7 +1549,7 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		goto out;
 	}
 	/* map the entry */
-	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 1)) {
+	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 1)) {
 		/* technically, this will fail if there are currently no devices
 		 * plugged in, even if a device were added later, this mapping
 		 * might have succeeded. however, since we cannot verify if this
@@ -1577,19 +1573,15 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 	return ret;
 }
 
-int __rte_experimental
-rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
+static int
+container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
 {
 	struct user_mem_map *map, *new_map = NULL;
 	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
-	if (len == 0) {
-		rte_errno = EINVAL;
-		return -1;
-	}
-
-	user_mem_maps = &default_vfio_cfg->mem_maps;
+	user_mem_maps = &vfio_cfg->mem_maps;
 	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* find our mapping */
@@ -1614,7 +1606,7 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 	}
 
 	/* unmap the entry */
-	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 0)) {
+	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 0)) {
 		/* there may not be any devices plugged in, so unmapping will
 		 * fail with ENODEV/ENOTSUP rte_errno values, but that doesn't
 		 * stop us from removing the mapping, as the assumption is we
@@ -1653,6 +1645,28 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 	return ret;
 }
 
+int __rte_experimental
+rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
+{
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return container_dma_map(default_vfio_cfg, vaddr, iova, len);
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
+{
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return container_dma_unmap(default_vfio_cfg, vaddr, iova, len);
+}
+
 int
 rte_vfio_noiommu_is_enabled(void)
 {
@@ -1685,6 +1699,181 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i].vfio_container_fd == -1)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i].vfio_container_fd = rte_vfio_get_container_fd();
+	if (vfio_cfgs[i].vfio_container_fd < 0) {
+		RTE_LOG(NOTICE, EAL, "fail to create a new container\n");
+		return -1;
+	}
+
+	return vfio_cfgs[i].vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_num != -1)
+			rte_vfio_container_group_unbind(container_fd,
+				vfio_cfg->vfio_groups[i].group_num);
+
+	close(container_fd);
+	vfio_cfg->vfio_container_fd = -1;
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_iommu_type = NULL;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_container_group_bind(int container_fd, int iommu_group_num)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	/* Check room for new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_num == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_num);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_num);
+		return -1;
+	}
+	cur_grp->group_num = iommu_group_num;
+	cur_grp->fd = vfio_group_fd;
+	cur_grp->devices = 0;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_container_group_unbind(int container_fd, int iommu_group_num)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (vfio_cfg->vfio_groups[i].group_num == iommu_group_num) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+			" iommu_group_num %d\n", iommu_group_num);
+		return -1;
+	}
+	cur_grp->group_num = -1;
+	cur_grp->fd = -1;
+	cur_grp->devices = 0;
+	vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(int container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct vfio_config *vfio_cfg;
+
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	return container_dma_map(vfio_cfg, vaddr, iova, len);
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct vfio_config *vfio_cfg;
+
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	return container_dma_unmap(vfio_cfg, vaddr, iova, len);
+}
+
 #else
 
 int __rte_experimental
@@ -1701,4 +1890,48 @@ rte_vfio_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova,
 	return -1;
 }
 
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_bind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_unbind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(__rte_unused int container_fd,
+		__rte_unused uint64_t vaddr,
+		__rte_unused uint64_t iova,
+		__rte_unused uint64_t len)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(__rte_unused int container_fd,
+		__rte_unused uint64_t vaddr,
+		__rte_unused uint64_t iova,
+		__rte_unused uint64_t len)
+{
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d02d80b8a..28f51f8d2 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -293,5 +293,11 @@ EXPERIMENTAL {
 	rte_vfio_get_container_fd;
 	rte_vfio_get_group_fd;
 	rte_vfio_get_group_num;
+	rte_vfio_container_create;
+	rte_vfio_container_destroy;
+	rte_vfio_container_dma_map;
+	rte_vfio_container_dma_unmap;
+	rte_vfio_container_group_bind;
+	rte_vfio_container_group_unbind;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 3/5] net/virtio: skip device probe in vdpa mode
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 1/5] vfio: extend data structure for multi container Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 2/5] vfio: add multi container support Xiao Wang
@ 2018-04-16 15:34                             ` Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
                                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-16 15:34 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/virtio.rst         | 13 ++++++++++++
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index ca09cd203..8922f9c0b 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -318,3 +318,16 @@ Here we use l3fwd-power as an example to show how to get started.
 
         $ l3fwd-power -l 0-1 -- -p 1 -P --config="(0,0,1)" \
                                                --no-numa --parse-ptype
+
+
+Virtio PMD arguments
+--------------------
+
+The user can specify below argument in devargs.
+
+#.  ``vdpa``:
+
+    A virtio device could also be driven by vDPA (vhost data path acceleration)
+    driver, and works as a HW vhost backend. This argument is used to specify
+    a virtio device needs to work in vDPA mode.
+    (Default: 0 (disabled))
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 41042cb23..5833dad73 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -28,6 +28,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1713,9 +1714,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 4/5] net/ifcvf: add ifcvf vdpa driver
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
                                               ` (2 preceding siblings ...)
  2018-04-16 15:34                             ` [PATCH v8 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-16 15:34                             ` Xiao Wang
  2018-04-16 15:34                             ` [PATCH v8 5/5] doc: add ifcvf driver document and release note Xiao Wang
  2018-04-16 16:36                             ` [PATCH v8 0/5] add ifcvf vdpa driver Ferruh Yigit
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-16 15:34 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang, Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  35 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 842 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1436 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index 74ed0d8b1..651188943 100644
--- a/config/common_base
+++ b/config/common_base
@@ -804,6 +804,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index d0437e5d6..14e56cb4d 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index dc5047e04..9f9da6651 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -58,6 +58,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..1011995bc
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,35 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..0a4666660
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,842 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+#ifndef PAGE_SIZE
+#define PAGE_SIZE 4096
+#endif
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_num;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	rte_vfio_get_group_num(rte_pci_get_sysfs_path(), devname,
+			&iommu_group_num);
+
+	internal->vfio_container_fd = rte_vfio_container_create();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	internal->vfio_group_fd = rte_vfio_container_group_bind(
+			internal->vfio_container_fd, iommu_group_num);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_container_destroy(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx.", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		rte_vfio_container_dma_map(vfio_container_fd,
+				reg->host_user_addr, reg->guest_phys_addr,
+				reg->size);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+
+		reg = &mem->regions[i];
+		rte_vfio_container_dma_map(vfio_container_fd,
+				reg->host_user_addr, reg->guest_phys_addr,
+				reg->size);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	uint32_t i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint32_t size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		size = hw->vring[i].size * 8 + 4;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / PAGE_SIZE;
+			for (j = 0; j <= size / PAGE_SIZE; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_set_features(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_features(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK | \
+		 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+static int
+ifcvf_get_protocol_features(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.get_queue_num = ifcvf_get_queue_num,
+	.get_features = ifcvf_get_vdpa_features,
+	.get_protocol_features = ifcvf_get_protocol_features,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.set_vring_state = NULL,
+	.set_features = ifcvf_set_features,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES) |
+		(1ULL << VHOST_F_LOG_ALL);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	internal->did = rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops);
+	if (internal->did < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_container_destroy(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * IFCVF has the same vendor ID and device ID as virtio net PCI
+ * device, with its specific subsystem vendor ID and device ID.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("pmd.net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 005803a56..f6e7ccc37 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -186,6 +186,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v8 5/5] doc: add ifcvf driver document and release note
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
                                               ` (3 preceding siblings ...)
  2018-04-16 15:34                             ` [PATCH v8 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-16 15:34                             ` Xiao Wang
  2018-04-16 16:36                             ` [PATCH v8 0/5] add ifcvf vdpa driver Ferruh Yigit
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-16 15:34 UTC (permalink / raw)
  To: ferruh.yigit, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 +++
 doc/guides/nics/ifcvf.rst              | 98 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 116 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..d7e76353c
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,98 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+Pre-Installation Configuration
+------------------------------
+
+Config File Options
+~~~~~~~~~~~~~~~~~~~
+
+The following option can be modified in the ``config`` file.
+
+- ``CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD`` (default ``y`` for linux)
+
+  Toggle compilation of the ``librte_ifcvf_vdpa`` driver.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index ea9110c81..9b98c620f 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -45,6 +45,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index 961820592..6742f4b5d 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -108,6 +108,15 @@ New Features
 
   Linux uevent is supported as backend of this device event notification framework.
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 API Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 1/5] vfio: extend data structure for multi container
  2018-04-16 15:34                             ` [PATCH v8 1/5] vfio: extend data structure for multi container Xiao Wang
@ 2018-04-16 15:56                               ` Burakov, Anatoly
  0 siblings, 0 replies; 98+ messages in thread
From: Burakov, Anatoly @ 2018-04-16 15:56 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Junjie Chen

On 16-Apr-18 4:34 PM, Xiao Wang wrote:
> Currently eal vfio framework binds vfio group fd to the default
> container fd during rte_vfio_setup_device, while in some cases,
> e.g. vDPA (vhost data path acceleration), we want to put vfio group
> to a separate container and program IOMMU via this container.
> 
> This patch extends the vfio_config structure to contain per-container
> user_mem_maps and defines an array of vfio_config. The next patch will
> base on this to add container API.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 2/5] vfio: add multi container support
  2018-04-16 15:34                             ` [PATCH v8 2/5] vfio: add multi container support Xiao Wang
@ 2018-04-16 15:58                               ` Burakov, Anatoly
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
  1 sibling, 0 replies; 98+ messages in thread
From: Burakov, Anatoly @ 2018-04-16 15:58 UTC (permalink / raw)
  To: Xiao Wang, ferruh.yigit
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas, Junjie Chen

On 16-Apr-18 4:34 PM, Xiao Wang wrote:
> This patch adds APIs to support container create/destroy and device
> bind/unbind with a container. It also provides API for IOMMU programing
> on a specified container.
> 
> A driver could use "rte_vfio_container_create" helper to create a new
> container from eal, use "rte_vfio_container_group_bind" to bind a device
> to the newly created container. During rte_vfio_setup_device the container
> bound with the device will be used for IOMMU setup.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 0/5] add ifcvf vdpa driver
  2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
                                               ` (4 preceding siblings ...)
  2018-04-16 15:34                             ` [PATCH v8 5/5] doc: add ifcvf driver document and release note Xiao Wang
@ 2018-04-16 16:36                             ` Ferruh Yigit
  2018-04-16 18:07                               ` Thomas Monjalon
  5 siblings, 1 reply; 98+ messages in thread
From: Ferruh Yigit @ 2018-04-16 16:36 UTC (permalink / raw)
  To: Xiao Wang, anatoly.burakov
  Cc: dev, maxime.coquelin, zhihong.wang, tiwei.bie, jianfeng.tan,
	cunming.liang, dan.daly, thomas

On 4/16/2018 4:34 PM, Xiao Wang wrote:
> IFCVF driver
> ============
> The IFCVF vDPA (vhost data path acceleration) driver provides support for the
> Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
> works as a HW vhost backend which can send/receive packets to/from virtio
> directly by DMA. Besides, it supports dirty page logging and device state
> report/restore. This driver enables its vDPA functionality with live migration
> feature.
> 
> vDPA mode
> =========
> IFCVF's vendor ID and device ID are same as that of virtio net pci device,
> with its specific subsystem vendor ID and device ID. To let the device be
> probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
> device is to be used in vDPA mode, rather than polling mode, virtio pmd will
> skip when it detects this message.
> 
> Container per device
> ====================
> vDPA needs to create different containers for different devices, thus this
> patch set adds some APIs in eal/vfio to support multiple container, e.g.
> - rte_vfio_create_container
> - rte_vfio_destroy_container
> - rte_vfio_bind_group
> - rte_vfio_unbind_group
> 
> By this extension, a device can be put into a new specific container, rather
> than the previous default container.
> 
> IFCVF vDPA details
> ==================
> Key vDPA driver ops implemented:
> - ifcvf_dev_config:
>   Enable VF data path with virtio information provided by vhost lib, including
>   IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
>   route HW interrupt to virtio driver, create notify relay thread to translate
>   virtio driver's kick to a MMIO write onto HW, HW queues configuration.
> 
>   This function gets called to set up HW data path backend when virtio driver
>   in VM gets ready.
> 
> - ifcvf_dev_close:
>   Revoke all the setup in ifcvf_dev_config.
> 
>   This function gets called when virtio driver stops device in VM.
> 
> Change log
> ==========
> v8:
> - Rebase on HEAD.
> - Move vfio_group definition back to eal_vfio.h.
> - Return NULL when vfio group num/fd is not found, let caller handle that.
> - Fix wrong API name in commit log.
> - Rename bind/unbind function to rte_vfio_container_group_bind/unbind for
>   consistensy.
> - Add note for rte_vfio_container_create and rte_vfio_dma_map and fix typo
>   in comment.
> - Extract out the shared code snip of rte_vfio_dma_map and
>   rte_vfio_container_dma_map to avoid code duplication. So do for the unmap.
> 
> v7:
> - Rebase on HEAD.
> - Split the vfio patch into 2 parts, one for data structure extension, one for
>   adding new API.
> - Use static vfio_config array instead of dynamic alloating.
> - Change rte_vfio_container_dma_map/unmap's parameters to use (va, iova, len).
> 
> v6:
> - Rebase on master branch.
> - Document "vdpa" devarg in virtio documentation.
> - Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
>   consistensy, and add it into driver documentation.
> - Add comments for ifcvf device ID.
> - Minor code cleaning.
> 
> v5:
> - Fix compilation in BSD, remove the rte_vfio.h including in BSD.
> 
> v4:
> - Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
> - Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
> - Align the vfio_cfg search internal APIs naming.
> 
> v3:
> - Add doc and release note for the new driver.
> - Remove the vdev concept, make the driver as a PCI driver, it will get probed
>   by PCI bus driver.
> - Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
> - Remove the PCI API exposure accordingly.
> - Move the MAX_VFIO_CONTAINERS definition to config file.
> - Let virtio pmd skips when a virtio device needs to work in vDPA mode.
> 
> v2:
> - Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
>   to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
> - Rebase on Zhihong's vDPA v3 patch set.
> - Minor code cleanup on vfio extension.
> 
> 
> Xiao Wang (5):
>   vfio: extend data structure for multi container
>   vfio: add multi container support
>   net/virtio: skip device probe in vdpa mode
>   net/ifcvf: add ifcvf vdpa driver
>   doc: add ifcvf driver document and release note


Hi Xiao,

Getting following build error for 32bit [1], can you please check them?

[1]
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c: In function ‘ifcvf_dma_map’:
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:24:3: error: format ‘%lx’ expects argument
of type ‘long unsigned int’, but argument 6 has type ‘uint64_t {aka long long
unsigned int}’ [-Werror=format=]
   "%s(): " fmt "\n", __func__, ##args)
   ^
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:178:22:
    "size 0x%lx.", i, reg->host_user_addr,
                      ~~~~~~
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:177:3: note: in expansion of macro ‘DRV_LOG’
   DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
   ^~~~~~~
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:177:37: note: format string is defined here
   DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
                                   ~~^
                                   %llx
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:24:3: error: format ‘%lx’ expects argument
of type ‘long unsigned int’, but argument 7 has type ‘uint64_t {aka long long
unsigned int}’ [-Werror=format=]
   "%s(): " fmt "\n", __func__, ##args)
   ^
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:179:4:
    reg->guest_phys_addr, reg->size);
    ~~~~~~
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:177:3: note: in expansion of macro ‘DRV_LOG’
   DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
   ^~~~~~~
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:177:48: note: format string is defined here
   DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
                                              ~~^
                                              %llx
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:24:3: error: format ‘%lx’ expects argument
of type ‘long unsigned int’, but argument 8 has type ‘uint64_t {aka long long
unsigned int}’ [-Werror=format=]
   "%s(): " fmt "\n", __func__, ##args)
   ^
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:179:26:
    reg->guest_phys_addr, reg->size);
                          ~~~~~~
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:177:3: note: in expansion of macro ‘DRV_LOG’
   DRV_LOG(INFO, "region %u: HVA 0x%lx, GPA 0x%lx, "
   ^~~~~~~
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:178:14: note: format string is defined here
    "size 0x%lx.", i, reg->host_user_addr,
            ~~^
            %llx
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c: In function ‘vdpa_ifcvf_start’:
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:266:39: error: cast from pointer to
integer of different size [-Werror=pointer-to-int-cast]
   hw->vring[i].desc = qva_to_gpa(vid, (uint64_t)vq.desc);
                                       ^
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:267:40: error: cast from pointer to
integer of different size [-Werror=pointer-to-int-cast]
   hw->vring[i].avail = qva_to_gpa(vid, (uint64_t)vq.avail);
                                        ^
.../dpdk/drivers/net/ifc/ifcvf_vdpa.c:268:39: error: cast from pointer to
integer of different size [-Werror=pointer-to-int-cast]
   hw->vring[i].used = qva_to_gpa(vid, (uint64_t)vq.used);
                                       ^

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 0/5] add ifcvf vdpa driver
  2018-04-16 16:36                             ` [PATCH v8 0/5] add ifcvf vdpa driver Ferruh Yigit
@ 2018-04-16 18:07                               ` Thomas Monjalon
  2018-04-17  5:36                                 ` Wang, Xiao W
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2018-04-16 18:07 UTC (permalink / raw)
  To: Xiao Wang
  Cc: Ferruh Yigit, anatoly.burakov, dev, maxime.coquelin,
	zhihong.wang, tiwei.bie, jianfeng.tan, cunming.liang, dan.daly

16/04/2018 18:36, Ferruh Yigit:
> Hi Xiao,
> 
> Getting following build error for 32bit [1], can you please check them?
> 
> [1]
> .../dpdk/drivers/net/ifc/ifcvf_vdpa.c: In function ‘ifcvf_dma_map’:
> .../dpdk/drivers/net/ifc/ifcvf_vdpa.c:24:3: error: format ‘%lx’ expects argument
> of type ‘long unsigned int’, but argument 6 has type ‘uint64_t {aka long long
> unsigned int}’ [-Werror=format=]

Reminder from this recent post:
	http://dpdk.org/ml/archives/dev/2018-February/090882.html
"
Most of the times, using %l is wrong (except when printing a long).
So next time you write %l, please think "I am probably wrong".
"

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v8 0/5] add ifcvf vdpa driver
  2018-04-16 18:07                               ` Thomas Monjalon
@ 2018-04-17  5:36                                 ` Wang, Xiao W
  0 siblings, 0 replies; 98+ messages in thread
From: Wang, Xiao W @ 2018-04-17  5:36 UTC (permalink / raw)
  To: Thomas Monjalon, Yigit, Ferruh
  Cc: Burakov, Anatoly, dev, maxime.coquelin, Wang, Zhihong, Bie,
	Tiwei, Tan, Jianfeng, Liang, Cunming, Daly, Dan

Thanks for the reminder. Will fix it.

BRs,
Xiao

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, April 17, 2018 2:07 AM
> To: Wang, Xiao W <xiao.w.wang@intel.com>
> Cc: Yigit, Ferruh <ferruh.yigit@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>; dev@dpdk.org; maxime.coquelin@redhat.com;
> Wang, Zhihong <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
> Tan, Jianfeng <jianfeng.tan@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>; Daly, Dan <dan.daly@intel.com>
> Subject: Re: [PATCH v8 0/5] add ifcvf vdpa driver
> 
> 16/04/2018 18:36, Ferruh Yigit:
> > Hi Xiao,
> >
> > Getting following build error for 32bit [1], can you please check them?
> >
> > [1]
> > .../dpdk/drivers/net/ifc/ifcvf_vdpa.c: In function ‘ifcvf_dma_map’:
> > .../dpdk/drivers/net/ifc/ifcvf_vdpa.c:24:3: error: format ‘%lx’ expects
> argument
> > of type ‘long unsigned int’, but argument 6 has type ‘uint64_t {aka long long
> > unsigned int}’ [-Werror=format=]
> 
> Reminder from this recent post:
> 	http://dpdk.org/ml/archives/dev/2018-February/090882.html
> "
> Most of the times, using %l is wrong (except when printing a long).
> So next time you write %l, please think "I am probably wrong".
> "
> 
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v9 0/5] add ifcvf vdpa driver
  2018-04-16 15:34                             ` [PATCH v8 2/5] vfio: add multi container support Xiao Wang
  2018-04-16 15:58                               ` Burakov, Anatoly
@ 2018-04-17  7:06                               ` Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 1/5] vfio: extend data structure for multi container Xiao Wang
                                                   ` (5 more replies)
  1 sibling, 6 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-17  7:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas, Xiao Wang

IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_container_create
- rte_vfio_container_destroy
- rte_vfio_container_group_bind
- rte_vfio_container_group_unbind

By this extension, a device can be put into a new specific container, rather
than the previous default container.

Two APIs are added for IOMMU programming for a specified container:
- rte_vfio_container_dma_map
- rte_vfio_container_dma_unmap

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v9:
- Rebase on master tree's HEAD.
- Fix compile error on 32-bit platform.

v8:
- Rebase on HEAD.
- Move vfio_group definition back to eal_vfio.h.
- Return NULL when vfio group num/fd is not found, let caller handle that.
- Fix wrong API name in commit log.
- Rename bind/unbind function to rte_vfio_container_group_bind/unbind for
  consistensy.
- Add note for rte_vfio_container_create and rte_vfio_dma_map and fix typo
  in comment.
- Extract out the shared code snip of rte_vfio_dma_map and
  rte_vfio_container_dma_map to avoid code duplication. So do for the unmap.

v7:
- Rebase on HEAD.
- Split the vfio patch into 2 parts, one for data structure extension, one for
  adding new API.
- Use static vfio_config array instead of dynamic alloating.
- Change rte_vfio_container_dma_map/unmap's parameters to use (va, iova, len).

v6:
- Rebase on master branch.
- Document "vdpa" devarg in virtio documentation.
- Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
  consistensy, and add it into driver documentation.
- Add comments for ifcvf device ID.
- Minor code cleaning.

v5:
- Fix compilation in BSD, remove the rte_vfio.h including in BSD.

v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the vfio_cfg search internal APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (5):
  vfio: extend data structure for multi container
  vfio: add multi container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  98 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/nics/virtio.rst               |  13 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  35 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 846 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  44 ++
 lib/librte_eal/common/include/rte_vfio.h | 128 ++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 681 +++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   9 +-
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 21 files changed, 2325 insertions(+), 156 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

-- 
2.15.1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v9 1/5] vfio: extend data structure for multi container
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
@ 2018-04-17  7:06                                 ` Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 2/5] vfio: add multi container support Xiao Wang
                                                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-17  7:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas, Xiao Wang,
	Junjie Chen

Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch extends the vfio_config structure to contain per-container
user_mem_maps and defines an array of vfio_config. The next patch will
base on this to add container API.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base                     |   1 +
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 420 ++++++++++++++++++++++-----------
 lib/librte_eal/linuxapp/eal/eal_vfio.h |   9 +-
 3 files changed, 289 insertions(+), 141 deletions(-)

diff --git a/config/common_base b/config/common_base
index c2b0d91e0..9b9f79ff8 100644
--- a/config/common_base
+++ b/config/common_base
@@ -87,6 +87,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index def71a668..974dcbe6d 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -22,8 +22,36 @@
 
 #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
 
+/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
+ * recreate the mappings for DPDK segments, but we cannot do so for memory that
+ * was registered by the user themselves, so we need to store the user mappings
+ * somewhere, to recreate them later.
+ */
+#define VFIO_MAX_USER_MEM_MAPS 256
+struct user_mem_map {
+	uint64_t addr;
+	uint64_t iova;
+	uint64_t len;
+};
+
+struct user_mem_maps {
+	rte_spinlock_recursive_t lock;
+	int n_maps;
+	struct user_mem_map maps[VFIO_MAX_USER_MEM_MAPS];
+};
+
+struct vfio_config {
+	int vfio_enabled;
+	int vfio_container_fd;
+	int vfio_active_groups;
+	const struct vfio_iommu_type *vfio_iommu_type;
+	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
+	struct user_mem_maps mem_maps;
+};
+
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config vfio_cfgs[VFIO_MAX_CONTAINERS];
+static struct vfio_config *default_vfio_cfg = &vfio_cfgs[0];
 
 static int vfio_type1_dma_map(int);
 static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
@@ -31,8 +59,8 @@ static int vfio_spapr_dma_map(int);
 static int vfio_spapr_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
 static int vfio_noiommu_dma_map(int);
 static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
-static int vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len,
-		int do_map);
+static int vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map);
 
 /* IOMMU types we support */
 static const struct vfio_iommu_type iommu_types[] = {
@@ -59,25 +87,6 @@ static const struct vfio_iommu_type iommu_types[] = {
 	},
 };
 
-/* hot plug/unplug of VFIO groups may cause all DMA maps to be dropped. we can
- * recreate the mappings for DPDK segments, but we cannot do so for memory that
- * was registered by the user themselves, so we need to store the user mappings
- * somewhere, to recreate them later.
- */
-#define VFIO_MAX_USER_MEM_MAPS 256
-struct user_mem_map {
-	uint64_t addr;
-	uint64_t iova;
-	uint64_t len;
-};
-static struct {
-	rte_spinlock_recursive_t lock;
-	int n_maps;
-	struct user_mem_map maps[VFIO_MAX_USER_MEM_MAPS];
-} user_mem_maps = {
-	.lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER
-};
-
 /* for sPAPR IOMMU, we will need to walk memseg list, but we cannot use
  * rte_memseg_walk() because by the time we enter callback we will be holding a
  * write lock, so regular rte-memseg_walk will deadlock. copying the same
@@ -206,14 +215,15 @@ merge_map(struct user_mem_map *left, struct user_mem_map *right)
 }
 
 static struct user_mem_map *
-find_user_mem_map(uint64_t addr, uint64_t iova, uint64_t len)
+find_user_mem_map(struct user_mem_maps *user_mem_maps, uint64_t addr,
+		uint64_t iova, uint64_t len)
 {
 	uint64_t va_end = addr + len;
 	uint64_t iova_end = iova + len;
 	int i;
 
-	for (i = 0; i < user_mem_maps.n_maps; i++) {
-		struct user_mem_map *map = &user_mem_maps.maps[i];
+	for (i = 0; i < user_mem_maps->n_maps; i++) {
+		struct user_mem_map *map = &user_mem_maps->maps[i];
 		uint64_t map_va_end = map->addr + map->len;
 		uint64_t map_iova_end = map->iova + map->len;
 
@@ -239,20 +249,20 @@ find_user_mem_map(uint64_t addr, uint64_t iova, uint64_t len)
 
 /* this will sort all user maps, and merge/compact any adjacent maps */
 static void
-compact_user_maps(void)
+compact_user_maps(struct user_mem_maps *user_mem_maps)
 {
 	int i, n_merged, cur_idx;
 
-	qsort(user_mem_maps.maps, user_mem_maps.n_maps,
-			sizeof(user_mem_maps.maps[0]), user_mem_map_cmp);
+	qsort(user_mem_maps->maps, user_mem_maps->n_maps,
+			sizeof(user_mem_maps->maps[0]), user_mem_map_cmp);
 
 	/* we'll go over the list backwards when merging */
 	n_merged = 0;
-	for (i = user_mem_maps.n_maps - 2; i >= 0; i--) {
+	for (i = user_mem_maps->n_maps - 2; i >= 0; i--) {
 		struct user_mem_map *l, *r;
 
-		l = &user_mem_maps.maps[i];
-		r = &user_mem_maps.maps[i + 1];
+		l = &user_mem_maps->maps[i];
+		r = &user_mem_maps->maps[i + 1];
 
 		if (is_null_map(l) || is_null_map(r))
 			continue;
@@ -266,12 +276,12 @@ compact_user_maps(void)
 	 */
 	if (n_merged > 0) {
 		cur_idx = 0;
-		for (i = 0; i < user_mem_maps.n_maps; i++) {
-			if (!is_null_map(&user_mem_maps.maps[i])) {
+		for (i = 0; i < user_mem_maps->n_maps; i++) {
+			if (!is_null_map(&user_mem_maps->maps[i])) {
 				struct user_mem_map *src, *dst;
 
-				src = &user_mem_maps.maps[i];
-				dst = &user_mem_maps.maps[cur_idx++];
+				src = &user_mem_maps->maps[i];
+				dst = &user_mem_maps->maps[cur_idx++];
 
 				if (src != dst) {
 					memcpy(dst, src, sizeof(*src));
@@ -279,41 +289,16 @@ compact_user_maps(void)
 				}
 			}
 		}
-		user_mem_maps.n_maps = cur_idx;
+		user_mem_maps->n_maps = cur_idx;
 	}
 }
 
-int
-rte_vfio_get_group_fd(int iommu_group_num)
+static int
+vfio_open_group_fd(int iommu_group_num)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_num == iommu_group_num)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_num == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
 	/* if primary, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
@@ -344,9 +329,6 @@ rte_vfio_get_group_fd(int iommu_group_num)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_num = iommu_group_num;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
 	/* if we're in a secondary process, request group fd from the primary
@@ -381,9 +363,6 @@ rte_vfio_get_group_fd(int iommu_group_num)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_num = iommu_group_num;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -393,56 +372,177 @@ rte_vfio_get_group_fd(int iommu_group_num)
 			return -1;
 		}
 	}
-	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_num(int iommu_group_num)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_num ==
+					iommu_group_num)
+				return vfio_cfg;
+		}
+	}
 
-static int
-get_vfio_group_idx(int vfio_group_fd)
+	return NULL;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return NULL;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_container_fd(int container_fd)
+{
+	int i;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i].vfio_container_fd == container_fd)
+			return &vfio_cfgs[i];
+	}
+
+	return NULL;
+}
+
+int
+rte_vfio_get_group_fd(int iommu_group_num)
 {
 	int i;
+	int vfio_group_fd;
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_num(iommu_group_num);
+	vfio_cfg = vfio_cfg ? vfio_cfg : default_vfio_cfg;
+
+	/* check if we already have the group descriptor open */
 	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
-			return i;
+		if (vfio_cfg->vfio_groups[i].group_num == iommu_group_num)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_num == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_num);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_num);
+		return -1;
+	}
+
+	cur_grp->group_num = iommu_group_num;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = &vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return;
+	}
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return;
+	}
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return -1;
+	}
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 static void
@@ -458,9 +558,11 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len)
 	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
 		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(vfio_va, vfio_va, len, 1);
+			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
+					len, 1);
 		else
-			vfio_dma_mem_map(vfio_va, vfio_va, len, 0);
+			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
+					len, 0);
 		return;
 	}
 
@@ -468,9 +570,11 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len)
 	ms = rte_mem_virt2memseg(addr, msl);
 	while (cur_len < len) {
 		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(ms->addr_64, ms->iova, ms->len, 1);
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
 		else
-			vfio_dma_mem_map(ms->addr_64, ms->iova, ms->len, 0);
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
 
 		cur_len += ms->len;
 		++ms;
@@ -482,16 +586,23 @@ rte_vfio_clear_group(int vfio_group_fd)
 {
 	int i;
 	int socket_fd, ret;
+	struct vfio_config *vfio_cfg;
+
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid group fd!\n");
+		return -1;
+	}
 
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
 		if (i < 0)
 			return -1;
-		vfio_cfg.vfio_groups[i].group_num = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		vfio_cfg->vfio_groups[i].group_num = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -544,6 +655,9 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_num;
 	int i, ret;
@@ -592,12 +706,18 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_num(iommu_group_num);
+	vfio_cfg = vfio_cfg ? vfio_cfg : default_vfio_cfg;
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+	user_mem_maps = &vfio_cfg->mem_maps;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
 
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -615,12 +735,12 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * functionality.
 		 */
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1 &&
+				vfio_cfg->vfio_active_groups == 1 &&
 				vfio_group_device_count(vfio_group_fd) == 0) {
 			const struct vfio_iommu_type *t;
 
 			/* select an IOMMU type which we will be using */
-			t = vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+			t = vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -633,7 +753,10 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 			 * after registering callback, to prevent races
 			 */
 			rte_rwlock_read_lock(mem_lock);
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			if (vfio_cfg == default_vfio_cfg)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -644,22 +767,22 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				return -1;
 			}
 
-			vfio_cfg.vfio_iommu_type = t;
+			vfio_cfg->vfio_iommu_type = t;
 
 			/* re-map all user-mapped segments */
-			rte_spinlock_recursive_lock(&user_mem_maps.lock);
+			rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 			/* this IOMMU type may not support DMA mapping, but
 			 * if we have mappings in the list - that means we have
 			 * previously mapped something successfully, so we can
 			 * be sure that DMA mapping is supported.
 			 */
-			for (i = 0; i < user_mem_maps.n_maps; i++) {
+			for (i = 0; i < user_mem_maps->n_maps; i++) {
 				struct user_mem_map *map;
-				map = &user_mem_maps.maps[i];
+				map = &user_mem_maps->maps[i];
 
 				ret = t->dma_user_map_func(
-						vfio_cfg.vfio_container_fd,
+						vfio_container_fd,
 						map->addr, map->iova, map->len,
 						1);
 				if (ret) {
@@ -670,17 +793,20 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 							map->addr, map->iova,
 							map->len);
 					rte_spinlock_recursive_unlock(
-							&user_mem_maps.lock);
+							&user_mem_maps->lock);
 					rte_rwlock_read_unlock(mem_lock);
 					return -1;
 				}
 			}
-			rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+			rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 
 			/* register callback for mem events */
-			ret = rte_mem_event_callback_register(
+			if (vfio_cfg == default_vfio_cfg)
+				ret = rte_mem_event_callback_register(
 					VFIO_MEM_EVENT_CLB_NAME,
 					vfio_mem_event_callback);
+			else
+				ret = 0;
 			/* unlock memory hotplug */
 			rte_rwlock_read_unlock(mem_lock);
 
@@ -734,6 +860,7 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
 	int vfio_group_fd;
 	int iommu_group_num;
 	int ret;
@@ -763,6 +890,10 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 		goto out;
 	}
 
+	/* get the vfio_config it belongs to */
+	vfio_cfg = get_vfio_cfg_by_group_num(iommu_group_num);
+	vfio_cfg = vfio_cfg ? vfio_cfg : default_vfio_cfg;
+
 	/* At this point we got an active group. Closing it will make the
 	 * container detachment. If this is the last active group, VFIO kernel
 	 * code will unset the container and the IOMMU mappings.
@@ -800,7 +931,7 @@ rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
 	/* if there are no active device groups, unregister the callback to
 	 * avoid spurious attempts to map/unmap memory from VFIO.
 	 */
-	if (vfio_cfg.vfio_active_groups == 0)
+	if (vfio_cfg == default_vfio_cfg && vfio_cfg->vfio_active_groups == 0)
 		rte_mem_event_callback_unregister(VFIO_MEM_EVENT_CLB_NAME);
 
 	/* success */
@@ -815,13 +946,22 @@ int
 rte_vfio_enable(const char *modname)
 {
 	/* initialize group list */
-	int i;
+	int i, j;
 	int vfio_available;
 
-	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_num = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+	rte_spinlock_recursive_t lock = RTE_SPINLOCK_RECURSIVE_INITIALIZER;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfgs[i].vfio_container_fd = -1;
+		vfio_cfgs[i].vfio_active_groups = 0;
+		vfio_cfgs[i].vfio_iommu_type = NULL;
+		vfio_cfgs[i].mem_maps.lock = lock;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			vfio_cfgs[i].vfio_groups[j].fd = -1;
+			vfio_cfgs[i].vfio_groups[j].group_num = -1;
+			vfio_cfgs[i].vfio_groups[j].devices = 0;
+		}
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -843,12 +983,12 @@ rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = rte_vfio_get_container_fd();
+	default_vfio_cfg->vfio_container_fd = rte_vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg->vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg->vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -860,7 +1000,7 @@ int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg->vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -1222,9 +1362,18 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 	struct vfio_iommu_spapr_tce_create create = {
 		.argsz = sizeof(create),
 	};
+	struct vfio_config *vfio_cfg;
+	struct user_mem_maps *user_mem_maps;
 	int i, ret = 0;
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
+	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
+		return -1;
+	}
+
+	user_mem_maps = &vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* check if window size needs to be adjusted */
 	memset(&param, 0, sizeof(param));
@@ -1237,9 +1386,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 	}
 
 	/* also check user maps */
-	for (i = 0; i < user_mem_maps.n_maps; i++) {
-		uint64_t max = user_mem_maps.maps[i].iova +
-				user_mem_maps.maps[i].len;
+	for (i = 0; i < user_mem_maps->n_maps; i++) {
+		uint64_t max = user_mem_maps->maps[i].iova +
+				user_mem_maps->maps[i].len;
 		create.window_size = RTE_MAX(create.window_size, max);
 	}
 
@@ -1265,9 +1414,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				goto out;
 			}
 			/* remap all user maps */
-			for (i = 0; i < user_mem_maps.n_maps; i++) {
+			for (i = 0; i < user_mem_maps->n_maps; i++) {
 				struct user_mem_map *map =
-						&user_mem_maps.maps[i];
+						&user_mem_maps->maps[i];
 				if (vfio_spapr_dma_do_map(vfio_container_fd,
 						map->addr, map->iova, map->len,
 						1)) {
@@ -1308,7 +1457,7 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
@@ -1360,9 +1509,10 @@ vfio_noiommu_dma_mem_map(int __rte_unused vfio_container_fd,
 }
 
 static int
-vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
+vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len, int do_map)
 {
-	const struct vfio_iommu_type *t = vfio_cfg.vfio_iommu_type;
+	const struct vfio_iommu_type *t = vfio_cfg->vfio_iommu_type;
 
 	if (!t) {
 		RTE_LOG(ERR, EAL, "  VFIO support not initialized\n");
@@ -1378,7 +1528,7 @@ vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
 		return -1;
 	}
 
-	return t->dma_user_map_func(vfio_cfg.vfio_container_fd, vaddr, iova,
+	return t->dma_user_map_func(vfio_cfg->vfio_container_fd, vaddr, iova,
 			len, do_map);
 }
 
@@ -1386,6 +1536,7 @@ int __rte_experimental
 rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 {
 	struct user_mem_map *new_map;
+	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
 	if (len == 0) {
@@ -1393,15 +1544,16 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		return -1;
 	}
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
-	if (user_mem_maps.n_maps == VFIO_MAX_USER_MEM_MAPS) {
+	user_mem_maps = &default_vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
+	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
 		rte_errno = ENOMEM;
 		ret = -1;
 		goto out;
 	}
 	/* map the entry */
-	if (vfio_dma_mem_map(vaddr, iova, len, 1)) {
+	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 1)) {
 		/* technically, this will fail if there are currently no devices
 		 * plugged in, even if a device were added later, this mapping
 		 * might have succeeded. however, since we cannot verify if this
@@ -1414,14 +1566,14 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		goto out;
 	}
 	/* create new user mem map entry */
-	new_map = &user_mem_maps.maps[user_mem_maps.n_maps++];
+	new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
 	new_map->addr = vaddr;
 	new_map->iova = iova;
 	new_map->len = len;
 
-	compact_user_maps();
+	compact_user_maps(user_mem_maps);
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
@@ -1429,6 +1581,7 @@ int __rte_experimental
 rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 {
 	struct user_mem_map *map, *new_map = NULL;
+	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
 	if (len == 0) {
@@ -1436,10 +1589,11 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 		return -1;
 	}
 
-	rte_spinlock_recursive_lock(&user_mem_maps.lock);
+	user_mem_maps = &default_vfio_cfg->mem_maps;
+	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* find our mapping */
-	map = find_user_mem_map(vaddr, iova, len);
+	map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
 	if (!map) {
 		RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
 		rte_errno = EINVAL;
@@ -1450,17 +1604,17 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
-		if (user_mem_maps.n_maps == VFIO_MAX_USER_MEM_MAPS) {
+		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
 			ret = -1;
 			goto out;
 		}
-		new_map = &user_mem_maps.maps[user_mem_maps.n_maps++];
+		new_map = &user_mem_maps->maps[user_mem_maps->n_maps++];
 	}
 
 	/* unmap the entry */
-	if (vfio_dma_mem_map(vaddr, iova, len, 0)) {
+	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 0)) {
 		/* there may not be any devices plugged in, so unmapping will
 		 * fail with ENODEV/ENOTSUP rte_errno values, but that doesn't
 		 * stop us from removing the mapping, as the assumption is we
@@ -1483,19 +1637,19 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 
 		/* if we've created a new map by splitting, sort everything */
 		if (!is_null_map(new_map)) {
-			compact_user_maps();
+			compact_user_maps(user_mem_maps);
 		} else {
 			/* we've created a new mapping, but it was unused */
-			user_mem_maps.n_maps--;
+			user_mem_maps->n_maps--;
 		}
 	} else {
 		memset(map, 0, sizeof(*map));
-		compact_user_maps();
-		user_mem_maps.n_maps--;
+		compact_user_maps(user_mem_maps);
+		user_mem_maps->n_maps--;
 	}
 
 out:
-	rte_spinlock_recursive_unlock(&user_mem_maps.lock);
+	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
 	return ret;
 }
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index c788bba44..18f85fb4f 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -82,6 +82,7 @@ struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
@@ -102,14 +103,6 @@ struct vfio_group {
 	int devices;
 };
 
-struct vfio_config {
-	int vfio_enabled;
-	int vfio_container_fd;
-	int vfio_active_groups;
-	const struct vfio_iommu_type *vfio_iommu_type;
-	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
-};
-
 /* DMA mapping function prototype.
  * Takes VFIO container fd as a parameter.
  * Returns 0 on success, -1 on error.
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v9 2/5] vfio: add multi container support
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 1/5] vfio: extend data structure for multi container Xiao Wang
@ 2018-04-17  7:06                                 ` Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
                                                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-17  7:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas, Xiao Wang,
	Junjie Chen

This patch adds APIs to support container create/destroy and device
bind/unbind with a container. It also provides API for IOMMU programing
on a specified container.

A driver could use "rte_vfio_container_create" helper to create a new
container from eal, use "rte_vfio_container_group_bind" to bind a device
to the newly created container. During rte_vfio_setup_device the container
bound with the device will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal.c          |  44 +++++
 lib/librte_eal/common/include/rte_vfio.h | 128 ++++++++++++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 269 ++++++++++++++++++++++++++++---
 lib/librte_eal/rte_eal_version.map       |   6 +
 4 files changed, 428 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index d996190fe..719789ba2 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -825,3 +825,47 @@ rte_vfio_get_group_fd(__rte_unused int iommu_group_num)
 {
 	return -1;
 }
+
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_bind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_unbind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(__rte_unused int container_fd,
+			__rte_unused uint64_t vaddr,
+			__rte_unused uint64_t iova,
+			__rte_unused uint64_t len)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(__rte_unused int container_fd,
+			__rte_unused uint64_t vaddr,
+			__rte_unused uint64_t iova,
+			__rte_unused uint64_t len)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index 890006484..f90972faa 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -161,7 +161,10 @@ rte_vfio_clear_group(int vfio_group_fd);
 /**
  * Map memory region for use with VFIO.
  *
- * @note requires at least one device to be attached at the time of mapping.
+ * @note Require at least one device to be attached at the time of
+ *       mapping. DMA maps done via this API will only apply to default
+ *       container and will not apply to any of the containers created
+ *       via rte_vfio_container_create().
  *
  * @param vaddr
  *   Starting virtual address of memory to be mapped.
@@ -252,6 +255,129 @@ rte_vfio_get_container_fd(void);
 int __rte_experimental
 rte_vfio_get_group_fd(int iommu_group_num);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container for device binding.
+ *
+ * @note Any newly allocated DPDK memory will not be mapped into these
+ *       containers by default, user needs to manage DMA mappings for
+ *       any container created by this API.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_create(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_destroy(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a IOMMU group to a container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_num
+ *   the iommu group number to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_group_bind(int container_fd, int iommu_group_num);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a IOMMU group from a container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_num
+ *   the iommu group number to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_group_unbind(int container_fd, int iommu_group_num);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform DMA mapping for devices in a container.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param vaddr
+ *   Starting virtual address of memory to be mapped.
+ *
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ *
+ * @param len
+ *   Length of memory segment being mapped.
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_dma_map(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform DMA unmapping for devices in a container.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param vaddr
+ *   Starting virtual address of memory to be unmapped.
+ *
+ * @param iova
+ *   Starting IOVA address of memory to be unmapped.
+ *
+ * @param len
+ *   Length of memory segment being unmapped.
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 974dcbe6d..8bc0381c7 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1532,19 +1532,15 @@ vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 			len, do_map);
 }
 
-int __rte_experimental
-rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
+static int
+container_dma_map(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
 {
 	struct user_mem_map *new_map;
 	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
-	if (len == 0) {
-		rte_errno = EINVAL;
-		return -1;
-	}
-
-	user_mem_maps = &default_vfio_cfg->mem_maps;
+	user_mem_maps = &vfio_cfg->mem_maps;
 	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 	if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 		RTE_LOG(ERR, EAL, "No more space for user mem maps\n");
@@ -1553,7 +1549,7 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 		goto out;
 	}
 	/* map the entry */
-	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 1)) {
+	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 1)) {
 		/* technically, this will fail if there are currently no devices
 		 * plugged in, even if a device were added later, this mapping
 		 * might have succeeded. however, since we cannot verify if this
@@ -1577,19 +1573,15 @@ rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
 	return ret;
 }
 
-int __rte_experimental
-rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
+static int
+container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
 {
 	struct user_mem_map *map, *new_map = NULL;
 	struct user_mem_maps *user_mem_maps;
 	int ret = 0;
 
-	if (len == 0) {
-		rte_errno = EINVAL;
-		return -1;
-	}
-
-	user_mem_maps = &default_vfio_cfg->mem_maps;
+	user_mem_maps = &vfio_cfg->mem_maps;
 	rte_spinlock_recursive_lock(&user_mem_maps->lock);
 
 	/* find our mapping */
@@ -1614,7 +1606,7 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 	}
 
 	/* unmap the entry */
-	if (vfio_dma_mem_map(default_vfio_cfg, vaddr, iova, len, 0)) {
+	if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 0)) {
 		/* there may not be any devices plugged in, so unmapping will
 		 * fail with ENODEV/ENOTSUP rte_errno values, but that doesn't
 		 * stop us from removing the mapping, as the assumption is we
@@ -1653,6 +1645,28 @@ rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
 	return ret;
 }
 
+int __rte_experimental
+rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
+{
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return container_dma_map(default_vfio_cfg, vaddr, iova, len);
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
+{
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return container_dma_unmap(default_vfio_cfg, vaddr, iova, len);
+}
+
 int
 rte_vfio_noiommu_is_enabled(void)
 {
@@ -1685,6 +1699,181 @@ rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i].vfio_container_fd == -1)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i].vfio_container_fd = rte_vfio_get_container_fd();
+	if (vfio_cfgs[i].vfio_container_fd < 0) {
+		RTE_LOG(NOTICE, EAL, "fail to create a new container\n");
+		return -1;
+	}
+
+	return vfio_cfgs[i].vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_num != -1)
+			rte_vfio_container_group_unbind(container_fd,
+				vfio_cfg->vfio_groups[i].group_num);
+
+	close(container_fd);
+	vfio_cfg->vfio_container_fd = -1;
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_iommu_type = NULL;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_container_group_bind(int container_fd, int iommu_group_num)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	/* Check room for new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_num == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_num);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_num);
+		return -1;
+	}
+	cur_grp->group_num = iommu_group_num;
+	cur_grp->fd = vfio_group_fd;
+	cur_grp->devices = 0;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_container_group_unbind(int container_fd, int iommu_group_num)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (vfio_cfg->vfio_groups[i].group_num == iommu_group_num) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+			" iommu_group_num %d\n", iommu_group_num);
+		return -1;
+	}
+	cur_grp->group_num = -1;
+	cur_grp->fd = -1;
+	cur_grp->devices = 0;
+	vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(int container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct vfio_config *vfio_cfg;
+
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	return container_dma_map(vfio_cfg, vaddr, iova, len);
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(int container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct vfio_config *vfio_cfg;
+
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
+	if (vfio_cfg == NULL) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	return container_dma_unmap(vfio_cfg, vaddr, iova, len);
+}
+
 #else
 
 int __rte_experimental
@@ -1761,4 +1950,48 @@ rte_vfio_get_group_fd(__rte_unused int iommu_group_num)
 	return -1;
 }
 
+int __rte_experimental
+rte_vfio_container_create(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_destroy(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_bind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_group_unbind(__rte_unused int container_fd,
+		__rte_unused int iommu_group_num)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_map(__rte_unused int container_fd,
+		__rte_unused uint64_t vaddr,
+		__rte_unused uint64_t iova,
+		__rte_unused uint64_t len)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_container_dma_unmap(__rte_unused int container_fd,
+		__rte_unused uint64_t vaddr,
+		__rte_unused uint64_t iova,
+		__rte_unused uint64_t len)
+{
+	return -1;
+}
+
 #endif /* VFIO_PRESENT */
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d02d80b8a..28f51f8d2 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -293,5 +293,11 @@ EXPERIMENTAL {
 	rte_vfio_get_container_fd;
 	rte_vfio_get_group_fd;
 	rte_vfio_get_group_num;
+	rte_vfio_container_create;
+	rte_vfio_container_destroy;
+	rte_vfio_container_dma_map;
+	rte_vfio_container_dma_unmap;
+	rte_vfio_container_group_bind;
+	rte_vfio_container_group_unbind;
 
 } DPDK_18.02;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v9 3/5] net/virtio: skip device probe in vdpa mode
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 1/5] vfio: extend data structure for multi container Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 2/5] vfio: add multi container support Xiao Wang
@ 2018-04-17  7:06                                 ` Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
                                                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-17  7:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas, Xiao Wang

If we want a virtio device to work in vDPA (vhost data path acceleration)
mode, we could add a "vdpa=1" devarg for this device to specify the mode.

This patch let virtio pmd skip device probe when detecting this parameter.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/virtio.rst         | 13 ++++++++++++
 drivers/net/virtio/virtio_ethdev.c | 43 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index ca09cd203..8922f9c0b 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -318,3 +318,16 @@ Here we use l3fwd-power as an example to show how to get started.
 
         $ l3fwd-power -l 0-1 -- -p 1 -P --config="(0,0,1)" \
                                                --no-numa --parse-ptype
+
+
+Virtio PMD arguments
+--------------------
+
+The user can specify below argument in devargs.
+
+#.  ``vdpa``:
+
+    A virtio device could also be driven by vDPA (vhost data path acceleration)
+    driver, and works as a HW vhost backend. This argument is used to specify
+    a virtio device needs to work in vDPA mode.
+    (Default: 0 (disabled))
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 41042cb23..5833dad73 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -28,6 +28,7 @@
 #include <rte_eal.h>
 #include <rte_dev.h>
 #include <rte_cycles.h>
+#include <rte_kvargs.h>
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -1713,9 +1714,51 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	return 0;
 }
 
+static int vdpa_check_handler(__rte_unused const char *key,
+		const char *value, __rte_unused void *opaque)
+{
+	if (strcmp(value, "1"))
+		return -1;
+
+	return 0;
+}
+
+static int
+vdpa_mode_selected(struct rte_devargs *devargs)
+{
+	struct rte_kvargs *kvlist;
+	const char *key = "vdpa";
+	int ret = 0;
+
+	if (devargs == NULL)
+		return 0;
+
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (kvlist == NULL)
+		return 0;
+
+	if (!rte_kvargs_count(kvlist, key))
+		goto exit;
+
+	/* vdpa mode selected when there's a key-value pair: vdpa=1 */
+	if (rte_kvargs_process(kvlist, key,
+				vdpa_check_handler, NULL) < 0) {
+		goto exit;
+	}
+	ret = 1;
+
+exit:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	struct rte_pci_device *pci_dev)
 {
+	/* virtio pmd skips probe if device needs to work in vdpa mode */
+	if (vdpa_mode_selected(pci_dev->device.devargs))
+		return 1;
+
 	return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
 		eth_virtio_dev_init);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v9 4/5] net/ifcvf: add ifcvf vdpa driver
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
                                                   ` (2 preceding siblings ...)
  2018-04-17  7:06                                 ` [PATCH v9 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
@ 2018-04-17  7:06                                 ` Xiao Wang
  2018-04-17  7:06                                 ` [PATCH v9 5/5] doc: add ifcvf driver document and release note Xiao Wang
  2018-04-17 11:13                                 ` [PATCH v9 0/5] add ifcvf vdpa driver Ferruh Yigit
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-17  7:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas, Xiao Wang,
	Rosen Xu

The IFCVF vDPA (vhost data path acceleration) driver provides support for
the Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible,
it works as a HW vhost backend which can send/receive packets to/from
virtio directly by DMA.

Different VF devices serve different virtio frontends which are in
different VMs, so each VF needs to have its own DMA address translation
service. During the driver probe a new container is created, with this
container vDPA driver can program DMA remapping table with the VM's memory
region information.

Key vDPA driver ops implemented:

- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib,
  including IOMMU programming to enable VF DMA to VM's memory, VFIO
  interrupt setup to route HW interrupt to virtio driver, create notify
  relay thread to translate virtio driver's kick to a MMIO write onto HW,
  HW queues configuration.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

Live migration feature is supported by IFCVF and this driver enables
it. For the dirty page logging, VF helps to log for packet buffer write,
driver helps to make the used ring as dirty when device stops.

Because vDPA driver needs to set up MSI-X vector to interrupt the
guest, only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Signed-off-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                    |   7 +
 config/common_linuxapp                |   1 +
 drivers/net/Makefile                  |   3 +
 drivers/net/ifc/Makefile              |  35 ++
 drivers/net/ifc/base/ifcvf.c          | 329 +++++++++++++
 drivers/net/ifc/base/ifcvf.h          | 160 +++++++
 drivers/net/ifc/base/ifcvf_osdep.h    |  52 +++
 drivers/net/ifc/ifcvf_vdpa.c          | 846 ++++++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map |   4 +
 mk/rte.app.mk                         |   3 +
 10 files changed, 1440 insertions(+)
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map

diff --git a/config/common_base b/config/common_base
index 9b9f79ff8..af3a706af 100644
--- a/config/common_base
+++ b/config/common_base
@@ -805,6 +805,13 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile IFCVF driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST and CONFIG_RTE_EAL_VFIO
+# should be enabled.
+#
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index d0437e5d6..14e56cb4d 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index dc5047e04..9f9da6651 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -58,6 +58,9 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+DIRS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifc
+endif
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
diff --git a/drivers/net/ifc/Makefile b/drivers/net/ifc/Makefile
new file mode 100644
index 000000000..1011995bc
--- /dev/null
+++ b/drivers/net/ifc/Makefile
@@ -0,0 +1,35 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_ifcvf_vdpa.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_pci -lrte_vhost -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+#
+# Add extra flags for base driver source files to disable warnings in them
+#
+BASE_DRIVER_OBJS=$(sort $(patsubst %.c,%.o,$(notdir $(wildcard $(SRCDIR)/base/*.c))))
+
+VPATH += $(SRCDIR)/base
+
+EXPORT_MAP := rte_ifcvf_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf_vdpa.c
+SRCS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += ifcvf.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/ifc/base/ifcvf.c b/drivers/net/ifc/base/ifcvf.c
new file mode 100644
index 000000000..d312ad99f
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include "ifcvf.h"
+#include "ifcvf_osdep.h"
+
+STATIC void *
+get_cap_addr(struct ifcvf_hw *hw, struct ifcvf_pci_cap *cap)
+{
+	u8 bar = cap->bar;
+	u32 length = cap->length;
+	u32 offset = cap->offset;
+
+	if (bar > IFCVF_PCI_MAX_RESOURCE - 1) {
+		DEBUGOUT("invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		DEBUGOUT("offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > hw->mem_resource[cap->bar].len) {
+		DEBUGOUT("offset(%u) + length(%u) overflows bar length(%u)",
+			offset, length, (u32)hw->mem_resource[cap->bar].len);
+		return NULL;
+	}
+
+	return hw->mem_resource[bar].addr + offset;
+}
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev)
+{
+	int ret;
+	u8 pos;
+	struct ifcvf_pci_cap cap;
+
+	ret = PCI_READ_CONFIG_BYTE(dev, &pos, PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		DEBUGOUT("failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = PCI_READ_CONFIG_RANGE(dev, (u32 *)&cap,
+				sizeof(cap), pos);
+		if (ret < 0) {
+			DEBUGOUT("failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		DEBUGOUT("cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case IFCVF_PCI_CAP_COMMON_CFG:
+			hw->common_cfg = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_NOTIFY_CFG:
+			PCI_READ_CONFIG_DWORD(dev, &hw->notify_off_multiplier,
+					pos + sizeof(cap));
+			hw->notify_base = get_cap_addr(hw, &cap);
+			hw->notify_region = cap.bar;
+			break;
+		case IFCVF_PCI_CAP_ISR_CFG:
+			hw->isr = get_cap_addr(hw, &cap);
+			break;
+		case IFCVF_PCI_CAP_DEVICE_CFG:
+			hw->dev_cfg = get_cap_addr(hw, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	hw->lm_cfg = hw->mem_resource[4].addr;
+
+	if (hw->common_cfg == NULL || hw->notify_base == NULL ||
+			hw->isr == NULL || hw->dev_cfg == NULL) {
+		DEBUGOUT("capability incomplete\n");
+		return -1;
+	}
+
+	DEBUGOUT("capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			hw->common_cfg, hw->dev_cfg,
+			hw->isr, hw->notify_base,
+			hw->notify_off_multiplier);
+
+	return 0;
+}
+
+STATIC u8
+ifcvf_get_status(struct ifcvf_hw *hw)
+{
+	return IFCVF_READ_REG8(&hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_set_status(struct ifcvf_hw *hw, u8 status)
+{
+	IFCVF_WRITE_REG8(status, &hw->common_cfg->device_status);
+}
+
+STATIC void
+ifcvf_reset(struct ifcvf_hw *hw)
+{
+	ifcvf_set_status(hw, 0);
+
+	/* flush status write */
+	while (ifcvf_get_status(hw))
+		msec_delay(1);
+}
+
+STATIC void
+ifcvf_add_status(struct ifcvf_hw *hw, u8 status)
+{
+	if (status != 0)
+		status |= ifcvf_get_status(hw);
+
+	ifcvf_set_status(hw, status);
+	ifcvf_get_status(hw);
+}
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw)
+{
+	u32 features_lo, features_hi;
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->device_feature_select);
+	features_lo = IFCVF_READ_REG32(&cfg->device_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->device_feature_select);
+	features_hi = IFCVF_READ_REG32(&cfg->device_feature);
+
+	return ((u64)features_hi << 32) | features_lo;
+}
+
+STATIC void
+ifcvf_set_features(struct ifcvf_hw *hw, u64 features)
+{
+	struct ifcvf_pci_common_cfg *cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG32(0, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	IFCVF_WRITE_REG32(1, &cfg->guest_feature_select);
+	IFCVF_WRITE_REG32(features >> 32, &cfg->guest_feature);
+}
+
+STATIC int
+ifcvf_config_features(struct ifcvf_hw *hw)
+{
+	u64 host_features;
+
+	host_features = ifcvf_get_features(hw);
+	hw->req_features &= host_features;
+
+	ifcvf_set_features(hw, hw->req_features);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_FEATURES_OK);
+
+	if (!(ifcvf_get_status(hw) & IFCVF_CONFIG_STATUS_FEATURES_OK)) {
+		DEBUGOUT("failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+STATIC void
+io_write64_twopart(u64 val, u32 *lo, u32 *hi)
+{
+	IFCVF_WRITE_REG32(val & ((1ULL << 32) - 1), lo);
+	IFCVF_WRITE_REG32(val >> 32, hi);
+}
+
+STATIC int
+ifcvf_hw_enable(struct ifcvf_hw *hw)
+{
+	struct ifcvf_pci_common_cfg *cfg;
+	u8 *lm_cfg;
+	u32 i;
+	u16 notify_off;
+
+	cfg = hw->common_cfg;
+	lm_cfg = hw->lm_cfg;
+
+	IFCVF_WRITE_REG16(0, &cfg->msix_config);
+	if (IFCVF_READ_REG16(&cfg->msix_config) == IFCVF_MSI_NO_VECTOR) {
+		DEBUGOUT("msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		io_write64_twopart(hw->vring[i].desc, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(hw->vring[i].avail, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(hw->vring[i].used, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size);
+
+		*(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4) =
+			(u32)hw->vring[i].last_avail_idx |
+			((u32)hw->vring[i].last_used_idx << 16);
+
+		IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector);
+		if (IFCVF_READ_REG16(&cfg->queue_msix_vector) ==
+				IFCVF_MSI_NO_VECTOR) {
+			DEBUGOUT("queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = IFCVF_READ_REG16(&cfg->queue_notify_off);
+		hw->notify_addr[i] = (void *)((u8 *)hw->notify_base +
+				notify_off * hw->notify_off_multiplier);
+		IFCVF_WRITE_REG16(1, &cfg->queue_enable);
+	}
+
+	return 0;
+}
+
+STATIC void
+ifcvf_hw_disable(struct ifcvf_hw *hw)
+{
+	u32 i;
+	struct ifcvf_pci_common_cfg *cfg;
+	u32 ring_state;
+
+	cfg = hw->common_cfg;
+
+	IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < hw->nr_vring; i++) {
+		IFCVF_WRITE_REG16(i, &cfg->queue_select);
+		IFCVF_WRITE_REG16(0, &cfg->queue_enable);
+		IFCVF_WRITE_REG16(IFCVF_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+		ring_state = *(u32 *)(hw->lm_cfg + IFCVF_LM_RING_STATE_OFFSET +
+				(i / 2) * IFCVF_LM_CFG_SIZE + (i % 2) * 4);
+		hw->vring[i].last_avail_idx = (u16)ring_state;
+		hw->vring[i].last_used_idx = (u16)ring_state >> 16;
+	}
+}
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_reset(hw);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_ACK);
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER);
+
+	if (ifcvf_config_features(hw) < 0)
+		return -1;
+
+	if (ifcvf_hw_enable(hw) < 0)
+		return -1;
+
+	ifcvf_add_status(hw, IFCVF_CONFIG_STATUS_DRIVER_OK);
+	return 0;
+}
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw)
+{
+	ifcvf_hw_disable(hw);
+	ifcvf_reset(hw);
+}
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) =
+		log_base & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_HIGH) =
+		(log_base >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_LOW) =
+		(log_base + log_size) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_END_ADDR_HIGH) =
+		((log_base + log_size) >> 32) & IFCVF_32_BIT_MASK;
+
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_ENABLE_PF;
+}
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw)
+{
+	u8 *lm_cfg;
+
+	lm_cfg = hw->lm_cfg;
+	*(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE;
+}
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid)
+{
+	IFCVF_WRITE_REG16(qid, hw->notify_addr[qid]);
+}
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw)
+{
+	return hw->notify_region;
+}
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid)
+{
+	return (u8 *)hw->notify_addr[qid] -
+		(u8 *)hw->mem_resource[hw->notify_region].addr;
+}
diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h
new file mode 100644
index 000000000..77a2bfa83
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_H_
+#define _IFCVF_H_
+
+#include "ifcvf_osdep.h"
+
+#define IFCVF_VENDOR_ID		0x1AF4
+#define IFCVF_DEVICE_ID		0x1041
+#define IFCVF_SUBSYS_VENDOR_ID	0x8086
+#define IFCVF_SUBSYS_DEVICE_ID	0x001A
+
+#define IFCVF_MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* Common configuration */
+#define IFCVF_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define IFCVF_PCI_CAP_NOTIFY_CFG	2
+/* ISR Status */
+#define IFCVF_PCI_CAP_ISR_CFG		3
+/* Device specific configuration */
+#define IFCVF_PCI_CAP_DEVICE_CFG	4
+/* PCI configuration access */
+#define IFCVF_PCI_CAP_PCI_CFG		5
+
+#define IFCVF_CONFIG_STATUS_RESET     0x00
+#define IFCVF_CONFIG_STATUS_ACK       0x01
+#define IFCVF_CONFIG_STATUS_DRIVER    0x02
+#define IFCVF_CONFIG_STATUS_DRIVER_OK 0x04
+#define IFCVF_CONFIG_STATUS_FEATURES_OK 0x08
+#define IFCVF_CONFIG_STATUS_FAILED    0x80
+
+#define IFCVF_MSI_NO_VECTOR	0xffff
+#define IFCVF_PCI_MAX_RESOURCE	6
+
+#define IFCVF_LM_CFG_SIZE		0x40
+#define IFCVF_LM_RING_STATE_OFFSET	0x20
+
+#define IFCVF_LM_LOGGING_CTRL		0x0
+
+#define IFCVF_LM_BASE_ADDR_LOW		0x10
+#define IFCVF_LM_BASE_ADDR_HIGH		0x14
+#define IFCVF_LM_END_ADDR_LOW		0x18
+#define IFCVF_LM_END_ADDR_HIGH		0x1c
+
+#define IFCVF_LM_DISABLE		0x0
+#define IFCVF_LM_ENABLE_VF		0x1
+#define IFCVF_LM_ENABLE_PF		0x3
+
+#define IFCVF_32_BIT_MASK		0xffffffff
+
+
+struct ifcvf_pci_cap {
+	u8 cap_vndr;            /* Generic PCI field: PCI_CAP_ID_VNDR */
+	u8 cap_next;            /* Generic PCI field: next ptr. */
+	u8 cap_len;             /* Generic PCI field: capability length */
+	u8 cfg_type;            /* Identifies the structure. */
+	u8 bar;                 /* Where to find it. */
+	u8 padding[3];          /* Pad to full dword. */
+	u32 offset;             /* Offset within bar. */
+	u32 length;             /* Length of the structure, in bytes. */
+};
+
+struct ifcvf_pci_notify_cap {
+	struct ifcvf_pci_cap cap;
+	u32 notify_off_multiplier;  /* Multiplier for queue_notify_off. */
+};
+
+struct ifcvf_pci_common_cfg {
+	/* About the whole device. */
+	u32 device_feature_select;
+	u32 device_feature;
+	u32 guest_feature_select;
+	u32 guest_feature;
+	u16 msix_config;
+	u16 num_queues;
+	u8 device_status;
+	u8 config_generation;
+
+	/* About a specific virtqueue. */
+	u16 queue_select;
+	u16 queue_size;
+	u16 queue_msix_vector;
+	u16 queue_enable;
+	u16 queue_notify_off;
+	u32 queue_desc_lo;
+	u32 queue_desc_hi;
+	u32 queue_avail_lo;
+	u32 queue_avail_hi;
+	u32 queue_used_lo;
+	u32 queue_used_hi;
+};
+
+struct ifcvf_net_config {
+	u8    mac[6];
+	u16   status;
+	u16   max_virtqueue_pairs;
+} __attribute__((packed));
+
+struct ifcvf_pci_mem_resource {
+	u64      phys_addr; /**< Physical address, 0 if not resource. */
+	u64      len;       /**< Length of the resource. */
+	u8       *addr;     /**< Virtual address, NULL when not mapped. */
+};
+
+struct vring_info {
+	u64 desc;
+	u64 avail;
+	u64 used;
+	u16 size;
+	u16 last_avail_idx;
+	u16 last_used_idx;
+};
+
+struct ifcvf_hw {
+	u64    req_features;
+	u8     notify_region;
+	u32    notify_off_multiplier;
+	struct ifcvf_pci_common_cfg *common_cfg;
+	struct ifcvf_net_device_config *dev_cfg;
+	u8     *isr;
+	u16    *notify_base;
+	u16    *notify_addr[IFCVF_MAX_QUEUES * 2];
+	u8     *lm_cfg;
+	struct vring_info vring[IFCVF_MAX_QUEUES * 2];
+	u8 nr_vring;
+	struct ifcvf_pci_mem_resource mem_resource[IFCVF_PCI_MAX_RESOURCE];
+};
+
+int
+ifcvf_init_hw(struct ifcvf_hw *hw, PCI_DEV *dev);
+
+u64
+ifcvf_get_features(struct ifcvf_hw *hw);
+
+int
+ifcvf_start_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_stop_hw(struct ifcvf_hw *hw);
+
+void
+ifcvf_enable_logging(struct ifcvf_hw *hw, u64 log_base, u64 log_size);
+
+void
+ifcvf_disable_logging(struct ifcvf_hw *hw);
+
+void
+ifcvf_notify_queue(struct ifcvf_hw *hw, u16 qid);
+
+u8
+ifcvf_get_notify_region(struct ifcvf_hw *hw);
+
+u64
+ifcvf_get_queue_notify_off(struct ifcvf_hw *hw, int qid);
+
+#endif /* _IFCVF_H_ */
diff --git a/drivers/net/ifc/base/ifcvf_osdep.h b/drivers/net/ifc/base/ifcvf_osdep.h
new file mode 100644
index 000000000..cf151ef52
--- /dev/null
+++ b/drivers/net/ifc/base/ifcvf_osdep.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _IFCVF_OSDEP_H_
+#define _IFCVF_OSDEP_H_
+
+#include <stdint.h>
+#include <linux/pci_regs.h>
+
+#include <rte_cycles.h>
+#include <rte_pci.h>
+#include <rte_bus_pci.h>
+#include <rte_log.h>
+#include <rte_io.h>
+
+#define DEBUGOUT(S, args...)    RTE_LOG(DEBUG, PMD, S, ##args)
+#define STATIC                  static
+
+#define msec_delay	rte_delay_ms
+
+#define IFCVF_READ_REG8(reg)		rte_read8(reg)
+#define IFCVF_WRITE_REG8(val, reg)	rte_write8((val), (reg))
+#define IFCVF_READ_REG16(reg)		rte_read16(reg)
+#define IFCVF_WRITE_REG16(val, reg)	rte_write16((val), (reg))
+#define IFCVF_READ_REG32(reg)		rte_read32(reg)
+#define IFCVF_WRITE_REG32(val, reg)	rte_write32((val), (reg))
+
+typedef struct rte_pci_device PCI_DEV;
+
+#define PCI_READ_CONFIG_BYTE(dev, val, where) \
+	rte_pci_read_config(dev, val, 1, where)
+
+#define PCI_READ_CONFIG_DWORD(dev, val, where) \
+	rte_pci_read_config(dev, val, 4, where)
+
+typedef uint8_t    u8;
+typedef int8_t     s8;
+typedef uint16_t   u16;
+typedef int16_t    s16;
+typedef uint32_t   u32;
+typedef int32_t    s32;
+typedef int64_t    s64;
+typedef uint64_t   u64;
+
+static inline int
+PCI_READ_CONFIG_RANGE(PCI_DEV *dev, uint32_t *val, int size, int where)
+{
+	return rte_pci_read_config(dev, val, size, where);
+}
+
+#endif /* _IFCVF_OSDEP_H_ */
diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c
new file mode 100644
index 000000000..a2b3f0dbf
--- /dev/null
+++ b/drivers/net/ifc/ifcvf_vdpa.c
@@ -0,0 +1,846 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+
+#include <rte_malloc.h>
+#include <rte_memory.h>
+#include <rte_bus_pci.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_vfio.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+
+#include "base/ifcvf.h"
+
+#define DRV_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, ifcvf_vdpa_logtype, \
+		"%s(): " fmt "\n", __func__, ##args)
+
+#ifndef PAGE_SIZE
+#define PAGE_SIZE 4096
+#endif
+
+static int ifcvf_vdpa_logtype;
+
+struct ifcvf_internal {
+	struct rte_vdpa_dev_addr dev_addr;
+	struct rte_pci_device *pdev;
+	struct ifcvf_hw hw;
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+	int vid;
+	int did;
+	uint16_t max_queues;
+	uint64_t features;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct ifcvf_internal *internal;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct internal_list *
+find_internal_resource_by_did(int did)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (did == list->internal->did) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_dev(struct rte_pci_device *pdev)
+{
+	int found = 0;
+	struct internal_list *list;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		if (pdev == list->internal->pdev) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+ifcvf_vfio_setup(struct ifcvf_internal *internal)
+{
+	struct rte_pci_device *dev = internal->pdev;
+	char devname[RTE_DEV_NAME_MAX_LEN] = {0};
+	int iommu_group_num;
+	int ret = 0;
+	int i;
+
+	internal->vfio_dev_fd = -1;
+	internal->vfio_group_fd = -1;
+	internal->vfio_container_fd = -1;
+
+	rte_pci_device_name(&dev->addr, devname, RTE_DEV_NAME_MAX_LEN);
+	rte_vfio_get_group_num(rte_pci_get_sysfs_path(), devname,
+			&iommu_group_num);
+
+	internal->vfio_container_fd = rte_vfio_container_create();
+	if (internal->vfio_container_fd < 0)
+		return -1;
+
+	internal->vfio_group_fd = rte_vfio_container_group_bind(
+			internal->vfio_container_fd, iommu_group_num);
+	if (internal->vfio_group_fd < 0)
+		goto err;
+
+	if (rte_pci_map_device(dev))
+		goto err;
+
+	internal->vfio_dev_fd = dev->intr_handle.vfio_dev_fd;
+
+	for (i = 0; i < RTE_MIN(PCI_MAX_RESOURCE, IFCVF_PCI_MAX_RESOURCE);
+			i++) {
+		internal->hw.mem_resource[i].addr =
+			internal->pdev->mem_resource[i].addr;
+		internal->hw.mem_resource[i].phys_addr =
+			internal->pdev->mem_resource[i].phys_addr;
+		internal->hw.mem_resource[i].len =
+			internal->pdev->mem_resource[i].len;
+	}
+	ret = ifcvf_init_hw(&internal->hw, internal->pdev);
+
+	return ret;
+
+err:
+	rte_vfio_container_destroy(internal->vfio_container_fd);
+	return -1;
+}
+
+static int
+ifcvf_dma_map(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+
+		reg = &mem->regions[i];
+		DRV_LOG(INFO, "region %u: HVA 0x%" PRIx64 ", "
+			"GPA 0x%" PRIx64 ", size 0x%" PRIx64 ".",
+			i, reg->host_user_addr, reg->guest_phys_addr,
+			reg->size);
+
+		rte_vfio_container_dma_map(vfio_container_fd,
+				reg->host_user_addr, reg->guest_phys_addr,
+				reg->size);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+ifcvf_dma_unmap(struct ifcvf_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		DRV_LOG(ERR, "failed to get VM memory layout.");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct rte_vhost_mem_region *reg;
+
+		reg = &mem->regions[i];
+		rte_vfio_container_dma_map(vfio_container_fd,
+				reg->host_user_addr, reg->guest_phys_addr,
+				reg->size);
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static int
+vdpa_ifcvf_start(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	int i, nr_vring;
+	int vid;
+	struct rte_vhost_vring vq;
+
+	vid = internal->vid;
+	nr_vring = rte_vhost_get_vring_num(vid);
+	rte_vhost_get_negotiated_features(vid, &hw->req_features);
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(vid, i, &vq);
+		hw->vring[i].desc = qva_to_gpa(vid,
+				(uint64_t)(uintptr_t)vq.desc);
+		hw->vring[i].avail = qva_to_gpa(vid,
+				(uint64_t)(uintptr_t)vq.avail);
+		hw->vring[i].used = qva_to_gpa(vid,
+				(uint64_t)(uintptr_t)vq.used);
+		hw->vring[i].size = vq.size;
+		rte_vhost_get_vring_base(vid, i, &hw->vring[i].last_avail_idx,
+				&hw->vring[i].last_used_idx);
+	}
+	hw->nr_vring = i;
+
+	return ifcvf_start_hw(&internal->hw);
+}
+
+static void
+vdpa_ifcvf_stop(struct ifcvf_internal *internal)
+{
+	struct ifcvf_hw *hw = &internal->hw;
+	uint32_t i, j;
+	int vid;
+	uint64_t features, pfn;
+	uint64_t log_base, log_size;
+	uint32_t size;
+	uint8_t *log_buf;
+
+	vid = internal->vid;
+	ifcvf_stop_hw(hw);
+
+	for (i = 0; i < hw->nr_vring; i++)
+		rte_vhost_set_vring_base(vid, i, hw->vring[i].last_avail_idx,
+				hw->vring[i].last_used_idx);
+
+	rte_vhost_get_negotiated_features(vid, &features);
+	if (RTE_VHOST_NEED_LOG(features)) {
+		ifcvf_disable_logging(hw);
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		/*
+		 * IFCVF marks dirty memory pages for only packet buffer,
+		 * SW helps to mark the used ring as dirty after device stops.
+		 */
+		log_buf = (uint8_t *)(uintptr_t)log_base;
+		size = hw->vring[i].size * 8 + 4;
+		for (i = 0; i < hw->nr_vring; i++) {
+			pfn = hw->vring[i].used / PAGE_SIZE;
+			for (j = 0; j <= size / PAGE_SIZE; j++)
+				__sync_fetch_and_or_8(&log_buf[(pfn + j) / 8],
+						 1 << ((pfn + j) % 8));
+		}
+	}
+}
+
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (IFCVF_MAX_QUEUES * 2 + 1))
+static int
+vdpa_enable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct rte_vhost_vring vring;
+
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = internal->pdev->intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error enabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct ifcvf_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		DRV_LOG(ERR, "Error disabling MSI-X interrupts: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	uint32_t qid, q_num;
+	struct epoll_event events[IFCVF_MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct ifcvf_internal *internal = (struct ifcvf_internal *)arg;
+	struct ifcvf_hw *hw = &internal->hw;
+
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(IFCVF_MAX_QUEUES * 2);
+	if (epfd < 0) {
+		DRV_LOG(ERR, "failed to create epoll instance.");
+		return NULL;
+	}
+	internal->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			DRV_LOG(ERR, "epoll add error: %s", strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			DRV_LOG(ERR, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					DRV_LOG(INFO, "Error reading "
+						"kickfd: %s",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			ifcvf_notify_queue(hw, qid);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->tid, NULL, notify_relay,
+			(void *)internal);
+	if (ret) {
+		DRV_LOG(ERR, "failed to create notify relay pthread.");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct ifcvf_internal *internal)
+{
+	void *status;
+
+	if (internal->tid) {
+		pthread_cancel(internal->tid);
+		pthread_join(internal->tid, &status);
+	}
+	internal->tid = 0;
+
+	if (internal->epfd >= 0)
+		close(internal->epfd);
+	internal->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct ifcvf_internal *internal)
+{
+	int ret;
+
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = ifcvf_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_ifcvf_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_ifcvf_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = ifcvf_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+ifcvf_dev_config(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	internal->vid = vid;
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_dev_close(int vid)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(internal);
+
+	return 0;
+}
+
+static int
+ifcvf_set_features(int vid)
+{
+	uint64_t features;
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	uint64_t log_base, log_size;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	if (RTE_VHOST_NEED_LOG(features)) {
+		rte_vhost_get_log_base(internal->vid, &log_base, &log_size);
+		log_base = rte_mem_virt2phy((void *)(uintptr_t)log_base);
+		ifcvf_enable_logging(&internal->hw, log_base, log_size);
+	}
+
+	return 0;
+}
+
+static int
+ifcvf_get_vfio_group_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_group_fd;
+}
+
+static int
+ifcvf_get_vfio_device_fd(int vid)
+{
+	int did;
+	struct internal_list *list;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	return list->internal->vfio_dev_fd;
+}
+
+static int
+ifcvf_get_notify_area(int vid, int qid, uint64_t *offset, uint64_t *size)
+{
+	int did;
+	struct internal_list *list;
+	struct ifcvf_internal *internal;
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	int ret;
+
+	did = rte_vhost_get_vdpa_device_id(vid);
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	internal = list->internal;
+
+	reg.index = ifcvf_get_notify_region(&internal->hw);
+	ret = ioctl(internal->vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (ret) {
+		DRV_LOG(ERR, "Get not get device region info: %s",
+				strerror(errno));
+		return -1;
+	}
+
+	*offset = ifcvf_get_queue_notify_off(&internal->hw, qid) + reg.offset;
+	*size = 0x1000;
+
+	return 0;
+}
+
+static int
+ifcvf_get_queue_num(int did, uint32_t *queue_num)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*queue_num = list->internal->max_queues;
+
+	return 0;
+}
+
+static int
+ifcvf_get_vdpa_features(int did, uint64_t *features)
+{
+	struct internal_list *list;
+
+	list = find_internal_resource_by_did(did);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device id: %d", did);
+		return -1;
+	}
+
+	*features = list->internal->features;
+
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK | \
+		 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+static int
+ifcvf_get_protocol_features(int did __rte_unused, uint64_t *features)
+{
+	*features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+	return 0;
+}
+
+struct rte_vdpa_dev_ops ifcvf_ops = {
+	.get_queue_num = ifcvf_get_queue_num,
+	.get_features = ifcvf_get_vdpa_features,
+	.get_protocol_features = ifcvf_get_protocol_features,
+	.dev_conf = ifcvf_dev_config,
+	.dev_close = ifcvf_dev_close,
+	.set_vring_state = NULL,
+	.set_features = ifcvf_set_features,
+	.migration_done = NULL,
+	.get_vfio_group_fd = ifcvf_get_vfio_group_fd,
+	.get_vfio_device_fd = ifcvf_get_vfio_device_fd,
+	.get_notify_area = ifcvf_get_notify_area,
+};
+
+static int
+ifcvf_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
+		struct rte_pci_device *pci_dev)
+{
+	uint64_t features;
+	struct ifcvf_internal *internal = NULL;
+	struct internal_list *list = NULL;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = rte_zmalloc("ifcvf", sizeof(*list), 0);
+	if (list == NULL)
+		goto error;
+
+	internal = rte_zmalloc("ifcvf", sizeof(*internal), 0);
+	if (internal == NULL)
+		goto error;
+
+	internal->pdev = pci_dev;
+	rte_spinlock_init(&internal->lock);
+	if (ifcvf_vfio_setup(internal) < 0)
+		return -1;
+
+	internal->max_queues = IFCVF_MAX_QUEUES;
+	features = ifcvf_get_features(&internal->hw);
+	internal->features = (features &
+		~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+		(1ULL << VHOST_USER_F_PROTOCOL_FEATURES) |
+		(1ULL << VHOST_F_LOG_ALL);
+
+	internal->dev_addr.pci_addr = pci_dev->addr;
+	internal->dev_addr.type = PCI_ADDR;
+	list->internal = internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	internal->did = rte_vdpa_register_device(&internal->dev_addr,
+				&ifcvf_ops);
+	if (internal->did < 0)
+		goto error;
+
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(internal);
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(internal);
+	return -1;
+}
+
+static int
+ifcvf_pci_remove(struct rte_pci_device *pci_dev)
+{
+	struct ifcvf_internal *internal;
+	struct internal_list *list;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	list = find_internal_resource_by_dev(pci_dev);
+	if (list == NULL) {
+		DRV_LOG(ERR, "Invalid device: %s", pci_dev->name);
+		return -1;
+	}
+
+	internal = list->internal;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(internal);
+
+	rte_pci_unmap_device(internal->pdev);
+	rte_vfio_container_destroy(internal->vfio_container_fd);
+	rte_vdpa_unregister_device(internal->did);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	rte_free(list);
+	rte_free(internal);
+
+	return 0;
+}
+
+/*
+ * IFCVF has the same vendor ID and device ID as virtio net PCI
+ * device, with its specific subsystem vendor ID and device ID.
+ */
+static const struct rte_pci_id pci_id_ifcvf_map[] = {
+	{ .class_id = RTE_CLASS_ANY_ID,
+	  .vendor_id = IFCVF_VENDOR_ID,
+	  .device_id = IFCVF_DEVICE_ID,
+	  .subsystem_vendor_id = IFCVF_SUBSYS_VENDOR_ID,
+	  .subsystem_device_id = IFCVF_SUBSYS_DEVICE_ID,
+	},
+
+	{ .vendor_id = 0, /* sentinel */
+	},
+};
+
+static struct rte_pci_driver rte_ifcvf_vdpa = {
+	.id_table = pci_id_ifcvf_map,
+	.drv_flags = 0,
+	.probe = ifcvf_pci_probe,
+	.remove = ifcvf_pci_remove,
+};
+
+RTE_PMD_REGISTER_PCI(net_ifcvf, rte_ifcvf_vdpa);
+RTE_PMD_REGISTER_PCI_TABLE(net_ifcvf, pci_id_ifcvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_ifcvf, "* vfio-pci");
+
+RTE_INIT(ifcvf_vdpa_init_log);
+static void
+ifcvf_vdpa_init_log(void)
+{
+	ifcvf_vdpa_logtype = rte_log_register("pmd.net.ifcvf_vdpa");
+	if (ifcvf_vdpa_logtype >= 0)
+		rte_log_set_level(ifcvf_vdpa_logtype, RTE_LOG_NOTICE);
+}
diff --git a/drivers/net/ifc/rte_ifcvf_version.map b/drivers/net/ifc/rte_ifcvf_version.map
new file mode 100644
index 000000000..9b9ab1a4c
--- /dev/null
+++ b/drivers/net/ifc/rte_ifcvf_version.map
@@ -0,0 +1,4 @@
+DPDK_18.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 8bab901fc..0e18d0fac 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -186,6 +186,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+ifeq ($(CONFIG_RTE_EAL_VFIO),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD) += -lrte_ifcvf_vdpa
+endif # $(CONFIG_RTE_EAL_VFIO)
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v9 5/5] doc: add ifcvf driver document and release note
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
                                                   ` (3 preceding siblings ...)
  2018-04-17  7:06                                 ` [PATCH v9 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
@ 2018-04-17  7:06                                 ` Xiao Wang
  2018-04-17 11:13                                 ` [PATCH v9 0/5] add ifcvf vdpa driver Ferruh Yigit
  5 siblings, 0 replies; 98+ messages in thread
From: Xiao Wang @ 2018-04-17  7:06 UTC (permalink / raw)
  To: ferruh.yigit
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas, Xiao Wang

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 doc/guides/nics/features/ifcvf.ini     |  8 +++
 doc/guides/nics/ifcvf.rst              | 98 ++++++++++++++++++++++++++++++++++
 doc/guides/nics/index.rst              |  1 +
 doc/guides/rel_notes/release_18_05.rst |  9 ++++
 4 files changed, 116 insertions(+)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst

diff --git a/doc/guides/nics/features/ifcvf.ini b/doc/guides/nics/features/ifcvf.ini
new file mode 100644
index 000000000..ef1fc4711
--- /dev/null
+++ b/doc/guides/nics/features/ifcvf.ini
@@ -0,0 +1,8 @@
+;
+; Supported features of the 'ifcvf' vDPA driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+x86-32               = Y
+x86-64               = Y
diff --git a/doc/guides/nics/ifcvf.rst b/doc/guides/nics/ifcvf.rst
new file mode 100644
index 000000000..d7e76353c
--- /dev/null
+++ b/doc/guides/nics/ifcvf.rst
@@ -0,0 +1,98 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+IFCVF vDPA driver
+=================
+
+The IFCVF vDPA (vhost data path acceleration) driver provides support for the
+Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
+works as a HW vhost backend which can send/receive packets to/from virtio
+directly by DMA. Besides, it supports dirty page logging and device state
+report/restore. This driver enables its vDPA functionality with live migration
+feature.
+
+
+Pre-Installation Configuration
+------------------------------
+
+Config File Options
+~~~~~~~~~~~~~~~~~~~
+
+The following option can be modified in the ``config`` file.
+
+- ``CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD`` (default ``y`` for linux)
+
+  Toggle compilation of the ``librte_ifcvf_vdpa`` driver.
+
+
+IFCVF vDPA Implementation
+-------------------------
+
+IFCVF's vendor ID and device ID are same as that of virtio net pci device,
+with its specific subsystem vendor ID and device ID. To let the device be
+probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
+device is to be used in vDPA mode, rather than polling mode, virtio pmd will
+skip when it detects this message.
+
+Different VF devices serve different virtio frontends which are in different
+VMs, so each VF needs to have its own DMA address translation service. During
+the driver probe a new container is created for this device, with this
+container vDPA driver can program DMA remapping table with the VM's memory
+region information.
+
+Key IFCVF vDPA driver ops
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ifcvf_dev_config:
+  Enable VF data path with virtio information provided by vhost lib, including
+  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
+  route HW interrupt to virtio driver, create notify relay thread to translate
+  virtio driver's kick to a MMIO write onto HW, HW queues configuration.
+
+  This function gets called to set up HW data path backend when virtio driver
+  in VM gets ready.
+
+- ifcvf_dev_close:
+  Revoke all the setup in ifcvf_dev_config.
+
+  This function gets called when virtio driver stops device in VM.
+
+To create a vhost port with IFC VF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Create a vhost socket and assign a VF's device ID to this socket via
+  vhost API. When QEMU vhost connection gets ready, the assigned VF will
+  get configured automatically.
+
+
+Features
+--------
+
+Features of the IFCVF driver are:
+
+- Compatibility with virtio 0.95 and 1.0.
+- Live migration.
+
+
+Prerequisites
+-------------
+
+- Platform with IOMMU feature. IFC VF needs address translation service to
+  Rx/Tx directly with virtio driver in VM.
+
+
+Limitations
+-----------
+
+Dependency on vfio-pci
+~~~~~~~~~~~~~~~~~~~~~~
+
+vDPA driver needs to setup VF MSIX interrupts, each queue's interrupt vector
+is mapped to a callfd associated with a virtio ring. Currently only vfio-pci
+allows multiple interrupts, so the IFCVF driver is dependent on vfio-pci.
+
+Live Migration with VIRTIO_NET_F_GUEST_ANNOUNCE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+IFC VF doesn't support RARP packet generation, virtio frontend supporting
+VIRTIO_NET_F_GUEST_ANNOUNCE feature can help to do that.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index ea9110c81..9b98c620f 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -45,6 +45,7 @@ Network Interface Controller Drivers
     vmxnet3
     pcap_ring
     fail_safe
+    ifcvf
 
 **Figures**
 
diff --git a/doc/guides/rel_notes/release_18_05.rst b/doc/guides/rel_notes/release_18_05.rst
index bc9cdda6a..2f803231e 100644
--- a/doc/guides/rel_notes/release_18_05.rst
+++ b/doc/guides/rel_notes/release_18_05.rst
@@ -115,6 +115,15 @@ New Features
 
   Linux uevent is supported as backend of this device event notification framework.
 
+* **Added IFCVF vDPA driver.**
+
+  Added IFCVF vDPA driver to support Intel FPGA 100G VF device. IFCVF works
+  as a HW vhost data path accelerator, it supports live migration and is
+  compatible with virtio 0.95 and 1.0. This driver registers ifcvf vDPA driver
+  to vhost lib, when virtio connected, with the help of the registered vDPA
+  driver the assigned VF gets configured to Rx/Tx directly to VM's virtio
+  vrings.
+
 
 API Changes
 -----------
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v9 0/5] add ifcvf vdpa driver
  2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
                                                   ` (4 preceding siblings ...)
  2018-04-17  7:06                                 ` [PATCH v9 5/5] doc: add ifcvf driver document and release note Xiao Wang
@ 2018-04-17 11:13                                 ` Ferruh Yigit
  5 siblings, 0 replies; 98+ messages in thread
From: Ferruh Yigit @ 2018-04-17 11:13 UTC (permalink / raw)
  To: Xiao Wang
  Cc: anatoly.burakov, dev, maxime.coquelin, zhihong.wang, tiwei.bie,
	jianfeng.tan, cunming.liang, dan.daly, thomas

On 4/17/2018 8:06 AM, Xiao Wang wrote:
> IFCVF driver
> ============
> The IFCVF vDPA (vhost data path acceleration) driver provides support for the
> Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
> works as a HW vhost backend which can send/receive packets to/from virtio
> directly by DMA. Besides, it supports dirty page logging and device state
> report/restore. This driver enables its vDPA functionality with live migration
> feature.
> 
> vDPA mode
> =========
> IFCVF's vendor ID and device ID are same as that of virtio net pci device,
> with its specific subsystem vendor ID and device ID. To let the device be
> probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
> device is to be used in vDPA mode, rather than polling mode, virtio pmd will
> skip when it detects this message.
> 
> Container per device
> ====================
> vDPA needs to create different containers for different devices, thus this
> patch set adds some APIs in eal/vfio to support multiple container, e.g.
> - rte_vfio_container_create
> - rte_vfio_container_destroy
> - rte_vfio_container_group_bind
> - rte_vfio_container_group_unbind
> 
> By this extension, a device can be put into a new specific container, rather
> than the previous default container.
> 
> Two APIs are added for IOMMU programming for a specified container:
> - rte_vfio_container_dma_map
> - rte_vfio_container_dma_unmap
> 
> IFCVF vDPA details
> ==================
> Key vDPA driver ops implemented:
> - ifcvf_dev_config:
>   Enable VF data path with virtio information provided by vhost lib, including
>   IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
>   route HW interrupt to virtio driver, create notify relay thread to translate
>   virtio driver's kick to a MMIO write onto HW, HW queues configuration.
> 
>   This function gets called to set up HW data path backend when virtio driver
>   in VM gets ready.
> 
> - ifcvf_dev_close:
>   Revoke all the setup in ifcvf_dev_config.
> 
>   This function gets called when virtio driver stops device in VM.
> 
> Change log
> ==========
> v9:
> - Rebase on master tree's HEAD.
> - Fix compile error on 32-bit platform.
> 
> v8:
> - Rebase on HEAD.
> - Move vfio_group definition back to eal_vfio.h.
> - Return NULL when vfio group num/fd is not found, let caller handle that.
> - Fix wrong API name in commit log.
> - Rename bind/unbind function to rte_vfio_container_group_bind/unbind for
>   consistensy.
> - Add note for rte_vfio_container_create and rte_vfio_dma_map and fix typo
>   in comment.
> - Extract out the shared code snip of rte_vfio_dma_map and
>   rte_vfio_container_dma_map to avoid code duplication. So do for the unmap.
> 
> v7:
> - Rebase on HEAD.
> - Split the vfio patch into 2 parts, one for data structure extension, one for
>   adding new API.
> - Use static vfio_config array instead of dynamic alloating.
> - Change rte_vfio_container_dma_map/unmap's parameters to use (va, iova, len).
> 
> v6:
> - Rebase on master branch.
> - Document "vdpa" devarg in virtio documentation.
> - Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
>   consistensy, and add it into driver documentation.
> - Add comments for ifcvf device ID.
> - Minor code cleaning.
> 
> v5:
> - Fix compilation in BSD, remove the rte_vfio.h including in BSD.
> 
> v4:
> - Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
> - Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
> - Align the vfio_cfg search internal APIs naming.
> 
> v3:
> - Add doc and release note for the new driver.
> - Remove the vdev concept, make the driver as a PCI driver, it will get probed
>   by PCI bus driver.
> - Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
> - Remove the PCI API exposure accordingly.
> - Move the MAX_VFIO_CONTAINERS definition to config file.
> - Let virtio pmd skips when a virtio device needs to work in vDPA mode.
> 
> v2:
> - Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
>   to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
> - Rebase on Zhihong's vDPA v3 patch set.
> - Minor code cleanup on vfio extension.
> 
> 
> Xiao Wang (5):
>   vfio: extend data structure for multi container
>   vfio: add multi container support
>   net/virtio: skip device probe in vdpa mode
>   net/ifcvf: add ifcvf vdpa driver
>   doc: add ifcvf driver document and release note

Series applied to dpdk-next-net/master, thanks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2018-04-17 11:13 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-09 23:08 [PATCH 0/3] add ifcvf driver Xiao Wang
2018-03-09 23:08 ` [PATCH 1/3] eal/vfio: add support for multiple container Xiao Wang
2018-03-14 12:08   ` Burakov, Anatoly
2018-03-15 16:49     ` Wang, Xiao W
2018-03-09 23:08 ` [PATCH 2/3] bus/pci: expose sysfs parsing API Xiao Wang
2018-03-14 11:19   ` Burakov, Anatoly
2018-03-14 13:30     ` Gaëtan Rivet
2018-03-15 16:49       ` Wang, Xiao W
2018-03-15 17:19         ` Gaëtan Rivet
2018-03-19  1:31           ` Wang, Xiao W
2018-03-21 13:21   ` [PATCH v2 0/3] add ifcvf driver Xiao Wang
2018-03-21 13:21     ` [PATCH v2 1/3] eal/vfio: add support for multiple container Xiao Wang
2018-03-21 20:32       ` Thomas Monjalon
2018-03-21 21:37         ` Gaëtan Rivet
2018-03-22  3:00           ` Wang, Xiao W
2018-03-21 13:21     ` [PATCH v2 2/3] bus/pci: expose sysfs parsing API Xiao Wang
2018-03-21 20:44       ` Thomas Monjalon
2018-03-22  2:46         ` Wang, Xiao W
2018-03-21 13:21     ` [PATCH v2 3/3] net/ifcvf: add ifcvf driver Xiao Wang
2018-03-21 20:52       ` Thomas Monjalon
2018-03-23 10:39         ` Wang, Xiao W
2018-03-21 20:57       ` Maxime Coquelin
2018-03-23 10:37         ` Wang, Xiao W
2018-03-22  8:51       ` Ferruh Yigit
2018-03-22 17:23         ` Wang, Xiao W
2018-03-31  2:29       ` [PATCH v3 0/3] add ifcvf vdpa driver Xiao Wang
2018-03-31  2:29         ` [PATCH v3 1/4] eal/vfio: add support for multiple container Xiao Wang
2018-03-31 11:06           ` Maxime Coquelin
2018-03-31  2:29         ` [PATCH v3 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-03-31 11:13           ` Maxime Coquelin
2018-03-31 13:16             ` Thomas Monjalon
2018-04-02  4:08               ` Wang, Xiao W
2018-03-31  2:29         ` [PATCH v3 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-03-31 11:26           ` Maxime Coquelin
2018-04-03  9:38             ` Wang, Xiao W
2018-04-04 14:40           ` [PATCH v4 0/4] " Xiao Wang
2018-04-04 14:40             ` [PATCH v4 1/4] eal/vfio: add multiple container support Xiao Wang
2018-04-05 18:06               ` [PATCH v5 0/4] add ifcvf vdpa driver Xiao Wang
2018-04-05 18:06                 ` [PATCH v5 1/4] eal/vfio: add multiple container support Xiao Wang
2018-04-05 18:06                 ` [PATCH v5 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-04-11 18:58                   ` Ferruh Yigit
2018-04-05 18:07                 ` [PATCH v5 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-04-11 18:58                   ` Ferruh Yigit
2018-04-12  7:19                   ` [PATCH v6 0/4] " Xiao Wang
2018-04-12  7:19                     ` [PATCH v6 1/4] eal/vfio: add multiple container support Xiao Wang
2018-04-12 14:03                       ` Burakov, Anatoly
2018-04-12 16:07                         ` Wang, Xiao W
2018-04-12 16:24                           ` Burakov, Anatoly
2018-04-13  9:18                             ` Wang, Xiao W
2018-04-15 15:33                       ` [PATCH v7 0/5] add ifcvf vdpa driver Xiao Wang
2018-04-15 15:33                         ` [PATCH v7 1/5] vfio: extend data structure for multi container Xiao Wang
2018-04-16 10:02                           ` Burakov, Anatoly
2018-04-16 12:22                             ` Wang, Xiao W
2018-04-16 15:34                           ` [PATCH v8 0/5] add ifcvf vdpa driver Xiao Wang
2018-04-16 15:34                             ` [PATCH v8 1/5] vfio: extend data structure for multi container Xiao Wang
2018-04-16 15:56                               ` Burakov, Anatoly
2018-04-16 15:34                             ` [PATCH v8 2/5] vfio: add multi container support Xiao Wang
2018-04-16 15:58                               ` Burakov, Anatoly
2018-04-17  7:06                               ` [PATCH v9 0/5] add ifcvf vdpa driver Xiao Wang
2018-04-17  7:06                                 ` [PATCH v9 1/5] vfio: extend data structure for multi container Xiao Wang
2018-04-17  7:06                                 ` [PATCH v9 2/5] vfio: add multi container support Xiao Wang
2018-04-17  7:06                                 ` [PATCH v9 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-04-17  7:06                                 ` [PATCH v9 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-04-17  7:06                                 ` [PATCH v9 5/5] doc: add ifcvf driver document and release note Xiao Wang
2018-04-17 11:13                                 ` [PATCH v9 0/5] add ifcvf vdpa driver Ferruh Yigit
2018-04-16 15:34                             ` [PATCH v8 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-04-16 15:34                             ` [PATCH v8 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-04-16 15:34                             ` [PATCH v8 5/5] doc: add ifcvf driver document and release note Xiao Wang
2018-04-16 16:36                             ` [PATCH v8 0/5] add ifcvf vdpa driver Ferruh Yigit
2018-04-16 18:07                               ` Thomas Monjalon
2018-04-17  5:36                                 ` Wang, Xiao W
2018-04-15 15:33                         ` [PATCH v7 2/5] vfio: add multi container support Xiao Wang
2018-04-16 10:03                           ` Burakov, Anatoly
2018-04-16 12:44                             ` Wang, Xiao W
2018-04-15 15:33                         ` [PATCH v7 3/5] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-04-15 15:33                         ` [PATCH v7 4/5] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-04-15 15:33                         ` [PATCH v7 5/5] doc: add ifcvf driver document and release note Xiao Wang
2018-04-12  7:19                     ` [PATCH v6 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-04-12  7:19                     ` [PATCH v6 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-04-12  7:19                     ` [PATCH v6 4/4] doc: add ifcvf driver document and release note Xiao Wang
2018-04-05 18:07                 ` [PATCH v5 " Xiao Wang
2018-04-11 18:59                 ` [PATCH v5 0/4] add ifcvf vdpa driver Ferruh Yigit
2018-04-12  5:47                   ` Wang, Xiao W
2018-04-04 14:40             ` [PATCH v4 2/4] net/virtio: skip device probe in vdpa mode Xiao Wang
2018-04-04 14:40             ` [PATCH v4 3/4] net/ifcvf: add ifcvf vdpa driver Xiao Wang
2018-04-04 14:40             ` [PATCH v4 4/4] doc: add ifcvf driver document and release note Xiao Wang
2018-03-31  2:29         ` [PATCH v3 4/4] net/ifcvf: add " Xiao Wang
2018-03-31 11:28           ` Maxime Coquelin
2018-03-09 23:08 ` [PATCH 3/3] net/ifcvf: add ifcvf driver Xiao Wang
2018-03-10 18:23 ` [PATCH 0/3] " Maxime Coquelin
2018-03-15 16:49   ` Wang, Xiao W
2018-03-21 20:47     ` Maxime Coquelin
2018-03-23 10:27       ` Wang, Xiao W
2018-03-25  9:51         ` Maxime Coquelin
2018-03-26  9:05           ` Wang, Xiao W
2018-03-26 13:29             ` Maxime Coquelin
2018-03-27  4:40               ` Wang, Xiao W
2018-03-27  5:09                 ` Maxime Coquelin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.