All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] add vDPA sample driver
@ 2018-02-04 14:55 Xiao Wang
  2018-02-04 14:55 ` [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

This patch set has dependency on the vhost lib patch:
http://dpdk.org/dev/patchwork/patch/34872/

This patch set shows a reference sample of making vDPA device driver 
, this driver uses a QEMU-emulated virtio-net PCI device as vDPA device,
and make it serve as a backend for a virtio-net pci device in nested VM.

The key driver ops implemented are:

* vdpa_virtio_eng_init
Mapping virtio pci device with VFIO into userspace, and read device
capability and intialize internal data.

* vdpa_virtio_eng_uninit
Release the mapped device.

* vdpa_virtio_info_query
Device capability reporting, e.g. queue number, features.

* vdpa_virtio_dev_config
With the guest virtio information provideed by vhost lib, this
function configures device and IOMMU to set up vhost datapath,
which includes: Rx/Tx vring, VFIO interrupt, kick relay.

* vdpa_virtio_dev_close
Unset the stuff that are configured previously by dev_conf.

This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
, because the buffer address written in desc is IOVA.

Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
only vfio-pci is supported currently.

Below are setup steps for your reference:

1. Make sure your kernnel vhost module and QEMU support vIOMMU.
   - OS: CentOS 7.4
   - QEMU: 2.10.1
   - Guest OS: CentOS 7.2
   - Nested VM OS: CentOS 7.2

2. enable VT-x feature for vCPU in VM.
   modprobe kvm_intel nested=1

3. Start a VM with a virtio-net-pci device.
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu host \
   <snip>
   -machine q35 \
   -device intel-iommu \
   -netdev tap,id=mytap,ifname=vdpa,vhostforce=on \
   -device virtio-net-pci,netdev=mytap,mac=00:aa:bb:cc:dd:ee,\
   disable-modern=off,disable-legacy=on,iommu_platform=on \

4. Bind VFIO-pci to virtio_net_pci device
   a) login to VM;
   b) modprobe vfio-pci
   c) rmmod vfio_iommu_type1
   d) modprobe vfio_iommu_type1 allow_unsafe_interrupts=1
   e) ./usertools/dpdk-devbind.py -b vfio-pci 00:03.0

5. Start vdpa sample
   ./examples/vdpa/build/vdpa -c 0x2 -n 4 --socket-mem 1024 --no-pci \
    --vdev "net_vdpa_virtio_pci0,bdf=0000:00:03.0" -- --bdf 0000:00:03.0 \
    --iface /tmp/vhost-user- --devcnt 1  --queue 1

6. Start nested VM
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -cpu host -enable-kvm \
   <snip>
   -mem-prealloc \
   -chardev socket,id=char0,path=/tmp/vhost-user-0 \
   -netdev type=vhost-user,id=vdpa,chardev=char0,vhostforce \
   -device virtio-net-pci,netdev=vdpa,mac=00:aa:bb:cc:dd:ee \

7. Login the nested VM, and verify the virtio in nested VM can communicate
   with tap device on host.

Xiao Wang (3):
  bus/pci: expose API for vDPA
  net/vdpa_virtio_pci: introduce vdpa sample driver
  examples/vdpa: add a new sample for vdpa

 config/common_base                                 |    6 +
 config/common_linuxapp                             |    1 +
 drivers/bus/pci/Makefile                           |    1 +
 drivers/bus/pci/linux/pci.c                        |   10 +-
 drivers/bus/pci/linux/pci_init.h                   |   22 +-
 drivers/bus/pci/linux/pci_vfio.c                   |    5 +-
 drivers/bus/pci/rte_bus_pci_version.map            |   13 +
 drivers/net/Makefile                               |    1 +
 drivers/net/vdpa_virtio_pci/Makefile               |   31 +
 .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527 ++++++++++++++++++++
 .../rte_vdpa_virtio_pci_version.map                |    4 +
 examples/vdpa/Makefile                             |   32 +
 examples/vdpa/main.c                               |  387 +++++
 mk/rte.app.mk                                      |    1 +
 14 files changed, 2030 insertions(+), 11 deletions(-)
 create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
 create mode 100644 examples/vdpa/Makefile
 create mode 100644 examples/vdpa/main.c

-- 
2.15.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] bus/pci: expose API for vDPA
  2018-02-04 14:55 [PATCH 0/3] add vDPA sample driver Xiao Wang
@ 2018-02-04 14:55 ` Xiao Wang
  2018-02-04 14:55 ` [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
  2018-02-04 14:55 ` [PATCH 3/3] examples/vdpa: add a new sample for vdpa Xiao Wang
  2 siblings, 0 replies; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

Some existing PCI APIs are helpful for vDPA device setup, expose them
for the later driver patch.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 drivers/bus/pci/Makefile                |  1 +
 drivers/bus/pci/linux/pci.c             | 10 +++-------
 drivers/bus/pci/linux/pci_init.h        | 22 +++++++++++++++++++++-
 drivers/bus/pci/linux/pci_vfio.c        |  5 ++---
 drivers/bus/pci/rte_bus_pci_version.map | 13 +++++++++++++
 5 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
index f3df1c4ce..e45bee024 100644
--- a/drivers/bus/pci/Makefile
+++ b/drivers/bus/pci/Makefile
@@ -45,6 +45,7 @@ ifneq ($(CONFIG_RTE_EXEC_ENV_BSDAPP),)
 SYSTEM := bsd
 endif
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/$(SYSTEM)
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/$(SYSTEM)app/eal
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index abde64119..06b811d5e 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -32,7 +32,7 @@
 
 extern struct rte_pci_bus rte_pci_bus;
 
-static int
+int
 pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
 {
 	int count;
@@ -168,8 +168,7 @@ pci_parse_one_sysfs_resource(char *line, size_t len, uint64_t *phys_addr,
 	return 0;
 }
 
-/* parse the "resource" sysfs file */
-static int
+int
 pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev)
 {
 	FILE *f;
@@ -372,10 +371,7 @@ pci_update_device(const struct rte_pci_addr *addr)
 	return pci_scan_one(filename, addr);
 }
 
-/*
- * split up a pci address into its constituent parts.
- */
-static int
+int
 parse_pci_addr_format(const char *buf, int bufsize, struct rte_pci_addr *addr)
 {
 	/* first split on ':' */
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index c2e603a37..9e06cb57d 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -6,6 +6,7 @@
 #define EAL_PCI_INIT_H_
 
 #include <rte_vfio.h>
+#include <stdbool.h>
 
 /** IO resource type: */
 #define IORESOURCE_IO         0x00000100
@@ -15,7 +16,7 @@
  * Helper function to map PCI resources right after hugepages in virtual memory
  */
 extern void *pci_map_addr;
-void *pci_find_max_end_va(void);
+void *__rte_experimental pci_find_max_end_va(void);
 
 /* parse one line of the "resource" sysfs file (note that the 'line'
  * string is modified)
@@ -83,6 +84,25 @@ int pci_vfio_unmap_resource(struct rte_pci_device *dev);
 
 int pci_vfio_is_enabled(void);
 
+/* parse sysfs file path */
+int __rte_experimental
+pci_get_kernel_driver_by_path(const char *filename, char *dri_name);
+
+/* parse the "resource" sysfs file */
+int __rte_experimental
+pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev);
+
+/* split up a pci address into its constituent parts */
+int __rte_experimental
+parse_pci_addr_format(const char *buf, int bufsize, struct rte_pci_addr *addr);
+
+/* get PCI BAR info for MSI-X interrupts */
+int __rte_experimental
+pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table);
+
+/* enable DMA and reset device */
+int __rte_experimental
+pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd);
 #endif
 
 #endif /* EAL_PCI_INIT_H_ */
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..6d0486a7d 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -59,8 +59,7 @@ pci_vfio_write_config(const struct rte_intr_handle *intr_handle,
 	       VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + offs);
 }
 
-/* get PCI BAR number where MSI-X interrupts are */
-static int
+int
 pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 {
 	int ret;
@@ -295,7 +294,7 @@ pci_vfio_is_ioport_bar(int vfio_dev_fd, int bar_index)
 	return (ioport_bar & PCI_BASE_ADDRESS_SPACE_IO) != 0;
 }
 
-static int
+int
 pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd)
 {
 	if (pci_vfio_setup_interrupts(dev, vfio_dev_fd) != 0) {
diff --git a/drivers/bus/pci/rte_bus_pci_version.map b/drivers/bus/pci/rte_bus_pci_version.map
index 27e9c4f10..fd806ad33 100644
--- a/drivers/bus/pci/rte_bus_pci_version.map
+++ b/drivers/bus/pci/rte_bus_pci_version.map
@@ -16,3 +16,16 @@ DPDK_17.11 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	global:
+
+	pci_map_addr;
+	pci_find_max_end_va;
+	pci_get_kernel_driver_by_path;
+	pci_parse_sysfs_resource;
+	pci_vfio_get_msix_bar;
+	pci_rte_vfio_setup_device;
+	parse_pci_addr_format;
+
+} DPDK_17.11;
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-04 14:55 [PATCH 0/3] add vDPA sample driver Xiao Wang
  2018-02-04 14:55 ` [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
@ 2018-02-04 14:55 ` Xiao Wang
  2018-02-06 14:24   ` Maxime Coquelin
  2018-02-04 14:55 ` [PATCH 3/3] examples/vdpa: add a new sample for vdpa Xiao Wang
  2 siblings, 1 reply; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

This driver is a reference sample of making vDPA device driver based
on vhost lib, this driver uses a standard virtio-net PCI device as
vDPA device, it can serve as a backend for a virtio-net pci device
in nested VM.

The key driver ops implemented are:

* vdpa_virtio_eng_init
Mapping virtio pci device with VFIO into userspace, and read device
capability and intialize internal data.

* vdpa_virtio_eng_uninit
Release the mapped device.

* vdpa_virtio_info_query
Device capability reporting, e.g. queue number, features.

* vdpa_virtio_dev_config
With the guest virtio information provideed by vhost lib, this
function configures device and IOMMU to set up vhost datapath,
which includes: Rx/Tx vring, VFIO interrupt, kick relay.

* vdpa_virtio_dev_close
Unset the stuff that are configured previously by dev_conf.

This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
, because the buffer address written in desc is IOVA.

Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 config/common_base                                 |    6 +
 config/common_linuxapp                             |    1 +
 drivers/net/Makefile                               |    1 +
 drivers/net/vdpa_virtio_pci/Makefile               |   31 +
 .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527 ++++++++++++++++++++
 .../rte_vdpa_virtio_pci_version.map                |    4 +
 mk/rte.app.mk                                      |    1 +
 7 files changed, 1571 insertions(+)
 create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map

diff --git a/config/common_base b/config/common_base
index ad03cf433..aaa775129 100644
--- a/config/common_base
+++ b/config/common_base
@@ -791,6 +791,12 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile VDPA VIRTIO PCI driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST should be enabled.
+#
+CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..83446090c 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..0a45ef603 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -53,6 +53,7 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+DIRS-$(CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI) += vdpa_virtio_pci
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
diff --git a/drivers/net/vdpa_virtio_pci/Makefile b/drivers/net/vdpa_virtio_pci/Makefile
new file mode 100644
index 000000000..147d7a7a3
--- /dev/null
+++ b/drivers/net/vdpa_virtio_pci/Makefile
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_vdpa_virtio_pci.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_mempool -lrte_pci
+LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost
+LDLIBS += -lrte_bus_vdev -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/linux
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+EXPORT_MAP := rte_vdpa_virtio_pci_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += rte_eth_vdpa_virtio_pci.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c b/drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
new file mode 100644
index 000000000..5e63b15e6
--- /dev/null
+++ b/drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
@@ -0,0 +1,1527 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <linux/pci_regs.h>
+#include <linux/virtio_net.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_pci.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+#include <sys/mman.h>
+
+#include <rte_vfio.h>
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_ethdev_vdev.h>
+#include <rte_malloc.h>
+#include <rte_memcpy.h>
+#include <rte_bus_pci.h>
+#include <rte_bus_vdev.h>
+#include <rte_kvargs.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_io.h>
+#include <rte_cycles.h>
+#include <rte_spinlock.h>
+#include <eal_vfio.h>
+#include <pci_init.h>
+
+#define MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM	33
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (MAX_QUEUES * 2 + 1))
+
+#define ETH_VDPA_VIRTIO_PCI_BDF_ARG	"bdf"
+
+static const char *const valid_arguments[] = {
+	ETH_VDPA_VIRTIO_PCI_BDF_ARG,
+	NULL
+};
+
+static struct ether_addr base_eth_addr = {
+	.addr_bytes = {
+		0x56 /* V */,
+		0x44 /* D */,
+		0x50 /* P */,
+		0x41 /* A */,
+		0x00,
+		0x00
+	}
+};
+
+struct virtio_pci_info {
+	struct rte_pci_device pdev;
+	uint64_t    req_features;
+	uint32_t    notify_off_multiplier;
+	struct virtio_pci_common_cfg *common_cfg;
+	uint8_t     *isr;
+	uint16_t    *notify_base;
+	struct virtio_net_device_config *dev_cfg;
+	uint16_t    *notify_addr[MAX_QUEUES * 2];
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+};
+
+struct vdpa_virtio_pci_internal {
+	char *dev_name;
+	uint16_t max_queues;
+	uint16_t max_devices;
+	uint64_t features;
+	struct rte_vdpa_eng_addr eng_addr;
+	int eid;
+	int vid;
+	struct virtio_pci_info vpci;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct rte_eth_dev *eth_dev;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct rte_eth_link vdpa_link = {
+		.link_speed = 10000,
+		.link_duplex = ETH_LINK_FULL_DUPLEX,
+		.link_status = ETH_LINK_DOWN
+};
+
+static struct internal_list *
+find_internal_resource_by_eid(int eid)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (eid == internal->eid) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_eng_addr(struct rte_vdpa_eng_addr *addr)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (addr == &internal->eng_addr) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+check_pci_dev(struct rte_pci_device *dev)
+{
+	char filename[PATH_MAX];
+	char dev_dir[PATH_MAX];
+	char driver[PATH_MAX];
+	int ret;
+
+	snprintf(dev_dir, sizeof(dev_dir), "%s/" PCI_PRI_FMT,
+			rte_pci_get_sysfs_path(),
+			dev->addr.domain, dev->addr.bus,
+			dev->addr.devid, dev->addr.function);
+	if (access(dev_dir, R_OK) != 0) {
+		RTE_LOG(ERR, PMD, "%s not exist\n", dev_dir);
+		return -1;
+	}
+
+	/* parse resources */
+	snprintf(filename, sizeof(filename), "%s/resource", dev_dir);
+	if (pci_parse_sysfs_resource(filename, dev) < 0) {
+		RTE_LOG(ERR, PMD, "cannot parse resource: %s\n", filename);
+		return -1;
+	}
+
+	/* parse driver */
+	snprintf(filename, sizeof(filename), "%s/driver", dev_dir);
+	ret = pci_get_kernel_driver_by_path(filename, driver);
+	if (ret != 0) {
+		RTE_LOG(ERR, PMD, "Fail to get kernel driver: %s\n", filename);
+		return -1;
+	}
+
+	if (strcmp(driver, "vfio-pci") != 0) {
+		RTE_LOG(ERR, PMD, "kernel driver %s is not vfio-pci\n", driver);
+		return -1;
+	}
+	return 0;
+}
+
+static int
+vdpa_vfio_get_group_fd(int iommu_group_no)
+{
+	char filename[PATH_MAX];
+	int vfio_group_fd;
+
+	snprintf(filename, sizeof(filename), VFIO_GROUP_FMT, iommu_group_no);
+	vfio_group_fd = open(filename, O_RDWR);
+	if (vfio_group_fd < 0) {
+		if (errno != ENOENT) {
+			RTE_LOG(ERR, PMD, "cannot open %s: %s\n", filename,
+				strerror(errno));
+			return -1;
+		}
+		return 0;
+	}
+
+	return vfio_group_fd;
+}
+
+static int
+vfio_setup_device(const char *sysfs_base, const char *dev_addr,
+		  int *vfio_dev_fd, struct vfio_device_info *device_info,
+		  struct virtio_pci_info *vpci)
+{
+	struct vfio_group_status group_status = {
+		.argsz = sizeof(group_status)
+	};
+	int vfio_container_fd = -1;
+	int vfio_group_fd = -1;
+	int iommu_group_no;
+	int ret;
+
+	vfio_container_fd = vfio_get_container_fd();
+
+	/* check if we have VFIO driver enabled */
+	if (vfio_container_fd < 0) {
+		RTE_LOG(ERR, PMD, "failed to open VFIO container\n");
+		return -1;
+	}
+
+	ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "VFIO_TYPE1_IOMMU not supported\n");
+		goto err;
+	}
+
+	/* get group number */
+	ret = vfio_get_group_no(sysfs_base, dev_addr, &iommu_group_no);
+	if (ret <= 0) {
+		RTE_LOG(ERR, PMD, "%s: unable to find IOMMU group\n", dev_addr);
+		goto err;
+	}
+
+	/* get the actual group fd */
+	vfio_group_fd = vdpa_vfio_get_group_fd(iommu_group_no);
+	RTE_LOG(INFO, PMD, "\n%s group no %d group fd %d\n",
+			dev_addr, iommu_group_no, vfio_group_fd);
+	if (vfio_group_fd <= 0)
+		goto err;
+
+	/* check if the group is viable */
+	ret = ioctl(vfio_group_fd, VFIO_GROUP_GET_STATUS, &group_status);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "%s cannot get group status, error: %s\n",
+				dev_addr, strerror(errno));
+		goto err;
+	} else if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		RTE_LOG(ERR, PMD, "%s VFIO group is not viable\n", dev_addr);
+		goto err;
+	}
+
+	/* check if group does not have a container yet */
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
+		/* add group to a container */
+		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
+				&vfio_container_fd);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "cannot add VFIO group to container, "
+					"error: %s\n", strerror(errno));
+			goto err;
+		}
+		RTE_LOG(INFO, PMD, "vfio_group_fd %d set container_fd %d\n",
+				vfio_group_fd, vfio_container_fd);
+	} else {
+		RTE_LOG(ERR, PMD, "%s has a container already\n", dev_addr);
+		goto err;
+	}
+
+	ret = ioctl(vfio_container_fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "%s set IOMMU type failed, error: %s\n",
+				dev_addr, strerror(errno));
+		goto err;
+	}
+
+	/* get a file descriptor for the device */
+	*vfio_dev_fd = ioctl(vfio_group_fd, VFIO_GROUP_GET_DEVICE_FD, dev_addr);
+	if (*vfio_dev_fd < 0) {
+		RTE_LOG(ERR, PMD, "%s cannot get vfio_dev_fd, error: %s\n",
+				dev_addr, strerror(errno));
+		goto err;
+	}
+
+	ret = ioctl(*vfio_dev_fd, VFIO_DEVICE_GET_INFO, device_info);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "%s cannot get device info, error: %s\n",
+				dev_addr, strerror(errno));
+		close(*vfio_dev_fd);
+		goto err;
+	}
+
+	vpci->vfio_container_fd = vfio_container_fd;
+	vpci->vfio_group_fd = vfio_group_fd;
+	return 0;
+
+err:
+	if (vfio_container_fd >= 0)
+		close(vfio_container_fd);
+	if (vfio_group_fd > 0)
+		close(vfio_group_fd);
+	return -1;
+}
+
+static int
+virtio_pci_vfio_map_resource(struct virtio_pci_info *vpci)
+{
+	struct rte_pci_device *pdev = &vpci->pdev;
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+	char pci_addr[PATH_MAX] = {0};
+	struct rte_pci_addr *loc = &pdev->addr;
+	int i, ret, nb_maps;
+	int vfio_dev_fd;
+	uint32_t ioport_bar;
+	struct pci_msix_table msix_table;
+
+	/* store PCI address string */
+	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+			loc->domain, loc->bus, loc->devid, loc->function);
+
+	ret = vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+			&vfio_dev_fd, &device_info, vpci);
+	if (ret)
+		return ret;
+
+	ret = pci_vfio_get_msix_bar(vfio_dev_fd, &msix_table);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "%s cannot get MSI-X BAR number\n", pci_addr);
+		goto fail;
+	}
+
+	/* get number of regions (up to BAR5) */
+	nb_maps = RTE_MIN((int)device_info.num_regions,
+				VFIO_PCI_BAR5_REGION_INDEX + 1);
+
+	/* map BARs */
+	for (i = 0; i < nb_maps; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+		void *bar_addr;
+
+		reg.index = i;
+		ret = ioctl(vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "%s cannot get region info, "
+					"error: %s\n",
+					pci_addr, strerror(errno));
+			goto fail;
+		}
+
+		ret = pread(vfio_dev_fd, &ioport_bar, sizeof(ioport_bar),
+			    VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+			    PCI_BASE_ADDRESS_0 + i * 4);
+		if (ret != sizeof(ioport_bar)) {
+			RTE_LOG(ERR, PMD, "cannot read command (%x) from "
+				"config space\n", PCI_BASE_ADDRESS_0 + i * 4);
+			goto fail;
+		}
+
+		/* check for io port region */
+		if (ioport_bar & PCI_BASE_ADDRESS_SPACE_IO)
+			continue;
+
+		/* skip non-mmapable BARs */
+		if ((reg.flags & VFIO_REGION_INFO_FLAG_MMAP) == 0)
+			continue;
+
+		if (i == msix_table.bar_index)
+			continue;
+
+		/* try mapping somewhere close to the end of hugepages */
+		if (pci_map_addr == NULL)
+			pci_map_addr = pci_find_max_end_va();
+
+		bar_addr = pci_map_addr;
+		pci_map_addr = RTE_PTR_ADD(bar_addr, (size_t)reg.size);
+
+		/* reserve the address using an inaccessible mapping */
+		bar_addr = mmap(bar_addr, reg.size, 0, MAP_PRIVATE |
+				MAP_ANONYMOUS, -1, 0);
+		if (bar_addr != MAP_FAILED) {
+			void *map_addr = NULL;
+			if (reg.size)
+				map_addr = pci_map_resource(bar_addr,
+						vfio_dev_fd,
+						reg.offset, reg.size,
+						MAP_FIXED);
+
+			if (map_addr == MAP_FAILED || !map_addr) {
+				munmap(bar_addr, reg.size);
+				bar_addr = MAP_FAILED;
+			}
+		}
+
+		if (bar_addr == MAP_FAILED) {
+			RTE_LOG(ERR, PMD, "%s mapping BAR%d failed: %s\n",
+					pci_addr, i, strerror(errno));
+			goto fail;
+		}
+		pdev->mem_resource[i].addr = bar_addr;
+	}
+
+	if (pci_rte_vfio_setup_device(pdev, vfio_dev_fd) < 0) {
+		RTE_LOG(ERR, PMD, "%s failed to set up device\n", pci_addr);
+		goto fail;
+	}
+
+	vpci->vfio_dev_fd = vfio_dev_fd;
+	return 0;
+
+fail:
+	close(vfio_dev_fd);
+	return -1;
+}
+
+static void *
+get_cap_addr(struct rte_pci_device *dev, struct virtio_pci_cap *cap)
+{
+	uint8_t bar = cap->bar;
+	uint32_t length = cap->length;
+	uint32_t offset = cap->offset;
+	uint8_t *base;
+
+	if (bar > 5) {
+		RTE_LOG(ERR, PMD, "invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		RTE_LOG(ERR, PMD, "offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > dev->mem_resource[bar].len) {
+		RTE_LOG(ERR, PMD, "invalid cap: overflows bar size: %u > %lu\n",
+			offset + length, dev->mem_resource[bar].len);
+		return NULL;
+	}
+
+	base = dev->mem_resource[bar].addr;
+	if (base == NULL) {
+		RTE_LOG(ERR, PMD, "bar %u base addr is NULL", bar);
+		return NULL;
+	}
+
+	return base + offset;
+}
+
+static int
+virtio_pci_map(struct virtio_pci_info *vpci)
+{
+	uint8_t pos;
+	struct virtio_pci_cap cap;
+	struct rte_pci_device *dev = &vpci->pdev;
+	int ret;
+
+	if (virtio_pci_vfio_map_resource(vpci)) {
+		RTE_LOG(ERR, PMD, "failed to map pci device\n");
+		return -1;
+	}
+
+	ret = rte_pci_read_config(dev, &pos, sizeof(pos), PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = rte_pci_read_config(dev, &cap, sizeof(cap), pos);
+		if (ret < 0) {
+			RTE_LOG(ERR, PMD, "failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		RTE_LOG(INFO, PMD, "cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case VIRTIO_PCI_CAP_COMMON_CFG:
+			vpci->common_cfg = get_cap_addr(dev, &cap);
+			break;
+		case VIRTIO_PCI_CAP_NOTIFY_CFG:
+			rte_pci_read_config(dev, &vpci->notify_off_multiplier,
+						4, pos + sizeof(cap));
+			vpci->notify_base = get_cap_addr(dev, &cap);
+			break;
+		case VIRTIO_PCI_CAP_ISR_CFG:
+			vpci->isr = get_cap_addr(dev, &cap);
+			break;
+		case VIRTIO_PCI_CAP_DEVICE_CFG:
+			vpci->dev_cfg = get_cap_addr(dev, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	if (vpci->common_cfg == NULL || vpci->notify_base == NULL ||
+			vpci->isr == NULL || vpci->dev_cfg == NULL) {
+		RTE_LOG(ERR, PMD, "capability incomplete\n");
+		return -1;
+	}
+
+	RTE_LOG(INFO, PMD, "capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			vpci->common_cfg, vpci->dev_cfg,
+			vpci->isr, vpci->notify_base,
+			vpci->notify_off_multiplier);
+
+	return 0;
+}
+
+static int
+virtio_pci_dma_map(struct vdpa_virtio_pci_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vpci.vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct vfio_iommu_type1_dma_map dma_map;
+		struct rte_vhost_mem_region *reg;
+		reg = &mem->regions[i];
+
+		RTE_LOG(INFO, PMD, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx\n", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		memset(&dma_map, 0, sizeof(dma_map));
+		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+		dma_map.vaddr = reg->host_user_addr;
+		dma_map.size = reg->size;
+		dma_map.iova = reg->guest_phys_addr;
+		dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE;
+
+		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "cannot set up DMA remapping, "
+				"error: %s\n", strerror(errno));
+			goto exit;
+		}
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+virtio_pci_dma_unmap(struct vdpa_virtio_pci_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vpci.vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct vfio_iommu_type1_dma_unmap dma_unmap;
+		struct rte_vhost_mem_region *reg;
+		reg = &mem->regions[i];
+
+		memset(&dma_unmap, 0, sizeof(dma_unmap));
+		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+		dma_unmap.size = reg->size;
+		dma_unmap.iova = reg->guest_phys_addr;
+		dma_unmap.flags = 0;
+
+		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
+				&dma_unmap);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "cannot unset DMA remapping, "
+				"error: %s\n", strerror(errno));
+			goto exit;
+		}
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint8_t
+virtio_get_status(struct virtio_pci_info *vpci)
+{
+	return rte_read8(&vpci->common_cfg->device_status);
+}
+
+static void
+virtio_set_status(struct virtio_pci_info *vpci, uint8_t status)
+{
+	rte_write8(status, &vpci->common_cfg->device_status);
+}
+
+static void
+vdpa_virtio_reset(struct virtio_pci_info *vpci)
+{
+	/* 0 means reset */
+	virtio_set_status(vpci, 0);
+
+	/* flush status write */
+	while (virtio_get_status(vpci))
+		rte_delay_ms(1);
+}
+
+static void
+vdpa_virtio_set_status(struct virtio_pci_info *vpci, uint8_t status)
+{
+	if (status != 0)
+		status |= virtio_get_status(vpci);
+
+	virtio_set_status(vpci, status);
+	virtio_get_status(vpci);
+}
+
+static uint64_t
+virtio_get_features(struct virtio_pci_info *vpci)
+{
+	uint32_t features_lo, features_hi;
+	struct virtio_pci_common_cfg *cfg = vpci->common_cfg;
+
+	rte_write32(0, &cfg->device_feature_select);
+	features_lo = rte_read32(&cfg->device_feature);
+
+	rte_write32(1, &cfg->device_feature_select);
+	features_hi = rte_read32(&cfg->device_feature);
+
+	return ((uint64_t)features_hi << 32) | features_lo;
+}
+
+static void
+vdpa_set_features(struct virtio_pci_info *vpci, uint64_t features)
+{
+	struct virtio_pci_common_cfg *cfg = vpci->common_cfg;
+
+	/* enable device DMA with iova */
+	features |= (1ULL << VIRTIO_F_IOMMU_PLATFORM);
+
+	rte_write32(0, &cfg->guest_feature_select);
+	rte_write32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	rte_write32(1, &cfg->guest_feature_select);
+	rte_write32(features >> 32, &cfg->guest_feature);
+}
+
+static int
+vdpa_virtio_config_features(struct virtio_pci_info *vpci, uint64_t req_features)
+{
+	uint64_t host_features;
+
+	host_features = virtio_get_features(vpci);
+	vpci->req_features = req_features & host_features;
+
+	vdpa_set_features(vpci, vpci->req_features);
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_FEATURES_OK);
+
+	if (!(virtio_get_status(vpci) & VIRTIO_CONFIG_S_FEATURES_OK)) {
+		RTE_LOG(ERR, PMD, "failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static void
+io_write64_twopart(uint64_t val, uint32_t *lo, uint32_t *hi)
+{
+	rte_write32(val & ((1ULL << 32) - 1), lo);
+	rte_write32(val >> 32, hi);
+}
+
+static int
+vdpa_virtio_dev_enable(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+	struct virtio_pci_common_cfg *cfg;
+	uint64_t desc_addr, avail_addr, used_addr;
+	uint32_t i, nr_vring;
+	uint16_t notify_off;
+	struct rte_vhost_vring vq;
+
+	vpci = &internal->vpci;
+	cfg = vpci->common_cfg;
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	rte_write16(0, &cfg->msix_config);
+	if (rte_read16(&cfg->msix_config) == VIRTIO_MSI_NO_VECTOR) {
+		RTE_LOG(ERR, PMD, "msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vq);
+		desc_addr = qva_to_gpa(internal->vid, (uint64_t)vq.desc);
+		avail_addr = qva_to_gpa(internal->vid, (uint64_t)vq.avail);
+		used_addr = qva_to_gpa(internal->vid, (uint64_t)vq.used);
+
+		rte_write16(i, &cfg->queue_select);
+		io_write64_twopart(desc_addr, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(avail_addr, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(used_addr, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		rte_write16(vq.size, &cfg->queue_size);
+
+		rte_write16(i + 1, &cfg->queue_msix_vector);
+		if (rte_read16(&cfg->queue_msix_vector) ==
+				VIRTIO_MSI_NO_VECTOR) {
+			RTE_LOG(ERR, PMD, "queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = rte_read16(&cfg->queue_notify_off);
+		vpci->notify_addr[i] = (void *)((uint8_t *)vpci->notify_base +
+				notify_off * vpci->notify_off_multiplier);
+		rte_write16(1, &cfg->queue_enable);
+
+		RTE_LOG(INFO, PMD, "queue %u addresses:\n"
+				"desc_addr: 0x%lx\tavail_addr: 0x%lx\tused_addr: 0x%lx\n"
+				"queue size: %u\t\tnotify addr: %p\tnotify offset: %u\n",
+				i, desc_addr, avail_addr, used_addr,
+				vq.size, vpci->notify_addr[i], notify_off);
+	}
+
+	return 0;
+}
+
+static void
+vdpa_virtio_dev_disable(struct vdpa_virtio_pci_internal *internal)
+{
+	uint32_t i, nr_vring;
+	struct virtio_pci_info *vpci;
+	struct virtio_pci_common_cfg *cfg;
+
+	vpci = &internal->vpci;
+	cfg = vpci->common_cfg;
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	rte_write16(VIRTIO_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < nr_vring; i++) {
+		rte_write16(i, &cfg->queue_select);
+		rte_write16(0, &cfg->queue_enable);
+		rte_write16(VIRTIO_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+	}
+}
+
+static int
+vdpa_virtio_pci_start(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+	uint64_t features;
+
+	vpci = &internal->vpci;
+
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	/* Reset the device although not necessary at startup. */
+	vdpa_virtio_reset(vpci);
+
+	/* Tell the host we've noticed this device. */
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_ACKNOWLEDGE);
+
+	/* Tell the host we've known how to drive the device. */
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_DRIVER);
+
+	if (vdpa_virtio_config_features(vpci, features) < 0)
+		return -1;
+
+	if (vdpa_virtio_dev_enable(internal) < 0)
+		return -1;
+
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_DRIVER_OK);
+	return 0;
+}
+
+static void
+vdpa_virtio_pci_stop(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+
+	vpci = &internal->vpci;
+	vdpa_virtio_dev_disable(internal);
+	vdpa_virtio_reset(vpci);
+}
+
+static int
+vdpa_enable_vfio_intr(struct vdpa_virtio_pci_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct virtio_pci_info *vpci;
+	struct rte_vhost_vring vring;
+
+	vpci = &internal->vpci;
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = vpci->pdev.intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(vpci->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error enabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct vdpa_virtio_pci_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	struct virtio_pci_info *vpci;
+
+	vpci = &internal->vpci;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(vpci->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error disabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	struct virtio_pci_info *vpci;
+	uint32_t qid, q_num;
+	struct epoll_event events[MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct vdpa_virtio_pci_internal *internal = arg;
+
+	vpci = &internal->vpci;
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(MAX_QUEUES * 2);
+	if (epfd < 0) {
+		RTE_LOG(ERR, PMD, "failed to create epoll instance\n");
+		return NULL;
+	}
+	vpci->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			RTE_LOG(ERR, PMD, "epoll add error, %s\n",
+					strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			RTE_LOG(ERR, PMD, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					RTE_LOG(INFO, PMD, "Error reading "
+						"kickfd: %s\n",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			rte_write16(qid, vpci->notify_addr[qid]);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct vdpa_virtio_pci_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->vpci.tid, NULL, notify_relay,
+			     (void *)internal);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "failed to create notify relay pthread\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+	void *status;
+
+	vpci = &internal->vpci;
+	if (vpci->tid) {
+		pthread_cancel(vpci->tid);
+		pthread_join(vpci->tid, &status);
+	}
+	vpci->tid = 0;
+
+	if (vpci->epfd >= 0)
+		close(vpci->epfd);
+	vpci->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct rte_eth_dev *eth_dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+	int ret;
+
+	internal = eth_dev->data->dev_private;
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = virtio_pci_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_virtio_pci_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_virtio_pci_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = virtio_pci_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+vdpa_virtio_dev_config(int vid)
+{
+	int eid;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct vdpa_virtio_pci_internal *internal;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_UP;
+
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(eth_dev);
+
+	return 0;
+}
+
+static int
+vdpa_virtio_dev_close(int vid)
+{
+	int eid;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct vdpa_virtio_pci_internal *internal;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_DOWN;
+
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(eth_dev);
+
+	return 0;
+}
+
+static void
+vfio_close_fds(struct virtio_pci_info *vpci)
+{
+	if (vpci->vfio_dev_fd >= 0)
+		close(vpci->vfio_dev_fd);
+	if (vpci->vfio_group_fd >= 0)
+		close(vpci->vfio_group_fd);
+	if (vpci->vfio_container_fd >= 0)
+		close(vpci->vfio_container_fd);
+
+	vpci->vfio_dev_fd = -1;
+	vpci->vfio_group_fd = -1;
+	vpci->vfio_container_fd = -1;
+}
+
+static int
+vdpa_virtio_eng_init(int eid, struct rte_vdpa_eng_addr *addr)
+{
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+	struct virtio_pci_info *vpci;
+	uint64_t features;
+
+	list = find_internal_resource_by_eng_addr(addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	vpci = &internal->vpci;
+
+	vpci->vfio_dev_fd = -1;
+	vpci->vfio_group_fd = -1;
+	vpci->vfio_container_fd = -1;
+
+	if (check_pci_dev(&vpci->pdev) < 0)
+		return -1;
+
+	if (virtio_pci_map(vpci) < 0)
+		goto err;
+
+	internal->eid = eid;
+	internal->max_devices = 1;
+	internal->max_queues = MAX_QUEUES;
+	features = virtio_get_features(&internal->vpci);
+	if ((features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) == 0) {
+		RTE_LOG(ERR, PMD, "VIRTIO_F_IOMMU_PLATFORM feature is required "
+				"to support DMA with IOVA");
+		goto err;
+	}
+
+	/* We need the nested VM's driver to use GPA */
+	internal->features = (features & ~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+			  (1ULL << RTE_VHOST_USER_F_PROTOCOL_FEATURES);
+	return 0;
+
+err:
+	vfio_close_fds(vpci);
+	return -1;
+}
+
+static int
+vdpa_virtio_eng_uninit(int eid)
+{
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	vfio_close_fds(&internal->vpci);
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << RTE_VHOST_USER_PROTOCOL_F_REPLY_ACK)
+static int
+vdpa_virtio_info_query(int eid, struct rte_vdpa_eng_attr *attr)
+{
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	attr->dev_num = internal->max_devices;
+	attr->queue_num = internal->max_queues;
+	attr->features = internal->features;
+	attr->protocol_features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+
+	return 0;
+}
+
+struct rte_vdpa_eng_driver vdpa_virtio_pci_driver = {
+	.name = "vdpa_virtio_pci",
+	.eng_ops = {
+		.eng_init = vdpa_virtio_eng_init,
+		.eng_uninit = vdpa_virtio_eng_uninit,
+		.info_query = vdpa_virtio_info_query,
+	},
+	.dev_ops = {
+		.dev_conf = vdpa_virtio_dev_config,
+		.dev_close = vdpa_virtio_dev_close,
+		.vring_state_set = NULL,
+		.feature_set = NULL,
+		.migration_done = NULL,
+	},
+};
+
+RTE_VDPA_REGISTER_DRIVER(vdpa_virtio_pci, vdpa_virtio_pci_driver);
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+
+	internal = dev->data->dev_private;
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(dev);
+
+	return 0;
+}
+
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+
+	internal = dev->data->dev_private;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(dev);
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+	struct internal_list *list;
+
+	internal = dev->data->dev_private;
+	eth_dev_stop(dev);
+
+	list = find_internal_resource_by_eng_addr(&internal->eng_addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return;
+	}
+
+	rte_vdpa_unregister_engine(internal->eid);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+	rte_free(list);
+
+	rte_free(dev->data->mac_addrs);
+	free(internal->dev_name);
+	rte_free(internal);
+
+	dev->data->dev_private = NULL;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct vdpa_virtio_pci_internal *internal;
+
+	internal = dev->data->dev_private;
+	if (internal == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid device specified\n");
+		return;
+	}
+
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)-1;
+	dev_info->max_rx_queues = internal->max_queues;
+	dev_info->max_tx_queues = internal->max_queues;
+	dev_info->min_rx_bufsize = 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t rx_queue_id __rte_unused,
+		   uint16_t nb_rx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t tx_queue_id __rte_unused,
+		   uint16_t nb_tx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static uint16_t
+eth_vdpa_virtio_pci_rx(void *q __rte_unused,
+		       struct rte_mbuf **bufs __rte_unused,
+		       uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static uint16_t
+eth_vdpa_virtio_pci_tx(void *q __rte_unused,
+		       struct rte_mbuf **bufs __rte_unused,
+		       uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+		int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+};
+
+static int
+eth_dev_vdpa_virtio_pci_create(struct rte_vdev_device *dev,
+		struct rte_pci_addr *pci_addr)
+{
+	const char *name = rte_vdev_device_name(dev);
+	struct rte_eth_dev *eth_dev = NULL;
+	struct ether_addr *eth_addr = NULL;
+	struct vdpa_virtio_pci_internal *internal = NULL;
+	struct internal_list *list = NULL;
+	struct rte_eth_dev_data *data = NULL;
+
+	list = rte_zmalloc_socket(name, sizeof(*list), 0,
+			dev->device.numa_node);
+	if (list == NULL)
+		goto error;
+
+	/* reserve an ethdev entry */
+	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
+	if (eth_dev == NULL)
+		goto error;
+
+	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
+			dev->device.numa_node);
+	if (eth_addr == NULL)
+		goto error;
+
+	*eth_addr = base_eth_addr;
+	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
+
+	internal = eth_dev->data->dev_private;
+	internal->dev_name = strdup(name);
+	if (internal->dev_name == NULL)
+		goto error;
+
+	internal->eng_addr.pci_addr = *pci_addr;
+	internal->vpci.pdev.addr = *pci_addr;
+	rte_spinlock_init(&internal->lock);
+
+	list->eth_dev = eth_dev;
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	data = eth_dev->data;
+	data->nb_rx_queues = MAX_QUEUES;
+	data->nb_tx_queues = MAX_QUEUES;
+	data->dev_link = vdpa_link;
+	data->mac_addrs = eth_addr;
+	data->dev_flags = RTE_ETH_DEV_INTR_LSC;
+	eth_dev->dev_ops = &ops;
+
+	/* assign rx and tx ops, could be used as vDPA fallback */
+	eth_dev->rx_pkt_burst = eth_vdpa_virtio_pci_rx;
+	eth_dev->tx_pkt_burst = eth_vdpa_virtio_pci_tx;
+
+	if (rte_vdpa_register_engine(vdpa_virtio_pci_driver.name,
+				&internal->eng_addr) < 0)
+		goto error;
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(eth_addr);
+	if (internal && internal->dev_name)
+		free(internal->dev_name);
+	rte_free(internal);
+	if (eth_dev)
+		rte_eth_dev_release_port(eth_dev);
+
+	return -1;
+}
+
+static int
+get_pci_addr(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	if (value == NULL || extra_args == NULL)
+		return -1;
+
+	return parse_pci_addr_format(value, strlen(value), extra_args);
+}
+
+static int
+rte_vdpa_virtio_pci_probe(struct rte_vdev_device *dev)
+{
+	struct rte_kvargs *kvlist = NULL;
+	int ret = 0;
+	struct rte_pci_addr pci_addr;
+
+	RTE_LOG(INFO, PMD, "Initializing vdpa_virtio_pci for %s\n",
+		rte_vdev_device_name(dev));
+
+	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	if (rte_kvargs_count(kvlist, ETH_VDPA_VIRTIO_PCI_BDF_ARG) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_VDPA_VIRTIO_PCI_BDF_ARG,
+				&get_pci_addr, &pci_addr);
+		if (ret < 0)
+			goto out_free;
+	} else {
+		ret = -1;
+		goto out_free;
+	}
+
+	eth_dev_vdpa_virtio_pci_create(dev, &pci_addr);
+
+out_free:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
+static int
+rte_vdpa_virtio_pci_remove(struct rte_vdev_device *dev)
+{
+	const char *name;
+	struct rte_eth_dev *eth_dev = NULL;
+
+	name = rte_vdev_device_name(dev);
+	RTE_LOG(INFO, PMD, "Un-Initializing vdpa_virtio_pci for %s\n", name);
+
+	/* find an ethdev entry */
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	eth_dev_close(eth_dev);
+	rte_free(eth_dev->data);
+	rte_eth_dev_release_port(eth_dev);
+
+	return 0;
+}
+
+static struct rte_vdev_driver vdpa_virtio_pci_drv = {
+	.probe = rte_vdpa_virtio_pci_probe,
+	.remove = rte_vdpa_virtio_pci_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_vdpa_virtio_pci, vdpa_virtio_pci_drv);
+RTE_PMD_REGISTER_ALIAS(net_vdpa_virtio_pci, eth_vdpa_virtio_pci);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdpa_virtio_pci,
+	"bdf=<bdf>");
diff --git a/drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map b/drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
new file mode 100644
index 000000000..33d237913
--- /dev/null
+++ b/drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
@@ -0,0 +1,4 @@
+EXPERIMENTAL {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..44e87f4d9 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -171,6 +171,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI)      += -lrte_vdpa_virtio_pci
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] examples/vdpa: add a new sample for vdpa
  2018-02-04 14:55 [PATCH 0/3] add vDPA sample driver Xiao Wang
  2018-02-04 14:55 ` [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
  2018-02-04 14:55 ` [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
@ 2018-02-04 14:55 ` Xiao Wang
  2 siblings, 0 replies; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

This patch adds a sample which creates vhost-user socket based on
vdpa driver. vdpa driver can help to set up vhost datapath so this
app doesn't need to spend a dedicated worker thread on vhost
enqueue/dequeue operations.

Below are setup steps for your reference:

1. Make sure your kernnel vhost module and QEMU support vIOMMU.
   - OS: CentOS 7.4
   - QEMU: 2.10.1
   - Guest OS: CentOS 7.2
   - Nested VM OS: CentOS 7.2

2. enable VT-x feature for vCPU in VM.
   modprobe kvm_intel nested=1

3. Start a VM with a virtio-net-pci device.
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu host \
   <snip>
   -machine q35 \
   -device intel-iommu \
   -netdev tap,id=mytap,ifname=vdpa,vhostforce=on \
   -device virtio-net-pci,netdev=mytap,mac=00:aa:bb:cc:dd:ee,\
   disable-modern=off,disable-legacy=on,iommu_platform=on \

4. Bind VFIO-pci to virtio_net_pci device
   a) login to VM;
   b) modprobe vfio-pci
   c) rmmod vfio_iommu_type1
   d) modprobe vfio_iommu_type1 allow_unsafe_interrupts=1
   e) ./usertools/dpdk-devbind.py -b vfio-pci 00:03.0

5. Start vdpa sample
   ./examples/vdpa/build/vdpa -c 0x2 -n 4 --socket-mem 1024 --no-pci \
    --vdev "net_vdpa_virtio_pci0,bdf=0000:00:03.0" -- --bdf 0000:00:03.0 \
    --iface /tmp/vhost-user- --devcnt 1  --queue 1

6. Start nested VM
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -cpu host -enable-kvm \
   <snip>
   -mem-prealloc \
   -chardev socket,id=char0,path=/tmp/vhost-user-0 \
   -netdev type=vhost-user,id=vdpa,chardev=char0,vhostforce \
   -device virtio-net-pci,netdev=vdpa,mac=00:aa:bb:cc:dd:ee \

7. Login the nested VM, and verify the virtio in nested VM can communicate
   with tap device on the host.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 examples/vdpa/Makefile |  32 ++++
 examples/vdpa/main.c   | 387 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 419 insertions(+)
 create mode 100644 examples/vdpa/Makefile
 create mode 100644 examples/vdpa/main.c

diff --git a/examples/vdpa/Makefile b/examples/vdpa/Makefile
new file mode 100644
index 000000000..42672a2bc
--- /dev/null
+++ b/examples/vdpa/Makefile
@@ -0,0 +1,32 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+ifneq ($(CONFIG_RTE_EXEC_ENV),"linuxapp")
+$(info This application can only operate in a linuxapp environment, \
+please change the definition of the RTE_TARGET environment variable)
+all:
+else
+
+# binary name
+APP = vdpa
+
+# all source are stored in SRCS-y
+SRCS-y := main.c
+
+CFLAGS += -O2 -D_FILE_OFFSET_BITS=64
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -D_GNU_SOURCE
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+include $(RTE_SDK)/mk/rte.extapp.mk
+
+endif
diff --git a/examples/vdpa/main.c b/examples/vdpa/main.c
new file mode 100644
index 000000000..1c9143469
--- /dev/null
+++ b/examples/vdpa/main.c
@@ -0,0 +1,387 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <getopt.h>
+#include <signal.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <rte_ethdev.h>
+#include <rte_malloc.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+
+#define NUM_MBUFS 8191
+#define MBUF_CACHE_SIZE 250
+
+#define RX_RING_SIZE 128
+#define TX_RING_SIZE 128
+
+#define MAX_PATH_LEN 128
+#define MAX_VDPA_SAMPLE_PORTS 1024
+
+struct vdpa_port {
+	char ifname[MAX_PATH_LEN];
+	int eid;
+	int did;
+	int vid;
+};
+
+struct vdpa_port vports[MAX_VDPA_SAMPLE_PORTS];
+
+struct rte_vdpa_eng_attr attr;
+struct rte_vdpa_eng_addr dev_id;
+char iface[MAX_PATH_LEN];
+int queue;
+int devcnt;
+
+/* display usage */
+static void
+vdpa_usage(const char *prgname)
+{
+	printf("%s [EAL options]"
+		" -- --bdf B:D:F --iface <path> --devcnt ND  --queue NQ\n"
+		" --bdf B:D:F, the PCI device used for vdpa\n"
+		" --iface <path>: The path of the socket file\n"
+		" --devcnt ND: number of vhost sockets to be created, default 1\n"
+		" --queue NQ: number of queue pairs to be configured, default 1\n",
+		prgname);
+}
+
+static int
+get_unsigned(const char *str, int base)
+{
+	unsigned long num;
+	char *end = NULL;
+
+	errno = 0;
+	num = strtoul(str, &end, base);
+	if (str[0] == '\0' || end == NULL || *end != '\0' || errno != 0)
+		return -1;
+
+	return num;
+}
+
+static int
+parse_args(int argc, char **argv)
+{
+	static const char *short_option = "";
+	static struct option long_option[] = {
+		{"bdf", required_argument, NULL, 0},
+		{"queue", required_argument, NULL, 0},
+		{"devcnt", required_argument, NULL, 0},
+		{"iface", required_argument, NULL, 0},
+		{NULL, 0, 0, 0},
+	};
+	char str[MAX_PATH_LEN];
+	int opt, idx;
+	int num[4] = {0};
+	int i, j;
+	char *prgname = argv[0];
+
+	while ((opt = getopt_long(argc, argv, short_option, long_option, &idx))
+			!= EOF) {
+		switch (opt) {
+		case 0:
+			if (strncmp(long_option[idx].name, "bdf",
+						MAX_PATH_LEN) == 0) {
+				strcpy(str, optarg);
+				i = strlen(str) - 1;
+				j = 3;
+				while (i > 0 && j >= 0) {
+					while ((str[i - 1] != ':'
+							&& str[i - 1] != '.')
+							&& i > 0)
+						i--;
+					num[j--] = get_unsigned(&str[i], 16);
+					i--;
+					if (i >= 0)
+						str[i] = '\0';
+				}
+				dev_id.pci_addr.domain = num[0];
+				dev_id.pci_addr.bus = num[1];
+				dev_id.pci_addr.devid = num[2];
+				dev_id.pci_addr.function = num[3];
+				printf("bdf %04x:%02x:%02x.%02x\n",
+						dev_id.pci_addr.domain,
+						dev_id.pci_addr.bus,
+						dev_id.pci_addr.devid,
+						dev_id.pci_addr.function);
+			} else if (strncmp(long_option[idx].name, "queue",
+						MAX_PATH_LEN) == 0) {
+				queue = get_unsigned(optarg, 10);
+				printf("queue %d\n", queue);
+			} else if (strncmp(long_option[idx].name, "devcnt",
+						MAX_PATH_LEN) == 0) {
+				devcnt = get_unsigned(optarg, 10);
+				printf("devcnt %d\n", devcnt);
+			} else if (strncmp(long_option[idx].name, "iface",
+						MAX_PATH_LEN) == 0) {
+				strncpy(iface, optarg, MAX_PATH_LEN);
+				printf("iface %s\n", iface);
+			}
+
+			break;
+
+		default:
+			vdpa_usage(prgname);
+			return -1;
+		}
+	}
+
+	if (queue <= 0 || devcnt <= 0 || *iface == '\0') {
+		vdpa_usage(prgname);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+data_init(void)
+{
+	devcnt = 1;
+	queue = 1;
+	memset(&dev_id, 0, sizeof(dev_id));
+	memset(iface, 0, MAX_PATH_LEN * sizeof(iface[0]));
+	memset(vports, 0, MAX_VDPA_SAMPLE_PORTS * sizeof(vports[0]));
+
+	return;
+}
+
+static void
+signal_handler(int signum)
+{
+	uint16_t portid, nb_ports;
+
+	if (signum == SIGINT || signum == SIGTERM) {
+		printf("\nSignal %d received, preparing to exit...\n",
+				signum);
+		nb_ports = rte_eth_dev_count();
+		for (portid = 0; portid < nb_ports; portid++) {
+			printf("Closing port %d...\n", portid);
+			rte_eth_dev_stop(portid);
+			rte_eth_dev_close(portid);
+		}
+		exit(0);
+	}
+}
+
+static int
+new_device(int vid)
+{
+	char ifname[MAX_PATH_LEN];
+	int i;
+
+	rte_vhost_get_ifname(vid, ifname, sizeof(ifname));
+	for (i = 0; i < MAX_VDPA_SAMPLE_PORTS; i++) {
+		if (strcmp(ifname, vports[i].ifname) == 0) {
+			printf("\nport %s connected, eid: %d, did %d\n",
+					ifname, vports[i].eid, vports[i].did);
+			vports[i].vid = vid;
+			break;
+		}
+	}
+
+	if (i >= MAX_VDPA_SAMPLE_PORTS)
+		return -1;
+
+	return 0;
+}
+
+static void
+destroy_device(int vid)
+{
+	char ifname[MAX_PATH_LEN];
+	int i;
+
+	rte_vhost_get_ifname(vid, ifname, sizeof(ifname));
+	for (i = 0; i < MAX_VDPA_SAMPLE_PORTS; i++) {
+		if (strcmp(ifname, vports[i].ifname) == 0) {
+			printf("\nport %s disconnected, eid: %d, did %d\n",
+					ifname, vports[i].eid, vports[i].did);
+			vports[i].vid = vid;
+			break;
+		}
+	}
+
+	return;
+}
+
+static const struct vhost_device_ops vdpa_sample_devops = {
+	.new_device = new_device,
+	.destroy_device = destroy_device,
+	.vring_state_changed = NULL,
+	.features_changed = NULL,
+	.new_connection = NULL,
+	.destroy_connection = NULL,
+};
+
+static const struct rte_eth_conf port_conf_default = {
+	.rxmode = {
+		.max_rx_pkt_len = ETHER_MAX_LEN,
+		.ignore_offload_bitfield = 1,
+	},
+};
+
+static inline int
+port_init(uint16_t port, struct rte_mempool *mbuf_pool)
+{
+	uint16_t rx_rings = 1, tx_rings = 1;
+	uint16_t nb_rxd = RX_RING_SIZE;
+	uint16_t nb_txd = TX_RING_SIZE;
+	int retval;
+	uint16_t q;
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_txconf txconf;
+	struct ether_addr addr;
+
+	if (port >= rte_eth_dev_count())
+		return -1;
+
+	rte_eth_dev_info_get(port, &dev_info);
+
+	/* Configure the Ethernet device. */
+	retval = rte_eth_dev_configure(port, rx_rings, tx_rings,
+			&port_conf_default);
+	if (retval < 0)
+		return retval;
+
+	/* Allocate and set up 1 Rx queue per Ethernet port. */
+	for (q = 0; q < rx_rings; q++) {
+		retval = rte_eth_rx_queue_setup(port, q, nb_rxd,
+				rte_eth_dev_socket_id(port), NULL, mbuf_pool);
+		if (retval < 0)
+			return retval;
+	}
+
+	txconf = dev_info.default_txconf;
+	/* Allocate and set up 1 Tx queue per Ethernet port. */
+	for (q = 0; q < tx_rings; q++) {
+		retval = rte_eth_tx_queue_setup(port, q, nb_txd,
+				rte_eth_dev_socket_id(port), &txconf);
+		if (retval < 0)
+			return retval;
+	}
+
+	/* Start the Ethernet port. */
+	retval = rte_eth_dev_start(port);
+	if (retval < 0)
+		return retval;
+
+	/* Display the port MAC address. */
+	rte_eth_macaddr_get(port, &addr);
+	printf("Port %u MAC: %02" PRIx8 " %02" PRIx8 " %02" PRIx8
+			   " %02" PRIx8 " %02" PRIx8 " %02" PRIx8 "\n",
+			port,
+			addr.addr_bytes[0], addr.addr_bytes[1],
+			addr.addr_bytes[2], addr.addr_bytes[3],
+			addr.addr_bytes[4], addr.addr_bytes[5]);
+
+	return 0;
+}
+
+int
+main(int argc, char *argv[])
+{
+	char ifname[MAX_PATH_LEN];
+	uint16_t nb_ports, portid;
+	struct rte_mempool *mbuf_pool;
+	char ch;
+	int i, eid, did;
+	int ret;
+	uint64_t flags = 0;
+
+	ret = rte_eal_init(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "eal init failed\n");
+	argc -= ret;
+	argv += ret;
+
+	signal(SIGINT, signal_handler);
+	signal(SIGTERM, signal_handler);
+
+	nb_ports = rte_eth_dev_count();
+	mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", NUM_MBUFS * nb_ports,
+		MBUF_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
+
+	/* Initialize all ports. */
+	for (portid = 0; portid < nb_ports; portid++)
+		if (port_init(portid, mbuf_pool) != 0)
+			rte_exit(EXIT_FAILURE, "Cannot init port %d\n",
+					portid);
+
+	data_init();
+
+	ret = parse_args(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "invalid argument\n");
+
+	eid = rte_vdpa_find_engine_id(&dev_id);
+	if (eid < 0)
+		rte_exit(EXIT_FAILURE, "no vDPA engine found\n");
+
+	printf("\nuse engine %d to create vhost socket\n", eid);
+	rte_vdpa_info_query(eid, &attr);
+	if (devcnt > (int)attr.dev_num)
+		rte_exit(EXIT_FAILURE, "not enough devices in engine\n");
+
+	if (queue > (int)attr.queue_num)
+		rte_exit(EXIT_FAILURE, "not enough queues in engine\n");
+
+	for (i = 0; i <  RTE_MIN(MAX_VDPA_SAMPLE_PORTS, devcnt); i++) {
+		snprintf(ifname, sizeof(ifname), "%s%d", iface, i);
+		did = i;
+		vports[i].eid = eid;
+		vports[i].did = did;
+		strcpy(vports[i].ifname, ifname);
+
+		ret = rte_vhost_driver_register(ifname, flags);
+		if (ret != 0)
+			rte_exit(EXIT_FAILURE,
+					"register driver failed: %s\n",
+					ifname);
+
+		ret = rte_vhost_driver_callback_register(ifname,
+				&vdpa_sample_devops);
+		if (ret != 0)
+			rte_exit(EXIT_FAILURE,
+					"register driver ops failed: %s\n",
+					ifname);
+
+		rte_vhost_driver_set_vdpa_eid(ifname, eid);
+		rte_vhost_driver_set_vdpa_did(ifname, did);
+		/*
+		 * Configure vhost port with vDPA device's maximum capability.
+		 * App has the flexibility to change the features, queue num.
+		 */
+		rte_vhost_driver_set_queue_num(ifname, attr.queue_num);
+		rte_vhost_driver_set_features(ifname, attr.features);
+		rte_vhost_driver_set_protocol_features(ifname,
+				attr.protocol_features);
+
+		if (rte_vhost_driver_start(ifname) < 0)
+			rte_exit(EXIT_FAILURE,
+					"start vhost driver failed: %s\n",
+					ifname);
+	}
+
+	printf("enter \'q\' to quit\n");
+	while (scanf("%c", &ch)) {
+		if (ch == 'q')
+			break;
+		while (ch != '\n')
+			scanf("%c", &ch);
+		printf("enter \'q\' to quit\n");
+	}
+
+	for (portid = 0; portid < nb_ports; portid++) {
+		printf("Closing port %d...\n", portid);
+		rte_eth_dev_stop(portid);
+		rte_eth_dev_close(portid);
+	}
+
+	return 0;
+}
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-04 14:55 ` [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
@ 2018-02-06 14:24   ` Maxime Coquelin
  2018-02-08  2:23     ` Wang, Xiao W
  0 siblings, 1 reply; 8+ messages in thread
From: Maxime Coquelin @ 2018-02-06 14:24 UTC (permalink / raw)
  To: Xiao Wang, dev
  Cc: jianfeng.tan, tiwei.bie, yliu, cunming.liang, dan.daly, zhihong.wang

Hi Xiao,

On 02/04/2018 03:55 PM, Xiao Wang wrote:
> This driver is a reference sample of making vDPA device driver based
> on vhost lib, this driver uses a standard virtio-net PCI device as
> vDPA device, it can serve as a backend for a virtio-net pci device
> in nested VM.
> 
> The key driver ops implemented are:
> 
> * vdpa_virtio_eng_init
> Mapping virtio pci device with VFIO into userspace, and read device
> capability and intialize internal data.
> 
> * vdpa_virtio_eng_uninit
> Release the mapped device.
> 
> * vdpa_virtio_info_query
> Device capability reporting, e.g. queue number, features.
> 
> * vdpa_virtio_dev_config
> With the guest virtio information provideed by vhost lib, this
> function configures device and IOMMU to set up vhost datapath,
> which includes: Rx/Tx vring, VFIO interrupt, kick relay.
> 
> * vdpa_virtio_dev_close
> Unset the stuff that are configured previously by dev_conf.
> 
> This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
> , because the buffer address written in desc is IOVA.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
> ---
>   config/common_base                                 |    6 +
>   config/common_linuxapp                             |    1 +
>   drivers/net/Makefile                               |    1 +
>   drivers/net/vdpa_virtio_pci/Makefile               |   31 +
>   .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527 ++++++++++++++++++++
>   .../rte_vdpa_virtio_pci_version.map                |    4 +
>   mk/rte.app.mk                                      |    1 +
>   7 files changed, 1571 insertions(+)
>   create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
>   create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
>   create mode 100644 drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map

Is there a specific constraint that makes you expose PCI functions and
duplicate a lot of vfio code into the driver?

Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co. 
to benefit from all the existing infrastructure?

Maxime

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-06 14:24   ` Maxime Coquelin
@ 2018-02-08  2:23     ` Wang, Xiao W
  2018-02-08  9:08       ` Maxime Coquelin
  0 siblings, 1 reply; 8+ messages in thread
From: Wang, Xiao W @ 2018-02-08  2:23 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Tan, Jianfeng, Bie, Tiwei, yliu, Liang, Cunming, Daly, Dan, Wang,
	Zhihong

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Tuesday, February 6, 2018 10:24 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
> 
> Hi Xiao,
> 
> On 02/04/2018 03:55 PM, Xiao Wang wrote:
> > This driver is a reference sample of making vDPA device driver based
> > on vhost lib, this driver uses a standard virtio-net PCI device as
> > vDPA device, it can serve as a backend for a virtio-net pci device
> > in nested VM.
> >
> > The key driver ops implemented are:
> >
> > * vdpa_virtio_eng_init
> > Mapping virtio pci device with VFIO into userspace, and read device
> > capability and intialize internal data.
> >
> > * vdpa_virtio_eng_uninit
> > Release the mapped device.
> >
> > * vdpa_virtio_info_query
> > Device capability reporting, e.g. queue number, features.
> >
> > * vdpa_virtio_dev_config
> > With the guest virtio information provideed by vhost lib, this
> > function configures device and IOMMU to set up vhost datapath,
> > which includes: Rx/Tx vring, VFIO interrupt, kick relay.
> >
> > * vdpa_virtio_dev_close
> > Unset the stuff that are configured previously by dev_conf.
> >
> > This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
> > , because the buffer address written in desc is IOVA.
> >
> > Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> > only vfio-pci is supported currently.
> >
> > Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
> > ---
> >   config/common_base                                 |    6 +
> >   config/common_linuxapp                             |    1 +
> >   drivers/net/Makefile                               |    1 +
> >   drivers/net/vdpa_virtio_pci/Makefile               |   31 +
> >   .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527
> ++++++++++++++++++++
> >   .../rte_vdpa_virtio_pci_version.map                |    4 +
> >   mk/rte.app.mk                                      |    1 +
> >   7 files changed, 1571 insertions(+)
> >   create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
> >   create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
> >   create mode 100644
> drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
> 
> Is there a specific constraint that makes you expose PCI functions and
> duplicate a lot of vfio code into the driver?

The existing vfio code doesn't fit VDPA well, this vDPA driver needs to program IOMMU for a vDPA device with a VM's memory table.
While the eal/vfio uses a struct vfio_cfg to takes all regular devices and add them to a single vfio_container, and program IOMMU with DPDK process's memory table.

This driver doing PCI VFIO initialization itself can avoid affecting the global vfio_cfg structure.

> 
> Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co.
> to benefit from all the existing infrastructure?

RTE_PMD_REGISTER_PCI() & co will make this driver as PCI driver (physical device), then this will conflict with the virtio_pmd.
So I make vDPA device driver as a vdev driver.

> 
> Maxime

Thanks for the comments,
Xiao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-08  2:23     ` Wang, Xiao W
@ 2018-02-08  9:08       ` Maxime Coquelin
  2018-02-12 15:36         ` Wang, Xiao W
  0 siblings, 1 reply; 8+ messages in thread
From: Maxime Coquelin @ 2018-02-08  9:08 UTC (permalink / raw)
  To: Wang, Xiao W, dev
  Cc: Tan, Jianfeng, Bie, Tiwei, yliu, Liang, Cunming, Daly, Dan, Wang,
	Zhihong

Hi Xiao,

On 02/08/2018 03:23 AM, Wang, Xiao W wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Tuesday, February 6, 2018 10:24 PM
>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
>> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
>> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
>> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
>>
>> Hi Xiao,
>>
>> On 02/04/2018 03:55 PM, Xiao Wang wrote:
>>> This driver is a reference sample of making vDPA device driver based
>>> on vhost lib, this driver uses a standard virtio-net PCI device as
>>> vDPA device, it can serve as a backend for a virtio-net pci device
>>> in nested VM.
>>>
>>> The key driver ops implemented are:
>>>
>>> * vdpa_virtio_eng_init
>>> Mapping virtio pci device with VFIO into userspace, and read device
>>> capability and intialize internal data.
>>>
>>> * vdpa_virtio_eng_uninit
>>> Release the mapped device.
>>>
>>> * vdpa_virtio_info_query
>>> Device capability reporting, e.g. queue number, features.
>>>
>>> * vdpa_virtio_dev_config
>>> With the guest virtio information provideed by vhost lib, this
>>> function configures device and IOMMU to set up vhost datapath,
>>> which includes: Rx/Tx vring, VFIO interrupt, kick relay.
>>>
>>> * vdpa_virtio_dev_close
>>> Unset the stuff that are configured previously by dev_conf.
>>>
>>> This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
>>> , because the buffer address written in desc is IOVA.
>>>
>>> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
>>> only vfio-pci is supported currently.
>>>
>>> Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
>>> ---
>>>    config/common_base                                 |    6 +
>>>    config/common_linuxapp                             |    1 +
>>>    drivers/net/Makefile                               |    1 +
>>>    drivers/net/vdpa_virtio_pci/Makefile               |   31 +
>>>    .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527
>> ++++++++++++++++++++
>>>    .../rte_vdpa_virtio_pci_version.map                |    4 +
>>>    mk/rte.app.mk                                      |    1 +
>>>    7 files changed, 1571 insertions(+)
>>>    create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
>>>    create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
>>>    create mode 100644
>> drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
>>
>> Is there a specific constraint that makes you expose PCI functions and
>> duplicate a lot of vfio code into the driver?
> 
> The existing vfio code doesn't fit VDPA well, this vDPA driver needs to program IOMMU for a vDPA device with a VM's memory table.
> While the eal/vfio uses a struct vfio_cfg to takes all regular devices and add them to a single vfio_container, and program IOMMU with DPDK process's memory table.
> 
> This driver doing PCI VFIO initialization itself can avoid affecting the global vfio_cfg structure.

Ok, I get it.
So I think what you have to do is to extend eal/vfio for this case.
Or at least, have a vdpa layer to perform this, else every offload
driver will have to duplicate the code.

>>
>> Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co.
>> to benefit from all the existing infrastructure?
> 
> RTE_PMD_REGISTER_PCI() & co will make this driver as PCI driver (physical device), then this will conflict with the virtio_pmd.
> So I make vDPA device driver as a vdev driver.

Yes, but it is a PCI device, not a virtual device. You have to extend
the EAL to support this new class of devices/drivers. Think of it as in
kernel when a NIC device can be either binded to its NIC driver, VFIO or
UIO.

If I look at patch 3, you have to set --no-pci, or at least I think to
blacklist the Virtio device.

I wonder if real vDPA cards will support either vDPA mode or or behave
like a regular NIC, like the Virtio case in your example.
If this is the case, maybe the vDPA code for a NIC could be in the same
driver as the "NIC" mode.
A new struct rte_pci_driver driver flag could be introduced to specify
that the driver supports vDPA.
Then, in EAL arguments, if a vhost vdev specifies it wants Virtio device
at PCI addr 00:01:00 as offload, the PCI layer could probe this device
in "vdpa" mode.

Also, I don't know if this will be possible with real vDPA cards, but we
could have the application doing packet switching between vhost-user
vdev and the Virtio device. And at some point, at runtime, switch into
vDPA mode. This use-case would be much easier to implement if vDPA
relied on existing PCI layer.

I may be not very clear, don't hesitate to ask questions.
But generally, I think vDPA has to fit in existing DPDK architecture,
and not try to live outside of it.

Thanks,
Maxime
>>
>> Maxime
> 
> Thanks for the comments,
> Xiao
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-08  9:08       ` Maxime Coquelin
@ 2018-02-12 15:36         ` Wang, Xiao W
  0 siblings, 0 replies; 8+ messages in thread
From: Wang, Xiao W @ 2018-02-12 15:36 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Tan, Jianfeng, Bie, Tiwei, yliu, Liang, Cunming, Daly, Dan, Wang,
	Zhihong

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Thursday, February 8, 2018 5:09 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
> 
> Hi Xiao,
> 
> On 02/08/2018 03:23 AM, Wang, Xiao W wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Tuesday, February 6, 2018 10:24 PM
> >> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei
> <tiwei.bie@intel.com>;
> >> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> >> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> >> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
> >>
> >> Hi Xiao,
> >>
> >> On 02/04/2018 03:55 PM, Xiao Wang wrote:
> >>> This driver is a reference sample of making vDPA device driver based
> >>> on vhost lib, this driver uses a standard virtio-net PCI device as
> >>> vDPA device, it can serve as a backend for a virtio-net pci device
> >>> in nested VM.
> >>>
> >>> The key driver ops implemented are:
> >>>
> >>> * vdpa_virtio_eng_init
> >>> Mapping virtio pci device with VFIO into userspace, and read device
> >>> capability and intialize internal data.
> >>>
> >>> * vdpa_virtio_eng_uninit
> >>> Release the mapped device.
> >>>
> >>> * vdpa_virtio_info_query
> >>> Device capability reporting, e.g. queue number, features.
> >>>
> >>> * vdpa_virtio_dev_config
> >>> With the guest virtio information provideed by vhost lib, this
> >>> function configures device and IOMMU to set up vhost datapath,
> >>> which includes: Rx/Tx vring, VFIO interrupt, kick relay.
> >>>
> >>> * vdpa_virtio_dev_close
> >>> Unset the stuff that are configured previously by dev_conf.
> >>>
> >>> This driver requires the virtio device supports
> VIRTIO_F_IOMMU_PLATFORM
> >>> , because the buffer address written in desc is IOVA.
> >>>
> >>> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> >>> only vfio-pci is supported currently.
> >>>
> >>> Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
> >>> ---
> >>>    config/common_base                                 |    6 +
> >>>    config/common_linuxapp                             |    1 +
> >>>    drivers/net/Makefile                               |    1 +
> >>>    drivers/net/vdpa_virtio_pci/Makefile               |   31 +
> >>>    .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527
> >> ++++++++++++++++++++
> >>>    .../rte_vdpa_virtio_pci_version.map                |    4 +
> >>>    mk/rte.app.mk                                      |    1 +
> >>>    7 files changed, 1571 insertions(+)
> >>>    create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
> >>>    create mode 100644
> drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
> >>>    create mode 100644
> >> drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
> >>
> >> Is there a specific constraint that makes you expose PCI functions and
> >> duplicate a lot of vfio code into the driver?
> >
> > The existing vfio code doesn't fit VDPA well, this vDPA driver needs to
> program IOMMU for a vDPA device with a VM's memory table.
> > While the eal/vfio uses a struct vfio_cfg to takes all regular devices and add
> them to a single vfio_container, and program IOMMU with DPDK process's
> memory table.
> >
> > This driver doing PCI VFIO initialization itself can avoid affecting the global
> vfio_cfg structure.
> 
> Ok, I get it.
> So I think what you have to do is to extend eal/vfio for this case.
> Or at least, have a vdpa layer to perform this, else every offload
> driver will have to duplicate the code.

I think I need to extend eal/vfio to provide container based APIs, such as creating container,
vfio group fd binding with container, DMAR programming, etc.

> 
> >>
> >> Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co.
> >> to benefit from all the existing infrastructure?
> >
> > RTE_PMD_REGISTER_PCI() & co will make this driver as PCI driver (physical
> device), then this will conflict with the virtio_pmd.
> > So I make vDPA device driver as a vdev driver.
> 
> Yes, but it is a PCI device, not a virtual device. You have to extend
> the EAL to support this new class of devices/drivers. Think of it as in
> kernel when a NIC device can be either binded to its NIC driver, VFIO or
> UIO.
> 
> If I look at patch 3, you have to set --no-pci, or at least I think to
> blacklist the Virtio device.
> 
> I wonder if real vDPA cards will support either vDPA mode or or behave
> like a regular NIC, like the Virtio case in your example.
> If this is the case, maybe the vDPA code for a NIC could be in the same
> driver as the "NIC" mode.
> A new struct rte_pci_driver driver flag could be introduced to specify
> that the driver supports vDPA.
> Then, in EAL arguments, if a vhost vdev specifies it wants Virtio device
> at PCI addr 00:01:00 as offload, the PCI layer could probe this device
> in "vdpa" mode.

Considering that we could have a pool of vDPA devices, we need to have a port supporting port representor,
it defines control domain to which these vDPA devices belong to, we can have a vdev port for this
purpose and this vdev helps to register vDPA ports by port-representor library (patch submitted).

                                   +------+
                                   | vdev |
+---+                              |------|
|app|--register representor port-->|broker|-->add port with vDPA device 0/1/2...
+---+                              +------+

I plan to submit vdpa driver patch for a real vDPA card, that card will have different sub device_id/vendor_id,
so we won't have conflict issue on that driver.

> 
> Also, I don't know if this will be possible with real vDPA cards, but we
> could have the application doing packet switching between vhost-user
> vdev and the Virtio device. And at some point, at runtime, switch into
> vDPA mode. This use-case would be much easier to implement if vDPA
> relied on existing PCI layer.

In vDPA mode, each vhost-user datapath is performed by a vDPA device,
If switchover to normal SW packet switching, it will be typically many vhost-user ports and one uplink port.

Thanks,
Xiao

> 
> I may be not very clear, don't hesitate to ask questions.
> But generally, I think vDPA has to fit in existing DPDK architecture,
> and not try to live outside of it.
> 
> Thanks,
> Maxime
> >>
> >> Maxime
> >
> > Thanks for the comments,
> > Xiao
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-02-12 15:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-04 14:55 [PATCH 0/3] add vDPA sample driver Xiao Wang
2018-02-04 14:55 ` [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
2018-02-04 14:55 ` [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
2018-02-06 14:24   ` Maxime Coquelin
2018-02-08  2:23     ` Wang, Xiao W
2018-02-08  9:08       ` Maxime Coquelin
2018-02-12 15:36         ` Wang, Xiao W
2018-02-04 14:55 ` [PATCH 3/3] examples/vdpa: add a new sample for vdpa Xiao Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.