linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] VirtIO RDMA
@ 2021-09-02 13:06 Junji Wei
  2021-09-02 13:06 ` [RFC 1/5] RDMA/virtio-rdma Introduce a new core cap prot Junji Wei
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Junji Wei @ 2021-09-02 13:06 UTC (permalink / raw)
  To: dledford, jgg, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare
  Cc: xieyongji, chaiwen.cc, weijunji, linux-rdma, virtualization, qemu-devel

Hi all,

This RFC aims to reopen the discussion of Virtio RDMA.
Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
which implemented a frame for Virtio RDMA and a simple
control path (Not sure if Yuval Shaia has any further
plan for it).

We try to extend this work and implement a simple
data-path and a completed control path. Now this can
work with SEND, RECV and REG_MR in kernel. There is a
simple test module in this patch that can communicate
with ibv_rc_pingpong in rdma-core.

During doing this work, we have found some problems and
would like to ask for some suggestions from community:
1. Each qp need two VQ, but qemu default only support 1024 VQ.
   I think it is possible to multiplex the VQ, since the
   cmd_post_send carry the qpn in request.

2. The virtio-rdma device's gid should equal to host rdma
   device's gid. This means that we cannot use gid cache in
   rdma subsystem. And theoretically the gid should also equal
   to the device's netdev's ip address, how can we deal with
   this conflict.

3. How to support DMA mr? The verbs in host cannot support it.
   And it seems hard to ping whole guest physical memory in qemu.

4. The FRMR api need to set key of MR through IB_WR_REG_MR.
   But it is impossible to change a key of mr using uverbs.
   In our implementation, we change the key of WR while post_send,
   but this means the MR can only work with SEND and RECV since we
   cannot change the key in the remote. The final solution may be to
   implement an urdma device based on rxe in qemu, through this we
   can get full control of MR.

5. The GSI is not supported now. And we think it's a problem that
   when the host receive a GSI package, it doesn't know which
   device it belongs to.

Any further thoughts will be greatly welcomed. And we noticed that
there seems to be no existing work for virtio-rdma spec, we are
happy to start it from this RFC.

How to test with test module:

1. Set test module's SERVER_ADDR and SERVER_PORT
2. Build kernel and qemu
3. Build rdmacm-mux in qemu/contrib and run it in backend
4. Boot kernel with qemu with following args using libvirt
<interface type='bridge'>
  <mac address='00:16:3e:5d:aa:a8'/>
  <source bridge='virbr0'/>
  <target dev='vnet1'/>
  <model type='virtio'/>
  <alias name='net0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02'
   function='0x0' multifunction='on'/>
</interface>

<qemu:commandline>
  <qemu:arg value='-chardev'/>
  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
  <qemu:arg value='-device'/>
  <qemu:arg value='virtio-rdma-pci,disable-legacy=on,addr=2.1,
   ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
  <qemu:arg value='-object'/>
  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
  <qemu:arg value='-numa'/>
  <qemu:arg value='node,memdev=mb1'/>
</qemu:commandline>

Note that virtio-net and virtio-rdma should be in same slot's
function 0 and function 1.

5. Run "ibv_rc_pingpong -g 1 -n 500 -s 20480" as server
6. Run "insmod virtio_rdma_rc_pingping_client.ko" in guest

One note regarding the patchset.
We know it's not standard to collaps patches from two repos. But in
order to display the whole work of Virtio RDMA, we still did it.

Thanks.

patch1: RDMA/virtio-rdma Introduce a new core cap prot (linux)
patch2: RDMA/virtio-rdma: VirtIO RDMA driver (linux)
        The main patch of virtio-rdma driver in linux kernel
patch3: RDMA/virtio-rdma: VirtIO RDMA test module (linux)
        A test module
patch4: virtio-net: Move some virtio-net-pci decl to include/hw/virtio (qemu)
        Patch from Yuval Shaia
patch5: hw/virtio-rdma: VirtIO rdma device (qemu)
        The main patch of virtio-rdma device in linux kernel
-- 
2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/5] RDMA/virtio-rdma Introduce a new core cap prot
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
@ 2021-09-02 13:06 ` Junji Wei
  2021-09-02 13:06 ` [RFC 2/5] RDMA/virtio-rdma: VirtIO RDMA driver Junji Wei
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Junji Wei @ 2021-09-02 13:06 UTC (permalink / raw)
  To: dledford, jgg, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare
  Cc: xieyongji, chaiwen.cc, weijunji, linux-rdma, virtualization, qemu-devel

Introduce a new core cap prot RDMA_CORE_CAP_PROT_VIRTIO
to support virtio-rdma

Currently RDMA_CORE_CAP_PROT_VIRTIO is as same as
RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP except rdma_query_gid,
we need to get get gid from host device.

Signed-off-by: Junji Wei <weijunji@bytedance.com>
---
 drivers/infiniband/core/cache.c         |  9 ++++++---
 drivers/infiniband/core/cm.c            |  4 ++--
 drivers/infiniband/core/cma.c           | 20 ++++++++++----------
 drivers/infiniband/core/device.c        |  4 ++--
 drivers/infiniband/core/multicast.c     |  2 +-
 drivers/infiniband/core/nldev.c         |  2 ++
 drivers/infiniband/core/roce_gid_mgmt.c |  3 ++-
 drivers/infiniband/core/ucma.c          |  2 +-
 drivers/infiniband/core/verbs.c         |  2 +-
 include/rdma/ib_verbs.h                 | 28 +++++++++++++++++++++++++---
 10 files changed, 52 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index c9e9fc81447e..3c0a0c9896b4 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -396,7 +396,7 @@ static void del_gid(struct ib_device *ib_dev, u32 port,
 	/*
 	 * For non RoCE protocol, GID entry slot is ready to use.
 	 */
-	if (!rdma_protocol_roce(ib_dev, port))
+	if (!rdma_protocol_virtio_or_roce(ib_dev, port))
 		table->data_vec[ix] = NULL;
 	write_unlock_irq(&table->rwlock);
 
@@ -448,7 +448,7 @@ static int add_modify_gid(struct ib_gid_table *table,
 	if (!entry)
 		return -ENOMEM;
 
-	if (rdma_protocol_roce(attr->device, attr->port_num)) {
+	if (rdma_protocol_virtio_or_roce(attr->device, attr->port_num)) {
 		ret = add_roce_gid(entry);
 		if (ret)
 			goto done;
@@ -960,6 +960,9 @@ int rdma_query_gid(struct ib_device *device, u32 port_num,
 	if (!rdma_is_port_valid(device, port_num))
 		return -EINVAL;
 
+	if (rdma_protocol_virtio(device, port_num))
+		return device->ops.query_gid(device, port_num, index, gid);
+
 	table = rdma_gid_table(device, port_num);
 	read_lock_irqsave(&table->rwlock, flags);
 
@@ -1482,7 +1485,7 @@ ib_cache_update(struct ib_device *device, u32 port, bool update_gids,
 		goto err;
 	}
 
-	if (!rdma_protocol_roce(device, port) && update_gids) {
+	if (!rdma_protocol_virtio_or_roce(device, port) && update_gids) {
 		ret = config_non_roce_gid_cache(device, port,
 						tprops->gid_tbl_len);
 		if (ret)
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index c903b74f46a4..a707f5de1c2e 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3288,7 +3288,7 @@ static int cm_lap_handler(struct cm_work *work)
 	/* Currently Alternate path messages are not supported for
 	 * RoCE link layer.
 	 */
-	if (rdma_protocol_roce(work->port->cm_dev->ib_device,
+	if (rdma_protocol_virtio_or_roce(work->port->cm_dev->ib_device,
 			       work->port->port_num))
 		return -EINVAL;
 
@@ -3381,7 +3381,7 @@ static int cm_apr_handler(struct cm_work *work)
 	/* Currently Alternate path messages are not supported for
 	 * RoCE link layer.
 	 */
-	if (rdma_protocol_roce(work->port->cm_dev->ib_device,
+	if (rdma_protocol_virtio_or_roce(work->port->cm_dev->ib_device,
 			       work->port->port_num))
 		return -EINVAL;
 
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 5d3b8b8d163d..5d29de352ed8 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -573,7 +573,7 @@ cma_validate_port(struct ib_device *device, u32 port,
 	if ((dev_type != ARPHRD_INFINIBAND) && rdma_protocol_ib(device, port))
 		return ERR_PTR(-ENODEV);
 
-	if (dev_type == ARPHRD_ETHER && rdma_protocol_roce(device, port)) {
+	if (dev_type == ARPHRD_ETHER && rdma_protocol_virtio_or_roce(device, port)) {
 		ndev = dev_get_by_index(dev_addr->net, bound_if_index);
 		if (!ndev)
 			return ERR_PTR(-ENODEV);
@@ -626,7 +626,7 @@ static int cma_acquire_dev_by_src_ip(struct rdma_id_private *id_priv)
 	mutex_lock(&lock);
 	list_for_each_entry(cma_dev, &dev_list, list) {
 		rdma_for_each_port (cma_dev->device, port) {
-			gidp = rdma_protocol_roce(cma_dev->device, port) ?
+			gidp = rdma_protocol_virtio_or_roce(cma_dev->device, port) ?
 			       &iboe_gid : &gid;
 			gid_type = cma_dev->default_gid_type[port - 1];
 			sgid_attr = cma_validate_port(cma_dev->device, port,
@@ -669,7 +669,7 @@ static int cma_ib_acquire_dev(struct rdma_id_private *id_priv,
 	    id_priv->id.ps == RDMA_PS_IPOIB)
 		return -EINVAL;
 
-	if (rdma_protocol_roce(req->device, req->port))
+	if (rdma_protocol_virtio_or_roce(req->device, req->port))
 		rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.src_addr,
 			    &gid);
 	else
@@ -1525,7 +1525,7 @@ static struct net_device *cma_get_net_dev(const struct ib_cm_event *ib_event,
 	if (err)
 		return ERR_PTR(err);
 
-	if (rdma_protocol_roce(req->device, req->port))
+	if (rdma_protocol_virtio_or_roce(req->device, req->port))
 		net_dev = roce_get_net_dev_by_cm_event(ib_event);
 	else
 		net_dev = ib_get_net_dev_by_params(req->device, req->port,
@@ -1583,7 +1583,7 @@ static bool cma_protocol_roce(const struct rdma_cm_id *id)
 	struct ib_device *device = id->device;
 	const u32 port_num = id->port_num ?: rdma_start_port(device);
 
-	return rdma_protocol_roce(device, port_num);
+	return rdma_protocol_virtio_or_roce(device, port_num);
 }
 
 static bool cma_is_req_ipv6_ll(const struct cma_req_info *req)
@@ -1813,7 +1813,7 @@ static void destroy_mc(struct rdma_id_private *id_priv,
 	if (rdma_cap_ib_mcast(id_priv->id.device, id_priv->id.port_num))
 		ib_sa_free_multicast(mc->sa_mc);
 
-	if (rdma_protocol_roce(id_priv->id.device, id_priv->id.port_num)) {
+	if (rdma_protocol_virtio_or_roce(id_priv->id.device, id_priv->id.port_num)) {
 		struct rdma_dev_addr *dev_addr =
 			&id_priv->id.route.addr.dev_addr;
 		struct net_device *ndev = NULL;
@@ -2296,7 +2296,7 @@ void rdma_read_gids(struct rdma_cm_id *cm_id, union ib_gid *sgid,
 		return;
 	}
 
-	if (rdma_protocol_roce(cm_id->device, cm_id->port_num)) {
+	if (rdma_protocol_virtio_or_roce(cm_id->device, cm_id->port_num)) {
 		if (sgid)
 			rdma_ip2gid((struct sockaddr *)&addr->src_addr, sgid);
 		if (dgid)
@@ -2919,7 +2919,7 @@ int rdma_set_ib_path(struct rdma_cm_id *id,
 		goto err;
 	}
 
-	if (rdma_protocol_roce(id->device, id->port_num)) {
+	if (rdma_protocol_virtio_or_roce(id->device, id->port_num)) {
 		ndev = cma_iboe_set_path_rec_l2_fields(id_priv);
 		if (!ndev) {
 			ret = -ENODEV;
@@ -3139,7 +3139,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, unsigned long timeout_ms)
 	cma_id_get(id_priv);
 	if (rdma_cap_ib_sa(id->device, id->port_num))
 		ret = cma_resolve_ib_route(id_priv, timeout_ms);
-	else if (rdma_protocol_roce(id->device, id->port_num))
+	else if (rdma_protocol_virtio_or_roce(id->device, id->port_num))
 		ret = cma_resolve_iboe_route(id_priv);
 	else if (rdma_protocol_iwarp(id->device, id->port_num))
 		ret = cma_resolve_iw_route(id_priv);
@@ -4766,7 +4766,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
 	mc->id_priv = id_priv;
 	mc->join_state = join_state;
 
-	if (rdma_protocol_roce(id->device, id->port_num)) {
+	if (rdma_protocol_virtio_or_roce(id->device, id->port_num)) {
 		ret = cma_iboe_join_multicast(id_priv, mc);
 		if (ret)
 			goto out_err;
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index fa20b1824fb8..fadf17246574 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2297,7 +2297,7 @@ void ib_enum_roce_netdev(struct ib_device *ib_dev,
 	u32 port;
 
 	rdma_for_each_port (ib_dev, port)
-		if (rdma_protocol_roce(ib_dev, port)) {
+		if (rdma_protocol_virtio_or_roce(ib_dev, port)) {
 			struct net_device *idev =
 				ib_device_get_netdev(ib_dev, port);
 
@@ -2429,7 +2429,7 @@ int ib_modify_port(struct ib_device *device,
 		rc = device->ops.modify_port(device, port_num,
 					     port_modify_mask,
 					     port_modify);
-	else if (rdma_protocol_roce(device, port_num) &&
+	else if (rdma_protocol_virtio_or_roce(device, port_num) &&
 		 ((port_modify->set_port_cap_mask & ~IB_PORT_CM_SUP) == 0 ||
 		  (port_modify->clr_port_cap_mask & ~IB_PORT_CM_SUP) == 0))
 		rc = 0;
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index a236532a9026..eaeea1002177 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -745,7 +745,7 @@ int ib_init_ah_from_mcmember(struct ib_device *device, u32 port_num,
 	 */
 	if (rdma_protocol_ib(device, port_num))
 		ndev = NULL;
-	else if (!rdma_protocol_roce(device, port_num))
+	else if (!rdma_protocol_virtio_or_roce(device, port_num))
 		return -EINVAL;
 
 	sgid_attr = rdma_find_gid_by_port(device, &rec->port_gid,
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index e9b4b2cccaa0..e41cbf6bef0b 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -296,6 +296,8 @@ static int fill_dev_info(struct sk_buff *msg, struct ib_device *device)
 		ret = nla_put_string(msg, RDMA_NLDEV_ATTR_DEV_PROTOCOL, "iw");
 	else if (rdma_protocol_roce(device, port))
 		ret = nla_put_string(msg, RDMA_NLDEV_ATTR_DEV_PROTOCOL, "roce");
+	else if (rdma_protocol_virtio(device, port))
+		ret = nla_put_string(msg, RDMA_NLDEV_ATTR_DEV_PROTOCOL, "virtio");
 	else if (rdma_protocol_usnic(device, port))
 		ret = nla_put_string(msg, RDMA_NLDEV_ATTR_DEV_PROTOCOL,
 				     "usnic");
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
index 68197e576433..5ea87b89dae6 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -75,6 +75,7 @@ static const struct {
 } PORT_CAP_TO_GID_TYPE[] = {
 	{rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE},
 	{rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP},
+	{rdma_protocol_virtio, IB_GID_TYPE_ROCE_UDP_ENCAP},
 };
 
 #define CAP_TO_GID_TABLE_SIZE	ARRAY_SIZE(PORT_CAP_TO_GID_TYPE)
@@ -84,7 +85,7 @@ unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u32 port)
 	int i;
 	unsigned int ret_flags = 0;
 
-	if (!rdma_protocol_roce(ib_dev, port))
+	if (!rdma_protocol_virtio_or_roce(ib_dev, port))
 		return 1UL << IB_GID_TYPE_IB;
 
 	for (i = 0; i < CAP_TO_GID_TABLE_SIZE; i++)
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 2b72c4fa9550..f748db3f0414 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -849,7 +849,7 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 
 	if (rdma_cap_ib_sa(ctx->cm_id->device, ctx->cm_id->port_num))
 		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
-	else if (rdma_protocol_roce(ctx->cm_id->device, ctx->cm_id->port_num))
+	else if (rdma_protocol_virtio_or_roce(ctx->cm_id->device, ctx->cm_id->port_num))
 		ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
 	else if (rdma_protocol_iwarp(ctx->cm_id->device, ctx->cm_id->port_num))
 		ucma_copy_iw_route(&resp, &ctx->cm_id->route);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 7036967e4c0b..f5037ff0c2e5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -822,7 +822,7 @@ int ib_init_ah_attr_from_wc(struct ib_device *device, u32 port_num,
 	rdma_ah_set_sl(ah_attr, wc->sl);
 	rdma_ah_set_port_num(ah_attr, port_num);
 
-	if (rdma_protocol_roce(device, port_num)) {
+	if (rdma_protocol_virtio_or_roce(device, port_num)) {
 		u16 vlan_id = wc->wc_flags & IB_WC_WITH_VLAN ?
 				wc->vlan_id : 0xffff;
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 371df1c80aeb..779d4d09aec1 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -623,6 +623,7 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 #define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x00800000
 #define RDMA_CORE_CAP_PROT_RAW_PACKET   0x01000000
 #define RDMA_CORE_CAP_PROT_USNIC        0x02000000
+#define RDMA_CORE_CAP_PROT_VIRTIO		0x04000000
 
 #define RDMA_CORE_PORT_IB_GRH_REQUIRED (RDMA_CORE_CAP_IB_GRH_REQUIRED \
 					| RDMA_CORE_CAP_PROT_ROCE     \
@@ -654,6 +655,14 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 
 #define RDMA_CORE_PORT_USNIC		(RDMA_CORE_CAP_PROT_USNIC)
 
+/* in most time, RDMA_CORE_PORT_VIRTIO is same as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP */
+#define RDMA_CORE_PORT_VIRTIO    \
+                    (RDMA_CORE_CAP_PROT_VIRTIO \
+					| RDMA_CORE_CAP_IB_MAD  \
+					| RDMA_CORE_CAP_IB_CM   \
+					| RDMA_CORE_CAP_AF_IB   \
+					| RDMA_CORE_CAP_ETH_AH)
+
 struct ib_port_attr {
 	u64			subnet_prefix;
 	enum ib_port_state	state;
@@ -3031,6 +3040,18 @@ static inline bool rdma_protocol_ib(const struct ib_device *device,
 	       RDMA_CORE_CAP_PROT_IB;
 }
 
+static inline bool rdma_protocol_virtio(const struct ib_device *device, u8 port_num)
+{
+	return device->port_data[port_num].immutable.core_cap_flags &
+	       RDMA_CORE_CAP_PROT_VIRTIO;
+}
+
+static inline bool rdma_protocol_virtio_or_roce(const struct ib_device *device, u8 port_num)
+{
+	return device->port_data[port_num].immutable.core_cap_flags &
+	       (RDMA_CORE_CAP_PROT_VIRTIO | RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP);
+}
+
 static inline bool rdma_protocol_roce(const struct ib_device *device,
 				      u32 port_num)
 {
@@ -3063,7 +3084,8 @@ static inline bool rdma_ib_or_roce(const struct ib_device *device,
 				   u32 port_num)
 {
 	return rdma_protocol_ib(device, port_num) ||
-		rdma_protocol_roce(device, port_num);
+		rdma_protocol_roce(device, port_num) ||
+		rdma_protocol_virtio(device, port_num);
 }
 
 static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
@@ -3322,7 +3344,7 @@ static inline size_t rdma_max_mad_size(const struct ib_device *device,
 static inline bool rdma_cap_roce_gid_table(const struct ib_device *device,
 					   u32 port_num)
 {
-	return rdma_protocol_roce(device, port_num) &&
+	return rdma_protocol_virtio_or_roce(device, port_num) &&
 		device->ops.add_gid && device->ops.del_gid;
 }
 
@@ -4502,7 +4524,7 @@ void rdma_move_ah_attr(struct rdma_ah_attr *dest, struct rdma_ah_attr *src);
 static inline enum rdma_ah_attr_type rdma_ah_find_type(struct ib_device *dev,
 						       u32 port_num)
 {
-	if (rdma_protocol_roce(dev, port_num))
+	if (rdma_protocol_virtio_or_roce(dev, port_num))
 		return RDMA_AH_ATTR_TYPE_ROCE;
 	if (rdma_protocol_ib(dev, port_num)) {
 		if (rdma_cap_opa_ah(dev, port_num))
-- 
2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 2/5] RDMA/virtio-rdma: VirtIO RDMA driver
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
  2021-09-02 13:06 ` [RFC 1/5] RDMA/virtio-rdma Introduce a new core cap prot Junji Wei
@ 2021-09-02 13:06 ` Junji Wei
  2021-09-02 13:06 ` [RFC 3/5] RDMA/virtio-rdma: VirtIO RDMA test module Junji Wei
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Junji Wei @ 2021-09-02 13:06 UTC (permalink / raw)
  To: dledford, jgg, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare
  Cc: xieyongji, chaiwen.cc, weijunji, linux-rdma, virtualization, qemu-devel

This is based on Yuval Shaia's [RFC 3/3]

[ Junji Wei: Implement simple date path and complete control path. ]

Signed-off-by: Yuval Shaia <yuval.shaia.ml@gmail.com>
Signed-off-by: Junji Wei <weijunji@bytedance.com>
---
 drivers/infiniband/Kconfig                         |    1 +
 drivers/infiniband/hw/Makefile                     |    1 +
 drivers/infiniband/hw/virtio/Kconfig               |    6 +
 drivers/infiniband/hw/virtio/Makefile              |    4 +
 drivers/infiniband/hw/virtio/virtio_rdma.h         |   67 +
 drivers/infiniband/hw/virtio/virtio_rdma_dev_api.h |  285 ++++
 drivers/infiniband/hw/virtio/virtio_rdma_device.c  |  144 ++
 drivers/infiniband/hw/virtio/virtio_rdma_device.h  |   32 +
 drivers/infiniband/hw/virtio/virtio_rdma_ib.c      | 1695 ++++++++++++++++++++
 drivers/infiniband/hw/virtio/virtio_rdma_ib.h      |  237 +++
 drivers/infiniband/hw/virtio/virtio_rdma_main.c    |  152 ++
 drivers/infiniband/hw/virtio/virtio_rdma_netdev.c  |   68 +
 drivers/infiniband/hw/virtio/virtio_rdma_netdev.h  |   29 +
 include/uapi/linux/virtio_ids.h                    |    1 +
 14 files changed, 2722 insertions(+)
 create mode 100644 drivers/infiniband/hw/virtio/Kconfig
 create mode 100644 drivers/infiniband/hw/virtio/Makefile
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_dev_api.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_device.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_device.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_ib.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_ib.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_main.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_netdev.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_netdev.h

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 33d3ce9c888e..ca201ed6a350 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -92,6 +92,7 @@ source "drivers/infiniband/hw/hns/Kconfig"
 source "drivers/infiniband/hw/bnxt_re/Kconfig"
 source "drivers/infiniband/hw/hfi1/Kconfig"
 source "drivers/infiniband/hw/qedr/Kconfig"
+source "drivers/infiniband/hw/virtio/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 source "drivers/infiniband/sw/siw/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index fba0b3be903e..e2290bd9808c 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -13,3 +13,4 @@ obj-$(CONFIG_INFINIBAND_HFI1)		+= hfi1/
 obj-$(CONFIG_INFINIBAND_HNS)		+= hns/
 obj-$(CONFIG_INFINIBAND_QEDR)		+= qedr/
 obj-$(CONFIG_INFINIBAND_BNXT_RE)	+= bnxt_re/
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA)	+= virtio/
diff --git a/drivers/infiniband/hw/virtio/Kconfig b/drivers/infiniband/hw/virtio/Kconfig
new file mode 100644
index 000000000000..116620d49851
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/Kconfig
@@ -0,0 +1,6 @@
+config INFINIBAND_VIRTIO_RDMA
+	tristate "VirtIO Paravirtualized RDMA Driver"
+	depends on NETDEVICES && ETHERNET && PCI && INET && VIRTIO
+	help
+	  This driver provides low-level support for VirtIO Paravirtual
+	  RDMA adapter.
diff --git a/drivers/infiniband/hw/virtio/Makefile b/drivers/infiniband/hw/virtio/Makefile
new file mode 100644
index 000000000000..fb637e467167
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA) += virtio_rdma.o
+
+virtio_rdma-y := virtio_rdma_main.o virtio_rdma_device.o virtio_rdma_ib.o \
+		 virtio_rdma_netdev.o
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma.h b/drivers/infiniband/hw/virtio/virtio_rdma.h
new file mode 100644
index 000000000000..e637f879e069
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma.h
@@ -0,0 +1,67 @@
+/*
+ * Virtio RDMA device: Driver main data types
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef __VIRTIO_RDMA__
+#define __VIRTIO_RDMA__
+
+#include <linux/spinlock.h>
+#include <linux/virtio.h>
+#include <rdma/ib_verbs.h>
+
+#include "virtio_rdma_ib.h"
+
+struct virtio_rdma_dev {
+	struct ib_device ib_dev;
+	struct virtio_device *vdev;
+	struct virtqueue *ctrl_vq;
+
+	/* To protect the vq operations for the controlq */
+	spinlock_t ctrl_lock;
+
+	// wait_queue_head_t acked; /* arm on send to host, release on recv */
+	struct net_device *netdev;
+
+	struct virtio_rdma_vq* cq_vqs;
+	struct virtio_rdma_cq** cqs;
+
+	struct virtio_rdma_vq* qp_vqs;
+	int *qp_vq_using;
+	spinlock_t qp_using_lock;
+
+	atomic_t num_qp;
+	atomic_t num_cq;
+	atomic_t num_ah;
+
+	// only for modify_port ?
+	struct mutex port_mutex;
+	u32 port_cap_mask;
+	// TODO: check ib_active before operations
+	bool ib_active;
+};
+
+static inline struct virtio_rdma_dev *to_vdev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct virtio_rdma_dev, ib_dev);
+}
+
+#define virtio_rdma_dbg(ibdev, fmt, ...)                                               \
+	ibdev_dbg(ibdev, "%s: " fmt, __func__, ##__VA_ARGS__)
+
+#endif
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_dev_api.h b/drivers/infiniband/hw/virtio/virtio_rdma_dev_api.h
new file mode 100644
index 000000000000..4a668ddfcd64
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_dev_api.h
@@ -0,0 +1,285 @@
+/*
+ * Virtio RDMA device: Virtio communication message
+ *
+ * Copyright (C) 2019 Junji Wei Bytedance Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+#ifndef __VIRTIO_RDMA_DEV_API__
+#define __VIRTIO_RDMA_DEV_API__
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <rdma/ib_verbs.h>
+
+#define VIRTIO_RDMA_CTRL_OK	0
+#define VIRTIO_RDMA_CTRL_ERR	1
+
+struct control_buf {
+	__u8 cmd;
+	__u8 status;
+};
+
+enum {
+	VIRTIO_CMD_QUERY_DEVICE = 10,
+	VIRTIO_CMD_QUERY_PORT,
+	VIRTIO_CMD_CREATE_CQ,
+	VIRTIO_CMD_DESTROY_CQ,
+	VIRTIO_CMD_CREATE_PD,
+	VIRTIO_CMD_DESTROY_PD,
+	VIRTIO_CMD_GET_DMA_MR,
+	VIRTIO_CMD_CREATE_MR,
+	VIRTIO_CMD_MAP_MR_SG,
+	VIRTIO_CMD_REG_USER_MR,
+	VIRTIO_CMD_DEREG_MR,
+	VIRTIO_CMD_CREATE_QP,
+    VIRTIO_CMD_MODIFY_QP,
+	VIRTIO_CMD_QUERY_QP,
+    VIRTIO_CMD_DESTROY_QP,
+	VIRTIO_CMD_QUERY_GID,
+	VIRTIO_CMD_CREATE_UC,
+	VIRTIO_CMD_DEALLOC_UC,
+	VIRTIO_CMD_QUERY_PKEY,
+};
+
+const char* cmd_name[] = {
+	[VIRTIO_CMD_QUERY_DEVICE] = "VIRTIO_CMD_QUERY_DEVICE",
+	[VIRTIO_CMD_QUERY_PORT] = "VIRTIO_CMD_QUERY_PORT",
+	[VIRTIO_CMD_CREATE_CQ] = "VIRTIO_CMD_CREATE_CQ",
+	[VIRTIO_CMD_DESTROY_CQ] = "VIRTIO_CMD_DESTROY_CQ",
+	[VIRTIO_CMD_CREATE_PD] = "VIRTIO_CMD_CREATE_PD",
+	[VIRTIO_CMD_DESTROY_PD] = "VIRTIO_CMD_DESTROY_PD",
+	[VIRTIO_CMD_GET_DMA_MR] = "VIRTIO_CMD_GET_DMA_MR",
+	[VIRTIO_CMD_CREATE_MR] = "VIRTIO_CMD_CREATE_MR",
+	[VIRTIO_CMD_MAP_MR_SG] = "VIRTIO_CMD_MAP_MR_SG",
+	[VIRTIO_CMD_REG_USER_MR] = "VIRTIO_CMD_REG_USER_MR",
+	[VIRTIO_CMD_DEREG_MR] = "VIRTIO_CMD_DEREG_MR",
+	[VIRTIO_CMD_CREATE_QP] = "VIRTIO_CMD_CREATE_QP",
+    [VIRTIO_CMD_MODIFY_QP] = "VIRTIO_CMD_MODIFY_QP",
+    [VIRTIO_CMD_DESTROY_QP] = "VIRTIO_CMD_DESTROY_QP",
+	[VIRTIO_CMD_QUERY_GID] = "VIRTIO_CMD_QUERY_GID",
+	[VIRTIO_CMD_CREATE_UC] = "VIRTIO_CMD_CREATE_UC",
+	[VIRTIO_CMD_DEALLOC_UC] = "VIRTIO_CMD_DEALLOC_UC",
+	[VIRTIO_CMD_QUERY_PKEY] = "VIRTIO_CMD_QUERY_PKEY",
+};
+
+struct cmd_query_port {
+	__u8 port;
+};
+
+struct cmd_create_cq {
+	__u32 cqe;
+};
+
+struct rsp_create_cq {
+	__u32 cqn;
+};
+
+struct cmd_destroy_cq {
+	__u32 cqn;
+};
+
+struct rsp_destroy_cq {
+	__u32 cqn;
+};
+
+struct cmd_create_pd {
+	__u32 ctx_handle;
+};
+
+struct rsp_create_pd {
+	__u32 pdn;
+};
+
+struct cmd_destroy_pd {
+	__u32 pdn;
+};
+
+struct rsp_destroy_pd {
+	__u32 pdn;
+};
+
+struct cmd_create_mr {
+	__u32 pdn;
+	__u32 access_flags;
+
+	__u32 max_num_sg;
+};
+
+struct rsp_create_mr {
+	__u32 mrn;
+	__u32 lkey;
+	__u32 rkey;
+};
+
+struct cmd_map_mr_sg {
+	__u32 mrn;
+	__u64 start;
+	__u32 npages;
+
+	__u64 pages;
+};
+
+struct rsp_map_mr_sg {
+	__u32 npages;
+};
+
+struct cmd_reg_user_mr {
+	__u32 pdn;
+	__u32 access_flags;
+	__u64 start;
+	__u64 length;
+
+	__u64 pages;
+	__u32 npages;
+};
+
+struct rsp_reg_user_mr {
+	__u32 mrn;
+	__u32 lkey;
+	__u32 rkey;
+};
+
+struct cmd_dereg_mr {
+    __u32 mrn;
+
+	__u8 is_user_mr;
+};
+
+struct rsp_dereg_mr {
+    __u32 mrn;
+};
+
+struct cmd_create_qp {
+    __u32 pdn;
+    __u8 qp_type;
+    __u32 max_send_wr;
+    __u32 max_send_sge;
+    __u32 send_cqn;
+    __u32 max_recv_wr;
+    __u32 max_recv_sge;
+    __u32 recv_cqn;
+    __u8 is_srq;
+    __u32 srq_handle;
+};
+
+struct rsp_create_qp {
+	__u32 qpn;
+};
+
+struct cmd_modify_qp {
+    __u32 qpn;
+    __u32 attr_mask;
+    struct virtio_rdma_qp_attr attrs;
+};
+
+struct rsp_modify_qp {
+    __u32 qpn;
+};
+
+struct cmd_destroy_qp {
+    __u32 qpn;
+};
+
+struct rsp_destroy_qp {
+    __u32 qpn;
+};
+
+struct cmd_query_qp {
+	__u32 qpn;
+	__u32 attr_mask;
+};
+
+struct rsp_query_qp {
+	struct virtio_rdma_qp_attr attr;
+};
+
+struct cmd_query_gid {
+    __u8 port;
+	__u32 index;
+};
+
+struct cmd_create_uc {
+	__u64 pfn;
+};
+
+struct rsp_create_uc {
+	__u32 ctx_handle;
+};
+
+struct cmd_dealloc_uc {
+	__u32 ctx_handle;
+};
+
+struct rsp_dealloc_uc {
+	__u32 ctx_handle;
+};
+
+struct cmd_query_pkey {
+	__u8 port;
+	__u16 index;
+};
+
+struct rsp_query_pkey {
+	__u16 pkey;
+};
+
+struct cmd_post_send {
+	__u32 qpn;
+	__u32 is_kernel;
+	__u32 num_sge;
+
+	int send_flags;
+	enum ib_wr_opcode opcode;
+	__u64 wr_id;
+
+	union {
+		__be32 imm_data;
+		__u32 invalidate_rkey;
+	} ex;
+	
+	union {
+		struct {
+			__u64 remote_addr;
+			__u32 rkey;
+		} rdma;
+		struct {
+			__u64 remote_addr;
+			__u64 compare_add;
+			__u64 swap;
+			__u32 rkey;
+		} atomic;
+		struct {
+			__u32 remote_qpn;
+			__u32 remote_qkey;
+			__u32 ahn;
+		} ud;
+		struct {
+			__u32 mrn;
+			__u32 key;
+			int access;
+		} reg;
+	} wr;
+};
+
+struct cmd_post_recv {
+	__u32 qpn;
+	__u32 is_kernel;
+
+	__u32 num_sge;
+	__u64 wr_id;
+};
+
+#endif
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_device.c b/drivers/infiniband/hw/virtio/virtio_rdma_device.c
new file mode 100644
index 000000000000..89b636a32140
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_device.c
@@ -0,0 +1,144 @@
+/*
+ * Virtio RDMA device: Device related functions and data
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include <linux/virtio_config.h>
+
+#include "virtio_rdma.h"
+/*
+static void rdma_ctrl_ack(struct virtqueue *vq)
+{
+	struct virtio_rdma_dev *dev = vq->vdev->priv;
+
+	wake_up(&dev->acked);
+
+	printk("%s\n", __func__);
+}
+*/
+struct virtio_rdma_config {
+    int32_t max_cq;
+};
+
+int init_device(struct virtio_rdma_dev *dev)
+{
+	int rc = -ENOMEM, i, cur_vq = 1, total_vqs = 1; // first for ctrl_vq
+	struct virtqueue **vqs;
+	vq_callback_t **cbs;
+	const char **names;
+	int max_cq, max_qp;
+
+	// init cq virtqueue
+	virtio_cread(dev->vdev, struct virtio_rdma_config, max_cq, &max_cq);
+	max_cq = 64; // TODO: remove this, qemu only support 1024 virtqueue
+	dev->ib_dev.attrs.max_cq = max_cq;
+	dev->ib_dev.attrs.max_qp = 64; // TODO: read from host
+	dev->ib_dev.attrs.max_ah = 64; // TODO: read from host
+	dev->ib_dev.attrs.max_cqe = 64; // TODO: read from host, size of virtqueue
+	pr_info("Device max cq %d\n", dev->ib_dev.attrs.max_cq);
+	total_vqs += max_cq;
+
+	dev->cq_vqs = kcalloc(max_cq, sizeof(*dev->cq_vqs), GFP_ATOMIC);
+	dev->cqs = kcalloc(max_cq, sizeof(*dev->cqs), GFP_ATOMIC);
+
+	// init qp virtqueue
+	max_qp = 64; // TODO: read max qp from device
+	dev->ib_dev.attrs.max_qp = max_qp;
+	total_vqs += max_qp * 2;
+
+	dev->qp_vqs = kcalloc(max_qp * 2, sizeof(*dev->qp_vqs), GFP_ATOMIC);
+
+	dev->qp_vq_using = kzalloc(max_qp * sizeof(*dev->qp_vq_using), GFP_ATOMIC);
+	for (i = 0; i < max_qp; i++) {
+		dev->qp_vq_using[i] = -1;
+	}
+	spin_lock_init(&dev->qp_using_lock);
+
+	vqs = kmalloc_array(total_vqs, sizeof(*vqs), GFP_ATOMIC);
+	if (!vqs)
+		goto err_vq;
+		
+	cbs = kmalloc_array(total_vqs, sizeof(*cbs), GFP_ATOMIC);
+	if (!cbs)
+		goto err_callback;
+
+	names = kmalloc_array(total_vqs, sizeof(*names), GFP_ATOMIC);
+	if (!names)
+		goto err_names;
+
+	names[0] = "ctrl";
+	// cbs[0] = rdma_ctrl_ack;
+	cbs[0] = NULL;
+
+	for (i = 0; i < max_cq; i++, cur_vq++) {
+		sprintf(dev->cq_vqs[i].name, "cq.%d", i);
+		names[cur_vq] = dev->cq_vqs[i].name;
+		cbs[cur_vq] = virtio_rdma_cq_ack;
+	}
+
+	for (i = 0; i < max_qp * 2; i += 2, cur_vq += 2) {
+		sprintf(dev->cq_vqs[i].name, "wqp.%d", i);
+		sprintf(dev->cq_vqs[i+1].name, "rqp.%d", i);
+		names[cur_vq] = dev->cq_vqs[i].name;
+		names[cur_vq+1] = dev->cq_vqs[i+1].name;
+		cbs[cur_vq] = NULL;
+		cbs[cur_vq+1] = NULL;
+	}
+
+	rc = virtio_find_vqs(dev->vdev, total_vqs, vqs, cbs, names, NULL);
+	if (rc) {
+		pr_info("error: %d\n", rc);
+		goto err;
+	}
+
+	dev->ctrl_vq = vqs[0];
+	cur_vq = 1;
+	for (i = 0; i < max_cq; i++, cur_vq++) {
+		dev->cq_vqs[i].vq = vqs[cur_vq];
+		dev->cq_vqs[i].idx = i;
+		spin_lock_init(&dev->cq_vqs[i].lock);
+	}
+
+	for (i = 0; i < max_qp * 2; i += 2, cur_vq += 2) {
+		dev->qp_vqs[i].vq = vqs[cur_vq];
+		dev->qp_vqs[i+1].vq = vqs[cur_vq+1];
+		dev->qp_vqs[i].idx = i / 2;
+		dev->qp_vqs[i+1].idx = i / 2;
+		spin_lock_init(&dev->qp_vqs[i].lock);
+		spin_lock_init(&dev->qp_vqs[i+1].lock);
+	}
+	pr_info("VIRTIO-RDMA INIT qp_vqs %d\n", dev->qp_vqs[max_qp * 2 - 1].vq->index);
+
+	mutex_init(&dev->port_mutex);
+	dev->ib_active = true;
+
+err:
+	kfree(names);
+err_names:
+	kfree(cbs);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return rc;
+}
+
+void fini_device(struct virtio_rdma_dev *dev)
+{
+	dev->vdev->config->reset(dev->vdev);
+	dev->vdev->config->del_vqs(dev->vdev);
+}
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_device.h b/drivers/infiniband/hw/virtio/virtio_rdma_device.h
new file mode 100644
index 000000000000..ca2be23128c7
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_device.h
@@ -0,0 +1,32 @@
+/*
+ * Virtio RDMA device: Device related functions and data
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef __VIRTIO_RDMA_DEVICE__
+#define __VIRTIO_RDMA_DEVICE__
+
+#define VIRTIO_RDMA_BOARD_ID	1
+#define VIRTIO_RDMA_HW_NAME	"virtio-rdma"
+#define VIRTIO_RDMA_HW_REV	1
+#define VIRTIO_RDMA_DRIVER_VER	"1.0"
+
+int init_device(struct virtio_rdma_dev *dev);
+void fini_device(struct virtio_rdma_dev *dev);
+
+#endif
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_ib.c b/drivers/infiniband/hw/virtio/virtio_rdma_ib.c
new file mode 100644
index 000000000000..27ba8990baf9
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_ib.c
@@ -0,0 +1,1695 @@
+/*
+ * Virtio RDMA device: IB related functions and data
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include <linux/scatterlist.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <rdma/ib_mad.h>
+#include <rdma/uverbs_ioctl.h>
+#include <rdma/ib_umem.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_addr.h>
+
+#include "virtio_rdma.h"
+#include "virtio_rdma_device.h"
+#include "virtio_rdma_ib.h"
+#include "virtio_rdma_dev_api.h"
+
+#include "../../core/core_priv.h"
+
+static void ib_qp_cap_to_virtio_rdma(struct virtio_rdma_qp_cap *dst, const struct ib_qp_cap *src)
+{
+	dst->max_send_wr = src->max_send_wr;
+	dst->max_recv_wr = src->max_recv_wr;
+	dst->max_send_sge = src->max_send_sge;
+	dst->max_recv_sge = src->max_recv_sge;
+	dst->max_inline_data = src->max_inline_data;
+}
+
+static void virtio_rdma_to_ib_qp_cap(struct ib_qp_cap *dst, const struct virtio_rdma_qp_cap *src)
+{
+	dst->max_send_wr = src->max_send_wr;
+	dst->max_recv_wr = src->max_recv_wr;
+	dst->max_send_sge = src->max_send_sge;
+	dst->max_recv_sge = src->max_recv_sge;
+	dst->max_inline_data = src->max_inline_data;
+}
+
+void ib_global_route_to_virtio_rdma(struct virtio_rdma_global_route *dst,
+			       const struct ib_global_route *src)
+{
+	dst->dgid = src->dgid;
+	dst->flow_label = src->flow_label;
+	dst->sgid_index = src->sgid_index;
+	dst->hop_limit = src->hop_limit;
+	dst->traffic_class = src->traffic_class;
+}
+
+void virtio_rdma_to_ib_global_route(struct ib_global_route *dst,
+			       const struct virtio_rdma_global_route *src)
+{
+	dst->dgid = src->dgid;
+	dst->flow_label = src->flow_label;
+	dst->sgid_index = src->sgid_index;
+	dst->hop_limit = src->hop_limit;
+	dst->traffic_class = src->traffic_class;
+}
+
+void rdma_ah_attr_to_virtio_rdma(struct virtio_rdma_ah_attr *dst,
+			    const struct rdma_ah_attr *src)
+{
+	ib_global_route_to_virtio_rdma(&dst->grh, rdma_ah_read_grh(src));
+	// FIXME: this should be roce->dmac
+	dst->dlid = rdma_ah_get_dlid(src);
+	dst->sl = rdma_ah_get_sl(src);
+	dst->src_path_bits = rdma_ah_get_path_bits(src);
+	dst->static_rate = rdma_ah_get_static_rate(src);
+	dst->port_num = rdma_ah_get_port_num(src);
+}
+
+void virtio_rdma_to_rdma_ah_attr(struct rdma_ah_attr *dst,
+			    const struct virtio_rdma_ah_attr *src)
+{
+	virtio_rdma_to_ib_global_route(rdma_ah_retrieve_grh(dst), &src->grh);
+	rdma_ah_set_dlid(dst, src->dlid);
+	rdma_ah_set_sl(dst, src->sl);
+	rdma_ah_set_path_bits(dst, src->src_path_bits);
+	rdma_ah_set_static_rate(dst, src->static_rate);
+	rdma_ah_set_port_num(dst, src->port_num);
+}
+
+/* TODO: For the scope fof the RFC i'm utilizing ib*_*_attr structures */
+
+static int virtio_rdma_exec_cmd(struct virtio_rdma_dev *di, int cmd,
+				struct scatterlist *in, struct scatterlist *out)
+{
+	struct scatterlist *sgs[4], hdr, status;
+	struct control_buf *ctrl;
+	unsigned tmp;
+	int rc;
+	unsigned long flags;
+
+	pr_info("%s: cmd %d %s\n", __func__, cmd, cmd_name[cmd]);
+	spin_lock_irqsave(&di->ctrl_lock, flags);
+
+	ctrl = kmalloc(sizeof(*ctrl), GFP_ATOMIC);
+	ctrl->cmd = cmd;
+	ctrl->status = ~0;
+
+	sg_init_one(&hdr, &ctrl->cmd, sizeof(ctrl->cmd));
+	sgs[0] = &hdr;
+	sgs[1] = in;
+	sgs[2] = out;
+	sg_init_one(&status, &ctrl->status, sizeof(ctrl->status));
+	sgs[3] = &status;
+
+	rc = virtqueue_add_sgs(di->ctrl_vq, sgs, 2, 2, di, GFP_ATOMIC);
+	if (rc)
+		goto out;
+
+	if (unlikely(!virtqueue_kick(di->ctrl_vq))) {
+		goto out_with_status;
+	}
+
+	while (!virtqueue_get_buf(di->ctrl_vq, &tmp) &&
+	       !virtqueue_is_broken(di->ctrl_vq))
+		cpu_relax();
+
+out_with_status:
+	pr_info("EXEC cmd %d %s, status %d\n", ctrl->cmd, cmd_name[ctrl->cmd], ctrl->status);
+	rc = ctrl->status == VIRTIO_RDMA_CTRL_OK ? 0 : 1;
+
+out:
+	spin_unlock_irqrestore(&di->ctrl_lock, flags);
+	kfree(ctrl);
+	return rc;
+}
+
+static struct scatterlist* init_sg(void* buf, unsigned long nbytes) {
+	struct scatterlist* sg;
+
+	if (is_vmalloc_addr(buf)) {
+		int num_page = 1;
+		int i, off;
+		unsigned int len = nbytes;
+		// pr_info("vmalloc address %px\n", buf);
+
+		off = offset_in_page(buf);
+		if (off + nbytes > (int)PAGE_SIZE) {
+			num_page += (nbytes + off - PAGE_SIZE) / PAGE_SIZE;
+			len = PAGE_SIZE - off;
+		}
+
+		sg = kmalloc(sizeof(*sg) * num_page, GFP_ATOMIC);
+		if (!sg)
+			return NULL;
+
+		sg_init_table(sg, num_page);
+
+		for (i = 0; i < num_page; i++)	{
+			sg_set_page(sg + i, vmalloc_to_page(buf), len, off);
+			// pr_info("sg_set_page: addr %px len %d off %d\n", vmalloc_to_page(buf), len, off);
+
+			nbytes -= len;
+			buf += len;
+			off = 0;
+			len = min(nbytes, PAGE_SIZE);
+		}
+	} else {
+		sg = kmalloc(sizeof(*sg), GFP_ATOMIC);
+		if (!sg)
+			return NULL;
+        sg_init_one(sg, buf, nbytes);
+	}
+
+	return sg;
+}
+
+static int virtio_rdma_port_immutable(struct ib_device *ibdev, u8 port_num,
+				      struct ib_port_immutable *immutable)
+{
+	struct ib_port_attr attr;
+	int rc;
+
+	rc = ib_query_port(ibdev, port_num, &attr);
+	if (rc)
+		return rc;
+
+	immutable->core_cap_flags = RDMA_CORE_PORT_VIRTIO;
+	immutable->pkey_tbl_len = attr.pkey_tbl_len;
+	immutable->gid_tbl_len = attr.gid_tbl_len;
+	immutable->max_mad_size = IB_MGMT_MAD_SIZE;
+
+	return 0;
+}
+
+static int virtio_rdma_query_device(struct ib_device *ibdev,
+				    struct ib_device_attr *props,
+				    struct ib_udata *uhw)
+{
+	struct scatterlist* data;
+	int offs;
+	int rc;
+
+	if (uhw->inlen || uhw->outlen)
+		return -EINVAL;
+
+	/* We start with sys_image_guid because of inconsistency beween ib_
+	 * and ibv_ */
+	offs = offsetof(struct ib_device_attr, sys_image_guid);
+
+	data = init_sg((void *)props + offs, sizeof(*props) - offs);
+	if (!data)
+		return -ENOMEM;
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibdev), VIRTIO_CMD_QUERY_DEVICE, NULL,
+				  data);
+
+	// TODO: more attrs
+	props->max_cq = ibdev->attrs.max_cq;
+	props->max_cqe = ibdev->attrs.max_cqe;
+
+	kfree(data);
+	return rc;
+}
+
+static int virtio_rdma_query_port(struct ib_device *ibdev, u8 port,
+				  struct ib_port_attr *props)
+{
+	struct scatterlist in, *out;
+	struct cmd_query_port *cmd;
+	int offs;
+	int rc;
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	offs = offsetof(struct ib_port_attr, state);
+
+	out = init_sg((void *)props + offs, sizeof(*props) - offs);
+	if (!out) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->port = port;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibdev), VIRTIO_CMD_QUERY_PORT, &in,
+				  out);
+
+	kfree(out);
+	kfree(cmd);
+
+	return rc;
+}
+
+static struct net_device *virtio_rdma_get_netdev(struct ib_device *ibdev,
+						 u8 port_num)
+{
+	struct virtio_rdma_dev *ri = to_vdev(ibdev);
+	return ri->netdev;
+}
+
+static bool virtio_rdma_cq_notify_now(struct virtio_rdma_cq *cq, uint32_t flags)
+{
+	uint32_t cq_notify;
+
+	if (!cq->ibcq.comp_handler)
+		return false;
+
+	/* Read application shared notification state */
+	cq_notify = READ_ONCE(cq->notify_flags);
+
+	if ((cq_notify & VIRTIO_RDMA_NOTIFY_NEXT_COMPLETION) ||
+	    ((cq_notify & VIRTIO_RDMA_NOTIFY_SOLICITED) &&
+	     (flags & IB_SEND_SOLICITED))) {
+		/*
+		 * CQ notification is one-shot: Since the
+		 * current CQE causes user notification,
+		 * the CQ gets dis-aremd and must be re-aremd
+		 * by the user for a new notification.
+		 */
+		WRITE_ONCE(cq->notify_flags, VIRTIO_RDMA_NOTIFY_NOT);
+
+		return true;
+	}
+	return false;
+}
+
+void virtio_rdma_cq_ack(struct virtqueue *vq)
+{
+	unsigned tmp;
+	struct virtio_rdma_cq *vcq;
+	struct scatterlist sg;
+	bool notify;
+
+	virtqueue_disable_cb(vq);
+	while ((vcq = virtqueue_get_buf(vq, &tmp))) {
+		atomic_inc(&vcq->cqe_cnt);
+		vcq->cqe_put++;
+
+		notify = virtio_rdma_cq_notify_now(vcq, vcq->queue[vcq->cqe_put % vcq->num_cqe].wc_flags);
+
+		sg_init_one(&sg, &vcq->queue[vcq->cqe_enqueue % vcq->num_cqe], sizeof(*vcq->queue));
+		virtqueue_add_inbuf(vcq->vq->vq, &sg, 1, vcq, GFP_KERNEL);
+		vcq->cqe_enqueue++;
+
+		if (notify) {
+			vcq->ibcq.comp_handler(&vcq->ibcq,
+					vcq->ibcq.cq_context);
+		}
+	}
+	virtqueue_enable_cb(vq);
+}
+
+static int virtio_rdma_create_cq(struct ib_cq *ibcq,
+				    const struct ib_cq_init_attr *attr,
+				    struct ib_udata *udata)
+{
+	struct scatterlist in, out;
+	struct virtio_rdma_cq *vcq = to_vcq(ibcq);
+	struct virtio_rdma_dev *vdev = to_vdev(ibcq->device);
+	struct cmd_create_cq *cmd;
+	struct rsp_create_cq *rsp;
+	struct scatterlist sg;
+	int rc, i, fill;
+	int entries = attr->cqe;
+
+	if (!atomic_add_unless(&vdev->num_cq, 1, ibcq->device->attrs.max_cq))
+		return -ENOMEM;
+
+	// size should be power of 2, to avoid idx overflow cause an invalid idx
+	entries = roundup_pow_of_two(entries);
+	vcq->queue = kcalloc(entries, sizeof(*vcq->queue), GFP_KERNEL);
+	if (!vcq->queue)
+		return -ENOMEM;
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd) {
+		kfree(vcq->queue);
+		return -ENOMEM;
+	}
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(vcq->queue);
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->cqe = attr->cqe;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(vdev, VIRTIO_CMD_CREATE_CQ, &in,
+				  &out);
+	if (rc) {
+		kfree(vcq->queue);
+		goto out;
+	}
+
+	vcq->cq_handle = rsp->cqn;
+	vcq->ibcq.cqe = entries;
+	vcq->vq = &vdev->cq_vqs[rsp->cqn];
+	vcq->num_cqe = entries;
+	vcq->cqe_enqueue = 0;
+	vcq->cqe_put = 0;
+	vcq->cqe_get = 0;
+	atomic_set(&vcq->cqe_cnt, 0);
+
+	vdev->cqs[rsp->cqn] = vcq;
+
+	fill = min(entries, vdev->ib_dev.attrs.max_cqe);
+	for(i = 0; i < fill; i++) {
+		sg_init_one(&sg, vcq->queue + i, sizeof(*vcq->queue));
+		virtqueue_add_inbuf(vcq->vq->vq, &sg, 1, vcq, GFP_KERNEL);
+		vcq->cqe_enqueue++;
+	}
+
+	spin_lock_init(&vcq->lock);
+
+out:
+	kfree(rsp);
+	kfree(cmd);
+	return rc;
+}
+
+void virtio_rdma_destroy_cq(struct ib_cq *cq, struct ib_udata *udata)
+{
+	struct virtio_rdma_cq *vcq;
+	struct scatterlist in, out;
+	struct cmd_destroy_cq *cmd;
+	struct rsp_destroy_cq *rsp;
+	unsigned tmp;
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return;
+	}
+
+	vcq = to_vcq(cq);
+
+	cmd->cqn = vcq->cq_handle;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	virtqueue_disable_cb(vcq->vq->vq);
+
+	virtio_rdma_exec_cmd(to_vdev(cq->device), VIRTIO_CMD_DESTROY_CQ,
+				  &in, &out);
+
+	/* pop all from virtqueue, after host call virtqueue_drop_all,
+	 * prepare for next use.
+	 */
+	while(virtqueue_get_buf(vcq->vq->vq, &tmp));
+
+	atomic_dec(&to_vdev(cq->device)->num_cq);
+	virtqueue_enable_cb(vcq->vq->vq);
+
+	pr_debug("cqp_cnt %d %u %u %u\n", atomic_read(&vcq->cqe_cnt), vcq->cqe_enqueue, vcq->cqe_get, vcq->cqe_put);
+
+	kfree(cmd);
+	kfree(rsp);
+}
+
+int virtio_rdma_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+	struct virtio_rdma_pd *pd = to_vpd(ibpd);
+	struct ib_device *ibdev = ibpd->device;
+	struct cmd_create_pd *cmd;
+	struct rsp_create_pd *rsp;
+	struct scatterlist out, in;
+	int rc;
+	struct virtio_rdma_ucontext *context = rdma_udata_to_drv_context(
+		udata, struct virtio_rdma_ucontext, ibucontext);
+
+	// TODO: Check MAX_PD
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->ctx_handle = context ? context->ctx_handle : 0;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibdev), VIRTIO_CMD_CREATE_PD, &in,
+				  &out);
+	if (rc)
+		goto out;
+
+	pd->pd_handle = rsp->pdn;
+
+	printk("%s: pd_handle=%d\n", __func__, pd->pd_handle);
+
+out:
+	kfree(rsp);
+	kfree(cmd);
+
+	printk("%s: rc=%d\n", __func__, rc);
+	return rc;
+}
+
+void virtio_rdma_dealloc_pd(struct ib_pd *pd, struct ib_udata *udata)
+{
+	struct virtio_rdma_pd *vpd = to_vpd(pd);
+	struct ib_device *ibdev = pd->device;
+	struct cmd_destroy_pd *cmd;
+	struct rsp_destroy_pd *rsp;
+	struct scatterlist in, out;
+
+	pr_debug("%s:\n", __func__);
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return;
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(rsp);
+		return;
+	}
+
+	cmd->pdn = vpd->pd_handle;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	virtio_rdma_exec_cmd(to_vdev(ibdev), VIRTIO_CMD_DESTROY_PD, &in, &out);
+
+	kfree(cmd);
+	kfree(rsp);
+}
+
+struct ib_mr *virtio_rdma_get_dma_mr(struct ib_pd *pd, int flags)
+{
+	struct virtio_rdma_mr *mr;
+	struct scatterlist in, out;
+	struct cmd_create_mr *cmd;
+	struct rsp_create_mr *rsp;
+	int rc;
+
+	mr = kzalloc(sizeof(*mr), GFP_ATOMIC);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd) {
+		kfree(mr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!cmd) {
+		kfree(mr);
+		kfree(cmd);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	cmd->pdn = to_vpd(pd)->pd_handle;
+	cmd->access_flags = flags;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	pr_warn("Not support DMA mr now\n");
+
+	rc = virtio_rdma_exec_cmd(to_vdev(pd->device), VIRTIO_CMD_GET_DMA_MR,
+				  &in, &out);
+	pr_info("%s: mr_handle=0x%x\n", __func__, rsp->mrn);
+	if (rc) {
+		kfree(rsp);
+		kfree(mr);
+		kfree(cmd);
+		return ERR_PTR(rc);
+	}
+
+	mr->mr_handle = rsp->mrn;
+	mr->ibmr.lkey = rsp->lkey;
+	mr->ibmr.rkey = rsp->rkey;
+	mr->type = VIRTIO_RDMA_TYPE_KERNEL;
+	to_vpd(pd)->type = VIRTIO_RDMA_TYPE_KERNEL;
+
+	kfree(cmd);
+	kfree(rsp);
+
+	return &mr->ibmr;
+}
+
+struct ib_mr *virtio_rdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
+				   u32 max_num_sg, struct ib_udata *udata)
+{
+	struct virtio_rdma_dev *dev = to_vdev(pd->device);
+	struct virtio_rdma_pd *vpd = to_vpd(pd);
+	struct virtio_rdma_mr *mr;
+	struct scatterlist in, out;
+	struct cmd_create_mr *cmd;
+	struct rsp_create_mr *rsp;
+	struct ib_mr *ret = ERR_PTR(-ENOMEM);
+	int rc;
+
+	pr_info("%s: mr_type %d, max_num_sg %d\n", __func__, mr_type,
+	       max_num_sg);
+
+	if (mr_type != IB_MR_TYPE_MEM_REG)
+		return ERR_PTR(-EINVAL);
+
+	mr = kzalloc(sizeof(*mr), GFP_ATOMIC);
+	if (!mr)
+		goto err;
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		goto err_cmd;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!cmd)
+		goto err_rsp;
+
+	 // FIXME: only support PAGE_SIZE/8 sg;
+	mr->pages = dma_alloc_coherent(dev->vdev->dev.parent, PAGE_SIZE, &mr->dma_pages, GFP_KERNEL);
+	if (!mr->pages) {
+		pr_err("dma alloc pages failed\n");
+		goto err_pages;
+	}
+	mr->max_pages = max_num_sg;
+	mr->npages = 0;
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->pdn = to_vpd(pd)->pd_handle;
+	cmd->max_num_sg = max_num_sg;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(pd->device), VIRTIO_CMD_CREATE_MR,
+				  &in, &out);
+
+	if (rc) {
+		kfree(rsp);
+		kfree(mr);
+		kfree(cmd);
+		return ERR_PTR(rc);
+	}
+
+	mr->mr_handle = rsp->mrn;
+	mr->ibmr.lkey = rsp->lkey;
+	mr->ibmr.rkey = rsp->rkey;
+	mr->type = VIRTIO_RDMA_TYPE_KERNEL;
+	vpd->type = VIRTIO_RDMA_TYPE_KERNEL;
+
+	pr_info("%s: mr_handle=0x%x\n", __func__, mr->mr_handle);
+
+	kfree(cmd);
+	kfree(rsp);
+
+	return &mr->ibmr;
+
+err_pages:
+	kfree(rsp);
+err_rsp:
+	kfree(cmd);
+err_cmd:
+	kfree(mr);
+err:
+	return ret;
+}
+
+static int virtio_rdma_set_page(struct ib_mr *ibmr, u64 addr)
+{
+	struct virtio_rdma_mr *mr = to_vmr(ibmr);
+
+	if (mr->npages == mr->max_pages)
+		return -ENOMEM;
+
+	if (is_vmalloc_addr((void*)addr)) {
+		pr_err("vmalloc addr is not support\n");
+		return -EINVAL;
+	}
+	mr->pages[mr->npages++] = virt_to_phys((void*)addr);
+	return 0;
+}
+
+int virtio_rdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+			  int sg_nents, unsigned int *sg_offset)
+{
+	struct virtio_rdma_mr *mr = to_vmr(ibmr);
+	struct cmd_map_mr_sg *cmd;
+	struct rsp_map_mr_sg *rsp;
+	struct scatterlist in, out;
+	int rc;
+
+	cmd = kmalloc(sizeof(*cmd), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+	rsp = kmalloc(sizeof(*rsp), GFP_KERNEL);
+	if (!rsp) {
+		rc = -ENOMEM;
+		goto out_rsp;
+	}
+
+	mr->npages = 0;
+
+	rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, virtio_rdma_set_page);
+	if (rc < 0) {
+		pr_err("could not map sg to pages\n");
+		rc = -EINVAL;
+		goto out;
+	}
+
+	pr_info("%s: start %llx npages %d\n", __func__, sg[0].dma_address, mr->npages);
+
+	cmd->mrn = mr->mr_handle;
+	cmd->start = (uint64_t)phys_to_virt(mr->pages[0]);
+	cmd->npages = mr->npages;
+	cmd->pages = mr->dma_pages;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibmr->device), VIRTIO_CMD_MAP_MR_SG,
+				  &in, &out);
+
+	if (rc)
+		rc = -EIO;
+
+out:
+	kfree(rsp);
+out_rsp:
+	kfree(cmd);
+	return rc;
+}
+
+struct ib_mr *virtio_rdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+				      u64 virt_addr, int access_flags,
+				      struct ib_udata *udata)
+{
+	struct virtio_rdma_dev *dev = to_vdev(pd->device);
+	struct virtio_rdma_pd *vpd = to_vpd(pd);
+	struct virtio_rdma_mr *mr;
+	struct ib_umem *umem;
+	struct ib_mr *ret = ERR_PTR(-ENOMEM);
+	struct sg_dma_page_iter sg_iter;
+	struct scatterlist in, out;
+	struct cmd_reg_user_mr *cmd;
+	struct rsp_reg_user_mr *rsp;
+	int rc;
+	uint32_t npages;
+
+	pr_info("%s: start %llu, len %llu, addr %llu\n", __func__, start, length, virt_addr);
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd) 
+		return ERR_PTR(-ENOMEM);
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!cmd)
+		goto err_rsp;
+
+	umem = ib_umem_get(udata, start, length, access_flags, 0);
+	if (IS_ERR(umem)) {
+		pr_err("could not get umem for mem region\n");
+		ret = ERR_CAST(umem);
+		goto err;
+	}
+
+	npages = ib_umem_num_pages(umem);
+	if (npages < 0) {
+		pr_err("npages < 0");
+		ret = ERR_PTR(-EINVAL);
+		goto err;
+	}
+
+	mr = kzalloc(sizeof(*mr), GFP_ATOMIC);
+	if (!mr) {
+		ret = ERR_PTR(-ENOMEM);
+		goto err;
+	}
+
+	// TODO: change page size to needed
+	mr->pages = dma_alloc_coherent(dev->vdev->dev.parent, PAGE_SIZE, &mr->dma_pages, GFP_KERNEL);
+	if (!mr->pages) {
+		pr_err("dma alloc pages failed\n");
+		goto err;
+	}
+
+	mr->max_pages = npages;
+	mr->iova = virt_addr;
+	mr->size = length;
+	mr->umem = umem;
+
+	// TODO: test pages
+	mr->npages = 0;
+	for_each_sg_dma_page(umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
+		dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+		mr->pages[mr->npages] = addr;
+		mr->npages++;
+	}
+
+	cmd->pdn = to_vpd(pd)->pd_handle;
+	cmd->access_flags = access_flags;
+	cmd->start = start;
+	cmd->length = length;
+	cmd->pages = mr->dma_pages;
+	cmd->npages = npages;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(pd->device), VIRTIO_CMD_REG_USER_MR,
+				  &in, &out);
+
+	if (rc) {
+		ib_umem_release(umem);
+		kfree(rsp);
+		kfree(mr);
+		kfree(cmd);
+		return ERR_PTR(rc);
+	}
+
+	mr->mr_handle = rsp->mrn;
+	mr->ibmr.lkey = rsp->lkey;
+	mr->ibmr.rkey = rsp->rkey;
+	mr->type = VIRTIO_RDMA_TYPE_USER;
+	vpd->type = VIRTIO_RDMA_TYPE_USER;
+
+	printk("%s: mr_handle=0x%x\n", __func__, mr->mr_handle);
+
+	ret = &mr->ibmr;
+
+err:
+	kfree(cmd);
+err_rsp:
+	kfree(rsp);
+	return ret;
+}
+
+int virtio_rdma_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
+{
+	struct virtio_rdma_mr *mr = to_vmr(ibmr);
+	struct scatterlist in, out;
+	struct cmd_dereg_mr *cmd;
+	struct rsp_dereg_mr *rsp;
+	int rc = -ENOMEM;
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp)
+		goto out_rsp;
+
+	cmd->mrn = mr->mr_handle;
+	cmd->is_user_mr = mr->type == VIRTIO_RDMA_TYPE_USER;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibmr->device), VIRTIO_CMD_DEREG_MR,
+	                          &in, &out);
+	if (rc) {
+		rc = -EIO;
+		goto out;
+	}
+
+	dma_free_coherent(to_vdev(ibmr->device)->vdev->dev.parent, PAGE_SIZE, &mr->pages, GFP_KERNEL);
+	if (mr->type == VIRTIO_RDMA_TYPE_USER)
+		ib_umem_release(mr->umem);
+out:
+	kfree(rsp);
+out_rsp:
+	kfree(cmd);
+	return rc;
+}
+
+static int find_qp_vq(struct virtio_rdma_dev *dev, uint32_t qpn) {
+	int rc = -1, i;
+	unsigned long flags;
+	uint32_t max_qp = dev->ib_dev.attrs.max_qp;
+
+	spin_lock_irqsave(&dev->qp_using_lock, flags);
+	for(i = 0; i < max_qp; i++) {
+		if (dev->qp_vq_using[i] == -1) {
+			rc = i;
+			dev->qp_vq_using[i] = qpn;
+			goto found;
+		}
+	}
+found:
+	spin_unlock_irqrestore(&dev->qp_using_lock, flags);
+	return rc;
+}
+
+struct ib_qp *virtio_rdma_create_qp(struct ib_pd *ibpd,
+				    struct ib_qp_init_attr *attr,
+				    struct ib_udata *udata)
+{
+	struct scatterlist in, out;
+	struct virtio_rdma_dev *vdev = to_vdev(ibpd->device);
+	struct virtio_rdma_pd *vpd = to_vpd(ibpd);
+	struct cmd_create_qp *cmd;
+	struct rsp_create_qp *rsp;
+	struct virtio_rdma_qp *vqp;
+	int rc, vqn;
+
+	if (!atomic_add_unless(&vdev->num_cq, 1, vdev->ib_dev.attrs.max_qp))
+        return ERR_PTR(-ENOMEM);
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return ERR_PTR(-ENOMEM);
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	vqp = kzalloc(sizeof(*vqp), GFP_ATOMIC);
+	if (!vqp) {
+		kfree(cmd);
+		kfree(rsp);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	cmd->pdn = to_vpd(ibpd)->pd_handle;
+	cmd->qp_type = attr->qp_type;
+	cmd->max_send_wr = attr->cap.max_send_wr;
+	cmd->max_send_sge = attr->cap.max_send_sge;
+	cmd->send_cqn = to_vcq(attr->send_cq)->cq_handle;
+	cmd->max_recv_wr = attr->cap.max_recv_wr;
+	cmd->max_recv_sge = attr->cap.max_recv_sge;
+	cmd->recv_cqn = to_vcq(attr->recv_cq)->cq_handle;
+	cmd->is_srq = !!attr->srq;
+	cmd->srq_handle = 0; // Not support srq now
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	printk("%s: pdn %d\n", __func__, cmd->pdn);
+
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(vdev, VIRTIO_CMD_CREATE_QP, &in,
+				  &out);
+	if (rc) {
+		kfree(vqp);
+		kfree(rsp);
+		kfree(cmd);
+		return ERR_PTR(-EIO);
+	}
+
+	vqp->type = vpd->type;
+	vqp->port = attr->port_num;
+	vqp->qp_handle = rsp->qpn;
+	vqp->ibqp.qp_num = rsp->qpn;
+	
+	vqn = find_qp_vq(vdev, vqp->qp_handle);
+	vqp->sq = &vdev->qp_vqs[vqn * 2];
+	vqp->rq = &vdev->qp_vqs[vqn * 2 + 1];
+	vqp->s_cmd = kmalloc(sizeof(*vqp->s_cmd), GFP_ATOMIC);
+	vqp->r_cmd = kmalloc(sizeof(*vqp->r_cmd), GFP_ATOMIC);
+
+	pr_info("%s: qpn 0x%x wq %d rq %d\n", __func__, rsp->qpn,
+	        vqp->sq->vq->index, vqp->rq->vq->index);
+	
+	kfree(rsp);
+	kfree(cmd);
+	return &vqp->ibqp;
+}
+
+int virtio_rdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
+{
+	struct virtio_rdma_dev *vdev = to_vdev(ibqp->device);
+	struct virtio_rdma_qp *vqp = to_vqp(ibqp);
+	struct scatterlist in, out;
+	struct cmd_destroy_qp *cmd;
+	struct rsp_destroy_qp *rsp;
+	int rc;
+
+	pr_info("%s: qpn %d\n", __func__, vqp->qp_handle);
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->qpn = vqp->qp_handle;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(vdev, VIRTIO_CMD_DESTROY_QP,
+	                          &in, &out);
+	
+	atomic_dec(&vdev->num_qp);
+	// FIXME: need lock ?
+	smp_store_mb(vdev->qp_vq_using[vqp->sq->idx / 2], -1);
+
+	kfree(vqp->s_cmd);
+	kfree(vqp->r_cmd);
+
+	kfree(rsp);
+	kfree(cmd);
+	return rc;
+}
+
+int virtio_rdma_query_gid(struct ib_device *ibdev, u8 port, int index,
+			  union ib_gid *gid)
+{
+	struct scatterlist in, *data;
+	struct cmd_query_gid *cmd;
+	struct ib_gid_attr gid_attr;
+	int rc;
+
+	printk("%s: port %d, index %d\n", __func__, port, index);
+
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	data = init_sg(gid, sizeof(*gid));
+	if (!data) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->port = port;
+	cmd->index = index;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibdev), VIRTIO_CMD_QUERY_GID, &in,
+				  data);
+
+	if (!rc) {
+		gid_attr.ndev = to_vdev(ibdev)->netdev;
+		gid_attr.gid_type = IB_GID_TYPE_ROCE;
+		ib_cache_gid_add(ibdev, port, gid, &gid_attr);
+	}
+
+	kfree(data);
+	kfree(cmd);
+	return rc;
+}
+
+static int virtio_rdma_add_gid(const struct ib_gid_attr *attr, void **context)
+{
+	printk("%s: gid index %d\n", __func__, attr->index);
+
+	return 0;
+}
+
+static int virtio_rdma_del_gid(const struct ib_gid_attr *attr, void **context)
+{
+	printk("%s:\n", __func__);
+
+	return 0;
+}
+
+int virtio_rdma_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata)
+{
+	struct scatterlist in, out;
+	struct cmd_create_uc *cmd;
+	struct rsp_create_uc *rsp;
+	struct virtio_rdma_ucontext *vuc = to_vucontext(uctx);
+	int rc;
+	
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	// TODO: init uar & set cmd->pfn
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(uctx->device), VIRTIO_CMD_CREATE_UC, &in,
+				  &out);
+
+	if (rc) {
+		rc = -EIO;
+		goto out;
+	}
+
+	vuc->ctx_handle = rsp->ctx_handle;
+
+out:
+	kfree(rsp);
+	kfree(cmd);
+	return rc;
+}
+
+void virtio_rdma_dealloc_ucontext(struct ib_ucontext *ibcontext)
+{
+	struct scatterlist in, out;
+	struct cmd_dealloc_uc *cmd;
+	struct rsp_dealloc_uc *rsp;
+	struct virtio_rdma_ucontext *vuc = to_vucontext(ibcontext);
+	
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return;
+	}
+
+	cmd->ctx_handle = vuc->ctx_handle;
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	virtio_rdma_exec_cmd(to_vdev(ibcontext->device), VIRTIO_CMD_DEALLOC_UC, &in,
+				  &out);
+
+	kfree(rsp);
+	kfree(cmd);
+}
+
+int virtio_rdma_create_ah(struct ib_ah *ibah,
+				    struct rdma_ah_attr *ah_attr, u32 flags,
+				    struct ib_udata *udata)
+{
+	struct virtio_rdma_dev *vdev = to_vdev(ibah->device);
+	struct virtio_rdma_ah *ah = to_vah(ibah);
+	const struct ib_global_route *grh;
+	u8 port_num = rdma_ah_get_port_num(ah_attr);
+
+	if (!(rdma_ah_get_ah_flags(ah_attr) & IB_AH_GRH))
+		return -EINVAL;
+
+	grh = rdma_ah_read_grh(ah_attr);
+	if ((ah_attr->type != RDMA_AH_ATTR_TYPE_ROCE)  ||
+	    rdma_is_multicast_addr((struct in6_addr *)grh->dgid.raw))
+		return -EINVAL;
+
+	if (!atomic_add_unless(&vdev->num_ah, 1, vdev->ib_dev.attrs.max_ah))
+		return -ENOMEM;
+
+	ah->av.port_pd = to_vpd(ibah->pd)->pd_handle | (port_num << 24);
+	ah->av.src_path_bits = rdma_ah_get_path_bits(ah_attr);
+	ah->av.src_path_bits |= 0x80;
+	ah->av.gid_index = grh->sgid_index;
+	ah->av.hop_limit = grh->hop_limit;
+	ah->av.sl_tclass_flowlabel = (grh->traffic_class << 20) |
+				      grh->flow_label;
+	memcpy(ah->av.dgid, grh->dgid.raw, 16);
+	memcpy(ah->av.dmac, ah_attr->roce.dmac, ETH_ALEN);
+
+	return 0;
+}
+
+void virtio_rdma_destroy_ah(struct ib_ah *ah, u32 flags)
+{
+	struct virtio_rdma_dev *vdev = to_vdev(ah->device);
+
+	printk("%s:\n", __func__);
+	atomic_dec(&vdev->num_ah);
+}
+
+static void virtio_rdma_get_fw_ver_str(struct ib_device *device, char *str)
+{
+	snprintf(str, IB_FW_VERSION_NAME_MAX, "%d.%d.%d\n", 1, 0, 0);
+}
+
+enum rdma_link_layer virtio_rdma_port_link_layer(struct ib_device *ibdev,
+						 u8 port)
+{
+	return IB_LINK_LAYER_ETHERNET;
+}
+
+int virtio_rdma_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vma)
+{
+	printk("%s:\n", __func__);
+
+	return 0;
+}
+
+int virtio_rdma_modify_port(struct ib_device *ibdev, u8 port, int mask,
+			    struct ib_port_modify *props)
+{
+	struct ib_port_attr attr;
+	struct virtio_rdma_dev *vdev = to_vdev(ibdev);
+	int ret;
+
+	if (mask & ~IB_PORT_SHUTDOWN) {
+		pr_warn("unsupported port modify mask %#x\n", mask);
+		return -EOPNOTSUPP;
+	}
+
+	mutex_lock(&vdev->port_mutex);
+	ret = ib_query_port(ibdev, port, &attr);
+	if (ret)
+		goto out;
+
+	vdev->port_cap_mask |= props->set_port_cap_mask;
+	vdev->port_cap_mask &= ~props->clr_port_cap_mask;
+
+	if (mask & IB_PORT_SHUTDOWN)
+		vdev->ib_active = false;
+
+out:
+	mutex_unlock(&vdev->port_mutex);
+	return ret;
+}
+
+int virtio_rdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+			  int attr_mask, struct ib_udata *udata)
+{
+	struct scatterlist in, out;
+	struct cmd_modify_qp *cmd;
+	struct rsp_modify_qp *rsp;
+	int rc;
+
+	pr_info("%s: qpn %d\n", __func__, to_vqp(ibqp)->qp_handle);
+
+	cmd = kzalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->qpn = to_vqp(ibqp)->qp_handle;
+	cmd->attr_mask = attr_mask & ((1 << 21) - 1);
+
+	// TODO: copy based on attr_mask
+	cmd->attrs.qp_state = attr->qp_state;
+	cmd->attrs.cur_qp_state = attr->cur_qp_state;
+	cmd->attrs.path_mtu = attr->path_mtu;
+	cmd->attrs.path_mig_state = attr->path_mig_state;
+	cmd->attrs.qkey = attr->qkey;
+	cmd->attrs.rq_psn = attr->rq_psn;
+	cmd->attrs.sq_psn = attr->sq_psn;
+	cmd->attrs.dest_qp_num = attr->dest_qp_num;
+	cmd->attrs.qp_access_flags = attr->qp_access_flags;
+	cmd->attrs.pkey_index = attr->pkey_index;
+	cmd->attrs.alt_pkey_index = attr->alt_pkey_index;
+	cmd->attrs.en_sqd_async_notify = attr->en_sqd_async_notify;
+	cmd->attrs.sq_draining = attr->sq_draining;
+	cmd->attrs.max_rd_atomic = attr->max_rd_atomic;
+	cmd->attrs.max_dest_rd_atomic = attr->max_dest_rd_atomic;
+	cmd->attrs.min_rnr_timer = attr->min_rnr_timer;
+	cmd->attrs.port_num = attr->port_num;
+	cmd->attrs.timeout = attr->timeout;
+	cmd->attrs.retry_cnt = attr->retry_cnt;
+	cmd->attrs.rnr_retry = attr->rnr_retry;
+	cmd->attrs.alt_port_num = attr->alt_port_num;
+	cmd->attrs.alt_timeout = attr->alt_timeout;
+	cmd->attrs.rate_limit = attr->rate_limit;
+	ib_qp_cap_to_virtio_rdma(&cmd->attrs.cap, &attr->cap);
+	rdma_ah_attr_to_virtio_rdma(&cmd->attrs.ah_attr, &attr->ah_attr);
+	rdma_ah_attr_to_virtio_rdma(&cmd->attrs.alt_ah_attr, &attr->alt_ah_attr);
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibqp->device), VIRTIO_CMD_MODIFY_QP,
+	                          &in, &out);
+
+	kfree(rsp);
+	kfree(cmd);
+	return rc;
+}
+
+int virtio_rdma_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+			 int attr_mask, struct ib_qp_init_attr *init_attr)
+{
+	struct scatterlist in, out;
+	struct virtio_rdma_qp *vqp = to_vqp(ibqp);
+	struct cmd_query_qp *cmd;
+	struct rsp_query_qp *rsp;
+	int rc;
+
+	cmd = kzalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->qpn = vqp->qp_handle;
+	cmd->attr_mask = attr_mask;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+	rc = virtio_rdma_exec_cmd(to_vdev(ibqp->device), VIRTIO_CMD_QUERY_QP,
+	                          &in, &out);
+
+	if (rc)
+		goto out;
+
+	attr->qp_state = rsp->attr.qp_state;
+	attr->cur_qp_state = rsp->attr.cur_qp_state;
+	attr->path_mtu = rsp->attr.path_mtu;
+	attr->path_mig_state = rsp->attr.path_mig_state;
+	attr->qkey = rsp->attr.qkey;
+	attr->rq_psn = rsp->attr.rq_psn;
+	attr->sq_psn = rsp->attr.sq_psn;
+	attr->dest_qp_num = rsp->attr.dest_qp_num;
+	attr->qp_access_flags = rsp->attr.qp_access_flags;
+	attr->pkey_index = rsp->attr.pkey_index;
+	attr->alt_pkey_index = rsp->attr.alt_pkey_index;
+	attr->en_sqd_async_notify = rsp->attr.en_sqd_async_notify;
+	attr->sq_draining = rsp->attr.sq_draining;
+	attr->max_rd_atomic = rsp->attr.max_rd_atomic;
+	attr->max_dest_rd_atomic = rsp->attr.max_dest_rd_atomic;
+	attr->min_rnr_timer = rsp->attr.min_rnr_timer;
+	attr->port_num = rsp->attr.port_num;
+	attr->timeout = rsp->attr.timeout;
+	attr->retry_cnt = rsp->attr.retry_cnt;
+	attr->rnr_retry = rsp->attr.rnr_retry;
+	attr->alt_port_num = rsp->attr.alt_port_num;
+	attr->alt_timeout = rsp->attr.alt_timeout;
+	attr->rate_limit = rsp->attr.rate_limit;
+	virtio_rdma_to_ib_qp_cap(&attr->cap, &rsp->attr.cap);
+	virtio_rdma_to_rdma_ah_attr(&attr->ah_attr, &rsp->attr.ah_attr);
+	virtio_rdma_to_rdma_ah_attr(&attr->alt_ah_attr, &rsp->attr.alt_ah_attr);
+
+out:
+	init_attr->event_handler = vqp->ibqp.event_handler;
+	init_attr->qp_context = vqp->ibqp.qp_context;
+	init_attr->send_cq = vqp->ibqp.send_cq;
+	init_attr->recv_cq = vqp->ibqp.recv_cq;
+	init_attr->srq = vqp->ibqp.srq;
+	init_attr->xrcd = NULL;
+	init_attr->cap = attr->cap;
+	init_attr->sq_sig_type = 0;
+	init_attr->qp_type = vqp->ibqp.qp_type;
+	init_attr->create_flags = 0;
+	init_attr->port_num = vqp->port;
+
+	kfree(cmd);
+	kfree(rsp);
+	return rc;
+}
+
+/* This verb is relevant only for InfiniBand */
+int virtio_rdma_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
+			   u16 *pkey)
+{
+	struct scatterlist in, out;
+	struct cmd_query_pkey *cmd;
+	struct rsp_query_pkey *rsp;
+	int rc;
+	
+	cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+	if (!cmd)
+		return -ENOMEM;
+
+	rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
+	if (!rsp) {
+		kfree(cmd);
+		return -ENOMEM;
+	}
+
+	cmd->port = port;
+	cmd->index = index;
+
+	sg_init_one(&in, cmd, sizeof(*cmd));
+	sg_init_one(&out, rsp, sizeof(*rsp));
+
+	rc = virtio_rdma_exec_cmd(to_vdev(ibdev), VIRTIO_CMD_QUERY_PKEY,
+	                          &in, &out);
+
+	*pkey = rsp->pkey;
+	
+	kfree(cmd);
+	kfree(rsp);
+	return rc;
+}
+
+int virtio_rdma_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct virtio_rdma_cq *vcq = to_vcq(ibcq);
+	struct virtio_rdma_cqe *cqe;
+	int i = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vcq->lock, flags);
+	while (i < num_entries && vcq->cqe_get < vcq->cqe_put) {
+		cqe = &vcq->queue[vcq->cqe_get];
+
+		wc[i].wr_id = cqe->wr_id;
+		wc[i].status = cqe->status;
+		wc[i].opcode = cqe->opcode;
+		wc[i].vendor_err = cqe->vendor_err;
+		wc[i].byte_len = cqe->byte_len;
+		// TODO: wc[i].qp
+		wc[i].ex.imm_data = cqe->imm_data;
+		wc[i].src_qp = cqe->src_qp;
+		wc[i].slid = cqe->slid;
+		wc[i].wc_flags = cqe->wc_flags;
+		wc[i].pkey_index = cqe->pkey_index;
+		wc[i].sl = cqe->sl;
+		wc[i].dlid_path_bits = cqe->dlid_path_bits;
+
+		vcq->cqe_get++;
+		i++;
+	}
+	spin_unlock_irqrestore(&vcq->lock, flags);
+	return i;
+}
+
+int virtio_rdma_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
+			  const struct ib_recv_wr **bad_wr)
+{
+	struct scatterlist *sgs[3], hdr, status_sg, *sge_sg;
+	struct virtio_rdma_qp *vqp = to_vqp(ibqp);
+	struct cmd_post_recv *cmd;
+	int *status, rc = 0;
+	unsigned tmp;
+
+	// TODO: mad support
+	if (vqp->ibqp.qp_type == IB_QPT_GSI || vqp->ibqp.qp_type == IB_QPT_SMI)
+		return 0;
+
+	// TODO: more than one wr
+	// TODO: check bad wr
+	spin_lock(&vqp->rq->lock);
+	status = &vqp->r_status;
+    cmd = vqp->r_cmd;
+
+	cmd->qpn = to_vqp(ibqp)->qp_handle;
+	cmd->is_kernel = vqp->type == VIRTIO_RDMA_TYPE_KERNEL;
+	cmd->num_sge = wr->num_sge;
+	cmd->wr_id = wr->wr_id;
+
+	sg_init_one(&hdr, cmd, sizeof(*cmd));
+	sgs[0] = &hdr;
+	// TODO: num_sge is zero
+	sge_sg = init_sg(wr->sg_list, sizeof(*wr->sg_list) * wr->num_sge);
+	sgs[1] = sge_sg;
+	sg_init_one(&status_sg, status, sizeof(*status));
+	sgs[2] = &status_sg;
+
+	rc = virtqueue_add_sgs(vqp->rq->vq, sgs, 2, 1, vqp, GFP_ATOMIC);
+	if (rc)
+		goto out;
+
+	if (unlikely(!virtqueue_kick(vqp->rq->vq))) {
+		goto out;
+	}
+
+	while (!virtqueue_get_buf(vqp->rq->vq, &tmp) &&
+	       !virtqueue_is_broken(vqp->rq->vq))
+        cpu_relax();
+
+out:
+	spin_unlock(&vqp->rq->lock);
+	kfree(sge_sg);
+	return rc;
+}
+
+int virtio_rdma_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
+			  const struct ib_send_wr **bad_wr)
+{
+	struct scatterlist *sgs[3], hdr, status_sg, *sge_sg;
+	struct virtio_rdma_qp *vqp = to_vqp(ibqp);
+	struct cmd_post_send *cmd;
+	struct ib_sge dummy_sge;
+	int *status, rc = 0;
+	unsigned tmp;
+
+	// TODO: support more than one wr
+	// TODO: check bad wr
+	if (vqp->type == VIRTIO_RDMA_TYPE_KERNEL &&
+	    wr->opcode != IB_WR_SEND && wr->opcode != IB_WR_SEND_WITH_IMM &&
+		wr->opcode != IB_WR_REG_MR &&
+		wr->opcode != IB_WR_LOCAL_INV && wr->opcode != IB_WR_SEND_WITH_INV) {
+		pr_warn("Only support op send in kernel\n");
+		return -EINVAL;
+	}
+
+	spin_lock(&vqp->sq->lock);
+	cmd = vqp->s_cmd;
+	status = &vqp->s_status;
+
+	cmd->qpn = vqp->qp_handle;
+	cmd->is_kernel = vqp->type == VIRTIO_RDMA_TYPE_KERNEL;
+	cmd->num_sge = wr->num_sge;
+	cmd->send_flags = wr->send_flags;
+	cmd->opcode = wr->opcode;
+	cmd->wr_id = wr->wr_id;
+	cmd->ex.imm_data = wr->ex.imm_data;
+	cmd->ex.invalidate_rkey = wr->ex.invalidate_rkey;
+
+	switch (ibqp->qp_type) {
+	case IB_QPT_GSI:
+	case IB_QPT_UD:
+		pr_err("Not support UD now\n");
+		rc = -EINVAL;
+		goto out;
+		break;
+	case IB_QPT_RC:
+		switch (wr->opcode) {
+		case IB_WR_RDMA_READ:
+		case IB_WR_RDMA_WRITE:
+		case IB_WR_RDMA_WRITE_WITH_IMM:
+			cmd->wr.rdma.remote_addr =
+				rdma_wr(wr)->remote_addr;
+			cmd->wr.rdma.rkey = rdma_wr(wr)->rkey;
+			break;
+		case IB_WR_LOCAL_INV:
+		case IB_WR_SEND_WITH_INV:
+			cmd->ex.invalidate_rkey =
+				wr->ex.invalidate_rkey;
+			break;
+		case IB_WR_ATOMIC_CMP_AND_SWP:
+		case IB_WR_ATOMIC_FETCH_AND_ADD:
+			cmd->wr.atomic.remote_addr =
+				atomic_wr(wr)->remote_addr;
+			cmd->wr.atomic.rkey = atomic_wr(wr)->rkey;
+			cmd->wr.atomic.compare_add =
+				atomic_wr(wr)->compare_add;
+			if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP)
+				cmd->wr.atomic.swap =
+					atomic_wr(wr)->swap;
+			break;
+		case IB_WR_REG_MR:
+			cmd->wr.reg.mrn = to_vmr(reg_wr(wr)->mr)->mr_handle;
+			cmd->wr.reg.key = reg_wr(wr)->key;
+			cmd->wr.reg.access = reg_wr(wr)->access;
+			break;
+		default:
+			break;
+		}
+		break;
+	default:
+		pr_err("Bad qp type\n");
+		rc = -EINVAL;
+		*bad_wr = wr;
+		goto out;
+	}
+
+	sg_init_one(&hdr, cmd, sizeof(*cmd));
+	sgs[0] = &hdr;
+	/* while sg_list is null, use a dummy sge to avoid 
+	 * "zero sized buffers are not allowed"
+	 */
+	if (wr->sg_list)
+		sge_sg = init_sg(wr->sg_list, sizeof(*wr->sg_list) * wr->num_sge);
+	else
+		sge_sg = init_sg(&dummy_sge, sizeof(dummy_sge));
+	sgs[1] = sge_sg;
+	sg_init_one(&status_sg, status, sizeof(*status));
+	sgs[2] = &status_sg;
+
+	rc = virtqueue_add_sgs(vqp->sq->vq, sgs, 2, 1, vqp, GFP_ATOMIC);
+	if (rc)
+		goto out;
+
+	if (unlikely(!virtqueue_kick(vqp->sq->vq))) {
+		goto out;
+	}
+
+	while (!virtqueue_get_buf(vqp->sq->vq, &tmp) &&
+	       !virtqueue_is_broken(vqp->sq->vq))
+		cpu_relax();
+
+out:
+	spin_unlock(&vqp->sq->lock);
+	kfree(sge_sg);
+	return rc;
+}
+
+int virtio_rdma_req_notify_cq(struct ib_cq *ibcq,
+			      enum ib_cq_notify_flags flags)
+{
+	struct virtio_rdma_cq *vcq = to_vcq(ibcq);
+
+	if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED)
+		/*
+		 * Enable CQ event for next solicited completion.
+		 * and make it visible to all associated producers.
+		 */
+		smp_store_mb(vcq->notify_flags, VIRTIO_RDMA_NOTIFY_SOLICITED);
+	else
+		/*
+		 * Enable CQ event for any signalled completion.
+		 * and make it visible to all associated producers.
+		 */
+		smp_store_mb(vcq->notify_flags, VIRTIO_RDMA_NOTIFY_ALL);
+
+	if (flags & IB_CQ_REPORT_MISSED_EVENTS)
+		return vcq->cqe_put - vcq->cqe_get;
+
+	return 0;
+}
+
+static const struct ib_device_ops virtio_rdma_dev_ops = {
+	.owner = THIS_MODULE,
+	.driver_id = RDMA_DRIVER_VIRTIO,
+
+	.get_port_immutable = virtio_rdma_port_immutable,
+	.query_device = virtio_rdma_query_device,
+	.query_port = virtio_rdma_query_port,
+	.get_netdev = virtio_rdma_get_netdev,
+	.create_cq = virtio_rdma_create_cq,
+	.destroy_cq = virtio_rdma_destroy_cq,
+	.alloc_pd = virtio_rdma_alloc_pd,
+	.dealloc_pd = virtio_rdma_dealloc_pd,
+	.get_dma_mr = virtio_rdma_get_dma_mr,
+	.create_qp = virtio_rdma_create_qp,
+	.query_gid = virtio_rdma_query_gid,
+	.add_gid = virtio_rdma_add_gid,
+	.alloc_mr = virtio_rdma_alloc_mr,
+	.alloc_ucontext = virtio_rdma_alloc_ucontext,
+	.create_ah = virtio_rdma_create_ah,
+	.dealloc_ucontext = virtio_rdma_dealloc_ucontext,
+	.del_gid = virtio_rdma_del_gid,
+	.dereg_mr = virtio_rdma_dereg_mr,
+	.destroy_ah = virtio_rdma_destroy_ah,
+	.destroy_qp = virtio_rdma_destroy_qp,
+	.get_dev_fw_str = virtio_rdma_get_fw_ver_str,
+	.get_link_layer = virtio_rdma_port_link_layer,
+	.map_mr_sg = virtio_rdma_map_mr_sg,
+	.mmap = virtio_rdma_mmap,
+	.modify_port = virtio_rdma_modify_port,
+	.modify_qp = virtio_rdma_modify_qp,
+	.poll_cq = virtio_rdma_poll_cq,
+	.post_recv = virtio_rdma_post_recv,
+	.post_send = virtio_rdma_post_send,
+	.query_device = virtio_rdma_query_device,
+	.query_pkey = virtio_rdma_query_pkey,
+	.query_qp = virtio_rdma_query_qp,
+	.reg_user_mr = virtio_rdma_reg_user_mr,
+	.req_notify_cq = virtio_rdma_req_notify_cq,
+
+	INIT_RDMA_OBJ_SIZE(ib_ah, virtio_rdma_ah, ibah),
+	INIT_RDMA_OBJ_SIZE(ib_cq, virtio_rdma_cq, ibcq),
+	INIT_RDMA_OBJ_SIZE(ib_pd, virtio_rdma_pd, ibpd),
+	// INIT_RDMA_OBJ_SIZE(ib_srq, virtio_rdma_srq, base_srq),
+	INIT_RDMA_OBJ_SIZE(ib_ucontext, virtio_rdma_ucontext, ibucontext),
+};
+
+static ssize_t hca_type_show(struct device *device,
+			     struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "VRDMA-%s\n", VIRTIO_RDMA_DRIVER_VER);
+}
+static DEVICE_ATTR_RO(hca_type);
+
+static ssize_t hw_rev_show(struct device *device,
+			   struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", VIRTIO_RDMA_HW_REV);
+}
+static DEVICE_ATTR_RO(hw_rev);
+
+static ssize_t board_id_show(struct device *device,
+			     struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", VIRTIO_RDMA_BOARD_ID);
+}
+static DEVICE_ATTR_RO(board_id);
+
+static struct attribute *virtio_rdma_class_attributes[] = {
+	&dev_attr_hw_rev.attr,
+	&dev_attr_hca_type.attr,
+	&dev_attr_board_id.attr,
+	NULL,
+};
+
+static const struct attribute_group virtio_rdma_attr_group = {
+	.attrs = virtio_rdma_class_attributes,
+};
+
+int virtio_rdma_register_ib_device(struct virtio_rdma_dev *ri)
+{
+	int rc;
+	struct ib_device *dev =  &ri->ib_dev;
+
+	strlcpy(dev->node_desc, "VirtIO RDMA", sizeof(dev->node_desc));
+
+	dev->dev.dma_ops = &dma_virt_ops;
+
+	dev->num_comp_vectors = 1;
+	dev->dev.parent = ri->vdev->dev.parent;
+	dev->node_type = RDMA_NODE_IB_CA;
+	dev->phys_port_cnt = 1;
+	dev->uverbs_cmd_mask =
+		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
+		(1ull << IB_USER_VERBS_CMD_QUERY_PORT)		|
+		(1ull << IB_USER_VERBS_CMD_CREATE_CQ)		|
+		(1ull << IB_USER_VERBS_CMD_DESTROY_CQ)		|
+		(1ull << IB_USER_VERBS_CMD_ALLOC_PD)		|
+		(1ull << IB_USER_VERBS_CMD_DEALLOC_PD);
+
+    ib_set_device_ops(dev, &virtio_rdma_dev_ops);
+	ib_device_set_netdev(dev, ri->netdev, 1);
+	rdma_set_device_sysfs_group(dev, &virtio_rdma_attr_group);
+
+	rc = ib_register_device(dev, "virtio_rdma%d");
+
+	memcpy(&dev->node_guid, dev->name, 6);
+	return rc;
+}
+
+void fini_ib(struct virtio_rdma_dev *ri)
+{
+	ib_unregister_device(&ri->ib_dev);
+}
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_ib.h b/drivers/infiniband/hw/virtio/virtio_rdma_ib.h
new file mode 100644
index 000000000000..ff5d6a41db4d
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_ib.h
@@ -0,0 +1,237 @@
+/*
+ * Virtio RDMA device: IB related functions and data
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef __VIRTIO_RDMA_IB__
+#define __VIRTIO_RDMA_IB__
+
+#include <linux/types.h>
+
+#include <rdma/ib_verbs.h>
+
+enum virtio_rdma_type {
+	VIRTIO_RDMA_TYPE_USER,
+	VIRTIO_RDMA_TYPE_KERNEL
+};
+
+struct virtio_rdma_pd {
+	struct ib_pd ibpd;
+	u32 pd_handle;
+	enum virtio_rdma_type type;
+};
+
+struct virtio_rdma_mr {
+	struct ib_mr ibmr;
+	struct ib_umem *umem;
+
+	u32 mr_handle;
+	enum virtio_rdma_type type;
+	u64 iova;
+	u64 size;
+
+	u64 *pages;
+	dma_addr_t dma_pages;
+	u32 npages;
+	u32 max_pages;
+};
+
+struct virtio_rdma_vq {
+	struct virtqueue* vq;
+	spinlock_t lock;
+	char name[16];
+	int idx;
+};
+
+struct virtio_rdma_cqe {
+	uint64_t		wr_id;
+	enum ib_wc_status status;
+	enum ib_wc_opcode opcode;
+	uint32_t vendor_err;
+	uint32_t byte_len;
+	uint32_t imm_data;
+	uint32_t qp_num;
+	uint32_t src_qp;
+	int	 wc_flags;
+	uint16_t pkey_index;
+	uint16_t slid;
+	uint8_t sl;
+	uint8_t dlid_path_bits;
+};
+
+enum {
+	VIRTIO_RDMA_NOTIFY_NOT = (0),
+	VIRTIO_RDMA_NOTIFY_SOLICITED = (1 << 0),
+	VIRTIO_RDMA_NOTIFY_NEXT_COMPLETION = (1 << 1),
+	VIRTIO_RDMA_NOTIFY_MISSED_EVENTS = (1 << 2),
+	VIRTIO_RDMA_NOTIFY_ALL = VIRTIO_RDMA_NOTIFY_SOLICITED | VIRTIO_RDMA_NOTIFY_NEXT_COMPLETION |
+			                 VIRTIO_RDMA_NOTIFY_MISSED_EVENTS
+};
+
+struct virtio_rdma_cq {
+	struct ib_cq ibcq;
+	u32 cq_handle;
+
+	struct virtio_rdma_vq *vq;
+
+	spinlock_t lock;
+	struct virtio_rdma_cqe *queue;
+	u32 cqe_enqueue;
+	u32 cqe_put;
+	u32 cqe_get;
+	u32 num_cqe;
+
+	u32 notify_flags;
+	atomic_t cqe_cnt;
+};
+
+struct virtio_rdma_qp {
+	struct ib_qp ibqp;
+	u32 qp_handle;
+	enum virtio_rdma_type type;
+	u8 port;
+
+	struct virtio_rdma_vq *sq;
+	int s_status;
+	struct cmd_post_send *s_cmd;
+
+	struct virtio_rdma_vq *rq;
+	int r_status;
+	struct cmd_post_recv *r_cmd;
+};
+
+struct virtio_rdma_global_route {
+	union ib_gid		dgid;
+	uint32_t		flow_label;
+	uint8_t			sgid_index;
+	uint8_t			hop_limit;
+	uint8_t			traffic_class;
+};
+
+struct virtio_rdma_ah_attr {
+	struct virtio_rdma_global_route	grh;
+	uint16_t			dlid;
+	uint8_t				sl;
+	uint8_t				src_path_bits;
+	uint8_t				static_rate;
+	uint8_t				port_num;
+};
+
+struct virtio_rdma_qp_cap {
+	uint32_t		max_send_wr;
+	uint32_t		max_recv_wr;
+	uint32_t		max_send_sge;
+	uint32_t		max_recv_sge;
+	uint32_t		max_inline_data;
+};
+
+struct virtio_rdma_qp_attr {
+	enum ib_qp_state	qp_state;
+	enum ib_qp_state	cur_qp_state;
+	enum ib_mtu		path_mtu;
+	enum ib_mig_state	path_mig_state;
+	uint32_t			qkey;
+	uint32_t			rq_psn;
+	uint32_t			sq_psn;
+	uint32_t			dest_qp_num;
+	uint32_t			qp_access_flags;
+	uint16_t			pkey_index;
+	uint16_t			alt_pkey_index;
+	uint8_t			en_sqd_async_notify;
+	uint8_t			sq_draining;
+	uint8_t			max_rd_atomic;
+	uint8_t			max_dest_rd_atomic;
+	uint8_t			min_rnr_timer;
+	uint8_t			port_num;
+	uint8_t			timeout;
+	uint8_t			retry_cnt;
+	uint8_t			rnr_retry;
+	uint8_t			alt_port_num;
+	uint8_t			alt_timeout;
+	uint32_t			rate_limit;
+	struct virtio_rdma_qp_cap	cap;
+	struct virtio_rdma_ah_attr	ah_attr;
+	struct virtio_rdma_ah_attr	alt_ah_attr;
+};
+
+struct virtio_rdma_uar_map {
+	unsigned long pfn;
+	void __iomem *map;
+	int index;
+};
+
+struct virtio_rdma_ucontext {
+	struct ib_ucontext ibucontext;
+	struct virtio_rdma_dev *dev;
+	struct virtio_rdma_uar_map uar;
+	__u64 ctx_handle;
+};
+
+struct virtio_rdma_av {
+	__u32 port_pd;
+	__u32 sl_tclass_flowlabel;
+	__u8 dgid[16];
+	__u8 src_path_bits;
+	__u8 gid_index;
+	__u8 stat_rate;
+	__u8 hop_limit;
+	__u8 dmac[6];
+	__u8 reserved[6];
+};
+
+struct virtio_rdma_ah {
+	struct ib_ah ibah;
+	struct virtio_rdma_av av;
+};
+
+void virtio_rdma_cq_ack(struct virtqueue *vq);
+
+static inline struct virtio_rdma_ah *to_vah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct virtio_rdma_ah, ibah);
+}
+
+static inline struct virtio_rdma_pd *to_vpd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct virtio_rdma_pd, ibpd);
+}
+
+static inline struct virtio_rdma_cq *to_vcq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct virtio_rdma_cq, ibcq);
+}
+
+static inline struct virtio_rdma_qp *to_vqp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct virtio_rdma_qp, ibqp);
+}
+
+static inline struct virtio_rdma_mr *to_vmr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct virtio_rdma_mr, ibmr);
+}
+
+static inline struct virtio_rdma_ucontext *to_vucontext(struct ib_ucontext *ibucontext)
+{
+	return container_of(ibucontext, struct virtio_rdma_ucontext, ibucontext);
+}
+
+int virtio_rdma_register_ib_device(struct virtio_rdma_dev *ri);
+void fini_ib(struct virtio_rdma_dev *ri);
+
+#endif
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_main.c b/drivers/infiniband/hw/virtio/virtio_rdma_main.c
new file mode 100644
index 000000000000..8f467ee62cf2
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_main.c
@@ -0,0 +1,152 @@
+/*
+ * Virtio RDMA device
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include <linux/err.h>
+#include <linux/scatterlist.h>
+#include <linux/spinlock.h>
+#include <linux/virtio.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <uapi/linux/virtio_ids.h>
+
+#include "virtio_rdma.h"
+#include "virtio_rdma_device.h"
+#include "virtio_rdma_ib.h"
+#include "virtio_rdma_netdev.h"
+
+/* TODO:
+ * - How to hook to unload driver, we need to undo all the stuff with did
+ *   for all the devices that probed
+ * -
+ */
+
+static int virtio_rdma_probe(struct virtio_device *vdev)
+{
+	struct virtio_rdma_dev *ri;
+	int rc = -EIO;
+
+	ri = ib_alloc_device(virtio_rdma_dev, ib_dev);
+	if (!ri) {
+		pr_err("Fail to allocate IB device\n");
+		rc = -ENOMEM;
+		goto out;
+	}
+	vdev->priv = ri;
+
+	ri->vdev = vdev;
+
+	spin_lock_init(&ri->ctrl_lock);
+
+	rc = init_device(ri);
+	if (rc) {
+		pr_err("Fail to connect to device\n");
+		goto out_dealloc_ib_device;
+	}
+
+	rc = init_netdev(ri);
+	if (rc) {
+		pr_err("Fail to connect to NetDev layer\n");
+		goto out_fini_device;
+	}
+
+	rc = virtio_rdma_register_ib_device(ri);
+	if (rc) {
+		pr_err("Fail to connect to IB layer\n");
+		goto out_fini_netdev;
+	}
+
+	pr_info("VirtIO RDMA device %d probed\n", vdev->index);
+
+	goto out;
+
+out_fini_netdev:
+	fini_netdev(ri);
+
+out_fini_device:
+	fini_device(ri);
+
+out_dealloc_ib_device:
+	ib_dealloc_device(&ri->ib_dev);
+
+	vdev->priv = NULL;
+
+out:
+	return rc;
+}
+
+static void virtio_rdma_remove(struct virtio_device *vdev)
+{
+	struct virtio_rdma_dev *ri = vdev->priv;
+
+	if (!ri)
+		return;
+
+	vdev->priv = NULL;
+
+	fini_ib(ri);
+
+	fini_netdev(ri);
+
+	fini_device(ri);
+
+	ib_dealloc_device(&ri->ib_dev);
+
+	pr_info("VirtIO RDMA device %d removed\n", vdev->index);
+}
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_RDMA, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct virtio_driver virtio_rdma_driver = {
+	.driver.name	= KBUILD_MODNAME,
+	.driver.owner	= THIS_MODULE,
+	.id_table	= id_table,
+	.probe		= virtio_rdma_probe,
+	.remove		= virtio_rdma_remove,
+};
+
+static int __init virtio_rdma_init(void)
+{
+	int rc;
+
+	rc = register_virtio_driver(&virtio_rdma_driver);
+	if (rc) {
+		pr_err("%s: Fail to register virtio driver (%d)\n", __func__,
+		       rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+static void __exit virtio_rdma_fini(void)
+{
+	unregister_virtio_driver(&virtio_rdma_driver);
+}
+
+module_init(virtio_rdma_init);
+module_exit(virtio_rdma_fini);
+
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_AUTHOR("Yuval Shaia, Junji Wei");
+MODULE_DESCRIPTION("Virtio RDMA driver");
+MODULE_LICENSE("Dual BSD/GPL");
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_netdev.c b/drivers/infiniband/hw/virtio/virtio_rdma_netdev.c
new file mode 100644
index 000000000000..641a07b630bd
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_netdev.c
@@ -0,0 +1,68 @@
+/*
+ * Virtio RDMA device
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include <linux/virtio_pci.h>
+#include <linux/pci_ids.h>
+#include <linux/virtio_ids.h>
+
+#include "../../../virtio/virtio_pci_common.h"
+#include "virtio_rdma_netdev.h"
+
+int init_netdev(struct virtio_rdma_dev *ri)
+{
+	struct pci_dev* pdev_net;
+	struct virtio_pci_device *vp_dev = to_vp_device(ri->vdev);
+	struct virtio_pci_device *vnet_pdev;
+	void* priv;
+
+	pdev_net = pci_get_slot(vp_dev->pci_dev->bus, PCI_DEVFN(PCI_SLOT(vp_dev->pci_dev->devfn), 0));
+	if (!pdev_net) {
+		pr_err("failed to find paired net device\n");
+		return -ENODEV;
+	}
+
+	if (pdev_net->vendor != PCI_VENDOR_ID_REDHAT_QUMRANET ||
+	    pdev_net->subsystem_device != VIRTIO_ID_NET) {
+		pr_err("failed to find paired virtio-net device\n");
+		pci_dev_put(pdev_net);
+		return -ENODEV;
+	}
+
+	vnet_pdev = pci_get_drvdata(pdev_net);
+	pci_dev_put(pdev_net);
+
+	priv = vnet_pdev->vdev.priv;
+	/* get netdev from virtnet_info, which is netdev->priv */
+	ri->netdev = priv - ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
+	if (!ri->netdev) {
+		pr_err("failed to get backend net device\n");
+		return -ENODEV;
+	}
+	dev_hold(ri->netdev);
+	return 0;
+}
+
+void fini_netdev(struct virtio_rdma_dev *ri)
+{
+	if (ri->netdev) {
+		dev_put(ri->netdev);
+		ri->netdev = NULL;
+	}
+}
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_netdev.h b/drivers/infiniband/hw/virtio/virtio_rdma_netdev.h
new file mode 100644
index 000000000000..d9ca263f8bff
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_netdev.h
@@ -0,0 +1,29 @@
+/*
+ * Virtio RDMA device: Netdev related functions and data
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef __VIRTIO_RDMA_NETDEV__
+#define __VIRTIO_RDMA_NETDEV__
+
+#include "virtio_rdma.h"
+
+int init_netdev(struct virtio_rdma_dev *ri);
+void fini_netdev(struct virtio_rdma_dev *ri);
+
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 70a8057ad4bb..7dba3cd48e72 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -55,6 +55,7 @@
 #define VIRTIO_ID_FS			26 /* virtio filesystem */
 #define VIRTIO_ID_PMEM			27 /* virtio pmem */
 #define VIRTIO_ID_MAC80211_HWSIM	29 /* virtio mac80211-hwsim */
+#define VIRTIO_ID_RDMA          30 /* virtio rdma */
 #define VIRTIO_ID_BT			40 /* virtio bluetooth */
 
 /*
-- 
2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 3/5] RDMA/virtio-rdma: VirtIO RDMA test module
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
  2021-09-02 13:06 ` [RFC 1/5] RDMA/virtio-rdma Introduce a new core cap prot Junji Wei
  2021-09-02 13:06 ` [RFC 2/5] RDMA/virtio-rdma: VirtIO RDMA driver Junji Wei
@ 2021-09-02 13:06 ` Junji Wei
  2021-09-02 13:06 ` [RFC 4/5] virtio-net: Move some virtio-net-pci decl to include/hw/virtio Junji Wei
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Junji Wei @ 2021-09-02 13:06 UTC (permalink / raw)
  To: dledford, jgg, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare
  Cc: xieyongji, chaiwen.cc, weijunji, linux-rdma, virtualization, qemu-devel

This is a test module for virtio-rdma, it can
work with rc_pingpong server included in rdma-core.

Signed-off-by: Junji Wei <weijunji@bytedance.com>
---
 drivers/infiniband/hw/virtio/Makefile              |   1 +
 .../hw/virtio/virtio_rdma_rc_pingpong_client.c     | 477 +++++++++++++++++++++
 2 files changed, 478 insertions(+)
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_rc_pingpong_client.c

diff --git a/drivers/infiniband/hw/virtio/Makefile b/drivers/infiniband/hw/virtio/Makefile
index fb637e467167..eb72a0aa48f3 100644
--- a/drivers/infiniband/hw/virtio/Makefile
+++ b/drivers/infiniband/hw/virtio/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA) += virtio_rdma.o
+obj-m := virtio_rdma_rc_pingpong_client.o
 
 virtio_rdma-y := virtio_rdma_main.o virtio_rdma_device.o virtio_rdma_ib.o \
 		 virtio_rdma_netdev.o
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_rc_pingpong_client.c b/drivers/infiniband/hw/virtio/virtio_rdma_rc_pingpong_client.c
new file mode 100644
index 000000000000..d1a38fe8f8cd
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma_rc_pingpong_client.c
@@ -0,0 +1,477 @@
+/*
+ * Virtio RDMA device: Test client
+ *
+ * Copyright (C) 2021 Junji Wei Bytedance Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include<linux/in.h>
+#include<linux/inet.h>
+#include<linux/socket.h>
+#include<net/sock.h>
+
+#include <asm/dma.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_cache.h>
+#include "../../core/uverbs.h"
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Junji Wei");
+MODULE_DESCRIPTION("Virtio rdma test module");
+MODULE_VERSION("0.01");
+
+#define SERVER_ADDR "10.131.251.125"
+#define SERVER_PORT 18515
+
+#define RX_DEPTH 500
+#define ITER 500
+#define PAGES 5
+
+struct pingpong_dest {
+    int 				lid;
+    int 				out_reads;
+    int 				qpn;
+    int 				psn;
+    unsigned			rkey;
+    unsigned long long		vaddr;
+    union ib_gid			gid;
+    unsigned			srqn;
+    int				gid_index;
+};
+
+static struct ib_device* open_dev(char* path)
+{
+	struct ib_device *ib_dev;
+    struct ib_uverbs_file *file;
+    struct file* filp;
+    struct ib_port_attr port_attr;
+    int rc;
+
+    filp = filp_open(path, O_RDWR | O_CLOEXEC, 0);
+    if (!filp)
+        pr_err("Open failed\n");
+
+    file = filp->private_data;
+    ib_dev = file->device->ib_dev;
+    if (!ib_dev)
+        pr_err("Get ib_dev failed\n");
+
+    pr_info("Open ib_device %s\n", ib_dev->node_desc);
+
+    /* test query_port */
+    rc = ib_query_port(ib_dev, 1, &port_attr);
+    if (rc)
+        pr_err("Query port failed\n");
+    pr_info("Port gid_tbl_len %d\n", port_attr.gid_tbl_len);
+
+	return ib_dev;
+}
+
+static struct socket* ethernet_client_connect(void)
+{
+	struct socket *sock;
+    struct sockaddr_in s_addr;
+    int ret;
+
+    memset(&s_addr,0,sizeof(s_addr));
+    s_addr.sin_family=AF_INET;
+    s_addr.sin_port=htons(SERVER_PORT);
+    
+    s_addr.sin_addr.s_addr = in_aton(SERVER_ADDR);
+    sock = (struct socket *)kmalloc(sizeof(struct socket), GFP_KERNEL);
+
+    /*create a socket*/
+    ret = sock_create_kern(&init_net, AF_INET, SOCK_STREAM, 0, &sock);
+    if (ret < 0) {
+        pr_err("client: socket create error\n");
+    }
+    pr_info("client: socket create ok\n");
+
+    /*connect server*/
+    ret = sock->ops->connect(sock, (struct sockaddr *)&s_addr, sizeof(s_addr), 0);
+    if (ret) {
+        pr_err("client: connect error\n");
+        return NULL;
+    }
+    pr_info("client: connect ok\n");
+
+    return sock;
+}
+
+static int ethernet_read_data(struct socket *sock, char* buf, int size) {
+    struct kvec vec;
+    struct msghdr msg;
+    int ret;
+
+    memset(&vec,0,sizeof(vec));
+    memset(&msg,0,sizeof(msg));
+    vec.iov_base = buf;
+    vec.iov_len = size;
+
+    ret = kernel_recvmsg(sock, &msg, &vec, 1, size, 0);
+    if (ret < 0) {
+        pr_err("read failed\n");
+        return ret;
+    }
+    return ret;
+}
+
+static int ethernet_write_data(struct socket *sock, char* buf, int size) {  
+    struct kvec vec;
+    struct msghdr msg;
+    int ret;
+
+    vec.iov_base = buf;
+    vec.iov_len = size;
+
+    memset(&msg,0,sizeof(msg));
+
+    ret = kernel_sendmsg(sock, &msg, &vec, 1, size);
+    if (ret < 0) {
+        pr_err("kernel_sendmsg error\n");
+        return ret;
+    }else if(ret != size){
+        pr_info("write ret != size");
+    }
+
+    pr_info("send success\n");
+    return ret;
+}
+
+static void gid_to_wire_gid(const union ib_gid *gid, char wgid[])
+{
+	uint32_t tmp_gid[4];
+	int i;
+
+	memcpy(tmp_gid, gid, sizeof(tmp_gid));
+	for (i = 0; i < 4; ++i)
+		sprintf(&wgid[i * 8], "%08x", cpu_to_be32(tmp_gid[i]));
+}
+
+void wire_gid_to_gid(const char *wgid, union ib_gid *gid)
+{
+	char tmp[9];
+	__be32 v32;
+	int i;
+	uint32_t tmp_gid[4];
+
+	for (tmp[8] = 0, i = 0; i < 4; ++i) {
+		memcpy(tmp, wgid + i * 8, 8);
+		sscanf(tmp, "%x", &v32);
+		tmp_gid[i] = be32_to_cpu(v32);
+	}
+	memcpy(gid, tmp_gid, sizeof(*gid));
+}
+
+static struct pingpong_dest *pp_client_exch_dest(const struct pingpong_dest *my_dest)
+{
+    struct socket* sock;
+	char msg[sizeof "0000:000000:000000:00000000000000000000000000000000"];
+	struct pingpong_dest *rem_dest = NULL;
+	char gid[33];
+
+    sock = ethernet_client_connect();
+    if (!sock) {
+        return NULL;
+    }
+
+	gid_to_wire_gid(&my_dest->gid, gid);
+	sprintf(msg, "%04x:%06x:%06x:%s", my_dest->lid, my_dest->qpn,
+							my_dest->psn, gid);
+    pr_info("Local %s\n", msg);
+	if (ethernet_write_data(sock, msg, sizeof msg) != sizeof msg) {
+		pr_err("Couldn't send local address\n");
+		goto out;
+	}
+
+	if (ethernet_read_data(sock, msg, sizeof msg) != sizeof msg ||
+	    ethernet_write_data(sock, "done", sizeof "done") != sizeof "done") {
+		pr_err("Couldn't read/write remote address\n");
+		goto out;
+	}
+
+	rem_dest = kmalloc(sizeof *rem_dest, GFP_KERNEL);
+	if (!rem_dest)
+		goto out;
+
+    pr_info("Remote %s\n", msg);
+	sscanf(msg, "%x:%x:%x:%s", &rem_dest->lid, &rem_dest->qpn,
+						&rem_dest->psn, gid);
+	wire_gid_to_gid(gid, &rem_dest->gid);
+
+out:
+	return rem_dest;
+}
+
+static int __init rdma_test_init(void) {
+    struct ib_device* ib_dev;
+    struct ib_pd* pd;
+    struct ib_mr *mr, *mr_recv;
+    uint64_t dma_addr, dma_addr_recv;
+    struct scatterlist sg;
+    struct scatterlist sgr;
+    const struct ib_cq_init_attr cq_attr = { 64, 0, 0 };
+    struct ib_cq *cq;
+    struct ib_qp *qp;
+    struct ib_qp_init_attr qp_init_attr = {
+        .event_handler = NULL,
+        .qp_context = NULL,
+        .srq = NULL,
+        .xrcd = NULL,
+        .cap = {
+            RX_DEPTH, RX_DEPTH, 1, 1, -1, 0
+        },
+        .sq_sig_type = IB_SIGNAL_ALL_WR,
+        .qp_type = IB_QPT_RC,
+        .create_flags = 0,
+        .port_num = 0,
+        .rwq_ind_tbl = NULL,
+        .source_qpn = 0
+    };
+    struct ib_qp_attr qp_attr = {};
+    struct ib_port_attr port_attr;
+    struct pingpong_dest my_dest;
+    struct pingpong_dest *rem_dest;
+    int mask, rand_num, iter;
+    struct ib_rdma_wr swr;
+    const struct ib_send_wr *bad_swr;
+    struct ib_recv_wr rwr;
+    const struct ib_recv_wr *bad_rwr;
+    struct ib_sge wsge[1], rsge[1];
+    uint64_t *addr_send, *addr_recv;
+    int i, wc_got;
+    struct ib_wc wc[2];
+    struct ib_reg_wr reg_wr;
+
+    ktime_t t0;
+    uint64_t rt;
+    int wc_total = 0;
+
+    pr_info("Start rdma test\n");
+    pr_info("Normal address: 0x%lu -- 0x%px\n", MAX_DMA_ADDRESS, high_memory);
+    
+    ib_dev = open_dev("/dev/infiniband/uverbs0");
+
+    pd = ib_alloc_pd(ib_dev, 0);
+    if (!pd) {
+        pr_err("alloc_pd failed\n");
+        return -ENOMEM;
+    }
+
+    mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, PAGES);
+    mr_recv = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, PAGES);
+    if (!mr || !mr_recv) {
+        pr_err("alloc_mr failed\n");
+        return -EIO;
+    }
+
+    addr_send = ib_dma_alloc_coherent(ib_dev, PAGE_SIZE * PAGES, &dma_addr, GFP_KERNEL);
+    memset((char*)addr_send, '?', 4096 * PAGES);
+    sg_dma_address(&sg) = dma_addr;
+	sg_dma_len(&sg) = PAGE_SIZE * PAGES;
+    ib_map_mr_sg(mr, &sg, 1, NULL, PAGE_SIZE);
+
+    addr_recv = ib_dma_alloc_coherent(ib_dev, PAGE_SIZE * PAGES, &dma_addr_recv, GFP_KERNEL);
+    sg_dma_address(&sgr) = dma_addr_recv;
+	sg_dma_len(&sgr) = PAGE_SIZE * PAGES;
+    ib_map_mr_sg(mr_recv, &sgr, 1, NULL, PAGE_SIZE);
+
+    memset((char*)addr_recv, 'x', 4096 * PAGES);
+    strcpy((char*)addr_recv, "hello world");
+    pr_info("Before %s\n", (char*)addr_send);
+    pr_info("Before %s\n", (char*)addr_recv);
+
+    cq = ib_create_cq(ib_dev, NULL, NULL, NULL, &cq_attr);
+    if (!cq) {
+        pr_err("create_cq failed\n");
+    }
+
+    qp_init_attr.send_cq = cq;
+    qp_init_attr.recv_cq = cq;
+    pr_info("qp type: %d\n", qp_init_attr.qp_type);
+    qp = ib_create_qp(pd, &qp_init_attr);
+    if (!qp) {
+        pr_err("create_qp failed\n");
+    }
+
+    // modify to init
+    memset(&qp_attr, 0, sizeof(qp_attr));
+    mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT;
+    qp_attr.qp_state = IB_QPS_INIT;
+    qp_attr.port_num = 1;
+    qp_attr.pkey_index = 0;
+    qp_attr.qp_access_flags = 0;
+    ib_modify_qp(qp, &qp_attr, mask);
+
+    memset(&reg_wr, 0, sizeof(reg_wr));
+	reg_wr.wr.opcode = IB_WR_REG_MR;
+	reg_wr.wr.num_sge = 0;
+	reg_wr.mr = mr;
+	reg_wr.key = mr->lkey;
+	reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+    ib_post_send(qp, &reg_wr.wr, &bad_swr);
+
+    memset(&reg_wr, 0, sizeof(reg_wr));
+	reg_wr.wr.opcode = IB_WR_REG_MR;
+	reg_wr.wr.num_sge = 0;
+	reg_wr.mr = mr_recv;
+	reg_wr.key = mr_recv->lkey;
+	reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+    ib_post_send(qp, &reg_wr.wr, &bad_swr);
+
+    // post recv
+    rsge[0].addr = dma_addr_recv;
+    rsge[0].length = 4096 * PAGES;
+    rsge[0].lkey = mr_recv->lkey;
+
+    rwr.next = NULL;
+    rwr.wr_id = 1;
+    rwr.sg_list = rsge;
+    rwr.num_sge = 1;
+    for (i = 0; i < ITER; i++) {
+        if (ib_post_recv(qp, &rwr, &bad_rwr)) {
+            pr_err("post recv failed\n");
+            return -EIO;
+        }
+    }
+
+    // exchange info
+	if (ib_query_port(ib_dev, 1, &port_attr))
+		pr_err("query port failed");
+    my_dest.lid = port_attr.lid;
+
+    // TODO: fix rdma_query_gid
+    if (rdma_query_gid(ib_dev, 1, 1, &my_dest.gid))
+        pr_err("query gid failed");
+
+    get_random_bytes(&rand_num, sizeof(rand_num));
+    my_dest.gid_index = 1;
+    my_dest.qpn = qp->qp_num;
+    my_dest.psn = rand_num & 0xffffff;
+
+    pr_info("  local address:  LID 0x%04x, QPN 0x%06x, PSN 0x%06x, GID %pI6\n",
+	         my_dest.lid, my_dest.qpn, my_dest.psn, &my_dest.gid);
+
+    rem_dest = pp_client_exch_dest(&my_dest);
+    if (!rem_dest) {
+        return -EIO;
+    }
+
+    pr_info("  remote address: LID 0x%04x, QPN 0x%06x, PSN 0x%06x, GID %pI6\n",
+	       rem_dest->lid, rem_dest->qpn, rem_dest->psn, &rem_dest->gid);
+
+    my_dest.rkey = mr->rkey;
+    my_dest.out_reads = 1;
+    my_dest.vaddr = dma_addr;
+    my_dest.srqn = 0;
+
+    // modify to rtr
+    memset(&qp_attr, 0, sizeof(qp_attr));
+    mask = IB_QP_STATE | IB_QP_AV | IB_QP_PATH_MTU | IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER | IB_QP_MAX_DEST_RD_ATOMIC;
+	qp_attr.qp_state		= IB_QPS_RTR;
+	qp_attr.path_mtu		= IB_MTU_1024;
+	qp_attr.dest_qp_num		= rem_dest->qpn;
+	qp_attr.rq_psn			= rem_dest->psn;
+	qp_attr.max_dest_rd_atomic	= 1;
+	qp_attr.min_rnr_timer		= 12;
+    qp_attr.ah_attr.ah_flags = IB_AH_GRH;
+	qp_attr.ah_attr.ib.dlid = rem_dest->lid; // is_global  lid
+    qp_attr.ah_attr.ib.src_path_bits = 0;
+	qp_attr.ah_attr.sl		= 0;
+	qp_attr.ah_attr.port_num	= 1;
+
+	if (rem_dest->gid.global.interface_id) {
+		qp_attr.ah_attr.grh.hop_limit = 1;
+		qp_attr.ah_attr.grh.dgid = rem_dest->gid;
+		qp_attr.ah_attr.grh.sgid_index = my_dest.gid_index;
+	}
+
+    if (ib_modify_qp(qp, &qp_attr, mask)) {
+        pr_info("Failed to modify to RTR\n");
+        return -EIO;
+    }
+
+    // modify to rts
+    memset(&qp_attr, 0, sizeof(qp_attr));
+    mask = IB_QP_STATE | IB_QP_SQ_PSN | IB_QP_TIMEOUT | IB_QP_RETRY_CNT | IB_QP_RNR_RETRY | IB_QP_MAX_QP_RD_ATOMIC;
+    qp_attr.qp_state = IB_QPS_RTS;
+	qp_attr.sq_psn = my_dest.psn;
+    qp_attr.timeout   = 14;
+    qp_attr.retry_cnt = 7;
+    qp_attr.rnr_retry = 7;
+    qp_attr.max_rd_atomic  = 1;
+    if (ib_modify_qp(qp, &qp_attr, mask)) {
+        pr_info("Failed to modify to RTS\n");
+    }
+
+    wsge[0].addr = dma_addr;
+    wsge[0].length = 4096 * PAGES;
+    wsge[0].lkey = mr->lkey;
+
+    swr.wr.next = NULL;
+    swr.wr.wr_id = 2;
+    swr.wr.sg_list = wsge;
+    swr.wr.num_sge = 1;
+    swr.wr.opcode = IB_WR_SEND;
+    swr.wr.send_flags = IB_SEND_SIGNALED;
+    swr.remote_addr = rem_dest->vaddr;
+    swr.rkey = rem_dest->rkey;
+
+    t0 = ktime_get();
+
+    for (iter = 0; iter < ITER; iter++) {
+        if (ib_post_send(qp, &swr.wr, &bad_swr)) {
+            pr_err("post send failed\n");
+            return -EIO;
+        }
+
+        do {
+            wc_got = ib_poll_cq(cq, 2, wc);
+        } while(wc_got < 1);
+        wc_total += wc_got;
+    }
+
+    pr_info("Total wc %d\n", wc_total);
+    do {
+        wc_total += ib_poll_cq(cq, 2, wc);
+    }while(wc_total < ITER * 2);
+
+    rt = ktime_to_us(ktime_sub(ktime_get(), t0));
+    pr_info("%d iters in %lld us = %lld usec/iter\n", ITER, rt, rt / ITER);
+    pr_info("%d bytes in %lld us = %lld Mbit/sec\n", ITER * 4096 * 2, rt, (uint64_t)ITER * 62500 / rt);
+
+    pr_info("After %s\n", (char*)addr_send);
+    pr_info("After %s\n", (char*)addr_recv);
+
+    ib_destroy_qp(qp);
+    ib_destroy_cq(cq);
+    ib_dereg_mr(mr);
+    ib_dereg_mr(mr_recv);
+    ib_dealloc_pd(pd);
+    return 0;
+}
+
+static void __exit rdma_test_exit(void) {
+    pr_info("Exit rdma test\n");
+}
+
+module_init(rdma_test_init);
+module_exit(rdma_test_exit);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 4/5] virtio-net: Move some virtio-net-pci decl to include/hw/virtio
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
                   ` (2 preceding siblings ...)
  2021-09-02 13:06 ` [RFC 3/5] RDMA/virtio-rdma: VirtIO RDMA test module Junji Wei
@ 2021-09-02 13:06 ` Junji Wei
  2021-09-02 13:06 ` [RFC 5/5] hw/virtio-rdma: VirtIO rdma device Junji Wei
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Junji Wei @ 2021-09-02 13:06 UTC (permalink / raw)
  To: dledford, jgg, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare
  Cc: xieyongji, chaiwen.cc, weijunji, linux-rdma, virtualization, qemu-devel

From: Yuval Shaia <yuval.shaia.ml@gmail.com>

This patch is from Yuval Shaia's [RFC 1/3]

Signed-off-by: Yuval Shaia <yuval.shaia.ml@gmail.com>
---
 hw/virtio/virtio-net-pci.c         | 18 ++----------------
 include/hw/virtio/virtio-net-pci.h | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 16 deletions(-)
 create mode 100644 include/hw/virtio/virtio-net-pci.h

diff --git a/hw/virtio/virtio-net-pci.c b/hw/virtio/virtio-net-pci.c
index 292d13d278..6cea7e0441 100644
--- a/hw/virtio/virtio-net-pci.c
+++ b/hw/virtio/virtio-net-pci.c
@@ -18,26 +18,12 @@
 #include "qemu/osdep.h"
 
 #include "hw/qdev-properties.h"
-#include "hw/virtio/virtio-net.h"
+#include "hw/virtio/virtio-net-pci.h"
 #include "virtio-pci.h"
 #include "qapi/error.h"
 #include "qemu/module.h"
 #include "qom/object.h"
 
-typedef struct VirtIONetPCI VirtIONetPCI;
-
-/*
- * virtio-net-pci: This extends VirtioPCIProxy.
- */
-#define TYPE_VIRTIO_NET_PCI "virtio-net-pci-base"
-DECLARE_INSTANCE_CHECKER(VirtIONetPCI, VIRTIO_NET_PCI,
-                         TYPE_VIRTIO_NET_PCI)
-
-struct VirtIONetPCI {
-    VirtIOPCIProxy parent_obj;
-    VirtIONet vdev;
-};
-
 static Property virtio_net_properties[] = {
     DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
                     VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
@@ -84,7 +70,7 @@ static void virtio_net_pci_instance_init(Object *obj)
 
 static const VirtioPCIDeviceTypeInfo virtio_net_pci_info = {
     .base_name             = TYPE_VIRTIO_NET_PCI,
-    .generic_name          = "virtio-net-pci",
+    .generic_name          = TYPE_VIRTIO_NET_PCI_GENERIC,
     .transitional_name     = "virtio-net-pci-transitional",
     .non_transitional_name = "virtio-net-pci-non-transitional",
     .instance_size = sizeof(VirtIONetPCI),
diff --git a/include/hw/virtio/virtio-net-pci.h b/include/hw/virtio/virtio-net-pci.h
new file mode 100644
index 0000000000..c1915cd54f
--- /dev/null
+++ b/include/hw/virtio/virtio-net-pci.h
@@ -0,0 +1,35 @@
+/*
+ * PCI Virtio Network Device
+ *
+ * Copyright IBM, Corp. 2007
+ *
+ * Authors:
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_VIRTIO_NET_PCI_H
+#define QEMU_VIRTIO_NET_PCI_H
+
+#include "hw/virtio/virtio-net.h"
+#include "hw/virtio/virtio-pci.h"
+
+typedef struct VirtIONetPCI VirtIONetPCI;
+
+/*
+ * virtio-net-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_NET_PCI_GENERIC "virtio-net-pci"
+#define TYPE_VIRTIO_NET_PCI "virtio-net-pci-base"
+#define VIRTIO_NET_PCI(obj) \
+        OBJECT_CHECK(VirtIONetPCI, (obj), TYPE_VIRTIO_NET_PCI)
+
+struct VirtIONetPCI {
+    VirtIOPCIProxy parent_obj;
+    VirtIONet vdev;
+};
+
+#endif
-- 
2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 5/5] hw/virtio-rdma: VirtIO rdma device
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
                   ` (3 preceding siblings ...)
  2021-09-02 13:06 ` [RFC 4/5] virtio-net: Move some virtio-net-pci decl to include/hw/virtio Junji Wei
@ 2021-09-02 13:06 ` Junji Wei
  2021-09-02 15:16   ` Michael S. Tsirkin
  2021-09-03  0:57 ` [RFC 0/5] VirtIO RDMA Jason Wang
  2021-09-15 13:43 ` Jason Gunthorpe
  6 siblings, 1 reply; 14+ messages in thread
From: Junji Wei @ 2021-09-02 13:06 UTC (permalink / raw)
  To: dledford, jgg, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare
  Cc: xieyongji, chaiwen.cc, weijunji, linux-rdma, virtualization, qemu-devel

This based on Yuval Shaia's [RFC 2/3]

[ Junji Wei: Implement simple date path and complete control path. ]

Signed-off-by: Yuval Shaia <yuval.shaia.ml@gmail.com>
Signed-off-by: Junji Wei <weijunji@bytedance.com>
---
 hw/rdma/Kconfig                             |   5 +
 hw/rdma/meson.build                         |  10 +
 hw/rdma/virtio/virtio-rdma-dev-api.h        | 269 ++++++++++
 hw/rdma/virtio/virtio-rdma-ib.c             | 764 ++++++++++++++++++++++++++++
 hw/rdma/virtio/virtio-rdma-ib.h             | 176 +++++++
 hw/rdma/virtio/virtio-rdma-main.c           | 231 +++++++++
 hw/rdma/virtio/virtio-rdma-qp.c             | 241 +++++++++
 hw/rdma/virtio/virtio-rdma-qp.h             |  29 ++
 hw/virtio/meson.build                       |   1 +
 hw/virtio/virtio-rdma-pci.c                 | 110 ++++
 include/hw/pci/pci.h                        |   1 +
 include/hw/virtio/virtio-rdma.h             |  58 +++
 include/standard-headers/linux/virtio_ids.h |   1 +
 13 files changed, 1896 insertions(+)
 create mode 100644 hw/rdma/virtio/virtio-rdma-dev-api.h
 create mode 100644 hw/rdma/virtio/virtio-rdma-ib.c
 create mode 100644 hw/rdma/virtio/virtio-rdma-ib.h
 create mode 100644 hw/rdma/virtio/virtio-rdma-main.c
 create mode 100644 hw/rdma/virtio/virtio-rdma-qp.c
 create mode 100644 hw/rdma/virtio/virtio-rdma-qp.h
 create mode 100644 hw/virtio/virtio-rdma-pci.c
 create mode 100644 include/hw/virtio/virtio-rdma.h

diff --git a/hw/rdma/Kconfig b/hw/rdma/Kconfig
index 8e2211288f..245b5b4d11 100644
--- a/hw/rdma/Kconfig
+++ b/hw/rdma/Kconfig
@@ -1,3 +1,8 @@
 config VMW_PVRDMA
     default y if PCI_DEVICES
     depends on PVRDMA && PCI && MSI_NONBROKEN
+
+config VIRTIO_RDMA
+    bool
+    default y
+    depends on VIRTIO
diff --git a/hw/rdma/meson.build b/hw/rdma/meson.build
index 7325f40c32..da9c3aaaf4 100644
--- a/hw/rdma/meson.build
+++ b/hw/rdma/meson.build
@@ -8,3 +8,13 @@ specific_ss.add(when: 'CONFIG_VMW_PVRDMA', if_true: files(
   'vmw/pvrdma_main.c',
   'vmw/pvrdma_qp_ops.c',
 ))
+
+specific_ss.add(when: 'CONFIG_VIRTIO_RDMA', if_true: files(
+  'rdma.c',
+  'rdma_backend.c',
+  'rdma_rm.c',
+  'rdma_utils.c',
+  'virtio/virtio-rdma-main.c',
+  'virtio/virtio-rdma-ib.c',
+  'virtio/virtio-rdma-qp.c',
+))
diff --git a/hw/rdma/virtio/virtio-rdma-dev-api.h b/hw/rdma/virtio/virtio-rdma-dev-api.h
new file mode 100644
index 0000000000..d4d8f2acc2
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-dev-api.h
@@ -0,0 +1,269 @@
+/*
+ * Virtio RDMA Device - QP ops
+ *
+ * Copyright (C) 2021 Bytedance Inc.
+ *
+ * Authors:
+ *  Junji Wei <weijunji@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VIRTIO_RDMA_DEV_API_H
+#define VIRTIO_RDMA_DEV_API_H
+
+#include "virtio-rdma-ib.h"
+
+#define VIRTIO_RDMA_CTRL_OK    0
+#define VIRTIO_RDMA_CTRL_ERR   1
+
+enum {
+    VIRTIO_CMD_QUERY_DEVICE = 10,
+    VIRTIO_CMD_QUERY_PORT,
+    VIRTIO_CMD_CREATE_CQ,
+    VIRTIO_CMD_DESTROY_CQ,
+    VIRTIO_CMD_CREATE_PD,
+    VIRTIO_CMD_DESTROY_PD,
+    VIRTIO_CMD_GET_DMA_MR,
+    VIRTIO_CMD_CREATE_MR,
+	VIRTIO_CMD_MAP_MR_SG,
+    VIRTIO_CMD_REG_USER_MR,
+	VIRTIO_CMD_DEREG_MR,
+    VIRTIO_CMD_CREATE_QP,
+    VIRTIO_CMD_MODIFY_QP,
+	VIRTIO_CMD_QUERY_QP,
+    VIRTIO_CMD_DESTROY_QP,
+    VIRTIO_CMD_QUERY_GID,
+	VIRTIO_CMD_CREATE_UC,
+	VIRTIO_CMD_DEALLOC_UC,
+	VIRTIO_CMD_QUERY_PKEY,
+	VIRTIO_MAX_CMD_NUM,
+};
+
+struct control_buf {
+    uint8_t cmd;
+    uint8_t status;
+};
+
+struct cmd_query_port {
+    uint8_t port;
+};
+
+struct virtio_rdma_port_attr {
+	enum ibv_port_state	state;
+	enum ibv_mtu		max_mtu;
+	enum ibv_mtu		active_mtu;
+	int			        gid_tbl_len;
+	unsigned int		ip_gids:1;
+	uint32_t			port_cap_flags;
+	uint32_t			max_msg_sz;
+	uint32_t			bad_pkey_cntr;
+	uint32_t			qkey_viol_cntr;
+	uint16_t			pkey_tbl_len;
+	uint32_t			sm_lid;
+	uint32_t			lid;
+	uint8_t			    lmc;
+	uint8_t         	max_vl_num;
+	uint8_t             sm_sl;
+	uint8_t             subnet_timeout;
+	uint8_t			    init_type_reply;
+	uint8_t			    active_width;
+	uint8_t			    active_speed;
+	uint8_t             phys_state;
+	uint16_t			port_cap_flags2;
+};
+
+struct cmd_create_cq {
+    uint32_t cqe;
+};
+
+struct rsp_create_cq {
+    uint32_t cqn;
+};
+
+struct cmd_destroy_cq {
+    uint32_t cqn;
+};
+
+struct cmd_create_pd {
+	uint32_t ctx_handle;
+};
+
+struct rsp_create_pd {
+    uint32_t pdn;
+};
+
+struct cmd_destroy_pd {
+    uint32_t pdn;
+};
+
+struct cmd_create_mr {
+    uint32_t pdn;
+    uint32_t access_flags;
+
+	uint32_t max_num_sg;
+};
+
+struct rsp_create_mr {
+    uint32_t mrn;
+    uint32_t lkey;
+    uint32_t rkey;
+};
+
+struct cmd_map_mr_sg {
+	uint32_t mrn;
+	uint64_t start;
+	uint32_t npages;
+
+	uint64_t pages;
+};
+
+struct rsp_map_mr_sg {
+	uint32_t npages;
+};
+
+struct cmd_reg_user_mr {
+	uint32_t pdn;
+	uint32_t access_flags;
+	uint64_t start;
+	uint64_t length;
+
+	uint64_t pages;
+	uint32_t npages;
+};
+
+struct rsp_reg_user_mr {
+	uint32_t mrn;
+	uint32_t lkey;
+	uint32_t rkey;
+};
+
+struct cmd_dereg_mr {
+    uint32_t mrn;
+
+	uint8_t is_user_mr;
+};
+
+struct rsp_dereg_mr {
+    uint32_t mrn;
+};
+
+struct cmd_create_qp {
+    uint32_t pdn;
+    uint8_t qp_type;
+    uint32_t max_send_wr;
+    uint32_t max_send_sge;
+    uint32_t send_cqn;
+    uint32_t max_recv_wr;
+    uint32_t max_recv_sge;
+    uint32_t recv_cqn;
+    uint8_t is_srq;
+    uint32_t srq_handle;
+};
+
+struct rsp_create_qp {
+    uint32_t qpn;
+};
+
+struct cmd_modify_qp {
+    uint32_t qpn;
+    uint32_t attr_mask;
+    struct virtio_rdma_qp_attr attr;
+};
+
+struct cmd_destroy_qp {
+    uint32_t qpn;
+};
+
+struct rsp_destroy_qp {
+    uint32_t qpn;
+};
+
+struct cmd_query_qp {
+	uint32_t qpn;
+	uint32_t attr_mask;
+};
+
+struct rsp_query_qp {
+	struct virtio_rdma_qp_attr attr;
+};
+
+struct cmd_query_gid {
+    uint8_t port;
+    uint32_t index;
+};
+
+struct cmd_create_uc {
+	uint64_t pfn;
+};
+
+struct rsp_create_uc {
+	uint32_t ctx_handle;
+};
+
+struct cmd_dealloc_uc {
+	uint32_t ctx_handle;
+};
+
+struct rsp_dealloc_uc {
+	uint32_t ctx_handle;
+};
+
+struct cmd_query_pkey {
+	__u8 port;
+	__u16 index;
+};
+
+struct rsp_query_pkey {
+	__u16 pkey;
+};
+
+struct cmd_post_send {
+	__u32 qpn;
+	__u32 is_kernel;
+	__u32 num_sge;
+
+	int send_flags;
+	enum virtio_rdma_wr_opcode opcode;
+	__u64 wr_id;
+
+	union {
+		__be32 imm_data;
+		__u32 invalidate_rkey;
+	} ex;
+	
+	union {
+		struct {
+			__u64 remote_addr;
+			__u32 rkey;
+		} rdma;
+		struct {
+			__u64 remote_addr;
+			__u64 compare_add;
+			__u64 swap;
+			__u32 rkey;
+		} atomic;
+		struct {
+			__u32 remote_qpn;
+			__u32 remote_qkey;
+			__u32 ahn;
+		} ud;
+        struct {
+			__u32 mrn;
+			__u32 key;
+			int access;
+		} reg;
+	} wr;
+};
+
+struct cmd_post_recv {
+	__u32 qpn;
+	__u32 is_kernel;
+
+	__u32 num_sge;
+	__u64 wr_id;
+};
+
+#endif
diff --git a/hw/rdma/virtio/virtio-rdma-ib.c b/hw/rdma/virtio/virtio-rdma-ib.c
new file mode 100644
index 0000000000..54831ec787
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-ib.c
@@ -0,0 +1,764 @@
+/*
+ * Virtio RDMA Device - IB verbs
+ *
+ * Copyright (C) 2019 Oracle
+ *
+ * Authors:
+ *  Yuval Shaia <yuval.shaia@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <infiniband/verbs.h>
+
+#include "qemu/osdep.h"
+#include "qemu/atomic.h"
+#include "cpu.h"
+
+#include "virtio-rdma-ib.h"
+#include "virtio-rdma-qp.h"
+#include "virtio-rdma-dev-api.h"
+
+#include "../rdma_utils.h"
+#include "../rdma_rm.h"
+#include "../rdma_backend.h"
+
+#include <malloc.h>
+
+int virtio_rdma_query_device(VirtIORdma *rdev, struct iovec *in,
+                             struct iovec *out)
+{
+    int offs;
+    size_t s;
+
+    addrconf_addr_eui48((unsigned char *)&rdev->dev_attr.sys_image_guid,
+                        (const char *)&rdev->netdev->mac);
+
+    offs = offsetof(struct ibv_device_attr, sys_image_guid);
+    s = iov_from_buf(out, 1, 0, (void *)&rdev->dev_attr + offs, sizeof(rdev->dev_attr) - offs);
+
+    return s == sizeof(rdev->dev_attr) - offs ? VIRTIO_RDMA_CTRL_OK :
+                                                VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_query_port(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out)
+{
+    struct virtio_rdma_port_attr attr = {};
+    struct ibv_port_attr vattr = {};
+    struct cmd_query_port cmd = {};
+    int offs;
+    size_t s;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    if (cmd.port != 1) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    if(rdma_backend_query_port(rdev->backend_dev, &vattr))
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    attr.state = vattr.state;
+    attr.max_mtu = vattr.max_mtu;
+    attr.active_mtu = vattr.active_mtu;
+    attr.gid_tbl_len = vattr.gid_tbl_len;
+    attr.port_cap_flags = vattr.port_cap_flags;
+    attr.max_msg_sz = vattr.max_msg_sz;
+    attr.bad_pkey_cntr = vattr.bad_pkey_cntr;
+    attr.qkey_viol_cntr = vattr.qkey_viol_cntr;
+    attr.pkey_tbl_len = vattr.pkey_tbl_len;
+    attr.lid = vattr.lid;
+    attr.sm_lid = vattr.sm_lid;
+    attr.lmc = vattr.lmc;
+    attr.max_vl_num = vattr.max_vl_num;
+    attr.sm_sl = vattr.sm_sl;
+    attr.subnet_timeout = vattr.subnet_timeout;
+    attr.init_type_reply = vattr.init_type_reply;
+    attr.active_width = vattr.active_width;
+    attr.active_speed = vattr.phys_state;
+    attr.phys_state = vattr.phys_state;
+    attr.port_cap_flags2 = vattr.port_cap_flags2;
+
+    offs = offsetof(struct virtio_rdma_port_attr, state);
+
+    s = iov_from_buf(out, 1, 0, (void *)&attr + offs, sizeof(attr) - offs);
+
+    return s == sizeof(attr) - offs ? VIRTIO_RDMA_CTRL_OK :
+                                      VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_create_cq(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_create_cq cmd = {};
+    struct rsp_create_cq rsp = {};
+    size_t s;
+    int rc;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    /* TODO: Define MAX_CQE */
+#define MAX_CQE 1024
+    /* TODO: Check MAX_CQ */
+    if (cmd.cqe > MAX_CQE) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    printf("%s: %d\n", __func__, cmd.cqe);
+
+    rc = rdma_rm_alloc_cq(rdev->rdma_dev_res, rdev->backend_dev, cmd.cqe,
+                          &rsp.cqn, NULL);
+    if (rc)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    printf("%s: %d\n", __func__, rsp.cqn);
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_destroy_cq(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_destroy_cq cmd = {};
+    size_t s;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    printf("%s: %d\n", __func__, cmd.cqn);
+
+    virtqueue_drop_all(rdev->cq_vqs[cmd.cqn]);
+    rdma_rm_dealloc_cq(rdev->rdma_dev_res, cmd.cqn);
+
+    return VIRTIO_RDMA_CTRL_OK;
+}
+
+int virtio_rdma_create_pd(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_create_pd cmd = {};
+    struct rsp_create_pd rsp = {};
+    size_t s;
+    int rc;
+
+    if (qatomic_inc_fetch(&rdev->num_pd) > rdev->dev_attr.max_pd)
+        goto err;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd))
+        goto err;
+
+    /* TODO: Check MAX_PD */
+
+    rc = rdma_rm_alloc_pd(rdev->rdma_dev_res, rdev->backend_dev, &rsp.pdn,
+                          cmd.ctx_handle);
+    if (rc)
+        goto err;
+
+    printf("%s: pdn %d  num_pd %d\n", __func__, rsp.pdn, qatomic_read(&rdev->num_pd));
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    if (s == sizeof(rsp))
+        return VIRTIO_RDMA_CTRL_OK;
+
+err:
+    qatomic_dec(&rdev->num_pd);
+    return VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_destroy_pd(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_destroy_pd cmd = {};
+    size_t s;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    printf("%s: %d\n", __func__, cmd.pdn);
+
+    rdma_rm_dealloc_pd(rdev->rdma_dev_res, cmd.pdn);
+
+    return VIRTIO_RDMA_CTRL_OK;
+}
+
+int virtio_rdma_get_dma_mr(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out)
+{
+    struct cmd_create_mr cmd = {};
+    struct rsp_create_mr rsp = {};
+    size_t s;
+    uint32_t *htbl_key;
+    struct virtio_rdma_kernel_mr *kernel_mr;
+
+    // FIXME: how to support dma mr
+    rdma_warn_report("DMA mr is not supported now");
+
+    htbl_key = g_malloc0(sizeof(*htbl_key));
+    if (htbl_key == NULL)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    kernel_mr = g_malloc0(sizeof(*kernel_mr));
+    if (kernel_mr == NULL) {
+        g_free(htbl_key);
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        g_free(kernel_mr);
+        g_free(htbl_key);
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    rdma_rm_alloc_mr(rdev->rdma_dev_res, cmd.pdn, 0, 0, NULL, cmd.access_flags, &rsp.mrn, &rsp.lkey, &rsp.rkey);
+
+    *htbl_key = rsp.lkey;
+    kernel_mr->dummy_mr = rdma_rm_get_mr(rdev->rdma_dev_res, rsp.mrn);
+    kernel_mr->max_num_sg = cmd.max_num_sg;
+    kernel_mr->real_mr = NULL;
+    kernel_mr->dma_mr = true;
+    g_hash_table_insert(rdev->lkey_mr_tbl, htbl_key, kernel_mr);
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_create_mr(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_create_mr cmd = {};
+    struct rsp_create_mr rsp = {};
+    size_t s;
+    void* map_addr;
+    // uint64_t length;
+    uint32_t *htbl_key;
+    struct virtio_rdma_kernel_mr *kernel_mr;
+    RdmaRmMR *mr;
+
+    htbl_key = g_malloc0(sizeof(*htbl_key));
+    if (htbl_key == NULL)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    kernel_mr = g_malloc0(sizeof(*kernel_mr));
+    if (kernel_mr == NULL) {
+        g_free(htbl_key);
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        g_free(kernel_mr);
+        g_free(htbl_key);
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    // when length is zero, will return same lkey
+    map_addr = mmap(0, TARGET_PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+    rdma_rm_alloc_mr(rdev->rdma_dev_res, cmd.pdn, (uint64_t)map_addr, TARGET_PAGE_SIZE, map_addr, cmd.access_flags, &rsp.mrn, &rsp.lkey, &rsp.rkey);
+    // rkey is -1, because in kernel mode mr cannot access from remotes
+
+    /* we need to build a lkey to MR map, in order to set the local address
+     * in post_send and post_recv.
+     */
+    *htbl_key = rsp.lkey;
+    mr = rdma_rm_get_mr(rdev->rdma_dev_res, rsp.mrn);
+    mr->lkey = rsp.lkey;
+    kernel_mr->dummy_mr = mr;
+    kernel_mr->max_num_sg = cmd.max_num_sg;
+    kernel_mr->real_mr = NULL;
+    kernel_mr->dma_mr = false;
+    g_hash_table_insert(rdev->lkey_mr_tbl, htbl_key, kernel_mr);
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+static int remap_pages(AddressSpace *as, uint64_t *pages, void* remap_start, int npages)
+{
+    int i;
+    void* addr;
+    void* curr_page;
+    dma_addr_t len = TARGET_PAGE_SIZE;
+
+    for (i = 0; i < npages; i++) {
+        rdma_info_report("remap page %lx to %p", pages[i], remap_start + TARGET_PAGE_SIZE * i);
+        curr_page = dma_memory_map(as, pages[i], &len, DMA_DIRECTION_TO_DEVICE);
+        addr = mremap(curr_page, 0, TARGET_PAGE_SIZE, MREMAP_MAYMOVE | MREMAP_FIXED,
+                     remap_start + TARGET_PAGE_SIZE * i);
+        dma_memory_unmap(as, curr_page, TARGET_PAGE_SIZE, DMA_DIRECTION_TO_DEVICE, 0);
+        if (addr == MAP_FAILED)
+            break;
+    }
+    return i;
+}
+
+int virtio_rdma_map_mr_sg(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_map_mr_sg cmd = {};
+    struct rsp_map_mr_sg rsp = {};
+    size_t s;
+    uint64_t *pages;
+    dma_addr_t len = TARGET_PAGE_SIZE;
+    RdmaRmMR *mr;
+    void *remap_addr;
+    AddressSpace *dma_as = VIRTIO_DEVICE(rdev)->dma_as;
+    struct virtio_rdma_kernel_mr *kmr;
+    uint32_t num_pages;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    mr = rdma_rm_get_mr(rdev->rdma_dev_res, cmd.mrn);
+    if (!mr) {
+        rdma_error_report("get mr failed\n");
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    pages = dma_memory_map(dma_as, cmd.pages, &len, DMA_DIRECTION_TO_DEVICE);
+
+    kmr = g_hash_table_lookup(rdev->lkey_mr_tbl, &mr->lkey);
+    if (!kmr) {
+        rdma_error_report("Get kmr failed\n");
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    num_pages = kmr->max_num_sg > cmd.npages ? cmd.npages : kmr->max_num_sg;
+    remap_addr = mmap(0, num_pages * TARGET_PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+
+    rsp.npages = remap_pages(dma_as, pages, remap_addr, num_pages);
+    dma_memory_unmap(dma_as, pages, len, DMA_DIRECTION_TO_DEVICE, 0);
+
+    // rdma_rm_alloc_mr(rdev->rdma_dev_res, mr->pd_handle, (uint64_t)remap_addr, num_pages * TARGET_PAGE_SIZE,
+    //                  remap_addr, IBV_ACCESS_LOCAL_WRITE, &kmr->mrn, &kmr->lkey, &kmr->rkey);
+
+    kmr->virt = remap_addr;
+    kmr->length = num_pages * TARGET_PAGE_SIZE;
+    kmr->start = cmd.start;
+    // kmr->real_mr = rdma_rm_get_mr(rdev->rdma_dev_res, kmr->mrn);
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_reg_user_mr(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_reg_user_mr cmd = {};
+    struct rsp_reg_user_mr rsp = {};
+    size_t s;
+    uint64_t *pages;
+    dma_addr_t len = TARGET_PAGE_SIZE;
+    void *remap_addr, *curr_page;
+    AddressSpace *dma_as = VIRTIO_DEVICE(rdev)->dma_as;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    pages = dma_memory_map(dma_as, cmd.pages, &len, DMA_DIRECTION_TO_DEVICE);
+
+    curr_page = dma_memory_map(dma_as, pages[0], &len, DMA_DIRECTION_TO_DEVICE);
+    remap_addr = mremap(curr_page, 0, TARGET_PAGE_SIZE * cmd.npages, MREMAP_MAYMOVE);
+    dma_memory_unmap(dma_as, curr_page, TARGET_PAGE_SIZE, DMA_DIRECTION_TO_DEVICE, 0);
+    if (remap_addr == MAP_FAILED) {
+        rdma_error_report("mremap failed\n");
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    remap_pages(dma_as, pages + 1, remap_addr + TARGET_PAGE_SIZE, cmd.npages - 1);
+    dma_memory_unmap(dma_as, pages, len, DMA_DIRECTION_TO_DEVICE, 0);
+
+    rdma_rm_alloc_mr(rdev->rdma_dev_res, cmd.pdn, cmd.start, TARGET_PAGE_SIZE * cmd.npages,
+                     remap_addr, cmd.access_flags, &rsp.mrn, &rsp.lkey, &rsp.rkey);
+    rsp.rkey = rdma_backend_mr_rkey(&rdma_rm_get_mr(rdev->rdma_dev_res, rsp.mrn)->backend_mr);
+    rdma_info_report("%s: 0x%x\n", __func__, rsp.mrn);
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_dereg_mr(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_dereg_mr cmd = {};
+    struct RdmaRmMR *mr;
+    struct virtio_rdma_kernel_mr *kmr;
+    size_t s;
+    uint32_t lkey;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    mr = rdma_rm_get_mr(rdev->rdma_dev_res, cmd.mrn);
+    if (!mr)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    if (!cmd.is_user_mr) {
+        lkey = mr->lkey;
+        kmr = g_hash_table_lookup(rdev->lkey_mr_tbl, &lkey);
+        if (!kmr)
+            return VIRTIO_RDMA_CTRL_ERR;
+        rdma_backend_destroy_mr(&kmr->dummy_mr->backend_mr);
+        mr = kmr->real_mr;
+        g_hash_table_remove(rdev->lkey_mr_tbl, &lkey);
+        if (!mr)
+            return VIRTIO_RDMA_CTRL_OK;
+    }
+
+    munmap(mr->virt, mr->length);
+    rdma_backend_destroy_mr(&mr->backend_mr);
+    g_free(kmr);
+    return VIRTIO_RDMA_CTRL_OK;
+}
+
+int virtio_rdma_create_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_create_qp cmd = {};
+    struct rsp_create_qp rsp = {};
+    size_t s;
+    int rc;
+    //uint32_t recv_cqn;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    // TODO: check max qp
+
+    printf("%s: %d qp type %d\n", __func__, cmd.pdn, cmd.qp_type);
+
+    // store recv_cqn in opaque
+    rc = rdma_rm_alloc_qp(rdev->rdma_dev_res, cmd.pdn, cmd.qp_type, cmd.max_send_wr,
+                          cmd.max_send_sge, cmd.send_cqn, cmd.max_recv_wr,
+                          cmd.max_recv_sge, cmd.recv_cqn, NULL, &rsp.qpn,
+                          cmd.is_srq, cmd.srq_handle);
+
+    if (rc)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    printf("%s: %d\n", __func__, rsp.qpn);
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+static void virtio_rdma_ah_attr_to_ibv (struct virtio_rdma_ah_attr *ah_attr, struct ibv_ah_attr *ibv_attr) {
+    ibv_attr->grh.dgid = ah_attr->grh.dgid;
+    ibv_attr->grh.flow_label = ah_attr->grh.flow_label;
+    ibv_attr->grh.sgid_index = ah_attr->grh.sgid_index;
+    ibv_attr->grh.hop_limit = ah_attr->grh.hop_limit;
+    ibv_attr->grh.traffic_class = ah_attr->grh.traffic_class;
+
+    ibv_attr->dlid = ah_attr->dlid;
+    ibv_attr->sl = ah_attr->sl;
+    ibv_attr->src_path_bits = ah_attr->src_path_bits;
+    ibv_attr->static_rate = ah_attr->static_rate;
+    ibv_attr->port_num = ah_attr->port_num;
+}
+
+int virtio_rdma_modify_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_modify_qp cmd = {};
+    size_t s;
+    int rc;
+
+    RdmaRmQP *rqp;
+    struct ibv_qp_attr attr = {};
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    printf("%s: %d %d\n", __func__, cmd.qpn, cmd.attr.qp_state);
+
+    rqp = rdma_rm_get_qp(rdev->rdma_dev_res, cmd.qpn);
+    if (!rqp)
+        printf("Get qp failed\n");
+
+    if (rqp->qp_type == IBV_QPT_GSI) {
+        return VIRTIO_RDMA_CTRL_OK;
+    }
+
+    // TODO: assign attr based on cmd.attr_mask
+    attr.qp_state = cmd.attr.qp_state;
+    attr.cur_qp_state = cmd.attr.cur_qp_state;
+    attr.path_mtu = cmd.attr.path_mtu;
+    attr.path_mig_state = cmd.attr.path_mig_state;
+    attr.qkey = cmd.attr.qkey;
+    attr.rq_psn = cmd.attr.rq_psn;
+    attr.sq_psn = cmd.attr.sq_psn;
+    attr.dest_qp_num = cmd.attr.dest_qp_num;
+    attr.qp_access_flags = cmd.attr.qp_access_flags;
+    attr.pkey_index = cmd.attr.pkey_index;
+    attr.en_sqd_async_notify = cmd.attr.en_sqd_async_notify;
+    attr.sq_draining = cmd.attr.sq_draining;
+    attr.max_rd_atomic = cmd.attr.max_rd_atomic;
+    attr.max_dest_rd_atomic = cmd.attr.max_dest_rd_atomic;
+    attr.min_rnr_timer = cmd.attr.min_rnr_timer;
+    attr.port_num = cmd.attr.port_num;
+    attr.timeout = cmd.attr.timeout;
+    attr.retry_cnt = cmd.attr.retry_cnt;
+    attr.rnr_retry = cmd.attr.rnr_retry;
+    attr.alt_port_num = cmd.attr.alt_port_num;
+    attr.alt_timeout = cmd.attr.alt_timeout;
+    attr.rate_limit = cmd.attr.rate_limit;
+    attr.cap.max_inline_data = cmd.attr.cap.max_inline_data;
+    attr.cap.max_recv_sge = cmd.attr.cap.max_recv_sge;
+    attr.cap.max_recv_wr = cmd.attr.cap.max_recv_wr;
+    attr.cap.max_send_sge = cmd.attr.cap.max_send_sge;
+    attr.cap.max_send_wr = cmd.attr.cap.max_send_wr;
+    virtio_rdma_ah_attr_to_ibv(&cmd.attr.ah_attr, &attr.ah_attr);
+    virtio_rdma_ah_attr_to_ibv(&cmd.attr.alt_ah_attr, &attr.alt_ah_attr);
+
+    rqp->qp_state = cmd.attr.qp_state;
+
+    if (rqp->qp_state == IBV_QPS_RTR) {
+        rqp->backend_qp.sgid_idx = cmd.attr.ah_attr.grh.sgid_index;
+        attr.ah_attr.grh.sgid_index = cmd.attr.ah_attr.grh.sgid_index;
+        attr.ah_attr.is_global  = 1;
+    }
+    
+    printf("modify_qp_debug %d %d %d %d %d %d %d %d\n", cmd.qpn, cmd.attr_mask, cmd.attr.ah_attr.grh.sgid_index,
+           cmd.attr.dest_qp_num, cmd.attr.qp_state, cmd.attr.qkey, cmd.attr.rq_psn, cmd.attr.sq_psn);
+
+    rc = ibv_modify_qp(rqp->backend_qp.ibqp, &attr, cmd.attr_mask);
+    /*
+    rc = rdma_rm_modify_qp(rdev->rdma_dev_res, rdev->backend_dev,
+                           cmd.qpn, cmd.attr_mask,
+                           cmd.attr.ah_attr.grh.sgid_index,
+                           &cmd.attr.ah_attr.grh.dgid,
+                           cmd.attr.dest_qp_num,
+                           (enum ibv_qp_state)cmd.attr.qp_state,
+                           cmd.attr.qkey, cmd.attr.rq_psn,
+                           cmd.attr.sq_psn);*/
+
+    if (rc) {
+        rdma_error_report( "ibv_modify_qp fail, rc=%d, errno=%d", rc, errno);
+        return -EIO;
+    }
+    return rc;
+}
+
+int virtio_rdma_query_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_query_qp cmd = {};
+    struct rsp_query_qp rsp = {};
+    struct ibv_qp_init_attr init_attr;
+    size_t s;
+    int rc;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    memset(&rsp, 0, sizeof(rsp));
+
+    rc = rdma_rm_query_qp(rdev->rdma_dev_res, rdev->backend_dev, cmd.qpn,
+                          (struct ibv_qp_attr *)&rsp.attr, cmd.attr_mask,
+                          &init_attr);
+    if (rc)
+        return -EIO;
+    
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_destroy_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out)
+{
+    struct cmd_destroy_qp cmd = {};
+    size_t s;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    rdma_info_report("%s: %d", __func__, cmd.qpn);
+
+    rdma_rm_dealloc_qp(rdev->rdma_dev_res, cmd.qpn);
+
+    return VIRTIO_RDMA_CTRL_OK;
+}
+
+int virtio_rdma_query_gid(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out)
+{
+    struct cmd_query_gid cmd = {};
+    union ibv_gid gid = {};
+    size_t s;
+    int rc;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    rc = ibv_query_gid(rdev->backend_dev->context, cmd.port, cmd.index,
+                       &gid);
+    if (rc)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    s = iov_from_buf(out, 1, 0, &gid, sizeof(gid));
+
+    return s == sizeof(gid) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_create_uc(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out)
+{
+    struct cmd_create_uc cmd = {};
+    struct rsp_create_uc rsp = {};
+    size_t s;
+    int rc;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    rc = rdma_rm_alloc_uc(rdev->rdma_dev_res, cmd.pfn, &rsp.ctx_handle);
+
+    if (rc)
+        return VIRTIO_RDMA_CTRL_ERR;
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_dealloc_uc(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out)
+{
+    struct cmd_dealloc_uc cmd = {};
+    size_t s;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    rdma_rm_dealloc_uc(rdev->rdma_dev_res, cmd.ctx_handle);
+
+    return VIRTIO_RDMA_CTRL_OK;
+}
+
+int virtio_rdma_query_pkey(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out)
+{
+    struct cmd_query_pkey cmd = {};
+    struct rsp_query_pkey rsp = {};
+    size_t s;
+
+    s = iov_to_buf(in, 1, 0, &cmd, sizeof(cmd));
+    if (s != sizeof(cmd)) {
+        return VIRTIO_RDMA_CTRL_ERR;
+    }
+
+    rsp.pkey = 0xFFFF;
+
+    s = iov_from_buf(out, 1, 0, &rsp, sizeof(rsp));
+
+    return s == sizeof(rsp) ? VIRTIO_RDMA_CTRL_OK :
+                              VIRTIO_RDMA_CTRL_ERR;
+}
+
+static void virtio_rdma_init_dev_caps(VirtIORdma *rdev)
+{
+    rdev->dev_attr.max_qp_wr = 1024;
+}
+
+int virtio_rdma_init_ib(VirtIORdma *rdev)
+{
+    int rc;
+
+    virtio_rdma_init_dev_caps(rdev);
+
+    rdev->rdma_dev_res = g_malloc0(sizeof(RdmaDeviceResources));
+    rdev->backend_dev = g_malloc0(sizeof(RdmaBackendDev));
+
+    rc = rdma_backend_init(rdev->backend_dev, NULL, rdev->rdma_dev_res,
+                           rdev->backend_device_name,
+                           rdev->backend_port_num, &rdev->dev_attr,
+                           &rdev->mad_chr);
+    if (rc) {
+        rdma_error_report("Fail to initialize backend device");
+        return rc;
+    }
+
+    rdev->dev_attr.max_mr_size = 4096;
+    rdev->dev_attr.page_size_cap = 4096;
+    rdev->dev_attr.vendor_id = 1;
+    rdev->dev_attr.vendor_part_id = 1;
+    rdev->dev_attr.hw_ver = VIRTIO_RDMA_HW_VER;
+    rdev->dev_attr.atomic_cap = IBV_ATOMIC_NONE;
+    rdev->dev_attr.max_pkeys = 1;
+    rdev->dev_attr.phys_port_cnt = VIRTIO_RDMA_PORT_CNT;
+
+    rc = rdma_rm_init(rdev->rdma_dev_res, &rdev->dev_attr);
+    if (rc) {
+        rdma_error_report("Fail to initialize resource manager");
+        return rc;
+    }
+
+    virtio_rdma_qp_ops_init();
+
+    rdma_backend_start(rdev->backend_dev);
+
+    return 0;
+}
+
+void virtio_rdma_fini_ib(VirtIORdma *rdev)
+{
+    rdma_backend_stop(rdev->backend_dev);
+    virtio_rdma_qp_ops_fini();
+    rdma_rm_fini(rdev->rdma_dev_res, rdev->backend_dev,
+                 rdev->backend_eth_device_name);
+    rdma_backend_fini(rdev->backend_dev);
+    g_free(rdev->rdma_dev_res);
+    g_free(rdev->backend_dev);
+}
diff --git a/hw/rdma/virtio/virtio-rdma-ib.h b/hw/rdma/virtio/virtio-rdma-ib.h
new file mode 100644
index 0000000000..457b25f998
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-ib.h
@@ -0,0 +1,176 @@
+/*
+ * Virtio RDMA Device - IB verbs
+ *
+ * Copyright (C) 2019 Oracle
+ *
+ * Authors:
+ *  Yuval Shaia <yuval.shaia@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VIRTIO_RDMA_IB_H
+#define VIRTIO_RDMA_IB_H
+
+#include "qemu/osdep.h"
+#include "qemu/iov.h"
+#include "hw/virtio/virtio-rdma.h"
+
+#include "../rdma_rm.h"
+
+enum virtio_rdma_wr_opcode {
+	VIRTIO_RDMA_WR_RDMA_WRITE,
+	VIRTIO_RDMA_WR_RDMA_WRITE_WITH_IMM,
+	VIRTIO_RDMA_WR_SEND,
+	VIRTIO_RDMA_WR_SEND_WITH_IMM,
+	VIRTIO_RDMA_WR_RDMA_READ,
+	VIRTIO_RDMA_WR_ATOMIC_CMP_AND_SWP,
+	VIRTIO_RDMA_WR_ATOMIC_FETCH_AND_ADD,
+	VIRTIO_RDMA_WR_LOCAL_INV,
+	VIRTIO_RDMA_WR_BIND_MW,
+	VIRTIO_RDMA_WR_SEND_WITH_INV,
+	VIRTIO_RDMA_WR_TSO,
+	VIRTIO_RDMA_WR_DRIVER1,
+
+	VIRTIO_RDMA_WR_REG_MR = 0x20,
+};
+
+struct virtio_rdma_cqe {
+	uint64_t		wr_id;
+	enum ibv_wc_status status;
+	enum ibv_wc_opcode opcode;
+	uint32_t vendor_err;
+	uint32_t byte_len;
+	uint32_t imm_data;
+	uint32_t qp_num;
+	uint32_t src_qp;
+	int	 wc_flags;
+	uint16_t pkey_index;
+	uint16_t slid;
+	uint8_t sl;
+	uint8_t dlid_path_bits;
+};
+
+struct CompHandlerCtx {
+	VirtIORdma *dev;
+    uint32_t cq_handle;
+    struct virtio_rdma_cqe cqe;
+};
+
+struct virtio_rdma_kernel_mr {
+	RdmaRmMR *dummy_mr; // created by create_mr
+	RdmaRmMR *real_mr; // real mr created by map_mr_sg
+
+	void* virt;
+	uint64_t length;
+	uint64_t start;
+	uint32_t mrn;
+	uint32_t lkey;
+	uint32_t rkey;
+
+	uint32_t max_num_sg;
+	uint8_t dma_mr;
+};
+
+struct virtio_rdma_global_route {
+	union ibv_gid		dgid;
+	uint32_t		flow_label;
+	uint8_t			sgid_index;
+	uint8_t			hop_limit;
+	uint8_t			traffic_class;
+};
+
+struct virtio_rdma_ah_attr {
+	struct virtio_rdma_global_route	grh;
+	uint16_t			dlid;
+	uint8_t				sl;
+	uint8_t				src_path_bits;
+	uint8_t				static_rate;
+	uint8_t				port_num;
+};
+
+struct virtio_rdma_qp_cap {
+	uint32_t		max_send_wr;
+	uint32_t		max_recv_wr;
+	uint32_t		max_send_sge;
+	uint32_t		max_recv_sge;
+	uint32_t		max_inline_data;
+};
+
+struct virtio_rdma_qp_attr {
+	enum ibv_qp_state	qp_state;
+	enum ibv_qp_state	cur_qp_state;
+	enum ibv_mtu		path_mtu;
+	enum ibv_mig_state	path_mig_state;
+	uint32_t			qkey;
+	uint32_t			rq_psn;
+	uint32_t			sq_psn;
+	uint32_t			dest_qp_num;
+	uint32_t			qp_access_flags;
+	uint16_t			pkey_index;
+	uint16_t			alt_pkey_index;
+	uint8_t			en_sqd_async_notify;
+	uint8_t			sq_draining;
+	uint8_t			max_rd_atomic;
+	uint8_t			max_dest_rd_atomic;
+	uint8_t			min_rnr_timer;
+	uint8_t			port_num;
+	uint8_t			timeout;
+	uint8_t			retry_cnt;
+	uint8_t			rnr_retry;
+	uint8_t			alt_port_num;
+	uint8_t			alt_timeout;
+	uint32_t			rate_limit;
+	struct virtio_rdma_qp_cap	cap;
+	struct virtio_rdma_ah_attr	ah_attr;
+	struct virtio_rdma_ah_attr	alt_ah_attr;
+};
+
+#define VIRTIO_RDMA_PORT_CNT    1
+#define VIRTIO_RDMA_HW_VER      1
+
+int virtio_rdma_init_ib(VirtIORdma *rdev);
+void virtio_rdma_fini_ib(VirtIORdma *rdev);
+
+int virtio_rdma_query_device(VirtIORdma *rdev, struct iovec *in,
+                             struct iovec *out);
+int virtio_rdma_query_port(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out);
+int virtio_rdma_create_cq(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_destroy_cq(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_create_pd(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_destroy_pd(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_get_dma_mr(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out);
+int virtio_rdma_create_mr(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_reg_user_mr(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_create_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_modify_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_query_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_query_gid(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out);
+int virtio_rdma_destroy_qp(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_map_mr_sg(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_dereg_mr(VirtIORdma *rdev, struct iovec *in,
+                          struct iovec *out);
+int virtio_rdma_create_uc(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out);
+int virtio_rdma_query_pkey(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out);
+int virtio_rdma_dealloc_uc(VirtIORdma *rdev, struct iovec *in,
+                           struct iovec *out);
+
+#endif
diff --git a/hw/rdma/virtio/virtio-rdma-main.c b/hw/rdma/virtio/virtio-rdma-main.c
new file mode 100644
index 0000000000..a69f0eb054
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-main.c
@@ -0,0 +1,231 @@
+/*
+ * Virtio RDMA Device
+ *
+ * Copyright (C) 2019 Oracle
+ *
+ * Authors:
+ *  Yuval Shaia <yuval.shaia@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <infiniband/verbs.h>
+#include <unistd.h>
+
+#include "qemu/osdep.h"
+#include "hw/virtio/virtio.h"
+#include "qemu/error-report.h"
+#include "hw/virtio/virtio-bus.h"
+#include "hw/virtio/virtio-rdma.h"
+#include "hw/qdev-properties.h"
+#include "include/standard-headers/linux/virtio_ids.h"
+
+#include "virtio-rdma-ib.h"
+#include "virtio-rdma-qp.h"
+#include "virtio-rdma-dev-api.h"
+
+#include "../rdma_rm_defs.h"
+#include "../rdma_utils.h"
+
+#define DEFINE_VIRTIO_RDMA_CMD(cmd, handler) [cmd] = {handler, #cmd},
+
+struct {
+    int (*handler)(VirtIORdma *rdev, struct iovec *in, struct iovec *out);
+    const char* name;
+} cmd_tbl[] = {
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_QUERY_DEVICE, virtio_rdma_query_device)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_QUERY_PORT, virtio_rdma_query_port)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_CREATE_CQ, virtio_rdma_create_cq)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_DESTROY_CQ, virtio_rdma_destroy_cq)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_CREATE_PD, virtio_rdma_create_pd)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_DESTROY_PD, virtio_rdma_destroy_pd)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_GET_DMA_MR, virtio_rdma_get_dma_mr)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_CREATE_MR, virtio_rdma_create_mr)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_MAP_MR_SG, virtio_rdma_map_mr_sg)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_REG_USER_MR, virtio_rdma_reg_user_mr)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_DEREG_MR, virtio_rdma_dereg_mr)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_CREATE_QP, virtio_rdma_create_qp)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_MODIFY_QP, virtio_rdma_modify_qp)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_QUERY_QP, virtio_rdma_query_qp)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_DESTROY_QP, virtio_rdma_destroy_qp)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_QUERY_GID, virtio_rdma_query_gid)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_CREATE_UC, virtio_rdma_create_uc)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_DEALLOC_UC, virtio_rdma_dealloc_uc)
+    DEFINE_VIRTIO_RDMA_CMD(VIRTIO_CMD_QUERY_PKEY, virtio_rdma_query_pkey)
+};
+
+static void virtio_rdma_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtIORdma *r = VIRTIO_RDMA(vdev);
+    struct control_buf cb;
+    VirtQueueElement *e;
+    size_t s;
+
+    virtio_queue_set_notification(vq, 0);
+
+    for (;;) {
+        e = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!e) {
+            break;
+        }
+
+        if (iov_size(e->in_sg, e->in_num) < sizeof(cb.status) ||
+            iov_size(e->out_sg, e->out_num) < sizeof(cb.cmd)) {
+            virtio_error(vdev, "Got invalid message size");
+            virtqueue_detach_element(vq, e, 0);
+            g_free(e);
+            break;
+        }
+
+        s = iov_to_buf(&e->out_sg[0], 1, 0, &cb.cmd, sizeof(cb.cmd));
+        if (s != sizeof(cb.cmd)) {
+            cb.status = VIRTIO_RDMA_CTRL_ERR;
+        } else {
+            printf("cmd=%d %s\n", cb.cmd, cmd_tbl[cb.cmd].name);
+            if (cb.cmd >= VIRTIO_MAX_CMD_NUM) {
+                rdma_warn_report("unknown cmd %d\n", cb.cmd);
+                cb.status = VIRTIO_RDMA_CTRL_ERR;
+            } else {
+                if (cmd_tbl[cb.cmd].handler) {
+                    cb.status = cmd_tbl[cb.cmd].handler(r, &e->out_sg[1],
+                                                        &e->in_sg[0]);
+                } else {
+                    rdma_warn_report("no handler for cmd %d\n", cb.cmd);
+                    cb.status = VIRTIO_RDMA_CTRL_ERR;
+                }
+            }
+        }
+        printf("status=%d\n", cb.status);
+        s = iov_from_buf(&e->in_sg[1], 1, 0, &cb.status, sizeof(cb.status));
+        assert(s == sizeof(cb.status));
+
+        virtqueue_push(vq, e, sizeof(cb.status));
+        g_free(e);
+        virtio_notify(vdev, vq);
+    }
+
+    virtio_queue_set_notification(vq, 1);
+}
+
+static void g_free_destroy(gpointer data) {
+    g_free(data);
+}
+
+static void virtio_rdma_device_realize(DeviceState *dev, Error **errp)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIORdma *r = VIRTIO_RDMA(dev);
+    int rc, i;
+
+    rc = virtio_rdma_init_ib(r);
+    if (rc) {
+        rdma_error_report("Fail to initialize IB layer");
+        return;
+    }
+
+    virtio_init(vdev, "virtio-rdma", VIRTIO_ID_RDMA, 1024);
+
+    r->lkey_mr_tbl = g_hash_table_new_full(g_int_hash, g_int_equal, g_free_destroy, NULL);
+
+    r->ctrl_vq = virtio_add_queue(vdev, 64, virtio_rdma_handle_ctrl);
+
+    r->cq_vqs = g_malloc0_n(64, sizeof(*r->cq_vqs));
+    for (i = 0; i < 64; i++) {
+        r->cq_vqs[i] = virtio_add_queue(vdev, 64, NULL);
+    }
+
+    r->qp_vqs = g_malloc0_n(64 * 2, sizeof(*r->cq_vqs));
+    for (i = 0; i < 64 * 2; i += 2) {
+        r->qp_vqs[i] = virtio_add_queue(vdev, 64, virtio_rdma_handle_sq);
+        r->qp_vqs[i+1] = virtio_add_queue(vdev, 64, virtio_rdma_handle_rq);
+    }
+}
+
+static void virtio_rdma_device_unrealize(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIORdma *r = VIRTIO_RDMA(dev);
+
+    virtio_del_queue(vdev, 0);
+
+    virtio_cleanup(vdev);
+
+    virtio_rdma_fini_ib(r);
+}
+
+static uint64_t virtio_rdma_get_features(VirtIODevice *vdev, uint64_t features,
+                                        Error **errp)
+{
+    /* virtio_add_feature(&features, VIRTIO_NET_F_MAC); */
+
+    vdev->backend_features = features;
+
+    return features;
+}
+
+
+static Property virtio_rdma_dev_properties[] = {
+    DEFINE_PROP_STRING("netdev", VirtIORdma, backend_eth_device_name),
+    DEFINE_PROP_STRING("ibdev",VirtIORdma, backend_device_name),
+    DEFINE_PROP_UINT8("ibport", VirtIORdma, backend_port_num, 1),
+    DEFINE_PROP_UINT64("dev-caps-max-mr-size", VirtIORdma, dev_attr.max_mr_size,
+                       MAX_MR_SIZE),
+    DEFINE_PROP_INT32("dev-caps-max-qp", VirtIORdma, dev_attr.max_qp, MAX_QP),
+    DEFINE_PROP_INT32("dev-caps-max-cq", VirtIORdma, dev_attr.max_cq, MAX_CQ),
+    DEFINE_PROP_INT32("dev-caps-max-mr", VirtIORdma, dev_attr.max_mr, MAX_MR),
+    DEFINE_PROP_INT32("dev-caps-max-pd", VirtIORdma, dev_attr.max_pd, MAX_PD),
+    DEFINE_PROP_INT32("dev-caps-qp-rd-atom", VirtIORdma,
+                       dev_attr.max_qp_rd_atom, MAX_QP_RD_ATOM),
+    DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", VirtIORdma,
+                      dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
+    DEFINE_PROP_INT32("dev-caps-max-ah", VirtIORdma, dev_attr.max_ah, MAX_AH),
+    DEFINE_PROP_INT32("dev-caps-max-srq", VirtIORdma, dev_attr.max_srq, MAX_SRQ),
+    DEFINE_PROP_CHR("mad-chardev", VirtIORdma, mad_chr),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+struct virtio_rdma_config {
+    int32_t max_cq;
+};
+
+static void virtio_rdma_get_config(VirtIODevice *vdev, uint8_t *config)
+{
+    VirtIORdma *r = VIRTIO_RDMA(vdev);
+    struct virtio_rdma_config cfg;
+
+    cfg.max_cq = r->dev_attr.max_cq;
+
+    memcpy(config, &cfg, sizeof(cfg));
+}
+
+static void virtio_rdma_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+
+    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+    vdc->realize = virtio_rdma_device_realize;
+    vdc->unrealize = virtio_rdma_device_unrealize;
+    vdc->get_features = virtio_rdma_get_features;
+    vdc->get_config = virtio_rdma_get_config;
+
+    dc->desc = "Virtio RDMA Device";
+    device_class_set_props(dc, virtio_rdma_dev_properties);
+    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+}
+
+static const TypeInfo virtio_rdma_info = {
+    .name = TYPE_VIRTIO_RDMA,
+    .parent = TYPE_VIRTIO_DEVICE,
+    .instance_size = sizeof(VirtIORdma),
+    .class_init = virtio_rdma_class_init,
+};
+
+static void virtio_register_types(void)
+{
+    type_register_static(&virtio_rdma_info);
+}
+
+type_init(virtio_register_types)
diff --git a/hw/rdma/virtio/virtio-rdma-qp.c b/hw/rdma/virtio/virtio-rdma-qp.c
new file mode 100644
index 0000000000..8b95c115cb
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-qp.c
@@ -0,0 +1,241 @@
+/*
+ * Virtio RDMA Device - QP ops
+ *
+ * Copyright (C) 2021 Bytedance Inc.
+ *
+ * Authors:
+ *  Junji Wei <weijunji@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <infiniband/verbs.h>
+#include <malloc.h>
+
+#include "qemu/osdep.h"
+#include "qemu/atomic.h"
+#include "cpu.h"
+
+#include "virtio-rdma-ib.h"
+#include "virtio-rdma-qp.h"
+#include "virtio-rdma-dev-api.h"
+
+#include "../rdma_utils.h"
+#include "../rdma_rm.h"
+#include "../rdma_backend.h"
+
+void virtio_rdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
+{
+    VirtQueueElement *e;
+    VirtQueue *vq;
+    struct CompHandlerCtx *comp_ctx = (struct CompHandlerCtx *)ctx;
+    size_t s;
+    struct virtio_rdma_cqe* cqe;
+
+    vq = comp_ctx->dev->cq_vqs[comp_ctx->cq_handle];
+    e = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!e) {
+        rdma_error_report("pop cq vq failed");
+    }
+
+    cqe = &comp_ctx->cqe;
+    cqe->status = wc->status;
+    cqe->opcode = wc->opcode;
+    cqe->vendor_err = wc->vendor_err;
+    cqe->byte_len = wc->byte_len;
+    cqe->imm_data = wc->imm_data;
+    cqe->src_qp = wc->src_qp;
+    cqe->wc_flags = wc->wc_flags;
+    cqe->pkey_index = wc->pkey_index;
+    cqe->slid = wc->slid;
+    cqe->sl = wc->sl;
+    cqe->dlid_path_bits = wc->dlid_path_bits;
+
+    s = iov_from_buf(&e->in_sg[0], 1, 0, &comp_ctx->cqe, sizeof(comp_ctx->cqe));
+    assert(s == sizeof(comp_ctx->cqe));
+    virtqueue_push(vq, e, sizeof(comp_ctx->cqe));
+
+    virtio_notify(&comp_ctx->dev->parent_obj, vq);
+
+    g_free(e);
+    g_free(comp_ctx);
+}
+
+void virtio_rdma_qp_ops_fini(void)
+{
+    rdma_backend_unregister_comp_handler();
+}
+
+int virtio_rdma_qp_ops_init(void)
+{
+    rdma_backend_register_comp_handler(virtio_rdma_qp_ops_comp_handler);
+
+    return 0;
+}
+
+void virtio_rdma_handle_sq(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtIORdma *dev = VIRTIO_RDMA(vdev);
+    VirtQueueElement *e;
+    struct cmd_post_send cmd;
+    struct ibv_sge *sge;
+    RdmaRmQP *qp;
+    struct virtio_rdma_kernel_mr *kmr;
+    size_t s;
+    int status = 0, i;
+    struct CompHandlerCtx *comp_ctx;
+
+    RdmaRmMR *mr;
+    uint32_t lkey;
+    uint32_t *htbl_key;
+
+    for (;;) {
+        e = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!e) {
+            break;
+        }
+
+        s = iov_to_buf(&e->out_sg[0], 1, 0, &cmd, sizeof(cmd));
+        if (s != sizeof(cmd)) {
+            rdma_error_report("bad cmd");
+            break;
+        }
+
+        qp = rdma_rm_get_qp(dev->rdma_dev_res, cmd.qpn);
+
+        sge = g_malloc0_n(cmd.num_sge, sizeof(*sge));
+        s = iov_to_buf(&e->out_sg[1], 1, 0, sge, cmd.num_sge * sizeof(*sge));
+        if (s != cmd.num_sge * sizeof(*sge)) {
+            rdma_error_report("bad sge");
+            break;
+        }
+
+        if (cmd.is_kernel) {
+            if (cmd.opcode == VIRTIO_RDMA_WR_REG_MR) {
+                mr = rdma_rm_get_mr(dev->rdma_dev_res, cmd.wr.reg.mrn);
+                lkey = mr->lkey;
+                kmr = g_hash_table_lookup(dev->lkey_mr_tbl, &lkey);
+                rdma_rm_alloc_mr(dev->rdma_dev_res, mr->pd_handle, (uint64_t)kmr->virt, kmr->length,
+                     kmr->virt, cmd.wr.reg.access, &kmr->mrn, &kmr->lkey, &kmr->rkey);
+                kmr->real_mr = rdma_rm_get_mr(dev->rdma_dev_res, kmr->mrn);
+                if (cmd.wr.reg.key != mr->lkey) {
+                    // rebuild lkey -> kmr
+                    g_hash_table_remove(dev->lkey_mr_tbl, &lkey);
+
+                    htbl_key = g_malloc0(sizeof(*htbl_key));
+                    *htbl_key = cmd.wr.reg.key;
+
+                    g_hash_table_insert(dev->lkey_mr_tbl, htbl_key, kmr);
+                }
+                goto fin;
+            }
+            /* In kernel mode, need to map guest addr to remaped addr */
+            for (i = 0; i < cmd.num_sge; i++) {
+                kmr = g_hash_table_lookup(dev->lkey_mr_tbl, &sge[i].lkey);
+                if (!kmr) {
+                    rdma_error_report("Cannot found mr with lkey %u", sge[i].lkey);
+                    // TODO: handler this error
+                }
+                sge[i].addr = (uint64_t) kmr->virt + (sge[i].addr - kmr->start);
+                sge[i].lkey = kmr->lkey;
+            }
+        }
+        // TODO: copy depend on opcode
+
+        /* Prepare CQE */
+        comp_ctx = g_malloc(sizeof(*comp_ctx));
+        comp_ctx->dev = dev;
+        comp_ctx->cq_handle = qp->send_cq_handle;
+        comp_ctx->cqe.wr_id = cmd.wr_id;
+        comp_ctx->cqe.qp_num = cmd.qpn;
+        comp_ctx->cqe.opcode = IBV_WC_SEND;
+
+        rdma_backend_post_send(dev->backend_dev, &qp->backend_qp, qp->qp_type, sge, 1, 0, NULL, NULL, 0, 0, comp_ctx);
+
+fin:
+        s = iov_from_buf(&e->in_sg[0], 1, 0, &status, sizeof(status));
+        if (s != sizeof(status))
+            break;
+
+        virtqueue_push(vq, e, sizeof(status));
+        g_free(e);
+        g_free(sge);
+        virtio_notify(vdev, vq);
+    }
+}
+
+void virtio_rdma_handle_rq(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtIORdma *dev = VIRTIO_RDMA(vdev);
+    VirtQueueElement *e;
+    struct cmd_post_recv cmd;
+    struct ibv_sge *sge;
+    RdmaRmQP *qp;
+    struct virtio_rdma_kernel_mr *kmr;
+    size_t s;
+    int i, status = 0;
+    struct CompHandlerCtx *comp_ctx;
+
+    for (;;) {
+        e = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!e)
+            break;
+
+        s = iov_to_buf(&e->out_sg[0], 1, 0, &cmd, sizeof(cmd));
+        if (s != sizeof(cmd)) {
+            fprintf(stderr, "bad cmd\n");
+            break;
+        }
+
+        qp = rdma_rm_get_qp(dev->rdma_dev_res, cmd.qpn);
+
+        if (!qp->backend_qp.ibqp) {
+            if (qp->qp_type == IBV_QPT_SMI)
+                rdma_error_report("Not support SMI");
+            if (qp->qp_type == IBV_QPT_GSI)
+                rdma_warn_report("Not support GSI now");
+            goto end;
+        }
+
+        sge = g_malloc0_n(cmd.num_sge, sizeof(*sge));
+        s = iov_to_buf(&e->out_sg[1], 1, 0, sge, cmd.num_sge * sizeof(*sge));
+        if (s != cmd.num_sge * sizeof(*sge)) {
+            rdma_error_report("bad sge");
+            break;
+        }
+
+        if (cmd.is_kernel) {
+            /* In kernel mode, need to map guest addr to remaped addr */
+            for (i = 0; i < cmd.num_sge; i++) {
+                kmr = g_hash_table_lookup(dev->lkey_mr_tbl, &sge[i].lkey);
+                if (!kmr) {
+                    rdma_error_report("Cannot found mr with lkey %u", sge[i].lkey);
+                    // TODO: handler this error
+                }
+                sge[i].addr = (uint64_t) kmr->virt + (sge[i].addr - kmr->start);
+                sge[i].lkey = kmr->lkey;
+            }
+        }
+
+        comp_ctx = g_malloc(sizeof(*comp_ctx));
+        comp_ctx->dev = dev;
+        comp_ctx->cq_handle = qp->recv_cq_handle;
+        comp_ctx->cqe.wr_id = cmd.wr_id;
+        comp_ctx->cqe.qp_num = cmd.qpn;
+        comp_ctx->cqe.opcode = IBV_WC_RECV;
+
+        rdma_backend_post_recv(dev->backend_dev, &qp->backend_qp, qp->qp_type, sge, 1, comp_ctx);
+
+end:
+        s = iov_from_buf(&e->in_sg[0], 1, 0, &status, sizeof(status));
+        if (s != sizeof(status))
+            break;
+
+        virtqueue_push(vq, e, sizeof(status));
+        g_free(e);
+        g_free(sge);
+        virtio_notify(vdev, vq);
+    }
+}
diff --git a/hw/rdma/virtio/virtio-rdma-qp.h b/hw/rdma/virtio/virtio-rdma-qp.h
new file mode 100644
index 0000000000..f4d9c755f3
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-qp.h
@@ -0,0 +1,29 @@
+/*
+ * Virtio RDMA Device - QP ops
+ *
+ * Copyright (C) 2021 Bytedance Inc.
+ *
+ * Authors:
+ *  Junji Wei <weijunji@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VIRTIO_RDMA_QP_H
+#define VIRTIO_RDMA_QP_H
+
+#include "qemu/osdep.h"
+#include "qemu/iov.h"
+#include "hw/virtio/virtio-rdma.h"
+
+#include "../rdma_rm.h"
+
+void virtio_rdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc);
+void virtio_rdma_qp_ops_fini(void);
+int virtio_rdma_qp_ops_init(void);
+void virtio_rdma_handle_sq(VirtIODevice *vdev, VirtQueue *vq);
+void virtio_rdma_handle_rq(VirtIODevice *vdev, VirtQueue *vq);
+
+#endif
\ No newline at end of file
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index fbff9bc9d4..4de3d4e985 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -41,6 +41,7 @@ virtio_pci_ss.add(when: 'CONFIG_VIRTIO_9P', if_true: files('virtio-9p-pci.c'))
 virtio_pci_ss.add(when: 'CONFIG_VIRTIO_SCSI', if_true: files('virtio-scsi-pci.c'))
 virtio_pci_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk-pci.c'))
 virtio_pci_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('virtio-net-pci.c'))
+virtio_pci_ss.add(when: 'CONFIG_VIRTIO_RDMA', if_true: files('virtio-rdma-pci.c'))
 virtio_pci_ss.add(when: 'CONFIG_VIRTIO_SERIAL', if_true: files('virtio-serial-pci.c'))
 virtio_pci_ss.add(when: 'CONFIG_VIRTIO_PMEM', if_true: files('virtio-pmem-pci.c'))
 virtio_pci_ss.add(when: 'CONFIG_VIRTIO_IOMMU', if_true: files('virtio-iommu-pci.c'))
diff --git a/hw/virtio/virtio-rdma-pci.c b/hw/virtio/virtio-rdma-pci.c
new file mode 100644
index 0000000000..c4de92c88a
--- /dev/null
+++ b/hw/virtio/virtio-rdma-pci.c
@@ -0,0 +1,110 @@
+/*
+ * Virtio rdma PCI Bindings
+ *
+ * Copyright (C) 2019 Oracle
+ *
+ * Authors:
+ *  Yuval Shaia <yuval.shaia@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+
+#include "hw/virtio/virtio-net-pci.h"
+#include "hw/virtio/virtio-rdma.h"
+#include "virtio-pci.h"
+#include "qapi/error.h"
+#include "hw/qdev-properties.h"
+
+typedef struct VirtIORdmaPCI VirtIORdmaPCI;
+
+/*
+ * virtio-rdma-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_RDMA_PCI "virtio-rdma-pci-base"
+#define VIRTIO_RDMA_PCI(obj) \
+        OBJECT_CHECK(VirtIORdmaPCI, (obj), TYPE_VIRTIO_RDMA_PCI)
+
+struct VirtIORdmaPCI {
+    VirtIOPCIProxy parent_obj;
+    VirtIORdma vdev;
+};
+
+static Property virtio_rdma_properties[] = {
+    DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
+                    VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
+    DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void virtio_rdma_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+{
+    VirtIORdmaPCI *dev = VIRTIO_RDMA_PCI(vpci_dev);
+    DeviceState *vdev = DEVICE(&dev->vdev);
+    VirtIONetPCI *vnet_pci;
+    PCIDevice *func0;
+
+    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus), errp);
+    object_property_set_bool(OBJECT(vdev), "realized", true, errp);
+
+    func0 = pci_get_function_0(&vpci_dev->pci_dev);
+    /* Break if not virtio device in slot 0 */
+    if (strcmp(object_get_typename(OBJECT(func0)),
+               TYPE_VIRTIO_NET_PCI_GENERIC)) {
+        fprintf(stderr, "Device on %x.0 is type %s but must be %s",
+                   PCI_SLOT(vpci_dev->pci_dev.devfn),
+                   object_get_typename(OBJECT(func0)),
+                   TYPE_VIRTIO_NET_PCI_GENERIC);
+        return;
+    }
+    vnet_pci = VIRTIO_NET_PCI(func0);
+    dev->vdev.netdev = &vnet_pci->vdev;
+}
+
+static void virtio_rdma_pci_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+    VirtioPCIClass *vpciklass = VIRTIO_PCI_CLASS(klass);
+
+    k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+    k->device_id = PCI_DEVICE_ID_VIRTIO_RDMA;
+    k->revision = VIRTIO_PCI_ABI_VERSION;
+    k->class_id = PCI_CLASS_NETWORK_OTHER;
+    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+    // dc->props_ = virtio_rdma_properties;
+    device_class_set_props(dc, virtio_rdma_properties);
+    vpciklass->realize = virtio_rdma_pci_realize;
+}
+
+static void virtio_rdma_pci_instance_init(Object *obj)
+{
+    VirtIORdmaPCI *dev = VIRTIO_RDMA_PCI(obj);
+
+    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+                                TYPE_VIRTIO_RDMA);
+    /*
+    object_property_add_alias(obj, "bootindex", OBJECT(&dev->vdev),
+                              "bootindex", &error_abort);
+    */
+}
+
+static const VirtioPCIDeviceTypeInfo virtio_rdma_pci_info = {
+    .base_name             = TYPE_VIRTIO_RDMA_PCI,
+    .generic_name          = "virtio-rdma-pci",
+    .transitional_name     = "virtio-rdma-pci-transitional",
+    .non_transitional_name = "virtio-rdma-pci-non-transitional",
+    .instance_size = sizeof(VirtIORdmaPCI),
+    .instance_init = virtio_rdma_pci_instance_init,
+    .class_init    = virtio_rdma_pci_class_init,
+};
+
+static void virtio_rdma_pci_register(void)
+{
+    virtio_pci_types_register(&virtio_rdma_pci_info);
+}
+
+type_init(virtio_rdma_pci_register)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 72ce649eee..f976ea9db7 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -89,6 +89,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
 #define PCI_DEVICE_ID_VIRTIO_IOMMU       0x1014
 #define PCI_DEVICE_ID_VIRTIO_MEM         0x1015
+#define PCI_DEVICE_ID_VIRTIO_RDMA        0x1016
 
 #define PCI_VENDOR_ID_REDHAT             0x1b36
 #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
diff --git a/include/hw/virtio/virtio-rdma.h b/include/hw/virtio/virtio-rdma.h
new file mode 100644
index 0000000000..1ae10deb6a
--- /dev/null
+++ b/include/hw/virtio/virtio-rdma.h
@@ -0,0 +1,58 @@
+/*
+ * Virtio RDMA Device
+ *
+ * Copyright (C) 2019 Oracle
+ *
+ * Authors:
+ *  Yuval Shaia <yuval.shaia@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_VIRTIO_RDMA_H
+#define QEMU_VIRTIO_RDMA_H
+
+#include <glib.h>
+#include <infiniband/verbs.h>
+
+#include "chardev/char-fe.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-net.h"
+
+#define TYPE_VIRTIO_RDMA "virtio-rdma-device"
+#define VIRTIO_RDMA(obj) \
+        OBJECT_CHECK(VirtIORdma, (obj), TYPE_VIRTIO_RDMA)
+
+typedef struct RdmaBackendDev RdmaBackendDev;
+typedef struct RdmaDeviceResources RdmaDeviceResources;
+struct ibv_device_attr;
+
+typedef struct VirtIORdma {
+    VirtIODevice parent_obj;
+    VirtQueue *ctrl_vq;
+    VirtIONet *netdev;
+    RdmaBackendDev *backend_dev;
+    RdmaDeviceResources *rdma_dev_res;
+    CharBackend mad_chr;
+    char *backend_eth_device_name;
+    char *backend_device_name;
+    uint8_t backend_port_num;
+    struct ibv_device_attr dev_attr;
+
+    VirtQueue **cq_vqs;
+    VirtQueue **qp_vqs;
+
+    GHashTable *lkey_mr_tbl;
+
+    /* active objects statistics to enforce limits, should write with qatomic */
+	int num_qp;
+	int num_cq;
+	int num_pd;
+	int num_mr;
+	int num_srq;
+	int num_ctx;
+} VirtIORdma;
+
+#endif
diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
index b052355ac7..4c2151bffb 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -48,5 +48,6 @@
 #define VIRTIO_ID_FS           26 /* virtio filesystem */
 #define VIRTIO_ID_PMEM         27 /* virtio pmem */
 #define VIRTIO_ID_MAC80211_HWSIM 29 /* virtio mac80211-hwsim */
+#define VIRTIO_ID_RDMA         30 /* virtio rdma */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] hw/virtio-rdma: VirtIO rdma device
  2021-09-02 13:06 ` [RFC 5/5] hw/virtio-rdma: VirtIO rdma device Junji Wei
@ 2021-09-02 15:16   ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2021-09-02 15:16 UTC (permalink / raw)
  To: Junji Wei
  Cc: dledford, jgg, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare, xieyongji, chaiwen.cc, linux-rdma, virtualization,
	qemu-devel

On Thu, Sep 02, 2021 at 09:06:25PM +0800, Junji Wei wrote:
> diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
> index b052355ac7..4c2151bffb 100644
> --- a/include/standard-headers/linux/virtio_ids.h
> +++ b/include/standard-headers/linux/virtio_ids.h
> @@ -48,5 +48,6 @@
>  #define VIRTIO_ID_FS           26 /* virtio filesystem */
>  #define VIRTIO_ID_PMEM         27 /* virtio pmem */
>  #define VIRTIO_ID_MAC80211_HWSIM 29 /* virtio mac80211-hwsim */
> +#define VIRTIO_ID_RDMA         30 /* virtio rdma */

You can start by registering this with the virtio TC.

>  #endif /* _LINUX_VIRTIO_IDS_H */
> -- 
> 2.11.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/5] VirtIO RDMA
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
                   ` (4 preceding siblings ...)
  2021-09-02 13:06 ` [RFC 5/5] hw/virtio-rdma: VirtIO rdma device Junji Wei
@ 2021-09-03  0:57 ` Jason Wang
  2021-09-03  7:41   ` 魏俊吉
  2021-09-15 13:43 ` Jason Gunthorpe
  6 siblings, 1 reply; 14+ messages in thread
From: Jason Wang @ 2021-09-03  0:57 UTC (permalink / raw)
  To: Junji Wei
  Cc: dledford, jgg, mst, yuval.shaia.ml, marcel.apfelbaum,
	Cornelia Huck, Hannes Reinecke, Yongji Xie, 柴稳,
	linux-rdma, virtualization, qemu-devel

On Thu, Sep 2, 2021 at 9:07 PM Junji Wei <weijunji@bytedance.com> wrote:
>
> Hi all,
>
> This RFC aims to reopen the discussion of Virtio RDMA.
> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
> which implemented a frame for Virtio RDMA and a simple
> control path (Not sure if Yuval Shaia has any further
> plan for it).
>
> We try to extend this work and implement a simple
> data-path and a completed control path. Now this can
> work with SEND, RECV and REG_MR in kernel. There is a
> simple test module in this patch that can communicate
> with ibv_rc_pingpong in rdma-core.
>
> During doing this work, we have found some problems and
> would like to ask for some suggestions from community:

I think it would be beneficial if you can post a spec patch.

Thanks


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/5] VirtIO RDMA
  2021-09-03  0:57 ` [RFC 0/5] VirtIO RDMA Jason Wang
@ 2021-09-03  7:41   ` 魏俊吉
  0 siblings, 0 replies; 14+ messages in thread
From: 魏俊吉 @ 2021-09-03  7:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: dledford, jgg, mst, yuval.shaia.ml, marcel.apfelbaum,
	Cornelia Huck, Hannes Reinecke, Yongji Xie, 柴稳,
	linux-rdma, virtualization, qemu-devel


> On Sep 3, 2021, at 8:57 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Thu, Sep 2, 2021 at 9:07 PM Junji Wei <weijunji@bytedance.com> wrote:
>> 
>> Hi all,
>> 
>> This RFC aims to reopen the discussion of Virtio RDMA.
>> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
>> which implemented a frame for Virtio RDMA and a simple
>> control path (Not sure if Yuval Shaia has any further
>> plan for it).
>> 
>> We try to extend this work and implement a simple
>> data-path and a completed control path. Now this can
>> work with SEND, RECV and REG_MR in kernel. There is a
>> simple test module in this patch that can communicate
>> with ibv_rc_pingpong in rdma-core.
>> 
>> During doing this work, we have found some problems and
>> would like to ask for some suggestions from community:
> 
> I think it would be beneficial if you can post a spec patch.

Ok, I will do it.

Thanks


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/5] VirtIO RDMA
  2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
                   ` (5 preceding siblings ...)
  2021-09-03  0:57 ` [RFC 0/5] VirtIO RDMA Jason Wang
@ 2021-09-15 13:43 ` Jason Gunthorpe
  2021-09-22 12:08   ` Junji Wei
  6 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2021-09-15 13:43 UTC (permalink / raw)
  To: Junji Wei
  Cc: dledford, mst, jasowang, yuval.shaia.ml, marcel.apfelbaum,
	cohuck, hare, xieyongji, chaiwen.cc, linux-rdma, virtualization,
	qemu-devel

On Thu, Sep 02, 2021 at 09:06:20PM +0800, Junji Wei wrote:
> Hi all,
> 
> This RFC aims to reopen the discussion of Virtio RDMA.
> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
> which implemented a frame for Virtio RDMA and a simple
> control path (Not sure if Yuval Shaia has any further
> plan for it).
> 
> We try to extend this work and implement a simple
> data-path and a completed control path. Now this can
> work with SEND, RECV and REG_MR in kernel. There is a
> simple test module in this patch that can communicate
> with ibv_rc_pingpong in rdma-core.
> 
> During doing this work, we have found some problems and
> would like to ask for some suggestions from community:

These seem like serious problems! Shouldn't these be solved before
sending patches?

> 1. Each qp need two VQ, but qemu default only support 1024 VQ.
>    I think it is possible to multiplex the VQ, since the
>    cmd_post_send carry the qpn in request.

QPs and CQs need to have predictable fixed WQE sizes, I don't know how
you can reasonably expect to map them to a shared queue.

> 2. The virtio-rdma device's gid should equal to host rdma
>    device's gid. This means that we cannot use gid cache in
>    rdma subsystem. And theoretically the gid should also equal
>    to the device's netdev's ip address, how can we deal with
>    this conflict.

You have to follow the correct semantics, the GID flows from the guest
into the host and updates the hosts GID table, not the other way
around.
 
> 3. How to support DMA mr? The verbs in host cannot support it.
>    And it seems hard to ping whole guest physical memory in qemu.

Either you have to trap the FRWR in the hypervisor and pin the memory,
remap the MR, etc or you have to pin the entire guest and rely on
something like memory windows to emulate FRWR.
 
> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
>    But it is impossible to change a key of mr using uverbs.

FRMR is more like memory windows in user space, you can't support it
using just regular MRs.

>    In our implementation, we change the key of WR while post_send,
>    but this means the MR can only work with SEND and RECV since we
>    cannot change the key in the remote.

Yes, this is not a realistic solution

> 5. The GSI is not supported now. And we think it's a problem that
>    when the host receive a GSI package, it doesn't know which
>    device it belongs to.

Of course, GSI packets are not virtualized. You need to somehow
capture GSI messages for the entire GID that the guest is using. We
don't have any API to do this in userspace.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/5] VirtIO RDMA
  2021-09-15 13:43 ` Jason Gunthorpe
@ 2021-09-22 12:08   ` Junji Wei
  2021-09-22 13:06     ` Leon Romanovsky
  0 siblings, 1 reply; 14+ messages in thread
From: Junji Wei @ 2021-09-22 12:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, mst, Jason Wang, yuval.shaia.ml, marcel.apfelbaum,
	Cornelia Huck, Hannes Reinecke, Yongji Xie, 柴稳,
	RDMA mailing list, virtualization, qemu-devel

> On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Thu, Sep 02, 2021 at 09:06:20PM +0800, Junji Wei wrote:
>> Hi all,
>> 
>> This RFC aims to reopen the discussion of Virtio RDMA.
>> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
>> which implemented a frame for Virtio RDMA and a simple
>> control path (Not sure if Yuval Shaia has any further
>> plan for it).
>> 
>> We try to extend this work and implement a simple
>> data-path and a completed control path. Now this can
>> work with SEND, RECV and REG_MR in kernel. There is a
>> simple test module in this patch that can communicate
>> with ibv_rc_pingpong in rdma-core.
>> 
>> During doing this work, we have found some problems and
>> would like to ask for some suggestions from community:
> 
> These seem like serious problems! Shouldn't these be solved before
> sending patches?
> 
>> 1. Each qp need two VQ, but qemu default only support 1024 VQ.
>>   I think it is possible to multiplex the VQ, since the
>>   cmd_post_send carry the qpn in request.
> 
> QPs and CQs need to have predictable fixed WQE sizes, I don't know how
> you can reasonably expect to map them to a shared queue.

Yes, it is a bad idea to multiplex the VQ. If we need more VQ,
we can extend QEMU and virtio spec.

>> 2. The virtio-rdma device's gid should equal to host rdma
>>   device's gid. This means that we cannot use gid cache in
>>   rdma subsystem. And theoretically the gid should also equal
>>   to the device's netdev's ip address, how can we deal with
>>   this conflict.
> 
> You have to follow the correct semantics, the GID flows from the guest
> into the host and updates the hosts GID table, not the other way
> around.

Sure, this is my misunderstanding.

>> 3. How to support DMA mr? The verbs in host cannot support it.
>>   And it seems hard to ping whole guest physical memory in qemu.
> 
> Either you have to trap the FRWR in the hypervisor and pin the memory,
> remap the MR, etc or you have to pin the entire guest and rely on
> something like memory windows to emulate FRWR.

We want to implement an emulated RDMA device in userspace. Since
we can directly access guest's physical memory in QEMU, it will be
easy to support DMA mr.

>> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
>>   But it is impossible to change a key of mr using uverbs.
> 
> FRMR is more like memory windows in user space, you can't support it
> using just regular MRs.

It is hard to support this using uverbs, but it is easy to support
with uRDMA that we can get full control of mrs.

>> 5. The GSI is not supported now. And we think it's a problem that
>>   when the host receive a GSI package, it doesn't know which
>>   device it belongs to.
> 
> Of course, GSI packets are not virtualized. You need to somehow
> capture GSI messages for the entire GID that the guest is using. We
> don't have any API to do this in userspace.

If we implement uRDMA device in QEMU, there is no need to distinguish
which device it belongs to, because there is only one device.

Thanks.

Junji

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/5] VirtIO RDMA
  2021-09-22 12:08   ` Junji Wei
@ 2021-09-22 13:06     ` Leon Romanovsky
  2021-09-22 13:37       ` 魏俊吉
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2021-09-22 13:06 UTC (permalink / raw)
  To: Junji Wei
  Cc: Jason Gunthorpe, Doug Ledford, mst, Jason Wang, yuval.shaia.ml,
	marcel.apfelbaum, Cornelia Huck, Hannes Reinecke, Yongji Xie,
	柴稳,
	RDMA mailing list, virtualization, qemu-devel

On Wed, Sep 22, 2021 at 08:08:44PM +0800, Junji Wei wrote:
> > On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:

<...>

> >> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
> >>   But it is impossible to change a key of mr using uverbs.
> > 
> > FRMR is more like memory windows in user space, you can't support it
> > using just regular MRs.
> 
> It is hard to support this using uverbs, but it is easy to support
> with uRDMA that we can get full control of mrs.

What is uRDMA?

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC 0/5] VirtIO RDMA
  2021-09-22 13:06     ` Leon Romanovsky
@ 2021-09-22 13:37       ` 魏俊吉
  2021-09-22 13:59         ` Leon Romanovsky
  0 siblings, 1 reply; 14+ messages in thread
From: 魏俊吉 @ 2021-09-22 13:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Doug Ledford, mst, Jason Wang, yuval.shaia.ml,
	marcel.apfelbaum, Cornelia Huck, Hannes Reinecke, Yongji Xie,
	柴稳,
	RDMA mailing list, virtualization, qemu-devel

On Wed, Sep 22, 2021 at 9:06 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Sep 22, 2021 at 08:08:44PM +0800, Junji Wei wrote:
> > > On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> <...>
>
> > >> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
> > >>   But it is impossible to change a key of mr using uverbs.
> > >
> > > FRMR is more like memory windows in user space, you can't support it
> > > using just regular MRs.
> >
> > It is hard to support this using uverbs, but it is easy to support
> > with uRDMA that we can get full control of mrs.
>
> What is uRDMA?

uRDMA is a software implementation of the RoCEv2 protocol like rxe.
We will implement it in QEMU with VFIO or DPDK.

Thanks.
Junji

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC 0/5] VirtIO RDMA
  2021-09-22 13:37       ` 魏俊吉
@ 2021-09-22 13:59         ` Leon Romanovsky
  0 siblings, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2021-09-22 13:59 UTC (permalink / raw)
  To: 魏俊吉
  Cc: Jason Gunthorpe, Doug Ledford, mst, Jason Wang, yuval.shaia.ml,
	marcel.apfelbaum, Cornelia Huck, Hannes Reinecke, Yongji Xie,
	柴稳,
	RDMA mailing list, virtualization, qemu-devel

On Wed, Sep 22, 2021 at 09:37:37PM +0800, 魏俊吉 wrote:
> On Wed, Sep 22, 2021 at 9:06 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Sep 22, 2021 at 08:08:44PM +0800, Junji Wei wrote:
> > > > On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > <...>
> >
> > > >> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
> > > >>   But it is impossible to change a key of mr using uverbs.
> > > >
> > > > FRMR is more like memory windows in user space, you can't support it
> > > > using just regular MRs.
> > >
> > > It is hard to support this using uverbs, but it is easy to support
> > > with uRDMA that we can get full control of mrs.
> >
> > What is uRDMA?
> 
> uRDMA is a software implementation of the RoCEv2 protocol like rxe.
> We will implement it in QEMU with VFIO or DPDK.

ok, thanks

> 
> Thanks.
> Junji

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-09-22 13:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 13:06 [RFC 0/5] VirtIO RDMA Junji Wei
2021-09-02 13:06 ` [RFC 1/5] RDMA/virtio-rdma Introduce a new core cap prot Junji Wei
2021-09-02 13:06 ` [RFC 2/5] RDMA/virtio-rdma: VirtIO RDMA driver Junji Wei
2021-09-02 13:06 ` [RFC 3/5] RDMA/virtio-rdma: VirtIO RDMA test module Junji Wei
2021-09-02 13:06 ` [RFC 4/5] virtio-net: Move some virtio-net-pci decl to include/hw/virtio Junji Wei
2021-09-02 13:06 ` [RFC 5/5] hw/virtio-rdma: VirtIO rdma device Junji Wei
2021-09-02 15:16   ` Michael S. Tsirkin
2021-09-03  0:57 ` [RFC 0/5] VirtIO RDMA Jason Wang
2021-09-03  7:41   ` 魏俊吉
2021-09-15 13:43 ` Jason Gunthorpe
2021-09-22 12:08   ` Junji Wei
2021-09-22 13:06     ` Leon Romanovsky
2021-09-22 13:37       ` 魏俊吉
2021-09-22 13:59         ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).