All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation
@ 2018-02-14 19:22 Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 01/10] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

V10 -> V11:
 - Addressed Michael S. Tsirkin comments:
    - Split the standard-headers patch in two, one dealing with the
      update-linux-headers script while the other adds the imported headers.
    - Add comments to the update-linux-headers script explaining
      the sed transformations.
  - Added Zhu Yanjun's R-B tags (rdma patches review -- Thanks!) 
  - Added Gal Hammer's R-B tag  (update-linux-headers patch review (Thanks!)
  - Rebased on latest master

V9 -> V10:
 - Addressed Peter Maydell's comments:
   - Modified license to "version 2 or any later version"
   - Added license comment on top of code files
   - Move the kernel headers to "standard-headers" and modified
     the update-linux-headers script to import them and update
     the types for QEMU. (this patch has no R-B tag, maybe someone can
     have a look, I am not sure who can review it)
   - Got an R-B from Eduardo on memory-ram-backend patch (thanks!)
   - Split the pvrdma implementation patch into 6 patches,
     preserving Dotan Barak R-B since no semantic changes were made,
     only a mechanical split.
 - Rebased on latest master

V8 -> V9:
 - Addressed Dotan Barak's (offline) comments:
   - use g_malloc instead of malloc
   - re-arrange structs for better padding
   - some cosmetic changes
   - do not try to fetch CQE when CQ is going down
   - init state of QP changed to RESET
   - modify poll_cq
   - add fix to qkey handling so now qkey=0 is also supported
   - add validation to gid_index
   - fix memory leak with ah_key ref
 - Addressed Eduardo Habkost comments:
   - Add the mem-backed-ram "share" option to qemu-options.hx.
 - Rebased on latest master

V7 -> V8:
 - Addressed Michael S. Tsirkin comments:
   - fail to init the pvrdma device if target page size is different
     from the host size, or if the guest RAM is not backed by memory
     and shared.
   - Update documentation to include a note on huge memory regions
     registration and remove not needed info.
 - Removed "pci/shpc: Move function to generic header file" since it
   appears in latest maintainer pull request
 - Rebased on master

V6 -> V7:
 - Addressed Philippe Mathieu-Daudé comments
   - modified pow2roundup32 signature
   - added his RB tag (thanks)
 - Addressed Corenlia Huck comments:
   - Compiled the pvrdma for all archs and not only x86/arm (thanks)
   - Fixed typo in documentation
 - Rebased on latest master

V5 -> V6:
 - Found a ppc machine and solved ppc compilation issues
 - Tried to fix the s390x issue (still looking of a machine)

V4 -> V5:
 - Fixed (at least tried to) compilation issues

V3 -> V4:
 - Fixed documentation (added more impl details)
 - Fixed compilation errors discovered by patchew.
 - Addressed Michael S. Tsirkin comments:
   - Removed unnecessary typedefs and replace them with
     macros in VMware header files, together with explanations.
   - Moved more code from vmw specific to rdma generic code.
   - Added page size limitations to the documentation.

V2 -> V3:
 - Addressed Michael S. Tsirkin and Philippe Mathieu-Daudé comments:
   - Moved the device to hw/rdma
 - Addressed Michael S. Tsirkin comments:
   - Split the code into generic (hw/rdma) and VMWare
     specific (hw/rdma/vmw)
   - Added more details to documentation - VMware guest-host protocol.
   - Remove mad processing
   - limited the memory the Guest can pin.
 - Addressed Philippe Mathieu-Daudé comment:
   - s/roundup_pow_of_two/pow2roundup32 and move it to qemu/host-utils.h 
 - Added Shamit Rabinovici's review to documentation
 - Rebased to latest master 

RFC -> V2:
 - Full implementation of the pvrdma device
 - Backend is an ibdevice interface, no need for the KDBR module


General description
===================
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
It works with its Linux Kernel driver AS IS, no need for any special guest
modifications.

While it complies with the VMware device, it can also communicate with bare
metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
can work with Soft-RoCE (rxe).

It does not require the whole guest RAM to be pinned allowing memory
over-commit and, even if not implemented yet, migration support will be
possible with some HW assistance.


 Design
 ======
 - Follows the behavior of VMware's pvrdma device, however is not tightly
   coupled with it and most of the code can be reused if we decide to
   continue to a Virtio based RDMA device.

 - It exposes 3 BARs:
    BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
            completions
    BAR 1 - Configuration of registers
    BAR 2 - UAR, used to pass HW commands from driver.

 - The device performs internal management of the RDMA
   resources (PDs, CQs, QPs, ...), meaning the objects
   are not directly coupled to a physical RDMA device resources.

The pvrdma backend is an ibdevice interface that can be exposed
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
or an HCA SRIOV function(VF/PF).
Note that ibdevice interfaces can't be shared between pvrdma devices,
each one requiring a separate instance (rxe or SRIOV VF).


Tests and performance
=====================
Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
and Mellanox ConnectX4 HCAs with:
  - VMs in the same host
  - VMs in different hosts 
  - VMs to bare metal.

The best performance achieved with ConnectX HCAs and buffer size
bigger than 1MB which was the line rate ~ 50Gb/s.
The conclusion is that using the PVRDMA device there are no
actual performance penalties compared to bare metal for big enough
buffers (which is quite common when using RDMA), while allowing
memory overcommit.

Marcel Apfelbaum (5):
  mem: add share parameter to memory-backend-ram
  docs: add pvrdma device documentation.
  scripts/update-linux-headers: import pvrdma headers
  include/standard-headers: add pvrdma related headers
  MAINTAINERS: add entry for hw/rdma

Yuval Shaia (5):
  hw/rdma: Add wrappers and macros
  hw/rdma: Definitions for rdma device and rdma resource manager
  hw/rdma: Implementation of generic rdma device layers
  hw/rdma: PVRDMA commands and data-path ops
  hw/rdma: Implementation of PVRDMA device

 MAINTAINERS                                        |   8 +
 Makefile.objs                                      |   2 +
 backends/hostmem-file.c                            |  25 +-
 backends/hostmem-ram.c                             |   4 +-
 backends/hostmem.c                                 |  21 +
 configure                                          |   9 +-
 docs/pvrdma.txt                                    | 255 +++++++
 exec.c                                             |  26 +-
 hw/Makefile.objs                                   |   1 +
 hw/rdma/Makefile.objs                              |   5 +
 hw/rdma/rdma_backend.c                             | 818 +++++++++++++++++++++
 hw/rdma/rdma_backend.h                             |  98 +++
 hw/rdma/rdma_backend_defs.h                        |  62 ++
 hw/rdma/rdma_rm.c                                  | 544 ++++++++++++++
 hw/rdma/rdma_rm.h                                  |  69 ++
 hw/rdma/rdma_rm_defs.h                             | 104 +++
 hw/rdma/rdma_utils.c                               |  51 ++
 hw/rdma/rdma_utils.h                               |  43 ++
 hw/rdma/trace-events                               |   5 +
 hw/rdma/vmw/pvrdma.h                               | 122 +++
 hw/rdma/vmw/pvrdma_cmd.c                           | 673 +++++++++++++++++
 hw/rdma/vmw/pvrdma_dev_ring.c                      | 155 ++++
 hw/rdma/vmw/pvrdma_dev_ring.h                      |  42 ++
 hw/rdma/vmw/pvrdma_main.c                          | 670 +++++++++++++++++
 hw/rdma/vmw/pvrdma_qp_ops.c                        | 222 ++++++
 hw/rdma/vmw/pvrdma_qp_ops.h                        |  27 +
 hw/rdma/vmw/trace-events                           |   5 +
 include/exec/memory.h                              |  23 +
 include/exec/ram_addr.h                            |   3 +-
 include/hw/pci/pci_ids.h                           |   3 +
 include/qemu/osdep.h                               |   2 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h      | 667 +++++++++++++++++
 .../drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h | 114 +++
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h        | 383 ++++++++++
 include/standard-headers/rdma/vmw_pvrdma-abi.h     | 293 ++++++++
 include/sysemu/hostmem.h                           |   2 +-
 include/sysemu/kvm.h                               |   2 +-
 memory.c                                           |  16 +-
 qemu-options.hx                                    |  10 +-
 scripts/update-linux-headers.sh                    |  30 +
 target/s390x/kvm.c                                 |   4 +-
 util/oslib-posix.c                                 |   4 +-
 util/oslib-win32.c                                 |   2 +-
 43 files changed, 5570 insertions(+), 54 deletions(-)
 create mode 100644 docs/pvrdma.txt
 create mode 100644 hw/rdma/Makefile.objs
 create mode 100644 hw/rdma/rdma_backend.c
 create mode 100644 hw/rdma/rdma_backend.h
 create mode 100644 hw/rdma/rdma_backend_defs.h
 create mode 100644 hw/rdma/rdma_rm.c
 create mode 100644 hw/rdma/rdma_rm.h
 create mode 100644 hw/rdma/rdma_rm_defs.h
 create mode 100644 hw/rdma/rdma_utils.c
 create mode 100644 hw/rdma/rdma_utils.h
 create mode 100644 hw/rdma/trace-events
 create mode 100644 hw/rdma/vmw/pvrdma.h
 create mode 100644 hw/rdma/vmw/pvrdma_cmd.c
 create mode 100644 hw/rdma/vmw/pvrdma_dev_ring.c
 create mode 100644 hw/rdma/vmw/pvrdma_dev_ring.h
 create mode 100644 hw/rdma/vmw/pvrdma_main.c
 create mode 100644 hw/rdma/vmw/pvrdma_qp_ops.c
 create mode 100644 hw/rdma/vmw/pvrdma_qp_ops.h
 create mode 100644 hw/rdma/vmw/trace-events
 create mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
 create mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h
 create mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
 create mode 100644 include/standard-headers/rdma/vmw_pvrdma-abi.h

-- 
2.13.5

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 01/10] mem: add share parameter to memory-backend-ram
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 02/10] docs: add pvrdma device documentation Marcel Apfelbaum
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

Currently only file backed memory backend can
be created with a "share" flag in order to allow
sharing guest RAM with other processes in the host.

Add the "share" flag also to RAM Memory Backend
in order to allow remapping parts of the guest RAM
to different host virtual addresses. This is needed
by the RDMA devices in order to remap non-contiguous
QEMU virtual addresses to a contiguous virtual address range.

Moved the "share" flag to the Host Memory base class,
modified phys_mem_alloc to include the new parameter
and a new interface memory_region_init_ram_shared_nomigrate.

There are no functional changes if the new flag is not used.

Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 backends/hostmem-file.c  | 25 +------------------------
 backends/hostmem-ram.c   |  4 ++--
 backends/hostmem.c       | 21 +++++++++++++++++++++
 exec.c                   | 26 +++++++++++++++-----------
 include/exec/memory.h    | 23 +++++++++++++++++++++++
 include/exec/ram_addr.h  |  3 ++-
 include/qemu/osdep.h     |  2 +-
 include/sysemu/hostmem.h |  2 +-
 include/sysemu/kvm.h     |  2 +-
 memory.c                 | 16 +++++++++++++---
 qemu-options.hx          | 10 +++++++++-
 target/s390x/kvm.c       |  4 ++--
 util/oslib-posix.c       |  4 ++--
 util/oslib-win32.c       |  2 +-
 14 files changed, 94 insertions(+), 50 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index e319ec1ad8..134b08d63a 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -31,7 +31,6 @@ typedef struct HostMemoryBackendFile HostMemoryBackendFile;
 struct HostMemoryBackendFile {
     HostMemoryBackend parent_obj;
 
-    bool share;
     bool discard_data;
     char *mem_path;
     uint64_t align;
@@ -59,7 +58,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         path = object_get_canonical_path(OBJECT(backend));
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
-                                 backend->size, fb->align, fb->share,
+                                 backend->size, fb->align, backend->share,
                                  fb->mem_path, errp);
         g_free(path);
     }
@@ -86,25 +85,6 @@ static void set_mem_path(Object *o, const char *str, Error **errp)
     fb->mem_path = g_strdup(str);
 }
 
-static bool file_memory_backend_get_share(Object *o, Error **errp)
-{
-    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
-
-    return fb->share;
-}
-
-static void file_memory_backend_set_share(Object *o, bool value, Error **errp)
-{
-    HostMemoryBackend *backend = MEMORY_BACKEND(o);
-    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
-
-    if (host_memory_backend_mr_inited(backend)) {
-        error_setg(errp, "cannot change property value");
-        return;
-    }
-    fb->share = value;
-}
-
 static bool file_memory_backend_get_discard_data(Object *o, Error **errp)
 {
     return MEMORY_BACKEND_FILE(o)->discard_data;
@@ -171,9 +151,6 @@ file_backend_class_init(ObjectClass *oc, void *data)
     bc->alloc = file_backend_memory_alloc;
     oc->unparent = file_backend_unparent;
 
-    object_class_property_add_bool(oc, "share",
-        file_memory_backend_get_share, file_memory_backend_set_share,
-        &error_abort);
     object_class_property_add_bool(oc, "discard-data",
         file_memory_backend_get_discard_data, file_memory_backend_set_discard_data,
         &error_abort);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 38977be73e..7ddd08d370 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -28,8 +28,8 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     path = object_get_canonical_path_component(OBJECT(backend));
-    memory_region_init_ram_nomigrate(&backend->mr, OBJECT(backend), path,
-                           backend->size, errp);
+    memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), path,
+                           backend->size, backend->share, errp);
     g_free(path);
 }
 
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 81d14554a7..8aa0412032 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -368,6 +368,24 @@ static void set_id(Object *o, const char *str, Error **errp)
     backend->id = g_strdup(str);
 }
 
+static bool host_memory_backend_get_share(Object *o, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    return backend->share;
+}
+
+static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(errp, "cannot change property value");
+        return;
+    }
+    backend->share = value;
+}
+
 static void
 host_memory_backend_class_init(ObjectClass *oc, void *data)
 {
@@ -398,6 +416,9 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_policy,
         host_memory_backend_set_policy, &error_abort);
     object_class_property_add_str(oc, "id", get_id, set_id, &error_abort);
+    object_class_property_add_bool(oc, "share",
+        host_memory_backend_get_share, host_memory_backend_set_share,
+        &error_abort);
 }
 
 static void host_memory_backend_finalize(Object *o)
diff --git a/exec.c b/exec.c
index e8d7b335b6..4d8addb263 100644
--- a/exec.c
+++ b/exec.c
@@ -1285,7 +1285,7 @@ static int subpage_register (subpage_t *mmio, uint32_t start, uint32_t end,
                              uint16_t section);
 static subpage_t *subpage_init(FlatView *fv, hwaddr base);
 
-static void *(*phys_mem_alloc)(size_t size, uint64_t *align) =
+static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared) =
                                qemu_anon_ram_alloc;
 
 /*
@@ -1293,7 +1293,7 @@ static void *(*phys_mem_alloc)(size_t size, uint64_t *align) =
  * Accelerators with unusual needs may need this.  Hopefully, we can
  * get rid of it eventually.
  */
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align))
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared))
 {
     phys_mem_alloc = alloc;
 }
@@ -1921,7 +1921,7 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
-static void ram_block_add(RAMBlock *new_block, Error **errp)
+static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
 {
     RAMBlock *block;
     RAMBlock *last_block = NULL;
@@ -1944,7 +1944,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             }
         } else {
             new_block->host = phys_mem_alloc(new_block->max_length,
-                                             &new_block->mr->align);
+                                             &new_block->mr->align, shared);
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2049,7 +2049,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    ram_block_add(new_block, &local_err);
+    ram_block_add(new_block, &local_err, share);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2091,7 +2091,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
                                                   uint64_t length,
                                                   void *host),
-                                  void *host, bool resizeable,
+                                  void *host, bool resizeable, bool share,
                                   MemoryRegion *mr, Error **errp)
 {
     RAMBlock *new_block;
@@ -2114,7 +2114,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     if (resizeable) {
         new_block->flags |= RAM_RESIZEABLE;
     }
-    ram_block_add(new_block, &local_err);
+    ram_block_add(new_block, &local_err, share);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2126,12 +2126,15 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                    MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, host, false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, host, false,
+                                   false, mr, errp);
 }
 
-RAMBlock *qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp)
+RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share,
+                         MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, NULL, false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, NULL, false,
+                                   share, mr, errp);
 }
 
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
@@ -2140,7 +2143,8 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true, mr, errp);
+    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true,
+                                   false, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index fff9b1d871..15e81113ba 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -436,6 +436,29 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       Error **errp);
 
 /**
+ * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
+ *                                           Accesses into the region will
+ *                                           modify memory directly.
+ *
+ * @mr: the #MemoryRegion to be initialized.
+ * @owner: the object that tracks the region's reference count
+ * @name: Region name, becomes part of RAMBlock name used in migration stream
+ *        must be unique within any device
+ * @size: size of the region.
+ * @share: allow remapping RAM to different addresses
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Note that this function is similar to memory_region_init_ram_nomigrate.
+ * The only difference is part of the RAM region can be remapped.
+ */
+void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
+                                             struct Object *owner,
+                                             const char *name,
+                                             uint64_t size,
+                                             bool share,
+                                             Error **errp);
+
+/**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
  *                                     RAM.  Accesses into the region will
  *                                     modify memory directly.  Only an initial
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 7633ef6342..cf2446a176 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -80,7 +80,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  Error **errp);
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
-RAMBlock *qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp);
+RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
+                         Error **errp);
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t max_size,
                                     void (*resized)(const char*,
                                                     uint64_t length,
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index adb3758275..41658060a7 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -255,7 +255,7 @@ extern int daemon(int, int);
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 621a3f9d42..d5ab0b99c6 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -54,7 +54,7 @@ struct HostMemoryBackend {
     char *id;
     uint64_t size;
     bool merge, dump;
-    bool prealloc, force_prealloc, is_mapped;
+    bool prealloc, force_prealloc, is_mapped, share;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index bbf12a1723..85002ac49a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -248,7 +248,7 @@ int kvm_on_sigbus(int code, void *addr);
 
 /* interface with exec.c */
 
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align));
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared));
 
 /* internal API */
 
diff --git a/memory.c b/memory.c
index c7f6588452..6515131ac2 100644
--- a/memory.c
+++ b/memory.c
@@ -1539,11 +1539,21 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
+    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+}
+
+void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
+                                             Object *owner,
+                                             const char *name,
+                                             uint64_t size,
+                                             bool share,
+                                             Error **errp)
+{
     memory_region_init(mr, owner, name, size);
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, mr, errp);
+    mr->ram_block = qemu_ram_alloc(size, share, mr, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
@@ -1654,7 +1664,7 @@ void memory_region_init_rom_nomigrate(MemoryRegion *mr,
     mr->readonly = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, mr, errp);
+    mr->ram_block = qemu_ram_alloc(size, false, mr, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
@@ -1673,7 +1683,7 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, mr, errp);
+    mr->ram_block = qemu_ram_alloc(size, false,  mr, errp);
 }
 
 void memory_region_init_iommu(void *_iommu_mr,
diff --git a/qemu-options.hx b/qemu-options.hx
index 5050a49a5e..8ccd5dcaa6 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3975,6 +3975,14 @@ The @option{share} boolean option determines whether the memory
 region is marked as private to QEMU, or shared. The latter allows
 a co-operating external process to access the QEMU memory region.
 
+The @option{share} is also required for pvrdma devices due to
+limitations in the RDMA API provided by Linux.
+
+Setting share=on might affect the ability to configure NUMA
+bindings for the memory backend under some circumstances, see
+Documentation/vm/numa_memory_policy.txt on the Linux kernel
+source tree for additional details.
+
 Setting the @option{discard-data} boolean option to @var{on}
 indicates that file contents can be destroyed when QEMU exits,
 to avoid unnecessarily flushing data to the backing file.  Note
@@ -4017,7 +4025,7 @@ requires an alignment different than the default one used by QEMU, eg
 the device DAX /dev/dax0.0 requires 2M alignment rather than 4K. In
 such cases, users can specify the required alignment via this option.
 
-@item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
+@item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
 
 Creates a memory backend object, which can be used to back the guest RAM.
 Memory backend objects offer more control than the @option{-m} option that is
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index 0301e9d519..e13c8907df 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -144,7 +144,7 @@ static int cap_gs;
 
 static int active_cmma;
 
-static void *legacy_s390_alloc(size_t size, uint64_t *align);
+static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared);
 
 static int kvm_s390_query_mem_limit(uint64_t *memory_limit)
 {
@@ -752,7 +752,7 @@ int kvm_s390_mem_op(S390CPU *cpu, vaddr addr, uint8_t ar, void *hostbuf,
  * to grow. We also have to use MAP parameters that avoid
  * read-only mapping of guest pages.
  */
-static void *legacy_s390_alloc(size_t size, uint64_t *align)
+static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
 {
     void *mem;
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 4655bc1f89..13b6f8d776 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -127,10 +127,10 @@ void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
 {
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false);
+    void *ptr = qemu_ram_mmap(-1, size, align, shared);
 
     if (ptr == MAP_FAILED) {
         return NULL;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 69a6286d50..bb5ad28bd3 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -67,7 +67,7 @@ void *qemu_memalign(size_t alignment, size_t size)
     return qemu_oom_check(qemu_try_memalign(alignment, size));
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
 {
     void *ptr;
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 02/10] docs: add pvrdma device documentation.
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 01/10] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 03/10] scripts/update-linux-headers: import pvrdma headers Marcel Apfelbaum
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
---
 docs/pvrdma.txt | 255 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 255 insertions(+)
 create mode 100644 docs/pvrdma.txt

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
new file mode 100644
index 0000000000..5599318159
--- /dev/null
+++ b/docs/pvrdma.txt
@@ -0,0 +1,255 @@
+Paravirtualized RDMA Device (PVRDMA)
+====================================
+
+
+1. Description
+===============
+PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
+It works with its Linux Kernel driver AS IS, no need for any special guest
+modifications.
+
+While it complies with the VMware device, it can also communicate with bare
+metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
+can work with Soft-RoCE (rxe).
+
+It does not require the whole guest RAM to be pinned allowing memory
+over-commit and, even if not implemented yet, migration support will be
+possible with some HW assistance.
+
+A project presentation accompany this document:
+- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
+
+
+
+2. Setup
+========
+
+
+2.1 Guest setup
+===============
+Fedora 27+ kernels work out of the box, older distributions
+require updating the kernel to 4.14 to include the pvrdma driver.
+
+However the libpvrdma library needed by User Level Software is still
+not available as part of the distributions, so the rdma-core library
+needs to be compiled and optionally installed.
+
+Please follow the instructions at:
+  https://github.com/linux-rdma/rdma-core.git
+
+
+2.2 Host Setup
+==============
+The pvrdma backend is an ibdevice interface that can be exposed
+either by a Soft-RoCE(rxe) device on machines with no RDMA device,
+or an HCA SRIOV function(VF/PF).
+Note that ibdevice interfaces can't be shared between pvrdma devices,
+each one requiring a separate instance (rxe or SRIOV VF).
+
+
+2.2.1 Soft-RoCE backend(rxe)
+===========================
+A stable version of rxe is required, Fedora 27+ or a Linux
+Kernel 4.14+ is preferred.
+
+The rdma_rxe module is part of the Linux Kernel but not loaded by default.
+Install the User Level library (librxe) following the instructions from:
+https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
+
+Associate an ETH interface with rxe by running:
+   rxe_cfg add eth0
+An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
+
+
+2.2.2 RDMA device Virtual Function backend
+==========================================
+Nothing special is required, the pvrdma device can work not only with
+Ethernet Links, but also Infinibands Links.
+All is needed is an ibdevice with an active port, for Mellanox cards
+will be something like mlx5_6 which can be the backend.
+
+
+2.2.3 QEMU setup
+================
+Configure QEMU with --enable-rdma flag, installing
+the required RDMA libraries.
+
+
+
+3. Usage
+========
+Currently the device is working only with memory backed RAM
+and it must be mark as "shared":
+   -m 1G \
+   -object memory-backend-ram,id=mb1,size=1G,share \
+   -numa node,memdev=mb1 \
+
+The pvrdma device is composed of two functions:
+ - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
+   but is required to pass the ibdevice GID using its MAC.
+   Examples:
+     For an rxe backend using eth0 interface it will use its mac:
+       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
+     For an SRIOV VF, we take the Ethernet Interface exposed by it:
+       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
+ - Function 1 is the actual device:
+       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
+   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
+ Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
+ The rules of conversion are part of the RoCE spec, but since manual conversion
+ is not required, spotting problems is not hard:
+    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
+             MAC: 7c:fe:90:cb:74:3a
+    Note the difference between the first byte of the MAC and the GID.
+
+
+
+4. Implementation details
+=========================
+
+
+4.1 Overview
+============
+The device acts like a proxy between the Guest Driver and the host
+ibdevice interface.
+On configuration path:
+ - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
+   a resource from the backend interface, maintaining a 1-1 mapping
+   between the guest and host.
+On data path:
+ - Every post_send/receive received from the guest will be converted into
+   a post_send/receive for the backend. The buffers data will not be touched
+   or copied resulting in near bare-metal performance for large enough buffers.
+ - Completions from the backend interface will result in completions for
+   the pvrdma device.
+
+
+4.2 PCI BARs
+============
+PCI Bars:
+	BAR 0 - MSI-X
+        MSI-X vectors:
+		(0) Command - used when execution of a command is completed.
+		(1) Async - not in use.
+		(2) Completion - used when a completion event is placed in
+		  device's CQ ring.
+	BAR 1 - Registers
+        --------------------------------------------------------
+        | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC |
+        --------------------------------------------------------
+		DSR - Address of driver/device shared memory used
+              for the command channel, used for passing:
+			    - General info such as driver version
+			    - Address of 'command' and 'response'
+			    - Address of async ring
+			    - Address of device's CQ ring
+			    - Device capabilities
+		CTL - Device control operations (activate, reset etc)
+		IMG - Set interrupt mask
+		REQ - Command execution register
+		ERR - Operation status
+
+	BAR 2 - UAR
+        ---------------------------------------------------------
+        | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag |
+        ---------------------------------------------------------
+		- Offset 0 used for QP operations (send and recv)
+		- Offset 4 used for CQ operations (arm and poll)
+
+
+4.3 Major flows
+===============
+
+4.3.1 Create CQ
+===============
+    - Guest driver
+        - Allocates pages for CQ ring
+        - Creates page directory (pdir) to hold CQ ring's pages
+        - Initializes CQ ring
+        - Initializes 'Create CQ' command object (cqe, pdir etc)
+        - Copies the command to 'command' address
+        - Writes 0 into REQ register
+    - Device
+        - Reads the request object from the 'command' address
+        - Allocates CQ object and initialize CQ ring based on pdir
+        - Creates the backend CQ
+        - Writes operation status to ERR register
+        - Posts command-interrupt to guest
+    - Guest driver
+        - Reads the HW response code from ERR register
+
+4.3.2 Create QP
+===============
+    - Guest driver
+        - Allocates pages for send and receive rings
+        - Creates page directory(pdir) to hold the ring's pages
+        - Initializes 'Create QP' command object (max_send_wr,
+          send_cq_handle, recv_cq_handle, pdir etc)
+        - Copies the object to 'command' address
+        - Write 0 into REQ register
+    - Device
+        - Reads the request object from 'command' address
+        - Allocates the QP object and initialize
+            - Send and recv rings based on pdir
+            - Send and recv ring state
+        - Creates the backend QP
+        - Writes the operation status to ERR register
+        - Posts command-interrupt to guest
+    - Guest driver
+        - Reads the HW response code from ERR register
+
+4.3.3 Post receive
+==================
+    - Guest driver
+        - Initializes a wqe and place it on recv ring
+        - Write to qpn|qp_recv_bit (31) to QP offset in UAR
+    - Device
+        - Extracts qpn from UAR
+        - Walks through the ring and does the following for each wqe
+            - Prepares the backend CQE context to be used when
+              receiving completion from backend (wr_id, op_code, emu_cq_num)
+            - For each sge prepares backend sge
+            - Calls backend's post_recv
+
+4.3.4 Process backend events
+============================
+    - Done by a dedicated thread used to process backend events;
+      at initialization is attached to the device and creates
+      the communication channel.
+    - Thread main loop:
+        - Polls for completions
+        - Extracts QEMU _cq_num, wr_id and op_code from context
+        - Writes CQE to CQ ring
+        - Writes CQ number to device CQ
+        - Sends completion-interrupt to guest
+        - Deallocates context
+        - Acks the event to backend
+
+
+
+5. Limitations
+==============
+- The device obviously is limited by the Guest Linux Driver features implementation
+  of the VMware device API.
+- Memory registration mechanism requires mremap for every page in the buffer in order
+  to map it to a contiguous virtual address range. Since this is not the data path
+  it should not matter much. If the default max mr size is increased, be aware that
+  memory registration can take up to 0.5 seconds for 1GB of memory.
+- The device requires target page size to be the same as the host page size,
+  otherwise it will fail to init.
+- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
+  so it can't work with huge pages. The limitation will be addressed in the future,
+  however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
+  pages available, QEMU will use them. QEMU will fail to init if the requirements
+  are not met.
+
+
+
+6. Performance
+==============
+By design the pvrdma device exits on each post-send/receive, so for small buffers
+the performance is affected; however for medium buffers it will became close to
+bare metal and from 1MB buffers and  up it reaches bare metal performance.
+(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
+
+All the above assumes no memory registration is done on data path.
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 03/10] scripts/update-linux-headers: import pvrdma headers
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 01/10] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 02/10] docs: add pvrdma device documentation Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 04/10] include/standard-headers: add pvrdma related headers Marcel Apfelbaum
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

Modify the script to import the headers used by the pvrdma device.
Part of them are interfaces between the guest driver and the device,
import them under include/standart-headers/drivers/infiniband/... .

Remove the unused functions from pvrdma_verbs.h avoiding the
unnecessary import of several infiniband/networking/other headers.

Reviewed-by: Gal Hammer <ghammer@redhat.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 scripts/update-linux-headers.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 135a10d96a..be065704df 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -38,6 +38,7 @@ cp_portable() {
                                      -e 'linux/if_ether' \
                                      -e 'input-event-codes' \
                                      -e 'sys/' \
+                                     -e 'pvrdma_verbs' \
                                      > /dev/null
     then
         echo "Unexpected #include in input file $f".
@@ -46,6 +47,7 @@ cp_portable() {
 
     header=$(basename "$f");
     sed -e 's/__u\([0-9][0-9]*\)/uint\1_t/g' \
+        -e 's/u\([0-9][0-9]*\)/uint\1_t/g' \
         -e 's/__s\([0-9][0-9]*\)/int\1_t/g' \
         -e 's/__le\([0-9][0-9]*\)/uint\1_t/g' \
         -e 's/__be\([0-9][0-9]*\)/uint\1_t/g' \
@@ -56,6 +58,7 @@ cp_portable() {
         -e 's/__inline__/inline/' \
         -e '/sys\/ioctl.h/d' \
         -e 's/SW_MAX/SW_MAX_/' \
+        -e 's/atomic_t/int/' \
         "$f" > "$to/$header";
 }
 
@@ -147,6 +150,33 @@ for i in "$tmpdir"/include/linux/*virtio*.h "$tmpdir/include/linux/input.h" \
     cp_portable "$i" "$output/include/standard-headers/linux"
 done
 
+rm -rf "$output/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma"
+mkdir -p "$output/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma"
+
+# Remove the unused functions from pvrdma_verbs.h avoiding the unnecessary
+# import of several infiniband/networking/other headers
+tmp_pvrdma_verbs="$tmpdir/pvrdma_verbs.h"
+# Parse the entire file instead of single lines to match
+# function declarations expanding over multiple lines
+# and strip the declarations starting with pvrdma prefix.
+sed  -e '1h;2,$H;$!d;g'  -e 's/[^};]*pvrdma[^(| ]*([^)]*);//g' \
+    "$linux/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h" > \
+    "$tmp_pvrdma_verbs";
+
+for i in "$linux/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h" \
+         "$linux/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h" \
+         "$tmp_pvrdma_verbs"; do \
+    cp_portable "$i" \
+         "$output/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/"
+done
+
+rm -rf "$output/include/standard-headers/rdma/"
+mkdir -p "$output/include/standard-headers/rdma/"
+for i in "$tmpdir/include/rdma/vmw_pvrdma-abi.h"; do
+    cp_portable "$i" \
+         "$output/include/standard-headers/rdma/"
+done
+
 cat <<EOF >$output/include/standard-headers/linux/types.h
 /* For QEMU all types are already defined via osdep.h, so this
  * header does not need to do anything.
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 04/10] include/standard-headers: add pvrdma related headers
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (2 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 03/10] scripts/update-linux-headers: import pvrdma headers Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 05/10] hw/rdma: Add wrappers and macros Marcel Apfelbaum
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

Import the headers used by the pvrdma device.
Part of them are interfaces between the guest driver and the device,
imported under include/standart-headers/drivers/infiniband/... .

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 .../infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h      | 667 +++++++++++++++++++++
 .../drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h | 114 ++++
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h        | 383 ++++++++++++
 include/standard-headers/rdma/vmw_pvrdma-abi.h     | 293 +++++++++
 4 files changed, 1457 insertions(+)
 create mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
 create mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h
 create mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
 create mode 100644 include/standard-headers/rdma/vmw_pvrdma-abi.h

diff --git a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
new file mode 100644
index 0000000000..422eb3f4c1
--- /dev/null
+++ b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __PVRDMA_DEV_API_H__
+#define __PVRDMA_DEV_API_H__
+
+#include "standard-headers/linux/types.h"
+
+#include "pvrdma_verbs.h"
+
+/*
+ * PVRDMA version macros. Some new features require updates to PVRDMA_VERSION.
+ * These macros allow us to check for different features if necessary.
+ */
+
+#define PVRDMA_ROCEV1_VERSION		17
+#define PVRDMA_ROCEV2_VERSION		18
+#define PVRDMA_VERSION			PVRDMA_ROCEV2_VERSION
+
+#define PVRDMA_BOARD_ID			1
+#define PVRDMA_REV_ID			1
+
+/*
+ * Masks and accessors for page directory, which is a two-level lookup:
+ * page directory -> page table -> page. Only one directory for now, but we
+ * could expand that easily. 9 bits for tables, 9 bits for pages, gives one
+ * gigabyte for memory regions and so forth.
+ */
+
+#define PVRDMA_PDIR_SHIFT		18
+#define PVRDMA_PTABLE_SHIFT		9
+#define PVRDMA_PAGE_DIR_DIR(x)		(((x) >> PVRDMA_PDIR_SHIFT) & 0x1)
+#define PVRDMA_PAGE_DIR_TABLE(x)	(((x) >> PVRDMA_PTABLE_SHIFT) & 0x1ff)
+#define PVRDMA_PAGE_DIR_PAGE(x)		((x) & 0x1ff)
+#define PVRDMA_PAGE_DIR_MAX_PAGES	(1 * 512 * 512)
+#define PVRDMA_MAX_FAST_REG_PAGES	128
+
+/*
+ * Max MSI-X vectors.
+ */
+
+#define PVRDMA_MAX_INTERRUPTS	3
+
+/* Register offsets within PCI resource on BAR1. */
+#define PVRDMA_REG_VERSION	0x00	/* R: Version of device. */
+#define PVRDMA_REG_DSRLOW	0x04	/* W: Device shared region low PA. */
+#define PVRDMA_REG_DSRHIGH	0x08	/* W: Device shared region high PA. */
+#define PVRDMA_REG_CTL		0x0c	/* W: PVRDMA_DEVICE_CTL */
+#define PVRDMA_REG_REQUEST	0x10	/* W: Indicate device request. */
+#define PVRDMA_REG_ERR		0x14	/* R: Device error. */
+#define PVRDMA_REG_ICR		0x18	/* R: Interrupt cause. */
+#define PVRDMA_REG_IMR		0x1c	/* R/W: Interrupt mask. */
+#define PVRDMA_REG_MACL		0x20	/* R/W: MAC address low. */
+#define PVRDMA_REG_MACH		0x24	/* R/W: MAC address high. */
+
+/* Object flags. */
+#define PVRDMA_CQ_FLAG_ARMED_SOL	BIT(0)	/* Armed for solicited-only. */
+#define PVRDMA_CQ_FLAG_ARMED		BIT(1)	/* Armed. */
+#define PVRDMA_MR_FLAG_DMA		BIT(0)	/* DMA region. */
+#define PVRDMA_MR_FLAG_FRMR		BIT(1)	/* Fast reg memory region. */
+
+/*
+ * Atomic operation capability (masked versions are extended atomic
+ * operations.
+ */
+
+#define PVRDMA_ATOMIC_OP_COMP_SWAP	BIT(0)	/* Compare and swap. */
+#define PVRDMA_ATOMIC_OP_FETCH_ADD	BIT(1)	/* Fetch and add. */
+#define PVRDMA_ATOMIC_OP_MASK_COMP_SWAP	BIT(2)	/* Masked compare and swap. */
+#define PVRDMA_ATOMIC_OP_MASK_FETCH_ADD	BIT(3)	/* Masked fetch and add. */
+
+/*
+ * Base Memory Management Extension flags to support Fast Reg Memory Regions
+ * and Fast Reg Work Requests. Each flag represents a verb operation and we
+ * must support all of them to qualify for the BMME device cap.
+ */
+
+#define PVRDMA_BMME_FLAG_LOCAL_INV	BIT(0)	/* Local Invalidate. */
+#define PVRDMA_BMME_FLAG_REMOTE_INV	BIT(1)	/* Remote Invalidate. */
+#define PVRDMA_BMME_FLAG_FAST_REG_WR	BIT(2)	/* Fast Reg Work Request. */
+
+/*
+ * GID types. The interpretation of the gid_types bit field in the device
+ * capabilities will depend on the device mode. For now, the device only
+ * supports RoCE as mode, so only the different GID types for RoCE are
+ * defined.
+ */
+
+#define PVRDMA_GID_TYPE_FLAG_ROCE_V1	BIT(0)
+#define PVRDMA_GID_TYPE_FLAG_ROCE_V2	BIT(1)
+
+/*
+ * Version checks. This checks whether each version supports specific
+ * capabilities from the device.
+ */
+
+#define PVRDMA_IS_VERSION17(_dev)					\
+	(_dev->dsr_version == PVRDMA_ROCEV1_VERSION &&			\
+	 _dev->dsr->caps.gid_types == PVRDMA_GID_TYPE_FLAG_ROCE_V1)
+
+#define PVRDMA_IS_VERSION18(_dev)					\
+	(_dev->dsr_version >= PVRDMA_ROCEV2_VERSION &&			\
+	 (_dev->dsr->caps.gid_types == PVRDMA_GID_TYPE_FLAG_ROCE_V1 ||  \
+	  _dev->dsr->caps.gid_types == PVRDMA_GID_TYPE_FLAG_ROCE_V2))	\
+
+#define PVRDMA_SUPPORTED(_dev)						\
+	((_dev->dsr->caps.mode == PVRDMA_DEVICE_MODE_ROCE) &&		\
+	 (PVRDMA_IS_VERSION17(_dev) || PVRDMA_IS_VERSION18(_dev)))
+
+/*
+ * Get capability values based on device version.
+ */
+
+#define PVRDMA_GET_CAP(_dev, _old_val, _val) \
+	((PVRDMA_IS_VERSION18(_dev)) ? _val : _old_val)
+
+enum pvrdma_pci_resource {
+	PVRDMA_PCI_RESOURCE_MSIX,	/* BAR0: MSI-X, MMIO. */
+	PVRDMA_PCI_RESOURCE_REG,	/* BAR1: Registers, MMIO. */
+	PVRDMA_PCI_RESOURCE_UAR,	/* BAR2: UAR pages, MMIO, 64-bit. */
+	PVRDMA_PCI_RESOURCE_LAST,	/* Last. */
+};
+
+enum pvrdma_device_ctl {
+	PVRDMA_DEVICE_CTL_ACTIVATE,	/* Activate device. */
+	PVRDMA_DEVICE_CTL_UNQUIESCE,	/* Unquiesce device. */
+	PVRDMA_DEVICE_CTL_RESET,	/* Reset device. */
+};
+
+enum pvrdma_intr_vector {
+	PVRDMA_INTR_VECTOR_RESPONSE,	/* Command response. */
+	PVRDMA_INTR_VECTOR_ASYNC,	/* Async events. */
+	PVRDMA_INTR_VECTOR_CQ,		/* CQ notification. */
+	/* Additional CQ notification vectors. */
+};
+
+enum pvrdma_intr_cause {
+	PVRDMA_INTR_CAUSE_RESPONSE	= (1 << PVRDMA_INTR_VECTOR_RESPONSE),
+	PVRDMA_INTR_CAUSE_ASYNC		= (1 << PVRDMA_INTR_VECTOR_ASYNC),
+	PVRDMA_INTR_CAUSE_CQ		= (1 << PVRDMA_INTR_VECTOR_CQ),
+};
+
+enum pvrdma_gos_bits {
+	PVRDMA_GOS_BITS_UNK,		/* Unknown. */
+	PVRDMA_GOS_BITS_32,		/* 32-bit. */
+	PVRDMA_GOS_BITS_64,		/* 64-bit. */
+};
+
+enum pvrdma_gos_type {
+	PVRDMA_GOS_TYPE_UNK,		/* Unknown. */
+	PVRDMA_GOS_TYPE_LINUX,		/* Linux. */
+};
+
+enum pvrdma_device_mode {
+	PVRDMA_DEVICE_MODE_ROCE,	/* RoCE. */
+	PVRDMA_DEVICE_MODE_IWARP,	/* iWarp. */
+	PVRDMA_DEVICE_MODE_IB,		/* InfiniBand. */
+};
+
+struct pvrdma_gos_info {
+	uint32_t gos_bits:2;			/* W: PVRDMA_GOS_BITS_ */
+	uint32_t gos_type:4;			/* W: PVRDMA_GOS_TYPE_ */
+	uint32_t gos_ver:16;			/* W: Guest OS version. */
+	uint32_t gos_misc:10;		/* W: Other. */
+	uint32_t pad;			/* Pad to 8-byte alignment. */
+};
+
+struct pvrdma_device_caps {
+	uint64_t fw_ver;				/* R: Query device. */
+	uint64_t node_guid;
+	uint64_t sys_image_guid;
+	uint64_t max_mr_size;
+	uint64_t page_size_cap;
+	uint64_t atomic_arg_sizes;			/* EX verbs. */
+	uint32_t ex_comp_mask;			/* EX verbs. */
+	uint32_t device_cap_flags2;			/* EX verbs. */
+	uint32_t max_fa_bit_boundary;		/* EX verbs. */
+	uint32_t log_max_atomic_inline_arg;		/* EX verbs. */
+	uint32_t vendor_id;
+	uint32_t vendor_part_id;
+	uint32_t hw_ver;
+	uint32_t max_qp;
+	uint32_t max_qp_wr;
+	uint32_t device_cap_flags;
+	uint32_t max_sge;
+	uint32_t max_sge_rd;
+	uint32_t max_cq;
+	uint32_t max_cqe;
+	uint32_t max_mr;
+	uint32_t max_pd;
+	uint32_t max_qp_rd_atom;
+	uint32_t max_ee_rd_atom;
+	uint32_t max_res_rd_atom;
+	uint32_t max_qp_init_rd_atom;
+	uint32_t max_ee_init_rd_atom;
+	uint32_t max_ee;
+	uint32_t max_rdd;
+	uint32_t max_mw;
+	uint32_t max_raw_ipv6_qp;
+	uint32_t max_raw_ethy_qp;
+	uint32_t max_mcast_grp;
+	uint32_t max_mcast_qp_attach;
+	uint32_t max_total_mcast_qp_attach;
+	uint32_t max_ah;
+	uint32_t max_fmr;
+	uint32_t max_map_per_fmr;
+	uint32_t max_srq;
+	uint32_t max_srq_wr;
+	uint32_t max_srq_sge;
+	uint32_t max_uar;
+	uint32_t gid_tbl_len;
+	uint16_t max_pkeys;
+	uint8_t  local_ca_ack_delay;
+	uint8_t  phys_port_cnt;
+	uint8_t  mode;				/* PVRDMA_DEVICE_MODE_ */
+	uint8_t  atomic_ops;				/* PVRDMA_ATOMIC_OP_* bits */
+	uint8_t  bmme_flags;				/* FRWR Mem Mgmt Extensions */
+	uint8_t  gid_types;				/* PVRDMA_GID_TYPE_FLAG_ */
+	uint32_t max_fast_reg_page_list_len;
+};
+
+struct pvrdma_ring_page_info {
+	uint32_t num_pages;				/* Num pages incl. header. */
+	uint32_t reserved;				/* Reserved. */
+	uint64_t pdir_dma;				/* Page directory PA. */
+};
+
+#pragma pack(push, 1)
+
+struct pvrdma_device_shared_region {
+	uint32_t driver_version;			/* W: Driver version. */
+	uint32_t pad;				/* Pad to 8-byte align. */
+	struct pvrdma_gos_info gos_info;	/* W: Guest OS information. */
+	uint64_t cmd_slot_dma;			/* W: Command slot address. */
+	uint64_t resp_slot_dma;			/* W: Response slot address. */
+	struct pvrdma_ring_page_info async_ring_pages;
+						/* W: Async ring page info. */
+	struct pvrdma_ring_page_info cq_ring_pages;
+						/* W: CQ ring page info. */
+	uint32_t uar_pfn;				/* W: UAR pageframe. */
+	uint32_t pad2;				/* Pad to 8-byte align. */
+	struct pvrdma_device_caps caps;		/* R: Device capabilities. */
+};
+
+#pragma pack(pop)
+
+/* Event types. Currently a 1:1 mapping with enum ib_event. */
+enum pvrdma_eqe_type {
+	PVRDMA_EVENT_CQ_ERR,
+	PVRDMA_EVENT_QP_FATAL,
+	PVRDMA_EVENT_QP_REQ_ERR,
+	PVRDMA_EVENT_QP_ACCESS_ERR,
+	PVRDMA_EVENT_COMM_EST,
+	PVRDMA_EVENT_SQ_DRAINED,
+	PVRDMA_EVENT_PATH_MIG,
+	PVRDMA_EVENT_PATH_MIG_ERR,
+	PVRDMA_EVENT_DEVICE_FATAL,
+	PVRDMA_EVENT_PORT_ACTIVE,
+	PVRDMA_EVENT_PORT_ERR,
+	PVRDMA_EVENT_LID_CHANGE,
+	PVRDMA_EVENT_PKEY_CHANGE,
+	PVRDMA_EVENT_SM_CHANGE,
+	PVRDMA_EVENT_SRQ_ERR,
+	PVRDMA_EVENT_SRQ_LIMIT_REACHED,
+	PVRDMA_EVENT_QP_LAST_WQE_REACHED,
+	PVRDMA_EVENT_CLIENT_REREGISTER,
+	PVRDMA_EVENT_GID_CHANGE,
+};
+
+/* Event queue element. */
+struct pvrdma_eqe {
+	uint32_t type;	/* Event type. */
+	uint32_t info;	/* Handle, other. */
+};
+
+/* CQ notification queue element. */
+struct pvrdma_cqne {
+	uint32_t info;	/* Handle */
+};
+
+enum {
+	PVRDMA_CMD_FIRST,
+	PVRDMA_CMD_QUERY_PORT = PVRDMA_CMD_FIRST,
+	PVRDMA_CMD_QUERY_PKEY,
+	PVRDMA_CMD_CREATE_PD,
+	PVRDMA_CMD_DESTROY_PD,
+	PVRDMA_CMD_CREATE_MR,
+	PVRDMA_CMD_DESTROY_MR,
+	PVRDMA_CMD_CREATE_CQ,
+	PVRDMA_CMD_RESIZE_CQ,
+	PVRDMA_CMD_DESTROY_CQ,
+	PVRDMA_CMD_CREATE_QP,
+	PVRDMA_CMD_MODIFY_QP,
+	PVRDMA_CMD_QUERY_QP,
+	PVRDMA_CMD_DESTROY_QP,
+	PVRDMA_CMD_CREATE_UC,
+	PVRDMA_CMD_DESTROY_UC,
+	PVRDMA_CMD_CREATE_BIND,
+	PVRDMA_CMD_DESTROY_BIND,
+	PVRDMA_CMD_CREATE_SRQ,
+	PVRDMA_CMD_MODIFY_SRQ,
+	PVRDMA_CMD_QUERY_SRQ,
+	PVRDMA_CMD_DESTROY_SRQ,
+	PVRDMA_CMD_MAX,
+};
+
+enum {
+	PVRDMA_CMD_FIRST_RESP = (1 << 31),
+	PVRDMA_CMD_QUERY_PORT_RESP = PVRDMA_CMD_FIRST_RESP,
+	PVRDMA_CMD_QUERY_PKEY_RESP,
+	PVRDMA_CMD_CREATE_PD_RESP,
+	PVRDMA_CMD_DESTROY_PD_RESP_NOOP,
+	PVRDMA_CMD_CREATE_MR_RESP,
+	PVRDMA_CMD_DESTROY_MR_RESP_NOOP,
+	PVRDMA_CMD_CREATE_CQ_RESP,
+	PVRDMA_CMD_RESIZE_CQ_RESP,
+	PVRDMA_CMD_DESTROY_CQ_RESP_NOOP,
+	PVRDMA_CMD_CREATE_QP_RESP,
+	PVRDMA_CMD_MODIFY_QP_RESP,
+	PVRDMA_CMD_QUERY_QP_RESP,
+	PVRDMA_CMD_DESTROY_QP_RESP,
+	PVRDMA_CMD_CREATE_UC_RESP,
+	PVRDMA_CMD_DESTROY_UC_RESP_NOOP,
+	PVRDMA_CMD_CREATE_BIND_RESP_NOOP,
+	PVRDMA_CMD_DESTROY_BIND_RESP_NOOP,
+	PVRDMA_CMD_CREATE_SRQ_RESP,
+	PVRDMA_CMD_MODIFY_SRQ_RESP,
+	PVRDMA_CMD_QUERY_SRQ_RESP,
+	PVRDMA_CMD_DESTROY_SRQ_RESP,
+	PVRDMA_CMD_MAX_RESP,
+};
+
+struct pvrdma_cmd_hdr {
+	uint64_t response;		/* Key for response lookup. */
+	uint32_t cmd;		/* PVRDMA_CMD_ */
+	uint32_t reserved;		/* Reserved. */
+};
+
+struct pvrdma_cmd_resp_hdr {
+	uint64_t response;		/* From cmd hdr. */
+	uint32_t ack;		/* PVRDMA_CMD_XXX_RESP */
+	uint8_t err;			/* Error. */
+	uint8_t reserved[3];		/* Reserved. */
+};
+
+struct pvrdma_cmd_query_port {
+	struct pvrdma_cmd_hdr hdr;
+	uint8_t port_num;
+	uint8_t reserved[7];
+};
+
+struct pvrdma_cmd_query_port_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	struct pvrdma_port_attr attrs;
+};
+
+struct pvrdma_cmd_query_pkey {
+	struct pvrdma_cmd_hdr hdr;
+	uint8_t port_num;
+	uint8_t index;
+	uint8_t reserved[6];
+};
+
+struct pvrdma_cmd_query_pkey_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint16_t pkey;
+	uint8_t reserved[6];
+};
+
+struct pvrdma_cmd_create_uc {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t pfn; /* UAR page frame number */
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_uc_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t ctx_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_destroy_uc {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t ctx_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_pd {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t ctx_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_pd_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t pd_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_destroy_pd {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t pd_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_mr {
+	struct pvrdma_cmd_hdr hdr;
+	uint64_t start;
+	uint64_t length;
+	uint64_t pdir_dma;
+	uint32_t pd_handle;
+	uint32_t access_flags;
+	uint32_t flags;
+	uint32_t nchunks;
+};
+
+struct pvrdma_cmd_create_mr_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t mr_handle;
+	uint32_t lkey;
+	uint32_t rkey;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_destroy_mr {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t mr_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_cq {
+	struct pvrdma_cmd_hdr hdr;
+	uint64_t pdir_dma;
+	uint32_t ctx_handle;
+	uint32_t cqe;
+	uint32_t nchunks;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_cq_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t cq_handle;
+	uint32_t cqe;
+};
+
+struct pvrdma_cmd_resize_cq {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t cq_handle;
+	uint32_t cqe;
+};
+
+struct pvrdma_cmd_resize_cq_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t cqe;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_destroy_cq {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t cq_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_srq {
+	struct pvrdma_cmd_hdr hdr;
+	uint64_t pdir_dma;
+	uint32_t pd_handle;
+	uint32_t nchunks;
+	struct pvrdma_srq_attr attrs;
+	uint8_t srq_type;
+	uint8_t reserved[7];
+};
+
+struct pvrdma_cmd_create_srq_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t srqn;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_modify_srq {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t srq_handle;
+	uint32_t attr_mask;
+	struct pvrdma_srq_attr attrs;
+};
+
+struct pvrdma_cmd_query_srq {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t srq_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_query_srq_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	struct pvrdma_srq_attr attrs;
+};
+
+struct pvrdma_cmd_destroy_srq {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t srq_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_qp {
+	struct pvrdma_cmd_hdr hdr;
+	uint64_t pdir_dma;
+	uint32_t pd_handle;
+	uint32_t send_cq_handle;
+	uint32_t recv_cq_handle;
+	uint32_t srq_handle;
+	uint32_t max_send_wr;
+	uint32_t max_recv_wr;
+	uint32_t max_send_sge;
+	uint32_t max_recv_sge;
+	uint32_t max_inline_data;
+	uint32_t lkey;
+	uint32_t access_flags;
+	uint16_t total_chunks;
+	uint16_t send_chunks;
+	uint16_t max_atomic_arg;
+	uint8_t sq_sig_all;
+	uint8_t qp_type;
+	uint8_t is_srq;
+	uint8_t reserved[3];
+};
+
+struct pvrdma_cmd_create_qp_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t qpn;
+	uint32_t max_send_wr;
+	uint32_t max_recv_wr;
+	uint32_t max_send_sge;
+	uint32_t max_recv_sge;
+	uint32_t max_inline_data;
+};
+
+struct pvrdma_cmd_modify_qp {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t qp_handle;
+	uint32_t attr_mask;
+	struct pvrdma_qp_attr attrs;
+};
+
+struct pvrdma_cmd_query_qp {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t qp_handle;
+	uint32_t attr_mask;
+};
+
+struct pvrdma_cmd_query_qp_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	struct pvrdma_qp_attr attrs;
+};
+
+struct pvrdma_cmd_destroy_qp {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t qp_handle;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_destroy_qp_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	uint32_t events_reported;
+	uint8_t reserved[4];
+};
+
+struct pvrdma_cmd_create_bind {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t mtu;
+	uint32_t vlan;
+	uint32_t index;
+	uint8_t new_gid[16];
+	uint8_t gid_type;
+	uint8_t reserved[3];
+};
+
+struct pvrdma_cmd_destroy_bind {
+	struct pvrdma_cmd_hdr hdr;
+	uint32_t index;
+	uint8_t dest_gid[16];
+	uint8_t reserved[4];
+};
+
+union pvrdma_cmd_req {
+	struct pvrdma_cmd_hdr hdr;
+	struct pvrdma_cmd_query_port query_port;
+	struct pvrdma_cmd_query_pkey query_pkey;
+	struct pvrdma_cmd_create_uc create_uc;
+	struct pvrdma_cmd_destroy_uc destroy_uc;
+	struct pvrdma_cmd_create_pd create_pd;
+	struct pvrdma_cmd_destroy_pd destroy_pd;
+	struct pvrdma_cmd_create_mr create_mr;
+	struct pvrdma_cmd_destroy_mr destroy_mr;
+	struct pvrdma_cmd_create_cq create_cq;
+	struct pvrdma_cmd_resize_cq resize_cq;
+	struct pvrdma_cmd_destroy_cq destroy_cq;
+	struct pvrdma_cmd_create_qp create_qp;
+	struct pvrdma_cmd_modify_qp modify_qp;
+	struct pvrdma_cmd_query_qp query_qp;
+	struct pvrdma_cmd_destroy_qp destroy_qp;
+	struct pvrdma_cmd_create_bind create_bind;
+	struct pvrdma_cmd_destroy_bind destroy_bind;
+	struct pvrdma_cmd_create_srq create_srq;
+	struct pvrdma_cmd_modify_srq modify_srq;
+	struct pvrdma_cmd_query_srq query_srq;
+	struct pvrdma_cmd_destroy_srq destroy_srq;
+};
+
+union pvrdma_cmd_resp {
+	struct pvrdma_cmd_resp_hdr hdr;
+	struct pvrdma_cmd_query_port_resp query_port_resp;
+	struct pvrdma_cmd_query_pkey_resp query_pkey_resp;
+	struct pvrdma_cmd_create_uc_resp create_uc_resp;
+	struct pvrdma_cmd_create_pd_resp create_pd_resp;
+	struct pvrdma_cmd_create_mr_resp create_mr_resp;
+	struct pvrdma_cmd_create_cq_resp create_cq_resp;
+	struct pvrdma_cmd_resize_cq_resp resize_cq_resp;
+	struct pvrdma_cmd_create_qp_resp create_qp_resp;
+	struct pvrdma_cmd_query_qp_resp query_qp_resp;
+	struct pvrdma_cmd_destroy_qp_resp destroy_qp_resp;
+	struct pvrdma_cmd_create_srq_resp create_srq_resp;
+	struct pvrdma_cmd_query_srq_resp query_srq_resp;
+};
+
+#endif /* __PVRDMA_DEV_API_H__ */
diff --git a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h
new file mode 100644
index 0000000000..acd4c8346d
--- /dev/null
+++ b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h
@@ -0,0 +1,114 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __PVRDMA_RING_H__
+#define __PVRDMA_RING_H__
+
+#include "standard-headers/linux/types.h"
+
+#define PVRDMA_INVALID_IDX	-1	/* Invalid index. */
+
+struct pvrdma_ring {
+	int prod_tail;	/* Producer tail. */
+	int cons_head;	/* Consumer head. */
+};
+
+struct pvrdma_ring_state {
+	struct pvrdma_ring tx;	/* Tx ring. */
+	struct pvrdma_ring rx;	/* Rx ring. */
+};
+
+static inline int pvrdma_idx_valid(uint32_t idx, uint32_t max_elems)
+{
+	/* Generates fewer instructions than a less-than. */
+	return (idx & ~((max_elems << 1) - 1)) == 0;
+}
+
+static inline int32_t pvrdma_idx(int *var, uint32_t max_elems)
+{
+	const unsigned int idx = atomic_read(var);
+
+	if (pvrdma_idx_valid(idx, max_elems))
+		return idx & (max_elems - 1);
+	return PVRDMA_INVALID_IDX;
+}
+
+static inline void pvrdma_idx_ring_inc(int *var, uint32_t max_elems)
+{
+	uint32_t idx = atomic_read(var) + 1;	/* Increment. */
+
+	idx &= (max_elems << 1) - 1;		/* Modulo size, flip gen. */
+	atomic_set(var, idx);
+}
+
+static inline int32_t pvrdma_idx_ring_has_space(const struct pvrdma_ring *r,
+					      uint32_t max_elems, uint32_t *out_tail)
+{
+	const uint32_t tail = atomic_read(&r->prod_tail);
+	const uint32_t head = atomic_read(&r->cons_head);
+
+	if (pvrdma_idx_valid(tail, max_elems) &&
+	    pvrdma_idx_valid(head, max_elems)) {
+		*out_tail = tail & (max_elems - 1);
+		return tail != (head ^ max_elems);
+	}
+	return PVRDMA_INVALID_IDX;
+}
+
+static inline int32_t pvrdma_idx_ring_has_data(const struct pvrdma_ring *r,
+					     uint32_t max_elems, uint32_t *out_head)
+{
+	const uint32_t tail = atomic_read(&r->prod_tail);
+	const uint32_t head = atomic_read(&r->cons_head);
+
+	if (pvrdma_idx_valid(tail, max_elems) &&
+	    pvrdma_idx_valid(head, max_elems)) {
+		*out_head = head & (max_elems - 1);
+		return tail != head;
+	}
+	return PVRDMA_INVALID_IDX;
+}
+
+#endif /* __PVRDMA_RING_H__ */
diff --git a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
new file mode 100644
index 0000000000..1677208a41
--- /dev/null
+++ b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
@@ -0,0 +1,383 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __PVRDMA_VERBS_H__
+#define __PVRDMA_VERBS_H__
+
+#include "standard-headers/linux/types.h"
+
+union pvrdma_gid {
+	uint8_t	raw[16];
+	struct {
+		uint64_t	subnet_prefix;
+		uint64_t	interface_id;
+	} global;
+};
+
+enum pvrdma_link_layer {
+	PVRDMA_LINK_LAYER_UNSPECIFIED,
+	PVRDMA_LINK_LAYER_INFINIBAND,
+	PVRDMA_LINK_LAYER_ETHERNET,
+};
+
+enum pvrdma_mtu {
+	PVRDMA_MTU_256  = 1,
+	PVRDMA_MTU_512  = 2,
+	PVRDMA_MTU_1024 = 3,
+	PVRDMA_MTU_2048 = 4,
+	PVRDMA_MTU_4096 = 5,
+};
+
+static inline int pvrdma_mtu_enum_to_int(enum pvrdma_mtu mtu)
+{
+	switch (mtu) {
+	case PVRDMA_MTU_256:	return  256;
+	case PVRDMA_MTU_512:	return  512;
+	case PVRDMA_MTU_1024:	return 1024;
+	case PVRDMA_MTU_2048:	return 2048;
+	case PVRDMA_MTU_4096:	return 4096;
+	default:		return   -1;
+	}
+}
+
+static inline enum pvrdma_mtu pvrdma_mtu_int_to_enum(int mtu)
+{
+	switch (mtu) {
+	case 256:	return PVRDMA_MTU_256;
+	case 512:	return PVRDMA_MTU_512;
+	case 1024:	return PVRDMA_MTU_1024;
+	case 2048:	return PVRDMA_MTU_2048;
+	case 4096:
+	default:	return PVRDMA_MTU_4096;
+	}
+}
+
+enum pvrdma_port_state {
+	PVRDMA_PORT_NOP			= 0,
+	PVRDMA_PORT_DOWN		= 1,
+	PVRDMA_PORT_INIT		= 2,
+	PVRDMA_PORT_ARMED		= 3,
+	PVRDMA_PORT_ACTIVE		= 4,
+	PVRDMA_PORT_ACTIVE_DEFER	= 5,
+};
+
+enum pvrdma_port_cap_flags {
+	PVRDMA_PORT_SM				= 1 <<  1,
+	PVRDMA_PORT_NOTICE_SUP			= 1 <<  2,
+	PVRDMA_PORT_TRAP_SUP			= 1 <<  3,
+	PVRDMA_PORT_OPT_IPD_SUP			= 1 <<  4,
+	PVRDMA_PORT_AUTO_MIGR_SUP		= 1 <<  5,
+	PVRDMA_PORT_SL_MAP_SUP			= 1 <<  6,
+	PVRDMA_PORT_MKEY_NVRAM			= 1 <<  7,
+	PVRDMA_PORT_PKEY_NVRAM			= 1 <<  8,
+	PVRDMA_PORT_LED_INFO_SUP		= 1 <<  9,
+	PVRDMA_PORT_SM_DISABLED			= 1 << 10,
+	PVRDMA_PORT_SYS_IMAGE_GUID_SUP		= 1 << 11,
+	PVRDMA_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= 1 << 12,
+	PVRDMA_PORT_EXTENDED_SPEEDS_SUP		= 1 << 14,
+	PVRDMA_PORT_CM_SUP			= 1 << 16,
+	PVRDMA_PORT_SNMP_TUNNEL_SUP		= 1 << 17,
+	PVRDMA_PORT_REINIT_SUP			= 1 << 18,
+	PVRDMA_PORT_DEVICE_MGMT_SUP		= 1 << 19,
+	PVRDMA_PORT_VENDOR_CLASS_SUP		= 1 << 20,
+	PVRDMA_PORT_DR_NOTICE_SUP		= 1 << 21,
+	PVRDMA_PORT_CAP_MASK_NOTICE_SUP		= 1 << 22,
+	PVRDMA_PORT_BOOT_MGMT_SUP		= 1 << 23,
+	PVRDMA_PORT_LINK_LATENCY_SUP		= 1 << 24,
+	PVRDMA_PORT_CLIENT_REG_SUP		= 1 << 25,
+	PVRDMA_PORT_IP_BASED_GIDS		= 1 << 26,
+	PVRDMA_PORT_CAP_FLAGS_MAX		= PVRDMA_PORT_IP_BASED_GIDS,
+};
+
+enum pvrdma_port_width {
+	PVRDMA_WIDTH_1X		= 1,
+	PVRDMA_WIDTH_4X		= 2,
+	PVRDMA_WIDTH_8X		= 4,
+	PVRDMA_WIDTH_12X	= 8,
+};
+
+static inline int pvrdma_width_enum_to_int(enum pvrdma_port_width width)
+{
+	switch (width) {
+	case PVRDMA_WIDTH_1X:	return  1;
+	case PVRDMA_WIDTH_4X:	return  4;
+	case PVRDMA_WIDTH_8X:	return  8;
+	case PVRDMA_WIDTH_12X:	return 12;
+	default:		return -1;
+	}
+}
+
+enum pvrdma_port_speed {
+	PVRDMA_SPEED_SDR	= 1,
+	PVRDMA_SPEED_DDR	= 2,
+	PVRDMA_SPEED_QDR	= 4,
+	PVRDMA_SPEED_FDR10	= 8,
+	PVRDMA_SPEED_FDR	= 16,
+	PVRDMA_SPEED_EDR	= 32,
+};
+
+struct pvrdma_port_attr {
+	enum pvrdma_port_state	state;
+	enum pvrdma_mtu		max_mtu;
+	enum pvrdma_mtu		active_mtu;
+	uint32_t			gid_tbl_len;
+	uint32_t			port_cap_flags;
+	uint32_t			max_msg_sz;
+	uint32_t			bad_pkey_cntr;
+	uint32_t			qkey_viol_cntr;
+	uint16_t			pkey_tbl_len;
+	uint16_t			lid;
+	uint16_t			sm_lid;
+	uint8_t			lmc;
+	uint8_t			max_vl_num;
+	uint8_t			sm_sl;
+	uint8_t			subnet_timeout;
+	uint8_t			init_type_reply;
+	uint8_t			active_width;
+	uint8_t			active_speed;
+	uint8_t			phys_state;
+	uint8_t			reserved[2];
+};
+
+struct pvrdma_global_route {
+	union pvrdma_gid	dgid;
+	uint32_t			flow_label;
+	uint8_t			sgid_index;
+	uint8_t			hop_limit;
+	uint8_t			traffic_class;
+	uint8_t			reserved;
+};
+
+struct pvrdma_grh {
+	uint32_t			version_tclass_flow;
+	uint16_t			paylen;
+	uint8_t			next_hdr;
+	uint8_t			hop_limit;
+	union pvrdma_gid	sgid;
+	union pvrdma_gid	dgid;
+};
+
+enum pvrdma_ah_flags {
+	PVRDMA_AH_GRH = 1,
+};
+
+enum pvrdma_rate {
+	PVRDMA_RATE_PORT_CURRENT	= 0,
+	PVRDMA_RATE_2_5_GBPS		= 2,
+	PVRDMA_RATE_5_GBPS		= 5,
+	PVRDMA_RATE_10_GBPS		= 3,
+	PVRDMA_RATE_20_GBPS		= 6,
+	PVRDMA_RATE_30_GBPS		= 4,
+	PVRDMA_RATE_40_GBPS		= 7,
+	PVRDMA_RATE_60_GBPS		= 8,
+	PVRDMA_RATE_80_GBPS		= 9,
+	PVRDMA_RATE_120_GBPS		= 10,
+	PVRDMA_RATE_14_GBPS		= 11,
+	PVRDMA_RATE_56_GBPS		= 12,
+	PVRDMA_RATE_112_GBPS		= 13,
+	PVRDMA_RATE_168_GBPS		= 14,
+	PVRDMA_RATE_25_GBPS		= 15,
+	PVRDMA_RATE_100_GBPS		= 16,
+	PVRDMA_RATE_200_GBPS		= 17,
+	PVRDMA_RATE_300_GBPS		= 18,
+};
+
+struct pvrdma_ah_attr {
+	struct pvrdma_global_route	grh;
+	uint16_t				dlid;
+	uint16_t				vlan_id;
+	uint8_t				sl;
+	uint8_t				src_path_bits;
+	uint8_t				static_rate;
+	uint8_t				ah_flags;
+	uint8_t				port_num;
+	uint8_t				dmac[6];
+	uint8_t				reserved;
+};
+
+enum pvrdma_cq_notify_flags {
+	PVRDMA_CQ_SOLICITED		= 1 << 0,
+	PVRDMA_CQ_NEXT_COMP		= 1 << 1,
+	PVRDMA_CQ_SOLICITED_MASK	= PVRDMA_CQ_SOLICITED |
+					  PVRDMA_CQ_NEXT_COMP,
+	PVRDMA_CQ_REPORT_MISSED_EVENTS	= 1 << 2,
+};
+
+struct pvrdma_qp_cap {
+	uint32_t	max_send_wr;
+	uint32_t	max_recv_wr;
+	uint32_t	max_send_sge;
+	uint32_t	max_recv_sge;
+	uint32_t	max_inline_data;
+	uint32_t	reserved;
+};
+
+enum pvrdma_sig_type {
+	PVRDMA_SIGNAL_ALL_WR,
+	PVRDMA_SIGNAL_REQ_WR,
+};
+
+enum pvrdma_qp_type {
+	PVRDMA_QPT_SMI,
+	PVRDMA_QPT_GSI,
+	PVRDMA_QPT_RC,
+	PVRDMA_QPT_UC,
+	PVRDMA_QPT_UD,
+	PVRDMA_QPT_RAW_IPV6,
+	PVRDMA_QPT_RAW_ETHERTYPE,
+	PVRDMA_QPT_RAW_PACKET = 8,
+	PVRDMA_QPT_XRC_INI = 9,
+	PVRDMA_QPT_XRC_TGT,
+	PVRDMA_QPT_MAX,
+};
+
+enum pvrdma_qp_create_flags {
+	PVRDMA_QP_CREATE_IPOPVRDMA_UD_LSO		= 1 << 0,
+	PVRDMA_QP_CREATE_BLOCK_MULTICAST_LOOPBACK	= 1 << 1,
+};
+
+enum pvrdma_qp_attr_mask {
+	PVRDMA_QP_STATE			= 1 << 0,
+	PVRDMA_QP_CUR_STATE		= 1 << 1,
+	PVRDMA_QP_EN_SQD_ASYNC_NOTIFY	= 1 << 2,
+	PVRDMA_QP_ACCESS_FLAGS		= 1 << 3,
+	PVRDMA_QP_PKEY_INDEX		= 1 << 4,
+	PVRDMA_QP_PORT			= 1 << 5,
+	PVRDMA_QP_QKEY			= 1 << 6,
+	PVRDMA_QP_AV			= 1 << 7,
+	PVRDMA_QP_PATH_MTU		= 1 << 8,
+	PVRDMA_QP_TIMEOUT		= 1 << 9,
+	PVRDMA_QP_RETRY_CNT		= 1 << 10,
+	PVRDMA_QP_RNR_RETRY		= 1 << 11,
+	PVRDMA_QP_RQ_PSN		= 1 << 12,
+	PVRDMA_QP_MAX_QP_RD_ATOMIC	= 1 << 13,
+	PVRDMA_QP_ALT_PATH		= 1 << 14,
+	PVRDMA_QP_MIN_RNR_TIMER		= 1 << 15,
+	PVRDMA_QP_SQ_PSN		= 1 << 16,
+	PVRDMA_QP_MAX_DEST_RD_ATOMIC	= 1 << 17,
+	PVRDMA_QP_PATH_MIG_STATE	= 1 << 18,
+	PVRDMA_QP_CAP			= 1 << 19,
+	PVRDMA_QP_DEST_QPN		= 1 << 20,
+	PVRDMA_QP_ATTR_MASK_MAX		= PVRDMA_QP_DEST_QPN,
+};
+
+enum pvrdma_qp_state {
+	PVRDMA_QPS_RESET,
+	PVRDMA_QPS_INIT,
+	PVRDMA_QPS_RTR,
+	PVRDMA_QPS_RTS,
+	PVRDMA_QPS_SQD,
+	PVRDMA_QPS_SQE,
+	PVRDMA_QPS_ERR,
+};
+
+enum pvrdma_mig_state {
+	PVRDMA_MIG_MIGRATED,
+	PVRDMA_MIG_REARM,
+	PVRDMA_MIG_ARMED,
+};
+
+enum pvrdma_mw_type {
+	PVRDMA_MW_TYPE_1 = 1,
+	PVRDMA_MW_TYPE_2 = 2,
+};
+
+struct pvrdma_srq_attr {
+	uint32_t			max_wr;
+	uint32_t			max_sge;
+	uint32_t			srq_limit;
+	uint32_t			reserved;
+};
+
+struct pvrdma_qp_attr {
+	enum pvrdma_qp_state	qp_state;
+	enum pvrdma_qp_state	cur_qp_state;
+	enum pvrdma_mtu		path_mtu;
+	enum pvrdma_mig_state	path_mig_state;
+	uint32_t			qkey;
+	uint32_t			rq_psn;
+	uint32_t			sq_psn;
+	uint32_t			dest_qp_num;
+	uint32_t			qp_access_flags;
+	uint16_t			pkey_index;
+	uint16_t			alt_pkey_index;
+	uint8_t			en_sqd_async_notify;
+	uint8_t			sq_draining;
+	uint8_t			max_rd_atomic;
+	uint8_t			max_dest_rd_atomic;
+	uint8_t			min_rnr_timer;
+	uint8_t			port_num;
+	uint8_t			timeout;
+	uint8_t			retry_cnt;
+	uint8_t			rnr_retry;
+	uint8_t			alt_port_num;
+	uint8_t			alt_timeout;
+	uint8_t			reserved[5];
+	struct pvrdma_qp_cap	cap;
+	struct pvrdma_ah_attr	ah_attr;
+	struct pvrdma_ah_attr	alt_ah_attr;
+};
+
+enum pvrdma_send_flags {
+	PVRDMA_SEND_FENCE	= 1 << 0,
+	PVRDMA_SEND_SIGNALED	= 1 << 1,
+	PVRDMA_SEND_SOLICITED	= 1 << 2,
+	PVRDMA_SEND_INLINE	= 1 << 3,
+	PVRDMA_SEND_IP_CSUM	= 1 << 4,
+	PVRDMA_SEND_FLAGS_MAX	= PVRDMA_SEND_IP_CSUM,
+};
+
+enum pvrdma_access_flags {
+	PVRDMA_ACCESS_LOCAL_WRITE	= 1 << 0,
+	PVRDMA_ACCESS_REMOTE_WRITE	= 1 << 1,
+	PVRDMA_ACCESS_REMOTE_READ	= 1 << 2,
+	PVRDMA_ACCESS_REMOTE_ATOMIC	= 1 << 3,
+	PVRDMA_ACCESS_MW_BIND		= 1 << 4,
+	PVRDMA_ZERO_BASED		= 1 << 5,
+	PVRDMA_ACCESS_ON_DEMAND		= 1 << 6,
+	PVRDMA_ACCESS_FLAGS_MAX		= PVRDMA_ACCESS_ON_DEMAND,
+};
+
+#endif /* __PVRDMA_VERBS_H__ */
diff --git a/include/standard-headers/rdma/vmw_pvrdma-abi.h b/include/standard-headers/rdma/vmw_pvrdma-abi.h
new file mode 100644
index 0000000000..0d0f7a8aca
--- /dev/null
+++ b/include/standard-headers/rdma/vmw_pvrdma-abi.h
@@ -0,0 +1,293 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __VMW_PVRDMA_ABI_H__
+#define __VMW_PVRDMA_ABI_H__
+
+#include "standard-headers/linux/types.h"
+
+#define PVRDMA_UVERBS_ABI_VERSION	3		/* ABI Version. */
+#define PVRDMA_UAR_HANDLE_MASK		0x00FFFFFF	/* Bottom 24 bits. */
+#define PVRDMA_UAR_QP_OFFSET		0		/* QP doorbell. */
+#define PVRDMA_UAR_QP_SEND		BIT(30)		/* Send bit. */
+#define PVRDMA_UAR_QP_RECV		BIT(31)		/* Recv bit. */
+#define PVRDMA_UAR_CQ_OFFSET		4		/* CQ doorbell. */
+#define PVRDMA_UAR_CQ_ARM_SOL		BIT(29)		/* Arm solicited bit. */
+#define PVRDMA_UAR_CQ_ARM		BIT(30)		/* Arm bit. */
+#define PVRDMA_UAR_CQ_POLL		BIT(31)		/* Poll bit. */
+
+enum pvrdma_wr_opcode {
+	PVRDMA_WR_RDMA_WRITE,
+	PVRDMA_WR_RDMA_WRITE_WITH_IMM,
+	PVRDMA_WR_SEND,
+	PVRDMA_WR_SEND_WITH_IMM,
+	PVRDMA_WR_RDMA_READ,
+	PVRDMA_WR_ATOMIC_CMP_AND_SWP,
+	PVRDMA_WR_ATOMIC_FETCH_AND_ADD,
+	PVRDMA_WR_LSO,
+	PVRDMA_WR_SEND_WITH_INV,
+	PVRDMA_WR_RDMA_READ_WITH_INV,
+	PVRDMA_WR_LOCAL_INV,
+	PVRDMA_WR_FAST_REG_MR,
+	PVRDMA_WR_MASKED_ATOMIC_CMP_AND_SWP,
+	PVRDMA_WR_MASKED_ATOMIC_FETCH_AND_ADD,
+	PVRDMA_WR_BIND_MW,
+	PVRDMA_WR_REG_SIG_MR,
+};
+
+enum pvrdma_wc_status {
+	PVRDMA_WC_SUCCESS,
+	PVRDMA_WC_LOC_LEN_ERR,
+	PVRDMA_WC_LOC_QP_OP_ERR,
+	PVRDMA_WC_LOC_EEC_OP_ERR,
+	PVRDMA_WC_LOC_PROT_ERR,
+	PVRDMA_WC_WR_FLUSH_ERR,
+	PVRDMA_WC_MW_BIND_ERR,
+	PVRDMA_WC_BAD_RESP_ERR,
+	PVRDMA_WC_LOC_ACCESS_ERR,
+	PVRDMA_WC_REM_INV_REQ_ERR,
+	PVRDMA_WC_REM_ACCESS_ERR,
+	PVRDMA_WC_REM_OP_ERR,
+	PVRDMA_WC_RETRY_EXC_ERR,
+	PVRDMA_WC_RNR_RETRY_EXC_ERR,
+	PVRDMA_WC_LOC_RDD_VIOL_ERR,
+	PVRDMA_WC_REM_INV_RD_REQ_ERR,
+	PVRDMA_WC_REM_ABORT_ERR,
+	PVRDMA_WC_INV_EECN_ERR,
+	PVRDMA_WC_INV_EEC_STATE_ERR,
+	PVRDMA_WC_FATAL_ERR,
+	PVRDMA_WC_RESP_TIMEOUT_ERR,
+	PVRDMA_WC_GENERAL_ERR,
+};
+
+enum pvrdma_wc_opcode {
+	PVRDMA_WC_SEND,
+	PVRDMA_WC_RDMA_WRITE,
+	PVRDMA_WC_RDMA_READ,
+	PVRDMA_WC_COMP_SWAP,
+	PVRDMA_WC_FETCH_ADD,
+	PVRDMA_WC_BIND_MW,
+	PVRDMA_WC_LSO,
+	PVRDMA_WC_LOCAL_INV,
+	PVRDMA_WC_FAST_REG_MR,
+	PVRDMA_WC_MASKED_COMP_SWAP,
+	PVRDMA_WC_MASKED_FETCH_ADD,
+	PVRDMA_WC_RECV = 1 << 7,
+	PVRDMA_WC_RECV_RDMA_WITH_IMM,
+};
+
+enum pvrdma_wc_flags {
+	PVRDMA_WC_GRH			= 1 << 0,
+	PVRDMA_WC_WITH_IMM		= 1 << 1,
+	PVRDMA_WC_WITH_INVALIDATE	= 1 << 2,
+	PVRDMA_WC_IP_CSUM_OK		= 1 << 3,
+	PVRDMA_WC_WITH_SMAC		= 1 << 4,
+	PVRDMA_WC_WITH_VLAN		= 1 << 5,
+	PVRDMA_WC_WITH_NETWORK_HDR_TYPE	= 1 << 6,
+	PVRDMA_WC_FLAGS_MAX		= PVRDMA_WC_WITH_NETWORK_HDR_TYPE,
+};
+
+struct pvrdma_alloc_ucontext_resp {
+	uint32_t qp_tab_size;
+	uint32_t reserved;
+};
+
+struct pvrdma_alloc_pd_resp {
+	uint32_t pdn;
+	uint32_t reserved;
+};
+
+struct pvrdma_create_cq {
+	uint64_t buf_addr;
+	uint32_t buf_size;
+	uint32_t reserved;
+};
+
+struct pvrdma_create_cq_resp {
+	uint32_t cqn;
+	uint32_t reserved;
+};
+
+struct pvrdma_resize_cq {
+	uint64_t buf_addr;
+	uint32_t buf_size;
+	uint32_t reserved;
+};
+
+struct pvrdma_create_srq {
+	uint64_t buf_addr;
+	uint32_t buf_size;
+	uint32_t reserved;
+};
+
+struct pvrdma_create_srq_resp {
+	uint32_t srqn;
+	uint32_t reserved;
+};
+
+struct pvrdma_create_qp {
+	uint64_t rbuf_addr;
+	uint64_t sbuf_addr;
+	uint32_t rbuf_size;
+	uint32_t sbuf_size;
+	uint64_t qp_addr;
+};
+
+/* PVRDMA masked atomic compare and swap */
+struct pvrdma_ex_cmp_swap {
+	uint64_t swap_val;
+	uint64_t compare_val;
+	uint64_t swap_mask;
+	uint64_t compare_mask;
+};
+
+/* PVRDMA masked atomic fetch and add */
+struct pvrdma_ex_fetch_add {
+	uint64_t add_val;
+	uint64_t field_boundary;
+};
+
+/* PVRDMA address vector. */
+struct pvrdma_av {
+	uint32_t port_pd;
+	uint32_t sl_tclass_flowlabel;
+	uint8_t dgid[16];
+	uint8_t src_path_bits;
+	uint8_t gid_index;
+	uint8_t stat_rate;
+	uint8_t hop_limit;
+	uint8_t dmac[6];
+	uint8_t reserved[6];
+};
+
+/* PVRDMA scatter/gather entry */
+struct pvrdma_sge {
+	uint64_t   addr;
+	uint32_t   length;
+	uint32_t   lkey;
+};
+
+/* PVRDMA receive queue work request */
+struct pvrdma_rq_wqe_hdr {
+	uint64_t wr_id;		/* wr id */
+	uint32_t num_sge;		/* size of s/g array */
+	uint32_t total_len;	/* reserved */
+};
+/* Use pvrdma_sge (ib_sge) for receive queue s/g array elements. */
+
+/* PVRDMA send queue work request */
+struct pvrdma_sq_wqe_hdr {
+	uint64_t wr_id;		/* wr id */
+	uint32_t num_sge;		/* size of s/g array */
+	uint32_t total_len;	/* reserved */
+	uint32_t opcode;		/* operation type */
+	uint32_t send_flags;	/* wr flags */
+	union {
+		uint32_t imm_data;
+		uint32_t invalidate_rkey;
+	} ex;
+	uint32_t reserved;
+	union {
+		struct {
+			uint64_t remote_addr;
+			uint32_t rkey;
+			uint8_t reserved[4];
+		} rdma;
+		struct {
+			uint64_t remote_addr;
+			uint64_t compare_add;
+			uint64_t swap;
+			uint32_t rkey;
+			uint32_t reserved;
+		} atomic;
+		struct {
+			uint64_t remote_addr;
+			uint32_t log_arg_sz;
+			uint32_t rkey;
+			union {
+				struct pvrdma_ex_cmp_swap  cmp_swap;
+				struct pvrdma_ex_fetch_add fetch_add;
+			} wr_data;
+		} masked_atomics;
+		struct {
+			uint64_t iova_start;
+			uint64_t pl_pdir_dma;
+			uint32_t page_shift;
+			uint32_t page_list_len;
+			uint32_t length;
+			uint32_t access_flags;
+			uint32_t rkey;
+		} fast_reg;
+		struct {
+			uint32_t remote_qpn;
+			uint32_t remote_qkey;
+			struct pvrdma_av av;
+		} ud;
+	} wr;
+};
+/* Use pvrdma_sge (ib_sge) for send queue s/g array elements. */
+
+/* Completion queue element. */
+struct pvrdma_cqe {
+	uint64_t wr_id;
+	uint64_t qp;
+	uint32_t opcode;
+	uint32_t status;
+	uint32_t byte_len;
+	uint32_t imm_data;
+	uint32_t src_qp;
+	uint32_t wc_flags;
+	uint32_t vendor_err;
+	uint16_t pkey_index;
+	uint16_t slid;
+	uint8_t sl;
+	uint8_t dlid_path_bits;
+	uint8_t port_num;
+	uint8_t smac[6];
+	uint8_t network_hdr_type;
+	uint8_t reserved2[6]; /* Pad to next power of 2 (64). */
+};
+
+#endif /* __VMW_PVRDMA_ABI_H__ */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel]  [PATCH V11 05/10] hw/rdma: Add wrappers and macros
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (3 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 04/10] include/standard-headers: add pvrdma related headers Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 06/10] hw/rdma: Definitions for rdma device and rdma resource manager Marcel Apfelbaum
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

From: Yuval Shaia <yuval.shaia@oracle.com>

As all mapping for this device are from driver to device,
declare wrappers on top of pci_dma_*map functions.

In addition, declare macros to be used for debug messages.

Reviewed-by: Dotan Barak <dotanb@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 hw/Makefile.objs      |  1 +
 hw/rdma/Makefile.objs |  3 +++
 hw/rdma/rdma_utils.c  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/rdma/rdma_utils.h  | 43 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+)
 create mode 100644 hw/rdma/Makefile.objs
 create mode 100644 hw/rdma/rdma_utils.c
 create mode 100644 hw/rdma/rdma_utils.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index cf4cb2010b..6a0ffe0afd 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -18,6 +18,7 @@ devices-dirs-$(CONFIG_IPMI) += ipmi/
 devices-dirs-$(CONFIG_SOFTMMU) += isa/
 devices-dirs-$(CONFIG_SOFTMMU) += misc/
 devices-dirs-$(CONFIG_SOFTMMU) += net/
+devices-dirs-$(CONFIG_SOFTMMU) += rdma/
 devices-dirs-$(CONFIG_SOFTMMU) += nvram/
 devices-dirs-$(CONFIG_SOFTMMU) += pci/
 devices-dirs-$(CONFIG_PCI) += pci-bridge/ pci-host/
diff --git a/hw/rdma/Makefile.objs b/hw/rdma/Makefile.objs
new file mode 100644
index 0000000000..cdffe4a9a3
--- /dev/null
+++ b/hw/rdma/Makefile.objs
@@ -0,0 +1,3 @@
+ifeq ($(CONFIG_RDMA),y)
+obj-$(CONFIG_PCI) += rdma_utils.o
+endif
diff --git a/hw/rdma/rdma_utils.c b/hw/rdma/rdma_utils.c
new file mode 100644
index 0000000000..0e5caffd40
--- /dev/null
+++ b/hw/rdma/rdma_utils.c
@@ -0,0 +1,51 @@
+/*
+ * QEMU paravirtual RDMA - Generic RDMA backend
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "rdma_utils.h"
+
+void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen)
+{
+    void *p;
+    hwaddr len = plen;
+
+    if (!addr) {
+        pr_dbg("addr is NULL\n");
+        return NULL;
+    }
+
+    p = pci_dma_map(dev, addr, &len, DMA_DIRECTION_TO_DEVICE);
+    if (!p) {
+        pr_dbg("Fail in pci_dma_map, addr=0x%llx, len=%ld\n",
+               (long long unsigned int)addr, len);
+        return NULL;
+    }
+
+    if (len != plen) {
+        rdma_pci_dma_unmap(dev, p, len);
+        return NULL;
+    }
+
+    pr_dbg("0x%llx -> %p (len=%ld)\n", (long long unsigned int)addr, p, len);
+
+    return p;
+}
+
+void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len)
+{
+    pr_dbg("%p\n", buffer);
+    if (buffer) {
+        pci_dma_unmap(dev, buffer, len, DMA_DIRECTION_TO_DEVICE, 0);
+    }
+}
diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
new file mode 100644
index 0000000000..cdac910e24
--- /dev/null
+++ b/hw/rdma/rdma_utils.h
@@ -0,0 +1,43 @@
+/*
+ * RDMA device: Debug utilities
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMA_UTILS_H
+#define RDMA_UTILS_H
+
+#include <qemu/osdep.h>
+#include <include/hw/pci/pci.h>
+#include <include/sysemu/dma.h>
+
+#define pr_info(fmt, ...) \
+    fprintf(stdout, "%s: %-20s (%3d): " fmt, "pvrdma",  __func__, __LINE__,\
+           ## __VA_ARGS__)
+
+#define pr_err(fmt, ...) \
+    fprintf(stderr, "%s: Error at %-20s (%3d): " fmt, "pvrdma", __func__, \
+        __LINE__, ## __VA_ARGS__)
+
+#ifdef PVRDMA_DEBUG
+#define pr_dbg(fmt, ...) \
+    fprintf(stdout, "%s: %-20s (%3d): " fmt, "pvrdma", __func__, __LINE__,\
+           ## __VA_ARGS__)
+#else
+#define pr_dbg(fmt, ...)
+#endif
+
+void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
+void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
+
+#endif
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 06/10] hw/rdma: Definitions for rdma device and rdma resource manager
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (4 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 05/10] hw/rdma: Add wrappers and macros Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 07/10] hw/rdma: Implementation of generic rdma device layers Marcel Apfelbaum
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

From: Yuval Shaia <yuval.shaia@oracle.com>

Definition of various structures and constants used in backend and
resource manager modules.

Reviewed-by: Dotan Barak <dotanb@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 hw/rdma/rdma_backend_defs.h |  62 ++++++++++++++++++++++++++
 hw/rdma/rdma_rm_defs.h      | 104 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 166 insertions(+)
 create mode 100644 hw/rdma/rdma_backend_defs.h
 create mode 100644 hw/rdma/rdma_rm_defs.h

diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
new file mode 100644
index 0000000000..837e32419c
--- /dev/null
+++ b/hw/rdma/rdma_backend_defs.h
@@ -0,0 +1,62 @@
+/*
+ *  RDMA device: Definitions of Backend Device structures
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMA_BACKEND_DEFS_H
+#define RDMA_BACKEND_DEFS_H
+
+#include <infiniband/verbs.h>
+#include <qemu/thread.h>
+
+typedef struct RdmaDeviceResources RdmaDeviceResources;
+
+typedef struct RdmaBackendThread {
+    QemuThread thread;
+    QemuMutex mutex;
+    bool run;
+} RdmaBackendThread;
+
+typedef struct RdmaBackendDev {
+    struct ibv_device_attr dev_attr;
+    RdmaBackendThread comp_thread;
+    union ibv_gid gid;
+    PCIDevice *dev;
+    RdmaDeviceResources *rdma_dev_res;
+    struct ibv_device *ib_dev;
+    struct ibv_context *context;
+    struct ibv_comp_channel *channel;
+    uint8_t port_num;
+    uint8_t backend_gid_idx;
+} RdmaBackendDev;
+
+typedef struct RdmaBackendPD {
+    struct ibv_pd *ibpd;
+} RdmaBackendPD;
+
+typedef struct RdmaBackendMR {
+    struct ibv_pd *ibpd;
+    struct ibv_mr *ibmr;
+} RdmaBackendMR;
+
+typedef struct RdmaBackendCQ {
+    RdmaBackendDev *backend_dev;
+    struct ibv_cq *ibcq;
+} RdmaBackendCQ;
+
+typedef struct RdmaBackendQP {
+    struct ibv_pd *ibpd;
+    struct ibv_qp *ibqp;
+} RdmaBackendQP;
+
+#endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
new file mode 100644
index 0000000000..6522dca68f
--- /dev/null
+++ b/hw/rdma/rdma_rm_defs.h
@@ -0,0 +1,104 @@
+/*
+ * RDMA device: Definitions of Resource Manager structures
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMA_RM_DEFS_H
+#define RDMA_RM_DEFS_H
+
+#include "rdma_backend_defs.h"
+
+#define MAX_PORTS             1
+#define MAX_PORT_GIDS         1
+#define MAX_PORT_PKEYS        1
+#define MAX_PKEYS             1
+#define MAX_GIDS              2048
+#define MAX_UCS               512
+#define MAX_MR_SIZE           (1UL << 27)
+#define MAX_QP                1024
+#define MAX_SGE               4
+#define MAX_CQ                2048
+#define MAX_MR                1024
+#define MAX_PD                1024
+#define MAX_QP_RD_ATOM        16
+#define MAX_QP_INIT_RD_ATOM   16
+#define MAX_AH                64
+
+#define MAX_RMRESTBL_NAME_SZ 16
+typedef struct RdmaRmResTbl {
+    char name[MAX_RMRESTBL_NAME_SZ];
+    QemuMutex lock;
+    unsigned long *bitmap;
+    size_t tbl_sz;
+    size_t res_sz;
+    void *tbl;
+} RdmaRmResTbl;
+
+typedef struct RdmaRmPD {
+    RdmaBackendPD backend_pd;
+    uint32_t ctx_handle;
+} RdmaRmPD;
+
+typedef struct RdmaRmCQ {
+    RdmaBackendCQ backend_cq;
+    void *opaque;
+    bool notify;
+} RdmaRmCQ;
+
+typedef struct RdmaRmUserMR {
+    uint64_t host_virt;
+    uint64_t guest_start;
+    size_t length;
+} RdmaRmUserMR;
+
+/* MR (DMA region) */
+typedef struct RdmaRmMR {
+    RdmaBackendMR backend_mr;
+    RdmaRmUserMR user_mr;
+    uint32_t pd_handle;
+    uint32_t lkey;
+    uint32_t rkey;
+} RdmaRmMR;
+
+typedef struct RdmaRmUC {
+    uint64_t uc_handle;
+} RdmaRmUC;
+
+typedef struct RdmaRmQP {
+    RdmaBackendQP backend_qp;
+    void *opaque;
+    uint32_t qp_type;
+    uint32_t qpn;
+    uint32_t send_cq_handle;
+    uint32_t recv_cq_handle;
+    enum ibv_qp_state qp_state;
+} RdmaRmQP;
+
+typedef struct RdmaRmPort {
+    union ibv_gid gid_tbl[MAX_PORT_GIDS];
+    enum ibv_port_state state;
+    int *pkey_tbl; /* TODO: Not yet supported */
+} RdmaRmPort;
+
+typedef struct RdmaDeviceResources {
+    RdmaRmPort ports[MAX_PORTS];
+    RdmaRmResTbl pd_tbl;
+    RdmaRmResTbl mr_tbl;
+    RdmaRmResTbl uc_tbl;
+    RdmaRmResTbl qp_tbl;
+    RdmaRmResTbl cq_tbl;
+    RdmaRmResTbl cqe_ctx_tbl;
+    GHashTable *qp_hash; /* Keeps mapping between real and emulated */
+} RdmaDeviceResources;
+
+#endif
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 07/10] hw/rdma: Implementation of generic rdma device layers
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (5 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 06/10] hw/rdma: Definitions for rdma device and rdma resource manager Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 08/10] hw/rdma: PVRDMA commands and data-path ops Marcel Apfelbaum
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

From: Yuval Shaia <yuval.shaia@oracle.com>

This layer is composed of two sub-modules, backend and resource manager.
Backend sub-module is responsible for all the interaction with IB layers
such as ibverbs and umad (external libraries).
Resource manager is a collection of functions and structures to manage
RDMA resources such as QPs, CQs and MRs.

Reviewed-by: Dotan Barak <dotanb@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 Makefile.objs          |   1 +
 configure              |   9 +-
 hw/rdma/Makefile.objs  |   2 +-
 hw/rdma/rdma_backend.c | 818 +++++++++++++++++++++++++++++++++++++++++++++++++
 hw/rdma/rdma_backend.h |  98 ++++++
 hw/rdma/rdma_rm.c      | 544 ++++++++++++++++++++++++++++++++
 hw/rdma/rdma_rm.h      |  69 +++++
 hw/rdma/trace-events   |   5 +
 8 files changed, 1541 insertions(+), 5 deletions(-)
 create mode 100644 hw/rdma/rdma_backend.c
 create mode 100644 hw/rdma/rdma_backend.h
 create mode 100644 hw/rdma/rdma_rm.c
 create mode 100644 hw/rdma/rdma_rm.h
 create mode 100644 hw/rdma/trace-events

diff --git a/Makefile.objs b/Makefile.objs
index 2efba6d768..f3a3d28304 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -130,6 +130,7 @@ trace-events-subdirs += hw/block/dataplane
 trace-events-subdirs += hw/char
 trace-events-subdirs += hw/intc
 trace-events-subdirs += hw/net
+trace-events-subdirs += hw/rdma
 trace-events-subdirs += hw/virtio
 trace-events-subdirs += hw/audio
 trace-events-subdirs += hw/misc
diff --git a/configure b/configure
index 913e14839d..ed45a3c4dd 100755
--- a/configure
+++ b/configure
@@ -1572,7 +1572,7 @@ disabled with --disable-FEATURE, default is enabled if available:
   hax             HAX acceleration support
   hvf             Hypervisor.framework acceleration support
   whpx            Windows Hypervisor Platform acceleration support
-  rdma            RDMA-based migration support
+  rdma            Enable RDMA-based migration and PVRDMA support
   vde             support for vde network
   netmap          support for netmap network
   linux-aio       Linux AIO support
@@ -2923,15 +2923,16 @@ if test "$rdma" != "no" ; then
 #include <rdma/rdma_cma.h>
 int main(void) { return 0; }
 EOF
-  rdma_libs="-lrdmacm -libverbs"
+  rdma_libs="-lrdmacm -libverbs -libumad"
   if compile_prog "" "$rdma_libs" ; then
     rdma="yes"
+    libs_softmmu="$libs_softmmu $rdma_libs"
   else
     if test "$rdma" = "yes" ; then
         error_exit \
-            " OpenFabrics librdmacm/libibverbs not present." \
+            " OpenFabrics librdmacm/libibverbs/libibumad not present." \
             " Your options:" \
-            "  (1) Fast: Install infiniband packages from your distro." \
+            "  (1) Fast: Install infiniband packages (devel) from your distro." \
             "  (2) Cleanest: Install libraries from www.openfabrics.org" \
             "  (3) Also: Install softiwarp if you don't have RDMA hardware"
     fi
diff --git a/hw/rdma/Makefile.objs b/hw/rdma/Makefile.objs
index cdffe4a9a3..6a59bf0d5b 100644
--- a/hw/rdma/Makefile.objs
+++ b/hw/rdma/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_RDMA),y)
-obj-$(CONFIG_PCI) += rdma_utils.o
+obj-$(CONFIG_PCI) += rdma_utils.o rdma_backend.o rdma_rm.o
 endif
diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
new file mode 100644
index 0000000000..e306fba534
--- /dev/null
+++ b/hw/rdma/rdma_backend.c
@@ -0,0 +1,818 @@
+/*
+ * QEMU paravirtual RDMA - Generic RDMA backend
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <qemu/osdep.h>
+#include <qemu/error-report.h>
+#include <qapi/error.h>
+
+#include <infiniband/verbs.h>
+
+#include "trace.h"
+#include "rdma_utils.h"
+#include "rdma_rm.h"
+#include "rdma_backend.h"
+
+/* Vendor Errors */
+#define VENDOR_ERR_FAIL_BACKEND     0x201
+#define VENDOR_ERR_TOO_MANY_SGES    0x202
+#define VENDOR_ERR_NOMEM            0x203
+#define VENDOR_ERR_QP0              0x204
+#define VENDOR_ERR_NO_SGE           0x205
+#define VENDOR_ERR_MAD_SEND         0x206
+#define VENDOR_ERR_INVLKEY          0x207
+#define VENDOR_ERR_MR_SMALL         0x208
+
+#define THR_NAME_LEN 16
+
+typedef struct BackendCtx {
+    uint64_t req_id;
+    void *up_ctx;
+    bool is_tx_req;
+} BackendCtx;
+
+static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
+
+static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+{
+    pr_err("No completion handler is registered\n");
+}
+
+static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
+{
+    int i, ne;
+    BackendCtx *bctx;
+    struct ibv_wc wc[2];
+
+    pr_dbg("Entering poll_cq loop on cq %p\n", ibcq);
+    do {
+        ne = ibv_poll_cq(ibcq, ARRAY_SIZE(wc), wc);
+
+        pr_dbg("Got %d completion(s) from cq %p\n", ne, ibcq);
+
+        for (i = 0; i < ne; i++) {
+            pr_dbg("wr_id=0x%lx\n", wc[i].wr_id);
+            pr_dbg("status=%d\n", wc[i].status);
+
+            bctx = rdma_rm_get_cqe_ctx(rdma_dev_res, wc[i].wr_id);
+            if (unlikely(!bctx)) {
+                pr_dbg("Error: Failed to find ctx for req %ld\n", wc[i].wr_id);
+                continue;
+            }
+            pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
+
+            comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
+
+            rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
+            g_free(bctx);
+        }
+    } while (ne > 0);
+
+    if (ne < 0) {
+        pr_dbg("Got error %d from ibv_poll_cq\n", ne);
+    }
+}
+
+static void *comp_handler_thread(void *arg)
+{
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)arg;
+    int rc;
+    struct ibv_cq *ev_cq;
+    void *ev_ctx;
+
+    pr_dbg("Starting\n");
+
+    while (backend_dev->comp_thread.run) {
+        pr_dbg("Waiting for completion on channel %p\n", backend_dev->channel);
+        rc = ibv_get_cq_event(backend_dev->channel, &ev_cq, &ev_ctx);
+        pr_dbg("ibv_get_cq_event=%d\n", rc);
+        if (unlikely(rc)) {
+            pr_dbg("---> ibv_get_cq_event (%d)\n", rc);
+            continue;
+        }
+
+        rc = ibv_req_notify_cq(ev_cq, 0);
+        if (unlikely(rc)) {
+            pr_dbg("Error %d from ibv_req_notify_cq\n", rc);
+        }
+
+        poll_cq(backend_dev->rdma_dev_res, ev_cq);
+
+        ibv_ack_cq_events(ev_cq, 1);
+    }
+
+    pr_dbg("Going down\n");
+
+    /* TODO: Post cqe for all remaining buffs that were posted */
+
+    return NULL;
+}
+
+void rdma_backend_register_comp_handler(void (*handler)(int status,
+                                        unsigned int vendor_err, void *ctx))
+{
+    comp_handler = handler;
+}
+
+void rdma_backend_unregister_comp_handler(void)
+{
+    rdma_backend_register_comp_handler(dummy_comp_handler);
+}
+
+int rdma_backend_query_port(RdmaBackendDev *backend_dev,
+                            struct ibv_port_attr *port_attr)
+{
+    int rc;
+
+    rc = ibv_query_port(backend_dev->context, backend_dev->port_num, port_attr);
+    if (rc) {
+        pr_dbg("Error %d from ibv_query_port\n", rc);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+void rdma_backend_poll_cq(RdmaDeviceResources *rdma_dev_res, RdmaBackendCQ *cq)
+{
+    poll_cq(rdma_dev_res, cq->ibcq);
+}
+
+static GHashTable *ah_hash;
+
+static struct ibv_ah *create_ah(RdmaBackendDev *backend_dev, struct ibv_pd *pd,
+                                uint8_t sgid_idx, union ibv_gid *dgid)
+{
+    GBytes *ah_key = g_bytes_new(dgid, sizeof(*dgid));
+    struct ibv_ah *ah = g_hash_table_lookup(ah_hash, ah_key);
+
+    if (ah) {
+        trace_create_ah_cache_hit(be64_to_cpu(dgid->global.subnet_prefix),
+                                  be64_to_cpu(dgid->global.interface_id));
+        g_bytes_unref(ah_key);
+    } else {
+        struct ibv_ah_attr ah_attr = {
+            .is_global     = 1,
+            .port_num      = backend_dev->port_num,
+            .grh.hop_limit = 1,
+        };
+
+        ah_attr.grh.dgid = *dgid;
+        ah_attr.grh.sgid_index = sgid_idx;
+
+        ah = ibv_create_ah(pd, &ah_attr);
+        if (ah) {
+            g_hash_table_insert(ah_hash, ah_key, ah);
+        } else {
+            g_bytes_unref(ah_key);
+            pr_dbg("ibv_create_ah failed for gid <%lx %lx>\n",
+                    be64_to_cpu(dgid->global.subnet_prefix),
+                    be64_to_cpu(dgid->global.interface_id));
+        }
+
+        trace_create_ah_cache_miss(be64_to_cpu(dgid->global.subnet_prefix),
+                                   be64_to_cpu(dgid->global.interface_id));
+    }
+
+    return ah;
+}
+
+static void destroy_ah_hash_key(gpointer data)
+{
+    g_bytes_unref(data);
+}
+
+static void destroy_ah_hast_data(gpointer data)
+{
+    struct ibv_ah *ah = data;
+
+    ibv_destroy_ah(ah);
+}
+
+static void ah_cache_init(void)
+{
+    ah_hash = g_hash_table_new_full(g_bytes_hash, g_bytes_equal,
+                                    destroy_ah_hash_key, destroy_ah_hast_data);
+}
+
+static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
+                                struct ibv_sge *dsge, struct ibv_sge *ssge,
+                                uint8_t num_sge)
+{
+    RdmaRmMR *mr;
+    int ssge_idx;
+
+    pr_dbg("num_sge=%d\n", num_sge);
+
+    for (ssge_idx = 0; ssge_idx < num_sge; ssge_idx++) {
+        mr = rdma_rm_get_mr(rdma_dev_res, ssge[ssge_idx].lkey);
+        if (unlikely(!mr)) {
+            pr_dbg("Invalid lkey 0x%x\n", ssge[ssge_idx].lkey);
+            return VENDOR_ERR_INVLKEY | ssge[ssge_idx].lkey;
+        }
+
+        dsge->addr = mr->user_mr.host_virt + ssge[ssge_idx].addr -
+                     mr->user_mr.guest_start;
+        dsge->length = ssge[ssge_idx].length;
+        dsge->lkey = rdma_backend_mr_lkey(&mr->backend_mr);
+
+        pr_dbg("ssge->addr=0x%lx\n", (uint64_t)ssge[ssge_idx].addr);
+        pr_dbg("dsge->addr=0x%lx\n", dsge->addr);
+        pr_dbg("dsge->length=%d\n", dsge->length);
+        pr_dbg("dsge->lkey=0x%x\n", dsge->lkey);
+
+        dsge++;
+    }
+
+    return 0;
+}
+
+void rdma_backend_post_send(RdmaBackendDev *backend_dev,
+                            RdmaBackendQP *qp, uint8_t qp_type,
+                            struct ibv_sge *sge, uint32_t num_sge,
+                            union ibv_gid *dgid, uint32_t dqpn,
+                            uint32_t dqkey, void *ctx)
+{
+    BackendCtx *bctx;
+    struct ibv_sge new_sge[MAX_SGE];
+    uint32_t bctx_id;
+    int rc;
+    struct ibv_send_wr wr = {0}, *bad_wr;
+
+    if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
+        if (qp_type == IBV_QPT_SMI) {
+            pr_dbg("QP0 unsupported\n");
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+        } else if (qp_type == IBV_QPT_GSI) {
+            pr_dbg("QP1\n");
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+        }
+        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
+        return;
+    }
+
+    pr_dbg("num_sge=%d\n", num_sge);
+    if (!num_sge) {
+        pr_dbg("num_sge=0\n");
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        return;
+    }
+
+    bctx = g_malloc0(sizeof(*bctx));
+    bctx->up_ctx = ctx;
+    bctx->is_tx_req = 1;
+
+    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
+    if (unlikely(rc)) {
+        pr_dbg("Failed to allocate cqe_ctx\n");
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        goto out_free_bctx;
+    }
+
+    rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, num_sge);
+    if (rc) {
+        pr_dbg("Error: Failed to build host SGE array\n");
+        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        goto out_dealloc_cqe_ctx;
+    }
+
+    if (qp_type == IBV_QPT_UD) {
+        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
+                                backend_dev->backend_gid_idx, dgid);
+        wr.wr.ud.remote_qpn = dqpn;
+        wr.wr.ud.remote_qkey = dqkey;
+    }
+
+    wr.num_sge = num_sge;
+    wr.opcode = IBV_WR_SEND;
+    wr.send_flags = IBV_SEND_SIGNALED;
+    wr.sg_list = new_sge;
+    wr.wr_id = bctx_id;
+
+    rc = ibv_post_send(qp->ibqp, &wr, &bad_wr);
+    pr_dbg("ibv_post_send=%d\n", rc);
+    if (rc) {
+        pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
+                qp->ibqp->qp_num);
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        goto out_dealloc_cqe_ctx;
+    }
+
+    return;
+
+out_dealloc_cqe_ctx:
+    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, bctx_id);
+
+out_free_bctx:
+    g_free(bctx);
+}
+
+void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
+                            RdmaDeviceResources *rdma_dev_res,
+                            RdmaBackendQP *qp, uint8_t qp_type,
+                            struct ibv_sge *sge, uint32_t num_sge, void *ctx)
+{
+    BackendCtx *bctx;
+    struct ibv_sge new_sge[MAX_SGE];
+    uint32_t bctx_id;
+    int rc;
+    struct ibv_recv_wr wr = {0}, *bad_wr;
+
+    if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
+        if (qp_type == IBV_QPT_SMI) {
+            pr_dbg("QP0 unsupported\n");
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+        }
+        if (qp_type == IBV_QPT_GSI) {
+            pr_dbg("QP1\n");
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+        }
+        return;
+    }
+
+    pr_dbg("num_sge=%d\n", num_sge);
+    if (!num_sge) {
+        pr_dbg("num_sge=0\n");
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        return;
+    }
+
+    bctx = g_malloc0(sizeof(*bctx));
+    bctx->up_ctx = ctx;
+    bctx->is_tx_req = 0;
+
+    rc = rdma_rm_alloc_cqe_ctx(rdma_dev_res, &bctx_id, bctx);
+    if (unlikely(rc)) {
+        pr_dbg("Failed to allocate cqe_ctx\n");
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        goto out_free_bctx;
+    }
+
+    rc = build_host_sge_array(rdma_dev_res, new_sge, sge, num_sge);
+    if (rc) {
+        pr_dbg("Error: Failed to build host SGE array\n");
+        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        goto out_dealloc_cqe_ctx;
+    }
+
+    wr.num_sge = num_sge;
+    wr.sg_list = new_sge;
+    wr.wr_id = bctx_id;
+    rc = ibv_post_recv(qp->ibqp, &wr, &bad_wr);
+    pr_dbg("ibv_post_recv=%d\n", rc);
+    if (rc) {
+        pr_dbg("Fail (%d, %d) to post recv WQE to qpn %d\n", rc, errno,
+                qp->ibqp->qp_num);
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        goto out_dealloc_cqe_ctx;
+    }
+
+    return;
+
+out_dealloc_cqe_ctx:
+    rdma_rm_dealloc_cqe_ctx(rdma_dev_res, bctx_id);
+
+out_free_bctx:
+    g_free(bctx);
+}
+
+int rdma_backend_create_pd(RdmaBackendDev *backend_dev, RdmaBackendPD *pd)
+{
+    pd->ibpd = ibv_alloc_pd(backend_dev->context);
+
+    return pd->ibpd ? 0 : -EIO;
+}
+
+void rdma_backend_destroy_pd(RdmaBackendPD *pd)
+{
+    if (pd->ibpd) {
+        ibv_dealloc_pd(pd->ibpd);
+    }
+}
+
+int rdma_backend_create_mr(RdmaBackendMR *mr, RdmaBackendPD *pd, uint64_t addr,
+                           size_t length, int access)
+{
+    pr_dbg("addr=0x%lx\n", addr);
+    pr_dbg("len=%ld\n", length);
+    mr->ibmr = ibv_reg_mr(pd->ibpd, (void *)addr, length, access);
+    if (mr->ibmr) {
+        pr_dbg("lkey=0x%x\n", mr->ibmr->lkey);
+        pr_dbg("rkey=0x%x\n", mr->ibmr->rkey);
+        mr->ibpd = pd->ibpd;
+    }
+
+    return mr->ibmr ? 0 : -EIO;
+}
+
+void rdma_backend_destroy_mr(RdmaBackendMR *mr)
+{
+    if (mr->ibmr) {
+        ibv_dereg_mr(mr->ibmr);
+    }
+}
+
+int rdma_backend_create_cq(RdmaBackendDev *backend_dev, RdmaBackendCQ *cq,
+                           int cqe)
+{
+    int rc;
+
+    pr_dbg("cqe=%d\n", cqe);
+
+    pr_dbg("dev->channel=%p\n", backend_dev->channel);
+    cq->ibcq = ibv_create_cq(backend_dev->context, cqe + 1, NULL,
+                             backend_dev->channel, 0);
+
+    if (cq->ibcq) {
+        rc = ibv_req_notify_cq(cq->ibcq, 0);
+        if (rc) {
+            pr_dbg("Error %d from ibv_req_notify_cq\n", rc);
+        }
+        cq->backend_dev = backend_dev;
+    }
+
+    return cq->ibcq ? 0 : -EIO;
+}
+
+void rdma_backend_destroy_cq(RdmaBackendCQ *cq)
+{
+    if (cq->ibcq) {
+        ibv_destroy_cq(cq->ibcq);
+    }
+}
+
+int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
+                           RdmaBackendPD *pd, RdmaBackendCQ *scq,
+                           RdmaBackendCQ *rcq, uint32_t max_send_wr,
+                           uint32_t max_recv_wr, uint32_t max_send_sge,
+                           uint32_t max_recv_sge)
+{
+    struct ibv_qp_init_attr attr = {0};
+
+    qp->ibqp = 0;
+    pr_dbg("qp_type=%d\n", qp_type);
+
+    switch (qp_type) {
+    case IBV_QPT_GSI:
+        pr_dbg("QP1 unsupported\n");
+        return 0;
+
+    case IBV_QPT_RC:
+        /* fall through */
+    case IBV_QPT_UD:
+        /* do nothing */
+        break;
+
+    default:
+        pr_dbg("Unsupported QP type %d\n", qp_type);
+        return -EIO;
+    }
+
+    attr.qp_type = qp_type;
+    attr.send_cq = scq->ibcq;
+    attr.recv_cq = rcq->ibcq;
+    attr.cap.max_send_wr = max_send_wr;
+    attr.cap.max_recv_wr = max_recv_wr;
+    attr.cap.max_send_sge = max_send_sge;
+    attr.cap.max_recv_sge = max_recv_sge;
+
+    pr_dbg("max_send_wr=%d\n", max_send_wr);
+    pr_dbg("max_recv_wr=%d\n", max_recv_wr);
+    pr_dbg("max_send_sge=%d\n", max_send_sge);
+    pr_dbg("max_recv_sge=%d\n", max_recv_sge);
+
+    qp->ibqp = ibv_create_qp(pd->ibpd, &attr);
+    if (likely(!qp->ibqp)) {
+        pr_dbg("Error from ibv_create_qp\n");
+        return -EIO;
+    }
+
+    qp->ibpd = pd->ibpd;
+
+    /* TODO: Query QP to get max_inline_data and save it to be used in send */
+
+    pr_dbg("qpn=0x%x\n", qp->ibqp->qp_num);
+
+    return 0;
+}
+
+int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
+                               uint8_t qp_type, uint32_t qkey)
+{
+    struct ibv_qp_attr attr = {0};
+    int rc, attr_mask;
+
+    pr_dbg("qpn=0x%x\n", qp->ibqp->qp_num);
+    pr_dbg("sport_num=%d\n", backend_dev->port_num);
+
+    attr_mask = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT;
+    attr.qp_state        = IBV_QPS_INIT;
+    attr.pkey_index      = 0;
+    attr.port_num        = backend_dev->port_num;
+
+    switch (qp_type) {
+    case IBV_QPT_RC:
+        attr_mask |= IBV_QP_ACCESS_FLAGS;
+        break;
+
+    case IBV_QPT_UD:
+        attr.qkey = qkey;
+        attr_mask |= IBV_QP_QKEY;
+        break;
+
+    default:
+        pr_dbg("Unsupported QP type %d\n", qp_type);
+        return -EIO;
+    }
+
+    rc = ibv_modify_qp(qp->ibqp, &attr, attr_mask);
+    if (rc) {
+        pr_dbg("Error %d from ibv_modify_qp\n", rc);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
+                              uint8_t qp_type, union ibv_gid *dgid,
+                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
+                              bool use_qkey)
+{
+    struct ibv_qp_attr attr = {0};
+    union ibv_gid ibv_gid = {
+        .global.interface_id = dgid->global.interface_id,
+        .global.subnet_prefix = dgid->global.subnet_prefix
+    };
+    int rc, attr_mask;
+
+    attr.qp_state = IBV_QPS_RTR;
+    attr_mask = IBV_QP_STATE;
+
+    switch (qp_type) {
+    case IBV_QPT_RC:
+        pr_dbg("dgid=0x%lx,%lx\n",
+               be64_to_cpu(ibv_gid.global.subnet_prefix),
+               be64_to_cpu(ibv_gid.global.interface_id));
+        pr_dbg("dqpn=0x%x\n", dqpn);
+        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
+        pr_dbg("sport_num=%d\n", backend_dev->port_num);
+        pr_dbg("rq_psn=0x%x\n", rq_psn);
+
+        attr.path_mtu               = IBV_MTU_1024;
+        attr.dest_qp_num            = dqpn;
+        attr.max_dest_rd_atomic     = 1;
+        attr.min_rnr_timer          = 12;
+        attr.ah_attr.port_num       = backend_dev->port_num;
+        attr.ah_attr.is_global      = 1;
+        attr.ah_attr.grh.hop_limit  = 1;
+        attr.ah_attr.grh.dgid       = ibv_gid;
+        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
+        attr.rq_psn                 = rq_psn;
+
+        attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
+                     IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC |
+                     IBV_QP_MIN_RNR_TIMER;
+        break;
+
+    case IBV_QPT_UD:
+        if (use_qkey) {
+            pr_dbg("qkey=0x%x\n", qkey);
+            attr.qkey = qkey;
+            attr_mask |= IBV_QP_QKEY;
+        }
+        break;
+    }
+
+    rc = ibv_modify_qp(qp->ibqp, &attr, attr_mask);
+    if (rc) {
+        pr_dbg("Error %d from ibv_modify_qp\n", rc);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
+                              uint32_t sq_psn, uint32_t qkey, bool use_qkey)
+{
+    struct ibv_qp_attr attr = {0};
+    int rc, attr_mask;
+
+    pr_dbg("qpn=0x%x\n", qp->ibqp->qp_num);
+    pr_dbg("sq_psn=0x%x\n", sq_psn);
+
+    attr.qp_state = IBV_QPS_RTS;
+    attr.sq_psn = sq_psn;
+    attr_mask = IBV_QP_STATE | IBV_QP_SQ_PSN;
+
+    switch (qp_type) {
+    case IBV_QPT_RC:
+        attr.timeout       = 14;
+        attr.retry_cnt     = 7;
+        attr.rnr_retry     = 7;
+        attr.max_rd_atomic = 1;
+
+        attr_mask |= IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY |
+                     IBV_QP_MAX_QP_RD_ATOMIC;
+        break;
+
+    case IBV_QPT_UD:
+        if (use_qkey) {
+            pr_dbg("qkey=0x%x\n", qkey);
+            attr.qkey = qkey;
+            attr_mask |= IBV_QP_QKEY;
+        }
+        break;
+    }
+
+    rc = ibv_modify_qp(qp->ibqp, &attr, attr_mask);
+    if (rc) {
+        pr_dbg("Error %d from ibv_modify_qp\n", rc);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+void rdma_backend_destroy_qp(RdmaBackendQP *qp)
+{
+    if (qp->ibqp) {
+        ibv_destroy_qp(qp->ibqp);
+    }
+}
+
+#define CHK_ATTR(req, dev, member, fmt) ({ \
+    pr_dbg("%s="fmt","fmt"\n", #member, dev.member, req->member); \
+    if (req->member > dev.member) { \
+        warn_report("%s = 0x%lx is higher than host device capability 0x%lx", \
+                    #member, (uint64_t)req->member, (uint64_t)dev.member); \
+        req->member = dev.member; \
+    } \
+    pr_dbg("%s="fmt"\n", #member, req->member); })
+
+static int init_device_caps(RdmaBackendDev *backend_dev,
+                            struct ibv_device_attr *dev_attr)
+{
+    if (ibv_query_device(backend_dev->context, &backend_dev->dev_attr)) {
+        return -EIO;
+    }
+
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_mr_size, "%ld");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_sge, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp_wr, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_cq, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_cqe, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_mr, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_pd, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp_rd_atom, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp_init_rd_atom, "%d");
+    CHK_ATTR(dev_attr, backend_dev->dev_attr, max_ah, "%d");
+
+    return 0;
+}
+
+int rdma_backend_init(RdmaBackendDev *backend_dev,
+                      RdmaDeviceResources *rdma_dev_res,
+                      const char *backend_device_name, uint8_t port_num,
+                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
+                      Error **errp)
+{
+    int i;
+    int ret = 0;
+    int num_ibv_devices;
+    char thread_name[THR_NAME_LEN] = {0};
+    struct ibv_device **dev_list;
+    struct ibv_port_attr port_attr;
+
+    backend_dev->backend_gid_idx = backend_gid_idx;
+    backend_dev->port_num = port_num;
+    backend_dev->rdma_dev_res = rdma_dev_res;
+
+    rdma_backend_register_comp_handler(dummy_comp_handler);
+
+    dev_list = ibv_get_device_list(&num_ibv_devices);
+    if (!dev_list) {
+        error_setg(errp, "Failed to get IB devices list");
+        return -EIO;
+    }
+
+    if (num_ibv_devices == 0) {
+        error_setg(errp, "No IB devices were found");
+        ret = -ENXIO;
+        goto out_free_dev_list;
+    }
+
+    if (backend_device_name) {
+        for (i = 0; dev_list[i]; ++i) {
+            if (!strcmp(ibv_get_device_name(dev_list[i]),
+                        backend_device_name)) {
+                break;
+            }
+        }
+
+        backend_dev->ib_dev = dev_list[i];
+        if (!backend_dev->ib_dev) {
+            error_setg(errp, "Failed to find IB device %s",
+                       backend_device_name);
+            ret = -EIO;
+            goto out_free_dev_list;
+        }
+    } else {
+        backend_dev->ib_dev = *dev_list;
+    }
+
+    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
+           ibv_get_device_name(backend_dev->ib_dev),
+           backend_dev->port_num, backend_dev->backend_gid_idx);
+
+    backend_dev->context = ibv_open_device(backend_dev->ib_dev);
+    if (!backend_dev->context) {
+        error_setg(errp, "Failed to open IB device");
+        ret = -EIO;
+        goto out;
+    }
+
+    backend_dev->channel = ibv_create_comp_channel(backend_dev->context);
+    if (!backend_dev->channel) {
+        error_setg(errp, "Failed to create IB communication channel");
+        ret = -EIO;
+        goto out_close_device;
+    }
+    pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
+
+    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
+                         &port_attr);
+    if (ret) {
+        error_setg(errp, "Error %d from ibv_query_port", ret);
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+
+    if (backend_dev->backend_gid_idx > port_attr.gid_tbl_len) {
+        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
+                   port_attr.gid_tbl_len);
+        goto out_destroy_comm_channel;
+    }
+
+    ret = init_device_caps(backend_dev, dev_attr);
+    if (ret) {
+        error_setg(errp, "Failed to initialize device capabilities");
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+
+    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
+                         backend_dev->backend_gid_idx, &backend_dev->gid);
+    if (ret) {
+        error_setg(errp, "Failed to query gid %d",
+                   backend_dev->backend_gid_idx);
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+    pr_dbg("subnet_prefix=0x%lx\n",
+           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
+    pr_dbg("interface_id=0x%lx\n",
+           be64_to_cpu(backend_dev->gid.global.interface_id));
+
+    snprintf(thread_name, sizeof(thread_name), "rdma_comp_%s",
+             ibv_get_device_name(backend_dev->ib_dev));
+    backend_dev->comp_thread.run = true;
+    qemu_thread_create(&backend_dev->comp_thread.thread, thread_name,
+                       comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
+
+    ah_cache_init();
+
+    goto out_free_dev_list;
+
+out_destroy_comm_channel:
+    ibv_destroy_comp_channel(backend_dev->channel);
+
+out_close_device:
+    ibv_close_device(backend_dev->context);
+
+out_free_dev_list:
+    ibv_free_device_list(dev_list);
+
+out:
+    return ret;
+}
+
+void rdma_backend_fini(RdmaBackendDev *backend_dev)
+{
+    g_hash_table_destroy(ah_hash);
+    ibv_destroy_comp_channel(backend_dev->channel);
+    ibv_close_device(backend_dev->context);
+}
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
new file mode 100644
index 0000000000..68f2b05ca7
--- /dev/null
+++ b/hw/rdma/rdma_backend.h
@@ -0,0 +1,98 @@
+/*
+ *  RDMA device: Definitions of Backend Device functions
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMA_BACKEND_H
+#define RDMA_BACKEND_H
+
+#include <qapi/error.h>
+#include "rdma_rm_defs.h"
+#include "rdma_backend_defs.h"
+
+/* Add definition for QP0 and QP1 as there is no userspace enums for them */
+enum ibv_special_qp_type {
+    IBV_QPT_SMI = 0,
+    IBV_QPT_GSI = 1,
+};
+
+static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
+{
+    return &dev->gid;
+}
+
+static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
+{
+    return qp->ibqp ? qp->ibqp->qp_num : 0;
+}
+
+static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
+{
+    return mr->ibmr ? mr->ibmr->lkey : 0;
+}
+
+static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
+{
+    return mr->ibmr ? mr->ibmr->rkey : 0;
+}
+
+int rdma_backend_init(RdmaBackendDev *backend_dev,
+                      RdmaDeviceResources *rdma_dev_res,
+                      const char *backend_device_name, uint8_t port_num,
+                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
+                      Error **errp);
+void rdma_backend_fini(RdmaBackendDev *backend_dev);
+void rdma_backend_register_comp_handler(void (*handler)(int status,
+                                        unsigned int vendor_err, void *ctx));
+void rdma_backend_unregister_comp_handler(void);
+
+int rdma_backend_query_port(RdmaBackendDev *backend_dev,
+                            struct ibv_port_attr *port_attr);
+int rdma_backend_create_pd(RdmaBackendDev *backend_dev, RdmaBackendPD *pd);
+void rdma_backend_destroy_pd(RdmaBackendPD *pd);
+
+int rdma_backend_create_mr(RdmaBackendMR *mr, RdmaBackendPD *pd, uint64_t addr,
+                           size_t length, int access);
+void rdma_backend_destroy_mr(RdmaBackendMR *mr);
+
+int rdma_backend_create_cq(RdmaBackendDev *backend_dev, RdmaBackendCQ *cq,
+                           int cqe);
+void rdma_backend_destroy_cq(RdmaBackendCQ *cq);
+void rdma_backend_poll_cq(RdmaDeviceResources *rdma_dev_res, RdmaBackendCQ *cq);
+
+int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
+                           RdmaBackendPD *pd, RdmaBackendCQ *scq,
+                           RdmaBackendCQ *rcq, uint32_t max_send_wr,
+                           uint32_t max_recv_wr, uint32_t max_send_sge,
+                           uint32_t max_recv_sge);
+int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
+                               uint8_t qp_type, uint32_t qkey);
+int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
+                              uint8_t qp_type, union ibv_gid *dgid,
+                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
+                              bool use_qkey);
+int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
+                              uint32_t sq_psn, uint32_t qkey, bool use_qkey);
+void rdma_backend_destroy_qp(RdmaBackendQP *qp);
+
+void rdma_backend_post_send(RdmaBackendDev *backend_dev,
+                            RdmaBackendQP *qp, uint8_t qp_type,
+                            struct ibv_sge *sge, uint32_t num_sge,
+                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
+                            void *ctx);
+void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
+                            RdmaDeviceResources *rdma_dev_res,
+                            RdmaBackendQP *qp, uint8_t qp_type,
+                            struct ibv_sge *sge, uint32_t num_sge, void *ctx);
+
+#endif
diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
new file mode 100644
index 0000000000..b5fc45ddab
--- /dev/null
+++ b/hw/rdma/rdma_rm.c
@@ -0,0 +1,544 @@
+/*
+ * QEMU paravirtual RDMA - Resource Manager Implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <qemu/osdep.h>
+#include <qapi/error.h>
+#include <cpu.h>
+
+#include "rdma_utils.h"
+#include "rdma_backend.h"
+#include "rdma_rm.h"
+
+#define MAX_RM_TBL_NAME 16
+
+/* Page directory and page tables */
+#define PG_DIR_SZ { TARGET_PAGE_SIZE / sizeof(__u64) }
+#define PG_TBL_SZ { TARGET_PAGE_SIZE / sizeof(__u64) }
+
+static inline void res_tbl_init(const char *name, RdmaRmResTbl *tbl,
+                                uint32_t tbl_sz, uint32_t res_sz)
+{
+    tbl->tbl = g_malloc(tbl_sz * res_sz);
+
+    strncpy(tbl->name, name, MAX_RM_TBL_NAME);
+    tbl->name[MAX_RM_TBL_NAME - 1] = 0;
+
+    tbl->bitmap = bitmap_new(tbl_sz);
+    tbl->tbl_sz = tbl_sz;
+    tbl->res_sz = res_sz;
+    qemu_mutex_init(&tbl->lock);
+}
+
+static inline void res_tbl_free(RdmaRmResTbl *tbl)
+{
+    qemu_mutex_destroy(&tbl->lock);
+    g_free(tbl->tbl);
+    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+}
+
+static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
+{
+    pr_dbg("%s, handle=%d\n", tbl->name, handle);
+
+    if ((handle < tbl->tbl_sz) && (test_bit(handle, tbl->bitmap))) {
+        return tbl->tbl + handle * tbl->res_sz;
+    } else {
+        pr_dbg("Invalid handle %d\n", handle);
+        return NULL;
+    }
+}
+
+static inline void *res_tbl_alloc(RdmaRmResTbl *tbl, uint32_t *handle)
+{
+    qemu_mutex_lock(&tbl->lock);
+
+    *handle = find_first_zero_bit(tbl->bitmap, tbl->tbl_sz);
+    if (*handle > tbl->tbl_sz) {
+        pr_dbg("Failed to alloc, bitmap is full\n");
+        qemu_mutex_unlock(&tbl->lock);
+        return NULL;
+    }
+
+    set_bit(*handle, tbl->bitmap);
+
+    qemu_mutex_unlock(&tbl->lock);
+
+    memset(tbl->tbl + *handle * tbl->res_sz, 0, tbl->res_sz);
+
+    pr_dbg("%s, handle=%d\n", tbl->name, *handle);
+
+    return tbl->tbl + *handle * tbl->res_sz;
+}
+
+static inline void res_tbl_dealloc(RdmaRmResTbl *tbl, uint32_t handle)
+{
+    pr_dbg("%s, handle=%d\n", tbl->name, handle);
+
+    qemu_mutex_lock(&tbl->lock);
+
+    if (handle < tbl->tbl_sz) {
+        clear_bit(handle, tbl->bitmap);
+    }
+
+    qemu_mutex_unlock(&tbl->lock);
+}
+
+int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                     uint32_t *pd_handle, uint32_t ctx_handle)
+{
+    RdmaRmPD *pd;
+    int ret = -ENOMEM;
+
+    pd = res_tbl_alloc(&dev_res->pd_tbl, pd_handle);
+    if (!pd) {
+        goto out;
+    }
+
+    ret = rdma_backend_create_pd(backend_dev, &pd->backend_pd);
+    if (ret) {
+        ret = -EIO;
+        goto out_tbl_dealloc;
+    }
+
+    pd->ctx_handle = ctx_handle;
+
+    return 0;
+
+out_tbl_dealloc:
+    res_tbl_dealloc(&dev_res->pd_tbl, *pd_handle);
+
+out:
+    return ret;
+}
+
+RdmaRmPD *rdma_rm_get_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle)
+{
+    return res_tbl_get(&dev_res->pd_tbl, pd_handle);
+}
+
+void rdma_rm_dealloc_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle)
+{
+    RdmaRmPD *pd = rdma_rm_get_pd(dev_res, pd_handle);
+
+    if (pd) {
+        rdma_backend_destroy_pd(&pd->backend_pd);
+        res_tbl_dealloc(&dev_res->pd_tbl, pd_handle);
+    }
+}
+
+int rdma_rm_alloc_mr(RdmaDeviceResources *dev_res, uint32_t pd_handle,
+                     uint64_t guest_start, size_t guest_length, void *host_virt,
+                     int access_flags, uint32_t *mr_handle, uint32_t *lkey,
+                     uint32_t *rkey)
+{
+    RdmaRmMR *mr;
+    int ret = 0;
+    RdmaRmPD *pd;
+    uint64_t addr;
+    size_t length;
+
+    pd = rdma_rm_get_pd(dev_res, pd_handle);
+    if (!pd) {
+        pr_dbg("Invalid PD\n");
+        return -EINVAL;
+    }
+
+    mr = res_tbl_alloc(&dev_res->mr_tbl, mr_handle);
+    if (!mr) {
+        pr_dbg("Failed to allocate obj in table\n");
+        return -ENOMEM;
+    }
+
+    if (!host_virt) {
+        /* TODO: This is my guess but not so sure that this needs to be
+         * done */
+        length = TARGET_PAGE_SIZE;
+        addr = (uint64_t)g_malloc(length);
+    } else {
+        mr->user_mr.host_virt = (uint64_t) host_virt;
+        pr_dbg("host_virt=0x%lx\n", mr->user_mr.host_virt);
+        mr->user_mr.length = guest_length;
+        pr_dbg("length=0x%lx\n", guest_length);
+        mr->user_mr.guest_start = guest_start;
+        pr_dbg("guest_start=0x%lx\n", mr->user_mr.guest_start);
+
+        length = mr->user_mr.length;
+        addr = mr->user_mr.host_virt;
+    }
+
+    ret = rdma_backend_create_mr(&mr->backend_mr, &pd->backend_pd, addr, length,
+                                 access_flags);
+    if (ret) {
+        pr_dbg("Fail in rdma_backend_create_mr, err=%d\n", ret);
+        ret = -EIO;
+        goto out_dealloc_mr;
+    }
+
+    if (!host_virt) {
+        *lkey = mr->lkey = rdma_backend_mr_lkey(&mr->backend_mr);
+        *rkey = mr->rkey = rdma_backend_mr_rkey(&mr->backend_mr);
+    } else {
+        /* We keep mr_handle in lkey so send and recv get get mr ptr */
+        *lkey = *mr_handle;
+        *rkey = -1;
+    }
+
+    mr->pd_handle = pd_handle;
+
+    return 0;
+
+out_dealloc_mr:
+    res_tbl_dealloc(&dev_res->mr_tbl, *mr_handle);
+
+    return ret;
+}
+
+RdmaRmMR *rdma_rm_get_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle)
+{
+    return res_tbl_get(&dev_res->mr_tbl, mr_handle);
+}
+
+void rdma_rm_dealloc_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle)
+{
+    RdmaRmMR *mr = rdma_rm_get_mr(dev_res, mr_handle);
+
+    if (mr) {
+        rdma_backend_destroy_mr(&mr->backend_mr);
+        munmap((void *)mr->user_mr.host_virt, mr->user_mr.length);
+        res_tbl_dealloc(&dev_res->mr_tbl, mr_handle);
+    }
+}
+
+int rdma_rm_alloc_uc(RdmaDeviceResources *dev_res, uint32_t pfn,
+                     uint32_t *uc_handle)
+{
+    RdmaRmUC *uc;
+
+    /* TODO: Need to make sure pfn is between bar start address and
+     * bsd+RDMA_BAR2_UAR_SIZE
+    if (pfn > RDMA_BAR2_UAR_SIZE) {
+        pr_err("pfn out of range (%d > %d)\n", pfn, RDMA_BAR2_UAR_SIZE);
+        return -ENOMEM;
+    }
+    */
+
+    uc = res_tbl_alloc(&dev_res->uc_tbl, uc_handle);
+    if (!uc) {
+        return -ENOMEM;
+    }
+
+    return 0;
+}
+
+RdmaRmUC *rdma_rm_get_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle)
+{
+    return res_tbl_get(&dev_res->uc_tbl, uc_handle);
+}
+
+void rdma_rm_dealloc_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle)
+{
+    RdmaRmUC *uc = rdma_rm_get_uc(dev_res, uc_handle);
+
+    if (uc) {
+        res_tbl_dealloc(&dev_res->uc_tbl, uc_handle);
+    }
+}
+
+RdmaRmCQ *rdma_rm_get_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle)
+{
+    return res_tbl_get(&dev_res->cq_tbl, cq_handle);
+}
+
+int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                     uint32_t cqe, uint32_t *cq_handle, void *opaque)
+{
+    int rc;
+    RdmaRmCQ *cq;
+
+    cq = res_tbl_alloc(&dev_res->cq_tbl, cq_handle);
+    if (!cq) {
+        return -ENOMEM;
+    }
+
+    cq->opaque = opaque;
+    cq->notify = false;
+
+    rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
+    if (rc) {
+        rc = -EIO;
+        goto out_dealloc_cq;
+    }
+
+    return 0;
+
+out_dealloc_cq:
+    rdma_rm_dealloc_cq(dev_res, *cq_handle);
+
+    return rc;
+}
+
+void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
+                           bool notify)
+{
+    RdmaRmCQ *cq;
+
+    pr_dbg("cq_handle=%d, notify=0x%x\n", cq_handle, notify);
+
+    cq = rdma_rm_get_cq(dev_res, cq_handle);
+    if (!cq) {
+        return;
+    }
+
+    cq->notify = notify;
+    pr_dbg("notify=%d\n", cq->notify);
+}
+
+void rdma_rm_dealloc_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle)
+{
+    RdmaRmCQ *cq;
+
+    cq = rdma_rm_get_cq(dev_res, cq_handle);
+    if (!cq) {
+        return;
+    }
+
+    rdma_backend_destroy_cq(&cq->backend_cq);
+
+    res_tbl_dealloc(&dev_res->cq_tbl, cq_handle);
+}
+
+RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn)
+{
+    GBytes *key = g_bytes_new(&qpn, sizeof(qpn));
+
+    RdmaRmQP *qp = g_hash_table_lookup(dev_res->qp_hash, key);
+
+    g_bytes_unref(key);
+
+    return qp;
+}
+
+int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
+                     uint8_t qp_type, uint32_t max_send_wr,
+                     uint32_t max_send_sge, uint32_t send_cq_handle,
+                     uint32_t max_recv_wr, uint32_t max_recv_sge,
+                     uint32_t recv_cq_handle, void *opaque, uint32_t *qpn)
+{
+    int rc;
+    RdmaRmQP *qp;
+    RdmaRmCQ *scq, *rcq;
+    RdmaRmPD *pd;
+    uint32_t rm_qpn;
+
+    pr_dbg("qp_type=%d\n", qp_type);
+
+    pd = rdma_rm_get_pd(dev_res, pd_handle);
+    if (!pd) {
+        pr_err("Invalid pd handle (%d)\n", pd_handle);
+        return -EINVAL;
+    }
+
+    scq = rdma_rm_get_cq(dev_res, send_cq_handle);
+    rcq = rdma_rm_get_cq(dev_res, recv_cq_handle);
+
+    if (!scq || !rcq) {
+        pr_err("Invalid send_cqn or recv_cqn (%d, %d)\n",
+               send_cq_handle, recv_cq_handle);
+        return -EINVAL;
+    }
+
+    qp = res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
+    if (!qp) {
+        return -ENOMEM;
+    }
+    pr_dbg("rm_qpn=%d\n", rm_qpn);
+
+    qp->qpn = rm_qpn;
+    qp->qp_state = IBV_QPS_RESET;
+    qp->qp_type = qp_type;
+    qp->send_cq_handle = send_cq_handle;
+    qp->recv_cq_handle = recv_cq_handle;
+    qp->opaque = opaque;
+
+    rc = rdma_backend_create_qp(&qp->backend_qp, qp_type, &pd->backend_pd,
+                                &scq->backend_cq, &rcq->backend_cq, max_send_wr,
+                                max_recv_wr, max_send_sge, max_recv_sge);
+    if (rc) {
+        rc = -EIO;
+        goto out_dealloc_qp;
+    }
+
+    *qpn = rdma_backend_qpn(&qp->backend_qp);
+    pr_dbg("rm_qpn=%d, backend_qpn=0x%x\n", rm_qpn, *qpn);
+    g_hash_table_insert(dev_res->qp_hash, g_bytes_new(qpn, sizeof(*qpn)), qp);
+
+    return 0;
+
+out_dealloc_qp:
+    res_tbl_dealloc(&dev_res->qp_tbl, qp->qpn);
+
+    return rc;
+}
+
+int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                      uint32_t qp_handle, uint32_t attr_mask,
+                      union ibv_gid *dgid, uint32_t dqpn,
+                      enum ibv_qp_state qp_state, uint32_t qkey,
+                      uint32_t rq_psn, uint32_t sq_psn)
+{
+    RdmaRmQP *qp;
+    int ret;
+
+    pr_dbg("qpn=%d\n", qp_handle);
+
+    qp = rdma_rm_get_qp(dev_res, qp_handle);
+    if (!qp) {
+        return -EINVAL;
+    }
+
+    pr_dbg("qp_type=%d\n", qp->qp_type);
+    pr_dbg("attr_mask=0x%x\n", attr_mask);
+
+    if (qp->qp_type == IBV_QPT_SMI) {
+        pr_dbg("QP0 unsupported\n");
+        return -EPERM;
+    } else if (qp->qp_type == IBV_QPT_GSI) {
+        pr_dbg("QP1\n");
+        return 0;
+    }
+
+    if (attr_mask & IBV_QP_STATE) {
+        qp->qp_state = qp_state;
+        pr_dbg("qp_state=%d\n", qp->qp_state);
+
+        if (qp->qp_state == IBV_QPS_INIT) {
+            ret = rdma_backend_qp_state_init(backend_dev, &qp->backend_qp,
+                                             qp->qp_type, qkey);
+            if (ret) {
+                return -EIO;
+            }
+        }
+
+        if (qp->qp_state == IBV_QPS_RTR) {
+            ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
+                                            qp->qp_type, dgid, dqpn, rq_psn,
+                                            qkey, attr_mask & IBV_QP_QKEY);
+            if (ret) {
+                return -EIO;
+            }
+        }
+
+        if (qp->qp_state == IBV_QPS_RTS) {
+            ret = rdma_backend_qp_state_rts(&qp->backend_qp, qp->qp_type,
+                                            sq_psn, qkey,
+                                            attr_mask & IBV_QP_QKEY);
+            if (ret) {
+                return -EIO;
+            }
+        }
+    }
+
+    return 0;
+}
+
+void rdma_rm_dealloc_qp(RdmaDeviceResources *dev_res, uint32_t qp_handle)
+{
+    RdmaRmQP *qp;
+    GBytes *key;
+
+    key = g_bytes_new(&qp_handle, sizeof(qp_handle));
+    qp = g_hash_table_lookup(dev_res->qp_hash, key);
+    g_hash_table_remove(dev_res->qp_hash, key);
+    g_bytes_unref(key);
+
+    if (!qp) {
+        return;
+    }
+
+    rdma_backend_destroy_qp(&qp->backend_qp);
+
+    res_tbl_dealloc(&dev_res->qp_tbl, qp->qpn);
+}
+
+void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
+{
+    void **cqe_ctx;
+
+    cqe_ctx = res_tbl_get(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
+    if (!cqe_ctx) {
+        return NULL;
+    }
+
+    pr_dbg("ctx=%p\n", *cqe_ctx);
+
+    return *cqe_ctx;
+}
+
+int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
+                          void *ctx)
+{
+    void **cqe_ctx;
+
+    cqe_ctx = res_tbl_alloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
+    if (!cqe_ctx) {
+        return -ENOMEM;
+    }
+
+    pr_dbg("ctx=%p\n", ctx);
+    *cqe_ctx = ctx;
+
+    return 0;
+}
+
+void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
+{
+    res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
+}
+
+static void destroy_qp_hash_key(gpointer data)
+{
+    g_bytes_unref(data);
+}
+
+int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
+                 Error **errp)
+{
+    dev_res->qp_hash = g_hash_table_new_full(g_bytes_hash, g_bytes_equal,
+                                             destroy_qp_hash_key, NULL);
+    if (!dev_res->qp_hash) {
+        return -ENOMEM;
+    }
+
+    res_tbl_init("PD", &dev_res->pd_tbl, dev_attr->max_pd, sizeof(RdmaRmPD));
+    res_tbl_init("CQ", &dev_res->cq_tbl, dev_attr->max_cq, sizeof(RdmaRmCQ));
+    res_tbl_init("MR", &dev_res->mr_tbl, dev_attr->max_mr, sizeof(RdmaRmMR));
+    res_tbl_init("QP", &dev_res->qp_tbl, dev_attr->max_qp, sizeof(RdmaRmQP));
+    res_tbl_init("CQE_CTX", &dev_res->cqe_ctx_tbl, dev_attr->max_qp *
+                       dev_attr->max_qp_wr, sizeof(void *));
+    res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
+
+    return 0;
+}
+
+void rdma_rm_fini(RdmaDeviceResources *dev_res)
+{
+    res_tbl_free(&dev_res->uc_tbl);
+    res_tbl_free(&dev_res->cqe_ctx_tbl);
+    res_tbl_free(&dev_res->qp_tbl);
+    res_tbl_free(&dev_res->cq_tbl);
+    res_tbl_free(&dev_res->mr_tbl);
+    res_tbl_free(&dev_res->pd_tbl);
+    g_hash_table_destroy(dev_res->qp_hash);
+}
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
new file mode 100644
index 0000000000..be95c1b0f4
--- /dev/null
+++ b/hw/rdma/rdma_rm.h
@@ -0,0 +1,69 @@
+/*
+ * RDMA device: Definitions of Resource Manager functions
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMA_RM_H
+#define RDMA_RM_H
+
+#include <qapi/error.h>
+#include "rdma_backend_defs.h"
+#include "rdma_rm_defs.h"
+
+int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
+                 Error **errp);
+void rdma_rm_fini(RdmaDeviceResources *dev_res);
+
+int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                     uint32_t *pd_handle, uint32_t ctx_handle);
+RdmaRmPD *rdma_rm_get_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle);
+void rdma_rm_dealloc_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle);
+
+int rdma_rm_alloc_mr(RdmaDeviceResources *dev_res, uint32_t pd_handle,
+                     uint64_t guest_start, size_t guest_length, void *host_virt,
+                     int access_flags, uint32_t *mr_handle, uint32_t *lkey,
+                     uint32_t *rkey);
+RdmaRmMR *rdma_rm_get_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle);
+void rdma_rm_dealloc_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle);
+
+int rdma_rm_alloc_uc(RdmaDeviceResources *dev_res, uint32_t pfn,
+                     uint32_t *uc_handle);
+RdmaRmUC *rdma_rm_get_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle);
+void rdma_rm_dealloc_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle);
+
+int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                     uint32_t cqe, uint32_t *cq_handle, void *opaque);
+RdmaRmCQ *rdma_rm_get_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle);
+void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
+                           bool notify);
+void rdma_rm_dealloc_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle);
+
+int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
+                     uint8_t qp_type, uint32_t max_send_wr,
+                     uint32_t max_send_sge, uint32_t send_cq_handle,
+                     uint32_t max_recv_wr, uint32_t max_recv_sge,
+                     uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
+RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
+int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                      uint32_t qp_handle, uint32_t attr_mask,
+                      union ibv_gid *dgid, uint32_t dqpn,
+                      enum ibv_qp_state qp_state, uint32_t qkey,
+                      uint32_t rq_psn, uint32_t sq_psn);
+void rdma_rm_dealloc_qp(RdmaDeviceResources *dev_res, uint32_t qp_handle);
+
+int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
+                          void *ctx);
+void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
+void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
+
+#endif
diff --git a/hw/rdma/trace-events b/hw/rdma/trace-events
new file mode 100644
index 0000000000..c4c202e647
--- /dev/null
+++ b/hw/rdma/trace-events
@@ -0,0 +1,5 @@
+# See docs/tracing.txt for syntax documentation.
+
+#hw/rdma/rdma_backend.c
+create_ah_cache_hit(uint64_t subnet, uint64_t net_id) "subnet = 0x%"PRIx64" net_id = 0x%"PRIx64
+create_ah_cache_miss(uint64_t subnet, uint64_t net_id) "subnet = 0x%"PRIx64" net_id = 0x%"PRIx64
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 08/10] hw/rdma: PVRDMA commands and data-path ops
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (6 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 07/10] hw/rdma: Implementation of generic rdma device layers Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 09/10] hw/rdma: Implementation of PVRDMA device Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 10/10] MAINTAINERS: add entry for hw/rdma Marcel Apfelbaum
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

From: Yuval Shaia <yuval.shaia@oracle.com>

First PVRDMA sub-module - implementation of the PVRDMA device.
- PVRDMA commands such as create CQ and create MR.
- Data path QP operations - post_send and post_recv.
- Completion handler.

Reviewed-by: Dotan Barak <dotanb@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 hw/rdma/Makefile.objs         |   2 +
 hw/rdma/vmw/pvrdma.h          | 122 ++++++++
 hw/rdma/vmw/pvrdma_cmd.c      | 673 ++++++++++++++++++++++++++++++++++++++++++
 hw/rdma/vmw/pvrdma_dev_ring.c | 155 ++++++++++
 hw/rdma/vmw/pvrdma_dev_ring.h |  42 +++
 hw/rdma/vmw/pvrdma_qp_ops.c   | 222 ++++++++++++++
 hw/rdma/vmw/pvrdma_qp_ops.h   |  27 ++
 7 files changed, 1243 insertions(+)
 create mode 100644 hw/rdma/vmw/pvrdma.h
 create mode 100644 hw/rdma/vmw/pvrdma_cmd.c
 create mode 100644 hw/rdma/vmw/pvrdma_dev_ring.c
 create mode 100644 hw/rdma/vmw/pvrdma_dev_ring.h
 create mode 100644 hw/rdma/vmw/pvrdma_qp_ops.c
 create mode 100644 hw/rdma/vmw/pvrdma_qp_ops.h

diff --git a/hw/rdma/Makefile.objs b/hw/rdma/Makefile.objs
index 6a59bf0d5b..44a85f687d 100644
--- a/hw/rdma/Makefile.objs
+++ b/hw/rdma/Makefile.objs
@@ -1,3 +1,5 @@
 ifeq ($(CONFIG_RDMA),y)
 obj-$(CONFIG_PCI) += rdma_utils.o rdma_backend.o rdma_rm.o
+obj-$(CONFIG_PCI) += vmw/pvrdma_dev_ring.o vmw/pvrdma_cmd.o \
+                     vmw/pvrdma_qp_ops.o
 endif
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
new file mode 100644
index 0000000000..b05f94a473
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma.h
@@ -0,0 +1,122 @@
+/*
+ * QEMU VMWARE paravirtual RDMA device definitions
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_PVRDMA_H
+#define PVRDMA_PVRDMA_H
+
+#include <hw/pci/pci.h>
+#include <hw/pci/msix.h>
+
+#include "../rdma_backend_defs.h"
+#include "../rdma_rm_defs.h"
+
+#include <standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h>
+#include <standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h>
+#include "pvrdma_dev_ring.h"
+
+/* BARs */
+#define RDMA_MSIX_BAR_IDX    0
+#define RDMA_REG_BAR_IDX     1
+#define RDMA_UAR_BAR_IDX     2
+#define RDMA_BAR0_MSIX_SIZE  (16 * 1024)
+#define RDMA_BAR1_REGS_SIZE  256
+#define RDMA_BAR2_UAR_SIZE   (0x1000 * MAX_UCS) /* each uc gets page */
+
+/* MSIX */
+#define RDMA_MAX_INTRS       3
+#define RDMA_MSIX_TABLE      0x0000
+#define RDMA_MSIX_PBA        0x2000
+
+/* Interrupts Vectors */
+#define INTR_VEC_CMD_RING            0
+#define INTR_VEC_CMD_ASYNC_EVENTS    1
+#define INTR_VEC_CMD_COMPLETION_Q    2
+
+/* HW attributes */
+#define PVRDMA_HW_NAME       "pvrdma"
+#define PVRDMA_HW_VERSION    17
+#define PVRDMA_FW_VERSION    14
+
+typedef struct DSRInfo {
+    dma_addr_t dma;
+    struct pvrdma_device_shared_region *dsr;
+
+    union pvrdma_cmd_req *req;
+    union pvrdma_cmd_resp *rsp;
+
+    struct pvrdma_ring *async_ring_state;
+    PvrdmaRing async;
+
+    struct pvrdma_ring *cq_ring_state;
+    PvrdmaRing cq;
+} DSRInfo;
+
+typedef struct PVRDMADev {
+    PCIDevice parent_obj;
+    MemoryRegion msix;
+    MemoryRegion regs;
+    uint32_t regs_data[RDMA_BAR1_REGS_SIZE];
+    MemoryRegion uar;
+    uint32_t uar_data[RDMA_BAR2_UAR_SIZE];
+    DSRInfo dsr_info;
+    int interrupt_mask;
+    struct ibv_device_attr dev_attr;
+    uint64_t node_guid;
+    char *backend_device_name;
+    uint8_t backend_gid_idx;
+    uint8_t backend_port_num;
+    RdmaBackendDev backend_dev;
+    RdmaDeviceResources rdma_dev_res;
+} PVRDMADev;
+#define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
+
+static inline int get_reg_val(PVRDMADev *dev, hwaddr addr, uint32_t *val)
+{
+    int idx = addr >> 2;
+
+    if (idx > RDMA_BAR1_REGS_SIZE) {
+        return -EINVAL;
+    }
+
+    *val = dev->regs_data[idx];
+
+    return 0;
+}
+
+static inline int set_reg_val(PVRDMADev *dev, hwaddr addr, uint32_t val)
+{
+    int idx = addr >> 2;
+
+    if (idx > RDMA_BAR1_REGS_SIZE) {
+        return -EINVAL;
+    }
+
+    dev->regs_data[idx] = val;
+
+    return 0;
+}
+
+static inline void post_interrupt(PVRDMADev *dev, unsigned vector)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    if (likely(!dev->interrupt_mask)) {
+        msix_notify(pci_dev, vector);
+    }
+}
+
+int execute_command(PVRDMADev *dev);
+
+#endif
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
new file mode 100644
index 0000000000..293dfed29f
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -0,0 +1,673 @@
+/*
+ * QEMU paravirtual RDMA - Command channel
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <qemu/osdep.h>
+#include <qemu/error-report.h>
+#include <cpu.h>
+#include <linux/types.h>
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_ids.h"
+
+#include "../rdma_backend.h"
+#include "../rdma_rm.h"
+#include "../rdma_utils.h"
+
+#include "pvrdma.h"
+#include <standard-headers/rdma/vmw_pvrdma-abi.h>
+
+static void *pvrdma_map_to_pdir(PCIDevice *pdev, uint64_t pdir_dma,
+                                uint32_t nchunks, size_t length)
+{
+    uint64_t *dir, *tbl;
+    int tbl_idx, dir_idx, addr_idx;
+    void *host_virt = NULL, *curr_page;
+
+    if (!nchunks) {
+        pr_dbg("nchunks=0\n");
+        return NULL;
+    }
+
+    dir = rdma_pci_dma_map(pdev, pdir_dma, TARGET_PAGE_SIZE);
+    if (!dir) {
+        error_report("PVRDMA: Failed to map to page directory");
+        return NULL;
+    }
+
+    tbl = rdma_pci_dma_map(pdev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        error_report("PVRDMA: Failed to map to page table 0");
+        goto out_unmap_dir;
+    }
+
+    curr_page = rdma_pci_dma_map(pdev, (dma_addr_t)tbl[0], TARGET_PAGE_SIZE);
+    if (!curr_page) {
+        error_report("PVRDMA: Failed to map the first page");
+        goto out_unmap_tbl;
+    }
+
+    host_virt = mremap(curr_page, 0, length, MREMAP_MAYMOVE);
+    if (host_virt == MAP_FAILED) {
+        host_virt = NULL;
+        error_report("PVRDMA: Failed to remap memory for host_virt");
+        goto out_unmap_tbl;
+    }
+
+    rdma_pci_dma_unmap(pdev, curr_page, TARGET_PAGE_SIZE);
+
+    pr_dbg("host_virt=%p\n", host_virt);
+
+    dir_idx = 0;
+    tbl_idx = 1;
+    addr_idx = 1;
+    while (addr_idx < nchunks) {
+        if ((tbl_idx == (TARGET_PAGE_SIZE / sizeof(uint64_t)))) {
+            tbl_idx = 0;
+            dir_idx++;
+            pr_dbg("Mapping to table %d\n", dir_idx);
+            rdma_pci_dma_unmap(pdev, tbl, TARGET_PAGE_SIZE);
+            tbl = rdma_pci_dma_map(pdev, dir[dir_idx], TARGET_PAGE_SIZE);
+            if (!tbl) {
+                error_report("PVRDMA: Failed to map to page table %d", dir_idx);
+                goto out_unmap_host_virt;
+            }
+        }
+
+        pr_dbg("guest_dma[%d]=0x%lx\n", addr_idx, tbl[tbl_idx]);
+
+        curr_page = rdma_pci_dma_map(pdev, (dma_addr_t)tbl[tbl_idx],
+                                     TARGET_PAGE_SIZE);
+        if (!curr_page) {
+            error_report("PVRDMA: Failed to map to page %d, dir %d", tbl_idx,
+                         dir_idx);
+            goto out_unmap_host_virt;
+        }
+
+        mremap(curr_page, 0, TARGET_PAGE_SIZE, MREMAP_MAYMOVE | MREMAP_FIXED,
+               host_virt + TARGET_PAGE_SIZE * addr_idx);
+
+        rdma_pci_dma_unmap(pdev, curr_page, TARGET_PAGE_SIZE);
+
+        addr_idx++;
+
+        tbl_idx++;
+    }
+
+    goto out_unmap_tbl;
+
+out_unmap_host_virt:
+    munmap(host_virt, length);
+    host_virt = NULL;
+
+out_unmap_tbl:
+    rdma_pci_dma_unmap(pdev, tbl, TARGET_PAGE_SIZE);
+
+out_unmap_dir:
+    rdma_pci_dma_unmap(pdev, dir, TARGET_PAGE_SIZE);
+
+    return host_virt;
+}
+
+static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_query_port *cmd = &req->query_port;
+    struct pvrdma_cmd_query_port_resp *resp = &rsp->query_port_resp;
+    struct pvrdma_port_attr attrs = {0};
+
+    pr_dbg("port=%d\n", cmd->port_num);
+
+    if (rdma_backend_query_port(&dev->backend_dev,
+                                (struct ibv_port_attr *)&attrs)) {
+        return -ENOMEM;
+    }
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
+    resp->hdr.err = 0;
+
+    resp->attrs.state = attrs.state;
+    resp->attrs.max_mtu = attrs.max_mtu;
+    resp->attrs.active_mtu = attrs.active_mtu;
+    resp->attrs.phys_state = attrs.phys_state;
+    resp->attrs.gid_tbl_len = MIN(MAX_PORT_GIDS, attrs.gid_tbl_len);
+    resp->attrs.max_msg_sz = 1024;
+    resp->attrs.pkey_tbl_len = MIN(MAX_PORT_PKEYS, attrs.pkey_tbl_len);
+    resp->attrs.active_width = 1;
+    resp->attrs.active_speed = 1;
+
+    return 0;
+}
+
+static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_query_pkey *cmd = &req->query_pkey;
+    struct pvrdma_cmd_query_pkey_resp *resp = &rsp->query_pkey_resp;
+
+    pr_dbg("port=%d\n", cmd->port_num);
+    pr_dbg("index=%d\n", cmd->index);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_QUERY_PKEY_RESP;
+    resp->hdr.err = 0;
+
+    resp->pkey = 0x7FFF;
+    pr_dbg("pkey=0x%x\n", resp->pkey);
+
+    return 0;
+}
+
+static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_pd *cmd = &req->create_pd;
+    struct pvrdma_cmd_create_pd_resp *resp = &rsp->create_pd_resp;
+
+    pr_dbg("context=0x%x\n", cmd->ctx_handle ? cmd->ctx_handle : 0);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_PD_RESP;
+    resp->hdr.err = rdma_rm_alloc_pd(&dev->rdma_dev_res, &dev->backend_dev,
+                                     &resp->pd_handle, cmd->ctx_handle);
+
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_pd *cmd = &req->destroy_pd;
+
+    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
+
+    rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
+
+    return 0;
+}
+
+static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_mr *cmd = &req->create_mr;
+    struct pvrdma_cmd_create_mr_resp *resp = &rsp->create_mr_resp;
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    void *host_virt = NULL;
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_MR_RESP;
+
+    pr_dbg("pd_handle=%d\n", cmd->pd_handle);
+    pr_dbg("access_flags=0x%x\n", cmd->access_flags);
+    pr_dbg("flags=0x%x\n", cmd->flags);
+
+    if (!(cmd->flags & PVRDMA_MR_FLAG_DMA)) {
+        host_virt = pvrdma_map_to_pdir(pci_dev, cmd->pdir_dma, cmd->nchunks,
+                                       cmd->length);
+        if (!host_virt) {
+            pr_dbg("Failed to map to pdir\n");
+            resp->hdr.err = -EINVAL;
+            goto out;
+        }
+    }
+
+    resp->hdr.err = rdma_rm_alloc_mr(&dev->rdma_dev_res, cmd->pd_handle,
+                                     cmd->start, cmd->length, host_virt,
+                                     cmd->access_flags, &resp->mr_handle,
+                                     &resp->lkey, &resp->rkey);
+    if (!resp->hdr.err) {
+        munmap(host_virt, cmd->length);
+    }
+
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_mr *cmd = &req->destroy_mr;
+
+    pr_dbg("mr_handle=%d\n", cmd->mr_handle);
+
+    rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
+
+    return 0;
+}
+
+static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
+                          uint64_t pdir_dma, uint32_t nchunks, uint32_t cqe)
+{
+    uint64_t *dir = NULL, *tbl = NULL;
+    PvrdmaRing *r;
+    int rc = -EINVAL;
+    char ring_name[MAX_RING_NAME_SZ];
+
+    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)pdir_dma);
+    dir = rdma_pci_dma_map(pci_dev, pdir_dma, TARGET_PAGE_SIZE);
+    if (!dir) {
+        pr_dbg("Failed to map to CQ page directory\n");
+        goto out;
+    }
+
+    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        pr_dbg("Failed to map to CQ page table\n");
+        goto out;
+    }
+
+    r = g_malloc(sizeof(*r));
+    *ring = r;
+
+    r->ring_state = (struct pvrdma_ring *)
+        rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
+
+    if (!r->ring_state) {
+        pr_dbg("Failed to map to CQ ring state\n");
+        goto out_free_ring;
+    }
+
+    sprintf(ring_name, "cq_ring_%lx", pdir_dma);
+    rc = pvrdma_ring_init(r, ring_name, pci_dev, &r->ring_state[1],
+                          cqe, sizeof(struct pvrdma_cqe),
+                          /* first page is ring state */
+                          (dma_addr_t *)&tbl[1], nchunks - 1);
+    if (rc) {
+        goto out_unmap_ring_state;
+    }
+
+    goto out;
+
+out_unmap_ring_state:
+    /* ring_state was in slot 1, not 0 so need to jump back */
+    rdma_pci_dma_unmap(pci_dev, --r->ring_state, TARGET_PAGE_SIZE);
+
+out_free_ring:
+    g_free(r);
+
+out:
+    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
+    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
+
+    return rc;
+}
+
+static int create_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_cq *cmd = &req->create_cq;
+    struct pvrdma_cmd_create_cq_resp *resp = &rsp->create_cq_resp;
+    PvrdmaRing *ring = NULL;
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_CQ_RESP;
+
+    resp->cqe = cmd->cqe;
+
+    resp->hdr.err = create_cq_ring(PCI_DEVICE(dev), &ring, cmd->pdir_dma,
+                                   cmd->nchunks, cmd->cqe);
+    if (resp->hdr.err) {
+        goto out;
+    }
+
+    pr_dbg("ring=%p\n", ring);
+
+    resp->hdr.err = rdma_rm_alloc_cq(&dev->rdma_dev_res, &dev->backend_dev,
+                                     cmd->cqe, &resp->cq_handle, ring);
+    resp->cqe = cmd->cqe;
+
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_cq *cmd = &req->destroy_cq;
+    RdmaRmCQ *cq;
+    PvrdmaRing *ring;
+
+    pr_dbg("cq_handle=%d\n", cmd->cq_handle);
+
+    cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
+    if (!cq) {
+        pr_dbg("Invalid CQ handle\n");
+        return -EINVAL;
+    }
+
+    ring = (PvrdmaRing *)cq->opaque;
+    pvrdma_ring_free(ring);
+    /* ring_state was in slot 1, not 0 so need to jump back */
+    rdma_pci_dma_unmap(PCI_DEVICE(dev), --ring->ring_state, TARGET_PAGE_SIZE);
+    g_free(ring);
+
+    rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
+
+    return 0;
+}
+
+static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
+                           PvrdmaRing **rings, uint32_t scqe, uint32_t smax_sge,
+                           uint32_t spages, uint32_t rcqe, uint32_t rmax_sge,
+                           uint32_t rpages)
+{
+    uint64_t *dir = NULL, *tbl = NULL;
+    PvrdmaRing *sr, *rr;
+    int rc = -EINVAL;
+    char ring_name[MAX_RING_NAME_SZ];
+    uint32_t wqe_sz;
+
+    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)pdir_dma);
+    dir = rdma_pci_dma_map(pci_dev, pdir_dma, TARGET_PAGE_SIZE);
+    if (!dir) {
+        pr_dbg("Failed to map to CQ page directory\n");
+        goto out;
+    }
+
+    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        pr_dbg("Failed to map to CQ page table\n");
+        goto out;
+    }
+
+    sr = g_malloc(2 * sizeof(*rr));
+    rr = &sr[1];
+    pr_dbg("sring=%p\n", sr);
+    pr_dbg("rring=%p\n", rr);
+
+    *rings = sr;
+
+    pr_dbg("scqe=%d\n", scqe);
+    pr_dbg("smax_sge=%d\n", smax_sge);
+    pr_dbg("spages=%d\n", spages);
+    pr_dbg("rcqe=%d\n", rcqe);
+    pr_dbg("rmax_sge=%d\n", rmax_sge);
+    pr_dbg("rpages=%d\n", rpages);
+
+    /* Create send ring */
+    sr->ring_state = (struct pvrdma_ring *)
+        rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
+    if (!sr->ring_state) {
+        pr_dbg("Failed to map to CQ ring state\n");
+        goto out_free_sr_mem;
+    }
+
+    wqe_sz = pow2ceil(sizeof(struct pvrdma_sq_wqe_hdr) +
+                      sizeof(struct pvrdma_sge) * smax_sge - 1);
+
+    sprintf(ring_name, "qp_sring_%lx", pdir_dma);
+    rc = pvrdma_ring_init(sr, ring_name, pci_dev, sr->ring_state,
+                          scqe, wqe_sz, (dma_addr_t *)&tbl[1], spages);
+    if (rc) {
+        goto out_unmap_ring_state;
+    }
+
+    /* Create recv ring */
+    rr->ring_state = &sr->ring_state[1];
+    wqe_sz = pow2ceil(sizeof(struct pvrdma_rq_wqe_hdr) +
+                      sizeof(struct pvrdma_sge) * rmax_sge - 1);
+    sprintf(ring_name, "qp_rring_%lx", pdir_dma);
+    rc = pvrdma_ring_init(rr, ring_name, pci_dev, rr->ring_state,
+                          rcqe, wqe_sz, (dma_addr_t *)&tbl[1 + spages], rpages);
+    if (rc) {
+        goto out_free_sr;
+    }
+
+    goto out;
+
+out_free_sr:
+    pvrdma_ring_free(sr);
+
+out_unmap_ring_state:
+    rdma_pci_dma_unmap(pci_dev, sr->ring_state, TARGET_PAGE_SIZE);
+
+out_free_sr_mem:
+    g_free(sr);
+
+out:
+    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
+    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
+
+    return rc;
+}
+
+static int create_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_qp *cmd = &req->create_qp;
+    struct pvrdma_cmd_create_qp_resp *resp = &rsp->create_qp_resp;
+    PvrdmaRing *rings = NULL;
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_QP_RESP;
+
+    pr_dbg("total_chunks=%d\n", cmd->total_chunks);
+    pr_dbg("send_chunks=%d\n", cmd->send_chunks);
+
+    resp->hdr.err = create_qp_rings(PCI_DEVICE(dev), cmd->pdir_dma, &rings,
+                                    cmd->max_send_wr, cmd->max_send_sge,
+                                    cmd->send_chunks, cmd->max_recv_wr,
+                                    cmd->max_recv_sge, cmd->total_chunks -
+                                    cmd->send_chunks - 1);
+    if (resp->hdr.err) {
+        goto out;
+    }
+
+    pr_dbg("rings=%p\n", rings);
+
+    resp->hdr.err = rdma_rm_alloc_qp(&dev->rdma_dev_res, cmd->pd_handle,
+                                     cmd->qp_type, cmd->max_send_wr,
+                                     cmd->max_send_sge, cmd->send_cq_handle,
+                                     cmd->max_recv_wr, cmd->max_recv_sge,
+                                     cmd->recv_cq_handle, rings, &resp->qpn);
+
+    resp->max_send_wr = cmd->max_send_wr;
+    resp->max_recv_wr = cmd->max_recv_wr;
+    resp->max_send_sge = cmd->max_send_sge;
+    resp->max_recv_sge = cmd->max_recv_sge;
+    resp->max_inline_data = cmd->max_inline_data;
+
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
+}
+
+static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_modify_qp *cmd = &req->modify_qp;
+
+    pr_dbg("qp_handle=%d\n", cmd->qp_handle);
+
+    memset(rsp, 0, sizeof(*rsp));
+    rsp->hdr.response = cmd->hdr.response;
+    rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
+
+    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
+                                 cmd->qp_handle, cmd->attr_mask,
+                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
+                                 cmd->attrs.dest_qp_num, cmd->attrs.qp_state,
+                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
+                                 cmd->attrs.sq_psn);
+
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
+}
+
+static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_qp *cmd = &req->destroy_qp;
+    RdmaRmQP *qp;
+    PvrdmaRing *ring;
+
+    qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
+    if (!qp) {
+        pr_dbg("Invalid QP handle\n");
+        return -EINVAL;
+    }
+
+    rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
+
+    ring = (PvrdmaRing *)qp->opaque;
+    pr_dbg("sring=%p\n", &ring[0]);
+    pvrdma_ring_free(&ring[0]);
+    pr_dbg("rring=%p\n", &ring[1]);
+    pvrdma_ring_free(&ring[1]);
+
+    rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
+    g_free(ring);
+
+    return 0;
+}
+
+static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                       union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
+#ifdef PVRDMA_DEBUG
+    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
+    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
+#endif
+
+    pr_dbg("index=%d\n", cmd->index);
+
+    if (cmd->index > MAX_PORT_GIDS) {
+        return -EINVAL;
+    }
+
+    pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
+           (long long unsigned int)be64_to_cpu(*subnet),
+           (long long unsigned int)be64_to_cpu(*if_id));
+
+    /* Driver forces to one port only */
+    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
+           sizeof(cmd->new_gid));
+
+    /* TODO: Since drivers stores node_guid at load_dsr phase then this
+     * assignment is not relevant, i need to figure out a way how to
+     * retrieve MAC of our netdev */
+    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
+    pr_dbg("dev->node_guid=0x%llx\n",
+           (long long unsigned int)be64_to_cpu(dev->node_guid));
+
+    return 0;
+}
+
+static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                        union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
+
+    pr_dbg("clear index %d\n", cmd->index);
+
+    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
+           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
+
+    return 0;
+}
+
+static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                     union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_create_uc *cmd = &req->create_uc;
+    struct pvrdma_cmd_create_uc_resp *resp = &rsp->create_uc_resp;
+
+    pr_dbg("pfn=%d\n", cmd->pfn);
+
+    memset(resp, 0, sizeof(*resp));
+    resp->hdr.response = cmd->hdr.response;
+    resp->hdr.ack = PVRDMA_CMD_CREATE_UC_RESP;
+    resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
+                                     &resp->ctx_handle);
+
+    pr_dbg("ret=%d\n", resp->hdr.err);
+
+    return 0;
+}
+
+static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
+                      union pvrdma_cmd_resp *rsp)
+{
+    struct pvrdma_cmd_destroy_uc *cmd = &req->destroy_uc;
+
+    pr_dbg("ctx_handle=%d\n", cmd->ctx_handle);
+
+    rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
+
+    return 0;
+}
+struct cmd_handler {
+    uint32_t cmd;
+    int (*exec)(PVRDMADev *dev, union pvrdma_cmd_req *req,
+            union pvrdma_cmd_resp *rsp);
+};
+
+static struct cmd_handler cmd_handlers[] = {
+    {PVRDMA_CMD_QUERY_PORT, query_port},
+    {PVRDMA_CMD_QUERY_PKEY, query_pkey},
+    {PVRDMA_CMD_CREATE_PD, create_pd},
+    {PVRDMA_CMD_DESTROY_PD, destroy_pd},
+    {PVRDMA_CMD_CREATE_MR, create_mr},
+    {PVRDMA_CMD_DESTROY_MR, destroy_mr},
+    {PVRDMA_CMD_CREATE_CQ, create_cq},
+    {PVRDMA_CMD_RESIZE_CQ, NULL},
+    {PVRDMA_CMD_DESTROY_CQ, destroy_cq},
+    {PVRDMA_CMD_CREATE_QP, create_qp},
+    {PVRDMA_CMD_MODIFY_QP, modify_qp},
+    {PVRDMA_CMD_QUERY_QP, NULL},
+    {PVRDMA_CMD_DESTROY_QP, destroy_qp},
+    {PVRDMA_CMD_CREATE_UC, create_uc},
+    {PVRDMA_CMD_DESTROY_UC, destroy_uc},
+    {PVRDMA_CMD_CREATE_BIND, create_bind},
+    {PVRDMA_CMD_DESTROY_BIND, destroy_bind},
+};
+
+int execute_command(PVRDMADev *dev)
+{
+    int err = 0xFFFF;
+    DSRInfo *dsr_info;
+
+    dsr_info = &dev->dsr_info;
+
+    pr_dbg("cmd=%d\n", dsr_info->req->hdr.cmd);
+    if (dsr_info->req->hdr.cmd >= sizeof(cmd_handlers) /
+                      sizeof(struct cmd_handler)) {
+        pr_dbg("Unsupported command\n");
+        goto out;
+    }
+
+    if (!cmd_handlers[dsr_info->req->hdr.cmd].exec) {
+        pr_dbg("Unsupported command (not implemented yet)\n");
+        goto out;
+    }
+
+    err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
+                            dsr_info->rsp);
+out:
+    set_reg_val(dev, PVRDMA_REG_ERR, err);
+    post_interrupt(dev, INTR_VEC_CMD_RING);
+
+    return (err == 0) ? 0 : -EINVAL;
+}
diff --git a/hw/rdma/vmw/pvrdma_dev_ring.c b/hw/rdma/vmw/pvrdma_dev_ring.c
new file mode 100644
index 0000000000..ec309dad55
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma_dev_ring.c
@@ -0,0 +1,155 @@
+/*
+ * QEMU paravirtual RDMA - Device rings
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <qemu/osdep.h>
+#include <hw/pci/pci.h>
+#include <cpu.h>
+
+#include "../rdma_utils.h"
+#include <standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_ring.h>
+#include "pvrdma_dev_ring.h"
+
+int pvrdma_ring_init(PvrdmaRing *ring, const char *name, PCIDevice *dev,
+                     struct pvrdma_ring *ring_state, uint32_t max_elems,
+                     size_t elem_sz, dma_addr_t *tbl, dma_addr_t npages)
+{
+    int i;
+    int rc = 0;
+
+    strncpy(ring->name, name, MAX_RING_NAME_SZ);
+    ring->name[MAX_RING_NAME_SZ - 1] = 0;
+    pr_dbg("Initializing %s ring\n", ring->name);
+    ring->dev = dev;
+    ring->ring_state = ring_state;
+    ring->max_elems = max_elems;
+    ring->elem_sz = elem_sz;
+    pr_dbg("ring->elem_sz=%ld\n", ring->elem_sz);
+    pr_dbg("npages=%ld\n", npages);
+    /* TODO: Give a moment to think if we want to redo driver settings
+    atomic_set(&ring->ring_state->prod_tail, 0);
+    atomic_set(&ring->ring_state->cons_head, 0);
+    */
+    ring->npages = npages;
+    ring->pages = g_malloc(npages * sizeof(void *));
+
+    for (i = 0; i < npages; i++) {
+        if (!tbl[i]) {
+            pr_err("npages=%ld but tbl[%d] is NULL\n", (long)npages, i);
+            continue;
+        }
+
+        ring->pages[i] = rdma_pci_dma_map(dev, tbl[i], TARGET_PAGE_SIZE);
+        if (!ring->pages[i]) {
+            rc = -ENOMEM;
+            pr_dbg("Failed to map to page %d\n", i);
+            goto out_free;
+        }
+        memset(ring->pages[i], 0, TARGET_PAGE_SIZE);
+    }
+
+    goto out;
+
+out_free:
+    while (i--) {
+        rdma_pci_dma_unmap(dev, ring->pages[i], TARGET_PAGE_SIZE);
+    }
+    g_free(ring->pages);
+
+out:
+    return rc;
+}
+
+void *pvrdma_ring_next_elem_read(PvrdmaRing *ring)
+{
+    unsigned int idx = 0, offset;
+
+    /*
+    pr_dbg("%s: t=%d, h=%d\n", ring->name, ring->ring_state->prod_tail,
+           ring->ring_state->cons_head);
+    */
+
+    if (!pvrdma_idx_ring_has_data(ring->ring_state, ring->max_elems, &idx)) {
+        pr_dbg("No more data in ring\n");
+        return NULL;
+    }
+
+    offset = idx * ring->elem_sz;
+    /*
+    pr_dbg("idx=%d\n", idx);
+    pr_dbg("offset=%d\n", offset);
+    */
+    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
+}
+
+void pvrdma_ring_read_inc(PvrdmaRing *ring)
+{
+    pvrdma_idx_ring_inc(&ring->ring_state->cons_head, ring->max_elems);
+    /*
+    pr_dbg("%s: t=%d, h=%d, m=%ld\n", ring->name,
+           ring->ring_state->prod_tail, ring->ring_state->cons_head,
+           ring->max_elems);
+    */
+}
+
+void *pvrdma_ring_next_elem_write(PvrdmaRing *ring)
+{
+    unsigned int idx, offset, tail;
+
+    /*
+    pr_dbg("%s: t=%d, h=%d\n", ring->name, ring->ring_state->prod_tail,
+           ring->ring_state->cons_head);
+    */
+
+    if (!pvrdma_idx_ring_has_space(ring->ring_state, ring->max_elems, &tail)) {
+        pr_dbg("CQ is full\n");
+        return NULL;
+    }
+
+    idx = pvrdma_idx(&ring->ring_state->prod_tail, ring->max_elems);
+    /* TODO: tail == idx */
+
+    offset = idx * ring->elem_sz;
+    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
+}
+
+void pvrdma_ring_write_inc(PvrdmaRing *ring)
+{
+    pvrdma_idx_ring_inc(&ring->ring_state->prod_tail, ring->max_elems);
+    /*
+    pr_dbg("%s: t=%d, h=%d, m=%ld\n", ring->name,
+           ring->ring_state->prod_tail, ring->ring_state->cons_head,
+           ring->max_elems);
+    */
+}
+
+void pvrdma_ring_free(PvrdmaRing *ring)
+{
+    if (!ring) {
+        return;
+    }
+
+    if (!ring->pages) {
+        return;
+    }
+
+    pr_dbg("ring->npages=%d\n", ring->npages);
+    while (ring->npages--) {
+        rdma_pci_dma_unmap(ring->dev, ring->pages[ring->npages],
+                           TARGET_PAGE_SIZE);
+    }
+
+    g_free(ring->pages);
+    ring->pages = NULL;
+}
diff --git a/hw/rdma/vmw/pvrdma_dev_ring.h b/hw/rdma/vmw/pvrdma_dev_ring.h
new file mode 100644
index 0000000000..02a590b86d
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma_dev_ring.h
@@ -0,0 +1,42 @@
+/*
+ * QEMU VMWARE paravirtual RDMA ring utilities
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_DEV_RING_H
+#define PVRDMA_DEV_RING_H
+
+#include <qemu/typedefs.h>
+
+#define MAX_RING_NAME_SZ 32
+
+typedef struct PvrdmaRing {
+    char name[MAX_RING_NAME_SZ];
+    PCIDevice *dev;
+    uint32_t max_elems;
+    size_t elem_sz;
+    struct pvrdma_ring *ring_state; /* used only for unmap */
+    int npages;
+    void **pages;
+} PvrdmaRing;
+
+int pvrdma_ring_init(PvrdmaRing *ring, const char *name, PCIDevice *dev,
+                     struct pvrdma_ring *ring_state, uint32_t max_elems,
+                     size_t elem_sz, dma_addr_t *tbl, dma_addr_t npages);
+void *pvrdma_ring_next_elem_read(PvrdmaRing *ring);
+void pvrdma_ring_read_inc(PvrdmaRing *ring);
+void *pvrdma_ring_next_elem_write(PvrdmaRing *ring);
+void pvrdma_ring_write_inc(PvrdmaRing *ring);
+void pvrdma_ring_free(PvrdmaRing *ring);
+
+#endif
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
new file mode 100644
index 0000000000..f0a1f9eb02
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -0,0 +1,222 @@
+/*
+ * QEMU paravirtual RDMA - QP implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <qemu/osdep.h>
+
+#include "../rdma_utils.h"
+#include "../rdma_rm.h"
+#include "../rdma_backend.h"
+
+#include "pvrdma.h"
+#include <standard-headers/rdma/vmw_pvrdma-abi.h>
+#include "pvrdma_qp_ops.h"
+
+typedef struct CompHandlerCtx {
+    PVRDMADev *dev;
+    uint32_t cq_handle;
+    struct pvrdma_cqe cqe;
+} CompHandlerCtx;
+
+/* Send Queue WQE */
+typedef struct PvrdmaSqWqe {
+    struct pvrdma_sq_wqe_hdr hdr;
+    struct pvrdma_sge sge[0];
+} PvrdmaSqWqe;
+
+/* Recv Queue WQE */
+typedef struct PvrdmaRqWqe {
+    struct pvrdma_rq_wqe_hdr hdr;
+    struct pvrdma_sge sge[0];
+} PvrdmaRqWqe;
+
+/*
+ * 1. Put CQE on send CQ ring
+ * 2. Put CQ number on dsr completion ring
+ * 3. Interrupt host
+ */
+static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
+                           struct pvrdma_cqe *cqe)
+{
+    struct pvrdma_cqe *cqe1;
+    struct pvrdma_cqne *cqne;
+    PvrdmaRing *ring;
+    RdmaRmCQ *cq = rdma_rm_get_cq(&dev->rdma_dev_res, cq_handle);
+
+    if (unlikely(!cq)) {
+        pr_dbg("Invalid cqn %d\n", cq_handle);
+        return -EINVAL;
+    }
+
+    ring = (PvrdmaRing *)cq->opaque;
+    pr_dbg("ring=%p\n", ring);
+
+    /* Step #1: Put CQE on CQ ring */
+    pr_dbg("Writing CQE\n");
+    cqe1 = pvrdma_ring_next_elem_write(ring);
+    if (unlikely(!cqe1)) {
+        return -EINVAL;
+    }
+
+    cqe1->wr_id = cqe->wr_id;
+    cqe1->qp = cqe->qp;
+    cqe1->opcode = cqe->opcode;
+    cqe1->status = cqe->status;
+    cqe1->vendor_err = cqe->vendor_err;
+
+    pvrdma_ring_write_inc(ring);
+
+    /* Step #2: Put CQ number on dsr completion ring */
+    pr_dbg("Writing CQNE\n");
+    cqne = pvrdma_ring_next_elem_write(&dev->dsr_info.cq);
+    if (unlikely(!cqne)) {
+        return -EINVAL;
+    }
+
+    cqne->info = cq_handle;
+    pvrdma_ring_write_inc(&dev->dsr_info.cq);
+
+    pr_dbg("cq->notify=%d\n", cq->notify);
+    if (cq->notify) {
+        cq->notify = false;
+        post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
+    }
+
+    return 0;
+}
+
+static void pvrdma_qp_ops_comp_handler(int status, unsigned int vendor_err,
+                                       void *ctx)
+{
+    CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
+
+    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
+    pr_dbg("wr_id=%ld\n", comp_ctx->cqe.wr_id);
+    pr_dbg("status=%d\n", status);
+    pr_dbg("vendor_err=0x%x\n", vendor_err);
+    comp_ctx->cqe.status = status;
+    comp_ctx->cqe.vendor_err = vendor_err;
+    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
+    g_free(ctx);
+}
+
+void pvrdma_qp_ops_fini(void)
+{
+    rdma_backend_unregister_comp_handler();
+}
+
+int pvrdma_qp_ops_init(void)
+{
+    rdma_backend_register_comp_handler(pvrdma_qp_ops_comp_handler);
+
+    return 0;
+}
+
+int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
+{
+    RdmaRmQP *qp;
+    PvrdmaSqWqe *wqe;
+    PvrdmaRing *ring;
+
+    pr_dbg("qp_handle=%d\n", qp_handle);
+
+    qp = rdma_rm_get_qp(&dev->rdma_dev_res, qp_handle);
+    if (unlikely(!qp)) {
+        return -EINVAL;
+    }
+
+    ring = (PvrdmaRing *)qp->opaque;
+    pr_dbg("sring=%p\n", ring);
+
+    wqe = (struct PvrdmaSqWqe *)pvrdma_ring_next_elem_read(ring);
+    while (wqe) {
+        CompHandlerCtx *comp_ctx;
+
+        pr_dbg("wr_id=%ld\n", wqe->hdr.wr_id);
+
+        /* Prepare CQE */
+        comp_ctx = g_malloc(sizeof(CompHandlerCtx));
+        comp_ctx->dev = dev;
+        comp_ctx->cq_handle = qp->send_cq_handle;
+        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cqe.opcode = wqe->hdr.opcode;
+
+        rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
+                               (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
+                               (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
+                               wqe->hdr.wr.ud.remote_qpn,
+                               wqe->hdr.wr.ud.remote_qkey, comp_ctx);
+
+        pvrdma_ring_read_inc(ring);
+
+        wqe = pvrdma_ring_next_elem_read(ring);
+    }
+
+    return 0;
+}
+
+int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
+{
+    RdmaRmQP *qp;
+    PvrdmaRqWqe *wqe;
+    PvrdmaRing *ring;
+
+    pr_dbg("qp_handle=%d\n", qp_handle);
+
+    qp = rdma_rm_get_qp(&dev->rdma_dev_res, qp_handle);
+    if (unlikely(!qp)) {
+        return -EINVAL;
+    }
+
+    ring = &((PvrdmaRing *)qp->opaque)[1];
+    pr_dbg("rring=%p\n", ring);
+
+    wqe = (struct PvrdmaRqWqe *)pvrdma_ring_next_elem_read(ring);
+    while (wqe) {
+        CompHandlerCtx *comp_ctx;
+
+        pr_dbg("wr_id=%ld\n", wqe->hdr.wr_id);
+
+        /* Prepare CQE */
+        comp_ctx = g_malloc(sizeof(CompHandlerCtx));
+        comp_ctx->dev = dev;
+        comp_ctx->cq_handle = qp->recv_cq_handle;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+
+        rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
+                               &qp->backend_qp, qp->qp_type,
+                               (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
+                               comp_ctx);
+
+        pvrdma_ring_read_inc(ring);
+
+        wqe = pvrdma_ring_next_elem_read(ring);
+    }
+
+    return 0;
+}
+
+void pvrdma_cq_poll(RdmaDeviceResources *dev_res, uint32_t cq_handle)
+{
+    RdmaRmCQ *cq;
+
+    cq = rdma_rm_get_cq(dev_res, cq_handle);
+    if (!cq) {
+        pr_dbg("Invalid CQ# %d\n", cq_handle);
+    }
+
+    rdma_backend_poll_cq(dev_res, &cq->backend_cq);
+}
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.h b/hw/rdma/vmw/pvrdma_qp_ops.h
new file mode 100644
index 0000000000..ac46bf7fdf
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma_qp_ops.h
@@ -0,0 +1,27 @@
+/*
+ * QEMU VMWARE paravirtual RDMA QP Operations
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PVRDMA_QP_H
+#define PVRDMA_QP_H
+
+#include "pvrdma.h"
+
+int pvrdma_qp_ops_init(void);
+void pvrdma_qp_ops_fini(void);
+int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle);
+int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle);
+void pvrdma_cq_poll(RdmaDeviceResources *dev_res, uint32_t cq_handle);
+
+#endif
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 09/10] hw/rdma: Implementation of PVRDMA device
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (7 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 08/10] hw/rdma: PVRDMA commands and data-path ops Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 10/10] MAINTAINERS: add entry for hw/rdma Marcel Apfelbaum
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

From: Yuval Shaia <yuval.shaia@oracle.com>

PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
It works with its Linux Kernel driver AS IS, no need for any special
guest modifications.

While it complies with the VMware device, it can also communicate with
bare metal RDMA-enabled machines and does not require an RDMA HCA in the
host, it can work with Soft-RoCE (rxe).

It does not require the whole guest RAM to be pinned allowing memory
over-commit and, even if not implemented yet, migration support will be
possible with some HW assistance.

Implementation is divided into 2 components, rdma general and pvRDMA
specific functions and structures.

The second PVRDMA sub-module - interaction with PCI layer.
- Device configuration and setup (MSIX, BARs etc).
- Setup of DSR (Device Shared Resources)
- Setup of device ring.
- Device management.

Reviewed-by: Dotan Barak <dotanb@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 Makefile.objs             |   1 +
 hw/rdma/Makefile.objs     |   2 +-
 hw/rdma/vmw/pvrdma_main.c | 670 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/rdma/vmw/trace-events  |   5 +
 include/hw/pci/pci_ids.h  |   3 +
 5 files changed, 680 insertions(+), 1 deletion(-)
 create mode 100644 hw/rdma/vmw/pvrdma_main.c
 create mode 100644 hw/rdma/vmw/trace-events

diff --git a/Makefile.objs b/Makefile.objs
index f3a3d28304..0b3c630719 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -131,6 +131,7 @@ trace-events-subdirs += hw/char
 trace-events-subdirs += hw/intc
 trace-events-subdirs += hw/net
 trace-events-subdirs += hw/rdma
+trace-events-subdirs += hw/rdma/vmw
 trace-events-subdirs += hw/virtio
 trace-events-subdirs += hw/audio
 trace-events-subdirs += hw/misc
diff --git a/hw/rdma/Makefile.objs b/hw/rdma/Makefile.objs
index 44a85f687d..3504c39d21 100644
--- a/hw/rdma/Makefile.objs
+++ b/hw/rdma/Makefile.objs
@@ -1,5 +1,5 @@
 ifeq ($(CONFIG_RDMA),y)
 obj-$(CONFIG_PCI) += rdma_utils.o rdma_backend.o rdma_rm.o
 obj-$(CONFIG_PCI) += vmw/pvrdma_dev_ring.o vmw/pvrdma_cmd.o \
-                     vmw/pvrdma_qp_ops.o
+                     vmw/pvrdma_qp_ops.o vmw/pvrdma_main.o
 endif
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
new file mode 100644
index 0000000000..99787812ba
--- /dev/null
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -0,0 +1,670 @@
+/*
+ * QEMU paravirtual RDMA
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <qemu/osdep.h>
+#include <qapi/error.h>
+#include <hw/hw.h>
+#include <hw/pci/pci.h>
+#include <hw/pci/pci_ids.h>
+#include <hw/pci/msi.h>
+#include <hw/pci/msix.h>
+#include <hw/qdev-core.h>
+#include <hw/qdev-properties.h>
+#include <cpu.h>
+#include "trace.h"
+
+#include "../rdma_rm.h"
+#include "../rdma_backend.h"
+#include "../rdma_utils.h"
+
+#include <infiniband/verbs.h>
+#include "pvrdma.h"
+#include <standard-headers/rdma/vmw_pvrdma-abi.h>
+#include <standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h>
+#include "pvrdma_qp_ops.h"
+
+static Property pvrdma_dev_properties[] = {
+    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
+    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
+    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
+    DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
+                       MAX_MR_SIZE),
+    DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
+    DEFINE_PROP_INT32("dev-caps-max-sge", PVRDMADev, dev_attr.max_sge, MAX_SGE),
+    DEFINE_PROP_INT32("dev-caps-max-cq", PVRDMADev, dev_attr.max_cq, MAX_CQ),
+    DEFINE_PROP_INT32("dev-caps-max-mr", PVRDMADev, dev_attr.max_mr, MAX_MR),
+    DEFINE_PROP_INT32("dev-caps-max-pd", PVRDMADev, dev_attr.max_pd, MAX_PD),
+    DEFINE_PROP_INT32("dev-caps-qp-rd-atom", PVRDMADev, dev_attr.max_qp_rd_atom,
+                      MAX_QP_RD_ATOM),
+    DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
+                      dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
+    DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void free_dev_ring(PCIDevice *pci_dev, PvrdmaRing *ring,
+                          void *ring_state)
+{
+    pvrdma_ring_free(ring);
+    rdma_pci_dma_unmap(pci_dev, ring_state, TARGET_PAGE_SIZE);
+}
+
+static int init_dev_ring(PvrdmaRing *ring, struct pvrdma_ring **ring_state,
+                         const char *name, PCIDevice *pci_dev,
+                         dma_addr_t dir_addr, uint32_t num_pages)
+{
+    uint64_t *dir, *tbl;
+    int rc = 0;
+
+    pr_dbg("Initializing device ring %s\n", name);
+    pr_dbg("pdir_dma=0x%llx\n", (long long unsigned int)dir_addr);
+    pr_dbg("num_pages=%d\n", num_pages);
+    dir = rdma_pci_dma_map(pci_dev, dir_addr, TARGET_PAGE_SIZE);
+    if (!dir) {
+        pr_err("Failed to map to page directory\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
+    if (!tbl) {
+        pr_err("Failed to map to page table\n");
+        rc = -ENOMEM;
+        goto out_free_dir;
+    }
+
+    *ring_state = rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
+    if (!*ring_state) {
+        pr_err("Failed to map to ring state\n");
+        rc = -ENOMEM;
+        goto out_free_tbl;
+    }
+    /* RX ring is the second */
+    (struct pvrdma_ring *)(*ring_state)++;
+    rc = pvrdma_ring_init(ring, name, pci_dev,
+                          (struct pvrdma_ring *)*ring_state,
+                          (num_pages - 1) * TARGET_PAGE_SIZE /
+                          sizeof(struct pvrdma_cqne),
+                          sizeof(struct pvrdma_cqne),
+                          (dma_addr_t *)&tbl[1], (dma_addr_t)num_pages - 1);
+    if (rc) {
+        pr_err("Failed to initialize ring\n");
+        rc = -ENOMEM;
+        goto out_free_ring_state;
+    }
+
+    goto out_free_tbl;
+
+out_free_ring_state:
+    rdma_pci_dma_unmap(pci_dev, *ring_state, TARGET_PAGE_SIZE);
+
+out_free_tbl:
+    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
+
+out_free_dir:
+    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
+
+out:
+    return rc;
+}
+
+static void free_dsr(PVRDMADev *dev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    if (!dev->dsr_info.dsr) {
+        return;
+    }
+
+    free_dev_ring(pci_dev, &dev->dsr_info.async,
+                  dev->dsr_info.async_ring_state);
+
+    free_dev_ring(pci_dev, &dev->dsr_info.cq, dev->dsr_info.cq_ring_state);
+
+    rdma_pci_dma_unmap(pci_dev, dev->dsr_info.req,
+                         sizeof(union pvrdma_cmd_req));
+
+    rdma_pci_dma_unmap(pci_dev, dev->dsr_info.rsp,
+                         sizeof(union pvrdma_cmd_resp));
+
+    rdma_pci_dma_unmap(pci_dev, dev->dsr_info.dsr,
+                         sizeof(struct pvrdma_device_shared_region));
+
+    dev->dsr_info.dsr = NULL;
+}
+
+static int load_dsr(PVRDMADev *dev)
+{
+    int rc = 0;
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    DSRInfo *dsr_info;
+    struct pvrdma_device_shared_region *dsr;
+
+    free_dsr(dev);
+
+    /* Map to DSR */
+    pr_dbg("dsr_dma=0x%llx\n", (long long unsigned int)dev->dsr_info.dma);
+    dev->dsr_info.dsr = rdma_pci_dma_map(pci_dev, dev->dsr_info.dma,
+                              sizeof(struct pvrdma_device_shared_region));
+    if (!dev->dsr_info.dsr) {
+        pr_err("Failed to map to DSR\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+
+    /* Shortcuts */
+    dsr_info = &dev->dsr_info;
+    dsr = dsr_info->dsr;
+
+    /* Map to command slot */
+    pr_dbg("cmd_dma=0x%llx\n", (long long unsigned int)dsr->cmd_slot_dma);
+    dsr_info->req = rdma_pci_dma_map(pci_dev, dsr->cmd_slot_dma,
+                                     sizeof(union pvrdma_cmd_req));
+    if (!dsr_info->req) {
+        pr_err("Failed to map to command slot address\n");
+        rc = -ENOMEM;
+        goto out_free_dsr;
+    }
+
+    /* Map to response slot */
+    pr_dbg("rsp_dma=0x%llx\n", (long long unsigned int)dsr->resp_slot_dma);
+    dsr_info->rsp = rdma_pci_dma_map(pci_dev, dsr->resp_slot_dma,
+                                     sizeof(union pvrdma_cmd_resp));
+    if (!dsr_info->rsp) {
+        pr_err("Failed to map to response slot address\n");
+        rc = -ENOMEM;
+        goto out_free_req;
+    }
+
+    /* Map to CQ notification ring */
+    rc = init_dev_ring(&dsr_info->cq, &dsr_info->cq_ring_state, "dev_cq",
+                       pci_dev, dsr->cq_ring_pages.pdir_dma,
+                       dsr->cq_ring_pages.num_pages);
+    if (rc) {
+        pr_err("Failed to map to initialize CQ ring\n");
+        rc = -ENOMEM;
+        goto out_free_rsp;
+    }
+
+    /* Map to event notification ring */
+    rc = init_dev_ring(&dsr_info->async, &dsr_info->async_ring_state,
+                       "dev_async", pci_dev, dsr->async_ring_pages.pdir_dma,
+                       dsr->async_ring_pages.num_pages);
+    if (rc) {
+        pr_err("Failed to map to initialize event ring\n");
+        rc = -ENOMEM;
+        goto out_free_rsp;
+    }
+
+    goto out;
+
+out_free_rsp:
+    rdma_pci_dma_unmap(pci_dev, dsr_info->rsp, sizeof(union pvrdma_cmd_resp));
+
+out_free_req:
+    rdma_pci_dma_unmap(pci_dev, dsr_info->req, sizeof(union pvrdma_cmd_req));
+
+out_free_dsr:
+    rdma_pci_dma_unmap(pci_dev, dsr_info->dsr,
+                       sizeof(struct pvrdma_device_shared_region));
+    dsr_info->dsr = NULL;
+
+out:
+    return rc;
+}
+
+static void init_dsr_dev_caps(PVRDMADev *dev)
+{
+    struct pvrdma_device_shared_region *dsr;
+
+    if (dev->dsr_info.dsr == NULL) {
+        pr_err("Can't initialized DSR\n");
+        return;
+    }
+
+    dsr = dev->dsr_info.dsr;
+
+    dsr->caps.fw_ver = PVRDMA_FW_VERSION;
+    pr_dbg("fw_ver=0x%lx\n", dsr->caps.fw_ver);
+
+    dsr->caps.mode = PVRDMA_DEVICE_MODE_ROCE;
+    pr_dbg("mode=%d\n", dsr->caps.mode);
+
+    dsr->caps.gid_types |= PVRDMA_GID_TYPE_FLAG_ROCE_V1;
+    pr_dbg("gid_types=0x%x\n", dsr->caps.gid_types);
+
+    dsr->caps.max_uar = RDMA_BAR2_UAR_SIZE;
+    pr_dbg("max_uar=%d\n", dsr->caps.max_uar);
+
+    dsr->caps.max_mr_size = dev->dev_attr.max_mr_size;
+    dsr->caps.max_qp = dev->dev_attr.max_qp;
+    dsr->caps.max_qp_wr = dev->dev_attr.max_qp_wr;
+    dsr->caps.max_sge = dev->dev_attr.max_sge;
+    dsr->caps.max_cq = dev->dev_attr.max_cq;
+    dsr->caps.max_cqe = dev->dev_attr.max_cqe;
+    dsr->caps.max_mr = dev->dev_attr.max_mr;
+    dsr->caps.max_pd = dev->dev_attr.max_pd;
+    dsr->caps.max_ah = dev->dev_attr.max_ah;
+
+    dsr->caps.gid_tbl_len = MAX_GIDS;
+    pr_dbg("gid_tbl_len=%d\n", dsr->caps.gid_tbl_len);
+
+    dsr->caps.sys_image_guid = 0;
+    pr_dbg("sys_image_guid=%lx\n", dsr->caps.sys_image_guid);
+
+    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+    pr_dbg("node_guid=%llx\n",
+           (long long unsigned int)be64_to_cpu(dsr->caps.node_guid));
+
+    dsr->caps.phys_port_cnt = MAX_PORTS;
+    pr_dbg("phys_port_cnt=%d\n", dsr->caps.phys_port_cnt);
+
+    dsr->caps.max_pkeys = MAX_PKEYS;
+    pr_dbg("max_pkeys=%d\n", dsr->caps.max_pkeys);
+
+    pr_dbg("Initialized\n");
+}
+
+static void free_ports(PVRDMADev *dev)
+{
+    int i;
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        g_free(dev->rdma_dev_res.ports[i].gid_tbl);
+    }
+}
+
+static void init_ports(PVRDMADev *dev, Error **errp)
+{
+    int i;
+
+    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        dev->rdma_dev_res.ports[i].state = PVRDMA_PORT_DOWN;
+
+        dev->rdma_dev_res.ports[i].pkey_tbl =
+            g_malloc0(sizeof(*dev->rdma_dev_res.ports[i].pkey_tbl) *
+                      MAX_PORT_PKEYS);
+    }
+}
+
+static void activate_device(PVRDMADev *dev)
+{
+    set_reg_val(dev, PVRDMA_REG_ERR, 0);
+    pr_dbg("Device activated\n");
+}
+
+static int unquiesce_device(PVRDMADev *dev)
+{
+    pr_dbg("Device unquiesced\n");
+    return 0;
+}
+
+static int reset_device(PVRDMADev *dev)
+{
+    pr_dbg("Device reset complete\n");
+    return 0;
+}
+
+static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+    uint32_t val;
+
+    /* pr_dbg("addr=0x%lx, size=%d\n", addr, size); */
+
+    if (get_reg_val(dev, addr, &val)) {
+        pr_dbg("Error trying to read REG value from address 0x%x\n",
+               (uint32_t)addr);
+        return -EINVAL;
+    }
+
+    trace_pvrdma_regs_read(addr, val);
+
+    return val;
+}
+
+static void regs_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+
+    /* pr_dbg("addr=0x%lx, val=0x%x, size=%d\n", addr, (uint32_t)val, size); */
+
+    if (set_reg_val(dev, addr, val)) {
+        pr_err("Error trying to set REG value, addr=0x%lx, val=0x%lx\n",
+               (uint64_t)addr, val);
+        return;
+    }
+
+    trace_pvrdma_regs_write(addr, val);
+
+    switch (addr) {
+    case PVRDMA_REG_DSRLOW:
+        dev->dsr_info.dma = val;
+        break;
+    case PVRDMA_REG_DSRHIGH:
+        dev->dsr_info.dma |= val << 32;
+        load_dsr(dev);
+        init_dsr_dev_caps(dev);
+        break;
+    case PVRDMA_REG_CTL:
+        switch (val) {
+        case PVRDMA_DEVICE_CTL_ACTIVATE:
+            activate_device(dev);
+            break;
+        case PVRDMA_DEVICE_CTL_UNQUIESCE:
+            unquiesce_device(dev);
+            break;
+        case PVRDMA_DEVICE_CTL_RESET:
+            reset_device(dev);
+            break;
+        }
+    break;
+    case PVRDMA_REG_IMR:
+        pr_dbg("Interrupt mask=0x%lx\n", val);
+        dev->interrupt_mask = val;
+        break;
+    case PVRDMA_REG_REQUEST:
+        if (val == 0) {
+            execute_command(dev);
+        }
+    break;
+    default:
+        break;
+    }
+}
+
+static const MemoryRegionOps regs_ops = {
+    .read = regs_read,
+    .write = regs_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = sizeof(uint32_t),
+        .max_access_size = sizeof(uint32_t),
+    },
+};
+
+static void uar_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
+{
+    PVRDMADev *dev = opaque;
+
+    /* pr_dbg("addr=0x%lx, val=0x%x, size=%d\n", addr, (uint32_t)val, size); */
+
+    switch (addr & 0xFFF) { /* Mask with 0xFFF as each UC gets page */
+    case PVRDMA_UAR_QP_OFFSET:
+        pr_dbg("UAR QP command, addr=0x%x, val=0x%lx\n", (uint32_t)addr, val);
+        if (val & PVRDMA_UAR_QP_SEND) {
+            pvrdma_qp_send(dev, val & PVRDMA_UAR_HANDLE_MASK);
+        }
+        if (val & PVRDMA_UAR_QP_RECV) {
+            pvrdma_qp_recv(dev, val & PVRDMA_UAR_HANDLE_MASK);
+        }
+        break;
+    case PVRDMA_UAR_CQ_OFFSET:
+        /* pr_dbg("UAR CQ cmd, addr=0x%x, val=0x%lx\n", (uint32_t)addr, val); */
+        if (val & PVRDMA_UAR_CQ_ARM) {
+            rdma_rm_req_notify_cq(&dev->rdma_dev_res,
+                                  val & PVRDMA_UAR_HANDLE_MASK,
+                                  !!(val & PVRDMA_UAR_CQ_ARM_SOL));
+        }
+        if (val & PVRDMA_UAR_CQ_ARM_SOL) {
+            pr_dbg("UAR_CQ_ARM_SOL (%ld)\n", val & PVRDMA_UAR_HANDLE_MASK);
+        }
+        if (val & PVRDMA_UAR_CQ_POLL) {
+            pr_dbg("UAR_CQ_POLL (%ld)\n", val & PVRDMA_UAR_HANDLE_MASK);
+            pvrdma_cq_poll(&dev->rdma_dev_res, val & PVRDMA_UAR_HANDLE_MASK);
+        }
+        break;
+    default:
+        pr_err("Unsupported command, addr=0x%lx, val=0x%lx\n",
+               (uint64_t)addr, val);
+        break;
+    }
+}
+
+static const MemoryRegionOps uar_ops = {
+    .write = uar_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = sizeof(uint32_t),
+        .max_access_size = sizeof(uint32_t),
+    },
+};
+
+static void init_pci_config(PCIDevice *pdev)
+{
+    pdev->config[PCI_INTERRUPT_PIN] = 1;
+}
+
+static void init_bars(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    /* BAR 0 - MSI-X */
+    memory_region_init(&dev->msix, OBJECT(dev), "pvrdma-msix",
+                       RDMA_BAR0_MSIX_SIZE);
+    pci_register_bar(pdev, RDMA_MSIX_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &dev->msix);
+
+    /* BAR 1 - Registers */
+    memset(&dev->regs_data, 0, sizeof(dev->regs_data));
+    memory_region_init_io(&dev->regs, OBJECT(dev), &regs_ops, dev,
+                          "pvrdma-regs", RDMA_BAR1_REGS_SIZE);
+    pci_register_bar(pdev, RDMA_REG_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &dev->regs);
+
+    /* BAR 2 - UAR */
+    memset(&dev->uar_data, 0, sizeof(dev->uar_data));
+    memory_region_init_io(&dev->uar, OBJECT(dev), &uar_ops, dev, "rdma-uar",
+                          RDMA_BAR2_UAR_SIZE);
+    pci_register_bar(pdev, RDMA_UAR_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &dev->uar);
+}
+
+static void init_regs(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    set_reg_val(dev, PVRDMA_REG_VERSION, PVRDMA_HW_VERSION);
+    set_reg_val(dev, PVRDMA_REG_ERR, 0xFFFF);
+}
+
+static void uninit_msix(PCIDevice *pdev, int used_vectors)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+    int i;
+
+    for (i = 0; i < used_vectors; i++) {
+        msix_vector_unuse(pdev, i);
+    }
+
+    msix_uninit(pdev, &dev->msix, &dev->msix);
+}
+
+static int init_msix(PCIDevice *pdev, Error **errp)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+    int i;
+    int rc;
+
+    rc = msix_init(pdev, RDMA_MAX_INTRS, &dev->msix, RDMA_MSIX_BAR_IDX,
+                   RDMA_MSIX_TABLE, &dev->msix, RDMA_MSIX_BAR_IDX,
+                   RDMA_MSIX_PBA, 0, NULL);
+
+    if (rc < 0) {
+        error_setg(errp, "Failed to initialize MSI-X");
+        return rc;
+    }
+
+    for (i = 0; i < RDMA_MAX_INTRS; i++) {
+        rc = msix_vector_use(PCI_DEVICE(dev), i);
+        if (rc < 0) {
+            error_setg(errp, "Fail mark MSI-X vercor %d", i);
+            uninit_msix(pdev, i);
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+static void init_dev_caps(PVRDMADev *dev)
+{
+    size_t pg_tbl_bytes = TARGET_PAGE_SIZE *
+                          (TARGET_PAGE_SIZE / sizeof(uint64_t));
+    size_t wr_sz = MAX(sizeof(struct pvrdma_sq_wqe_hdr),
+                       sizeof(struct pvrdma_rq_wqe_hdr));
+
+    dev->dev_attr.max_qp_wr = pg_tbl_bytes /
+                              (wr_sz + sizeof(struct pvrdma_sge) * MAX_SGE) -
+                              TARGET_PAGE_SIZE; /* First page is ring state */
+    pr_dbg("max_qp_wr=%d\n", dev->dev_attr.max_qp_wr);
+
+    dev->dev_attr.max_cqe = pg_tbl_bytes / sizeof(struct pvrdma_cqe) -
+                            TARGET_PAGE_SIZE; /* First page is ring state */
+    pr_dbg("max_cqe=%d\n", dev->dev_attr.max_cqe);
+}
+
+static int pvrdma_check_ram_shared(Object *obj, void *opaque)
+{
+    bool *shared = opaque;
+
+    if (object_dynamic_cast(obj, "memory-backend-ram")) {
+        *shared = object_property_get_bool(obj, "share", NULL);
+    }
+
+    return 0;
+}
+
+static void pvrdma_realize(PCIDevice *pdev, Error **errp)
+{
+    int rc;
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+    Object *memdev_root;
+    bool ram_shared = false;
+
+    pr_dbg("Initializing device %s %x.%x\n", pdev->name,
+           PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+    if (TARGET_PAGE_SIZE != getpagesize()) {
+        error_setg(errp, "Target page size must be the same as host page size");
+        return;
+    }
+
+    memdev_root = object_resolve_path("/objects", NULL);
+    if (memdev_root) {
+        object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
+    }
+    if (!ram_shared) {
+        error_setg(errp, "Only shared memory backed ram is supported");
+        return;
+    }
+
+    dev->dsr_info.dsr = NULL;
+
+    init_pci_config(pdev);
+
+    init_bars(pdev);
+
+    init_regs(pdev);
+
+    init_dev_caps(dev);
+
+    rc = init_msix(pdev, errp);
+    if (rc) {
+        goto out;
+    }
+
+    rc = rdma_backend_init(&dev->backend_dev, &dev->rdma_dev_res,
+                           dev->backend_device_name, dev->backend_port_num,
+                           dev->backend_gid_idx, &dev->dev_attr, errp);
+    if (rc) {
+        goto out;
+    }
+
+    rc = rdma_rm_init(&dev->rdma_dev_res, &dev->dev_attr, errp);
+    if (rc) {
+        goto out;
+    }
+
+    init_ports(dev, errp);
+
+    rc = pvrdma_qp_ops_init();
+    if (rc) {
+        goto out;
+    }
+
+out:
+    if (rc) {
+        error_append_hint(errp, "Device fail to load\n");
+    }
+}
+
+static void pvrdma_exit(PCIDevice *pdev)
+{
+    PVRDMADev *dev = PVRDMA_DEV(pdev);
+
+    pr_dbg("Closing device %s %x.%x\n", pdev->name, PCI_SLOT(pdev->devfn),
+           PCI_FUNC(pdev->devfn));
+
+    pvrdma_qp_ops_fini();
+
+    free_ports(dev);
+
+    rdma_rm_fini(&dev->rdma_dev_res);
+
+    rdma_backend_fini(&dev->backend_dev);
+
+    free_dsr(dev);
+
+    if (msix_enabled(pdev)) {
+        uninit_msix(pdev, RDMA_MAX_INTRS);
+    }
+}
+
+static void pvrdma_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+    k->realize = pvrdma_realize;
+    k->exit = pvrdma_exit;
+    k->vendor_id = PCI_VENDOR_ID_VMWARE;
+    k->device_id = PCI_DEVICE_ID_VMWARE_PVRDMA;
+    k->revision = 0x00;
+    k->class_id = PCI_CLASS_NETWORK_OTHER;
+
+    dc->desc = "RDMA Device";
+    dc->props = pvrdma_dev_properties;
+    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+}
+
+static const TypeInfo pvrdma_info = {
+    .name = PVRDMA_HW_NAME,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(PVRDMADev),
+    .class_init = pvrdma_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
+        { }
+    }
+};
+
+static void register_types(void)
+{
+    type_register_static(&pvrdma_info);
+}
+
+type_init(register_types)
diff --git a/hw/rdma/vmw/trace-events b/hw/rdma/vmw/trace-events
new file mode 100644
index 0000000000..b3f9e2b19f
--- /dev/null
+++ b/hw/rdma/vmw/trace-events
@@ -0,0 +1,5 @@
+# See docs/tracing.txt for syntax documentation.
+
+# hw/rdma/vmw/pvrdma_main.c
+pvrdma_regs_read(uint64_t addr, uint64_t val) "regs[0x%"PRIx64"] = 0x%"PRIx64
+pvrdma_regs_write(uint64_t addr, uint64_t val) "regs[0x%"PRIx64"] = 0x%"PRIx64
diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index 35df1874a9..1dbf53627c 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -266,4 +266,7 @@
 #define PCI_VENDOR_ID_TEWS               0x1498
 #define PCI_DEVICE_ID_TEWS_TPCI200       0x30C8
 
+#define PCI_VENDOR_ID_VMWARE             0x15ad
+#define PCI_DEVICE_ID_VMWARE_PVRDMA      0x0820
+
 #endif
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH V11 10/10] MAINTAINERS: add entry for hw/rdma
  2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
                   ` (8 preceding siblings ...)
  2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 09/10] hw/rdma: Implementation of PVRDMA device Marcel Apfelbaum
@ 2018-02-14 19:22 ` Marcel Apfelbaum
  9 siblings, 0 replies; 11+ messages in thread
From: Marcel Apfelbaum @ 2018-02-14 19:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: peter.maydell, ehabkost, yuval.shaia, marcel, mst, dotanb,
	yanjun.zhu, ghammer

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 57358a08e2..6e7adad1df 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2034,6 +2034,14 @@ F: block/replication.c
 F: tests/test-replication.c
 F: docs/block-replication.txt
 
+PVRDMA
+M: Yuval Shaia <yuval.shaia@oracle.com>
+M: Marcel Apfelbaum <marcel@redhat.com>
+S: Maintained
+F: hw/rdma/*
+F: hw/rdma/vmw/*
+F: docs/pvrdma.txt
+
 Build and test automation
 -------------------------
 Build and test automation
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-02-14 19:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-14 19:22 [Qemu-devel] [PATCH V11 00/10] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 01/10] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 02/10] docs: add pvrdma device documentation Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 03/10] scripts/update-linux-headers: import pvrdma headers Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 04/10] include/standard-headers: add pvrdma related headers Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 05/10] hw/rdma: Add wrappers and macros Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 06/10] hw/rdma: Definitions for rdma device and rdma resource manager Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 07/10] hw/rdma: Implementation of generic rdma device layers Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 08/10] hw/rdma: PVRDMA commands and data-path ops Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 09/10] hw/rdma: Implementation of PVRDMA device Marcel Apfelbaum
2018-02-14 19:22 ` [Qemu-devel] [PATCH V11 10/10] MAINTAINERS: add entry for hw/rdma Marcel Apfelbaum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.