All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD
@ 2018-11-13  7:12 Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
                   ` (46 more replies)
  0 siblings, 47 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Hi all.

This is a major enhancement to the pvrdma device to allow it to work with
state of the art applications such as MPI.

As described in patch #5, MAD packets are management packets that are used
for many purposes including but not limited to communication layer above IB
verbs API.

Patch 1 exposes new external executable (under contrib) that aims to
address a specific limitation in the RDMA usrespace MAD stack.

This patch-set mainly present MAD enhancement but during the work on it i
came across some bugs and enhancement needed to be implemented before doing
any MAD coding. This is the role of patches 2 to 4, 7 to 9 and 15 to 17.

Patches 6 and 18 are cosmetic changes while not relevant to this patchset
still introduce with it since (at least for 6) hard to decouple.

Patches 12 to 15 couple pvrdma device with vmxnet3 device as this is the
configuration enforced by pvrdma driver in guest - a vmxnet3 device in
function 0 and pvrdma device in function 1 in the same PCI slot. Patch 12
moves needed code from vmxnet3 device to a new header file that can be used
by pvrdma code while Patches 13 to 15 use of it.

Along with this patch-set there is a parallel patch posted to libvirt to
apply the change needed there as part of the process implemented in patches
10 and 11. This change is needed so that guest would be able to configure
any IP to the Ethernet function of the pvrdma device.
https://www.redhat.com/archives/libvir-list/2018-November/msg00135.html

Since we maintain external resources such as GIDs on host GID table we need
to do some cleanup before going down. This is the job of patches 19 and 20.
Patches 20 and 21 contain a fixes for bugs detected during the work on
processing cleanup code during shutdown.

v1 -> v2:
    * Fix compilation issue detected when compiling for mingw
    * Address comment from Eric Blake re version of QEMU in json
      message
    * Fix example from QMP message in json file
    * Fix case where a VM tries to remove an invalid GID from GID table
    * rdmacm-mux: Cleanup entries in socket-gids table when socket is
      closed
    * Cleanup resources (GIDs, QPs etc) when VM goes down

v2 -> v3:
    * Address comment from Cornelia Huck for patch #19
    * Add some R-Bs from Marcel Apfelbaum and Dmitry Fleytman
    * Update docs/pvrdma.txt with the changes made by this patchset
    * Address comments from Shamir Rabinovitch for UMAD multiplexer

Thanks,
Yuval

Yuval Shaia (23):
  contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  hw/rdma: Add ability to force notification without re-arm
  hw/rdma: Return qpn 1 if ibqp is NULL
  hw/rdma: Abort send-op if fail to create addr handler
  hw/rdma: Add support for MAD packets
  hw/pvrdma: Make function reset_device return void
  hw/pvrdma: Make default pkey 0xFFFF
  hw/pvrdma: Set the correct opcode for recv completion
  hw/pvrdma: Set the correct opcode for send completion
  json: Define new QMP message for pvrdma
  hw/pvrdma: Add support to allow guest to configure GID table
  vmxnet3: Move some definitions to header file
  hw/pvrdma: Make sure PCI function 0 is vmxnet3
  hw/rdma: Initialize node_guid from vmxnet3 mac address
  hw/pvrdma: Make device state depend on Ethernet function state
  hw/pvrdma: Fill all CQE fields
  hw/pvrdma: Fill error code in command's response
  hw/rdma: Remove unneeded code that handles more that one port
  vl: Introduce shutdown_notifiers
  hw/pvrdma: Clean device's resource when system is shutdown
  hw/rdma: Do not use bitmap_zero_extend to free bitmap
  hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  docs: Update pvrdma device documentation

 MAINTAINERS                      |   2 +
 Makefile                         |   6 +-
 Makefile.objs                    |   5 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 771 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 docs/pvrdma.txt                  | 103 ++++-
 hw/net/vmxnet3.c                 | 116 +----
 hw/net/vmxnet3_defs.h            | 133 ++++++
 hw/rdma/rdma_backend.c           | 461 +++++++++++++++---
 hw/rdma/rdma_backend.h           |  28 +-
 hw/rdma/rdma_backend_defs.h      |  13 +-
 hw/rdma/rdma_rm.c                | 120 ++++-
 hw/rdma/rdma_rm.h                |  17 +-
 hw/rdma/rdma_rm_defs.h           |  21 +-
 hw/rdma/rdma_utils.h             |  24 +
 hw/rdma/vmw/pvrdma.h             |  10 +-
 hw/rdma/vmw/pvrdma_cmd.c         | 119 +++--
 hw/rdma/vmw/pvrdma_main.c        |  49 +-
 hw/rdma/vmw/pvrdma_qp_ops.c      |  62 ++-
 include/sysemu/sysemu.h          |   1 +
 qapi/qapi-schema.json            |   1 +
 qapi/rdma.json                   |  38 ++
 vl.c                             |  15 +-
 24 files changed, 1868 insertions(+), 307 deletions(-)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
 create mode 100644 hw/net/vmxnet3_defs.h
 create mode 100644 qapi/rdma.json

-- 
2.17.2

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
                   ` (45 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
given MAD class.
This does not go hand-by-hand with qemu pvrdma device's requirements
where each VM is MAD agent.
Fix it by adding implementation of RDMA MAD multiplexer service which on
one hand register as a sole MAD agent with the kernel module and on the
other hand gives service to more than one VM.

Design Overview:
----------------
A server process is registered to UMAD framework (for this to work the
rdma_cm kernel module needs to be unloaded) and creates a unix socket to
listen to incoming request from clients.
A client process (such as QEMU) connects to this unix socket and
registers with its own GID.

TX:
---
When client needs to send rdma_cm MAD message it construct it the same
way as without this multiplexer, i.e. creates a umad packet but this
time it writes its content to the socket instead of calling umad_send().
The server, upon receiving such a message fetch local_comm_id from it so
a context for this session can be maintain and relay the message to UMAD
layer by calling umad_send().

RX:
---
The server creates a worker thread to process incoming rdma_cm MAD
messages. When an incoming message arrived (umad_recv()) the server,
depending on the message type (attr_id) looks for target client by
either searching in gid->fd table or in local_comm_id->fd table. With
the extracted fd the server relays to incoming message to the client.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 MAINTAINERS                      |   1 +
 Makefile                         |   3 +
 Makefile.objs                    |   1 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 771 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 6 files changed, 836 insertions(+)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 98a1856afc..e087d58ac6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2231,6 +2231,7 @@ S: Maintained
 F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
+F: contrib/rdmacm-mux/*
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index f2947186a4..94072776ff 100644
--- a/Makefile
+++ b/Makefile
@@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
                 elf2dmp-obj-y \
                 ivshmem-client-obj-y \
                 ivshmem-server-obj-y \
+                rdmacm-mux-obj-y \
                 libvhost-user-obj-y \
                 vhost-user-scsi-obj-y \
                 vhost-user-blk-obj-y \
@@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
 	$(call LINK, $^)
 vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
 	$(call LINK, $^)
+rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
+	$(call LINK, $^)
 
 module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
 	$(call quiet-command,$(PYTHON) $< $@ \
diff --git a/Makefile.objs b/Makefile.objs
index 1e1ff387d7..cc7df3ad80 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
 vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
 vhost-user-blk-obj-y = contrib/vhost-user-blk/
+rdmacm-mux-obj-y = contrib/rdmacm-mux/
 
 ######################################################################
 trace-events-subdirs =
diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
new file mode 100644
index 0000000000..be3eacb6f7
--- /dev/null
+++ b/contrib/rdmacm-mux/Makefile.objs
@@ -0,0 +1,4 @@
+ifdef CONFIG_PVRDMA
+CFLAGS += -libumad -Wno-format-truncation
+rdmacm-mux-obj-y = main.o
+endif
diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
new file mode 100644
index 0000000000..47cf0ac7bc
--- /dev/null
+++ b/contrib/rdmacm-mux/main.c
@@ -0,0 +1,771 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "sys/poll.h"
+#include "sys/ioctl.h"
+#include "pthread.h"
+#include "syslog.h"
+
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "infiniband/umad_types.h"
+#include "infiniband/umad_sa.h"
+#include "infiniband/umad_cm.h"
+
+#include "rdmacm-mux.h"
+
+#define SCALE_US 1000
+#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
+#define SLEEP_SECS 5 /* This is used both in poll() and thread */
+#define SERVER_LISTEN_BACKLOG 10
+#define MAX_CLIENTS 4096
+#define MAD_RMPP_VERSION 0
+#define MAD_METHOD_MASK0 0x8
+
+#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
+
+#define CM_REQ_DGID_POS      80
+#define CM_SIDR_REQ_DGID_POS 44
+
+/* The below can be override by command line parameter */
+#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
+#define RDMA_DEVICE "rxe0"
+#define RDMA_PORT_NUM 1
+
+typedef struct RdmaCmServerArgs {
+    char unix_socket_path[PATH_MAX];
+    char rdma_dev_name[NAME_MAX];
+    int rdma_port_num;
+} RdmaCMServerArgs;
+
+typedef struct CommId2FdEntry {
+    int fd;
+    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
+    __be64 gid_ifid;
+} CommId2FdEntry;
+
+typedef struct RdmaCmUMadAgent {
+    int port_id;
+    int agent_id;
+    GHashTable *gid2fd; /* Used to find fd of a given gid */
+    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
+} RdmaCmUMadAgent;
+
+typedef struct RdmaCmServer {
+    bool run;
+    RdmaCMServerArgs args;
+    struct pollfd fds[MAX_CLIENTS];
+    int nfds;
+    RdmaCmUMadAgent umad_agent;
+    pthread_t umad_recv_thread;
+    pthread_rwlock_t lock;
+} RdmaCMServer;
+
+static RdmaCMServer server = {0};
+
+static void usage(const char *progname)
+{
+    printf("Usage: %s [OPTION]...\n"
+           "Start a RDMA-CM multiplexer\n"
+           "\n"
+           "\t-h                    Show this help\n"
+           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
+           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
+           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
+           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
+}
+
+static void help(const char *progname)
+{
+    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    int c;
+    char unix_socket_path[PATH_MAX];
+
+    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
+    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
+    server.args.rdma_port_num = RDMA_PORT_NUM;
+
+    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
+        switch (c) {
+        case 'h':
+            usage(argv[0]);
+            exit(0);
+
+        case 's':
+            /* This is temporary, final name will build below */
+            strncpy(unix_socket_path, optarg, PATH_MAX);
+            break;
+
+        case 'd':
+            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
+            break;
+
+        case 'p':
+            server.args.rdma_port_num = atoi(optarg);
+            break;
+
+        default:
+            help(argv[0]);
+            exit(1);
+        }
+    }
+
+    /* Build unique unix-socket file name */
+    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
+             unix_socket_path, server.args.rdma_dev_name,
+             server.args.rdma_port_num);
+
+    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
+    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
+    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
+}
+
+static void hash_tbl_alloc(void)
+{
+
+    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
+                                                     g_int64_equal,
+                                                     g_free, g_free);
+    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
+                                                        g_int_equal,
+                                                        g_free, g_free);
+}
+
+static void hash_tbl_free(void)
+{
+    if (server.umad_agent.commid2fd) {
+        g_hash_table_destroy(server.umad_agent.commid2fd);
+    }
+    if (server.umad_agent.gid2fd) {
+        g_hash_table_destroy(server.umad_agent.gid2fd);
+    }
+}
+
+
+static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
+{
+    int *fd;
+
+    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    if (!fd) {
+        /* Let's try IPv4 */
+        *gid_ifid |= 0x00000000ffff0000;
+        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    }
+
+    return fd ? *fd : 0;
+}
+
+static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
+{
+    pthread_rwlock_rdlock(&server.lock);
+    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fd) {
+        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
+        return -ENOENT;
+    }
+
+    return 0;
+}
+
+static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
+                                         __be64 *gid_idid)
+{
+    CommId2FdEntry *fde;
+
+    pthread_rwlock_rdlock(&server.lock);
+    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fde) {
+        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
+        return -ENOENT;
+    }
+
+    *fd = fde->fd;
+    *gid_idid = fde->gid_ifid;
+
+    return 0;
+}
+
+static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (fd1) { /* record already exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
+                           RDMACM_MUX_ERR_CODE_EACCES;
+    }
+
+    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
+
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx registered on socket %d", (uint64_t)gid_ifid, fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (!fd1) { /* record not exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
+    }
+
+    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)));
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx unregistered on socket %d", (uint64_t)gid_ifid, fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
+                                          uint64_t gid_ifid)
+{
+    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
+
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_insert(server.umad_agent.commid2fd,
+                        g_memdup(&comm_id, sizeof(comm_id)),
+                        g_memdup(&fde, sizeof(fde)));
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static gboolean remove_old_comm_ids(gpointer key, gpointer value,
+                                    gpointer user_data)
+{
+    CommId2FdEntry *fde = (CommId2FdEntry *)value;
+
+    return !fde->ttl--;
+}
+
+static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    if (*(int *)value == *(int *)user_data) {
+        syslog(LOG_INFO, "0x%lx unregistered on socket %d", *(uint64_t *)key,
+               *(int *)value);
+        return true;
+    }
+
+    return false;
+}
+
+static void hash_tbl_remove_fd_ifid_pair(int fd)
+{
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
+                                remove_entry_from_gid2fd, (gpointer)&fd);
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
+{
+    struct umad_hdr *hdr = (struct umad_hdr *)mad;
+    char *data = (char *)hdr + sizeof(*hdr);
+    int32_t comm_id;
+    uint16_t attr_id = be16toh(hdr->attr_id);
+    int rc = 0;
+
+    switch (attr_id) {
+    case UMAD_CM_ATTR_REQ:
+        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_SIDR_REQ:
+        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_REP:
+        /* Fall through */
+    case UMAD_CM_ATTR_REJ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREQ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREP:
+        /* Fall through */
+    case UMAD_CM_ATTR_RTU:
+        data += sizeof(comm_id);
+        /* Fall through */
+    case UMAD_CM_ATTR_SIDR_REP:
+        memcpy(&comm_id, data, sizeof(comm_id));
+        if (comm_id) {
+            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
+        }
+        break;
+
+    default:
+        rc = -EINVAL;
+        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
+    }
+
+    return rc;
+}
+
+static void *umad_recv_thread_func(void *args)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    int fd = -2;
+
+    while (server.run) {
+        do {
+            msg.umad_len = sizeof(msg.umad.mad);
+            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
+                           SLEEP_SECS * SCALE_US);
+            if ((rc == -EIO) || (rc == -EINVAL)) {
+                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
+            }
+
+            if (rc == -ETIMEDOUT) {
+                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
+                                            remove_old_comm_ids, NULL);
+            }
+        } while (rc && server.run);
+
+        if (server.run) {
+            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
+            if (rc) {
+                continue;
+            }
+
+            send(fd, &msg, sizeof(msg), 0);
+        }
+    }
+
+    return NULL;
+}
+
+static int read_and_process(int fd)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    struct umad_hdr *hdr;
+    uint32_t *comm_id;
+    uint16_t attr_id;
+
+    rc = recv(fd, &msg, sizeof(msg), 0);
+
+    if (rc < 0 && errno != EWOULDBLOCK) {
+        return -EIO;
+    }
+
+    if (!rc) {
+        return -EPIPE;
+    }
+
+    switch (msg.hdr.msg_type) {
+    case RDMACM_MUX_MSG_TYPE_REG:
+        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_MSG_TYPE_UNREG:
+        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_MSG_TYPE_MAD:
+        /* If this is REQ or REP then store the pair comm_id,fd to be later
+         * used for other messages where gid is unknown */
+        hdr = (struct umad_hdr *)msg.umad.mad;
+        attr_id = be16toh(hdr->attr_id);
+        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
+            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
+            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
+            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
+            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
+                                          msg.hdr.sgid.global.interface_id);
+        }
+
+        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
+                       &msg.umad, msg.umad_len, 1, 0);
+        if (rc) {
+            syslog(LOG_WARNING, "Fail to send MAD message, err=%d", rc);
+        }
+        break;
+
+    default:
+        syslog(LOG_WARNING, "Got invalid message (%d) from %d",
+               msg.hdr.msg_type, fd);
+        rc = RDMACM_MUX_ERR_CODE_EINVAL;
+    }
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
+    msg.hdr.err_code = rc;
+    rc = send(fd, &msg, sizeof(msg), 0);
+
+    return rc == sizeof(msg) ? 0 : -EPIPE;
+}
+
+static int accept_all(void)
+{
+    int fd, rc = 0;;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    do {
+        if ((server.nfds + 1) > MAX_CLIENTS) {
+            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
+            rc = -EIO;
+            goto out;
+        }
+
+        fd = accept(server.fds[0].fd, NULL, NULL);
+        if (fd < 0) {
+            if (errno != EWOULDBLOCK) {
+                syslog(LOG_WARNING, "accept() failed");
+                rc = -EIO;
+                goto out;
+            }
+            break;
+        }
+
+        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
+        server.fds[server.nfds].fd = fd;
+        server.fds[server.nfds].events = POLLIN;
+        server.nfds++;
+    } while (fd != -1);
+
+out:
+    pthread_rwlock_unlock(&server.lock);
+    return rc;
+}
+
+static void compress_fds(void)
+{
+    int i, j;
+    int closed = 0;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    for (i = 1; i < server.nfds; i++) {
+        if (!server.fds[i].fd) {
+            closed++;
+            for (j = i; j < server.nfds; j++) {
+                server.fds[j].fd = server.fds[j + 1].fd;
+            }
+        }
+    }
+
+    server.nfds -= closed;
+
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static void close_fd(int idx)
+{
+    close(server.fds[idx].fd);
+    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
+    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
+    server.fds[idx].fd = 0;
+}
+
+static void run(void)
+{
+    int rc, nfds, i;
+    bool compress = false;
+
+    syslog(LOG_INFO, "Service started");
+
+    while (server.run) {
+        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
+        if (rc < 0) {
+            if (errno != EINTR) {
+                syslog(LOG_WARNING, "poll() failed");
+            }
+            continue;
+        }
+
+        if (rc == 0) {
+            continue;
+        }
+
+        nfds = server.nfds;
+        for (i = 0; i < nfds; i++) {
+            if (server.fds[i].revents == 0) {
+                continue;
+            }
+
+            if (server.fds[i].revents != POLLIN) {
+                if (i == 0) {
+                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
+                           server.fds[i].revents);
+                } else {
+                    close_fd(i);
+                    compress = true;
+                }
+                continue;
+            }
+
+            if (i == 0) {
+                rc = accept_all();
+                if (rc) {
+                    continue;
+                }
+            } else {
+                rc = read_and_process(server.fds[i].fd);
+                if (rc) {
+                    close_fd(i);
+                    compress = true;
+                }
+            }
+        }
+
+        if (compress) {
+            compress = false;
+            compress_fds();
+        }
+    }
+}
+
+static void fini_listener(void)
+{
+    int i;
+
+    if (server.fds[0].fd <= 0) {
+        return;
+    }
+
+    for (i = server.nfds - 1; i >= 0; i--) {
+        if (server.fds[i].fd) {
+            close(server.fds[i].fd);
+        }
+    }
+
+    unlink(server.args.unix_socket_path);
+}
+
+static void fini_umad(void)
+{
+    if (server.umad_agent.agent_id) {
+        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
+    }
+
+    if (server.umad_agent.port_id) {
+        umad_close_port(server.umad_agent.port_id);
+    }
+
+    hash_tbl_free();
+}
+
+static void fini(void)
+{
+    if (server.umad_recv_thread) {
+        pthread_join(server.umad_recv_thread, NULL);
+        server.umad_recv_thread = 0;
+    }
+    fini_umad();
+    fini_listener();
+    pthread_rwlock_destroy(&server.lock);
+
+    syslog(LOG_INFO, "Service going down");
+}
+
+static int init_listener(void)
+{
+    struct sockaddr_un sun;
+    int rc, on = 1;
+
+    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (server.fds[0].fd < 0) {
+        syslog(LOG_ALERT, "socket() failed");
+        return -EIO;
+    }
+
+    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
+                    sizeof(on));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "setsockopt() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "ioctl() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT,
+               "Invalid unix_socket_path, size must be less than %ld\n",
+               sizeof(sun.sun_path));
+        rc = -EINVAL;
+        goto err;
+    }
+
+    sun.sun_family = AF_UNIX;
+    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
+                  server.args.unix_socket_path);
+    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT, "Could not copy unix socket path\n");
+        rc = -EINVAL;
+        goto err;
+    }
+
+    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "bind() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "listen() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    server.fds[0].events = POLLIN;
+    server.nfds = 1;
+    server.run = true;
+
+    return 0;
+
+err:
+    close(server.fds[0].fd);
+    return rc;
+}
+
+static int init_umad(void)
+{
+    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
+
+    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
+                                               server.args.rdma_port_num);
+
+    if (server.umad_agent.port_id < 0) {
+        syslog(LOG_WARNING, "umad_open_port() failed");
+        return -EIO;
+    }
+
+    memset(&method_mask, 0, sizeof(method_mask));
+    method_mask[0] = MAD_METHOD_MASK0;
+    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
+                                               UMAD_CLASS_CM,
+                                               UMAD_SA_CLASS_VERSION,
+                                               MAD_RMPP_VERSION, method_mask);
+    if (server.umad_agent.agent_id < 0) {
+        syslog(LOG_WARNING, "umad_register() failed");
+        return -EIO;
+    }
+
+    hash_tbl_alloc();
+
+    return 0;
+}
+
+static void signal_handler(int sig, siginfo_t *siginfo, void *context)
+{
+    static bool warned;
+
+    /* Prevent stop if clients are connected */
+    if (server.nfds != 1) {
+        if (!warned) {
+            syslog(LOG_WARNING,
+                   "Can't stop while active client exist, resend SIGINT to overid");
+            warned = true;
+            return;
+        }
+    }
+
+    if (sig == SIGINT) {
+        server.run = false;
+        fini();
+    }
+
+    exit(0);
+}
+
+static int init(void)
+{
+    int rc;
+    struct sigaction sig = {0};
+
+    rc = init_listener();
+    if (rc) {
+        return rc;
+    }
+
+    rc = init_umad();
+    if (rc) {
+        return rc;
+    }
+
+    pthread_rwlock_init(&server.lock, 0);
+
+    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
+                        NULL);
+    if (rc) {
+        syslog(LOG_ERR, "Fail to create UMAD receiver thread (%d)\n", rc);
+        return rc;
+    }
+
+    sig.sa_sigaction = &signal_handler;
+    sig.sa_flags = SA_SIGINFO;
+    rc = sigaction(SIGINT, &sig, NULL);
+    if (rc < 0) {
+        syslog(LOG_ERR, "Fail to install SIGINT handler (%d)\n", errno);
+        return rc;
+    }
+
+    return 0;
+}
+
+int main(int argc, char *argv[])
+{
+    int rc;
+
+    memset(&server, 0, sizeof(server));
+
+    parse_args(argc, argv);
+
+    rc = init();
+    if (rc) {
+        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
+        rc = -EAGAIN;
+        goto out;
+    }
+
+    run();
+
+out:
+    fini();
+
+    return rc;
+}
diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
new file mode 100644
index 0000000000..03508d52b2
--- /dev/null
+++ b/contrib/rdmacm-mux/rdmacm-mux.h
@@ -0,0 +1,56 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux declarations
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMACM_MUX_H
+#define RDMACM_MUX_H
+
+#include "linux/if.h"
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "rdma/rdma_user_cm.h"
+
+typedef enum RdmaCmMuxMsgType {
+    RDMACM_MUX_MSG_TYPE_REG   = 0,
+    RDMACM_MUX_MSG_TYPE_UNREG = 1,
+    RDMACM_MUX_MSG_TYPE_MAD   = 2,
+    RDMACM_MUX_MSG_TYPE_RESP  = 3,
+} RdmaCmMuxMsgType;
+
+typedef enum RdmaCmMuxErrCode {
+    RDMACM_MUX_ERR_CODE_OK        = 0,
+    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
+    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
+    RDMACM_MUX_ERR_CODE_EACCES    = 3,
+    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
+} RdmaCmMuxErrCode;
+
+typedef struct RdmaCmMuxHdr {
+    RdmaCmMuxMsgType msg_type;
+    union ibv_gid sgid;
+    RdmaCmMuxErrCode err_code;
+} RdmaCmUHdr;
+
+typedef struct RdmaCmUMad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+} RdmaCmUMad;
+
+typedef struct RdmaCmMuxMsg {
+    RdmaCmUHdr hdr;
+    int umad_len;
+    RdmaCmUMad umad;
+} RdmaCmMuxMsg;
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
                   ` (44 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Upon completion of incoming packet the device pushes CQE to driver's RX
ring and notify the driver (msix).
While for data-path incoming packets the driver needs the ability to
control whether it wished to receive interrupts or not, for control-path
packets such as incoming MAD the driver needs to be notified anyway, it
even do not need to re-arm the notification bit.

Enhance the notification field to support this.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_rm.c           | 12 ++++++++++--
 hw/rdma/rdma_rm_defs.h      |  8 +++++++-
 hw/rdma/vmw/pvrdma_qp_ops.c |  6 ++++--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 8d59a42cd1..4f10fcabcc 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -263,7 +263,7 @@ int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     }
 
     cq->opaque = opaque;
-    cq->notify = false;
+    cq->notify = CNT_CLEAR;
 
     rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
     if (rc) {
@@ -291,7 +291,10 @@ void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
         return;
     }
 
-    cq->notify = notify;
+    if (cq->notify != CNT_SET) {
+        cq->notify = notify ? CNT_ARM : CNT_CLEAR;
+    }
+
     pr_dbg("notify=%d\n", cq->notify);
 }
 
@@ -349,6 +352,11 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
         return -EINVAL;
     }
 
+    if (qp_type == IBV_QPT_GSI) {
+        scq->notify = CNT_SET;
+        rcq->notify = CNT_SET;
+    }
+
     qp = res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
     if (!qp) {
         return -ENOMEM;
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7228151239..9b399063d3 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -49,10 +49,16 @@ typedef struct RdmaRmPD {
     uint32_t ctx_handle;
 } RdmaRmPD;
 
+typedef enum CQNotificationType {
+    CNT_CLEAR,
+    CNT_ARM,
+    CNT_SET,
+} CQNotificationType;
+
 typedef struct RdmaRmCQ {
     RdmaBackendCQ backend_cq;
     void *opaque;
-    bool notify;
+    CQNotificationType notify;
 } RdmaRmCQ;
 
 /* MR (DMA region) */
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index c668afd0ed..762700a205 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -89,8 +89,10 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pvrdma_ring_write_inc(&dev->dsr_info.cq);
 
     pr_dbg("cq->notify=%d\n", cq->notify);
-    if (cq->notify) {
-        cq->notify = false;
+    if (cq->notify != CNT_CLEAR) {
+        if (cq->notify == CNT_ARM) {
+            cq->notify = CNT_CLEAR;
+        }
         post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
     }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-17 11:42   ` Marcel Apfelbaum
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
                   ` (43 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Device is not supporting QP0, only QP1.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 86e8fe8ab6..3ccc9a2494 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
 
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
-    return qp->ibqp ? qp->ibqp->qp_num : 0;
+    return qp->ibqp ? qp->ibqp->qp_num : 1;
 }
 
 static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (2 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
                   ` (42 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Function create_ah might return NULL, let's exit with an error.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_backend.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index d7a4bbd91f..1e148398a2 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -338,6 +338,10 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
                                 backend_dev->backend_gid_idx, dgid);
+        if (!wr.wr.ud.ah) {
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            goto out_dealloc_cqe_ctx;
+        }
         wr.wr.ud.remote_qpn = dqpn;
         wr.wr.ud.remote_qkey = dqkey;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (3 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-17 12:06   ` Marcel Apfelbaum
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void Yuval Shaia
                   ` (41 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

MAD (Management Datagram) packets are widely used by various modules
both in kernel and in user space for example the rdma_* API which is
used to create and maintain "connection" layer on top of RDMA uses
several types of MAD packets.
To support MAD packets the device uses an external utility
(contrib/rdmacm-mux) to relay packets from and to the guest driver.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
 hw/rdma/rdma_backend.h      |   4 +-
 hw/rdma/rdma_backend_defs.h |  10 +-
 hw/rdma/vmw/pvrdma.h        |   2 +
 hw/rdma/vmw/pvrdma_main.c   |   4 +-
 5 files changed, 273 insertions(+), 10 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 1e148398a2..3eb0099f8d 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -16,8 +16,13 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qnum.h"
 
 #include <infiniband/verbs.h>
+#include <infiniband/umad_types.h>
+#include <infiniband/umad.h>
+#include <rdma/rdma_user_cm.h>
 
 #include "trace.h"
 #include "rdma_utils.h"
@@ -33,16 +38,25 @@
 #define VENDOR_ERR_MAD_SEND         0x206
 #define VENDOR_ERR_INVLKEY          0x207
 #define VENDOR_ERR_MR_SMALL         0x208
+#define VENDOR_ERR_INV_MAD_BUFF     0x209
+#define VENDOR_ERR_INV_NUM_SGE      0x210
 
 #define THR_NAME_LEN 16
 #define THR_POLL_TO  5000
 
+#define MAD_HDR_SIZE sizeof(struct ibv_grh)
+
 typedef struct BackendCtx {
-    uint64_t req_id;
     void *up_ctx;
     bool is_tx_req;
+    struct ibv_sge sge; /* Used to save MAD recv buffer */
 } BackendCtx;
 
+struct backend_umad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+};
+
 static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
 
 static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
@@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
+static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
+                    uint32_t num_sge)
+{
+    struct backend_umad umad = {0};
+    char *hdr, *msg;
+    int ret;
+
+    pr_dbg("num_sge=%d\n", num_sge);
+
+    if (num_sge != 2) {
+        return -EINVAL;
+    }
+
+    umad.hdr.length = sge[0].length + sge[1].length;
+    pr_dbg("msg_len=%d\n", umad.hdr.length);
+
+    if (umad.hdr.length > sizeof(umad.mad)) {
+        return -ENOMEM;
+    }
+
+    umad.hdr.addr.qpn = htobe32(1);
+    umad.hdr.addr.grh_present = 1;
+    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    umad.hdr.addr.hop_limit = 1;
+
+    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
+    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+    memcpy(&umad.mad[0], hdr, sge[0].length);
+    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+
+    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad));
+
+    pr_dbg("qemu_chr_fe_write=%d\n", ret);
+
+    return (ret != sizeof(umad));
+}
+
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
@@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = mad_send(backend_dev, sge, num_sge);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            } else {
+                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+            }
         }
-        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
         return;
     }
 
@@ -370,6 +431,48 @@ out_free_bctx:
     g_free(bctx);
 }
 
+static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
+                                         struct ibv_sge *sge, uint32_t num_sge,
+                                         void *ctx)
+{
+    BackendCtx *bctx;
+    int rc;
+    uint32_t bctx_id;
+
+    if (num_sge != 1) {
+        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
+        return VENDOR_ERR_INV_NUM_SGE;
+    }
+
+    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
+        pr_dbg("Too small buffer for MAD\n");
+        return VENDOR_ERR_INV_MAD_BUFF;
+    }
+
+    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
+    pr_dbg("length=%d\n", sge[0].length);
+    pr_dbg("lkey=%d\n", sge[0].lkey);
+
+    bctx = g_malloc0(sizeof(*bctx));
+
+    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
+    if (unlikely(rc)) {
+        g_free(bctx);
+        pr_dbg("Fail to allocate cqe_ctx\n");
+        return VENDOR_ERR_NOMEM;
+    }
+
+    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
+    bctx->up_ctx = ctx;
+    bctx->sge = *sge;
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+
+    return 0;
+}
+
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
                             RdmaDeviceResources *rdma_dev_res,
                             RdmaBackendQP *qp, uint8_t qp_type,
@@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+            }
         }
         return;
     }
@@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 
     switch (qp_type) {
     case IBV_QPT_GSI:
-        pr_dbg("QP1 unsupported\n");
         return 0;
 
     case IBV_QPT_RC:
@@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
     return 0;
 }
 
+static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
+                                 union ibv_gid *my_gid, int paylen)
+{
+    grh->paylen = htons(paylen);
+    grh->sgid = *sgid;
+    grh->dgid = *my_gid;
+
+    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
+    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+}
+
+static inline int mad_can_receieve(void *opaque)
+{
+    return sizeof(struct backend_umad);
+}
+
+static void mad_read(void *opaque, const uint8_t *buf, int size)
+{
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+    char *mad;
+    struct backend_umad *umad;
+
+    assert(size != sizeof(umad));
+    umad = (struct backend_umad *)buf;
+
+    pr_dbg("Got %d bytes\n", size);
+    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+
+#ifdef PVRDMA_DEBUG
+    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
+    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
+           hdr->base_version, hdr->mgmt_class, hdr->class_version,
+           hdr->method, hdr->status, be64toh(hdr->tid),
+           hdr->attr_id, hdr->attr_mod);
+#endif
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+    if (!o_ctx_id) {
+        pr_dbg("No more free MADs buffers, waiting for a while\n");
+        sleep(THR_POLL_TO);
+        return;
+    }
+
+    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+    if (unlikely(!bctx)) {
+        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
+        return;
+    }
+
+    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
+
+    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
+                           bctx->sge.length);
+    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                     bctx->up_ctx);
+    } else {
+        memset(mad, 0, bctx->sge.length);
+        build_mad_hdr((struct ibv_grh *)mad,
+                      (union ibv_gid *)&umad->hdr.addr.gid,
+                      &backend_dev->gid, umad->hdr.length);
+        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
+
+        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+    }
+
+    g_free(bctx);
+    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+}
+
+static int mad_init(RdmaBackendDev *backend_dev)
+{
+    struct backend_umad umad = {0};
+    int ret;
+
+    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+        pr_dbg("Missing chardev for MAD multiplexer\n");
+        return -EIO;
+    }
+
+    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
+                             mad_read, NULL, NULL, backend_dev, NULL, true);
+
+    /* Register ourself */
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad.hdr));
+    if (ret != sizeof(umad.hdr)) {
+        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
+    }
+
+    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
+    backend_dev->recv_mads_list.list = qlist_new();
+
+    return 0;
+}
+
+static void mad_stop(RdmaBackendDev *backend_dev)
+{
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+
+    pr_dbg("Closing MAD\n");
+
+    /* Clear MAD buffers list */
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    do {
+        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+        if (o_ctx_id) {
+            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+            if (bctx) {
+                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+                g_free(bctx);
+            }
+        }
+    } while (o_ctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+}
+
+static void mad_fini(RdmaBackendDev *backend_dev)
+{
+    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
+    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp)
+                      CharBackend *mad_chr_be, Error **errp)
 {
     int i;
     int ret = 0;
@@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
-
+    backend_dev->mad_chr_be = mad_chr_be;
     backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
@@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     pr_dbg("interface_id=0x%" PRIx64 "\n",
            be64_to_cpu(backend_dev->gid.global.interface_id));
 
+    ret = mad_init(backend_dev);
+    if (ret) {
+        error_setg(errp, "Fail to initialize mad");
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+
     backend_dev->comp_thread.run = false;
     backend_dev->comp_thread.is_running = false;
 
@@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
 {
     pr_dbg("Stopping rdma_backend\n");
     stop_backend_thread(&backend_dev->comp_thread);
+    mad_stop(backend_dev);
 }
 
 void rdma_backend_fini(RdmaBackendDev *backend_dev)
 {
     rdma_backend_stop(backend_dev);
+    mad_fini(backend_dev);
     g_hash_table_destroy(ah_hash);
     ibv_destroy_comp_channel(backend_dev->channel);
     ibv_close_device(backend_dev->context);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 3ccc9a2494..fc83330251 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -17,6 +17,8 @@
 #define RDMA_BACKEND_H
 
 #include "qapi/error.h"
+#include "chardev/char-fe.h"
+
 #include "rdma_rm_defs.h"
 #include "rdma_backend_defs.h"
 
@@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp);
+                      CharBackend *mad_chr_be, Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 7404f64002..2a7e667075 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -16,8 +16,9 @@
 #ifndef RDMA_BACKEND_DEFS_H
 #define RDMA_BACKEND_DEFS_H
 
-#include <infiniband/verbs.h>
 #include "qemu/thread.h"
+#include "chardev/char-fe.h"
+#include <infiniband/verbs.h>
 
 typedef struct RdmaDeviceResources RdmaDeviceResources;
 
@@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
     bool is_running; /* Set by the thread to report its status */
 } RdmaBackendThread;
 
+typedef struct RecvMadList {
+    QemuMutex lock;
+    QList *list;
+} RecvMadList;
+
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
@@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
     struct ibv_comp_channel *channel;
     uint8_t port_num;
     uint8_t backend_gid_idx;
+    RecvMadList recv_mads_list;
+    CharBackend *mad_chr_be;
 } RdmaBackendDev;
 
 typedef struct RdmaBackendPD {
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e2d9f93cdf..e3742d893a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -19,6 +19,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
+#include "chardev/char-fe.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -83,6 +84,7 @@ typedef struct PVRDMADev {
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
+    CharBackend mad_chr;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ca5fa8d981..6c8c0154fa 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
     DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
                       dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
     DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
+    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, errp);
+                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
+                           errp);
     if (rc) {
         goto out;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (4 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
                   ` (40 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

This function cannot fail - fix it to return void

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 6c8c0154fa..fc2abd34af 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
     return 0;
 }
 
-static int reset_device(PVRDMADev *dev)
+static void reset_device(PVRDMADev *dev)
 {
     pvrdma_stop(dev);
 
     pr_dbg("Device reset complete\n");
-
-    return 0;
 }
 
 static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (5 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
                   ` (39 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Commit 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF") exports
default pkey as external definition but omit the change from 0x7FFF to
0xFFFF.

Fixes: 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF")

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e3742d893a..15c3f28b86 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -52,7 +52,7 @@
 #define PVRDMA_FW_VERSION    14
 
 /* Some defaults */
-#define PVRDMA_PKEY          0x7FFF
+#define PVRDMA_PKEY          0xFFFF
 
 typedef struct DSRInfo {
     dma_addr_t dma;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (6 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
                   ` (38 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

The function pvrdma_post_cqe populates CQE entry with opcode from the
given completion element. For receive operation value was not set. Fix
it by setting it to IBV_WC_RECV.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 762700a205..7b0f440fda 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx = g_malloc(sizeof(CompHandlerCtx));
         comp_ctx->dev = dev;
         comp_ctx->cq_handle = qp->recv_cq_handle;
-        comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cqe.opcode = IBV_WC_RECV;
 
         rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
                                &qp->backend_qp, qp->qp_type,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (7 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-17 12:07   ` Marcel Apfelbaum
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma Yuval Shaia
                   ` (37 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

opcode for WC should be set by the device and not taken from work
element.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 7b0f440fda..3388be1926 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cq_handle = qp->send_cq_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
         comp_ctx->cqe.qp = qp_handle;
-        comp_ctx->cqe.opcode = wqe->hdr.opcode;
+        comp_ctx->cqe.opcode = IBV_WC_SEND;
 
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (8 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
@ 2018-11-13  7:12 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
                   ` (36 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:12 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

pvrdma requires that the same GID attached to it will be attached to the
backend device in the host.

A new QMP messages is defined so pvrdma device can broadcast any change
made to its GID table. This event is captured by libvirt which in turn
will update the GID table in the backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 MAINTAINERS           |  1 +
 Makefile              |  3 ++-
 Makefile.objs         |  4 ++++
 qapi/qapi-schema.json |  1 +
 qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 qapi/rdma.json

diff --git a/MAINTAINERS b/MAINTAINERS
index e087d58ac6..a149f68a8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2232,6 +2232,7 @@ F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
 F: contrib/rdmacm-mux/*
+F: qapi/rdma.json
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index 94072776ff..db4ce60ee5 100644
--- a/Makefile
+++ b/Makefile
@@ -599,7 +599,8 @@ qapi-modules = $(SRC_PATH)/qapi/qapi-schema.json $(SRC_PATH)/qapi/common.json \
                $(SRC_PATH)/qapi/tpm.json \
                $(SRC_PATH)/qapi/trace.json \
                $(SRC_PATH)/qapi/transaction.json \
-               $(SRC_PATH)/qapi/ui.json
+               $(SRC_PATH)/qapi/ui.json \
+               $(SRC_PATH)/qapi/rdma.json
 
 qapi/qapi-builtin-types.c qapi/qapi-builtin-types.h \
 qapi/qapi-types.c qapi/qapi-types.h \
diff --git a/Makefile.objs b/Makefile.objs
index cc7df3ad80..76d8028f2f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -21,6 +21,7 @@ util-obj-y += qapi/qapi-types-tpm.o
 util-obj-y += qapi/qapi-types-trace.o
 util-obj-y += qapi/qapi-types-transaction.o
 util-obj-y += qapi/qapi-types-ui.o
+util-obj-y += qapi/qapi-types-rdma.o
 util-obj-y += qapi/qapi-builtin-visit.o
 util-obj-y += qapi/qapi-visit.o
 util-obj-y += qapi/qapi-visit-block-core.o
@@ -40,6 +41,7 @@ util-obj-y += qapi/qapi-visit-tpm.o
 util-obj-y += qapi/qapi-visit-trace.o
 util-obj-y += qapi/qapi-visit-transaction.o
 util-obj-y += qapi/qapi-visit-ui.o
+util-obj-y += qapi/qapi-visit-rdma.o
 util-obj-y += qapi/qapi-events.o
 util-obj-y += qapi/qapi-events-block-core.o
 util-obj-y += qapi/qapi-events-block.o
@@ -58,6 +60,7 @@ util-obj-y += qapi/qapi-events-tpm.o
 util-obj-y += qapi/qapi-events-trace.o
 util-obj-y += qapi/qapi-events-transaction.o
 util-obj-y += qapi/qapi-events-ui.o
+util-obj-y += qapi/qapi-events-rdma.o
 util-obj-y += qapi/qapi-introspect.o
 
 chardev-obj-y = chardev/
@@ -155,6 +158,7 @@ common-obj-y += qapi/qapi-commands-tpm.o
 common-obj-y += qapi/qapi-commands-trace.o
 common-obj-y += qapi/qapi-commands-transaction.o
 common-obj-y += qapi/qapi-commands-ui.o
+common-obj-y += qapi/qapi-commands-rdma.o
 common-obj-y += qapi/qapi-introspect.o
 common-obj-y += qmp.o hmp.o
 endif
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 65b6dc2f6f..a650d80f83 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -94,3 +94,4 @@
 { 'include': 'trace.json' }
 { 'include': 'introspect.json' }
 { 'include': 'misc.json' }
+{ 'include': 'rdma.json' }
diff --git a/qapi/rdma.json b/qapi/rdma.json
new file mode 100644
index 0000000000..804c68ab36
--- /dev/null
+++ b/qapi/rdma.json
@@ -0,0 +1,38 @@
+# -*- Mode: Python -*-
+#
+
+##
+# = RDMA device
+##
+
+##
+# @RDMA_GID_STATUS_CHANGED:
+#
+# Emitted when guest driver adds/deletes GID to/from device
+#
+# @netdev: RoCE Network Device name - char *
+#
+# @gid-status: Add or delete indication - bool
+#
+# @subnet-prefix: Subnet Prefix - uint64
+#
+# @interface-id : Interface ID - uint64
+#
+# Since: 3.2
+#
+# Example:
+#
+# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
+#     "event": "RDMA_GID_STATUS_CHANGED",
+#     "data":
+#         {"netdev": "bridge0",
+#         "interface-id": 15880512517475447892,
+#         "gid-status": true,
+#         "subnet-prefix": 33022}}
+#
+##
+{ 'event': 'RDMA_GID_STATUS_CHANGED',
+  'data': { 'netdev'        : 'str',
+            'gid-status'    : 'bool',
+            'subnet-prefix' : 'uint64',
+            'interface-id'  : 'uint64' } }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (9 preceding siblings ...)
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:48   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file Yuval Shaia
                   ` (35 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determine by the MAC address, the second
by the first IPv6 address and the third by the IPv4 address. Other
entries can be added by adding more IP addresses. The opposite is the
same, i.e. whenever an address is removed, the corresponding GID entry
is removed.

The process is done by the network and RDMA stacks. Whenever an address
is added the ib_core driver is notified and calls the device driver
add_gid function which in turn update the device.

To support this in pvrdma device we need to hook into the create_bind
and destroy_bind HW commands triggered by pvrdma driver in guest.
Whenever a changed is made to the pvrdma device's GID table a special
QMP messages is sent to be processed by libvirt to update the address of
the backend Ethernet device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 243 +++++++++++++++++++++++-------------
 hw/rdma/rdma_backend.h      |  22 ++--
 hw/rdma/rdma_backend_defs.h |   3 +-
 hw/rdma/rdma_rm.c           | 104 ++++++++++++++-
 hw/rdma/rdma_rm.h           |  17 ++-
 hw/rdma/rdma_rm_defs.h      |   9 +-
 hw/rdma/rdma_utils.h        |  15 +++
 hw/rdma/vmw/pvrdma.h        |   2 +-
 hw/rdma/vmw/pvrdma_cmd.c    |  55 ++++----
 hw/rdma/vmw/pvrdma_main.c   |  25 +---
 hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
 11 files changed, 370 insertions(+), 145 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 3eb0099f8d..5675504165 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -18,12 +18,14 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qnum.h"
+#include "qapi/qapi-events-rdma.h"
 
 #include <infiniband/verbs.h>
 #include <infiniband/umad_types.h>
 #include <infiniband/umad.h>
 #include <rdma/rdma_user_cm.h>
 
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 #include "trace.h"
 #include "rdma_utils.h"
 #include "rdma_rm.h"
@@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
-static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
-                    uint32_t num_sge)
+static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
+                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
 {
-    struct backend_umad umad = {0};
-    char *hdr, *msg;
+    RdmaCmMuxMsg msg = {0};
+    char *hdr, *data;
     int ret;
 
     pr_dbg("num_sge=%d\n", num_sge);
@@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
         return -EINVAL;
     }
 
-    umad.hdr.length = sge[0].length + sge[1].length;
-    pr_dbg("msg_len=%d\n", umad.hdr.length);
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
+    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
 
-    if (umad.hdr.length > sizeof(umad.mad)) {
+    msg.umad_len = sge[0].length + sge[1].length;
+    pr_dbg("umad_len=%d\n", msg.umad_len);
+
+    if (msg.umad_len > sizeof(msg.umad.mad)) {
         return -ENOMEM;
     }
 
-    umad.hdr.addr.qpn = htobe32(1);
-    umad.hdr.addr.grh_present = 1;
-    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    umad.hdr.addr.hop_limit = 1;
+    msg.umad.hdr.addr.qpn = htobe32(1);
+    msg.umad.hdr.addr.grh_present = 1;
+    pr_dbg("sgid_idx=%d\n", sgid_idx);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
+    msg.umad.hdr.addr.gid_index = sgid_idx;
+    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
+    msg.umad.hdr.addr.hop_limit = 1;
 
     hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
-    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
+    pr_dbg_buf("mad_data", data, sge[1].length);
 
-    memcpy(&umad.mad[0], hdr, sge[0].length);
-    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
+    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
 
-    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
     rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
 
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
 
     pr_dbg("qemu_chr_fe_write=%d\n", ret);
 
-    return (ret != sizeof(umad));
+    return (ret != sizeof(msg));
 }
 
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
-                            union ibv_gid *dgid, uint32_t dqpn,
-                            uint32_t dqkey, void *ctx)
+                            uint8_t sgid_idx, union ibv_gid *sgid,
+                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
+                            void *ctx)
 {
     BackendCtx *bctx;
     struct ibv_sge new_sge[MAX_SGE];
@@ -361,7 +372,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            rc = mad_send(backend_dev, sge, num_sge);
+            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
                 comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
@@ -397,8 +408,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     }
 
     if (qp_type == IBV_QPT_UD) {
-        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
-                                backend_dev->backend_gid_idx, dgid);
+        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
@@ -703,9 +713,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
 }
 
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey)
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
 {
     struct ibv_qp_attr attr = {0};
     union ibv_gid ibv_gid = {
@@ -717,13 +727,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
     attr.qp_state = IBV_QPS_RTR;
     attr_mask = IBV_QP_STATE;
 
+    qp->sgid_idx = sgid_idx;
+
     switch (qp_type) {
     case IBV_QPT_RC:
         pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
                be64_to_cpu(ibv_gid.global.subnet_prefix),
                be64_to_cpu(ibv_gid.global.interface_id));
         pr_dbg("dqpn=0x%x\n", dqpn);
-        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
+        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
         pr_dbg("sport_num=%d\n", backend_dev->port_num);
         pr_dbg("rq_psn=0x%x\n", rq_psn);
 
@@ -735,7 +747,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         attr.ah_attr.is_global      = 1;
         attr.ah_attr.grh.hop_limit  = 1;
         attr.ah_attr.grh.dgid       = ibv_gid;
-        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
+        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
         attr.rq_psn                 = rq_psn;
 
         attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
@@ -744,8 +756,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         break;
 
     case IBV_QPT_UD:
+        pr_dbg("qkey=0x%x\n", qkey);
         if (use_qkey) {
-            pr_dbg("qkey=0x%x\n", qkey);
             attr.qkey = qkey;
             attr_mask |= IBV_QP_QKEY;
         }
@@ -861,13 +873,13 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
     grh->dgid = *my_gid;
 
     pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
-    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
-    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
 }
 
 static inline int mad_can_receieve(void *opaque)
 {
-    return sizeof(struct backend_umad);
+    return sizeof(RdmaCmMuxMsg);
 }
 
 static void mad_read(void *opaque, const uint8_t *buf, int size)
@@ -877,13 +889,13 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     unsigned long cqe_ctx_id;
     BackendCtx *bctx;
     char *mad;
-    struct backend_umad *umad;
+    RdmaCmMuxMsg *msg;
 
-    assert(size != sizeof(umad));
-    umad = (struct backend_umad *)buf;
+    assert(size != sizeof(msg));
+    msg = (RdmaCmMuxMsg *)buf;
 
     pr_dbg("Got %d bytes\n", size);
-    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+    pr_dbg("umad_len=%d\n", msg->umad_len);
 
 #ifdef PVRDMA_DEBUG
     struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
@@ -913,15 +925,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
-    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
         comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
                      bctx->up_ctx);
     } else {
+        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
-                      (union ibv_gid *)&umad->hdr.addr.gid,
-                      &backend_dev->gid, umad->hdr.length);
-        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
+                      msg->umad_len);
+        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
         comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
@@ -933,10 +946,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
 static int mad_init(RdmaBackendDev *backend_dev)
 {
-    struct backend_umad umad = {0};
     int ret;
 
-    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+    ret = qemu_chr_fe_backend_connected(backend_dev->mad_chr_be);
+    if (!ret) {
         pr_dbg("Missing chardev for MAD multiplexer\n");
         return -EIO;
     }
@@ -944,14 +957,6 @@ static int mad_init(RdmaBackendDev *backend_dev)
     qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
                              mad_read, NULL, NULL, backend_dev, NULL, true);
 
-    /* Register ourself */
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad.hdr));
-    if (ret != sizeof(umad.hdr)) {
-        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
-    }
-
     qemu_mutex_init(&backend_dev->recv_mads_list.lock);
     backend_dev->recv_mads_list.list = qlist_new();
 
@@ -988,23 +993,120 @@ static void mad_fini(RdmaBackendDev *backend_dev)
     qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
 }
 
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid)
+{
+    union ibv_gid sgid;
+    int ret;
+    int i = 0;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    do {
+        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
+                            &sgid);
+        i++;
+    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
+
+    pr_dbg("gid_index=%d\n", i - 1);
+
+    return ret ? ret : i - 1;
+}
+
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", msg.hdr.err_code);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, true,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return ret;
+}
+
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_UNREG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n",
+               msg.hdr.err_code);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, false,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return 0;
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp)
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp)
 {
     int i;
     int ret = 0;
     int num_ibv_devices;
     struct ibv_device **dev_list;
-    struct ibv_port_attr port_attr;
 
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
     backend_dev->mad_chr_be = mad_chr_be;
-    backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
 
@@ -1041,9 +1143,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         backend_dev->ib_dev = *dev_list;
     }
 
-    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
-           ibv_get_device_name(backend_dev->ib_dev),
-           backend_dev->port_num, backend_dev->backend_gid_idx);
+    pr_dbg("Using backend device %s, port %d\n",
+           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
 
     backend_dev->context = ibv_open_device(backend_dev->ib_dev);
     if (!backend_dev->context) {
@@ -1060,20 +1161,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     }
     pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
 
-    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
-                         &port_attr);
-    if (ret) {
-        error_setg(errp, "Error %d from ibv_query_port", ret);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-
-    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
-        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
-                   port_attr.gid_tbl_len);
-        goto out_destroy_comm_channel;
-    }
-
     ret = init_device_caps(backend_dev, dev_attr);
     if (ret) {
         error_setg(errp, "Failed to initialize device capabilities");
@@ -1081,18 +1168,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         goto out_destroy_comm_channel;
     }
 
-    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
-                         backend_dev->backend_gid_idx, &backend_dev->gid);
-    if (ret) {
-        error_setg(errp, "Failed to query gid %d",
-                   backend_dev->backend_gid_idx);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
-    pr_dbg("interface_id=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.interface_id));
 
     ret = mad_init(backend_dev);
     if (ret) {
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index fc83330251..59ad2b874b 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -28,11 +28,6 @@ enum ibv_special_qp_type {
     IBV_QPT_GSI = 1,
 };
 
-static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
-{
-    return &dev->gid;
-}
-
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
     return qp->ibqp ? qp->ibqp->qp_num : 1;
@@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp);
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
 void rdma_backend_register_comp_handler(void (*handler)(int status,
@@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
                                uint8_t qp_type, uint32_t qkey);
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey);
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
                               uint32_t sq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
@@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
+                            uint8_t sgid_idx, union ibv_gid *sgid,
                             union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
                             void *ctx);
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 2a7e667075..ff8b2426a0 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -37,14 +37,12 @@ typedef struct RecvMadList {
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
-    union ibv_gid gid;
     PCIDevice *dev;
     RdmaDeviceResources *rdma_dev_res;
     struct ibv_device *ib_dev;
     struct ibv_context *context;
     struct ibv_comp_channel *channel;
     uint8_t port_num;
-    uint8_t backend_gid_idx;
     RecvMadList recv_mads_list;
     CharBackend *mad_chr_be;
 } RdmaBackendDev;
@@ -66,6 +64,7 @@ typedef struct RdmaBackendCQ {
 typedef struct RdmaBackendQP {
     struct ibv_pd *ibpd;
     struct ibv_qp *ibqp;
+    uint8_t sgid_idx;
 } RdmaBackendQP;
 
 #endif
diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 4f10fcabcc..fe0979415d 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -391,7 +391,7 @@ out_dealloc_qp:
 }
 
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn)
@@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int ret;
 
     pr_dbg("qpn=0x%x\n", qp_handle);
+    pr_dbg("qkey=0x%x\n", qkey);
 
     qp = rdma_rm_get_qp(dev_res, qp_handle);
     if (!qp) {
@@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         }
 
         if (qp->qp_state == IBV_QPS_RTR) {
+            /* Get backend gid index */
+            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
+            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
+                                                     sgid_idx);
+            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
+                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
+                return -EIO;
+            }
+
             ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
-                                            qp->qp_type, dgid, dqpn, rq_psn,
-                                            qkey, attr_mask & IBV_QP_QKEY);
+                                            qp->qp_type, sgid_idx, dgid, dqpn,
+                                            rq_psn, qkey,
+                                            attr_mask & IBV_QP_QKEY);
             if (ret) {
                 return -EIO;
             }
@@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
     res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
 }
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
+    if (rc <= 0) {
+        pr_dbg("Fail to add gid\n");
+        return -EINVAL;
+    }
+
+    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+
+    return 0;
+}
+
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_del_gid(backend_dev, ifname,
+                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+    if (rc < 0) {
+        pr_dbg("Fail to delete gid\n");
+        return -EINVAL;
+    }
+
+    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
+    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+
+    return 0;
+}
+
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx)
+{
+    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
+        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
+        return -EINVAL;
+    }
+
+    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+        rdma_backend_get_gid_index(backend_dev,
+                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+    }
+
+    pr_dbg("backend_gid_index=%d\n",
+           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+
+    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+}
+
 static void destroy_qp_hash_key(gpointer data)
 {
     g_bytes_unref(data);
 }
 
+static void init_ports(RdmaDeviceResources *dev_res)
+{
+    int i, j;
+
+    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        dev_res->ports[i].state = IBV_PORT_DOWN;
+        for (j = 0; j < MAX_PORT_GIDS; j++) {
+            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
+        }
+    }
+}
+
+static void fini_ports(RdmaDeviceResources *dev_res,
+                       RdmaBackendDev *backend_dev, const char *ifname)
+{
+    int i;
+
+    dev_res->ports[0].state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
+    }
+}
+
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp)
 {
@@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                        dev_attr->max_qp_wr, sizeof(void *));
     res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
 
+    init_ports(dev_res);
+
     return 0;
 }
 
-void rdma_rm_fini(RdmaDeviceResources *dev_res)
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname)
 {
+    fini_ports(dev_res, backend_dev, ifname);
+
     res_tbl_free(&dev_res->uc_tbl);
     res_tbl_free(&dev_res->cqe_ctx_tbl);
     res_tbl_free(&dev_res->qp_tbl);
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index b4e04cc7b4..a7169b4e89 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -22,7 +22,8 @@
 
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp);
-void rdma_rm_fini(RdmaDeviceResources *dev_res);
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname);
 
 int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
                      uint32_t *pd_handle, uint32_t ctx_handle);
@@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
                      uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
 RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn);
@@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
 void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx);
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx);
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx);
+static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
+                                             int sgid_idx)
+{
+    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+}
+
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 9b399063d3..7b3435f991 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -19,7 +19,7 @@
 #include "rdma_backend_defs.h"
 
 #define MAX_PORTS             1
-#define MAX_PORT_GIDS         1
+#define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
 #define MAX_PKEYS             MAX_PORT_PKEYS
@@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
     enum ibv_qp_state qp_state;
 } RdmaRmQP;
 
+typedef struct RdmaRmGid {
+    union ibv_gid gid;
+    int backend_gid_index;
+} RdmaRmGid;
+
 typedef struct RdmaRmPort {
-    union ibv_gid gid_tbl[MAX_PORT_GIDS];
+    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
     enum ibv_port_state state;
 } RdmaRmPort;
 
diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 04c7c2ef5b..989db249ef 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "hw/pci/pci.h"
 #include "sysemu/dma.h"
+#include "stdio.h"
 
 #define pr_info(fmt, ...) \
     fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
@@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
 #define pr_dbg(fmt, ...) \
     fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
             __func__, __LINE__, ## __VA_ARGS__)
+
+#define pr_dbg_buf(title, buf, len) \
+{ \
+    char *b = g_malloc0(len * 3 + 1); \
+    char b1[4]; \
+    for (int i = 0; i < len; i++) { \
+        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
+        strcat(b, b1); \
+    } \
+    pr_dbg("%s (%d): %s\n", title, len, b); \
+    g_free(b); \
+}
+
 #else
 #define init_pr_dbg(void)
 #define pr_dbg(fmt, ...)
+#define pr_dbg_buf(title, buf, len)
 #endif
 
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 15c3f28b86..b019cb843a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -79,8 +79,8 @@ typedef struct PVRDMADev {
     int interrupt_mask;
     struct ibv_device_attr dev_attr;
     uint64_t node_guid;
+    char *backend_eth_device_name;
     char *backend_device_name;
-    uint8_t backend_gid_idx;
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 57d6f41ae6..a334f6205e 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rsp->hdr.response = cmd->hdr.response;
     rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
 
-    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
-                                 cmd->qp_handle, cmd->attr_mask,
-                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
-                                 cmd->attrs.dest_qp_num,
-                                 (enum ibv_qp_state)cmd->attrs.qp_state,
-                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
-                                 cmd->attrs.sq_psn);
+    /* No need to verify sgid_index since it is u8 */
+
+    rsp->hdr.err =
+        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
+                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
+                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
+                          cmd->attrs.dest_qp_num,
+                          (enum ibv_qp_state)cmd->attrs.qp_state,
+                          cmd->attrs.qkey, cmd->attrs.rq_psn,
+                          cmd->attrs.sq_psn);
 
     pr_dbg("ret=%d\n", rsp->hdr.err);
     return rsp->hdr.err;
@@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                        union pvrdma_cmd_resp *rsp)
 {
     struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
-#ifdef PVRDMA_DEBUG
-    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
-    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
-#endif
+    int rc;
+    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
 
     pr_dbg("index=%d\n", cmd->index);
 
@@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
-           (long long unsigned int)be64_to_cpu(*subnet),
-           (long long unsigned int)be64_to_cpu(*if_id));
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
 
-    /* Driver forces to one port only */
-    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
-           sizeof(cmd->new_gid));
+    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                         dev->backend_eth_device_name, gid, cmd->index);
+    if (rc < 0) {
+        return -EINVAL;
+    }
 
     /* TODO: Since drivers stores node_guid at load_dsr phase then this
      * assignment is not relevant, i need to figure out a way how to
      * retrieve MAC of our netdev */
-    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
-    pr_dbg("dev->node_guid=0x%llx\n",
-           (long long unsigned int)be64_to_cpu(dev->node_guid));
+    if (!cmd->index) {
+        dev->node_guid =
+            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
+        pr_dbg("dev->node_guid=0x%llx\n",
+               (long long unsigned int)be64_to_cpu(dev->node_guid));
+    }
 
     return 0;
 }
@@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                         union pvrdma_cmd_resp *rsp)
 {
+    int rc;
+
     struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
 
     pr_dbg("index=%d\n", cmd->index);
@@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
-           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
+    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                        dev->backend_eth_device_name, cmd->index);
+
+    if (rc < 0) {
+        rsp->hdr.err = rc;
+        goto out;
+    }
 
     return 0;
 }
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fc2abd34af..ac8c092db0 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -36,9 +36,9 @@
 #include "pvrdma_qp_ops.h"
 
 static Property pvrdma_dev_properties[] = {
-    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
-    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
-    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
+    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
+    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
+    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
     DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
                        MAX_MR_SIZE),
     DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
@@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     pr_dbg("Initialized\n");
 }
 
-static void init_ports(PVRDMADev *dev, Error **errp)
-{
-    int i;
-
-    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
-
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
-    }
-}
-
 static void uninit_msix(PCIDevice *pdev, int used_vectors)
 {
     PVRDMADev *dev = PVRDMA_DEV(pdev);
@@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
 
     pvrdma_qp_ops_fini();
 
-    rdma_rm_fini(&dev->rdma_dev_res);
+    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
+                 dev->backend_eth_device_name);
 
     rdma_backend_fini(&dev->backend_dev);
 
@@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
-                           errp);
+                           &dev->dev_attr, &dev->mad_chr, errp);
     if (rc) {
         goto out;
     }
@@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
-    init_ports(dev, errp);
-
     rc = pvrdma_qp_ops_init();
     if (rc) {
         goto out;
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 3388be1926..2130824098 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
     RdmaRmQP *qp;
     PvrdmaSqWqe *wqe;
     PvrdmaRing *ring;
+    int sgid_idx;
+    union ibv_gid *sgid;
 
     pr_dbg("qp_handle=0x%x\n", qp_handle);
 
@@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.opcode = IBV_WC_SEND;
 
+        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
+        if (!sgid) {
+            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
+               sgid->global.interface_id);
+
+        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
+                                                 &dev->backend_dev,
+                                                 wqe->hdr.wr.ud.av.gid_index);
+        if (sgid_idx <= 0) {
+            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
+                   wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
+                               sgid_idx, sgid,
                                (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
                                wqe->hdr.wr.ud.remote_qpn,
                                wqe->hdr.wr.ud.remote_qkey, comp_ctx);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (10 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
                   ` (34 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
on PCI function 1.
pvrdma device needs to access vmxnet3 device object for several reasons:
1. Make sure PCI function 0 is vmxnet3.
2. To monitor vmxnet3 device state.
3. To configure node_guid accoring to vmxnet3 device's MAC address.

To be able to access vmxnet3 device the definition of VMXNET3State is
moved to a new header file.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Dmitry Fleytman <dmitry.fleytman@gmail.com>
---
 hw/net/vmxnet3.c      | 116 +-----------------------------------
 hw/net/vmxnet3_defs.h | 133 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+), 115 deletions(-)
 create mode 100644 hw/net/vmxnet3_defs.h

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 3648630386..54746a4030 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -18,7 +18,6 @@
 #include "qemu/osdep.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
-#include "net/net.h"
 #include "net/tap.h"
 #include "net/checksum.h"
 #include "sysemu/sysemu.h"
@@ -29,6 +28,7 @@
 #include "migration/register.h"
 
 #include "vmxnet3.h"
+#include "vmxnet3_defs.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
 #include "net_tx_pkt.h"
@@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
     DeviceRealize parent_dc_realize;
 } VMXNET3Class;
 
-#define TYPE_VMXNET3 "vmxnet3"
-#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
-
 #define VMXNET3_DEVICE_CLASS(klass) \
     OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
 #define VMXNET3_DEVICE_GET_CLASS(obj) \
     OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
 
-/* Cyclic ring abstraction */
-typedef struct {
-    hwaddr pa;
-    uint32_t size;
-    uint32_t cell_size;
-    uint32_t next;
-    uint8_t gen;
-} Vmxnet3Ring;
-
 static inline void vmxnet3_ring_init(PCIDevice *d,
 				     Vmxnet3Ring *ring,
                                      hwaddr pa,
@@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
               descr->rsvd, descr->dtype, descr->ext1, descr->btype);
 }
 
-/* Device state and helper functions */
-#define VMXNET3_RX_RINGS_PER_QUEUE (2)
-
-typedef struct {
-    Vmxnet3Ring tx_ring;
-    Vmxnet3Ring comp_ring;
-
-    uint8_t intr_idx;
-    hwaddr tx_stats_pa;
-    struct UPT1_TxStats txq_stats;
-} Vmxnet3TxqDescr;
-
-typedef struct {
-    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
-    Vmxnet3Ring comp_ring;
-    uint8_t intr_idx;
-    hwaddr rx_stats_pa;
-    struct UPT1_RxStats rxq_stats;
-} Vmxnet3RxqDescr;
-
-typedef struct {
-    bool is_masked;
-    bool is_pending;
-    bool is_asserted;
-} Vmxnet3IntState;
-
-typedef struct {
-        PCIDevice parent_obj;
-        NICState *nic;
-        NICConf conf;
-        MemoryRegion bar0;
-        MemoryRegion bar1;
-        MemoryRegion msix_bar;
-
-        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
-        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
-
-        /* Whether MSI-X support was installed successfully */
-        bool msix_used;
-        hwaddr drv_shmem;
-        hwaddr temp_shared_guest_driver_memory;
-
-        uint8_t txq_num;
-
-        /* This boolean tells whether RX packet being indicated has to */
-        /* be split into head and body chunks from different RX rings  */
-        bool rx_packets_compound;
-
-        bool rx_vlan_stripping;
-        bool lro_supported;
-
-        uint8_t rxq_num;
-
-        /* Network MTU */
-        uint32_t mtu;
-
-        /* Maximum number of fragments for indicated TX packets */
-        uint32_t max_tx_frags;
-
-        /* Maximum number of fragments for indicated RX packets */
-        uint16_t max_rx_frags;
-
-        /* Index for events interrupt */
-        uint8_t event_int_idx;
-
-        /* Whether automatic interrupts masking enabled */
-        bool auto_int_masking;
-
-        bool peer_has_vhdr;
-
-        /* TX packets to QEMU interface */
-        struct NetTxPkt *tx_pkt;
-        uint32_t offload_mode;
-        uint32_t cso_or_gso_size;
-        uint16_t tci;
-        bool needs_vlan;
-
-        struct NetRxPkt *rx_pkt;
-
-        bool tx_sop;
-        bool skip_current_tx_pkt;
-
-        uint32_t device_active;
-        uint32_t last_command;
-
-        uint32_t link_status_and_speed;
-
-        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
-
-        uint32_t temp_mac;   /* To store the low part first */
-
-        MACAddr perm_mac;
-        uint32_t vlan_table[VMXNET3_VFT_SIZE];
-        uint32_t rx_mode;
-        MACAddr *mcast_list;
-        uint32_t mcast_list_len;
-        uint32_t mcast_list_buff_size; /* needed for live migration. */
-
-        /* Compatibility flags for migration */
-        uint32_t compat_flags;
-} VMXNET3State;
-
 /* Interrupt management */
 
 /*
diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
new file mode 100644
index 0000000000..6c19d29b12
--- /dev/null
+++ b/hw/net/vmxnet3_defs.h
@@ -0,0 +1,133 @@
+/*
+ * QEMU VMWARE VMXNET3 paravirtual NIC
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "net/net.h"
+#include "hw/net/vmxnet3.h"
+
+#define TYPE_VMXNET3 "vmxnet3"
+#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
+
+/* Device state and helper functions */
+#define VMXNET3_RX_RINGS_PER_QUEUE (2)
+
+/* Cyclic ring abstraction */
+typedef struct {
+    hwaddr pa;
+    uint32_t size;
+    uint32_t cell_size;
+    uint32_t next;
+    uint8_t gen;
+} Vmxnet3Ring;
+
+typedef struct {
+    Vmxnet3Ring tx_ring;
+    Vmxnet3Ring comp_ring;
+
+    uint8_t intr_idx;
+    hwaddr tx_stats_pa;
+    struct UPT1_TxStats txq_stats;
+} Vmxnet3TxqDescr;
+
+typedef struct {
+    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
+    Vmxnet3Ring comp_ring;
+    uint8_t intr_idx;
+    hwaddr rx_stats_pa;
+    struct UPT1_RxStats rxq_stats;
+} Vmxnet3RxqDescr;
+
+typedef struct {
+    bool is_masked;
+    bool is_pending;
+    bool is_asserted;
+} Vmxnet3IntState;
+
+typedef struct {
+        PCIDevice parent_obj;
+        NICState *nic;
+        NICConf conf;
+        MemoryRegion bar0;
+        MemoryRegion bar1;
+        MemoryRegion msix_bar;
+
+        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
+        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
+
+        /* Whether MSI-X support was installed successfully */
+        bool msix_used;
+        hwaddr drv_shmem;
+        hwaddr temp_shared_guest_driver_memory;
+
+        uint8_t txq_num;
+
+        /* This boolean tells whether RX packet being indicated has to */
+        /* be split into head and body chunks from different RX rings  */
+        bool rx_packets_compound;
+
+        bool rx_vlan_stripping;
+        bool lro_supported;
+
+        uint8_t rxq_num;
+
+        /* Network MTU */
+        uint32_t mtu;
+
+        /* Maximum number of fragments for indicated TX packets */
+        uint32_t max_tx_frags;
+
+        /* Maximum number of fragments for indicated RX packets */
+        uint16_t max_rx_frags;
+
+        /* Index for events interrupt */
+        uint8_t event_int_idx;
+
+        /* Whether automatic interrupts masking enabled */
+        bool auto_int_masking;
+
+        bool peer_has_vhdr;
+
+        /* TX packets to QEMU interface */
+        struct NetTxPkt *tx_pkt;
+        uint32_t offload_mode;
+        uint32_t cso_or_gso_size;
+        uint16_t tci;
+        bool needs_vlan;
+
+        struct NetRxPkt *rx_pkt;
+
+        bool tx_sop;
+        bool skip_current_tx_pkt;
+
+        uint32_t device_active;
+        uint32_t last_command;
+
+        uint32_t link_status_and_speed;
+
+        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
+
+        uint32_t temp_mac;   /* To store the low part first */
+
+        MACAddr perm_mac;
+        uint32_t vlan_table[VMXNET3_VFT_SIZE];
+        uint32_t rx_mode;
+        MACAddr *mcast_list;
+        uint32_t mcast_list_len;
+        uint32_t mcast_list_buff_size; /* needed for live migration. */
+
+        /* Compatibility flags for migration */
+        uint32_t compat_flags;
+} VMXNET3State;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (11 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
                   ` (33 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Guest driver enforces it, we should also.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      | 2 ++
 hw/rdma/vmw/pvrdma_main.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index b019cb843a..10a3c4fb7c 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -20,6 +20,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
+#include "hw/net/vmxnet3_defs.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -85,6 +86,7 @@ typedef struct PVRDMADev {
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
+    VMXNET3State *func0;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ac8c092db0..fa6468d221 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
+    /* Break if not vmxnet3 device in slot 0 */
+    dev->func0 = VMXNET3(pci_get_function_0(pdev));
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (12 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:10   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
                   ` (32 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

node_guid should be set once device is load.
Make node_guid be GID format (32 bit) of PCI function 0 vmxnet3 device's
MAC.

A new function was added to do the conversion.
So for example the MAC 56:b6:44:e9:62:dc will be converted to GID
54b6:44ff:fee9:62dc.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_utils.h      |  9 +++++++++
 hw/rdma/vmw/pvrdma_cmd.c  | 10 ----------
 hw/rdma/vmw/pvrdma_main.c |  5 ++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 989db249ef..202abb3366 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -63,4 +63,13 @@ extern unsigned long pr_dbg_cnt;
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
 void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
 
+static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
+{
+    memcpy(eui, addr, 3);
+    eui[3] = 0xFF;
+    eui[4] = 0xFE;
+    memcpy(eui + 5, addr + 3, 3);
+    eui[0] ^= 2;
+}
+
 #endif
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index a334f6205e..2979582fac 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -592,16 +592,6 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    /* TODO: Since drivers stores node_guid at load_dsr phase then this
-     * assignment is not relevant, i need to figure out a way how to
-     * retrieve MAC of our netdev */
-    if (!cmd->index) {
-        dev->node_guid =
-            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
-        pr_dbg("dev->node_guid=0x%llx\n",
-               (long long unsigned int)be64_to_cpu(dev->node_guid));
-    }
-
     return 0;
 }
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fa6468d221..95e9322b7c 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -264,7 +264,7 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     dsr->caps.sys_image_guid = 0;
     pr_dbg("sys_image_guid=%" PRIx64 "\n", dsr->caps.sys_image_guid);
 
-    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+    dsr->caps.node_guid = dev->node_guid;
     pr_dbg("node_guid=%" PRIx64 "\n", be64_to_cpu(dsr->caps.node_guid));
 
     dsr->caps.phys_port_cnt = MAX_PORTS;
@@ -579,6 +579,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
     /* Break if not vmxnet3 device in slot 0 */
     dev->func0 = VMXNET3(pci_get_function_0(pdev));
 
+    addrconf_addr_eui48((unsigned char *)&dev->node_guid,
+                        (const char *)&dev->func0->conf.macaddr.a);
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (13 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:11   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
                   ` (31 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

User should be able to control the device by changing Ethernet function
state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
down as well.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 2979582fac..0d3c818c20 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
     resp->hdr.err = 0;
 
-    resp->attrs.state = attrs.state;
+    resp->attrs.state = dev->func0->device_active ? attrs.state :
+                                                    PVRDMA_PORT_DOWN;
     resp->attrs.max_mtu = attrs.max_mtu;
     resp->attrs.active_mtu = attrs.active_mtu;
     resp->attrs.phys_state = attrs.phys_state;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (14 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:19   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
                   ` (30 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Add ability to pass specific WC attributes to CQE such as GRH_BIT flag.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 59 +++++++++++++++++++++++--------------
 hw/rdma/rdma_backend.h      |  4 +--
 hw/rdma/vmw/pvrdma_qp_ops.c | 31 +++++++++++--------
 3 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 5675504165..e453bda8f9 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -59,13 +59,24 @@ struct backend_umad {
     char mad[RDMA_MAX_PRIVATE_DATA];
 };
 
-static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
+static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
 
-static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     pr_err("No completion handler is registered\n");
 }
 
+static inline void complete_work(enum ibv_wc_status status, uint32_t vendor_err,
+                                 void *ctx)
+{
+    struct ibv_wc wc = {0};
+
+    wc.status = status;
+    wc.vendor_err = vendor_err;
+
+    comp_handler(ctx, &wc);
+}
+
 static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
 {
     int i, ne;
@@ -90,7 +101,7 @@ static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
             }
             pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
 
-            comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
+            comp_handler(bctx->up_ctx, &wc[i]);
 
             rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
             g_free(bctx);
@@ -184,8 +195,8 @@ static void start_comp_thread(RdmaBackendDev *backend_dev)
                        comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
 }
 
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx))
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                         struct ibv_wc *wc))
 {
     comp_handler = handler;
 }
@@ -369,14 +380,14 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
-                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+                complete_work(IBV_WC_SUCCESS, 0, ctx);
             }
         }
         return;
@@ -385,7 +396,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -396,21 +407,21 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
         }
         wr.wr.ud.remote_qpn = dqpn;
@@ -428,7 +439,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -497,13 +508,13 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
             }
         }
         return;
@@ -512,7 +523,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -523,14 +534,14 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -542,7 +553,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post recv WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -926,9 +937,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
     if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
-                     bctx->up_ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                      bctx->up_ctx);
     } else {
+        struct ibv_wc wc = {0};
         pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
@@ -937,7 +949,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
         memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
-        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+        wc.byte_len = msg->umad_len;
+        wc.status = IBV_WC_SUCCESS;
+        wc.wc_flags = IBV_WC_GRH;
+        comp_handler(bctx->up_ctx, &wc);
     }
 
     g_free(bctx);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 59ad2b874b..8cae40f827 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -57,8 +57,8 @@ int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
                                union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx));
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                        struct ibv_wc *wc));
 void rdma_backend_unregister_comp_handler(void);
 
 int rdma_backend_query_port(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 2130824098..300471a4c9 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -47,7 +47,7 @@ typedef struct PvrdmaRqWqe {
  * 3. Interrupt host
  */
 static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
-                           struct pvrdma_cqe *cqe)
+                           struct pvrdma_cqe *cqe, struct ibv_wc *wc)
 {
     struct pvrdma_cqe *cqe1;
     struct pvrdma_cqne *cqne;
@@ -66,6 +66,7 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pr_dbg("Writing CQE\n");
     cqe1 = pvrdma_ring_next_elem_write(ring);
     if (unlikely(!cqe1)) {
+        pr_dbg("No CQEs in ring\n");
         return -EINVAL;
     }
 
@@ -73,8 +74,20 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     cqe1->wr_id = cqe->wr_id;
     cqe1->qp = cqe->qp;
     cqe1->opcode = cqe->opcode;
-    cqe1->status = cqe->status;
-    cqe1->vendor_err = cqe->vendor_err;
+    cqe1->status = wc->status;
+    cqe1->byte_len = wc->byte_len;
+    cqe1->src_qp = wc->src_qp;
+    cqe1->wc_flags = wc->wc_flags;
+    cqe1->vendor_err = wc->vendor_err;
+
+    pr_dbg("wr_id=%" PRIx64 "\n", cqe1->wr_id);
+    pr_dbg("qp=0x%lx\n", cqe1->qp);
+    pr_dbg("opcode=%d\n", cqe1->opcode);
+    pr_dbg("status=%d\n", cqe1->status);
+    pr_dbg("byte_len=%d\n", cqe1->byte_len);
+    pr_dbg("src_qp=%d\n", cqe1->src_qp);
+    pr_dbg("wc_flags=%d\n", cqe1->wc_flags);
+    pr_dbg("vendor_err=%d\n", cqe1->vendor_err);
 
     pvrdma_ring_write_inc(ring);
 
@@ -99,18 +112,12 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     return 0;
 }
 
-static void pvrdma_qp_ops_comp_handler(int status, unsigned int vendor_err,
-                                       void *ctx)
+static void pvrdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
 
-    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
-    pr_dbg("wr_id=%" PRIx64 "\n", comp_ctx->cqe.wr_id);
-    pr_dbg("status=%d\n", status);
-    pr_dbg("vendor_err=0x%x\n", vendor_err);
-    comp_ctx->cqe.status = status;
-    comp_ctx->cqe.vendor_err = vendor_err;
-    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
+    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe, wc);
+
     g_free(ctx);
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (15 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:22   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
                   ` (29 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Driver checks error code let's set it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 0d3c818c20..a326c5d470 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     if (rdma_backend_query_port(&dev->backend_dev,
                                 (struct ibv_port_attr *)&attrs)) {
-        return -ENOMEM;
+        resp->hdr.err = -ENOMEM;
+        goto out;
     }
 
     memset(resp, 0, sizeof(*resp));
@@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->attrs.active_width = 1;
     resp->attrs.active_speed = 1;
 
-    return 0;
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
 }
 
 static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->pkey = PVRDMA_PKEY;
     pr_dbg("pkey=0x%x\n", resp->pkey);
 
-    return 0;
+    return resp->hdr.err;
 }
 
 static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
@@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
     cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
     if (!cq) {
         pr_dbg("Invalid CQ handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     ring = (PvrdmaRing *)cq->opaque;
@@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
@@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
     if (!qp) {
         pr_dbg("Invalid QP handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
@@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
     g_free(ring);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
@@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
                          dev->backend_eth_device_name, gid, cmd->index);
     if (rc < 0) {
-        return -EINVAL;
+        rsp->hdr.err = rc;
+        goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
@@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
                                      &resp->ctx_handle);
 
-    pr_dbg("ret=%d\n", resp->hdr.err);
-
-    return 0;
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 struct cmd_handler {
     uint32_t cmd;
@@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
     }
 
     err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
-                            dsr_info->rsp);
+                                                    dsr_info->rsp);
 out:
     set_reg_val(dev, PVRDMA_REG_ERR, err);
     post_interrupt(dev, INTR_VEC_CMD_RING);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (16 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:23   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
                   ` (28 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Device supports only one port, let's remove a dead code that handles
more than one port.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c      | 34 ++++++++++++++++------------------
 hw/rdma/rdma_rm.h      |  2 +-
 hw/rdma/rdma_rm_defs.h |  4 ++--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index fe0979415d..0a5ab8935a 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -545,7 +545,7 @@ int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         return -EINVAL;
     }
 
-    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+    memcpy(&dev_res->port.gid_tbl[gid_idx].gid, gid, sizeof(*gid));
 
     return 0;
 }
@@ -556,15 +556,15 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int rc;
 
     rc = rdma_backend_del_gid(backend_dev, ifname,
-                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+                              &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc < 0) {
         pr_dbg("Fail to delete gid\n");
         return -EINVAL;
     }
 
-    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
-           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
-    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+    memset(dev_res->port.gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->port.gid_tbl[gid_idx].gid));
+    dev_res->port.gid_tbl[gid_idx].backend_gid_index = -1;
 
     return 0;
 }
@@ -577,16 +577,16 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
         return -EINVAL;
     }
 
-    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
-        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+    if (unlikely(dev_res->port.gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->port.gid_tbl[sgid_idx].backend_gid_index =
         rdma_backend_get_gid_index(backend_dev,
-                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+                                   &dev_res->port.gid_tbl[sgid_idx].gid);
     }
 
     pr_dbg("backend_gid_index=%d\n",
-           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+           dev_res->port.gid_tbl[sgid_idx].backend_gid_index);
 
-    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+    return dev_res->port.gid_tbl[sgid_idx].backend_gid_index;
 }
 
 static void destroy_qp_hash_key(gpointer data)
@@ -596,15 +596,13 @@ static void destroy_qp_hash_key(gpointer data)
 
 static void init_ports(RdmaDeviceResources *dev_res)
 {
-    int i, j;
+    int i;
 
-    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+    memset(&dev_res->port, 0, sizeof(dev_res->port));
 
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev_res->ports[i].state = IBV_PORT_DOWN;
-        for (j = 0; j < MAX_PORT_GIDS; j++) {
-            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
-        }
+    dev_res->port.state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        dev_res->port.gid_tbl[i].backend_gid_index = -1;
     }
 }
 
@@ -613,7 +611,7 @@ static void fini_ports(RdmaDeviceResources *dev_res,
 {
     int i;
 
-    dev_res->ports[0].state = IBV_PORT_DOWN;
+    dev_res->port.state = IBV_PORT_DOWN;
     for (i = 0; i < MAX_PORT_GIDS; i++) {
         rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
     }
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index a7169b4e89..3c602c04c0 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -79,7 +79,7 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
 static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
                                              int sgid_idx)
 {
-    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+    return &dev_res->port.gid_tbl[sgid_idx].gid;
 }
 
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7b3435f991..0ba61d1838 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -18,7 +18,7 @@
 
 #include "rdma_backend_defs.h"
 
-#define MAX_PORTS             1
+#define MAX_PORTS             1 /* Do not change - we support only one port */
 #define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
@@ -97,7 +97,7 @@ typedef struct RdmaRmPort {
 } RdmaRmPort;
 
 typedef struct RdmaDeviceResources {
-    RdmaRmPort ports[MAX_PORTS];
+    RdmaRmPort port;
     RdmaRmResTbl pd_tbl;
     RdmaRmResTbl mr_tbl;
     RdmaRmResTbl uc_tbl;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (17 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  9:34   ` Cornelia Huck
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
                   ` (27 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Notifier will be used for signaling shutdown event to inform system is
shutdown. This will allow devices and other component to run some
cleanup code needed before VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 include/sysemu/sysemu.h |  1 +
 vl.c                    | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8d6095d98b..0d15f16492 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -80,6 +80,7 @@ void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_system_shutdown_request(ShutdownCause reason);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
+void qemu_register_shutdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
diff --git a/vl.c b/vl.c
index 1fcacc5caa..d33d52522c 100644
--- a/vl.c
+++ b/vl.c
@@ -1578,6 +1578,8 @@ static NotifierList suspend_notifiers =
     NOTIFIER_LIST_INITIALIZER(suspend_notifiers);
 static NotifierList wakeup_notifiers =
     NOTIFIER_LIST_INITIALIZER(wakeup_notifiers);
+static NotifierList shutdown_notifiers =
+    NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
 
 ShutdownCause qemu_shutdown_requested_get(void)
@@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
     notifier_list_notify(&powerdown_notifiers, NULL);
 }
 
+static void qemu_system_shutdown(ShutdownCause cause)
+{
+    qapi_event_send_shutdown(shutdown_caused_by_guest(cause));
+    notifier_list_notify(&shutdown_notifiers, &cause);
+}
+
 void qemu_system_powerdown_request(void)
 {
     trace_qemu_system_powerdown_request();
@@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
     notifier_list_add(&powerdown_notifiers, notifier);
 }
 
+void qemu_register_shutdown_notifier(Notifier *notifier)
+{
+    notifier_list_add(&shutdown_notifiers, notifier);
+}
+
 void qemu_system_debug_request(void)
 {
     debug_requested = 1;
@@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
     request = qemu_shutdown_requested();
     if (request) {
         qemu_kill_report();
-        qapi_event_send_shutdown(shutdown_caused_by_guest(request));
+        qemu_system_shutdown(request);
         if (no_shutdown) {
             vm_stop(RUN_STATE_SHUTDOWN);
         } else {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (18 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:24   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
                   ` (26 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

In order to clean some external resources such as GIDs, QPs etc,
register to receive notification when VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 10a3c4fb7c..ffae36986e 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -17,6 +17,7 @@
 #define PVRDMA_PVRDMA_H
 
 #include "qemu/units.h"
+#include "qemu/notify.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
@@ -87,6 +88,7 @@ typedef struct PVRDMADev {
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
     VMXNET3State *func0;
+    Notifier shutdown_notifier;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 95e9322b7c..45a59cddf9 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -24,6 +24,7 @@
 #include "hw/qdev-properties.h"
 #include "cpu.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 #include "../rdma_rm.h"
 #include "../rdma_backend.h"
@@ -559,6 +560,14 @@ static int pvrdma_check_ram_shared(Object *obj, void *opaque)
     return 0;
 }
 
+static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
+{
+    PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    pvrdma_fini(pci_dev);
+}
+
 static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 {
     int rc;
@@ -623,6 +632,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
+    dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
+    qemu_register_shutdown_notifier(&dev->shutdown_notifier);
+
 out:
     if (rc) {
         error_append_hint(errp, "Device fail to load\n");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (19 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:25   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
                   ` (25 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

bitmap_zero_extend is designed to work for extending, not for
shrinking.
Using g_free instead.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 0a5ab8935a..35a96d9a64 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -43,7 +43,7 @@ static inline void res_tbl_free(RdmaRmResTbl *tbl)
 {
     qemu_mutex_destroy(&tbl->lock);
     g_free(tbl->tbl);
-    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+    g_free(tbl->bitmap);
 }
 
 static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (20 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:25   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia
                   ` (24 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

When device goes down the function fini_ports loops over all entries in
gid table regardless of the fact whether entry is valid or not. In case
that entry is not valid we'd like to skip from any further processing in
backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 35a96d9a64..e3f6b2f6ea 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -555,6 +555,10 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
 {
     int rc;
 
+    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
+        return 0;
+    }
+
     rc = rdma_backend_del_gid(backend_dev, ifname,
                               &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc < 0) {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (21 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 12:34   ` Marcel Apfelbaum
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (23 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Interface with the device is changed with the addition of support for
MAD packets.
Adjust documentation accordingly.

While there fix a minor mistake which may lead to think that there is a
relation between using RXE on host and the compatibility with bare-metal
peers.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 84 insertions(+), 19 deletions(-)

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
index 5599318159..9e8d1674b7 100644
--- a/docs/pvrdma.txt
+++ b/docs/pvrdma.txt
@@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need for any special guest
 modifications.
 
 While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
-can work with Soft-RoCE (rxe).
+metal RDMA-enabled machines as peers.
+
+It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
 
 It does not require the whole guest RAM to be pinned allowing memory
 over-commit and, even if not implemented yet, migration support will be
@@ -78,29 +79,93 @@ the required RDMA libraries.
 
 3. Usage
 ========
+
+
+3.1 VM Memory settings
+======================
 Currently the device is working only with memory backed RAM
 and it must be mark as "shared":
    -m 1G \
    -object memory-backend-ram,id=mb1,size=1G,share \
    -numa node,memdev=mb1 \
 
-The pvrdma device is composed of two functions:
- - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
-   but is required to pass the ibdevice GID using its MAC.
-   Examples:
-     For an rxe backend using eth0 interface it will use its mac:
-       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
-     For an SRIOV VF, we take the Ethernet Interface exposed by it:
-       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
- - Function 1 is the actual device:
-       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
-   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
- Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
- The rules of conversion are part of the RoCE spec, but since manual conversion
- is not required, spotting problems is not hard:
-    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
-             MAC: 7c:fe:90:cb:74:3a
-    Note the difference between the first byte of the MAC and the GID.
+
+3.2 MAD Multiplexer
+===================
+MAD Multiplexer is a service that exposes MAD-like interface for VMs in
+order to overcome the limitation where only single entity can register with
+MAD layer to send and receive RDMA-CM MAD packets.
+
+To build rdmacm-mux run
+# make rdmacm-mux
+
+The program accepts 3 command line arguments and exposes a UNIX socket to
+be used to relay control and data messages to and from the service.
+-s unix-socket-path   Path to unix socket to listen on
+                      (default /var/run/rdmacm-mux)
+-d rdma-device-name   Name of RDMA device to register with
+                      (default rxe0)
+-p rdma-device-port   Port number of RDMA device to register with
+                      (default 1)
+The final UNIX socket file name is a concatenation of the 3 arguments so
+for example for device name mlx5_0 and port 2 the file
+/var/run/rdmacm-mux-mlx5_0-2 will be created.
+
+Please refer to contrib/rdmacm-mux for more details.
+
+
+3.3 PCI devices settings
+========================
+RoCE device exposes two functions - Ethernet and RDMA.
+To support it, pvrdma device is composed of two PCI functions, an Ethernet
+device of type vmxnet3 on PCI slot 0 and a pvrdma device on PCI slot 1. The
+Ethernet function can be used for other Ethernet purposes such as IP.
+
+
+3.4 Device parameters
+=====================
+- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
+  would be the Ethernet device used to create it. For any other physical
+  RoCE device this would be the netdev name of the device.
+- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
+- mad-chardev: The name of the MAD multiplexer char device.
+- ibport: In case of multi-port device (such as Mellanox's HCA) this
+  specify the port to use. If not set 1 will be used.
+- dev-caps-max-mr-size: The maximum size of MR.
+- dev-caps-max-qp: Maximum number of QPs.
+- dev-caps-max-sge: Maximum number of SGE elements in WR.
+- dev-caps-max-cq: Maximum number of CQs.
+- dev-caps-max-mr: Maximum number of MRs.
+- dev-caps-max-pd: Maximum number of PDs.
+- dev-caps-max-ah: Maximum number of AHs.
+
+Notes:
+- The first 3 parameters are mandatory settings, the rest have their
+  defaults.
+- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
+  limits but the final values are adjusted by the backend device limitations.
+
+3.5 Example
+===========
+Define bridge device with vmxnet3 network backend:
+<interface type='bridge'>
+  <mac address='56:b4:44:e9:62:dc'/>
+  <source bridge='bridge1'/>
+  <model type='vmxnet3'/>
+  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
+</interface>
+
+Define pvrdma device:
+<qemu:commandline>
+  <qemu:arg value='-object'/>
+  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
+  <qemu:arg value='-numa'/>
+  <qemu:arg value='node,memdev=mb1'/>
+  <qemu:arg value='-chardev'/>
+  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
+  <qemu:arg value='-device'/>
+  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
+</qemu:commandline>
 
 
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (22 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
                   ` (22 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Hi all.

This is a major enhancement to the pvrdma device to allow it to work with
state of the art applications such as MPI.

As described in patch #5, MAD packets are management packets that are used
for many purposes including but not limited to communication layer above IB
verbs API.

Patch 1 exposes new external executable (under contrib) that aims to
address a specific limitation in the RDMA usrespace MAD stack.

This patch-set mainly present MAD enhancement but during the work on it i
came across some bugs and enhancement needed to be implemented before doing
any MAD coding. This is the role of patches 2 to 4, 7 to 9 and 15 to 17.

Patches 6 and 18 are cosmetic changes while not relevant to this patchset
still introduce with it since (at least for 6) hard to decouple.

Patches 12 to 15 couple pvrdma device with vmxnet3 device as this is the
configuration enforced by pvrdma driver in guest - a vmxnet3 device in
function 0 and pvrdma device in function 1 in the same PCI slot. Patch 12
moves needed code from vmxnet3 device to a new header file that can be used
by pvrdma code while Patches 13 to 15 use of it.

Along with this patch-set there is a parallel patch posted to libvirt to
apply the change needed there as part of the process implemented in patches
10 and 11. This change is needed so that guest would be able to configure
any IP to the Ethernet function of the pvrdma device.
https://www.redhat.com/archives/libvir-list/2018-November/msg00135.html

Since we maintain external resources such as GIDs on host GID table we need
to do some cleanup before going down. This is the job of patches 19 and 20.
Patches 20 and 21 contain a fixes for bugs detected during the work on
processing cleanup code during shutdown.

v1 -> v2:
    * Fix compilation issue detected when compiling for mingw
    * Address comment from Eric Blake re version of QEMU in json
      message
    * Fix example from QMP message in json file
    * Fix case where a VM tries to remove an invalid GID from GID table
    * rdmacm-mux: Cleanup entries in socket-gids table when socket is
      closed
    * Cleanup resources (GIDs, QPs etc) when VM goes down

v2 -> v3:
    * Address comment from Cornelia Huck for patch #19
    * Add some R-Bs from Marcel Apfelbaum and Dmitry Fleytman
    * Update docs/pvrdma.txt with the changes made by this patchset
    * Address comments from Shamir Rabinovitch for UMAD multiplexer

Thanks,
Yuval

Yuval Shaia (23):
  contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  hw/rdma: Add ability to force notification without re-arm
  hw/rdma: Return qpn 1 if ibqp is NULL
  hw/rdma: Abort send-op if fail to create addr handler
  hw/rdma: Add support for MAD packets
  hw/pvrdma: Make function reset_device return void
  hw/pvrdma: Make default pkey 0xFFFF
  hw/pvrdma: Set the correct opcode for recv completion
  hw/pvrdma: Set the correct opcode for send completion
  json: Define new QMP message for pvrdma
  hw/pvrdma: Add support to allow guest to configure GID table
  vmxnet3: Move some definitions to header file
  hw/pvrdma: Make sure PCI function 0 is vmxnet3
  hw/rdma: Initialize node_guid from vmxnet3 mac address
  hw/pvrdma: Make device state depend on Ethernet function state
  hw/pvrdma: Fill all CQE fields
  hw/pvrdma: Fill error code in command's response
  hw/rdma: Remove unneeded code that handles more that one port
  vl: Introduce shutdown_notifiers
  hw/pvrdma: Clean device's resource when system is shutdown
  hw/rdma: Do not use bitmap_zero_extend to free bitmap
  hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  docs: Update pvrdma device documentation

 MAINTAINERS                      |   2 +
 Makefile                         |   6 +-
 Makefile.objs                    |   5 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 771 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 docs/pvrdma.txt                  | 103 ++++-
 hw/net/vmxnet3.c                 | 116 +----
 hw/net/vmxnet3_defs.h            | 133 ++++++
 hw/rdma/rdma_backend.c           | 461 +++++++++++++++---
 hw/rdma/rdma_backend.h           |  28 +-
 hw/rdma/rdma_backend_defs.h      |  13 +-
 hw/rdma/rdma_rm.c                | 120 ++++-
 hw/rdma/rdma_rm.h                |  17 +-
 hw/rdma/rdma_rm_defs.h           |  21 +-
 hw/rdma/rdma_utils.h             |  24 +
 hw/rdma/vmw/pvrdma.h             |  10 +-
 hw/rdma/vmw/pvrdma_cmd.c         | 119 +++--
 hw/rdma/vmw/pvrdma_main.c        |  49 +-
 hw/rdma/vmw/pvrdma_qp_ops.c      |  62 ++-
 include/sysemu/sysemu.h          |   1 +
 qapi/qapi-schema.json            |   1 +
 qapi/rdma.json                   |  38 ++
 vl.c                             |  15 +-
 24 files changed, 1868 insertions(+), 307 deletions(-)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
 create mode 100644 hw/net/vmxnet3_defs.h
 create mode 100644 qapi/rdma.json

-- 
2.17.2

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (23 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-17 17:27   ` Shamir Rabinovitch
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
                   ` (21 subsequent siblings)
  46 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
given MAD class.
This does not go hand-by-hand with qemu pvrdma device's requirements
where each VM is MAD agent.
Fix it by adding implementation of RDMA MAD multiplexer service which on
one hand register as a sole MAD agent with the kernel module and on the
other hand gives service to more than one VM.

Design Overview:
----------------
A server process is registered to UMAD framework (for this to work the
rdma_cm kernel module needs to be unloaded) and creates a unix socket to
listen to incoming request from clients.
A client process (such as QEMU) connects to this unix socket and
registers with its own GID.

TX:
---
When client needs to send rdma_cm MAD message it construct it the same
way as without this multiplexer, i.e. creates a umad packet but this
time it writes its content to the socket instead of calling umad_send().
The server, upon receiving such a message fetch local_comm_id from it so
a context for this session can be maintain and relay the message to UMAD
layer by calling umad_send().

RX:
---
The server creates a worker thread to process incoming rdma_cm MAD
messages. When an incoming message arrived (umad_recv()) the server,
depending on the message type (attr_id) looks for target client by
either searching in gid->fd table or in local_comm_id->fd table. With
the extracted fd the server relays to incoming message to the client.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 MAINTAINERS                      |   1 +
 Makefile                         |   3 +
 Makefile.objs                    |   1 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 771 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 6 files changed, 836 insertions(+)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 98a1856afc..e087d58ac6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2231,6 +2231,7 @@ S: Maintained
 F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
+F: contrib/rdmacm-mux/*
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index f2947186a4..94072776ff 100644
--- a/Makefile
+++ b/Makefile
@@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
                 elf2dmp-obj-y \
                 ivshmem-client-obj-y \
                 ivshmem-server-obj-y \
+                rdmacm-mux-obj-y \
                 libvhost-user-obj-y \
                 vhost-user-scsi-obj-y \
                 vhost-user-blk-obj-y \
@@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
 	$(call LINK, $^)
 vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
 	$(call LINK, $^)
+rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
+	$(call LINK, $^)
 
 module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
 	$(call quiet-command,$(PYTHON) $< $@ \
diff --git a/Makefile.objs b/Makefile.objs
index 1e1ff387d7..cc7df3ad80 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
 vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
 vhost-user-blk-obj-y = contrib/vhost-user-blk/
+rdmacm-mux-obj-y = contrib/rdmacm-mux/
 
 ######################################################################
 trace-events-subdirs =
diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
new file mode 100644
index 0000000000..be3eacb6f7
--- /dev/null
+++ b/contrib/rdmacm-mux/Makefile.objs
@@ -0,0 +1,4 @@
+ifdef CONFIG_PVRDMA
+CFLAGS += -libumad -Wno-format-truncation
+rdmacm-mux-obj-y = main.o
+endif
diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
new file mode 100644
index 0000000000..47cf0ac7bc
--- /dev/null
+++ b/contrib/rdmacm-mux/main.c
@@ -0,0 +1,771 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "sys/poll.h"
+#include "sys/ioctl.h"
+#include "pthread.h"
+#include "syslog.h"
+
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "infiniband/umad_types.h"
+#include "infiniband/umad_sa.h"
+#include "infiniband/umad_cm.h"
+
+#include "rdmacm-mux.h"
+
+#define SCALE_US 1000
+#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
+#define SLEEP_SECS 5 /* This is used both in poll() and thread */
+#define SERVER_LISTEN_BACKLOG 10
+#define MAX_CLIENTS 4096
+#define MAD_RMPP_VERSION 0
+#define MAD_METHOD_MASK0 0x8
+
+#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
+
+#define CM_REQ_DGID_POS      80
+#define CM_SIDR_REQ_DGID_POS 44
+
+/* The below can be override by command line parameter */
+#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
+#define RDMA_DEVICE "rxe0"
+#define RDMA_PORT_NUM 1
+
+typedef struct RdmaCmServerArgs {
+    char unix_socket_path[PATH_MAX];
+    char rdma_dev_name[NAME_MAX];
+    int rdma_port_num;
+} RdmaCMServerArgs;
+
+typedef struct CommId2FdEntry {
+    int fd;
+    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
+    __be64 gid_ifid;
+} CommId2FdEntry;
+
+typedef struct RdmaCmUMadAgent {
+    int port_id;
+    int agent_id;
+    GHashTable *gid2fd; /* Used to find fd of a given gid */
+    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
+} RdmaCmUMadAgent;
+
+typedef struct RdmaCmServer {
+    bool run;
+    RdmaCMServerArgs args;
+    struct pollfd fds[MAX_CLIENTS];
+    int nfds;
+    RdmaCmUMadAgent umad_agent;
+    pthread_t umad_recv_thread;
+    pthread_rwlock_t lock;
+} RdmaCMServer;
+
+static RdmaCMServer server = {0};
+
+static void usage(const char *progname)
+{
+    printf("Usage: %s [OPTION]...\n"
+           "Start a RDMA-CM multiplexer\n"
+           "\n"
+           "\t-h                    Show this help\n"
+           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
+           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
+           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
+           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
+}
+
+static void help(const char *progname)
+{
+    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    int c;
+    char unix_socket_path[PATH_MAX];
+
+    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
+    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
+    server.args.rdma_port_num = RDMA_PORT_NUM;
+
+    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
+        switch (c) {
+        case 'h':
+            usage(argv[0]);
+            exit(0);
+
+        case 's':
+            /* This is temporary, final name will build below */
+            strncpy(unix_socket_path, optarg, PATH_MAX);
+            break;
+
+        case 'd':
+            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
+            break;
+
+        case 'p':
+            server.args.rdma_port_num = atoi(optarg);
+            break;
+
+        default:
+            help(argv[0]);
+            exit(1);
+        }
+    }
+
+    /* Build unique unix-socket file name */
+    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
+             unix_socket_path, server.args.rdma_dev_name,
+             server.args.rdma_port_num);
+
+    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
+    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
+    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
+}
+
+static void hash_tbl_alloc(void)
+{
+
+    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
+                                                     g_int64_equal,
+                                                     g_free, g_free);
+    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
+                                                        g_int_equal,
+                                                        g_free, g_free);
+}
+
+static void hash_tbl_free(void)
+{
+    if (server.umad_agent.commid2fd) {
+        g_hash_table_destroy(server.umad_agent.commid2fd);
+    }
+    if (server.umad_agent.gid2fd) {
+        g_hash_table_destroy(server.umad_agent.gid2fd);
+    }
+}
+
+
+static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
+{
+    int *fd;
+
+    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    if (!fd) {
+        /* Let's try IPv4 */
+        *gid_ifid |= 0x00000000ffff0000;
+        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    }
+
+    return fd ? *fd : 0;
+}
+
+static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
+{
+    pthread_rwlock_rdlock(&server.lock);
+    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fd) {
+        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
+        return -ENOENT;
+    }
+
+    return 0;
+}
+
+static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
+                                         __be64 *gid_idid)
+{
+    CommId2FdEntry *fde;
+
+    pthread_rwlock_rdlock(&server.lock);
+    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fde) {
+        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
+        return -ENOENT;
+    }
+
+    *fd = fde->fd;
+    *gid_idid = fde->gid_ifid;
+
+    return 0;
+}
+
+static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (fd1) { /* record already exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
+                           RDMACM_MUX_ERR_CODE_EACCES;
+    }
+
+    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
+
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx registered on socket %d", (uint64_t)gid_ifid, fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (!fd1) { /* record not exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
+    }
+
+    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)));
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx unregistered on socket %d", (uint64_t)gid_ifid, fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
+                                          uint64_t gid_ifid)
+{
+    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
+
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_insert(server.umad_agent.commid2fd,
+                        g_memdup(&comm_id, sizeof(comm_id)),
+                        g_memdup(&fde, sizeof(fde)));
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static gboolean remove_old_comm_ids(gpointer key, gpointer value,
+                                    gpointer user_data)
+{
+    CommId2FdEntry *fde = (CommId2FdEntry *)value;
+
+    return !fde->ttl--;
+}
+
+static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    if (*(int *)value == *(int *)user_data) {
+        syslog(LOG_INFO, "0x%lx unregistered on socket %d", *(uint64_t *)key,
+               *(int *)value);
+        return true;
+    }
+
+    return false;
+}
+
+static void hash_tbl_remove_fd_ifid_pair(int fd)
+{
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
+                                remove_entry_from_gid2fd, (gpointer)&fd);
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
+{
+    struct umad_hdr *hdr = (struct umad_hdr *)mad;
+    char *data = (char *)hdr + sizeof(*hdr);
+    int32_t comm_id;
+    uint16_t attr_id = be16toh(hdr->attr_id);
+    int rc = 0;
+
+    switch (attr_id) {
+    case UMAD_CM_ATTR_REQ:
+        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_SIDR_REQ:
+        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_REP:
+        /* Fall through */
+    case UMAD_CM_ATTR_REJ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREQ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREP:
+        /* Fall through */
+    case UMAD_CM_ATTR_RTU:
+        data += sizeof(comm_id);
+        /* Fall through */
+    case UMAD_CM_ATTR_SIDR_REP:
+        memcpy(&comm_id, data, sizeof(comm_id));
+        if (comm_id) {
+            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
+        }
+        break;
+
+    default:
+        rc = -EINVAL;
+        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
+    }
+
+    return rc;
+}
+
+static void *umad_recv_thread_func(void *args)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    int fd = -2;
+
+    while (server.run) {
+        do {
+            msg.umad_len = sizeof(msg.umad.mad);
+            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
+                           SLEEP_SECS * SCALE_US);
+            if ((rc == -EIO) || (rc == -EINVAL)) {
+                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
+            }
+
+            if (rc == -ETIMEDOUT) {
+                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
+                                            remove_old_comm_ids, NULL);
+            }
+        } while (rc && server.run);
+
+        if (server.run) {
+            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
+            if (rc) {
+                continue;
+            }
+
+            send(fd, &msg, sizeof(msg), 0);
+        }
+    }
+
+    return NULL;
+}
+
+static int read_and_process(int fd)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    struct umad_hdr *hdr;
+    uint32_t *comm_id;
+    uint16_t attr_id;
+
+    rc = recv(fd, &msg, sizeof(msg), 0);
+
+    if (rc < 0 && errno != EWOULDBLOCK) {
+        return -EIO;
+    }
+
+    if (!rc) {
+        return -EPIPE;
+    }
+
+    switch (msg.hdr.msg_type) {
+    case RDMACM_MUX_MSG_TYPE_REG:
+        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_MSG_TYPE_UNREG:
+        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_MSG_TYPE_MAD:
+        /* If this is REQ or REP then store the pair comm_id,fd to be later
+         * used for other messages where gid is unknown */
+        hdr = (struct umad_hdr *)msg.umad.mad;
+        attr_id = be16toh(hdr->attr_id);
+        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
+            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
+            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
+            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
+            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
+                                          msg.hdr.sgid.global.interface_id);
+        }
+
+        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
+                       &msg.umad, msg.umad_len, 1, 0);
+        if (rc) {
+            syslog(LOG_WARNING, "Fail to send MAD message, err=%d", rc);
+        }
+        break;
+
+    default:
+        syslog(LOG_WARNING, "Got invalid message (%d) from %d",
+               msg.hdr.msg_type, fd);
+        rc = RDMACM_MUX_ERR_CODE_EINVAL;
+    }
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
+    msg.hdr.err_code = rc;
+    rc = send(fd, &msg, sizeof(msg), 0);
+
+    return rc == sizeof(msg) ? 0 : -EPIPE;
+}
+
+static int accept_all(void)
+{
+    int fd, rc = 0;;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    do {
+        if ((server.nfds + 1) > MAX_CLIENTS) {
+            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
+            rc = -EIO;
+            goto out;
+        }
+
+        fd = accept(server.fds[0].fd, NULL, NULL);
+        if (fd < 0) {
+            if (errno != EWOULDBLOCK) {
+                syslog(LOG_WARNING, "accept() failed");
+                rc = -EIO;
+                goto out;
+            }
+            break;
+        }
+
+        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
+        server.fds[server.nfds].fd = fd;
+        server.fds[server.nfds].events = POLLIN;
+        server.nfds++;
+    } while (fd != -1);
+
+out:
+    pthread_rwlock_unlock(&server.lock);
+    return rc;
+}
+
+static void compress_fds(void)
+{
+    int i, j;
+    int closed = 0;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    for (i = 1; i < server.nfds; i++) {
+        if (!server.fds[i].fd) {
+            closed++;
+            for (j = i; j < server.nfds; j++) {
+                server.fds[j].fd = server.fds[j + 1].fd;
+            }
+        }
+    }
+
+    server.nfds -= closed;
+
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static void close_fd(int idx)
+{
+    close(server.fds[idx].fd);
+    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
+    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
+    server.fds[idx].fd = 0;
+}
+
+static void run(void)
+{
+    int rc, nfds, i;
+    bool compress = false;
+
+    syslog(LOG_INFO, "Service started");
+
+    while (server.run) {
+        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
+        if (rc < 0) {
+            if (errno != EINTR) {
+                syslog(LOG_WARNING, "poll() failed");
+            }
+            continue;
+        }
+
+        if (rc == 0) {
+            continue;
+        }
+
+        nfds = server.nfds;
+        for (i = 0; i < nfds; i++) {
+            if (server.fds[i].revents == 0) {
+                continue;
+            }
+
+            if (server.fds[i].revents != POLLIN) {
+                if (i == 0) {
+                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
+                           server.fds[i].revents);
+                } else {
+                    close_fd(i);
+                    compress = true;
+                }
+                continue;
+            }
+
+            if (i == 0) {
+                rc = accept_all();
+                if (rc) {
+                    continue;
+                }
+            } else {
+                rc = read_and_process(server.fds[i].fd);
+                if (rc) {
+                    close_fd(i);
+                    compress = true;
+                }
+            }
+        }
+
+        if (compress) {
+            compress = false;
+            compress_fds();
+        }
+    }
+}
+
+static void fini_listener(void)
+{
+    int i;
+
+    if (server.fds[0].fd <= 0) {
+        return;
+    }
+
+    for (i = server.nfds - 1; i >= 0; i--) {
+        if (server.fds[i].fd) {
+            close(server.fds[i].fd);
+        }
+    }
+
+    unlink(server.args.unix_socket_path);
+}
+
+static void fini_umad(void)
+{
+    if (server.umad_agent.agent_id) {
+        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
+    }
+
+    if (server.umad_agent.port_id) {
+        umad_close_port(server.umad_agent.port_id);
+    }
+
+    hash_tbl_free();
+}
+
+static void fini(void)
+{
+    if (server.umad_recv_thread) {
+        pthread_join(server.umad_recv_thread, NULL);
+        server.umad_recv_thread = 0;
+    }
+    fini_umad();
+    fini_listener();
+    pthread_rwlock_destroy(&server.lock);
+
+    syslog(LOG_INFO, "Service going down");
+}
+
+static int init_listener(void)
+{
+    struct sockaddr_un sun;
+    int rc, on = 1;
+
+    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (server.fds[0].fd < 0) {
+        syslog(LOG_ALERT, "socket() failed");
+        return -EIO;
+    }
+
+    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
+                    sizeof(on));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "setsockopt() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "ioctl() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT,
+               "Invalid unix_socket_path, size must be less than %ld\n",
+               sizeof(sun.sun_path));
+        rc = -EINVAL;
+        goto err;
+    }
+
+    sun.sun_family = AF_UNIX;
+    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
+                  server.args.unix_socket_path);
+    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT, "Could not copy unix socket path\n");
+        rc = -EINVAL;
+        goto err;
+    }
+
+    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "bind() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "listen() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    server.fds[0].events = POLLIN;
+    server.nfds = 1;
+    server.run = true;
+
+    return 0;
+
+err:
+    close(server.fds[0].fd);
+    return rc;
+}
+
+static int init_umad(void)
+{
+    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
+
+    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
+                                               server.args.rdma_port_num);
+
+    if (server.umad_agent.port_id < 0) {
+        syslog(LOG_WARNING, "umad_open_port() failed");
+        return -EIO;
+    }
+
+    memset(&method_mask, 0, sizeof(method_mask));
+    method_mask[0] = MAD_METHOD_MASK0;
+    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
+                                               UMAD_CLASS_CM,
+                                               UMAD_SA_CLASS_VERSION,
+                                               MAD_RMPP_VERSION, method_mask);
+    if (server.umad_agent.agent_id < 0) {
+        syslog(LOG_WARNING, "umad_register() failed");
+        return -EIO;
+    }
+
+    hash_tbl_alloc();
+
+    return 0;
+}
+
+static void signal_handler(int sig, siginfo_t *siginfo, void *context)
+{
+    static bool warned;
+
+    /* Prevent stop if clients are connected */
+    if (server.nfds != 1) {
+        if (!warned) {
+            syslog(LOG_WARNING,
+                   "Can't stop while active client exist, resend SIGINT to overid");
+            warned = true;
+            return;
+        }
+    }
+
+    if (sig == SIGINT) {
+        server.run = false;
+        fini();
+    }
+
+    exit(0);
+}
+
+static int init(void)
+{
+    int rc;
+    struct sigaction sig = {0};
+
+    rc = init_listener();
+    if (rc) {
+        return rc;
+    }
+
+    rc = init_umad();
+    if (rc) {
+        return rc;
+    }
+
+    pthread_rwlock_init(&server.lock, 0);
+
+    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
+                        NULL);
+    if (rc) {
+        syslog(LOG_ERR, "Fail to create UMAD receiver thread (%d)\n", rc);
+        return rc;
+    }
+
+    sig.sa_sigaction = &signal_handler;
+    sig.sa_flags = SA_SIGINFO;
+    rc = sigaction(SIGINT, &sig, NULL);
+    if (rc < 0) {
+        syslog(LOG_ERR, "Fail to install SIGINT handler (%d)\n", errno);
+        return rc;
+    }
+
+    return 0;
+}
+
+int main(int argc, char *argv[])
+{
+    int rc;
+
+    memset(&server, 0, sizeof(server));
+
+    parse_args(argc, argv);
+
+    rc = init();
+    if (rc) {
+        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
+        rc = -EAGAIN;
+        goto out;
+    }
+
+    run();
+
+out:
+    fini();
+
+    return rc;
+}
diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
new file mode 100644
index 0000000000..03508d52b2
--- /dev/null
+++ b/contrib/rdmacm-mux/rdmacm-mux.h
@@ -0,0 +1,56 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux declarations
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMACM_MUX_H
+#define RDMACM_MUX_H
+
+#include "linux/if.h"
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "rdma/rdma_user_cm.h"
+
+typedef enum RdmaCmMuxMsgType {
+    RDMACM_MUX_MSG_TYPE_REG   = 0,
+    RDMACM_MUX_MSG_TYPE_UNREG = 1,
+    RDMACM_MUX_MSG_TYPE_MAD   = 2,
+    RDMACM_MUX_MSG_TYPE_RESP  = 3,
+} RdmaCmMuxMsgType;
+
+typedef enum RdmaCmMuxErrCode {
+    RDMACM_MUX_ERR_CODE_OK        = 0,
+    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
+    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
+    RDMACM_MUX_ERR_CODE_EACCES    = 3,
+    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
+} RdmaCmMuxErrCode;
+
+typedef struct RdmaCmMuxHdr {
+    RdmaCmMuxMsgType msg_type;
+    union ibv_gid sgid;
+    RdmaCmMuxErrCode err_code;
+} RdmaCmUHdr;
+
+typedef struct RdmaCmUMad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+} RdmaCmUMad;
+
+typedef struct RdmaCmMuxMsg {
+    RdmaCmUHdr hdr;
+    int umad_len;
+    RdmaCmUMad umad;
+} RdmaCmMuxMsg;
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (24 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Upon completion of incoming packet the device pushes CQE to driver's RX
ring and notify the driver (msix).
While for data-path incoming packets the driver needs the ability to
control whether it wished to receive interrupts or not, for control-path
packets such as incoming MAD the driver needs to be notified anyway, it
even do not need to re-arm the notification bit.

Enhance the notification field to support this.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_rm.c           | 12 ++++++++++--
 hw/rdma/rdma_rm_defs.h      |  8 +++++++-
 hw/rdma/vmw/pvrdma_qp_ops.c |  6 ++++--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 8d59a42cd1..4f10fcabcc 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -263,7 +263,7 @@ int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     }
 
     cq->opaque = opaque;
-    cq->notify = false;
+    cq->notify = CNT_CLEAR;
 
     rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
     if (rc) {
@@ -291,7 +291,10 @@ void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
         return;
     }
 
-    cq->notify = notify;
+    if (cq->notify != CNT_SET) {
+        cq->notify = notify ? CNT_ARM : CNT_CLEAR;
+    }
+
     pr_dbg("notify=%d\n", cq->notify);
 }
 
@@ -349,6 +352,11 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
         return -EINVAL;
     }
 
+    if (qp_type == IBV_QPT_GSI) {
+        scq->notify = CNT_SET;
+        rcq->notify = CNT_SET;
+    }
+
     qp = res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
     if (!qp) {
         return -ENOMEM;
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7228151239..9b399063d3 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -49,10 +49,16 @@ typedef struct RdmaRmPD {
     uint32_t ctx_handle;
 } RdmaRmPD;
 
+typedef enum CQNotificationType {
+    CNT_CLEAR,
+    CNT_ARM,
+    CNT_SET,
+} CQNotificationType;
+
 typedef struct RdmaRmCQ {
     RdmaBackendCQ backend_cq;
     void *opaque;
-    bool notify;
+    CQNotificationType notify;
 } RdmaRmCQ;
 
 /* MR (DMA region) */
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index c668afd0ed..762700a205 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -89,8 +89,10 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pvrdma_ring_write_inc(&dev->dsr_info.cq);
 
     pr_dbg("cq->notify=%d\n", cq->notify);
-    if (cq->notify) {
-        cq->notify = false;
+    if (cq->notify != CNT_CLEAR) {
+        if (cq->notify == CNT_ARM) {
+            cq->notify = CNT_CLEAR;
+        }
         post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
     }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (25 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Device is not supporting QP0, only QP1.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 86e8fe8ab6..3ccc9a2494 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
 
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
-    return qp->ibqp ? qp->ibqp->qp_num : 0;
+    return qp->ibqp ? qp->ibqp->qp_num : 1;
 }
 
 static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (26 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
                   ` (18 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Function create_ah might return NULL, let's exit with an error.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_backend.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index d7a4bbd91f..1e148398a2 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -338,6 +338,10 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
                                 backend_dev->backend_gid_idx, dgid);
+        if (!wr.wr.ud.ah) {
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            goto out_dealloc_cqe_ctx;
+        }
         wr.wr.ud.remote_qpn = dqpn;
         wr.wr.ud.remote_qkey = dqkey;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (27 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void Yuval Shaia
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

MAD (Management Datagram) packets are widely used by various modules
both in kernel and in user space for example the rdma_* API which is
used to create and maintain "connection" layer on top of RDMA uses
several types of MAD packets.
To support MAD packets the device uses an external utility
(contrib/rdmacm-mux) to relay packets from and to the guest driver.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
 hw/rdma/rdma_backend.h      |   4 +-
 hw/rdma/rdma_backend_defs.h |  10 +-
 hw/rdma/vmw/pvrdma.h        |   2 +
 hw/rdma/vmw/pvrdma_main.c   |   4 +-
 5 files changed, 273 insertions(+), 10 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 1e148398a2..3eb0099f8d 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -16,8 +16,13 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qnum.h"
 
 #include <infiniband/verbs.h>
+#include <infiniband/umad_types.h>
+#include <infiniband/umad.h>
+#include <rdma/rdma_user_cm.h>
 
 #include "trace.h"
 #include "rdma_utils.h"
@@ -33,16 +38,25 @@
 #define VENDOR_ERR_MAD_SEND         0x206
 #define VENDOR_ERR_INVLKEY          0x207
 #define VENDOR_ERR_MR_SMALL         0x208
+#define VENDOR_ERR_INV_MAD_BUFF     0x209
+#define VENDOR_ERR_INV_NUM_SGE      0x210
 
 #define THR_NAME_LEN 16
 #define THR_POLL_TO  5000
 
+#define MAD_HDR_SIZE sizeof(struct ibv_grh)
+
 typedef struct BackendCtx {
-    uint64_t req_id;
     void *up_ctx;
     bool is_tx_req;
+    struct ibv_sge sge; /* Used to save MAD recv buffer */
 } BackendCtx;
 
+struct backend_umad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+};
+
 static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
 
 static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
@@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
+static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
+                    uint32_t num_sge)
+{
+    struct backend_umad umad = {0};
+    char *hdr, *msg;
+    int ret;
+
+    pr_dbg("num_sge=%d\n", num_sge);
+
+    if (num_sge != 2) {
+        return -EINVAL;
+    }
+
+    umad.hdr.length = sge[0].length + sge[1].length;
+    pr_dbg("msg_len=%d\n", umad.hdr.length);
+
+    if (umad.hdr.length > sizeof(umad.mad)) {
+        return -ENOMEM;
+    }
+
+    umad.hdr.addr.qpn = htobe32(1);
+    umad.hdr.addr.grh_present = 1;
+    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    umad.hdr.addr.hop_limit = 1;
+
+    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
+    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+    memcpy(&umad.mad[0], hdr, sge[0].length);
+    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+
+    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad));
+
+    pr_dbg("qemu_chr_fe_write=%d\n", ret);
+
+    return (ret != sizeof(umad));
+}
+
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
@@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = mad_send(backend_dev, sge, num_sge);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            } else {
+                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+            }
         }
-        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
         return;
     }
 
@@ -370,6 +431,48 @@ out_free_bctx:
     g_free(bctx);
 }
 
+static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
+                                         struct ibv_sge *sge, uint32_t num_sge,
+                                         void *ctx)
+{
+    BackendCtx *bctx;
+    int rc;
+    uint32_t bctx_id;
+
+    if (num_sge != 1) {
+        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
+        return VENDOR_ERR_INV_NUM_SGE;
+    }
+
+    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
+        pr_dbg("Too small buffer for MAD\n");
+        return VENDOR_ERR_INV_MAD_BUFF;
+    }
+
+    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
+    pr_dbg("length=%d\n", sge[0].length);
+    pr_dbg("lkey=%d\n", sge[0].lkey);
+
+    bctx = g_malloc0(sizeof(*bctx));
+
+    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
+    if (unlikely(rc)) {
+        g_free(bctx);
+        pr_dbg("Fail to allocate cqe_ctx\n");
+        return VENDOR_ERR_NOMEM;
+    }
+
+    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
+    bctx->up_ctx = ctx;
+    bctx->sge = *sge;
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+
+    return 0;
+}
+
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
                             RdmaDeviceResources *rdma_dev_res,
                             RdmaBackendQP *qp, uint8_t qp_type,
@@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+            }
         }
         return;
     }
@@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 
     switch (qp_type) {
     case IBV_QPT_GSI:
-        pr_dbg("QP1 unsupported\n");
         return 0;
 
     case IBV_QPT_RC:
@@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
     return 0;
 }
 
+static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
+                                 union ibv_gid *my_gid, int paylen)
+{
+    grh->paylen = htons(paylen);
+    grh->sgid = *sgid;
+    grh->dgid = *my_gid;
+
+    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
+    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+}
+
+static inline int mad_can_receieve(void *opaque)
+{
+    return sizeof(struct backend_umad);
+}
+
+static void mad_read(void *opaque, const uint8_t *buf, int size)
+{
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+    char *mad;
+    struct backend_umad *umad;
+
+    assert(size != sizeof(umad));
+    umad = (struct backend_umad *)buf;
+
+    pr_dbg("Got %d bytes\n", size);
+    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+
+#ifdef PVRDMA_DEBUG
+    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
+    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
+           hdr->base_version, hdr->mgmt_class, hdr->class_version,
+           hdr->method, hdr->status, be64toh(hdr->tid),
+           hdr->attr_id, hdr->attr_mod);
+#endif
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+    if (!o_ctx_id) {
+        pr_dbg("No more free MADs buffers, waiting for a while\n");
+        sleep(THR_POLL_TO);
+        return;
+    }
+
+    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+    if (unlikely(!bctx)) {
+        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
+        return;
+    }
+
+    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
+
+    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
+                           bctx->sge.length);
+    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                     bctx->up_ctx);
+    } else {
+        memset(mad, 0, bctx->sge.length);
+        build_mad_hdr((struct ibv_grh *)mad,
+                      (union ibv_gid *)&umad->hdr.addr.gid,
+                      &backend_dev->gid, umad->hdr.length);
+        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
+
+        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+    }
+
+    g_free(bctx);
+    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+}
+
+static int mad_init(RdmaBackendDev *backend_dev)
+{
+    struct backend_umad umad = {0};
+    int ret;
+
+    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+        pr_dbg("Missing chardev for MAD multiplexer\n");
+        return -EIO;
+    }
+
+    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
+                             mad_read, NULL, NULL, backend_dev, NULL, true);
+
+    /* Register ourself */
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad.hdr));
+    if (ret != sizeof(umad.hdr)) {
+        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
+    }
+
+    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
+    backend_dev->recv_mads_list.list = qlist_new();
+
+    return 0;
+}
+
+static void mad_stop(RdmaBackendDev *backend_dev)
+{
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+
+    pr_dbg("Closing MAD\n");
+
+    /* Clear MAD buffers list */
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    do {
+        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+        if (o_ctx_id) {
+            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+            if (bctx) {
+                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+                g_free(bctx);
+            }
+        }
+    } while (o_ctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+}
+
+static void mad_fini(RdmaBackendDev *backend_dev)
+{
+    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
+    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp)
+                      CharBackend *mad_chr_be, Error **errp)
 {
     int i;
     int ret = 0;
@@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
-
+    backend_dev->mad_chr_be = mad_chr_be;
     backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
@@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     pr_dbg("interface_id=0x%" PRIx64 "\n",
            be64_to_cpu(backend_dev->gid.global.interface_id));
 
+    ret = mad_init(backend_dev);
+    if (ret) {
+        error_setg(errp, "Fail to initialize mad");
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+
     backend_dev->comp_thread.run = false;
     backend_dev->comp_thread.is_running = false;
 
@@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
 {
     pr_dbg("Stopping rdma_backend\n");
     stop_backend_thread(&backend_dev->comp_thread);
+    mad_stop(backend_dev);
 }
 
 void rdma_backend_fini(RdmaBackendDev *backend_dev)
 {
     rdma_backend_stop(backend_dev);
+    mad_fini(backend_dev);
     g_hash_table_destroy(ah_hash);
     ibv_destroy_comp_channel(backend_dev->channel);
     ibv_close_device(backend_dev->context);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 3ccc9a2494..fc83330251 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -17,6 +17,8 @@
 #define RDMA_BACKEND_H
 
 #include "qapi/error.h"
+#include "chardev/char-fe.h"
+
 #include "rdma_rm_defs.h"
 #include "rdma_backend_defs.h"
 
@@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp);
+                      CharBackend *mad_chr_be, Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 7404f64002..2a7e667075 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -16,8 +16,9 @@
 #ifndef RDMA_BACKEND_DEFS_H
 #define RDMA_BACKEND_DEFS_H
 
-#include <infiniband/verbs.h>
 #include "qemu/thread.h"
+#include "chardev/char-fe.h"
+#include <infiniband/verbs.h>
 
 typedef struct RdmaDeviceResources RdmaDeviceResources;
 
@@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
     bool is_running; /* Set by the thread to report its status */
 } RdmaBackendThread;
 
+typedef struct RecvMadList {
+    QemuMutex lock;
+    QList *list;
+} RecvMadList;
+
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
@@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
     struct ibv_comp_channel *channel;
     uint8_t port_num;
     uint8_t backend_gid_idx;
+    RecvMadList recv_mads_list;
+    CharBackend *mad_chr_be;
 } RdmaBackendDev;
 
 typedef struct RdmaBackendPD {
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e2d9f93cdf..e3742d893a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -19,6 +19,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
+#include "chardev/char-fe.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -83,6 +84,7 @@ typedef struct PVRDMADev {
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
+    CharBackend mad_chr;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ca5fa8d981..6c8c0154fa 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
     DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
                       dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
     DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
+    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, errp);
+                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
+                           errp);
     if (rc) {
         goto out;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (28 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

This function cannot fail - fix it to return void

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 6c8c0154fa..fc2abd34af 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
     return 0;
 }
 
-static int reset_device(PVRDMADev *dev)
+static void reset_device(PVRDMADev *dev)
 {
     pvrdma_stop(dev);
 
     pr_dbg("Device reset complete\n");
-
-    return 0;
 }
 
 static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (29 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Commit 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF") exports
default pkey as external definition but omit the change from 0x7FFF to
0xFFFF.

Fixes: 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF")

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e3742d893a..15c3f28b86 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -52,7 +52,7 @@
 #define PVRDMA_FW_VERSION    14
 
 /* Some defaults */
-#define PVRDMA_PKEY          0x7FFF
+#define PVRDMA_PKEY          0xFFFF
 
 typedef struct DSRInfo {
     dma_addr_t dma;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (30 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

The function pvrdma_post_cqe populates CQE entry with opcode from the
given completion element. For receive operation value was not set. Fix
it by setting it to IBV_WC_RECV.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 762700a205..7b0f440fda 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx = g_malloc(sizeof(CompHandlerCtx));
         comp_ctx->dev = dev;
         comp_ctx->cq_handle = qp->recv_cq_handle;
-        comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cqe.opcode = IBV_WC_RECV;
 
         rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
                                &qp->backend_qp, qp->qp_type,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (31 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma Yuval Shaia
                   ` (13 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

opcode for WC should be set by the device and not taken from work
element.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 7b0f440fda..3388be1926 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cq_handle = qp->send_cq_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
         comp_ctx->cqe.qp = qp_handle;
-        comp_ctx->cqe.opcode = wqe->hdr.opcode;
+        comp_ctx->cqe.opcode = IBV_WC_SEND;
 
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (32 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
                   ` (12 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

pvrdma requires that the same GID attached to it will be attached to the
backend device in the host.

A new QMP messages is defined so pvrdma device can broadcast any change
made to its GID table. This event is captured by libvirt which in turn
will update the GID table in the backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
---
 MAINTAINERS           |  1 +
 Makefile              |  3 ++-
 Makefile.objs         |  4 ++++
 qapi/qapi-schema.json |  1 +
 qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 qapi/rdma.json

diff --git a/MAINTAINERS b/MAINTAINERS
index e087d58ac6..a149f68a8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2232,6 +2232,7 @@ F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
 F: contrib/rdmacm-mux/*
+F: qapi/rdma.json
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index 94072776ff..db4ce60ee5 100644
--- a/Makefile
+++ b/Makefile
@@ -599,7 +599,8 @@ qapi-modules = $(SRC_PATH)/qapi/qapi-schema.json $(SRC_PATH)/qapi/common.json \
                $(SRC_PATH)/qapi/tpm.json \
                $(SRC_PATH)/qapi/trace.json \
                $(SRC_PATH)/qapi/transaction.json \
-               $(SRC_PATH)/qapi/ui.json
+               $(SRC_PATH)/qapi/ui.json \
+               $(SRC_PATH)/qapi/rdma.json
 
 qapi/qapi-builtin-types.c qapi/qapi-builtin-types.h \
 qapi/qapi-types.c qapi/qapi-types.h \
diff --git a/Makefile.objs b/Makefile.objs
index cc7df3ad80..76d8028f2f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -21,6 +21,7 @@ util-obj-y += qapi/qapi-types-tpm.o
 util-obj-y += qapi/qapi-types-trace.o
 util-obj-y += qapi/qapi-types-transaction.o
 util-obj-y += qapi/qapi-types-ui.o
+util-obj-y += qapi/qapi-types-rdma.o
 util-obj-y += qapi/qapi-builtin-visit.o
 util-obj-y += qapi/qapi-visit.o
 util-obj-y += qapi/qapi-visit-block-core.o
@@ -40,6 +41,7 @@ util-obj-y += qapi/qapi-visit-tpm.o
 util-obj-y += qapi/qapi-visit-trace.o
 util-obj-y += qapi/qapi-visit-transaction.o
 util-obj-y += qapi/qapi-visit-ui.o
+util-obj-y += qapi/qapi-visit-rdma.o
 util-obj-y += qapi/qapi-events.o
 util-obj-y += qapi/qapi-events-block-core.o
 util-obj-y += qapi/qapi-events-block.o
@@ -58,6 +60,7 @@ util-obj-y += qapi/qapi-events-tpm.o
 util-obj-y += qapi/qapi-events-trace.o
 util-obj-y += qapi/qapi-events-transaction.o
 util-obj-y += qapi/qapi-events-ui.o
+util-obj-y += qapi/qapi-events-rdma.o
 util-obj-y += qapi/qapi-introspect.o
 
 chardev-obj-y = chardev/
@@ -155,6 +158,7 @@ common-obj-y += qapi/qapi-commands-tpm.o
 common-obj-y += qapi/qapi-commands-trace.o
 common-obj-y += qapi/qapi-commands-transaction.o
 common-obj-y += qapi/qapi-commands-ui.o
+common-obj-y += qapi/qapi-commands-rdma.o
 common-obj-y += qapi/qapi-introspect.o
 common-obj-y += qmp.o hmp.o
 endif
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 65b6dc2f6f..a650d80f83 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -94,3 +94,4 @@
 { 'include': 'trace.json' }
 { 'include': 'introspect.json' }
 { 'include': 'misc.json' }
+{ 'include': 'rdma.json' }
diff --git a/qapi/rdma.json b/qapi/rdma.json
new file mode 100644
index 0000000000..804c68ab36
--- /dev/null
+++ b/qapi/rdma.json
@@ -0,0 +1,38 @@
+# -*- Mode: Python -*-
+#
+
+##
+# = RDMA device
+##
+
+##
+# @RDMA_GID_STATUS_CHANGED:
+#
+# Emitted when guest driver adds/deletes GID to/from device
+#
+# @netdev: RoCE Network Device name - char *
+#
+# @gid-status: Add or delete indication - bool
+#
+# @subnet-prefix: Subnet Prefix - uint64
+#
+# @interface-id : Interface ID - uint64
+#
+# Since: 3.2
+#
+# Example:
+#
+# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
+#     "event": "RDMA_GID_STATUS_CHANGED",
+#     "data":
+#         {"netdev": "bridge0",
+#         "interface-id": 15880512517475447892,
+#         "gid-status": true,
+#         "subnet-prefix": 33022}}
+#
+##
+{ 'event': 'RDMA_GID_STATUS_CHANGED',
+  'data': { 'netdev'        : 'str',
+            'gid-status'    : 'bool',
+            'subnet-prefix' : 'uint64',
+            'interface-id'  : 'uint64' } }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (33 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file Yuval Shaia
                   ` (11 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determine by the MAC address, the second
by the first IPv6 address and the third by the IPv4 address. Other
entries can be added by adding more IP addresses. The opposite is the
same, i.e. whenever an address is removed, the corresponding GID entry
is removed.

The process is done by the network and RDMA stacks. Whenever an address
is added the ib_core driver is notified and calls the device driver
add_gid function which in turn update the device.

To support this in pvrdma device we need to hook into the create_bind
and destroy_bind HW commands triggered by pvrdma driver in guest.
Whenever a changed is made to the pvrdma device's GID table a special
QMP messages is sent to be processed by libvirt to update the address of
the backend Ethernet device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 243 +++++++++++++++++++++++-------------
 hw/rdma/rdma_backend.h      |  22 ++--
 hw/rdma/rdma_backend_defs.h |   3 +-
 hw/rdma/rdma_rm.c           | 104 ++++++++++++++-
 hw/rdma/rdma_rm.h           |  17 ++-
 hw/rdma/rdma_rm_defs.h      |   9 +-
 hw/rdma/rdma_utils.h        |  15 +++
 hw/rdma/vmw/pvrdma.h        |   2 +-
 hw/rdma/vmw/pvrdma_cmd.c    |  55 ++++----
 hw/rdma/vmw/pvrdma_main.c   |  25 +---
 hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
 11 files changed, 370 insertions(+), 145 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 3eb0099f8d..5675504165 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -18,12 +18,14 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qnum.h"
+#include "qapi/qapi-events-rdma.h"
 
 #include <infiniband/verbs.h>
 #include <infiniband/umad_types.h>
 #include <infiniband/umad.h>
 #include <rdma/rdma_user_cm.h>
 
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 #include "trace.h"
 #include "rdma_utils.h"
 #include "rdma_rm.h"
@@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
-static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
-                    uint32_t num_sge)
+static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
+                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
 {
-    struct backend_umad umad = {0};
-    char *hdr, *msg;
+    RdmaCmMuxMsg msg = {0};
+    char *hdr, *data;
     int ret;
 
     pr_dbg("num_sge=%d\n", num_sge);
@@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
         return -EINVAL;
     }
 
-    umad.hdr.length = sge[0].length + sge[1].length;
-    pr_dbg("msg_len=%d\n", umad.hdr.length);
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
+    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
 
-    if (umad.hdr.length > sizeof(umad.mad)) {
+    msg.umad_len = sge[0].length + sge[1].length;
+    pr_dbg("umad_len=%d\n", msg.umad_len);
+
+    if (msg.umad_len > sizeof(msg.umad.mad)) {
         return -ENOMEM;
     }
 
-    umad.hdr.addr.qpn = htobe32(1);
-    umad.hdr.addr.grh_present = 1;
-    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    umad.hdr.addr.hop_limit = 1;
+    msg.umad.hdr.addr.qpn = htobe32(1);
+    msg.umad.hdr.addr.grh_present = 1;
+    pr_dbg("sgid_idx=%d\n", sgid_idx);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
+    msg.umad.hdr.addr.gid_index = sgid_idx;
+    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
+    msg.umad.hdr.addr.hop_limit = 1;
 
     hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
-    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
+    pr_dbg_buf("mad_data", data, sge[1].length);
 
-    memcpy(&umad.mad[0], hdr, sge[0].length);
-    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
+    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
 
-    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
     rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
 
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
 
     pr_dbg("qemu_chr_fe_write=%d\n", ret);
 
-    return (ret != sizeof(umad));
+    return (ret != sizeof(msg));
 }
 
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
-                            union ibv_gid *dgid, uint32_t dqpn,
-                            uint32_t dqkey, void *ctx)
+                            uint8_t sgid_idx, union ibv_gid *sgid,
+                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
+                            void *ctx)
 {
     BackendCtx *bctx;
     struct ibv_sge new_sge[MAX_SGE];
@@ -361,7 +372,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            rc = mad_send(backend_dev, sge, num_sge);
+            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
                 comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
@@ -397,8 +408,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     }
 
     if (qp_type == IBV_QPT_UD) {
-        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
-                                backend_dev->backend_gid_idx, dgid);
+        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
@@ -703,9 +713,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
 }
 
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey)
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
 {
     struct ibv_qp_attr attr = {0};
     union ibv_gid ibv_gid = {
@@ -717,13 +727,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
     attr.qp_state = IBV_QPS_RTR;
     attr_mask = IBV_QP_STATE;
 
+    qp->sgid_idx = sgid_idx;
+
     switch (qp_type) {
     case IBV_QPT_RC:
         pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
                be64_to_cpu(ibv_gid.global.subnet_prefix),
                be64_to_cpu(ibv_gid.global.interface_id));
         pr_dbg("dqpn=0x%x\n", dqpn);
-        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
+        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
         pr_dbg("sport_num=%d\n", backend_dev->port_num);
         pr_dbg("rq_psn=0x%x\n", rq_psn);
 
@@ -735,7 +747,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         attr.ah_attr.is_global      = 1;
         attr.ah_attr.grh.hop_limit  = 1;
         attr.ah_attr.grh.dgid       = ibv_gid;
-        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
+        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
         attr.rq_psn                 = rq_psn;
 
         attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
@@ -744,8 +756,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         break;
 
     case IBV_QPT_UD:
+        pr_dbg("qkey=0x%x\n", qkey);
         if (use_qkey) {
-            pr_dbg("qkey=0x%x\n", qkey);
             attr.qkey = qkey;
             attr_mask |= IBV_QP_QKEY;
         }
@@ -861,13 +873,13 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
     grh->dgid = *my_gid;
 
     pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
-    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
-    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
 }
 
 static inline int mad_can_receieve(void *opaque)
 {
-    return sizeof(struct backend_umad);
+    return sizeof(RdmaCmMuxMsg);
 }
 
 static void mad_read(void *opaque, const uint8_t *buf, int size)
@@ -877,13 +889,13 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     unsigned long cqe_ctx_id;
     BackendCtx *bctx;
     char *mad;
-    struct backend_umad *umad;
+    RdmaCmMuxMsg *msg;
 
-    assert(size != sizeof(umad));
-    umad = (struct backend_umad *)buf;
+    assert(size != sizeof(msg));
+    msg = (RdmaCmMuxMsg *)buf;
 
     pr_dbg("Got %d bytes\n", size);
-    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+    pr_dbg("umad_len=%d\n", msg->umad_len);
 
 #ifdef PVRDMA_DEBUG
     struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
@@ -913,15 +925,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
-    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
         comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
                      bctx->up_ctx);
     } else {
+        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
-                      (union ibv_gid *)&umad->hdr.addr.gid,
-                      &backend_dev->gid, umad->hdr.length);
-        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
+                      msg->umad_len);
+        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
         comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
@@ -933,10 +946,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
 static int mad_init(RdmaBackendDev *backend_dev)
 {
-    struct backend_umad umad = {0};
     int ret;
 
-    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+    ret = qemu_chr_fe_backend_connected(backend_dev->mad_chr_be);
+    if (!ret) {
         pr_dbg("Missing chardev for MAD multiplexer\n");
         return -EIO;
     }
@@ -944,14 +957,6 @@ static int mad_init(RdmaBackendDev *backend_dev)
     qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
                              mad_read, NULL, NULL, backend_dev, NULL, true);
 
-    /* Register ourself */
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad.hdr));
-    if (ret != sizeof(umad.hdr)) {
-        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
-    }
-
     qemu_mutex_init(&backend_dev->recv_mads_list.lock);
     backend_dev->recv_mads_list.list = qlist_new();
 
@@ -988,23 +993,120 @@ static void mad_fini(RdmaBackendDev *backend_dev)
     qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
 }
 
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid)
+{
+    union ibv_gid sgid;
+    int ret;
+    int i = 0;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    do {
+        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
+                            &sgid);
+        i++;
+    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
+
+    pr_dbg("gid_index=%d\n", i - 1);
+
+    return ret ? ret : i - 1;
+}
+
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", msg.hdr.err_code);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, true,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return ret;
+}
+
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_UNREG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n",
+               msg.hdr.err_code);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, false,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return 0;
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp)
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp)
 {
     int i;
     int ret = 0;
     int num_ibv_devices;
     struct ibv_device **dev_list;
-    struct ibv_port_attr port_attr;
 
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
     backend_dev->mad_chr_be = mad_chr_be;
-    backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
 
@@ -1041,9 +1143,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         backend_dev->ib_dev = *dev_list;
     }
 
-    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
-           ibv_get_device_name(backend_dev->ib_dev),
-           backend_dev->port_num, backend_dev->backend_gid_idx);
+    pr_dbg("Using backend device %s, port %d\n",
+           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
 
     backend_dev->context = ibv_open_device(backend_dev->ib_dev);
     if (!backend_dev->context) {
@@ -1060,20 +1161,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     }
     pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
 
-    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
-                         &port_attr);
-    if (ret) {
-        error_setg(errp, "Error %d from ibv_query_port", ret);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-
-    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
-        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
-                   port_attr.gid_tbl_len);
-        goto out_destroy_comm_channel;
-    }
-
     ret = init_device_caps(backend_dev, dev_attr);
     if (ret) {
         error_setg(errp, "Failed to initialize device capabilities");
@@ -1081,18 +1168,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         goto out_destroy_comm_channel;
     }
 
-    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
-                         backend_dev->backend_gid_idx, &backend_dev->gid);
-    if (ret) {
-        error_setg(errp, "Failed to query gid %d",
-                   backend_dev->backend_gid_idx);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
-    pr_dbg("interface_id=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.interface_id));
 
     ret = mad_init(backend_dev);
     if (ret) {
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index fc83330251..59ad2b874b 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -28,11 +28,6 @@ enum ibv_special_qp_type {
     IBV_QPT_GSI = 1,
 };
 
-static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
-{
-    return &dev->gid;
-}
-
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
     return qp->ibqp ? qp->ibqp->qp_num : 1;
@@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp);
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
 void rdma_backend_register_comp_handler(void (*handler)(int status,
@@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
                                uint8_t qp_type, uint32_t qkey);
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey);
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
                               uint32_t sq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
@@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
+                            uint8_t sgid_idx, union ibv_gid *sgid,
                             union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
                             void *ctx);
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 2a7e667075..ff8b2426a0 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -37,14 +37,12 @@ typedef struct RecvMadList {
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
-    union ibv_gid gid;
     PCIDevice *dev;
     RdmaDeviceResources *rdma_dev_res;
     struct ibv_device *ib_dev;
     struct ibv_context *context;
     struct ibv_comp_channel *channel;
     uint8_t port_num;
-    uint8_t backend_gid_idx;
     RecvMadList recv_mads_list;
     CharBackend *mad_chr_be;
 } RdmaBackendDev;
@@ -66,6 +64,7 @@ typedef struct RdmaBackendCQ {
 typedef struct RdmaBackendQP {
     struct ibv_pd *ibpd;
     struct ibv_qp *ibqp;
+    uint8_t sgid_idx;
 } RdmaBackendQP;
 
 #endif
diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 4f10fcabcc..fe0979415d 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -391,7 +391,7 @@ out_dealloc_qp:
 }
 
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn)
@@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int ret;
 
     pr_dbg("qpn=0x%x\n", qp_handle);
+    pr_dbg("qkey=0x%x\n", qkey);
 
     qp = rdma_rm_get_qp(dev_res, qp_handle);
     if (!qp) {
@@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         }
 
         if (qp->qp_state == IBV_QPS_RTR) {
+            /* Get backend gid index */
+            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
+            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
+                                                     sgid_idx);
+            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
+                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
+                return -EIO;
+            }
+
             ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
-                                            qp->qp_type, dgid, dqpn, rq_psn,
-                                            qkey, attr_mask & IBV_QP_QKEY);
+                                            qp->qp_type, sgid_idx, dgid, dqpn,
+                                            rq_psn, qkey,
+                                            attr_mask & IBV_QP_QKEY);
             if (ret) {
                 return -EIO;
             }
@@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
     res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
 }
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
+    if (rc <= 0) {
+        pr_dbg("Fail to add gid\n");
+        return -EINVAL;
+    }
+
+    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+
+    return 0;
+}
+
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_del_gid(backend_dev, ifname,
+                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+    if (rc < 0) {
+        pr_dbg("Fail to delete gid\n");
+        return -EINVAL;
+    }
+
+    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
+    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+
+    return 0;
+}
+
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx)
+{
+    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
+        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
+        return -EINVAL;
+    }
+
+    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+        rdma_backend_get_gid_index(backend_dev,
+                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+    }
+
+    pr_dbg("backend_gid_index=%d\n",
+           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+
+    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+}
+
 static void destroy_qp_hash_key(gpointer data)
 {
     g_bytes_unref(data);
 }
 
+static void init_ports(RdmaDeviceResources *dev_res)
+{
+    int i, j;
+
+    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        dev_res->ports[i].state = IBV_PORT_DOWN;
+        for (j = 0; j < MAX_PORT_GIDS; j++) {
+            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
+        }
+    }
+}
+
+static void fini_ports(RdmaDeviceResources *dev_res,
+                       RdmaBackendDev *backend_dev, const char *ifname)
+{
+    int i;
+
+    dev_res->ports[0].state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
+    }
+}
+
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp)
 {
@@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                        dev_attr->max_qp_wr, sizeof(void *));
     res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
 
+    init_ports(dev_res);
+
     return 0;
 }
 
-void rdma_rm_fini(RdmaDeviceResources *dev_res)
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname)
 {
+    fini_ports(dev_res, backend_dev, ifname);
+
     res_tbl_free(&dev_res->uc_tbl);
     res_tbl_free(&dev_res->cqe_ctx_tbl);
     res_tbl_free(&dev_res->qp_tbl);
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index b4e04cc7b4..a7169b4e89 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -22,7 +22,8 @@
 
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp);
-void rdma_rm_fini(RdmaDeviceResources *dev_res);
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname);
 
 int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
                      uint32_t *pd_handle, uint32_t ctx_handle);
@@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
                      uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
 RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn);
@@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
 void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx);
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx);
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx);
+static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
+                                             int sgid_idx)
+{
+    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+}
+
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 9b399063d3..7b3435f991 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -19,7 +19,7 @@
 #include "rdma_backend_defs.h"
 
 #define MAX_PORTS             1
-#define MAX_PORT_GIDS         1
+#define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
 #define MAX_PKEYS             MAX_PORT_PKEYS
@@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
     enum ibv_qp_state qp_state;
 } RdmaRmQP;
 
+typedef struct RdmaRmGid {
+    union ibv_gid gid;
+    int backend_gid_index;
+} RdmaRmGid;
+
 typedef struct RdmaRmPort {
-    union ibv_gid gid_tbl[MAX_PORT_GIDS];
+    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
     enum ibv_port_state state;
 } RdmaRmPort;
 
diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 04c7c2ef5b..989db249ef 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "hw/pci/pci.h"
 #include "sysemu/dma.h"
+#include "stdio.h"
 
 #define pr_info(fmt, ...) \
     fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
@@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
 #define pr_dbg(fmt, ...) \
     fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
             __func__, __LINE__, ## __VA_ARGS__)
+
+#define pr_dbg_buf(title, buf, len) \
+{ \
+    char *b = g_malloc0(len * 3 + 1); \
+    char b1[4]; \
+    for (int i = 0; i < len; i++) { \
+        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
+        strcat(b, b1); \
+    } \
+    pr_dbg("%s (%d): %s\n", title, len, b); \
+    g_free(b); \
+}
+
 #else
 #define init_pr_dbg(void)
 #define pr_dbg(fmt, ...)
+#define pr_dbg_buf(title, buf, len)
 #endif
 
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 15c3f28b86..b019cb843a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -79,8 +79,8 @@ typedef struct PVRDMADev {
     int interrupt_mask;
     struct ibv_device_attr dev_attr;
     uint64_t node_guid;
+    char *backend_eth_device_name;
     char *backend_device_name;
-    uint8_t backend_gid_idx;
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 57d6f41ae6..a334f6205e 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rsp->hdr.response = cmd->hdr.response;
     rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
 
-    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
-                                 cmd->qp_handle, cmd->attr_mask,
-                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
-                                 cmd->attrs.dest_qp_num,
-                                 (enum ibv_qp_state)cmd->attrs.qp_state,
-                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
-                                 cmd->attrs.sq_psn);
+    /* No need to verify sgid_index since it is u8 */
+
+    rsp->hdr.err =
+        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
+                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
+                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
+                          cmd->attrs.dest_qp_num,
+                          (enum ibv_qp_state)cmd->attrs.qp_state,
+                          cmd->attrs.qkey, cmd->attrs.rq_psn,
+                          cmd->attrs.sq_psn);
 
     pr_dbg("ret=%d\n", rsp->hdr.err);
     return rsp->hdr.err;
@@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                        union pvrdma_cmd_resp *rsp)
 {
     struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
-#ifdef PVRDMA_DEBUG
-    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
-    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
-#endif
+    int rc;
+    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
 
     pr_dbg("index=%d\n", cmd->index);
 
@@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
-           (long long unsigned int)be64_to_cpu(*subnet),
-           (long long unsigned int)be64_to_cpu(*if_id));
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
 
-    /* Driver forces to one port only */
-    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
-           sizeof(cmd->new_gid));
+    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                         dev->backend_eth_device_name, gid, cmd->index);
+    if (rc < 0) {
+        return -EINVAL;
+    }
 
     /* TODO: Since drivers stores node_guid at load_dsr phase then this
      * assignment is not relevant, i need to figure out a way how to
      * retrieve MAC of our netdev */
-    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
-    pr_dbg("dev->node_guid=0x%llx\n",
-           (long long unsigned int)be64_to_cpu(dev->node_guid));
+    if (!cmd->index) {
+        dev->node_guid =
+            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
+        pr_dbg("dev->node_guid=0x%llx\n",
+               (long long unsigned int)be64_to_cpu(dev->node_guid));
+    }
 
     return 0;
 }
@@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                         union pvrdma_cmd_resp *rsp)
 {
+    int rc;
+
     struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
 
     pr_dbg("index=%d\n", cmd->index);
@@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
-           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
+    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                        dev->backend_eth_device_name, cmd->index);
+
+    if (rc < 0) {
+        rsp->hdr.err = rc;
+        goto out;
+    }
 
     return 0;
 }
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fc2abd34af..ac8c092db0 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -36,9 +36,9 @@
 #include "pvrdma_qp_ops.h"
 
 static Property pvrdma_dev_properties[] = {
-    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
-    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
-    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
+    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
+    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
+    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
     DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
                        MAX_MR_SIZE),
     DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
@@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     pr_dbg("Initialized\n");
 }
 
-static void init_ports(PVRDMADev *dev, Error **errp)
-{
-    int i;
-
-    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
-
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
-    }
-}
-
 static void uninit_msix(PCIDevice *pdev, int used_vectors)
 {
     PVRDMADev *dev = PVRDMA_DEV(pdev);
@@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
 
     pvrdma_qp_ops_fini();
 
-    rdma_rm_fini(&dev->rdma_dev_res);
+    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
+                 dev->backend_eth_device_name);
 
     rdma_backend_fini(&dev->backend_dev);
 
@@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
-                           errp);
+                           &dev->dev_attr, &dev->mad_chr, errp);
     if (rc) {
         goto out;
     }
@@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
-    init_ports(dev, errp);
-
     rc = pvrdma_qp_ops_init();
     if (rc) {
         goto out;
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 3388be1926..2130824098 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
     RdmaRmQP *qp;
     PvrdmaSqWqe *wqe;
     PvrdmaRing *ring;
+    int sgid_idx;
+    union ibv_gid *sgid;
 
     pr_dbg("qp_handle=0x%x\n", qp_handle);
 
@@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.opcode = IBV_WC_SEND;
 
+        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
+        if (!sgid) {
+            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
+               sgid->global.interface_id);
+
+        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
+                                                 &dev->backend_dev,
+                                                 wqe->hdr.wr.ud.av.gid_index);
+        if (sgid_idx <= 0) {
+            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
+                   wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
+                               sgid_idx, sgid,
                                (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
                                wqe->hdr.wr.ud.remote_qpn,
                                wqe->hdr.wr.ud.remote_qkey, comp_ctx);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (34 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
                   ` (10 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
on PCI function 1.
pvrdma device needs to access vmxnet3 device object for several reasons:
1. Make sure PCI function 0 is vmxnet3.
2. To monitor vmxnet3 device state.
3. To configure node_guid accoring to vmxnet3 device's MAC address.

To be able to access vmxnet3 device the definition of VMXNET3State is
moved to a new header file.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Dmitry Fleytman <dmitry.fleytman@gmail.com>
---
 hw/net/vmxnet3.c      | 116 +-----------------------------------
 hw/net/vmxnet3_defs.h | 133 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+), 115 deletions(-)
 create mode 100644 hw/net/vmxnet3_defs.h

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 3648630386..54746a4030 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -18,7 +18,6 @@
 #include "qemu/osdep.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
-#include "net/net.h"
 #include "net/tap.h"
 #include "net/checksum.h"
 #include "sysemu/sysemu.h"
@@ -29,6 +28,7 @@
 #include "migration/register.h"
 
 #include "vmxnet3.h"
+#include "vmxnet3_defs.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
 #include "net_tx_pkt.h"
@@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
     DeviceRealize parent_dc_realize;
 } VMXNET3Class;
 
-#define TYPE_VMXNET3 "vmxnet3"
-#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
-
 #define VMXNET3_DEVICE_CLASS(klass) \
     OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
 #define VMXNET3_DEVICE_GET_CLASS(obj) \
     OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
 
-/* Cyclic ring abstraction */
-typedef struct {
-    hwaddr pa;
-    uint32_t size;
-    uint32_t cell_size;
-    uint32_t next;
-    uint8_t gen;
-} Vmxnet3Ring;
-
 static inline void vmxnet3_ring_init(PCIDevice *d,
 				     Vmxnet3Ring *ring,
                                      hwaddr pa,
@@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
               descr->rsvd, descr->dtype, descr->ext1, descr->btype);
 }
 
-/* Device state and helper functions */
-#define VMXNET3_RX_RINGS_PER_QUEUE (2)
-
-typedef struct {
-    Vmxnet3Ring tx_ring;
-    Vmxnet3Ring comp_ring;
-
-    uint8_t intr_idx;
-    hwaddr tx_stats_pa;
-    struct UPT1_TxStats txq_stats;
-} Vmxnet3TxqDescr;
-
-typedef struct {
-    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
-    Vmxnet3Ring comp_ring;
-    uint8_t intr_idx;
-    hwaddr rx_stats_pa;
-    struct UPT1_RxStats rxq_stats;
-} Vmxnet3RxqDescr;
-
-typedef struct {
-    bool is_masked;
-    bool is_pending;
-    bool is_asserted;
-} Vmxnet3IntState;
-
-typedef struct {
-        PCIDevice parent_obj;
-        NICState *nic;
-        NICConf conf;
-        MemoryRegion bar0;
-        MemoryRegion bar1;
-        MemoryRegion msix_bar;
-
-        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
-        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
-
-        /* Whether MSI-X support was installed successfully */
-        bool msix_used;
-        hwaddr drv_shmem;
-        hwaddr temp_shared_guest_driver_memory;
-
-        uint8_t txq_num;
-
-        /* This boolean tells whether RX packet being indicated has to */
-        /* be split into head and body chunks from different RX rings  */
-        bool rx_packets_compound;
-
-        bool rx_vlan_stripping;
-        bool lro_supported;
-
-        uint8_t rxq_num;
-
-        /* Network MTU */
-        uint32_t mtu;
-
-        /* Maximum number of fragments for indicated TX packets */
-        uint32_t max_tx_frags;
-
-        /* Maximum number of fragments for indicated RX packets */
-        uint16_t max_rx_frags;
-
-        /* Index for events interrupt */
-        uint8_t event_int_idx;
-
-        /* Whether automatic interrupts masking enabled */
-        bool auto_int_masking;
-
-        bool peer_has_vhdr;
-
-        /* TX packets to QEMU interface */
-        struct NetTxPkt *tx_pkt;
-        uint32_t offload_mode;
-        uint32_t cso_or_gso_size;
-        uint16_t tci;
-        bool needs_vlan;
-
-        struct NetRxPkt *rx_pkt;
-
-        bool tx_sop;
-        bool skip_current_tx_pkt;
-
-        uint32_t device_active;
-        uint32_t last_command;
-
-        uint32_t link_status_and_speed;
-
-        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
-
-        uint32_t temp_mac;   /* To store the low part first */
-
-        MACAddr perm_mac;
-        uint32_t vlan_table[VMXNET3_VFT_SIZE];
-        uint32_t rx_mode;
-        MACAddr *mcast_list;
-        uint32_t mcast_list_len;
-        uint32_t mcast_list_buff_size; /* needed for live migration. */
-
-        /* Compatibility flags for migration */
-        uint32_t compat_flags;
-} VMXNET3State;
-
 /* Interrupt management */
 
 /*
diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
new file mode 100644
index 0000000000..6c19d29b12
--- /dev/null
+++ b/hw/net/vmxnet3_defs.h
@@ -0,0 +1,133 @@
+/*
+ * QEMU VMWARE VMXNET3 paravirtual NIC
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "net/net.h"
+#include "hw/net/vmxnet3.h"
+
+#define TYPE_VMXNET3 "vmxnet3"
+#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
+
+/* Device state and helper functions */
+#define VMXNET3_RX_RINGS_PER_QUEUE (2)
+
+/* Cyclic ring abstraction */
+typedef struct {
+    hwaddr pa;
+    uint32_t size;
+    uint32_t cell_size;
+    uint32_t next;
+    uint8_t gen;
+} Vmxnet3Ring;
+
+typedef struct {
+    Vmxnet3Ring tx_ring;
+    Vmxnet3Ring comp_ring;
+
+    uint8_t intr_idx;
+    hwaddr tx_stats_pa;
+    struct UPT1_TxStats txq_stats;
+} Vmxnet3TxqDescr;
+
+typedef struct {
+    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
+    Vmxnet3Ring comp_ring;
+    uint8_t intr_idx;
+    hwaddr rx_stats_pa;
+    struct UPT1_RxStats rxq_stats;
+} Vmxnet3RxqDescr;
+
+typedef struct {
+    bool is_masked;
+    bool is_pending;
+    bool is_asserted;
+} Vmxnet3IntState;
+
+typedef struct {
+        PCIDevice parent_obj;
+        NICState *nic;
+        NICConf conf;
+        MemoryRegion bar0;
+        MemoryRegion bar1;
+        MemoryRegion msix_bar;
+
+        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
+        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
+
+        /* Whether MSI-X support was installed successfully */
+        bool msix_used;
+        hwaddr drv_shmem;
+        hwaddr temp_shared_guest_driver_memory;
+
+        uint8_t txq_num;
+
+        /* This boolean tells whether RX packet being indicated has to */
+        /* be split into head and body chunks from different RX rings  */
+        bool rx_packets_compound;
+
+        bool rx_vlan_stripping;
+        bool lro_supported;
+
+        uint8_t rxq_num;
+
+        /* Network MTU */
+        uint32_t mtu;
+
+        /* Maximum number of fragments for indicated TX packets */
+        uint32_t max_tx_frags;
+
+        /* Maximum number of fragments for indicated RX packets */
+        uint16_t max_rx_frags;
+
+        /* Index for events interrupt */
+        uint8_t event_int_idx;
+
+        /* Whether automatic interrupts masking enabled */
+        bool auto_int_masking;
+
+        bool peer_has_vhdr;
+
+        /* TX packets to QEMU interface */
+        struct NetTxPkt *tx_pkt;
+        uint32_t offload_mode;
+        uint32_t cso_or_gso_size;
+        uint16_t tci;
+        bool needs_vlan;
+
+        struct NetRxPkt *rx_pkt;
+
+        bool tx_sop;
+        bool skip_current_tx_pkt;
+
+        uint32_t device_active;
+        uint32_t last_command;
+
+        uint32_t link_status_and_speed;
+
+        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
+
+        uint32_t temp_mac;   /* To store the low part first */
+
+        MACAddr perm_mac;
+        uint32_t vlan_table[VMXNET3_VFT_SIZE];
+        uint32_t rx_mode;
+        MACAddr *mcast_list;
+        uint32_t mcast_list_len;
+        uint32_t mcast_list_buff_size; /* needed for live migration. */
+
+        /* Compatibility flags for migration */
+        uint32_t compat_flags;
+} VMXNET3State;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (35 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
                   ` (9 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Guest driver enforces it, we should also.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      | 2 ++
 hw/rdma/vmw/pvrdma_main.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index b019cb843a..10a3c4fb7c 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -20,6 +20,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
+#include "hw/net/vmxnet3_defs.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -85,6 +86,7 @@ typedef struct PVRDMADev {
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
+    VMXNET3State *func0;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ac8c092db0..fa6468d221 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
+    /* Break if not vmxnet3 device in slot 0 */
+    dev->func0 = VMXNET3(pci_get_function_0(pdev));
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (36 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

node_guid should be set once device is load.
Make node_guid be GID format (32 bit) of PCI function 0 vmxnet3 device's
MAC.

A new function was added to do the conversion.
So for example the MAC 56:b6:44:e9:62:dc will be converted to GID
54b6:44ff:fee9:62dc.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_utils.h      |  9 +++++++++
 hw/rdma/vmw/pvrdma_cmd.c  | 10 ----------
 hw/rdma/vmw/pvrdma_main.c |  5 ++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 989db249ef..202abb3366 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -63,4 +63,13 @@ extern unsigned long pr_dbg_cnt;
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
 void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
 
+static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
+{
+    memcpy(eui, addr, 3);
+    eui[3] = 0xFF;
+    eui[4] = 0xFE;
+    memcpy(eui + 5, addr + 3, 3);
+    eui[0] ^= 2;
+}
+
 #endif
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index a334f6205e..2979582fac 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -592,16 +592,6 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    /* TODO: Since drivers stores node_guid at load_dsr phase then this
-     * assignment is not relevant, i need to figure out a way how to
-     * retrieve MAC of our netdev */
-    if (!cmd->index) {
-        dev->node_guid =
-            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
-        pr_dbg("dev->node_guid=0x%llx\n",
-               (long long unsigned int)be64_to_cpu(dev->node_guid));
-    }
-
     return 0;
 }
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fa6468d221..95e9322b7c 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -264,7 +264,7 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     dsr->caps.sys_image_guid = 0;
     pr_dbg("sys_image_guid=%" PRIx64 "\n", dsr->caps.sys_image_guid);
 
-    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+    dsr->caps.node_guid = dev->node_guid;
     pr_dbg("node_guid=%" PRIx64 "\n", be64_to_cpu(dsr->caps.node_guid));
 
     dsr->caps.phys_port_cnt = MAX_PORTS;
@@ -579,6 +579,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
     /* Break if not vmxnet3 device in slot 0 */
     dev->func0 = VMXNET3(pci_get_function_0(pdev));
 
+    addrconf_addr_eui48((unsigned char *)&dev->node_guid,
+                        (const char *)&dev->func0->conf.macaddr.a);
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (37 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

User should be able to control the device by changing Ethernet function
state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
down as well.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 2979582fac..0d3c818c20 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
     resp->hdr.err = 0;
 
-    resp->attrs.state = attrs.state;
+    resp->attrs.state = dev->func0->device_active ? attrs.state :
+                                                    PVRDMA_PORT_DOWN;
     resp->attrs.max_mtu = attrs.max_mtu;
     resp->attrs.active_mtu = attrs.active_mtu;
     resp->attrs.phys_state = attrs.phys_state;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (38 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Add ability to pass specific WC attributes to CQE such as GRH_BIT flag.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 59 +++++++++++++++++++++++--------------
 hw/rdma/rdma_backend.h      |  4 +--
 hw/rdma/vmw/pvrdma_qp_ops.c | 31 +++++++++++--------
 3 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 5675504165..e453bda8f9 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -59,13 +59,24 @@ struct backend_umad {
     char mad[RDMA_MAX_PRIVATE_DATA];
 };
 
-static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
+static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
 
-static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     pr_err("No completion handler is registered\n");
 }
 
+static inline void complete_work(enum ibv_wc_status status, uint32_t vendor_err,
+                                 void *ctx)
+{
+    struct ibv_wc wc = {0};
+
+    wc.status = status;
+    wc.vendor_err = vendor_err;
+
+    comp_handler(ctx, &wc);
+}
+
 static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
 {
     int i, ne;
@@ -90,7 +101,7 @@ static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
             }
             pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
 
-            comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
+            comp_handler(bctx->up_ctx, &wc[i]);
 
             rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
             g_free(bctx);
@@ -184,8 +195,8 @@ static void start_comp_thread(RdmaBackendDev *backend_dev)
                        comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
 }
 
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx))
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                         struct ibv_wc *wc))
 {
     comp_handler = handler;
 }
@@ -369,14 +380,14 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
-                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+                complete_work(IBV_WC_SUCCESS, 0, ctx);
             }
         }
         return;
@@ -385,7 +396,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -396,21 +407,21 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
         }
         wr.wr.ud.remote_qpn = dqpn;
@@ -428,7 +439,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -497,13 +508,13 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
             }
         }
         return;
@@ -512,7 +523,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -523,14 +534,14 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -542,7 +553,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post recv WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -926,9 +937,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
     if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
-                     bctx->up_ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                      bctx->up_ctx);
     } else {
+        struct ibv_wc wc = {0};
         pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
@@ -937,7 +949,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
         memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
-        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+        wc.byte_len = msg->umad_len;
+        wc.status = IBV_WC_SUCCESS;
+        wc.wc_flags = IBV_WC_GRH;
+        comp_handler(bctx->up_ctx, &wc);
     }
 
     g_free(bctx);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 59ad2b874b..8cae40f827 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -57,8 +57,8 @@ int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
                                union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx));
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                        struct ibv_wc *wc));
 void rdma_backend_unregister_comp_handler(void);
 
 int rdma_backend_query_port(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 2130824098..300471a4c9 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -47,7 +47,7 @@ typedef struct PvrdmaRqWqe {
  * 3. Interrupt host
  */
 static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
-                           struct pvrdma_cqe *cqe)
+                           struct pvrdma_cqe *cqe, struct ibv_wc *wc)
 {
     struct pvrdma_cqe *cqe1;
     struct pvrdma_cqne *cqne;
@@ -66,6 +66,7 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pr_dbg("Writing CQE\n");
     cqe1 = pvrdma_ring_next_elem_write(ring);
     if (unlikely(!cqe1)) {
+        pr_dbg("No CQEs in ring\n");
         return -EINVAL;
     }
 
@@ -73,8 +74,20 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     cqe1->wr_id = cqe->wr_id;
     cqe1->qp = cqe->qp;
     cqe1->opcode = cqe->opcode;
-    cqe1->status = cqe->status;
-    cqe1->vendor_err = cqe->vendor_err;
+    cqe1->status = wc->status;
+    cqe1->byte_len = wc->byte_len;
+    cqe1->src_qp = wc->src_qp;
+    cqe1->wc_flags = wc->wc_flags;
+    cqe1->vendor_err = wc->vendor_err;
+
+    pr_dbg("wr_id=%" PRIx64 "\n", cqe1->wr_id);
+    pr_dbg("qp=0x%lx\n", cqe1->qp);
+    pr_dbg("opcode=%d\n", cqe1->opcode);
+    pr_dbg("status=%d\n", cqe1->status);
+    pr_dbg("byte_len=%d\n", cqe1->byte_len);
+    pr_dbg("src_qp=%d\n", cqe1->src_qp);
+    pr_dbg("wc_flags=%d\n", cqe1->wc_flags);
+    pr_dbg("vendor_err=%d\n", cqe1->vendor_err);
 
     pvrdma_ring_write_inc(ring);
 
@@ -99,18 +112,12 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     return 0;
 }
 
-static void pvrdma_qp_ops_comp_handler(int status, unsigned int vendor_err,
-                                       void *ctx)
+static void pvrdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
 
-    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
-    pr_dbg("wr_id=%" PRIx64 "\n", comp_ctx->cqe.wr_id);
-    pr_dbg("status=%d\n", status);
-    pr_dbg("vendor_err=0x%x\n", vendor_err);
-    comp_ctx->cqe.status = status;
-    comp_ctx->cqe.vendor_err = vendor_err;
-    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
+    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe, wc);
+
     g_free(ctx);
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (39 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Driver checks error code let's set it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 0d3c818c20..a326c5d470 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     if (rdma_backend_query_port(&dev->backend_dev,
                                 (struct ibv_port_attr *)&attrs)) {
-        return -ENOMEM;
+        resp->hdr.err = -ENOMEM;
+        goto out;
     }
 
     memset(resp, 0, sizeof(*resp));
@@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->attrs.active_width = 1;
     resp->attrs.active_speed = 1;
 
-    return 0;
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
 }
 
 static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->pkey = PVRDMA_PKEY;
     pr_dbg("pkey=0x%x\n", resp->pkey);
 
-    return 0;
+    return resp->hdr.err;
 }
 
 static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
@@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
     cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
     if (!cq) {
         pr_dbg("Invalid CQ handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     ring = (PvrdmaRing *)cq->opaque;
@@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
@@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
     if (!qp) {
         pr_dbg("Invalid QP handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
@@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
     g_free(ring);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
@@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
                          dev->backend_eth_device_name, gid, cmd->index);
     if (rc < 0) {
-        return -EINVAL;
+        rsp->hdr.err = rc;
+        goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
@@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
                                      &resp->ctx_handle);
 
-    pr_dbg("ret=%d\n", resp->hdr.err);
-
-    return 0;
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 struct cmd_handler {
     uint32_t cmd;
@@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
     }
 
     err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
-                            dsr_info->rsp);
+                                                    dsr_info->rsp);
 out:
     set_reg_val(dev, PVRDMA_REG_ERR, err);
     post_interrupt(dev, INTR_VEC_CMD_RING);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (40 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Device supports only one port, let's remove a dead code that handles
more than one port.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c      | 34 ++++++++++++++++------------------
 hw/rdma/rdma_rm.h      |  2 +-
 hw/rdma/rdma_rm_defs.h |  4 ++--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index fe0979415d..0a5ab8935a 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -545,7 +545,7 @@ int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         return -EINVAL;
     }
 
-    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+    memcpy(&dev_res->port.gid_tbl[gid_idx].gid, gid, sizeof(*gid));
 
     return 0;
 }
@@ -556,15 +556,15 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int rc;
 
     rc = rdma_backend_del_gid(backend_dev, ifname,
-                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+                              &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc < 0) {
         pr_dbg("Fail to delete gid\n");
         return -EINVAL;
     }
 
-    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
-           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
-    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+    memset(dev_res->port.gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->port.gid_tbl[gid_idx].gid));
+    dev_res->port.gid_tbl[gid_idx].backend_gid_index = -1;
 
     return 0;
 }
@@ -577,16 +577,16 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
         return -EINVAL;
     }
 
-    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
-        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+    if (unlikely(dev_res->port.gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->port.gid_tbl[sgid_idx].backend_gid_index =
         rdma_backend_get_gid_index(backend_dev,
-                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+                                   &dev_res->port.gid_tbl[sgid_idx].gid);
     }
 
     pr_dbg("backend_gid_index=%d\n",
-           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+           dev_res->port.gid_tbl[sgid_idx].backend_gid_index);
 
-    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+    return dev_res->port.gid_tbl[sgid_idx].backend_gid_index;
 }
 
 static void destroy_qp_hash_key(gpointer data)
@@ -596,15 +596,13 @@ static void destroy_qp_hash_key(gpointer data)
 
 static void init_ports(RdmaDeviceResources *dev_res)
 {
-    int i, j;
+    int i;
 
-    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+    memset(&dev_res->port, 0, sizeof(dev_res->port));
 
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev_res->ports[i].state = IBV_PORT_DOWN;
-        for (j = 0; j < MAX_PORT_GIDS; j++) {
-            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
-        }
+    dev_res->port.state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        dev_res->port.gid_tbl[i].backend_gid_index = -1;
     }
 }
 
@@ -613,7 +611,7 @@ static void fini_ports(RdmaDeviceResources *dev_res,
 {
     int i;
 
-    dev_res->ports[0].state = IBV_PORT_DOWN;
+    dev_res->port.state = IBV_PORT_DOWN;
     for (i = 0; i < MAX_PORT_GIDS; i++) {
         rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
     }
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index a7169b4e89..3c602c04c0 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -79,7 +79,7 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
 static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
                                              int sgid_idx)
 {
-    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+    return &dev_res->port.gid_tbl[sgid_idx].gid;
 }
 
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7b3435f991..0ba61d1838 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -18,7 +18,7 @@
 
 #include "rdma_backend_defs.h"
 
-#define MAX_PORTS             1
+#define MAX_PORTS             1 /* Do not change - we support only one port */
 #define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
@@ -97,7 +97,7 @@ typedef struct RdmaRmPort {
 } RdmaRmPort;
 
 typedef struct RdmaDeviceResources {
-    RdmaRmPort ports[MAX_PORTS];
+    RdmaRmPort port;
     RdmaRmResTbl pd_tbl;
     RdmaRmResTbl mr_tbl;
     RdmaRmResTbl uc_tbl;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (41 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Notifier will be used for signaling shutdown event to inform system is
shutdown. This will allow devices and other component to run some
cleanup code needed before VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 include/sysemu/sysemu.h |  1 +
 vl.c                    | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8d6095d98b..0d15f16492 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -80,6 +80,7 @@ void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_system_shutdown_request(ShutdownCause reason);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
+void qemu_register_shutdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
diff --git a/vl.c b/vl.c
index 1fcacc5caa..d33d52522c 100644
--- a/vl.c
+++ b/vl.c
@@ -1578,6 +1578,8 @@ static NotifierList suspend_notifiers =
     NOTIFIER_LIST_INITIALIZER(suspend_notifiers);
 static NotifierList wakeup_notifiers =
     NOTIFIER_LIST_INITIALIZER(wakeup_notifiers);
+static NotifierList shutdown_notifiers =
+    NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
 
 ShutdownCause qemu_shutdown_requested_get(void)
@@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
     notifier_list_notify(&powerdown_notifiers, NULL);
 }
 
+static void qemu_system_shutdown(ShutdownCause cause)
+{
+    qapi_event_send_shutdown(shutdown_caused_by_guest(cause));
+    notifier_list_notify(&shutdown_notifiers, &cause);
+}
+
 void qemu_system_powerdown_request(void)
 {
     trace_qemu_system_powerdown_request();
@@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
     notifier_list_add(&powerdown_notifiers, notifier);
 }
 
+void qemu_register_shutdown_notifier(Notifier *notifier)
+{
+    notifier_list_add(&shutdown_notifiers, notifier);
+}
+
 void qemu_system_debug_request(void)
 {
     debug_requested = 1;
@@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
     request = qemu_shutdown_requested();
     if (request) {
         qemu_kill_report();
-        qapi_event_send_shutdown(shutdown_caused_by_guest(request));
+        qemu_system_shutdown(request);
         if (no_shutdown) {
             vm_stop(RUN_STATE_SHUTDOWN);
         } else {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (42 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

In order to clean some external resources such as GIDs, QPs etc,
register to receive notification when VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 10a3c4fb7c..ffae36986e 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -17,6 +17,7 @@
 #define PVRDMA_PVRDMA_H
 
 #include "qemu/units.h"
+#include "qemu/notify.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
@@ -87,6 +88,7 @@ typedef struct PVRDMADev {
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
     VMXNET3State *func0;
+    Notifier shutdown_notifier;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 95e9322b7c..45a59cddf9 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -24,6 +24,7 @@
 #include "hw/qdev-properties.h"
 #include "cpu.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 #include "../rdma_rm.h"
 #include "../rdma_backend.h"
@@ -559,6 +560,14 @@ static int pvrdma_check_ram_shared(Object *obj, void *opaque)
     return 0;
 }
 
+static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
+{
+    PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    pvrdma_fini(pci_dev);
+}
+
 static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 {
     int rc;
@@ -623,6 +632,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
+    dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
+    qemu_register_shutdown_notifier(&dev->shutdown_notifier);
+
 out:
     if (rc) {
         error_append_hint(errp, "Device fail to load\n");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (43 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

bitmap_zero_extend is designed to work for extending, not for
shrinking.
Using g_free instead.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 0a5ab8935a..35a96d9a64 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -43,7 +43,7 @@ static inline void res_tbl_free(RdmaRmResTbl *tbl)
 {
     qemu_mutex_destroy(&tbl->lock);
     g_free(tbl->tbl);
-    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+    g_free(tbl->bitmap);
 }
 
 static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (44 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

When device goes down the function fini_ports loops over all entries in
gid table regardless of the fact whether entry is valid or not. In case
that entry is not valid we'd like to skip from any further processing in
backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 35a96d9a64..e3f6b2f6ea 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -555,6 +555,10 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
 {
     int rc;
 
+    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
+        return 0;
+    }
+
     rc = rdma_backend_del_gid(backend_dev, ifname,
                               &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc < 0) {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation
  2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
                   ` (45 preceding siblings ...)
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
@ 2018-11-13  7:13 ` Yuval Shaia
  46 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-13  7:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Interface with the device is changed with the addition of support for
MAD packets.
Adjust documentation accordingly.

While there fix a minor mistake which may lead to think that there is a
relation between using RXE on host and the compatibility with bare-metal
peers.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 84 insertions(+), 19 deletions(-)

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
index 5599318159..9e8d1674b7 100644
--- a/docs/pvrdma.txt
+++ b/docs/pvrdma.txt
@@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need for any special guest
 modifications.
 
 While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
-can work with Soft-RoCE (rxe).
+metal RDMA-enabled machines as peers.
+
+It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
 
 It does not require the whole guest RAM to be pinned allowing memory
 over-commit and, even if not implemented yet, migration support will be
@@ -78,29 +79,93 @@ the required RDMA libraries.
 
 3. Usage
 ========
+
+
+3.1 VM Memory settings
+======================
 Currently the device is working only with memory backed RAM
 and it must be mark as "shared":
    -m 1G \
    -object memory-backend-ram,id=mb1,size=1G,share \
    -numa node,memdev=mb1 \
 
-The pvrdma device is composed of two functions:
- - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
-   but is required to pass the ibdevice GID using its MAC.
-   Examples:
-     For an rxe backend using eth0 interface it will use its mac:
-       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
-     For an SRIOV VF, we take the Ethernet Interface exposed by it:
-       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
- - Function 1 is the actual device:
-       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
-   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
- Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
- The rules of conversion are part of the RoCE spec, but since manual conversion
- is not required, spotting problems is not hard:
-    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
-             MAC: 7c:fe:90:cb:74:3a
-    Note the difference between the first byte of the MAC and the GID.
+
+3.2 MAD Multiplexer
+===================
+MAD Multiplexer is a service that exposes MAD-like interface for VMs in
+order to overcome the limitation where only single entity can register with
+MAD layer to send and receive RDMA-CM MAD packets.
+
+To build rdmacm-mux run
+# make rdmacm-mux
+
+The program accepts 3 command line arguments and exposes a UNIX socket to
+be used to relay control and data messages to and from the service.
+-s unix-socket-path   Path to unix socket to listen on
+                      (default /var/run/rdmacm-mux)
+-d rdma-device-name   Name of RDMA device to register with
+                      (default rxe0)
+-p rdma-device-port   Port number of RDMA device to register with
+                      (default 1)
+The final UNIX socket file name is a concatenation of the 3 arguments so
+for example for device name mlx5_0 and port 2 the file
+/var/run/rdmacm-mux-mlx5_0-2 will be created.
+
+Please refer to contrib/rdmacm-mux for more details.
+
+
+3.3 PCI devices settings
+========================
+RoCE device exposes two functions - Ethernet and RDMA.
+To support it, pvrdma device is composed of two PCI functions, an Ethernet
+device of type vmxnet3 on PCI slot 0 and a pvrdma device on PCI slot 1. The
+Ethernet function can be used for other Ethernet purposes such as IP.
+
+
+3.4 Device parameters
+=====================
+- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
+  would be the Ethernet device used to create it. For any other physical
+  RoCE device this would be the netdev name of the device.
+- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
+- mad-chardev: The name of the MAD multiplexer char device.
+- ibport: In case of multi-port device (such as Mellanox's HCA) this
+  specify the port to use. If not set 1 will be used.
+- dev-caps-max-mr-size: The maximum size of MR.
+- dev-caps-max-qp: Maximum number of QPs.
+- dev-caps-max-sge: Maximum number of SGE elements in WR.
+- dev-caps-max-cq: Maximum number of CQs.
+- dev-caps-max-mr: Maximum number of MRs.
+- dev-caps-max-pd: Maximum number of PDs.
+- dev-caps-max-ah: Maximum number of AHs.
+
+Notes:
+- The first 3 parameters are mandatory settings, the rest have their
+  defaults.
+- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
+  limits but the final values are adjusted by the backend device limitations.
+
+3.5 Example
+===========
+Define bridge device with vmxnet3 network backend:
+<interface type='bridge'>
+  <mac address='56:b4:44:e9:62:dc'/>
+  <source bridge='bridge1'/>
+  <model type='vmxnet3'/>
+  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
+</interface>
+
+Define pvrdma device:
+<qemu:commandline>
+  <qemu:arg value='-object'/>
+  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
+  <qemu:arg value='-numa'/>
+  <qemu:arg value='node,memdev=mb1'/>
+  <qemu:arg value='-chardev'/>
+  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
+  <qemu:arg value='-device'/>
+  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
+</qemu:commandline>
 
 
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
@ 2018-11-13  9:34   ` Cornelia Huck
  0 siblings, 0 replies; 70+ messages in thread
From: Cornelia Huck @ 2018-11-13  9:34 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, shamir.rabinovitch

On Tue, 13 Nov 2018 09:13:08 +0200
Yuval Shaia <yuval.shaia@oracle.com> wrote:

> Notifier will be used for signaling shutdown event to inform system is
> shutdown. This will allow devices and other component to run some
> cleanup code needed before VM is shutdown.
> 
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  include/sysemu/sysemu.h |  1 +
>  vl.c                    | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
@ 2018-11-17 11:42   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 11:42 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:12 AM, Yuval Shaia wrote:
> Device is not supporting QP0, only QP1.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index 86e8fe8ab6..3ccc9a2494 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
>   
>   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
>   {
> -    return qp->ibqp ? qp->ibqp->qp_num : 0;
> +    return qp->ibqp ? qp->ibqp->qp_num : 1;
>   }
>   
>   static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-17 12:06   ` Marcel Apfelbaum
  2018-11-18  9:33     ` Yuval Shaia
  0 siblings, 1 reply; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:06 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck

Hi Yuval,

On 11/13/18 9:12 AM, Yuval Shaia wrote:
> MAD (Management Datagram) packets are widely used by various modules

Please add a link to Spec, I sent it in the V1 mail-thread
Please add it also as a comment in the code. I know MADs
are a complicated matter, but if somebody wants to have a look...

> both in kernel and in user space for example the rdma_* API which is
> used to create and maintain "connection" layer on top of RDMA uses
> several types of MAD packets.
> To support MAD packets the device uses an external utility
> (contrib/rdmacm-mux) to relay packets from and to the guest driver.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
>   hw/rdma/rdma_backend.h      |   4 +-
>   hw/rdma/rdma_backend_defs.h |  10 +-
>   hw/rdma/vmw/pvrdma.h        |   2 +
>   hw/rdma/vmw/pvrdma_main.c   |   4 +-
>   5 files changed, 273 insertions(+), 10 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 1e148398a2..3eb0099f8d 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c

rdma_backend is getting huge, have you consider taking out
mad related code?

> @@ -16,8 +16,13 @@
>   #include "qemu/osdep.h"
>   #include "qemu/error-report.h"
>   #include "qapi/error.h"
> +#include "qapi/qmp/qlist.h"
> +#include "qapi/qmp/qnum.h"
>   
>   #include <infiniband/verbs.h>
> +#include <infiniband/umad_types.h>
> +#include <infiniband/umad.h>
> +#include <rdma/rdma_user_cm.h>
>   
>   #include "trace.h"
>   #include "rdma_utils.h"
> @@ -33,16 +38,25 @@
>   #define VENDOR_ERR_MAD_SEND         0x206
>   #define VENDOR_ERR_INVLKEY          0x207
>   #define VENDOR_ERR_MR_SMALL         0x208
> +#define VENDOR_ERR_INV_MAD_BUFF     0x209
> +#define VENDOR_ERR_INV_NUM_SGE      0x210
>   
>   #define THR_NAME_LEN 16
>   #define THR_POLL_TO  5000
>   
> +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
> +
>   typedef struct BackendCtx {
> -    uint64_t req_id;
>       void *up_ctx;
>       bool is_tx_req;
> +    struct ibv_sge sge; /* Used to save MAD recv buffer */
>   } BackendCtx;
>   
> +struct backend_umad {
> +    struct ib_user_mad hdr;
> +    char mad[RDMA_MAX_PRIVATE_DATA];
> +};
> +
>   static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
>   
>   static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> @@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
>       return 0;
>   }
>   
> +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> +                    uint32_t num_sge)
> +{
> +    struct backend_umad umad = {0};
> +    char *hdr, *msg;
> +    int ret;
> +
> +    pr_dbg("num_sge=%d\n", num_sge);
> +
> +    if (num_sge != 2) {
> +        return -EINVAL;
> +    }
> +
> +    umad.hdr.length = sge[0].length + sge[1].length;
> +    pr_dbg("msg_len=%d\n", umad.hdr.length);
> +
> +    if (umad.hdr.length > sizeof(umad.mad)) {
> +        return -ENOMEM;
> +    }
> +
> +    umad.hdr.addr.qpn = htobe32(1);
> +    umad.hdr.addr.grh_present = 1;
> +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> +    umad.hdr.addr.hop_limit = 1;
> +
> +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> +

If rdma_pci_dma_map fails it will return NULL ....

> +    memcpy(&umad.mad[0], hdr, sge[0].length);
> +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> +

... and here we access a NULL pointer.
Maybe is possible to return some error here.
> +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> +
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> +                            sizeof(umad));
> +
> +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> +
> +    return (ret != sizeof(umad));
> +}
> +
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> @@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           } else if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            rc = mad_send(backend_dev, sge, num_sge);
> +            if (rc) {
> +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            } else {
> +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> +            }
>           }
> -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
>           return;
>       }
>   
> @@ -370,6 +431,48 @@ out_free_bctx:
>       g_free(bctx);
>   }
>   
> +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
> +                                         struct ibv_sge *sge, uint32_t num_sge,
> +                                         void *ctx)
> +{
> +    BackendCtx *bctx;
> +    int rc;
> +    uint32_t bctx_id;
> +
> +    if (num_sge != 1) {
> +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
> +        return VENDOR_ERR_INV_NUM_SGE;
> +    }
> +
> +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
> +        pr_dbg("Too small buffer for MAD\n");
> +        return VENDOR_ERR_INV_MAD_BUFF;
> +    }
> +
> +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
> +    pr_dbg("length=%d\n", sge[0].length);
> +    pr_dbg("lkey=%d\n", sge[0].lkey);
> +
> +    bctx = g_malloc0(sizeof(*bctx));
> +
> +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
> +    if (unlikely(rc)) {
> +        g_free(bctx);
> +        pr_dbg("Fail to allocate cqe_ctx\n");
> +        return VENDOR_ERR_NOMEM;
> +    }
> +
> +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
> +    bctx->up_ctx = ctx;
> +    bctx->sge = *sge;
> +
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +
> +    return 0;
> +}
> +
>   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>                               RdmaDeviceResources *rdma_dev_res,
>                               RdmaBackendQP *qp, uint8_t qp_type,
> @@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>           }
>           if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
> +            if (rc) {
> +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> +            }
>           }
>           return;
>       }
> @@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
>   
>       switch (qp_type) {
>       case IBV_QPT_GSI:
> -        pr_dbg("QP1 unsupported\n");
>           return 0;
>   
>       case IBV_QPT_RC:
> @@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
>       return 0;
>   }
>   
> +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> +                                 union ibv_gid *my_gid, int paylen)
> +{
> +    grh->paylen = htons(paylen);
> +    grh->sgid = *sgid;
> +    grh->dgid = *my_gid;
> +
> +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> +}
> +
> +static inline int mad_can_receieve(void *opaque)
> +{
> +    return sizeof(struct backend_umad);
> +}
> +
> +static void mad_read(void *opaque, const uint8_t *buf, int size)
> +{
> +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> +    QObject *o_ctx_id;
> +    unsigned long cqe_ctx_id;
> +    BackendCtx *bctx;
> +    char *mad;
> +    struct backend_umad *umad;
> +
> +    assert(size != sizeof(umad));
> +    umad = (struct backend_umad *)buf;
> +
> +    pr_dbg("Got %d bytes\n", size);
> +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> +
> +#ifdef PVRDMA_DEBUG
> +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
> +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
> +           hdr->method, hdr->status, be64toh(hdr->tid),
> +           hdr->attr_id, hdr->attr_mod);
> +#endif
> +
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +    if (!o_ctx_id) {
> +        pr_dbg("No more free MADs buffers, waiting for a while\n");
> +        sleep(THR_POLL_TO);
> +        return;
> +    }
> +
> +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +    if (unlikely(!bctx)) {
> +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
> +        return;
> +    }
> +
> +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
> +
> +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> +                           bctx->sge.length);
> +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> +                     bctx->up_ctx);
> +    } else {
> +        memset(mad, 0, bctx->sge.length);
> +        build_mad_hdr((struct ibv_grh *)mad,
> +                      (union ibv_gid *)&umad->hdr.addr.gid,
> +                      &backend_dev->gid, umad->hdr.length);
> +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> +
> +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> +    }
> +
> +    g_free(bctx);
> +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +}
> +
> +static int mad_init(RdmaBackendDev *backend_dev)
> +{
> +    struct backend_umad umad = {0};
> +    int ret;
> +
> +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> +        pr_dbg("Missing chardev for MAD multiplexer\n");
> +        return -EIO;
> +    }
> +
> +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> +                             mad_read, NULL, NULL, backend_dev, NULL, true);
> +
> +    /* Register ourself */
> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> +                            sizeof(umad.hdr));
> +    if (ret != sizeof(umad.hdr)) {
> +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret)
> +    }
> +
> +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> +    backend_dev->recv_mads_list.list = qlist_new();
> +

What happens if the device fails to register
to rdma_umadmux other than a debug message?
Can the device continue to work?


> +    return 0;
> +}
> +
> +static void mad_stop(RdmaBackendDev *backend_dev)
> +{
> +    QObject *o_ctx_id;
> +    unsigned long cqe_ctx_id;
> +    BackendCtx *bctx;
> +
> +    pr_dbg("Closing MAD\n");
> +
> +    /* Clear MAD buffers list */
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    do {
> +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> +        if (o_ctx_id) {
> +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +            if (bctx) {

Maybe it worth adding a debug message if we have some orphan context.

> +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +                g_free(bctx);
> +            }
> +        }
> +    } while (o_ctx_id);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +}
> +
> +static void mad_fini(RdmaBackendDev *backend_dev)
> +{
> +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
> +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> +}
> +
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
>                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      Error **errp)
> +                      CharBackend *mad_chr_be, Error **errp)
>   {
>       int i;
>       int ret = 0;
> @@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       memset(backend_dev, 0, sizeof(*backend_dev));
>   
>       backend_dev->dev = pdev;
> -
> +    backend_dev->mad_chr_be = mad_chr_be;
>       backend_dev->backend_gid_idx = backend_gid_idx;
>       backend_dev->port_num = port_num;
>       backend_dev->rdma_dev_res = rdma_dev_res;
> @@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       pr_dbg("interface_id=0x%" PRIx64 "\n",
>              be64_to_cpu(backend_dev->gid.global.interface_id));
>   
> +    ret = mad_init(backend_dev);
> +    if (ret) {
> +        error_setg(errp, "Fail to initialize mad");
> +        ret = -EIO;
> +        goto out_destroy_comm_channel;
> +    }
> +
>       backend_dev->comp_thread.run = false;
>       backend_dev->comp_thread.is_running = false;
>   
> @@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
>   {
>       pr_dbg("Stopping rdma_backend\n");
>       stop_backend_thread(&backend_dev->comp_thread);
> +    mad_stop(backend_dev);
>   }
>   
>   void rdma_backend_fini(RdmaBackendDev *backend_dev)
>   {
>       rdma_backend_stop(backend_dev);
> +    mad_fini(backend_dev);
>       g_hash_table_destroy(ah_hash);
>       ibv_destroy_comp_channel(backend_dev->channel);
>       ibv_close_device(backend_dev->context);
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index 3ccc9a2494..fc83330251 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -17,6 +17,8 @@
>   #define RDMA_BACKEND_H
>   
>   #include "qapi/error.h"
> +#include "chardev/char-fe.h"
> +
>   #include "rdma_rm_defs.h"
>   #include "rdma_backend_defs.h"
>   
> @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
>                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      Error **errp);
> +                      CharBackend *mad_chr_be, Error **errp);
>   void rdma_backend_fini(RdmaBackendDev *backend_dev);
>   void rdma_backend_start(RdmaBackendDev *backend_dev);
>   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> index 7404f64002..2a7e667075 100644
> --- a/hw/rdma/rdma_backend_defs.h
> +++ b/hw/rdma/rdma_backend_defs.h
> @@ -16,8 +16,9 @@
>   #ifndef RDMA_BACKEND_DEFS_H
>   #define RDMA_BACKEND_DEFS_H
>   
> -#include <infiniband/verbs.h>
>   #include "qemu/thread.h"
> +#include "chardev/char-fe.h"
> +#include <infiniband/verbs.h>
>   
>   typedef struct RdmaDeviceResources RdmaDeviceResources;
>   
> @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
>       bool is_running; /* Set by the thread to report its status */
>   } RdmaBackendThread;
>   
> +typedef struct RecvMadList {
> +    QemuMutex lock;
> +    QList *list;
> +} RecvMadList;
> +
>   typedef struct RdmaBackendDev {
>       struct ibv_device_attr dev_attr;
>       RdmaBackendThread comp_thread;
> @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
>       struct ibv_comp_channel *channel;
>       uint8_t port_num;
>       uint8_t backend_gid_idx;
> +    RecvMadList recv_mads_list;
> +    CharBackend *mad_chr_be;
>   } RdmaBackendDev;
>   
>   typedef struct RdmaBackendPD {
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index e2d9f93cdf..e3742d893a 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -19,6 +19,7 @@
>   #include "qemu/units.h"
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msix.h"
> +#include "chardev/char-fe.h"
>   
>   #include "../rdma_backend_defs.h"
>   #include "../rdma_rm_defs.h"
> @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
>       uint8_t backend_port_num;
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
> +    CharBackend mad_chr;
>   } PVRDMADev;
>   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index ca5fa8d981..6c8c0154fa 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
>       DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
>                         dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
>       DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
> +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
>       DEFINE_PROP_END_OF_LIST(),
>   };
>   
> @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>   
>       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
>                              dev->backend_device_name, dev->backend_port_num,
> -                           dev->backend_gid_idx, &dev->dev_attr, errp);
> +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> +                           errp);
>       if (rc) {
>           goto out;
>       }


Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion
  2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
@ 2018-11-17 12:07   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:07 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:12 AM, Yuval Shaia wrote:
> opcode for WC should be set by the device and not taken from work
> element.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index 7b0f440fda..3388be1926 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
>           comp_ctx->cq_handle = qp->send_cq_handle;
>           comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
>           comp_ctx->cqe.qp = qp_handle;
> -        comp_ctx->cqe.opcode = wqe->hdr.opcode;
> +        comp_ctx->cqe.opcode = IBV_WC_SEND;
>   
>           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
>                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
@ 2018-11-17 12:10   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:10 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> node_guid should be set once device is load.
> Make node_guid be GID format (32 bit) of PCI function 0 vmxnet3 device's
> MAC.
>
> A new function was added to do the conversion.
> So for example the MAC 56:b6:44:e9:62:dc will be converted to GID
> 54b6:44ff:fee9:62dc.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_utils.h      |  9 +++++++++
>   hw/rdma/vmw/pvrdma_cmd.c  | 10 ----------
>   hw/rdma/vmw/pvrdma_main.c |  5 ++++-
>   3 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
> index 989db249ef..202abb3366 100644
> --- a/hw/rdma/rdma_utils.h
> +++ b/hw/rdma/rdma_utils.h
> @@ -63,4 +63,13 @@ extern unsigned long pr_dbg_cnt;
>   void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
>   void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
>   
> +static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
> +{
> +    memcpy(eui, addr, 3);
> +    eui[3] = 0xFF;
> +    eui[4] = 0xFE;
> +    memcpy(eui + 5, addr + 3, 3);
> +    eui[0] ^= 2;
> +}
> +
>   #endif
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index a334f6205e..2979582fac 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -592,16 +592,6 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>           return -EINVAL;
>       }
>   
> -    /* TODO: Since drivers stores node_guid at load_dsr phase then this
> -     * assignment is not relevant, i need to figure out a way how to
> -     * retrieve MAC of our netdev */
> -    if (!cmd->index) {
> -        dev->node_guid =
> -            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
> -        pr_dbg("dev->node_guid=0x%llx\n",
> -               (long long unsigned int)be64_to_cpu(dev->node_guid));
> -    }
> -
>       return 0;
>   }
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index fa6468d221..95e9322b7c 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -264,7 +264,7 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
>       dsr->caps.sys_image_guid = 0;
>       pr_dbg("sys_image_guid=%" PRIx64 "\n", dsr->caps.sys_image_guid);
>   
> -    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
> +    dsr->caps.node_guid = dev->node_guid;
>       pr_dbg("node_guid=%" PRIx64 "\n", be64_to_cpu(dsr->caps.node_guid));
>   
>       dsr->caps.phys_port_cnt = MAX_PORTS;
> @@ -579,6 +579,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>       /* Break if not vmxnet3 device in slot 0 */
>       dev->func0 = VMXNET3(pci_get_function_0(pdev));
>   
> +    addrconf_addr_eui48((unsigned char *)&dev->node_guid,
> +                        (const char *)&dev->func0->conf.macaddr.a);
> +
>       memdev_root = object_resolve_path("/objects", NULL);
>       if (memdev_root) {
>           object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
@ 2018-11-17 12:11   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:11 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> User should be able to control the device by changing Ethernet function
> state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
> down as well.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index 2979582fac..0d3c818c20 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
>       resp->hdr.err = 0;
>   
> -    resp->attrs.state = attrs.state;
> +    resp->attrs.state = dev->func0->device_active ? attrs.state :
> +                                                    PVRDMA_PORT_DOWN;
>       resp->attrs.max_mtu = attrs.max_mtu;
>       resp->attrs.active_mtu = attrs.active_mtu;
>       resp->attrs.phys_state = attrs.phys_state;

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
@ 2018-11-17 12:19   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:19 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> Add ability to pass specific WC attributes to CQE such as GRH_BIT flag.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c      | 59 +++++++++++++++++++++++--------------
>   hw/rdma/rdma_backend.h      |  4 +--
>   hw/rdma/vmw/pvrdma_qp_ops.c | 31 +++++++++++--------
>   3 files changed, 58 insertions(+), 36 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 5675504165..e453bda8f9 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -59,13 +59,24 @@ struct backend_umad {
>       char mad[RDMA_MAX_PRIVATE_DATA];
>   };
>   
> -static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
> +static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
>   
> -static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> +static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
>   {
>       pr_err("No completion handler is registered\n");
>   }
>   
> +static inline void complete_work(enum ibv_wc_status status, uint32_t vendor_err,
> +                                 void *ctx)
> +{
> +    struct ibv_wc wc = {0};
> +
> +    wc.status = status;
> +    wc.vendor_err = vendor_err;
> +
> +    comp_handler(ctx, &wc);
> +}
> +
>   static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
>   {
>       int i, ne;
> @@ -90,7 +101,7 @@ static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
>               }
>               pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
>   
> -            comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
> +            comp_handler(bctx->up_ctx, &wc[i]);
>   
>               rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
>               g_free(bctx);
> @@ -184,8 +195,8 @@ static void start_comp_thread(RdmaBackendDev *backend_dev)
>                          comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
>   }
>   
> -void rdma_backend_register_comp_handler(void (*handler)(int status,
> -                                        unsigned int vendor_err, void *ctx))
> +void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
> +                                                         struct ibv_wc *wc))
>   {
>       comp_handler = handler;
>   }
> @@ -369,14 +380,14 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
>           if (qp_type == IBV_QPT_SMI) {
>               pr_dbg("QP0 unsupported\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> +            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           } else if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
>               rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
>               if (rc) {
> -                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +                complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
>               } else {
> -                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> +                complete_work(IBV_WC_SUCCESS, 0, ctx);
>               }
>           }
>           return;
> @@ -385,7 +396,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       pr_dbg("num_sge=%d\n", num_sge);
>       if (!num_sge) {
>           pr_dbg("num_sge=0\n");
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
>           return;
>       }
>   
> @@ -396,21 +407,21 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
>       if (unlikely(rc)) {
>           pr_dbg("Failed to allocate cqe_ctx\n");
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
>           goto out_free_bctx;
>       }
>   
>       rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, num_sge);
>       if (rc) {
>           pr_dbg("Error: Failed to build host SGE array\n");
> -        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
>           goto out_dealloc_cqe_ctx;
>       }
>   
>       if (qp_type == IBV_QPT_UD) {
>           wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
>           if (!wr.wr.ud.ah) {
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
> +            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
>               goto out_dealloc_cqe_ctx;
>           }
>           wr.wr.ud.remote_qpn = dqpn;
> @@ -428,7 +439,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       if (rc) {
>           pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
>                   qp->ibqp->qp_num);
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
>           goto out_dealloc_cqe_ctx;
>       }
>   
> @@ -497,13 +508,13 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>       if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
>           if (qp_type == IBV_QPT_SMI) {
>               pr_dbg("QP0 unsupported\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> +            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           }
>           if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
>               rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
>               if (rc) {
> -                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> +                complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
>               }
>           }
>           return;
> @@ -512,7 +523,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>       pr_dbg("num_sge=%d\n", num_sge);
>       if (!num_sge) {
>           pr_dbg("num_sge=0\n");
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
>           return;
>       }
>   
> @@ -523,14 +534,14 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>       rc = rdma_rm_alloc_cqe_ctx(rdma_dev_res, &bctx_id, bctx);
>       if (unlikely(rc)) {
>           pr_dbg("Failed to allocate cqe_ctx\n");
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
>           goto out_free_bctx;
>       }
>   
>       rc = build_host_sge_array(rdma_dev_res, new_sge, sge, num_sge);
>       if (rc) {
>           pr_dbg("Error: Failed to build host SGE array\n");
> -        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
>           goto out_dealloc_cqe_ctx;
>       }
>   
> @@ -542,7 +553,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>       if (rc) {
>           pr_dbg("Fail (%d, %d) to post recv WQE to qpn %d\n", rc, errno,
>                   qp->ibqp->qp_num);
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
>           goto out_dealloc_cqe_ctx;
>       }
>   
> @@ -926,9 +937,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>       mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
>                              bctx->sge.length);
>       if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
> -        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> -                     bctx->up_ctx);
> +        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> +                      bctx->up_ctx);
>       } else {
> +        struct ibv_wc wc = {0};
>           pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
>           memset(mad, 0, bctx->sge.length);
>           build_mad_hdr((struct ibv_grh *)mad,
> @@ -937,7 +949,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>           memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
>           rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
>   
> -        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> +        wc.byte_len = msg->umad_len;
> +        wc.status = IBV_WC_SUCCESS;
> +        wc.wc_flags = IBV_WC_GRH;
> +        comp_handler(bctx->up_ctx, &wc);
>       }
>   
>       g_free(bctx);
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index 59ad2b874b..8cae40f827 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -57,8 +57,8 @@ int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
>                                  union ibv_gid *gid);
>   void rdma_backend_start(RdmaBackendDev *backend_dev);
>   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> -void rdma_backend_register_comp_handler(void (*handler)(int status,
> -                                        unsigned int vendor_err, void *ctx));
> +void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
> +                                                        struct ibv_wc *wc));
>   void rdma_backend_unregister_comp_handler(void);
>   
>   int rdma_backend_query_port(RdmaBackendDev *backend_dev,
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index 2130824098..300471a4c9 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -47,7 +47,7 @@ typedef struct PvrdmaRqWqe {
>    * 3. Interrupt host
>    */
>   static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
> -                           struct pvrdma_cqe *cqe)
> +                           struct pvrdma_cqe *cqe, struct ibv_wc *wc)
>   {
>       struct pvrdma_cqe *cqe1;
>       struct pvrdma_cqne *cqne;
> @@ -66,6 +66,7 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
>       pr_dbg("Writing CQE\n");
>       cqe1 = pvrdma_ring_next_elem_write(ring);
>       if (unlikely(!cqe1)) {
> +        pr_dbg("No CQEs in ring\n");
>           return -EINVAL;
>       }
>   
> @@ -73,8 +74,20 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
>       cqe1->wr_id = cqe->wr_id;
>       cqe1->qp = cqe->qp;
>       cqe1->opcode = cqe->opcode;
> -    cqe1->status = cqe->status;
> -    cqe1->vendor_err = cqe->vendor_err;
> +    cqe1->status = wc->status;
> +    cqe1->byte_len = wc->byte_len;
> +    cqe1->src_qp = wc->src_qp;
> +    cqe1->wc_flags = wc->wc_flags;
> +    cqe1->vendor_err = wc->vendor_err;
> +
> +    pr_dbg("wr_id=%" PRIx64 "\n", cqe1->wr_id);
> +    pr_dbg("qp=0x%lx\n", cqe1->qp);
> +    pr_dbg("opcode=%d\n", cqe1->opcode);
> +    pr_dbg("status=%d\n", cqe1->status);
> +    pr_dbg("byte_len=%d\n", cqe1->byte_len);
> +    pr_dbg("src_qp=%d\n", cqe1->src_qp);
> +    pr_dbg("wc_flags=%d\n", cqe1->wc_flags);
> +    pr_dbg("vendor_err=%d\n", cqe1->vendor_err);
>   
>       pvrdma_ring_write_inc(ring);
>   
> @@ -99,18 +112,12 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
>       return 0;
>   }
>   
> -static void pvrdma_qp_ops_comp_handler(int status, unsigned int vendor_err,
> -                                       void *ctx)
> +static void pvrdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
>   {
>       CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
>   
> -    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
> -    pr_dbg("wr_id=%" PRIx64 "\n", comp_ctx->cqe.wr_id);
> -    pr_dbg("status=%d\n", status);
> -    pr_dbg("vendor_err=0x%x\n", vendor_err);
> -    comp_ctx->cqe.status = status;
> -    comp_ctx->cqe.vendor_err = vendor_err;
> -    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
> +    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe, wc);
> +
>       g_free(ctx);
>   }
>   

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
@ 2018-11-17 12:22   ` Marcel Apfelbaum
  2018-11-18  8:24     ` Yuval Shaia
  0 siblings, 1 reply; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:22 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> Driver checks error code let's set it.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
>   1 file changed, 48 insertions(+), 19 deletions(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index 0d3c818c20..a326c5d470 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       if (rdma_backend_query_port(&dev->backend_dev,
>                                   (struct ibv_port_attr *)&attrs)) {
> -        return -ENOMEM;
> +        resp->hdr.err = -ENOMEM;
> +        goto out;
>       }
>   
>       memset(resp, 0, sizeof(*resp));
> @@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->attrs.active_width = 1;
>       resp->attrs.active_speed = 1;
>   
> -    return 0;
> +out:
> +    pr_dbg("ret=%d\n", resp->hdr.err);
> +    return resp->hdr.err;
>   }
>   
>   static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->pkey = PVRDMA_PKEY;
>       pr_dbg("pkey=0x%x\n", resp->pkey);
>   
> -    return 0;
> +    return resp->hdr.err;
>   }
>   
>   static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;

Is it possible to ensure err is 0 by default during hdr creation
instead of manually setting it every time?

Thanks,
Marcel

> +
> +    return rsp->hdr.err;
>   }
>   
>   static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +    return rsp->hdr.err;
>   }
>   
>   static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
> @@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
>       if (!cq) {
>           pr_dbg("Invalid CQ handle\n");
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       ring = (PvrdmaRing *)cq->opaque;
> @@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
> @@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
>       if (!qp) {
>           pr_dbg("Invalid QP handle\n");
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
> @@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
>       g_free(ring);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       pr_dbg("index=%d\n", cmd->index);
>   
>       if (cmd->index >= MAX_PORT_GIDS) {
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> @@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
>                            dev->backend_eth_device_name, gid, cmd->index);
>       if (rc < 0) {
> -        return -EINVAL;
> +        rsp->hdr.err = rc;
> +        goto out;
>       }
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       pr_dbg("index=%d\n", cmd->index);
>   
>       if (cmd->index >= MAX_PORT_GIDS) {
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> @@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>           goto out;
>       }
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
>                                        &resp->ctx_handle);
>   
> -    pr_dbg("ret=%d\n", resp->hdr.err);
> -
> -    return 0;
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +    return rsp->hdr.err;
>   }
>   struct cmd_handler {
>       uint32_t cmd;
> @@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
>       }
>   
>       err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
> -                            dsr_info->rsp);
> +                                                    dsr_info->rsp);
>   out:
>       set_reg_val(dev, PVRDMA_REG_ERR, err);
>       post_interrupt(dev, INTR_VEC_CMD_RING);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
@ 2018-11-17 12:23   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:23 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> Device supports only one port, let's remove a dead code that handles
> more than one port.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_rm.c      | 34 ++++++++++++++++------------------
>   hw/rdma/rdma_rm.h      |  2 +-
>   hw/rdma/rdma_rm_defs.h |  4 ++--
>   3 files changed, 19 insertions(+), 21 deletions(-)
>
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index fe0979415d..0a5ab8935a 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -545,7 +545,7 @@ int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>           return -EINVAL;
>       }
>   
> -    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
> +    memcpy(&dev_res->port.gid_tbl[gid_idx].gid, gid, sizeof(*gid));
>   
>       return 0;
>   }
> @@ -556,15 +556,15 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>       int rc;
>   
>       rc = rdma_backend_del_gid(backend_dev, ifname,
> -                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
> +                              &dev_res->port.gid_tbl[gid_idx].gid);
>       if (rc < 0) {
>           pr_dbg("Fail to delete gid\n");
>           return -EINVAL;
>       }
>   
> -    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
> -           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
> -    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
> +    memset(dev_res->port.gid_tbl[gid_idx].gid.raw, 0,
> +           sizeof(dev_res->port.gid_tbl[gid_idx].gid));
> +    dev_res->port.gid_tbl[gid_idx].backend_gid_index = -1;
>   
>       return 0;
>   }
> @@ -577,16 +577,16 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
>           return -EINVAL;
>       }
>   
> -    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
> -        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
> +    if (unlikely(dev_res->port.gid_tbl[sgid_idx].backend_gid_index == -1)) {
> +        dev_res->port.gid_tbl[sgid_idx].backend_gid_index =
>           rdma_backend_get_gid_index(backend_dev,
> -                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
> +                                   &dev_res->port.gid_tbl[sgid_idx].gid);
>       }
>   
>       pr_dbg("backend_gid_index=%d\n",
> -           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
> +           dev_res->port.gid_tbl[sgid_idx].backend_gid_index);
>   
> -    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
> +    return dev_res->port.gid_tbl[sgid_idx].backend_gid_index;
>   }
>   
>   static void destroy_qp_hash_key(gpointer data)
> @@ -596,15 +596,13 @@ static void destroy_qp_hash_key(gpointer data)
>   
>   static void init_ports(RdmaDeviceResources *dev_res)
>   {
> -    int i, j;
> +    int i;
>   
> -    memset(dev_res->ports, 0, sizeof(dev_res->ports));
> +    memset(&dev_res->port, 0, sizeof(dev_res->port));
>   
> -    for (i = 0; i < MAX_PORTS; i++) {
> -        dev_res->ports[i].state = IBV_PORT_DOWN;
> -        for (j = 0; j < MAX_PORT_GIDS; j++) {
> -            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
> -        }
> +    dev_res->port.state = IBV_PORT_DOWN;
> +    for (i = 0; i < MAX_PORT_GIDS; i++) {
> +        dev_res->port.gid_tbl[i].backend_gid_index = -1;
>       }
>   }
>   
> @@ -613,7 +611,7 @@ static void fini_ports(RdmaDeviceResources *dev_res,
>   {
>       int i;
>   
> -    dev_res->ports[0].state = IBV_PORT_DOWN;
> +    dev_res->port.state = IBV_PORT_DOWN;
>       for (i = 0; i < MAX_PORT_GIDS; i++) {
>           rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
>       }
> diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
> index a7169b4e89..3c602c04c0 100644
> --- a/hw/rdma/rdma_rm.h
> +++ b/hw/rdma/rdma_rm.h
> @@ -79,7 +79,7 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
>   static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
>                                                int sgid_idx)
>   {
> -    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
> +    return &dev_res->port.gid_tbl[sgid_idx].gid;
>   }
>   
>   #endif
> diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
> index 7b3435f991..0ba61d1838 100644
> --- a/hw/rdma/rdma_rm_defs.h
> +++ b/hw/rdma/rdma_rm_defs.h
> @@ -18,7 +18,7 @@
>   
>   #include "rdma_backend_defs.h"
>   
> -#define MAX_PORTS             1
> +#define MAX_PORTS             1 /* Do not change - we support only one port */
>   #define MAX_PORT_GIDS         255
>   #define MAX_GIDS              MAX_PORT_GIDS
>   #define MAX_PORT_PKEYS        1
> @@ -97,7 +97,7 @@ typedef struct RdmaRmPort {
>   } RdmaRmPort;
>   
>   typedef struct RdmaDeviceResources {
> -    RdmaRmPort ports[MAX_PORTS];
> +    RdmaRmPort port;
>       RdmaRmResTbl pd_tbl;
>       RdmaRmResTbl mr_tbl;
>       RdmaRmResTbl uc_tbl;

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
@ 2018-11-17 12:24   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:24 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> In order to clean some external resources such as GIDs, QPs etc,
> register to receive notification when VM is shutdown.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma.h      |  2 ++
>   hw/rdma/vmw/pvrdma_main.c | 12 ++++++++++++
>   2 files changed, 14 insertions(+)
>
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index 10a3c4fb7c..ffae36986e 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -17,6 +17,7 @@
>   #define PVRDMA_PVRDMA_H
>   
>   #include "qemu/units.h"
> +#include "qemu/notify.h"
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msix.h"
>   #include "chardev/char-fe.h"
> @@ -87,6 +88,7 @@ typedef struct PVRDMADev {
>       RdmaDeviceResources rdma_dev_res;
>       CharBackend mad_chr;
>       VMXNET3State *func0;
> +    Notifier shutdown_notifier;
>   } PVRDMADev;
>   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index 95e9322b7c..45a59cddf9 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -24,6 +24,7 @@
>   #include "hw/qdev-properties.h"
>   #include "cpu.h"
>   #include "trace.h"
> +#include "sysemu/sysemu.h"
>   
>   #include "../rdma_rm.h"
>   #include "../rdma_backend.h"
> @@ -559,6 +560,14 @@ static int pvrdma_check_ram_shared(Object *obj, void *opaque)
>       return 0;
>   }
>   
> +static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
> +{
> +    PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +
> +    pvrdma_fini(pci_dev);
> +}
> +
>   static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>   {
>       int rc;
> @@ -623,6 +632,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>           goto out;
>       }
>   
> +    dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
> +    qemu_register_shutdown_notifier(&dev->shutdown_notifier);
> +
>   out:
>       if (rc) {
>           error_append_hint(errp, "Device fail to load\n");

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
@ 2018-11-17 12:25   ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:25 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> bitmap_zero_extend is designed to work for extending, not for
> shrinking.
> Using g_free instead.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_rm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index 0a5ab8935a..35a96d9a64 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -43,7 +43,7 @@ static inline void res_tbl_free(RdmaRmResTbl *tbl)
>   {
>       qemu_mutex_destroy(&tbl->lock);
>       g_free(tbl->tbl);
> -    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
> +    g_free(tbl->bitmap);
>   }
>   
>   static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
@ 2018-11-17 12:25   ` Marcel Apfelbaum
  2018-11-18  9:42     ` Yuval Shaia
  0 siblings, 1 reply; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:25 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> When device goes down the function fini_ports loops over all entries in
> gid table regardless of the fact whether entry is valid or not. In case
> that entry is not valid we'd like to skip from any further processing in
> backend device.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_rm.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index 35a96d9a64..e3f6b2f6ea 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -555,6 +555,10 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>   {
>       int rc;
>   
> +    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
> +        return 0;
> +    }
> +
>       rc = rdma_backend_del_gid(backend_dev, ifname,
>                                 &dev_res->port.gid_tbl[gid_idx].gid);
>       if (rc < 0) {

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia
@ 2018-11-17 12:34   ` Marcel Apfelbaum
  2018-11-18  7:27     ` Yuval Shaia
  0 siblings, 1 reply; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:34 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> Interface with the device is changed with the addition of support for
> MAD packets.
> Adjust documentation accordingly.
>
> While there fix a minor mistake which may lead to think that there is a
> relation between using RXE on host and the compatibility with bare-metal
> peers.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 84 insertions(+), 19 deletions(-)
>
> diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
> index 5599318159..9e8d1674b7 100644
> --- a/docs/pvrdma.txt
> +++ b/docs/pvrdma.txt
> @@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need for any special guest
>   modifications.
>   
>   While it complies with the VMware device, it can also communicate with bare
> -metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> -can work with Soft-RoCE (rxe).
> +metal RDMA-enabled machines as peers.
> +
> +It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
>   
>   It does not require the whole guest RAM to be pinned allowing memory
>   over-commit and, even if not implemented yet, migration support will be
> @@ -78,29 +79,93 @@ the required RDMA libraries.
>   
>   3. Usage
>   ========
> +
> +
> +3.1 VM Memory settings
> +======================
>   Currently the device is working only with memory backed RAM
>   and it must be mark as "shared":
>      -m 1G \
>      -object memory-backend-ram,id=mb1,size=1G,share \
>      -numa node,memdev=mb1 \
>   
> -The pvrdma device is composed of two functions:
> - - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
> -   but is required to pass the ibdevice GID using its MAC.
> -   Examples:
> -     For an rxe backend using eth0 interface it will use its mac:
> -       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
> -     For an SRIOV VF, we take the Ethernet Interface exposed by it:
> -       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
> - - Function 1 is the actual device:
> -       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
> -   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
> - Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
> - The rules of conversion are part of the RoCE spec, but since manual conversion
> - is not required, spotting problems is not hard:
> -    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
> -             MAC: 7c:fe:90:cb:74:3a
> -    Note the difference between the first byte of the MAC and the GID.
> +
> +3.2 MAD Multiplexer
> +===================
> +MAD Multiplexer is a service that exposes MAD-like interface for VMs in
> +order to overcome the limitation where only single entity can register with
> +MAD layer to send and receive RDMA-CM MAD packets.
> +
> +To build rdmacm-mux run
> +# make rdmacm-mux
> +
> +The program accepts 3 command line arguments and exposes a UNIX socket to
> +be used to relay control and data messages to and from the service.
> +-s unix-socket-path   Path to unix socket to listen on
> +                      (default /var/run/rdmacm-mux)
> +-d rdma-device-name   Name of RDMA device to register with
> +                      (default rxe0)
> +-p rdma-device-port   Port number of RDMA device to register with
> +                      (default 1)
> +The final UNIX socket file name is a concatenation of the 3 arguments so
> +for example for device name mlx5_0 and port 2 the file
> +/var/run/rdmacm-mux-mlx5_0-2 will be created.
> +
> +Please refer to contrib/rdmacm-mux for more details.
> +
> +
> +3.3 PCI devices settings
> +========================
> +RoCE device exposes two functions - Ethernet and RDMA.
> +To support it, pvrdma device is composed of two PCI functions, an Ethernet
> +device of type vmxnet3 on PCI slot 0 and a pvrdma device on PCI slot 1. The
> +Ethernet function can be used for other Ethernet purposes such as IP.
> +
> +
> +3.4 Device parameters
> +=====================
> +- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
> +  would be the Ethernet device used to create it. For any other physical
> +  RoCE device this would be the netdev name of the device.

I didn't understand, can you please elaborate? We need the ibdev,
this is clear, but what is the "ethernet device on host", how do
we get it and how it is used?

Thanks,
Marcel

> +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
> +- mad-chardev: The name of the MAD multiplexer char device.
> +- ibport: In case of multi-port device (such as Mellanox's HCA) this
> +  specify the port to use. If not set 1 will be used.
> +- dev-caps-max-mr-size: The maximum size of MR.
> +- dev-caps-max-qp: Maximum number of QPs.
> +- dev-caps-max-sge: Maximum number of SGE elements in WR.
> +- dev-caps-max-cq: Maximum number of CQs.
> +- dev-caps-max-mr: Maximum number of MRs.
> +- dev-caps-max-pd: Maximum number of PDs.
> +- dev-caps-max-ah: Maximum number of AHs.
> +
> +Notes:
> +- The first 3 parameters are mandatory settings, the rest have their
> +  defaults.
> +- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
> +  limits but the final values are adjusted by the backend device limitations.
> +
> +3.5 Example
> +===========
> +Define bridge device with vmxnet3 network backend:
> +<interface type='bridge'>
> +  <mac address='56:b4:44:e9:62:dc'/>
> +  <source bridge='bridge1'/>
> +  <model type='vmxnet3'/>
> +  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
> +</interface>
> +
> +Define pvrdma device:
> +<qemu:commandline>
> +  <qemu:arg value='-object'/>
> +  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
> +  <qemu:arg value='-numa'/>
> +  <qemu:arg value='node,memdev=mb1'/>
> +  <qemu:arg value='-chardev'/>
> +  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
> +  <qemu:arg value='-device'/>
> +  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
> +</qemu:commandline>
>   
>   
>   

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
@ 2018-11-17 12:48   ` Marcel Apfelbaum
  2018-11-18  8:13     ` Yuval Shaia
  0 siblings, 1 reply; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 12:48 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/13/18 9:13 AM, Yuval Shaia wrote:
> The control over the RDMA device's GID table is done by updating the
> device's Ethernet function addresses.
> Usually the first GID entry is determine by the MAC address, the second

s/determine/determined

> by the first IPv6 address and the third by the IPv4 address. Other
> entries can be added by adding more IP addresses. The opposite is the
> same, i.e. whenever an address is removed, the corresponding GID entry
> is removed.
>
> The process is done by the network and RDMA stacks. Whenever an address
> is added the ib_core driver is notified and calls the device driver
> add_gid function which in turn update the device.
>
> To support this in pvrdma device we need to hook into the create_bind
> and destroy_bind HW commands triggered by pvrdma driver in guest.
> Whenever a changed is made to the pvrdma device's GID table a special
without 'a'

> QMP messages is sent to be processed by libvirt to update the address of
> the backend Ethernet device.

So the device can't be used without libvirt? How can we
use it anyway only with QEMU ?

>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c      | 243 +++++++++++++++++++++++-------------

rdma_backend.c is gettting larger...

>   hw/rdma/rdma_backend.h      |  22 ++--
>   hw/rdma/rdma_backend_defs.h |   3 +-
>   hw/rdma/rdma_rm.c           | 104 ++++++++++++++-
>   hw/rdma/rdma_rm.h           |  17 ++-
>   hw/rdma/rdma_rm_defs.h      |   9 +-
>   hw/rdma/rdma_utils.h        |  15 +++
>   hw/rdma/vmw/pvrdma.h        |   2 +-
>   hw/rdma/vmw/pvrdma_cmd.c    |  55 ++++----
>   hw/rdma/vmw/pvrdma_main.c   |  25 +---
>   hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
>   11 files changed, 370 insertions(+), 145 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 3eb0099f8d..5675504165 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -18,12 +18,14 @@
>   #include "qapi/error.h"
>   #include "qapi/qmp/qlist.h"
>   #include "qapi/qmp/qnum.h"
> +#include "qapi/qapi-events-rdma.h"
>   
>   #include <infiniband/verbs.h>
>   #include <infiniband/umad_types.h>
>   #include <infiniband/umad.h>
>   #include <rdma/rdma_user_cm.h>
>   
> +#include "contrib/rdmacm-mux/rdmacm-mux.h"
>   #include "trace.h"
>   #include "rdma_utils.h"
>   #include "rdma_rm.h"
> @@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
>       return 0;
>   }
>   
> -static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> -                    uint32_t num_sge)
> +static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
> +                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
>   {
> -    struct backend_umad umad = {0};
> -    char *hdr, *msg;
> +    RdmaCmMuxMsg msg = {0};
> +    char *hdr, *data;
>       int ret;
>   
>       pr_dbg("num_sge=%d\n", num_sge);
> @@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
>           return -EINVAL;
>       }
>   
> -    umad.hdr.length = sge[0].length + sge[1].length;
> -    pr_dbg("msg_len=%d\n", umad.hdr.length);
> +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
> +    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
>   
> -    if (umad.hdr.length > sizeof(umad.mad)) {
> +    msg.umad_len = sge[0].length + sge[1].length;
> +    pr_dbg("umad_len=%d\n", msg.umad_len);
> +
> +    if (msg.umad_len > sizeof(msg.umad.mad)) {
>           return -ENOMEM;
>       }
>   
> -    umad.hdr.addr.qpn = htobe32(1);
> -    umad.hdr.addr.grh_present = 1;
> -    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> -    umad.hdr.addr.hop_limit = 1;
> +    msg.umad.hdr.addr.qpn = htobe32(1);
> +    msg.umad.hdr.addr.grh_present = 1;
> +    pr_dbg("sgid_idx=%d\n", sgid_idx);
> +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
> +    msg.umad.hdr.addr.gid_index = sgid_idx;
> +    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
> +    msg.umad.hdr.addr.hop_limit = 1;
>   
>       hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> -    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> +    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> +
> +    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
> +    pr_dbg_buf("mad_data", data, sge[1].length);
>   
> -    memcpy(&umad.mad[0], hdr, sge[0].length);
> -    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> +    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
> +    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
>   
> -    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> +    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
>       rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
>   
> -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> -                            sizeof(umad));
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
> +                            sizeof(msg));
>   
>       pr_dbg("qemu_chr_fe_write=%d\n", ret);
>   
> -    return (ret != sizeof(umad));
> +    return (ret != sizeof(msg));
>   }
>   
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> -                            union ibv_gid *dgid, uint32_t dqpn,
> -                            uint32_t dqkey, void *ctx)
> +                            uint8_t sgid_idx, union ibv_gid *sgid,
> +                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
> +                            void *ctx)
>   {
>       BackendCtx *bctx;
>       struct ibv_sge new_sge[MAX_SGE];
> @@ -361,7 +372,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           } else if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            rc = mad_send(backend_dev, sge, num_sge);
> +            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
>               if (rc) {
>                   comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
>               } else {
> @@ -397,8 +408,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       }
>   
>       if (qp_type == IBV_QPT_UD) {
> -        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
> -                                backend_dev->backend_gid_idx, dgid);
> +        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
>           if (!wr.wr.ud.ah) {
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
>               goto out_dealloc_cqe_ctx;
> @@ -703,9 +713,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>   }
>   
>   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> -                              uint8_t qp_type, union ibv_gid *dgid,
> -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> -                              bool use_qkey)
> +                              uint8_t qp_type, uint8_t sgid_idx,
> +                              union ibv_gid *dgid, uint32_t dqpn,
> +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
>   {
>       struct ibv_qp_attr attr = {0};
>       union ibv_gid ibv_gid = {
> @@ -717,13 +727,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>       attr.qp_state = IBV_QPS_RTR;
>       attr_mask = IBV_QP_STATE;
>   
> +    qp->sgid_idx = sgid_idx;
> +
>       switch (qp_type) {
>       case IBV_QPT_RC:
>           pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
>                  be64_to_cpu(ibv_gid.global.subnet_prefix),
>                  be64_to_cpu(ibv_gid.global.interface_id));
>           pr_dbg("dqpn=0x%x\n", dqpn);
> -        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
> +        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
>           pr_dbg("sport_num=%d\n", backend_dev->port_num);
>           pr_dbg("rq_psn=0x%x\n", rq_psn);
>   
> @@ -735,7 +747,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>           attr.ah_attr.is_global      = 1;
>           attr.ah_attr.grh.hop_limit  = 1;
>           attr.ah_attr.grh.dgid       = ibv_gid;
> -        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
> +        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
>           attr.rq_psn                 = rq_psn;
>   
>           attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
> @@ -744,8 +756,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>           break;
>   
>       case IBV_QPT_UD:
> +        pr_dbg("qkey=0x%x\n", qkey);
>           if (use_qkey) {
> -            pr_dbg("qkey=0x%x\n", qkey);
>               attr.qkey = qkey;
>               attr_mask |= IBV_QP_QKEY;
>           }
> @@ -861,13 +873,13 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
>       grh->dgid = *my_gid;
>   
>       pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> -    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> -    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> +    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
> +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
>   }
>   
>   static inline int mad_can_receieve(void *opaque)
>   {
> -    return sizeof(struct backend_umad);
> +    return sizeof(RdmaCmMuxMsg);
>   }
>   
>   static void mad_read(void *opaque, const uint8_t *buf, int size)
> @@ -877,13 +889,13 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>       unsigned long cqe_ctx_id;
>       BackendCtx *bctx;
>       char *mad;
> -    struct backend_umad *umad;
> +    RdmaCmMuxMsg *msg;
>   
> -    assert(size != sizeof(umad));
> -    umad = (struct backend_umad *)buf;
> +    assert(size != sizeof(msg));
> +    msg = (RdmaCmMuxMsg *)buf;
>   
>       pr_dbg("Got %d bytes\n", size);
> -    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> +    pr_dbg("umad_len=%d\n", msg->umad_len);
>   
>   #ifdef PVRDMA_DEBUG
>       struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> @@ -913,15 +925,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>   
>       mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
>                              bctx->sge.length);
> -    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> +    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
>           comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
>                        bctx->up_ctx);
>       } else {
> +        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
>           memset(mad, 0, bctx->sge.length);
>           build_mad_hdr((struct ibv_grh *)mad,
> -                      (union ibv_gid *)&umad->hdr.addr.gid,
> -                      &backend_dev->gid, umad->hdr.length);
> -        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> +                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
> +                      msg->umad_len);
> +        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
>           rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
>   
>           comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> @@ -933,10 +946,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>   
>   static int mad_init(RdmaBackendDev *backend_dev)
>   {
> -    struct backend_umad umad = {0};
>       int ret;
>   
> -    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> +    ret = qemu_chr_fe_backend_connected(backend_dev->mad_chr_be);
> +    if (!ret) {
>           pr_dbg("Missing chardev for MAD multiplexer\n");

This may be an error, not a debug message.

>           return -EIO;
>       }
> @@ -944,14 +957,6 @@ static int mad_init(RdmaBackendDev *backend_dev)
>       qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
>                                mad_read, NULL, NULL, backend_dev, NULL, true);
>   
> -    /* Register ourself */
> -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> -                            sizeof(umad.hdr));
> -    if (ret != sizeof(umad.hdr)) {
> -        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> -    }
> -
>       qemu_mutex_init(&backend_dev->recv_mads_list.lock);
>       backend_dev->recv_mads_list.list = qlist_new();
>   
> @@ -988,23 +993,120 @@ static void mad_fini(RdmaBackendDev *backend_dev)
>       qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
>   }
>   
> +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> +                               union ibv_gid *gid)
> +{
> +    union ibv_gid sgid;
> +    int ret;
> +    int i = 0;
> +
> +    pr_dbg("0x%llx, 0x%llx\n",
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> +
> +    do {
> +        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
> +                            &sgid);
> +        i++;
> +    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
> +
> +    pr_dbg("gid_index=%d\n", i - 1);
> +
> +    return ret ? ret : i - 1;
> +}
> +
> +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid)
> +{
> +    RdmaCmMuxMsg msg = {0};
> +    int ret;
> +
> +    pr_dbg("0x%llx, 0x%llx\n",
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> +
> +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REG;
> +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
> +                            sizeof(msg));
> +    if (ret != sizeof(msg)) {
> +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
> +
> +    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
> +                            sizeof(msg));
> +    if (ret != sizeof(msg)) {
> +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
> +
> +    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
> +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", msg.hdr.err_code);
> +        return -EIO;
> +    }
> +
> +    qapi_event_send_rdma_gid_status_changed(ifname, true,
> +                                            gid->global.subnet_prefix,
> +                                            gid->global.interface_id);
> +
> +    return ret;
> +}
> +
> +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid)
> +{
> +    RdmaCmMuxMsg msg = {0};
> +    int ret;
> +
> +    pr_dbg("0x%llx, 0x%llx\n",
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> +
> +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_UNREG;
> +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
> +                            sizeof(msg));
> +    if (ret != sizeof(msg)) {
> +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
> +
> +    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
> +                            sizeof(msg));
> +    if (ret != sizeof(msg)) {
> +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
> +
> +    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
> +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n",
> +               msg.hdr.err_code);
> +        return -EIO;
> +    }
> +
> +    qapi_event_send_rdma_gid_status_changed(ifname, false,
> +                                            gid->global.subnet_prefix,
> +                                            gid->global.interface_id);
> +
> +    return 0;
> +}
> +
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
> -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      CharBackend *mad_chr_be, Error **errp)
> +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> +                      Error **errp)
>   {
>       int i;
>       int ret = 0;
>       int num_ibv_devices;
>       struct ibv_device **dev_list;
> -    struct ibv_port_attr port_attr;
>   
>       memset(backend_dev, 0, sizeof(*backend_dev));
>   
>       backend_dev->dev = pdev;
>       backend_dev->mad_chr_be = mad_chr_be;
> -    backend_dev->backend_gid_idx = backend_gid_idx;
>       backend_dev->port_num = port_num;
>       backend_dev->rdma_dev_res = rdma_dev_res;
>   
> @@ -1041,9 +1143,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>           backend_dev->ib_dev = *dev_list;
>       }
>   
> -    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
> -           ibv_get_device_name(backend_dev->ib_dev),
> -           backend_dev->port_num, backend_dev->backend_gid_idx);
> +    pr_dbg("Using backend device %s, port %d\n",
> +           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
>   
>       backend_dev->context = ibv_open_device(backend_dev->ib_dev);
>       if (!backend_dev->context) {
> @@ -1060,20 +1161,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       }
>       pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
>   
> -    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
> -                         &port_attr);
> -    if (ret) {
> -        error_setg(errp, "Error %d from ibv_query_port", ret);
> -        ret = -EIO;
> -        goto out_destroy_comm_channel;
> -    }
> -
> -    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
> -        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
> -                   port_attr.gid_tbl_len);
> -        goto out_destroy_comm_channel;
> -    }
> -
>       ret = init_device_caps(backend_dev, dev_attr);
>       if (ret) {
>           error_setg(errp, "Failed to initialize device capabilities");
> @@ -1081,18 +1168,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>           goto out_destroy_comm_channel;
>       }
>   
> -    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
> -                         backend_dev->backend_gid_idx, &backend_dev->gid);
> -    if (ret) {
> -        error_setg(errp, "Failed to query gid %d",
> -                   backend_dev->backend_gid_idx);
> -        ret = -EIO;
> -        goto out_destroy_comm_channel;
> -    }
> -    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
> -           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
> -    pr_dbg("interface_id=0x%" PRIx64 "\n",
> -           be64_to_cpu(backend_dev->gid.global.interface_id));
>   
>       ret = mad_init(backend_dev);
>       if (ret) {
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index fc83330251..59ad2b874b 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -28,11 +28,6 @@ enum ibv_special_qp_type {
>       IBV_QPT_GSI = 1,
>   };
>   
> -static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
> -{
> -    return &dev->gid;
> -}
> -
>   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
>   {
>       return qp->ibqp ? qp->ibqp->qp_num : 1;
> @@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
> -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      CharBackend *mad_chr_be, Error **errp);
> +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> +                      Error **errp);
>   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid);
> +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid);
> +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> +                               union ibv_gid *gid);
>   void rdma_backend_start(RdmaBackendDev *backend_dev);
>   void rdma_backend_stop(RdmaBackendDev *backend_dev);
>   void rdma_backend_register_comp_handler(void (*handler)(int status,
> @@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
>   int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>                                  uint8_t qp_type, uint32_t qkey);
>   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> -                              uint8_t qp_type, union ibv_gid *dgid,
> -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> -                              bool use_qkey);
> +                              uint8_t qp_type, uint8_t sgid_idx,
> +                              union ibv_gid *dgid, uint32_t dqpn,
> +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
>   int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
>                                 uint32_t sq_psn, uint32_t qkey, bool use_qkey);
>   int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
> @@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> +                            uint8_t sgid_idx, union ibv_gid *sgid,
>                               union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
>                               void *ctx);
>   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> index 2a7e667075..ff8b2426a0 100644
> --- a/hw/rdma/rdma_backend_defs.h
> +++ b/hw/rdma/rdma_backend_defs.h
> @@ -37,14 +37,12 @@ typedef struct RecvMadList {
>   typedef struct RdmaBackendDev {
>       struct ibv_device_attr dev_attr;
>       RdmaBackendThread comp_thread;
> -    union ibv_gid gid;
>       PCIDevice *dev;
>       RdmaDeviceResources *rdma_dev_res;
>       struct ibv_device *ib_dev;
>       struct ibv_context *context;
>       struct ibv_comp_channel *channel;
>       uint8_t port_num;
> -    uint8_t backend_gid_idx;
>       RecvMadList recv_mads_list;
>       CharBackend *mad_chr_be;
>   } RdmaBackendDev;
> @@ -66,6 +64,7 @@ typedef struct RdmaBackendCQ {
>   typedef struct RdmaBackendQP {
>       struct ibv_pd *ibpd;
>       struct ibv_qp *ibqp;
> +    uint8_t sgid_idx;
>   } RdmaBackendQP;
>   
>   #endif
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index 4f10fcabcc..fe0979415d 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -391,7 +391,7 @@ out_dealloc_qp:
>   }
>   
>   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> -                      uint32_t qp_handle, uint32_t attr_mask,
> +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
>                         union ibv_gid *dgid, uint32_t dqpn,
>                         enum ibv_qp_state qp_state, uint32_t qkey,
>                         uint32_t rq_psn, uint32_t sq_psn)
> @@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>       int ret;
>   
>       pr_dbg("qpn=0x%x\n", qp_handle);
> +    pr_dbg("qkey=0x%x\n", qkey);
>   
>       qp = rdma_rm_get_qp(dev_res, qp_handle);
>       if (!qp) {
> @@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>           }
>   
>           if (qp->qp_state == IBV_QPS_RTR) {
> +            /* Get backend gid index */
> +            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
> +            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
> +                                                     sgid_idx);
> +            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
> +                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
> +                return -EIO;
> +            }
> +
>               ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
> -                                            qp->qp_type, dgid, dqpn, rq_psn,
> -                                            qkey, attr_mask & IBV_QP_QKEY);
> +                                            qp->qp_type, sgid_idx, dgid, dqpn,
> +                                            rq_psn, qkey,
> +                                            attr_mask & IBV_QP_QKEY);
>               if (ret) {
>                   return -EIO;
>               }
> @@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
>       res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
>   }
>   
> +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, union ibv_gid *gid, int gid_idx)
> +{
> +    int rc;
> +
> +    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
> +    if (rc <= 0) {
> +        pr_dbg("Fail to add gid\n");
> +        return -EINVAL;
> +    }
> +
> +    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));

A previous patch removed multiple ports support, why do we
have ports[0] ?

> +
> +    return 0;
> +}
> +
> +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, int gid_idx)
> +{
> +    int rc;
> +
> +    rc = rdma_backend_del_gid(backend_dev, ifname,
> +                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
> +    if (rc < 0) {
> +        pr_dbg("Fail to delete gid\n");
> +        return -EINVAL;
> +    }
> +
> +    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
> +           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
> +    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;

Same question as above.
> +
> +    return 0;
> +}
> +
> +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> +                                  RdmaBackendDev *backend_dev, int sgid_idx)
> +{
> +    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
> +        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
> +        return -EINVAL;
> +    }
> +
> +    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
> +        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
> +        rdma_backend_get_gid_index(backend_dev,
> +                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
> +    }
> +
> +    pr_dbg("backend_gid_index=%d\n",
> +           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
> +
> +    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
> +}
> +
>   static void destroy_qp_hash_key(gpointer data)
>   {
>       g_bytes_unref(data);
>   }
>   
> +static void init_ports(RdmaDeviceResources *dev_res)
> +{
> +    int i, j;
> +
> +    memset(dev_res->ports, 0, sizeof(dev_res->ports));
> +
> +    for (i = 0; i < MAX_PORTS; i++) {
> +        dev_res->ports[i].state = IBV_PORT_DOWN;

I might have missed something regarding the ports support,
can you please clarify for me?

Thanks,
Marcel

> +        for (j = 0; j < MAX_PORT_GIDS; j++) {
> +            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
> +        }
> +    }
> +}
> +
> +static void fini_ports(RdmaDeviceResources *dev_res,
> +                       RdmaBackendDev *backend_dev, const char *ifname)
> +{
> +    int i;
> +
> +    dev_res->ports[0].state = IBV_PORT_DOWN;
> +    for (i = 0; i < MAX_PORT_GIDS; i++) {
> +        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
> +    }
> +}
> +
>   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
>                    Error **errp)
>   {
> @@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
>                          dev_attr->max_qp_wr, sizeof(void *));
>       res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
>   
> +    init_ports(dev_res);
> +
>       return 0;
>   }
>   
> -void rdma_rm_fini(RdmaDeviceResources *dev_res)
> +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                  const char *ifname)
>   {
> +    fini_ports(dev_res, backend_dev, ifname);
> +
>       res_tbl_free(&dev_res->uc_tbl);
>       res_tbl_free(&dev_res->cqe_ctx_tbl);
>       res_tbl_free(&dev_res->qp_tbl);
> diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
> index b4e04cc7b4..a7169b4e89 100644
> --- a/hw/rdma/rdma_rm.h
> +++ b/hw/rdma/rdma_rm.h
> @@ -22,7 +22,8 @@
>   
>   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
>                    Error **errp);
> -void rdma_rm_fini(RdmaDeviceResources *dev_res);
> +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                  const char *ifname);
>   
>   int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>                        uint32_t *pd_handle, uint32_t ctx_handle);
> @@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
>                        uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
>   RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
>   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> -                      uint32_t qp_handle, uint32_t attr_mask,
> +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
>                         union ibv_gid *dgid, uint32_t dqpn,
>                         enum ibv_qp_state qp_state, uint32_t qkey,
>                         uint32_t rq_psn, uint32_t sq_psn);
> @@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
>   void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
>   void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
>   
> +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, union ibv_gid *gid, int gid_idx);
> +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, int gid_idx);
> +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> +                                  RdmaBackendDev *backend_dev, int sgid_idx);
> +static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
> +                                             int sgid_idx)
> +{
> +    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
> +}
> +
>   #endif
> diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
> index 9b399063d3..7b3435f991 100644
> --- a/hw/rdma/rdma_rm_defs.h
> +++ b/hw/rdma/rdma_rm_defs.h
> @@ -19,7 +19,7 @@
>   #include "rdma_backend_defs.h"
>   
>   #define MAX_PORTS             1
> -#define MAX_PORT_GIDS         1
> +#define MAX_PORT_GIDS         255
>   #define MAX_GIDS              MAX_PORT_GIDS
>   #define MAX_PORT_PKEYS        1
>   #define MAX_PKEYS             MAX_PORT_PKEYS
> @@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
>       enum ibv_qp_state qp_state;
>   } RdmaRmQP;
>   
> +typedef struct RdmaRmGid {
> +    union ibv_gid gid;
> +    int backend_gid_index;
> +} RdmaRmGid;
> +
>   typedef struct RdmaRmPort {
> -    union ibv_gid gid_tbl[MAX_PORT_GIDS];
> +    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
>       enum ibv_port_state state;
>   } RdmaRmPort;
>   
> diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
> index 04c7c2ef5b..989db249ef 100644
> --- a/hw/rdma/rdma_utils.h
> +++ b/hw/rdma/rdma_utils.h
> @@ -20,6 +20,7 @@
>   #include "qemu/osdep.h"
>   #include "hw/pci/pci.h"
>   #include "sysemu/dma.h"
> +#include "stdio.h"
>   
>   #define pr_info(fmt, ...) \
>       fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
> @@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
>   #define pr_dbg(fmt, ...) \
>       fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
>               __func__, __LINE__, ## __VA_ARGS__)
> +
> +#define pr_dbg_buf(title, buf, len) \
> +{ \
> +    char *b = g_malloc0(len * 3 + 1); \
> +    char b1[4]; \
> +    for (int i = 0; i < len; i++) { \
> +        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
> +        strcat(b, b1); \
> +    } \
> +    pr_dbg("%s (%d): %s\n", title, len, b); \
> +    g_free(b); \
> +}
> +
>   #else
>   #define init_pr_dbg(void)
>   #define pr_dbg(fmt, ...)
> +#define pr_dbg_buf(title, buf, len)
>   #endif
>   
>   void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index 15c3f28b86..b019cb843a 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -79,8 +79,8 @@ typedef struct PVRDMADev {
>       int interrupt_mask;
>       struct ibv_device_attr dev_attr;
>       uint64_t node_guid;
> +    char *backend_eth_device_name;
>       char *backend_device_name;
> -    uint8_t backend_gid_idx;
>       uint8_t backend_port_num;
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index 57d6f41ae6..a334f6205e 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       rsp->hdr.response = cmd->hdr.response;
>       rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
>   
> -    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
> -                                 cmd->qp_handle, cmd->attr_mask,
> -                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> -                                 cmd->attrs.dest_qp_num,
> -                                 (enum ibv_qp_state)cmd->attrs.qp_state,
> -                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
> -                                 cmd->attrs.sq_psn);
> +    /* No need to verify sgid_index since it is u8 */
> +
> +    rsp->hdr.err =
> +        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
> +                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
> +                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> +                          cmd->attrs.dest_qp_num,
> +                          (enum ibv_qp_state)cmd->attrs.qp_state,
> +                          cmd->attrs.qkey, cmd->attrs.rq_psn,
> +                          cmd->attrs.sq_psn);
>   
>       pr_dbg("ret=%d\n", rsp->hdr.err);
>       return rsp->hdr.err;
> @@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>                          union pvrdma_cmd_resp *rsp)
>   {
>       struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
> -#ifdef PVRDMA_DEBUG
> -    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
> -    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
> -#endif
> +    int rc;
> +    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
>   
>       pr_dbg("index=%d\n", cmd->index);
>   
> @@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       }
>   
>       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> -           (long long unsigned int)be64_to_cpu(*subnet),
> -           (long long unsigned int)be64_to_cpu(*if_id));
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
>   
> -    /* Driver forces to one port only */
> -    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
> -           sizeof(cmd->new_gid));
> +    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
> +                         dev->backend_eth_device_name, gid, cmd->index);
> +    if (rc < 0) {
> +        return -EINVAL;
> +    }
>   
>       /* TODO: Since drivers stores node_guid at load_dsr phase then this
>        * assignment is not relevant, i need to figure out a way how to
>        * retrieve MAC of our netdev */
> -    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
> -    pr_dbg("dev->node_guid=0x%llx\n",
> -           (long long unsigned int)be64_to_cpu(dev->node_guid));
> +    if (!cmd->index) {
> +        dev->node_guid =
> +            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
> +        pr_dbg("dev->node_guid=0x%llx\n",
> +               (long long unsigned int)be64_to_cpu(dev->node_guid));
> +    }
>   
>       return 0;
>   }
> @@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>                           union pvrdma_cmd_resp *rsp)
>   {
> +    int rc;
> +
>       struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
>   
>       pr_dbg("index=%d\n", cmd->index);
> @@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>           return -EINVAL;
>       }
>   
> -    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
> -           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
> +    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> +                        dev->backend_eth_device_name, cmd->index);
> +
> +    if (rc < 0) {
> +        rsp->hdr.err = rc;
> +        goto out;
> +    }
>   
>       return 0;
>   }
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index fc2abd34af..ac8c092db0 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -36,9 +36,9 @@
>   #include "pvrdma_qp_ops.h"
>   
>   static Property pvrdma_dev_properties[] = {
> -    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
> -    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
> -    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
> +    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
> +    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
> +    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
>       DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
>                          MAX_MR_SIZE),
>       DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
> @@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
>       pr_dbg("Initialized\n");
>   }
>   
> -static void init_ports(PVRDMADev *dev, Error **errp)
> -{
> -    int i;
> -
> -    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
> -
> -    for (i = 0; i < MAX_PORTS; i++) {
> -        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
> -    }
> -}
> -
>   static void uninit_msix(PCIDevice *pdev, int used_vectors)
>   {
>       PVRDMADev *dev = PVRDMA_DEV(pdev);
> @@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
>   
>       pvrdma_qp_ops_fini();
>   
> -    rdma_rm_fini(&dev->rdma_dev_res);
> +    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
> +                 dev->backend_eth_device_name);
>   
>       rdma_backend_fini(&dev->backend_dev);
>   
> @@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>   
>       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
>                              dev->backend_device_name, dev->backend_port_num,
> -                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> -                           errp);
> +                           &dev->dev_attr, &dev->mad_chr, errp);
>       if (rc) {
>           goto out;
>       }
> @@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>           goto out;
>       }
>   
> -    init_ports(dev, errp);
> -
>       rc = pvrdma_qp_ops_init();
>       if (rc) {
>           goto out;
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index 3388be1926..2130824098 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
>       RdmaRmQP *qp;
>       PvrdmaSqWqe *wqe;
>       PvrdmaRing *ring;
> +    int sgid_idx;
> +    union ibv_gid *sgid;
>   
>       pr_dbg("qp_handle=0x%x\n", qp_handle);
>   
> @@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
>           comp_ctx->cqe.qp = qp_handle;
>           comp_ctx->cqe.opcode = IBV_WC_SEND;
>   
> +        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
> +        if (!sgid) {
> +            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
> +            return -EIO;
> +        }
> +        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
> +               sgid->global.interface_id);
> +
> +        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
> +                                                 &dev->backend_dev,
> +                                                 wqe->hdr.wr.ud.av.gid_index);
> +        if (sgid_idx <= 0) {
> +            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
> +                   wqe->hdr.wr.ud.av.gid_index);
> +            return -EIO;
> +        }
> +
>           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
>                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
> +                               sgid_idx, sgid,
>                                  (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
>                                  wqe->hdr.wr.ud.remote_qpn,
>                                  wqe->hdr.wr.ud.remote_qkey, comp_ctx);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
@ 2018-11-17 17:27   ` Shamir Rabinovitch
  2018-11-18 10:17     ` Yuval Shaia
  0 siblings, 1 reply; 70+ messages in thread
From: Shamir Rabinovitch @ 2018-11-17 17:27 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, cohuck

On Tue, Nov 13, 2018 at 09:13:14AM +0200, Yuval Shaia wrote:
> RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
> given MAD class.
> This does not go hand-by-hand with qemu pvrdma device's requirements
> where each VM is MAD agent.
> Fix it by adding implementation of RDMA MAD multiplexer service which on
> one hand register as a sole MAD agent with the kernel module and on the
> other hand gives service to more than one VM.
> 
> Design Overview:
> ----------------
> A server process is registered to UMAD framework (for this to work the
> rdma_cm kernel module needs to be unloaded) and creates a unix socket to
> listen to incoming request from clients.
> A client process (such as QEMU) connects to this unix socket and
> registers with its own GID.
> 
> TX:
> ---
> When client needs to send rdma_cm MAD message it construct it the same
> way as without this multiplexer, i.e. creates a umad packet but this
> time it writes its content to the socket instead of calling umad_send().
> The server, upon receiving such a message fetch local_comm_id from it so
> a context for this session can be maintain and relay the message to UMAD
> layer by calling umad_send().
> 
> RX:
> ---
> The server creates a worker thread to process incoming rdma_cm MAD
> messages. When an incoming message arrived (umad_recv()) the server,
> depending on the message type (attr_id) looks for target client by
> either searching in gid->fd table or in local_comm_id->fd table. With
> the extracted fd the server relays to incoming message to the client.
> 
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  MAINTAINERS                      |   1 +
>  Makefile                         |   3 +
>  Makefile.objs                    |   1 +
>  contrib/rdmacm-mux/Makefile.objs |   4 +
>  contrib/rdmacm-mux/main.c        | 771 +++++++++++++++++++++++++++++++
>  contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
>  6 files changed, 836 insertions(+)
>  create mode 100644 contrib/rdmacm-mux/Makefile.objs
>  create mode 100644 contrib/rdmacm-mux/main.c
>  create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 98a1856afc..e087d58ac6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2231,6 +2231,7 @@ S: Maintained
>  F: hw/rdma/*
>  F: hw/rdma/vmw/*
>  F: docs/pvrdma.txt
> +F: contrib/rdmacm-mux/*
>  
>  Build and test automation
>  -------------------------
> diff --git a/Makefile b/Makefile
> index f2947186a4..94072776ff 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
>                  elf2dmp-obj-y \
>                  ivshmem-client-obj-y \
>                  ivshmem-server-obj-y \
> +                rdmacm-mux-obj-y \
>                  libvhost-user-obj-y \
>                  vhost-user-scsi-obj-y \
>                  vhost-user-blk-obj-y \
> @@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
>  	$(call LINK, $^)
>  vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
>  	$(call LINK, $^)
> +rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
> +	$(call LINK, $^)
>  
>  module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
>  	$(call quiet-command,$(PYTHON) $< $@ \
> diff --git a/Makefile.objs b/Makefile.objs
> index 1e1ff387d7..cc7df3ad80 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
>  vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
>  vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
>  vhost-user-blk-obj-y = contrib/vhost-user-blk/
> +rdmacm-mux-obj-y = contrib/rdmacm-mux/
>  
>  ######################################################################
>  trace-events-subdirs =
> diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
> new file mode 100644
> index 0000000000..be3eacb6f7
> --- /dev/null
> +++ b/contrib/rdmacm-mux/Makefile.objs
> @@ -0,0 +1,4 @@
> +ifdef CONFIG_PVRDMA
> +CFLAGS += -libumad -Wno-format-truncation
> +rdmacm-mux-obj-y = main.o
> +endif
> diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
> new file mode 100644
> index 0000000000..47cf0ac7bc
> --- /dev/null
> +++ b/contrib/rdmacm-mux/main.c
> @@ -0,0 +1,771 @@
> +/*
> + * QEMU paravirtual RDMA - rdmacm-mux implementation
> + *
> + * Copyright (C) 2018 Oracle
> + * Copyright (C) 2018 Red Hat Inc
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "sys/poll.h"
> +#include "sys/ioctl.h"
> +#include "pthread.h"
> +#include "syslog.h"
> +
> +#include "infiniband/verbs.h"
> +#include "infiniband/umad.h"
> +#include "infiniband/umad_types.h"
> +#include "infiniband/umad_sa.h"
> +#include "infiniband/umad_cm.h"
> +
> +#include "rdmacm-mux.h"
> +
> +#define SCALE_US 1000
> +#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
> +#define SLEEP_SECS 5 /* This is used both in poll() and thread */
> +#define SERVER_LISTEN_BACKLOG 10
> +#define MAX_CLIENTS 4096
> +#define MAD_RMPP_VERSION 0
> +#define MAD_METHOD_MASK0 0x8
> +
> +#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
> +
> +#define CM_REQ_DGID_POS      80
> +#define CM_SIDR_REQ_DGID_POS 44
> +
> +/* The below can be override by command line parameter */
> +#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
> +#define RDMA_DEVICE "rxe0"
> +#define RDMA_PORT_NUM 1
> +
> +typedef struct RdmaCmServerArgs {
> +    char unix_socket_path[PATH_MAX];
> +    char rdma_dev_name[NAME_MAX];
> +    int rdma_port_num;
> +} RdmaCMServerArgs;
> +
> +typedef struct CommId2FdEntry {
> +    int fd;
> +    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
> +    __be64 gid_ifid;
> +} CommId2FdEntry;
> +
> +typedef struct RdmaCmUMadAgent {
> +    int port_id;
> +    int agent_id;
> +    GHashTable *gid2fd; /* Used to find fd of a given gid */
> +    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
> +} RdmaCmUMadAgent;
> +
> +typedef struct RdmaCmServer {
> +    bool run;
> +    RdmaCMServerArgs args;
> +    struct pollfd fds[MAX_CLIENTS];
> +    int nfds;
> +    RdmaCmUMadAgent umad_agent;
> +    pthread_t umad_recv_thread;
> +    pthread_rwlock_t lock;
> +} RdmaCMServer;
> +
> +static RdmaCMServer server = {0};

nit - no need for '{0}'

> +
> +static void usage(const char *progname)
> +{
> +    printf("Usage: %s [OPTION]...\n"
> +           "Start a RDMA-CM multiplexer\n"
> +           "\n"
> +           "\t-h                    Show this help\n"
> +           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
> +           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
> +           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
> +           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
> +}
> +
> +static void help(const char *progname)
> +{
> +    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
> +}
> +
> +static void parse_args(int argc, char *argv[])
> +{
> +    int c;
> +    char unix_socket_path[PATH_MAX];
> +
> +    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
> +    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
> +    server.args.rdma_port_num = RDMA_PORT_NUM;
> +
> +    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
> +        switch (c) {
> +        case 'h':
> +            usage(argv[0]);
> +            exit(0);
> +
> +        case 's':
> +            /* This is temporary, final name will build below */
> +            strncpy(unix_socket_path, optarg, PATH_MAX);
> +            break;
> +
> +        case 'd':
> +            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
> +            break;
> +
> +        case 'p':
> +            server.args.rdma_port_num = atoi(optarg);
> +            break;
> +
> +        default:
> +            help(argv[0]);
> +            exit(1);
> +        }
> +    }
> +
> +    /* Build unique unix-socket file name */
> +    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
> +             unix_socket_path, server.args.rdma_dev_name,
> +             server.args.rdma_port_num);
> +
> +    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
> +    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
> +    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
> +}
> +
> +static void hash_tbl_alloc(void)
> +{
> +
> +    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
> +                                                     g_int64_equal,
> +                                                     g_free, g_free);
> +    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
> +                                                        g_int_equal,
> +                                                        g_free, g_free);
> +}
> +
> +static void hash_tbl_free(void)
> +{
> +    if (server.umad_agent.commid2fd) {
> +        g_hash_table_destroy(server.umad_agent.commid2fd);
> +    }
> +    if (server.umad_agent.gid2fd) {
> +        g_hash_table_destroy(server.umad_agent.gid2fd);
> +    }
> +}
> +
> +
> +static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
> +{
> +    int *fd;
> +
> +    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
> +    if (!fd) {
> +        /* Let's try IPv4 */
> +        *gid_ifid |= 0x00000000ffff0000;
> +        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
> +    }
> +
> +    return fd ? *fd : 0;
> +}
> +
> +static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
> +{
> +    pthread_rwlock_rdlock(&server.lock);
> +    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    if (!fd) {
> +        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
> +        return -ENOENT;
> +    }
> +
> +    return 0;
> +}
> +
> +static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
> +                                         __be64 *gid_idid)
> +{
> +    CommId2FdEntry *fde;
> +
> +    pthread_rwlock_rdlock(&server.lock);
> +    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    if (!fde) {
> +        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
> +        return -ENOENT;
> +    }
> +
> +    *fd = fde->fd;
> +    *gid_idid = fde->gid_ifid;
> +
> +    return 0;
> +}
> +
> +static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
> +{
> +    int fd1;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
> +    if (fd1) { /* record already exist - an error */
> +        pthread_rwlock_unlock(&server.lock);
> +        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
> +                           RDMACM_MUX_ERR_CODE_EACCES;
> +    }
> +
> +    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
> +                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
> +
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    syslog(LOG_INFO, "0x%lx registered on socket %d", (uint64_t)gid_ifid, fd);
> +
> +    return RDMACM_MUX_ERR_CODE_OK;
> +}
> +
> +static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
> +{
> +    int fd1;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
> +    if (!fd1) { /* record not exist - an error */
> +        pthread_rwlock_unlock(&server.lock);
> +        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
> +    }
> +
> +    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
> +                        sizeof(gid_ifid)));
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    syslog(LOG_INFO, "0x%lx unregistered on socket %d", (uint64_t)gid_ifid, fd);
> +
> +    return RDMACM_MUX_ERR_CODE_OK;
> +}
> +
> +static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
> +                                          uint64_t gid_ifid)
> +{
> +    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +    g_hash_table_insert(server.umad_agent.commid2fd,
> +                        g_memdup(&comm_id, sizeof(comm_id)),
> +                        g_memdup(&fde, sizeof(fde)));
> +    pthread_rwlock_unlock(&server.lock);
> +}
> +
> +static gboolean remove_old_comm_ids(gpointer key, gpointer value,
> +                                    gpointer user_data)
> +{
> +    CommId2FdEntry *fde = (CommId2FdEntry *)value;
> +
> +    return !fde->ttl--;
> +}
> +
> +static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
> +                                         gpointer user_data)
> +{
> +    if (*(int *)value == *(int *)user_data) {
> +        syslog(LOG_INFO, "0x%lx unregistered on socket %d", *(uint64_t *)key,
> +               *(int *)value);
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void hash_tbl_remove_fd_ifid_pair(int fd)
> +{
> +    pthread_rwlock_wrlock(&server.lock);
> +    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
> +                                remove_entry_from_gid2fd, (gpointer)&fd);
> +    pthread_rwlock_unlock(&server.lock);
> +}
> +
> +static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
> +{
> +    struct umad_hdr *hdr = (struct umad_hdr *)mad;
> +    char *data = (char *)hdr + sizeof(*hdr);
> +    int32_t comm_id;
> +    uint16_t attr_id = be16toh(hdr->attr_id);
> +    int rc = 0;
> +
> +    switch (attr_id) {
> +    case UMAD_CM_ATTR_REQ:
> +        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
> +        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
> +        break;
> +
> +    case UMAD_CM_ATTR_SIDR_REQ:
> +        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
> +        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
> +        break;
> +
> +    case UMAD_CM_ATTR_REP:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_REJ:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_DREQ:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_DREP:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_RTU:
> +        data += sizeof(comm_id);
> +        /* Fall through */
> +    case UMAD_CM_ATTR_SIDR_REP:
> +        memcpy(&comm_id, data, sizeof(comm_id));
> +        if (comm_id) {
> +            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
> +        }
> +        break;
> +
> +    default:
> +        rc = -EINVAL;
> +        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
> +    }
> +
> +    return rc;
> +}
> +
> +static void *umad_recv_thread_func(void *args)
> +{
> +    int rc;
> +    RdmaCmMuxMsg msg = {0};
> +    int fd = -2;
> +
> +    while (server.run) {
> +        do {
> +            msg.umad_len = sizeof(msg.umad.mad);
> +            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
> +                           SLEEP_SECS * SCALE_US);
> +            if ((rc == -EIO) || (rc == -EINVAL)) {
> +                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
> +            }
> +
> +            if (rc == -ETIMEDOUT) {
> +                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
> +                                            remove_old_comm_ids, NULL);
> +            }
> +        } while (rc && server.run);
> +
> +        if (server.run) {
> +            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
> +            if (rc) {
> +                continue;
> +            }
> +
> +            send(fd, &msg, sizeof(msg), 0);
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int read_and_process(int fd)
> +{
> +    int rc;
> +    RdmaCmMuxMsg msg = {0};
> +    struct umad_hdr *hdr;
> +    uint32_t *comm_id;
> +    uint16_t attr_id;
> +
> +    rc = recv(fd, &msg, sizeof(msg), 0);
> +
> +    if (rc < 0 && errno != EWOULDBLOCK) {
> +        return -EIO;
> +    }
> +
> +    if (!rc) {
> +        return -EPIPE;
> +    }
> +
> +    switch (msg.hdr.msg_type) {
> +    case RDMACM_MUX_MSG_TYPE_REG:
> +        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
> +        break;
> +
> +    case RDMACM_MUX_MSG_TYPE_UNREG:
> +        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
> +        break;
> +
> +    case RDMACM_MUX_MSG_TYPE_MAD:
> +        /* If this is REQ or REP then store the pair comm_id,fd to be later
> +         * used for other messages where gid is unknown */
> +        hdr = (struct umad_hdr *)msg.umad.mad;
> +        attr_id = be16toh(hdr->attr_id);
> +        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
> +            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
> +            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
> +            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
> +            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
> +                                          msg.hdr.sgid.global.interface_id);
> +        }
> +
> +        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
> +                       &msg.umad, msg.umad_len, 1, 0);
> +        if (rc) {
> +            syslog(LOG_WARNING, "Fail to send MAD message, err=%d", rc);
> +        }
> +        break;
> +
> +    default:
> +        syslog(LOG_WARNING, "Got invalid message (%d) from %d",
> +               msg.hdr.msg_type, fd);
> +        rc = RDMACM_MUX_ERR_CODE_EINVAL;
> +    }
> +
> +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
> +    msg.hdr.err_code = rc;
> +    rc = send(fd, &msg, sizeof(msg), 0);
> +
> +    return rc == sizeof(msg) ? 0 : -EPIPE;
> +}
> +
> +static int accept_all(void)
> +{
> +    int fd, rc = 0;;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    do {
> +        if ((server.nfds + 1) > MAX_CLIENTS) {
> +            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
> +            rc = -EIO;
> +            goto out;
> +        }
> +
> +        fd = accept(server.fds[0].fd, NULL, NULL);
> +        if (fd < 0) {
> +            if (errno != EWOULDBLOCK) {
> +                syslog(LOG_WARNING, "accept() failed");
> +                rc = -EIO;
> +                goto out;
> +            }
> +            break;
> +        }
> +
> +        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
> +        server.fds[server.nfds].fd = fd;
> +        server.fds[server.nfds].events = POLLIN;
> +        server.nfds++;
> +    } while (fd != -1);
> +
> +out:
> +    pthread_rwlock_unlock(&server.lock);
> +    return rc;
> +}
> +
> +static void compress_fds(void)
> +{
> +    int i, j;
> +    int closed = 0;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    for (i = 1; i < server.nfds; i++) {
> +        if (!server.fds[i].fd) {
> +            closed++;
> +            for (j = i; j < server.nfds; j++) {
> +                server.fds[j].fd = server.fds[j + 1].fd;
> +            }
> +        }
> +    }
> +
> +    server.nfds -= closed;
> +
> +    pthread_rwlock_unlock(&server.lock);
> +}
> +
> +static void close_fd(int idx)
> +{
> +    close(server.fds[idx].fd);
> +    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
> +    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
> +    server.fds[idx].fd = 0;
> +}
> +
> +static void run(void)
> +{
> +    int rc, nfds, i;
> +    bool compress = false;
> +
> +    syslog(LOG_INFO, "Service started");
> +
> +    while (server.run) {
> +        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
> +        if (rc < 0) {
> +            if (errno != EINTR) {
> +                syslog(LOG_WARNING, "poll() failed");
> +            }
> +            continue;
> +        }
> +
> +        if (rc == 0) {
> +            continue;
> +        }
> +
> +        nfds = server.nfds;
> +        for (i = 0; i < nfds; i++) {
> +            if (server.fds[i].revents == 0) {
> +                continue;
> +            }
> +
> +            if (server.fds[i].revents != POLLIN) {
> +                if (i == 0) {
> +                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
> +                           server.fds[i].revents);
> +                } else {
> +                    close_fd(i);
> +                    compress = true;
> +                }
> +                continue;
> +            }
> +
> +            if (i == 0) {
> +                rc = accept_all();
> +                if (rc) {
> +                    continue;
> +                }
> +            } else {
> +                rc = read_and_process(server.fds[i].fd);
> +                if (rc) {
> +                    close_fd(i);
> +                    compress = true;
> +                }
> +            }
> +        }
> +
> +        if (compress) {
> +            compress = false;
> +            compress_fds();
> +        }
> +    }
> +}
> +
> +static void fini_listener(void)
> +{
> +    int i;
> +
> +    if (server.fds[0].fd <= 0) {
> +        return;
> +    }
> +
> +    for (i = server.nfds - 1; i >= 0; i--) {
> +        if (server.fds[i].fd) {
> +            close(server.fds[i].fd);
> +        }
> +    }
> +
> +    unlink(server.args.unix_socket_path);
> +}
> +
> +static void fini_umad(void)
> +{
> +    if (server.umad_agent.agent_id) {
> +        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
> +    }
> +
> +    if (server.umad_agent.port_id) {
> +        umad_close_port(server.umad_agent.port_id);
> +    }
> +
> +    hash_tbl_free();
> +}
> +
> +static void fini(void)
> +{
> +    if (server.umad_recv_thread) {
> +        pthread_join(server.umad_recv_thread, NULL);
> +        server.umad_recv_thread = 0;
> +    }
> +    fini_umad();
> +    fini_listener();
> +    pthread_rwlock_destroy(&server.lock);
> +
> +    syslog(LOG_INFO, "Service going down");
> +}
> +
> +static int init_listener(void)
> +{
> +    struct sockaddr_un sun;
> +    int rc, on = 1;
> +
> +    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
> +    if (server.fds[0].fd < 0) {
> +        syslog(LOG_ALERT, "socket() failed");
> +        return -EIO;
> +    }
> +
> +    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
> +                    sizeof(on));
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "setsockopt() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "ioctl() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
> +        syslog(LOG_ALERT,
> +               "Invalid unix_socket_path, size must be less than %ld\n",
> +               sizeof(sun.sun_path));
> +        rc = -EINVAL;
> +        goto err;
> +    }
> +
> +    sun.sun_family = AF_UNIX;
> +    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
> +                  server.args.unix_socket_path);
> +    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
> +        syslog(LOG_ALERT, "Could not copy unix socket path\n");
> +        rc = -EINVAL;
> +        goto err;
> +    }
> +
> +    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "bind() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "listen() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    server.fds[0].events = POLLIN;
> +    server.nfds = 1;
> +    server.run = true;
> +
> +    return 0;
> +
> +err:
> +    close(server.fds[0].fd);
> +    return rc;
> +}
> +
> +static int init_umad(void)
> +{
> +    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
> +
> +    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
> +                                               server.args.rdma_port_num);
> +
> +    if (server.umad_agent.port_id < 0) {
> +        syslog(LOG_WARNING, "umad_open_port() failed");
> +        return -EIO;
> +    }
> +
> +    memset(&method_mask, 0, sizeof(method_mask));
> +    method_mask[0] = MAD_METHOD_MASK0;
> +    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
> +                                               UMAD_CLASS_CM,
> +                                               UMAD_SA_CLASS_VERSION,
> +                                               MAD_RMPP_VERSION, method_mask);
> +    if (server.umad_agent.agent_id < 0) {
> +        syslog(LOG_WARNING, "umad_register() failed");
> +        return -EIO;
> +    }
> +
> +    hash_tbl_alloc();
> +
> +    return 0;
> +}
> +
> +static void signal_handler(int sig, siginfo_t *siginfo, void *context)
> +{
> +    static bool warned;
> +
> +    /* Prevent stop if clients are connected */
> +    if (server.nfds != 1) {
> +        if (!warned) {
> +            syslog(LOG_WARNING,
> +                   "Can't stop while active client exist, resend SIGINT to overid");
> +            warned = true;
> +            return;
> +        }
> +    }
> +
> +    if (sig == SIGINT) {
> +        server.run = false;
> +        fini();
> +    }
> +
> +    exit(0);
> +}
> +
> +static int init(void)
> +{
> +    int rc;
> +    struct sigaction sig = {0};
> +
> +    rc = init_listener();
> +    if (rc) {
> +        return rc;
> +    }
> +
> +    rc = init_umad();
> +    if (rc) {
> +        return rc;
> +    }
> +
> +    pthread_rwlock_init(&server.lock, 0);
> +
> +    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
> +                        NULL);
> +    if (rc) {
> +        syslog(LOG_ERR, "Fail to create UMAD receiver thread (%d)\n", rc);
> +        return rc;
> +    }
> +
> +    sig.sa_sigaction = &signal_handler;
> +    sig.sa_flags = SA_SIGINFO;
> +    rc = sigaction(SIGINT, &sig, NULL);
> +    if (rc < 0) {
> +        syslog(LOG_ERR, "Fail to install SIGINT handler (%d)\n", errno);
> +        return rc;
> +    }
> +
> +    return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    int rc;
> +
> +    memset(&server, 0, sizeof(server));
> +
> +    parse_args(argc, argv);
> +
> +    rc = init();
> +    if (rc) {
> +        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
> +        rc = -EAGAIN;
> +        goto out;
> +    }
> +
> +    run();
> +
> +out:
> +    fini();
> +
> +    return rc;
> +}
> diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
> new file mode 100644
> index 0000000000..03508d52b2
> --- /dev/null
> +++ b/contrib/rdmacm-mux/rdmacm-mux.h
> @@ -0,0 +1,56 @@
> +/*
> + * QEMU paravirtual RDMA - rdmacm-mux declarations
> + *
> + * Copyright (C) 2018 Oracle
> + * Copyright (C) 2018 Red Hat Inc
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef RDMACM_MUX_H
> +#define RDMACM_MUX_H
> +
> +#include "linux/if.h"
> +#include "infiniband/verbs.h"
> +#include "infiniband/umad.h"
> +#include "rdma/rdma_user_cm.h"
> +
> +typedef enum RdmaCmMuxMsgType {
> +    RDMACM_MUX_MSG_TYPE_REG   = 0,
> +    RDMACM_MUX_MSG_TYPE_UNREG = 1,
> +    RDMACM_MUX_MSG_TYPE_MAD   = 2,
> +    RDMACM_MUX_MSG_TYPE_RESP  = 3,
> +} RdmaCmMuxMsgType;
> +
> +typedef enum RdmaCmMuxErrCode {
> +    RDMACM_MUX_ERR_CODE_OK        = 0,
> +    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
> +    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
> +    RDMACM_MUX_ERR_CODE_EACCES    = 3,
> +    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
> +} RdmaCmMuxErrCode;
> +
> +typedef struct RdmaCmMuxHdr {
> +    RdmaCmMuxMsgType msg_type;
> +    union ibv_gid sgid;
> +    RdmaCmMuxErrCode err_code;
> +} RdmaCmUHdr;
> +
> +typedef struct RdmaCmUMad {
> +    struct ib_user_mad hdr;
> +    char mad[RDMA_MAX_PRIVATE_DATA];
> +} RdmaCmUMad;
> +
> +typedef struct RdmaCmMuxMsg {
> +    RdmaCmUHdr hdr;
> +    int umad_len;
> +    RdmaCmUMad umad;
> +} RdmaCmMuxMsg;
> +
> +#endif
> -- 
> 2.17.2
>

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation
  2018-11-17 12:34   ` Marcel Apfelbaum
@ 2018-11-18  7:27     ` Yuval Shaia
  0 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-18  7:27 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck

On Sat, Nov 17, 2018 at 02:34:18PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/13/18 9:13 AM, Yuval Shaia wrote:
> > Interface with the device is changed with the addition of support for
> > MAD packets.
> > Adjust documentation accordingly.
> > 
> > While there fix a minor mistake which may lead to think that there is a
> > relation between using RXE on host and the compatibility with bare-metal
> > peers.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
> >   1 file changed, 84 insertions(+), 19 deletions(-)
> > 
> > diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
> > index 5599318159..9e8d1674b7 100644
> > --- a/docs/pvrdma.txt
> > +++ b/docs/pvrdma.txt
> > @@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need for any special guest
> >   modifications.
> >   While it complies with the VMware device, it can also communicate with bare
> > -metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> > -can work with Soft-RoCE (rxe).
> > +metal RDMA-enabled machines as peers.
> > +
> > +It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
> >   It does not require the whole guest RAM to be pinned allowing memory
> >   over-commit and, even if not implemented yet, migration support will be
> > @@ -78,29 +79,93 @@ the required RDMA libraries.
> >   3. Usage
> >   ========
> > +
> > +
> > +3.1 VM Memory settings
> > +======================
> >   Currently the device is working only with memory backed RAM
> >   and it must be mark as "shared":
> >      -m 1G \
> >      -object memory-backend-ram,id=mb1,size=1G,share \
> >      -numa node,memdev=mb1 \
> > -The pvrdma device is composed of two functions:
> > - - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
> > -   but is required to pass the ibdevice GID using its MAC.
> > -   Examples:
> > -     For an rxe backend using eth0 interface it will use its mac:
> > -       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
> > -     For an SRIOV VF, we take the Ethernet Interface exposed by it:
> > -       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
> > - - Function 1 is the actual device:
> > -       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
> > -   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
> > - Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
> > - The rules of conversion are part of the RoCE spec, but since manual conversion
> > - is not required, spotting problems is not hard:
> > -    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
> > -             MAC: 7c:fe:90:cb:74:3a
> > -    Note the difference between the first byte of the MAC and the GID.
> > +
> > +3.2 MAD Multiplexer
> > +===================
> > +MAD Multiplexer is a service that exposes MAD-like interface for VMs in
> > +order to overcome the limitation where only single entity can register with
> > +MAD layer to send and receive RDMA-CM MAD packets.
> > +
> > +To build rdmacm-mux run
> > +# make rdmacm-mux
> > +
> > +The program accepts 3 command line arguments and exposes a UNIX socket to
> > +be used to relay control and data messages to and from the service.
> > +-s unix-socket-path   Path to unix socket to listen on
> > +                      (default /var/run/rdmacm-mux)
> > +-d rdma-device-name   Name of RDMA device to register with
> > +                      (default rxe0)
> > +-p rdma-device-port   Port number of RDMA device to register with
> > +                      (default 1)
> > +The final UNIX socket file name is a concatenation of the 3 arguments so
> > +for example for device name mlx5_0 and port 2 the file
> > +/var/run/rdmacm-mux-mlx5_0-2 will be created.
> > +
> > +Please refer to contrib/rdmacm-mux for more details.
> > +
> > +
> > +3.3 PCI devices settings
> > +========================
> > +RoCE device exposes two functions - Ethernet and RDMA.
> > +To support it, pvrdma device is composed of two PCI functions, an Ethernet
> > +device of type vmxnet3 on PCI slot 0 and a pvrdma device on PCI slot 1. The
> > +Ethernet function can be used for other Ethernet purposes such as IP.
> > +
> > +
> > +3.4 Device parameters
> > +=====================
> > +- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
> > +  would be the Ethernet device used to create it. For any other physical
> > +  RoCE device this would be the netdev name of the device.
> 
> I didn't understand, can you please elaborate? We need the ibdev,
> this is clear, but what is the "ethernet device on host", how do
> we get it and how it is used?

netdev is used to maintain port's GID table.

Adding GID entry is by assigning new IPv6 address to the corresponding
Ethernet function, opposite is the same, i.e. removing an IPv6 address from
the Ethernet function will delete the corresponding GID from the GID table.

I wish there would be a way to extract netdev from a given ibdev (by means
of an API) but since there isn't - we must have it as a parameter.

> 
> Thanks,
> Marcel
> 
> > +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
> > +- mad-chardev: The name of the MAD multiplexer char device.
> > +- ibport: In case of multi-port device (such as Mellanox's HCA) this
> > +  specify the port to use. If not set 1 will be used.
> > +- dev-caps-max-mr-size: The maximum size of MR.
> > +- dev-caps-max-qp: Maximum number of QPs.
> > +- dev-caps-max-sge: Maximum number of SGE elements in WR.
> > +- dev-caps-max-cq: Maximum number of CQs.
> > +- dev-caps-max-mr: Maximum number of MRs.
> > +- dev-caps-max-pd: Maximum number of PDs.
> > +- dev-caps-max-ah: Maximum number of AHs.
> > +
> > +Notes:
> > +- The first 3 parameters are mandatory settings, the rest have their
> > +  defaults.
> > +- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
> > +  limits but the final values are adjusted by the backend device limitations.
> > +
> > +3.5 Example
> > +===========
> > +Define bridge device with vmxnet3 network backend:
> > +<interface type='bridge'>
> > +  <mac address='56:b4:44:e9:62:dc'/>
> > +  <source bridge='bridge1'/>
> > +  <model type='vmxnet3'/>
> > +  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
> > +</interface>
> > +
> > +Define pvrdma device:
> > +<qemu:commandline>
> > +  <qemu:arg value='-object'/>
> > +  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
> > +  <qemu:arg value='-numa'/>
> > +  <qemu:arg value='node,memdev=mb1'/>
> > +  <qemu:arg value='-chardev'/>
> > +  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
> > +  <qemu:arg value='-device'/>
> > +  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
> > +</qemu:commandline>
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-17 12:48   ` Marcel Apfelbaum
@ 2018-11-18  8:13     ` Yuval Shaia
  0 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-18  8:13 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Sat, Nov 17, 2018 at 02:48:34PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/13/18 9:13 AM, Yuval Shaia wrote:
> > The control over the RDMA device's GID table is done by updating the
> > device's Ethernet function addresses.
> > Usually the first GID entry is determine by the MAC address, the second
> 
> s/determine/determined
> 
> > by the first IPv6 address and the third by the IPv4 address. Other
> > entries can be added by adding more IP addresses. The opposite is the
> > same, i.e. whenever an address is removed, the corresponding GID entry
> > is removed.
> > 
> > The process is done by the network and RDMA stacks. Whenever an address
> > is added the ib_core driver is notified and calls the device driver
> > add_gid function which in turn update the device.
> > 
> > To support this in pvrdma device we need to hook into the create_bind
> > and destroy_bind HW commands triggered by pvrdma driver in guest.
> > Whenever a changed is made to the pvrdma device's GID table a special
> without 'a'
> 
> > QMP messages is sent to be processed by libvirt to update the address of
> > the backend Ethernet device.
> 
> So the device can't be used without libvirt? How can we
> use it anyway only with QEMU ?

We can't.

(actually we can but this involve a hack since we have to know in advance
the IP addresses that will be assigned to the net device in the guest)

> 
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_backend.c      | 243 +++++++++++++++++++++++-------------
> 
> rdma_backend.c is gettting larger...

1261 lines only, let's revisit it on next major change.

> 
> >   hw/rdma/rdma_backend.h      |  22 ++--
> >   hw/rdma/rdma_backend_defs.h |   3 +-
> >   hw/rdma/rdma_rm.c           | 104 ++++++++++++++-
> >   hw/rdma/rdma_rm.h           |  17 ++-
> >   hw/rdma/rdma_rm_defs.h      |   9 +-
> >   hw/rdma/rdma_utils.h        |  15 +++
> >   hw/rdma/vmw/pvrdma.h        |   2 +-
> >   hw/rdma/vmw/pvrdma_cmd.c    |  55 ++++----
> >   hw/rdma/vmw/pvrdma_main.c   |  25 +---
> >   hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
> >   11 files changed, 370 insertions(+), 145 deletions(-)
> > 
> > diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> > index 3eb0099f8d..5675504165 100644
> > --- a/hw/rdma/rdma_backend.c
> > +++ b/hw/rdma/rdma_backend.c
> > @@ -18,12 +18,14 @@
> >   #include "qapi/error.h"
> >   #include "qapi/qmp/qlist.h"
> >   #include "qapi/qmp/qnum.h"
> > +#include "qapi/qapi-events-rdma.h"
> >   #include <infiniband/verbs.h>
> >   #include <infiniband/umad_types.h>
> >   #include <infiniband/umad.h>
> >   #include <rdma/rdma_user_cm.h>
> > +#include "contrib/rdmacm-mux/rdmacm-mux.h"
> >   #include "trace.h"
> >   #include "rdma_utils.h"
> >   #include "rdma_rm.h"
> > @@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
> >       return 0;
> >   }
> > -static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> > -                    uint32_t num_sge)
> > +static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
> > +                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
> >   {
> > -    struct backend_umad umad = {0};
> > -    char *hdr, *msg;
> > +    RdmaCmMuxMsg msg = {0};
> > +    char *hdr, *data;
> >       int ret;
> >       pr_dbg("num_sge=%d\n", num_sge);
> > @@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> >           return -EINVAL;
> >       }
> > -    umad.hdr.length = sge[0].length + sge[1].length;
> > -    pr_dbg("msg_len=%d\n", umad.hdr.length);
> > +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
> > +    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
> > -    if (umad.hdr.length > sizeof(umad.mad)) {
> > +    msg.umad_len = sge[0].length + sge[1].length;
> > +    pr_dbg("umad_len=%d\n", msg.umad_len);
> > +
> > +    if (msg.umad_len > sizeof(msg.umad.mad)) {
> >           return -ENOMEM;
> >       }
> > -    umad.hdr.addr.qpn = htobe32(1);
> > -    umad.hdr.addr.grh_present = 1;
> > -    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> > -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > -    umad.hdr.addr.hop_limit = 1;
> > +    msg.umad.hdr.addr.qpn = htobe32(1);
> > +    msg.umad.hdr.addr.grh_present = 1;
> > +    pr_dbg("sgid_idx=%d\n", sgid_idx);
> > +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
> > +    msg.umad.hdr.addr.gid_index = sgid_idx;
> > +    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
> > +    msg.umad.hdr.addr.hop_limit = 1;
> >       hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> > -    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > +    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > +
> > +    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
> > +    pr_dbg_buf("mad_data", data, sge[1].length);
> > -    memcpy(&umad.mad[0], hdr, sge[0].length);
> > -    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> > +    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
> > +    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
> > -    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> > +    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
> >       rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> > -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > -                            sizeof(umad));
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
> > +                            sizeof(msg));
> >       pr_dbg("qemu_chr_fe_write=%d\n", ret);
> > -    return (ret != sizeof(umad));
> > +    return (ret != sizeof(msg));
> >   }
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > -                            union ibv_gid *dgid, uint32_t dqpn,
> > -                            uint32_t dqkey, void *ctx)
> > +                            uint8_t sgid_idx, union ibv_gid *sgid,
> > +                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
> > +                            void *ctx)
> >   {
> >       BackendCtx *bctx;
> >       struct ibv_sge new_sge[MAX_SGE];
> > @@ -361,7 +372,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> >           } else if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            rc = mad_send(backend_dev, sge, num_sge);
> > +            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
> >               if (rc) {
> >                   comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> >               } else {
> > @@ -397,8 +408,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >       }
> >       if (qp_type == IBV_QPT_UD) {
> > -        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
> > -                                backend_dev->backend_gid_idx, dgid);
> > +        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
> >           if (!wr.wr.ud.ah) {
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
> >               goto out_dealloc_cqe_ctx;
> > @@ -703,9 +713,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >   }
> >   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> > -                              uint8_t qp_type, union ibv_gid *dgid,
> > -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> > -                              bool use_qkey)
> > +                              uint8_t qp_type, uint8_t sgid_idx,
> > +                              union ibv_gid *dgid, uint32_t dqpn,
> > +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
> >   {
> >       struct ibv_qp_attr attr = {0};
> >       union ibv_gid ibv_gid = {
> > @@ -717,13 +727,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >       attr.qp_state = IBV_QPS_RTR;
> >       attr_mask = IBV_QP_STATE;
> > +    qp->sgid_idx = sgid_idx;
> > +
> >       switch (qp_type) {
> >       case IBV_QPT_RC:
> >           pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
> >                  be64_to_cpu(ibv_gid.global.subnet_prefix),
> >                  be64_to_cpu(ibv_gid.global.interface_id));
> >           pr_dbg("dqpn=0x%x\n", dqpn);
> > -        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
> > +        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
> >           pr_dbg("sport_num=%d\n", backend_dev->port_num);
> >           pr_dbg("rq_psn=0x%x\n", rq_psn);
> > @@ -735,7 +747,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >           attr.ah_attr.is_global      = 1;
> >           attr.ah_attr.grh.hop_limit  = 1;
> >           attr.ah_attr.grh.dgid       = ibv_gid;
> > -        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
> > +        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
> >           attr.rq_psn                 = rq_psn;
> >           attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
> > @@ -744,8 +756,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >           break;
> >       case IBV_QPT_UD:
> > +        pr_dbg("qkey=0x%x\n", qkey);
> >           if (use_qkey) {
> > -            pr_dbg("qkey=0x%x\n", qkey);
> >               attr.qkey = qkey;
> >               attr_mask |= IBV_QP_QKEY;
> >           }
> > @@ -861,13 +873,13 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> >       grh->dgid = *my_gid;
> >       pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> > -    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> > -    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> > +    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
> > +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
> >   }
> >   static inline int mad_can_receieve(void *opaque)
> >   {
> > -    return sizeof(struct backend_umad);
> > +    return sizeof(RdmaCmMuxMsg);
> >   }
> >   static void mad_read(void *opaque, const uint8_t *buf, int size)
> > @@ -877,13 +889,13 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
> >       unsigned long cqe_ctx_id;
> >       BackendCtx *bctx;
> >       char *mad;
> > -    struct backend_umad *umad;
> > +    RdmaCmMuxMsg *msg;
> > -    assert(size != sizeof(umad));
> > -    umad = (struct backend_umad *)buf;
> > +    assert(size != sizeof(msg));
> > +    msg = (RdmaCmMuxMsg *)buf;
> >       pr_dbg("Got %d bytes\n", size);
> > -    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> > +    pr_dbg("umad_len=%d\n", msg->umad_len);
> >   #ifdef PVRDMA_DEBUG
> >       struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> > @@ -913,15 +925,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
> >       mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> >                              bctx->sge.length);
> > -    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> > +    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
> >           comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> >                        bctx->up_ctx);
> >       } else {
> > +        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
> >           memset(mad, 0, bctx->sge.length);
> >           build_mad_hdr((struct ibv_grh *)mad,
> > -                      (union ibv_gid *)&umad->hdr.addr.gid,
> > -                      &backend_dev->gid, umad->hdr.length);
> > -        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> > +                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
> > +                      msg->umad_len);
> > +        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
> >           rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> >           comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> > @@ -933,10 +946,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
> >   static int mad_init(RdmaBackendDev *backend_dev)
> >   {
> > -    struct backend_umad umad = {0};
> >       int ret;
> > -    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> > +    ret = qemu_chr_fe_backend_connected(backend_dev->mad_chr_be);
> > +    if (!ret) {
> >           pr_dbg("Missing chardev for MAD multiplexer\n");
> 
> This may be an error, not a debug message.

Yeah, it is, and the below "return -EIO" will make sure the above layer
will decide how to handle it (a spoiler...it will exit).

> 
> >           return -EIO;
> >       }
> > @@ -944,14 +957,6 @@ static int mad_init(RdmaBackendDev *backend_dev)
> >       qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> >                                mad_read, NULL, NULL, backend_dev, NULL, true);
> > -    /* Register ourself */
> > -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > -                            sizeof(umad.hdr));
> > -    if (ret != sizeof(umad.hdr)) {
> > -        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> > -    }
> > -
> >       qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> >       backend_dev->recv_mads_list.list = qlist_new();
> > @@ -988,23 +993,120 @@ static void mad_fini(RdmaBackendDev *backend_dev)
> >       qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> >   }
> > +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> > +                               union ibv_gid *gid)
> > +{
> > +    union ibv_gid sgid;
> > +    int ret;
> > +    int i = 0;
> > +
> > +    pr_dbg("0x%llx, 0x%llx\n",
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > +
> > +    do {
> > +        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
> > +                            &sgid);
> > +        i++;
> > +    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
> > +
> > +    pr_dbg("gid_index=%d\n", i - 1);
> > +
> > +    return ret ? ret : i - 1;
> > +}
> > +
> > +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid)
> > +{
> > +    RdmaCmMuxMsg msg = {0};
> > +    int ret;
> > +
> > +    pr_dbg("0x%llx, 0x%llx\n",
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > +
> > +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REG;
> > +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
> > +                            sizeof(msg));
> > +    if (ret != sizeof(msg)) {
> > +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > +
> > +    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
> > +                            sizeof(msg));
> > +    if (ret != sizeof(msg)) {
> > +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > +
> > +    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
> > +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", msg.hdr.err_code);
> > +        return -EIO;
> > +    }
> > +
> > +    qapi_event_send_rdma_gid_status_changed(ifname, true,
> > +                                            gid->global.subnet_prefix,
> > +                                            gid->global.interface_id);
> > +
> > +    return ret;
> > +}
> > +
> > +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid)
> > +{
> > +    RdmaCmMuxMsg msg = {0};
> > +    int ret;
> > +
> > +    pr_dbg("0x%llx, 0x%llx\n",
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > +
> > +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_UNREG;
> > +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
> > +                            sizeof(msg));
> > +    if (ret != sizeof(msg)) {
> > +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > +
> > +    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
> > +                            sizeof(msg));
> > +    if (ret != sizeof(msg)) {
> > +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > +
> > +    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
> > +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n",
> > +               msg.hdr.err_code);
> > +        return -EIO;
> > +    }
> > +
> > +    qapi_event_send_rdma_gid_status_changed(ifname, false,
> > +                                            gid->global.subnet_prefix,
> > +                                            gid->global.interface_id);
> > +
> > +    return 0;
> > +}
> > +
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> > -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      CharBackend *mad_chr_be, Error **errp)
> > +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> > +                      Error **errp)
> >   {
> >       int i;
> >       int ret = 0;
> >       int num_ibv_devices;
> >       struct ibv_device **dev_list;
> > -    struct ibv_port_attr port_attr;
> >       memset(backend_dev, 0, sizeof(*backend_dev));
> >       backend_dev->dev = pdev;
> >       backend_dev->mad_chr_be = mad_chr_be;
> > -    backend_dev->backend_gid_idx = backend_gid_idx;
> >       backend_dev->port_num = port_num;
> >       backend_dev->rdma_dev_res = rdma_dev_res;
> > @@ -1041,9 +1143,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >           backend_dev->ib_dev = *dev_list;
> >       }
> > -    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
> > -           ibv_get_device_name(backend_dev->ib_dev),
> > -           backend_dev->port_num, backend_dev->backend_gid_idx);
> > +    pr_dbg("Using backend device %s, port %d\n",
> > +           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
> >       backend_dev->context = ibv_open_device(backend_dev->ib_dev);
> >       if (!backend_dev->context) {
> > @@ -1060,20 +1161,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       }
> >       pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
> > -    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
> > -                         &port_attr);
> > -    if (ret) {
> > -        error_setg(errp, "Error %d from ibv_query_port", ret);
> > -        ret = -EIO;
> > -        goto out_destroy_comm_channel;
> > -    }
> > -
> > -    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
> > -        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
> > -                   port_attr.gid_tbl_len);
> > -        goto out_destroy_comm_channel;
> > -    }
> > -
> >       ret = init_device_caps(backend_dev, dev_attr);
> >       if (ret) {
> >           error_setg(errp, "Failed to initialize device capabilities");
> > @@ -1081,18 +1168,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >           goto out_destroy_comm_channel;
> >       }
> > -    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
> > -                         backend_dev->backend_gid_idx, &backend_dev->gid);
> > -    if (ret) {
> > -        error_setg(errp, "Failed to query gid %d",
> > -                   backend_dev->backend_gid_idx);
> > -        ret = -EIO;
> > -        goto out_destroy_comm_channel;
> > -    }
> > -    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
> > -           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
> > -    pr_dbg("interface_id=0x%" PRIx64 "\n",
> > -           be64_to_cpu(backend_dev->gid.global.interface_id));
> >       ret = mad_init(backend_dev);
> >       if (ret) {
> > diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> > index fc83330251..59ad2b874b 100644
> > --- a/hw/rdma/rdma_backend.h
> > +++ b/hw/rdma/rdma_backend.h
> > @@ -28,11 +28,6 @@ enum ibv_special_qp_type {
> >       IBV_QPT_GSI = 1,
> >   };
> > -static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
> > -{
> > -    return &dev->gid;
> > -}
> > -
> >   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
> >   {
> >       return qp->ibqp ? qp->ibqp->qp_num : 1;
> > @@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> > -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      CharBackend *mad_chr_be, Error **errp);
> > +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> > +                      Error **errp);
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> > +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid);
> > +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid);
> > +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> > +                               union ibv_gid *gid);
> >   void rdma_backend_start(RdmaBackendDev *backend_dev);
> >   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> >   void rdma_backend_register_comp_handler(void (*handler)(int status,
> > @@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> >   int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >                                  uint8_t qp_type, uint32_t qkey);
> >   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> > -                              uint8_t qp_type, union ibv_gid *dgid,
> > -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> > -                              bool use_qkey);
> > +                              uint8_t qp_type, uint8_t sgid_idx,
> > +                              union ibv_gid *dgid, uint32_t dqpn,
> > +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
> >   int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
> >                                 uint32_t sq_psn, uint32_t qkey, bool use_qkey);
> >   int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
> > @@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > +                            uint8_t sgid_idx, union ibv_gid *sgid,
> >                               union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
> >                               void *ctx);
> >   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> > diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> > index 2a7e667075..ff8b2426a0 100644
> > --- a/hw/rdma/rdma_backend_defs.h
> > +++ b/hw/rdma/rdma_backend_defs.h
> > @@ -37,14 +37,12 @@ typedef struct RecvMadList {
> >   typedef struct RdmaBackendDev {
> >       struct ibv_device_attr dev_attr;
> >       RdmaBackendThread comp_thread;
> > -    union ibv_gid gid;
> >       PCIDevice *dev;
> >       RdmaDeviceResources *rdma_dev_res;
> >       struct ibv_device *ib_dev;
> >       struct ibv_context *context;
> >       struct ibv_comp_channel *channel;
> >       uint8_t port_num;
> > -    uint8_t backend_gid_idx;
> >       RecvMadList recv_mads_list;
> >       CharBackend *mad_chr_be;
> >   } RdmaBackendDev;
> > @@ -66,6 +64,7 @@ typedef struct RdmaBackendCQ {
> >   typedef struct RdmaBackendQP {
> >       struct ibv_pd *ibpd;
> >       struct ibv_qp *ibqp;
> > +    uint8_t sgid_idx;
> >   } RdmaBackendQP;
> >   #endif
> > diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> > index 4f10fcabcc..fe0979415d 100644
> > --- a/hw/rdma/rdma_rm.c
> > +++ b/hw/rdma/rdma_rm.c
> > @@ -391,7 +391,7 @@ out_dealloc_qp:
> >   }
> >   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > -                      uint32_t qp_handle, uint32_t attr_mask,
> > +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
> >                         union ibv_gid *dgid, uint32_t dqpn,
> >                         enum ibv_qp_state qp_state, uint32_t qkey,
> >                         uint32_t rq_psn, uint32_t sq_psn)
> > @@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >       int ret;
> >       pr_dbg("qpn=0x%x\n", qp_handle);
> > +    pr_dbg("qkey=0x%x\n", qkey);
> >       qp = rdma_rm_get_qp(dev_res, qp_handle);
> >       if (!qp) {
> > @@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >           }
> >           if (qp->qp_state == IBV_QPS_RTR) {
> > +            /* Get backend gid index */
> > +            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
> > +            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
> > +                                                     sgid_idx);
> > +            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
> > +                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
> > +                return -EIO;
> > +            }
> > +
> >               ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
> > -                                            qp->qp_type, dgid, dqpn, rq_psn,
> > -                                            qkey, attr_mask & IBV_QP_QKEY);
> > +                                            qp->qp_type, sgid_idx, dgid, dqpn,
> > +                                            rq_psn, qkey,
> > +                                            attr_mask & IBV_QP_QKEY);
> >               if (ret) {
> >                   return -EIO;
> >               }
> > @@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
> >       res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
> >   }
> > +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, union ibv_gid *gid, int gid_idx)
> > +{
> > +    int rc;
> > +
> > +    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
> > +    if (rc <= 0) {
> > +        pr_dbg("Fail to add gid\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
> 
> A previous patch removed multiple ports support, why do we
> have ports[0] ?

The patch you are referring to is patch 18 while here we are only at 11.

To apply it back to before patch 11 seems very complex and needless task so
i left it as patch on top of all other MAD-related patches.

This change is not related to MAD patchset but thought we can take it
anyway.

> 
> > +
> > +    return 0;
> > +}
> > +
> > +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, int gid_idx)
> > +{
> > +    int rc;
> > +
> > +    rc = rdma_backend_del_gid(backend_dev, ifname,
> > +                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
> > +    if (rc < 0) {
> > +        pr_dbg("Fail to delete gid\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
> > +           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
> > +    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
> 
> Same question as above.
> > +
> > +    return 0;
> > +}
> > +
> > +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> > +                                  RdmaBackendDev *backend_dev, int sgid_idx)
> > +{
> > +    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
> > +        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
> > +        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
> > +        rdma_backend_get_gid_index(backend_dev,
> > +                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
> > +    }
> > +
> > +    pr_dbg("backend_gid_index=%d\n",
> > +           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
> > +
> > +    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
> > +}
> > +
> >   static void destroy_qp_hash_key(gpointer data)
> >   {
> >       g_bytes_unref(data);
> >   }
> > +static void init_ports(RdmaDeviceResources *dev_res)
> > +{
> > +    int i, j;
> > +
> > +    memset(dev_res->ports, 0, sizeof(dev_res->ports));
> > +
> > +    for (i = 0; i < MAX_PORTS; i++) {
> > +        dev_res->ports[i].state = IBV_PORT_DOWN;
> 
> I might have missed something regarding the ports support,
> can you please clarify for me?
> 
> Thanks,
> Marcel
> 
> > +        for (j = 0; j < MAX_PORT_GIDS; j++) {
> > +            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
> > +        }
> > +    }
> > +}
> > +
> > +static void fini_ports(RdmaDeviceResources *dev_res,
> > +                       RdmaBackendDev *backend_dev, const char *ifname)
> > +{
> > +    int i;
> > +
> > +    dev_res->ports[0].state = IBV_PORT_DOWN;
> > +    for (i = 0; i < MAX_PORT_GIDS; i++) {
> > +        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
> > +    }
> > +}
> > +
> >   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
> >                    Error **errp)
> >   {
> > @@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
> >                          dev_attr->max_qp_wr, sizeof(void *));
> >       res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
> > +    init_ports(dev_res);
> > +
> >       return 0;
> >   }
> > -void rdma_rm_fini(RdmaDeviceResources *dev_res)
> > +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                  const char *ifname)
> >   {
> > +    fini_ports(dev_res, backend_dev, ifname);
> > +
> >       res_tbl_free(&dev_res->uc_tbl);
> >       res_tbl_free(&dev_res->cqe_ctx_tbl);
> >       res_tbl_free(&dev_res->qp_tbl);
> > diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
> > index b4e04cc7b4..a7169b4e89 100644
> > --- a/hw/rdma/rdma_rm.h
> > +++ b/hw/rdma/rdma_rm.h
> > @@ -22,7 +22,8 @@
> >   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
> >                    Error **errp);
> > -void rdma_rm_fini(RdmaDeviceResources *dev_res);
> > +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                  const char *ifname);
> >   int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >                        uint32_t *pd_handle, uint32_t ctx_handle);
> > @@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
> >                        uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
> >   RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
> >   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > -                      uint32_t qp_handle, uint32_t attr_mask,
> > +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
> >                         union ibv_gid *dgid, uint32_t dqpn,
> >                         enum ibv_qp_state qp_state, uint32_t qkey,
> >                         uint32_t rq_psn, uint32_t sq_psn);
> > @@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
> >   void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
> >   void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
> > +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, union ibv_gid *gid, int gid_idx);
> > +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, int gid_idx);
> > +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> > +                                  RdmaBackendDev *backend_dev, int sgid_idx);
> > +static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
> > +                                             int sgid_idx)
> > +{
> > +    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
> > +}
> > +
> >   #endif
> > diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
> > index 9b399063d3..7b3435f991 100644
> > --- a/hw/rdma/rdma_rm_defs.h
> > +++ b/hw/rdma/rdma_rm_defs.h
> > @@ -19,7 +19,7 @@
> >   #include "rdma_backend_defs.h"
> >   #define MAX_PORTS             1
> > -#define MAX_PORT_GIDS         1
> > +#define MAX_PORT_GIDS         255
> >   #define MAX_GIDS              MAX_PORT_GIDS
> >   #define MAX_PORT_PKEYS        1
> >   #define MAX_PKEYS             MAX_PORT_PKEYS
> > @@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
> >       enum ibv_qp_state qp_state;
> >   } RdmaRmQP;
> > +typedef struct RdmaRmGid {
> > +    union ibv_gid gid;
> > +    int backend_gid_index;
> > +} RdmaRmGid;
> > +
> >   typedef struct RdmaRmPort {
> > -    union ibv_gid gid_tbl[MAX_PORT_GIDS];
> > +    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
> >       enum ibv_port_state state;
> >   } RdmaRmPort;
> > diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
> > index 04c7c2ef5b..989db249ef 100644
> > --- a/hw/rdma/rdma_utils.h
> > +++ b/hw/rdma/rdma_utils.h
> > @@ -20,6 +20,7 @@
> >   #include "qemu/osdep.h"
> >   #include "hw/pci/pci.h"
> >   #include "sysemu/dma.h"
> > +#include "stdio.h"
> >   #define pr_info(fmt, ...) \
> >       fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
> > @@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
> >   #define pr_dbg(fmt, ...) \
> >       fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
> >               __func__, __LINE__, ## __VA_ARGS__)
> > +
> > +#define pr_dbg_buf(title, buf, len) \
> > +{ \
> > +    char *b = g_malloc0(len * 3 + 1); \
> > +    char b1[4]; \
> > +    for (int i = 0; i < len; i++) { \
> > +        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
> > +        strcat(b, b1); \
> > +    } \
> > +    pr_dbg("%s (%d): %s\n", title, len, b); \
> > +    g_free(b); \
> > +}
> > +
> >   #else
> >   #define init_pr_dbg(void)
> >   #define pr_dbg(fmt, ...)
> > +#define pr_dbg_buf(title, buf, len)
> >   #endif
> >   void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
> > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > index 15c3f28b86..b019cb843a 100644
> > --- a/hw/rdma/vmw/pvrdma.h
> > +++ b/hw/rdma/vmw/pvrdma.h
> > @@ -79,8 +79,8 @@ typedef struct PVRDMADev {
> >       int interrupt_mask;
> >       struct ibv_device_attr dev_attr;
> >       uint64_t node_guid;
> > +    char *backend_eth_device_name;
> >       char *backend_device_name;
> > -    uint8_t backend_gid_idx;
> >       uint8_t backend_port_num;
> >       RdmaBackendDev backend_dev;
> >       RdmaDeviceResources rdma_dev_res;
> > diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> > index 57d6f41ae6..a334f6205e 100644
> > --- a/hw/rdma/vmw/pvrdma_cmd.c
> > +++ b/hw/rdma/vmw/pvrdma_cmd.c
> > @@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rsp->hdr.response = cmd->hdr.response;
> >       rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
> > -    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
> > -                                 cmd->qp_handle, cmd->attr_mask,
> > -                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> > -                                 cmd->attrs.dest_qp_num,
> > -                                 (enum ibv_qp_state)cmd->attrs.qp_state,
> > -                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
> > -                                 cmd->attrs.sq_psn);
> > +    /* No need to verify sgid_index since it is u8 */
> > +
> > +    rsp->hdr.err =
> > +        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
> > +                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
> > +                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> > +                          cmd->attrs.dest_qp_num,
> > +                          (enum ibv_qp_state)cmd->attrs.qp_state,
> > +                          cmd->attrs.qkey, cmd->attrs.rq_psn,
> > +                          cmd->attrs.sq_psn);
> >       pr_dbg("ret=%d\n", rsp->hdr.err);
> >       return rsp->hdr.err;
> > @@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >                          union pvrdma_cmd_resp *rsp)
> >   {
> >       struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
> > -#ifdef PVRDMA_DEBUG
> > -    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
> > -    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
> > -#endif
> > +    int rc;
> > +    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
> >       pr_dbg("index=%d\n", cmd->index);
> > @@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       }
> >       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> > -           (long long unsigned int)be64_to_cpu(*subnet),
> > -           (long long unsigned int)be64_to_cpu(*if_id));
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > -    /* Driver forces to one port only */
> > -    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
> > -           sizeof(cmd->new_gid));
> > +    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
> > +                         dev->backend_eth_device_name, gid, cmd->index);
> > +    if (rc < 0) {
> > +        return -EINVAL;
> > +    }
> >       /* TODO: Since drivers stores node_guid at load_dsr phase then this
> >        * assignment is not relevant, i need to figure out a way how to
> >        * retrieve MAC of our netdev */
> > -    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
> > -    pr_dbg("dev->node_guid=0x%llx\n",
> > -           (long long unsigned int)be64_to_cpu(dev->node_guid));
> > +    if (!cmd->index) {
> > +        dev->node_guid =
> > +            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
> > +        pr_dbg("dev->node_guid=0x%llx\n",
> > +               (long long unsigned int)be64_to_cpu(dev->node_guid));
> > +    }
> >       return 0;
> >   }
> > @@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >                           union pvrdma_cmd_resp *rsp)
> >   {
> > +    int rc;
> > +
> >       struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
> >       pr_dbg("index=%d\n", cmd->index);
> > @@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >           return -EINVAL;
> >       }
> > -    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
> > -           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
> > +    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> > +                        dev->backend_eth_device_name, cmd->index);
> > +
> > +    if (rc < 0) {
> > +        rsp->hdr.err = rc;
> > +        goto out;
> > +    }
> >       return 0;
> >   }
> > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > index fc2abd34af..ac8c092db0 100644
> > --- a/hw/rdma/vmw/pvrdma_main.c
> > +++ b/hw/rdma/vmw/pvrdma_main.c
> > @@ -36,9 +36,9 @@
> >   #include "pvrdma_qp_ops.h"
> >   static Property pvrdma_dev_properties[] = {
> > -    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
> > -    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
> > -    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
> > +    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
> > +    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
> > +    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
> >       DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
> >                          MAX_MR_SIZE),
> >       DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
> > @@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
> >       pr_dbg("Initialized\n");
> >   }
> > -static void init_ports(PVRDMADev *dev, Error **errp)
> > -{
> > -    int i;
> > -
> > -    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
> > -
> > -    for (i = 0; i < MAX_PORTS; i++) {
> > -        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
> > -    }
> > -}
> > -
> >   static void uninit_msix(PCIDevice *pdev, int used_vectors)
> >   {
> >       PVRDMADev *dev = PVRDMA_DEV(pdev);
> > @@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
> >       pvrdma_qp_ops_fini();
> > -    rdma_rm_fini(&dev->rdma_dev_res);
> > +    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
> > +                 dev->backend_eth_device_name);
> >       rdma_backend_fini(&dev->backend_dev);
> > @@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
> >                              dev->backend_device_name, dev->backend_port_num,
> > -                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> > -                           errp);
> > +                           &dev->dev_attr, &dev->mad_chr, errp);
> >       if (rc) {
> >           goto out;
> >       }
> > @@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >           goto out;
> >       }
> > -    init_ports(dev, errp);
> > -
> >       rc = pvrdma_qp_ops_init();
> >       if (rc) {
> >           goto out;
> > diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> > index 3388be1926..2130824098 100644
> > --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> > +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> > @@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
> >       RdmaRmQP *qp;
> >       PvrdmaSqWqe *wqe;
> >       PvrdmaRing *ring;
> > +    int sgid_idx;
> > +    union ibv_gid *sgid;
> >       pr_dbg("qp_handle=0x%x\n", qp_handle);
> > @@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
> >           comp_ctx->cqe.qp = qp_handle;
> >           comp_ctx->cqe.opcode = IBV_WC_SEND;
> > +        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
> > +        if (!sgid) {
> > +            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
> > +            return -EIO;
> > +        }
> > +        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
> > +               sgid->global.interface_id);
> > +
> > +        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
> > +                                                 &dev->backend_dev,
> > +                                                 wqe->hdr.wr.ud.av.gid_index);
> > +        if (sgid_idx <= 0) {
> > +            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
> > +                   wqe->hdr.wr.ud.av.gid_index);
> > +            return -EIO;
> > +        }
> > +
> >           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
> >                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
> > +                               sgid_idx, sgid,
> >                                  (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
> >                                  wqe->hdr.wr.ud.remote_qpn,
> >                                  wqe->hdr.wr.ud.remote_qkey, comp_ctx);
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response
  2018-11-17 12:22   ` Marcel Apfelbaum
@ 2018-11-18  8:24     ` Yuval Shaia
  2018-11-25  7:35       ` Marcel Apfelbaum
  0 siblings, 1 reply; 70+ messages in thread
From: Yuval Shaia @ 2018-11-18  8:24 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Sat, Nov 17, 2018 at 02:22:31PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/13/18 9:13 AM, Yuval Shaia wrote:
> > Driver checks error code let's set it.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
> >   1 file changed, 48 insertions(+), 19 deletions(-)
> > 
> > diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> > index 0d3c818c20..a326c5d470 100644
> > --- a/hw/rdma/vmw/pvrdma_cmd.c
> > +++ b/hw/rdma/vmw/pvrdma_cmd.c
> > @@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       if (rdma_backend_query_port(&dev->backend_dev,
> >                                   (struct ibv_port_attr *)&attrs)) {
> > -        return -ENOMEM;
> > +        resp->hdr.err = -ENOMEM;
> > +        goto out;
> >       }
> >       memset(resp, 0, sizeof(*resp));
> > @@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       resp->attrs.active_width = 1;
> >       resp->attrs.active_speed = 1;
> > -    return 0;
> > +out:
> > +    pr_dbg("ret=%d\n", resp->hdr.err);
> > +    return resp->hdr.err;
> >   }
> >   static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       resp->pkey = PVRDMA_PKEY;
> >       pr_dbg("pkey=0x%x\n", resp->pkey);
> > -    return 0;
> > +    return resp->hdr.err;
> >   }
> >   static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
> > -    return 0;
> > +    rsp->hdr.err = 0;
> 
> Is it possible to ensure err is 0 by default during hdr creation
> instead of manually setting it every time?

Yes we can but since these handlers any fills some other fields in the
response i thought it will be clean if they will fill the op-status as
well.
I believe filling it at "handler" level is more modular.
Do you think filling it outside will make the code cleaner?

> 
> Thanks,
> Marcel
> 
> > +
> > +    return rsp->hdr.err;
> >   }
> >   static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
> > -    return 0;
> > +    rsp->hdr.err = 0;
> > +
> > +    return rsp->hdr.err;
> >   }
> >   static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
> > @@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
> >       if (!cq) {
> >           pr_dbg("Invalid CQ handle\n");
> > -        return -EINVAL;
> > +        rsp->hdr.err = -EINVAL;
> > +        goto out;
> >       }
> >       ring = (PvrdmaRing *)cq->opaque;
> > @@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
> > -    return 0;
> > +    rsp->hdr.err = 0;
> > +
> > +out:
> > +    pr_dbg("ret=%d\n", rsp->hdr.err);
> > +    return rsp->hdr.err;
> >   }
> >   static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
> > @@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
> >       if (!qp) {
> >           pr_dbg("Invalid QP handle\n");
> > -        return -EINVAL;
> > +        rsp->hdr.err = -EINVAL;
> > +        goto out;
> >       }
> >       rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
> > @@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
> >       g_free(ring);
> > -    return 0;
> > +    rsp->hdr.err = 0;
> > +
> > +out:
> > +    pr_dbg("ret=%d\n", rsp->hdr.err);
> > +    return rsp->hdr.err;
> >   }
> >   static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       pr_dbg("index=%d\n", cmd->index);
> >       if (cmd->index >= MAX_PORT_GIDS) {
> > -        return -EINVAL;
> > +        rsp->hdr.err = -EINVAL;
> > +        goto out;
> >       }
> >       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> > @@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
> >                            dev->backend_eth_device_name, gid, cmd->index);
> >       if (rc < 0) {
> > -        return -EINVAL;
> > +        rsp->hdr.err = rc;
> > +        goto out;
> >       }
> > -    return 0;
> > +    rsp->hdr.err = 0;
> > +
> > +out:
> > +    pr_dbg("ret=%d\n", rsp->hdr.err);
> > +    return rsp->hdr.err;
> >   }
> >   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       pr_dbg("index=%d\n", cmd->index);
> >       if (cmd->index >= MAX_PORT_GIDS) {
> > -        return -EINVAL;
> > +        rsp->hdr.err = -EINVAL;
> > +        goto out;
> >       }
> >       rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> > @@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >           goto out;
> >       }
> > -    return 0;
> > +    rsp->hdr.err = 0;
> > +
> > +out:
> > +    pr_dbg("ret=%d\n", rsp->hdr.err);
> > +    return rsp->hdr.err;
> >   }
> >   static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
> >                                        &resp->ctx_handle);
> > -    pr_dbg("ret=%d\n", resp->hdr.err);
> > -
> > -    return 0;
> > +    pr_dbg("ret=%d\n", rsp->hdr.err);
> > +    return rsp->hdr.err;
> >   }
> >   static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> > @@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
> > -    return 0;
> > +    rsp->hdr.err = 0;
> > +
> > +    return rsp->hdr.err;
> >   }
> >   struct cmd_handler {
> >       uint32_t cmd;
> > @@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
> >       }
> >       err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
> > -                            dsr_info->rsp);
> > +                                                    dsr_info->rsp);
> >   out:
> >       set_reg_val(dev, PVRDMA_REG_ERR, err);
> >       post_interrupt(dev, INTR_VEC_CMD_RING);
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets
  2018-11-17 12:06   ` Marcel Apfelbaum
@ 2018-11-18  9:33     ` Yuval Shaia
  0 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-18  9:33 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Sat, Nov 17, 2018 at 02:06:48PM +0200, Marcel Apfelbaum wrote:
> Hi Yuval,
> 
> On 11/13/18 9:12 AM, Yuval Shaia wrote:
> > MAD (Management Datagram) packets are widely used by various modules
> 
> Please add a link to Spec, I sent it in the V1 mail-thread
> Please add it also as a comment in the code. I know MADs
> are a complicated matter, but if somebody wants to have a look...

Added in v4, thanks.

> 
> > both in kernel and in user space for example the rdma_* API which is
> > used to create and maintain "connection" layer on top of RDMA uses
> > several types of MAD packets.
> > To support MAD packets the device uses an external utility
> > (contrib/rdmacm-mux) to relay packets from and to the guest driver.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
> >   hw/rdma/rdma_backend.h      |   4 +-
> >   hw/rdma/rdma_backend_defs.h |  10 +-
> >   hw/rdma/vmw/pvrdma.h        |   2 +
> >   hw/rdma/vmw/pvrdma_main.c   |   4 +-
> >   5 files changed, 273 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> > index 1e148398a2..3eb0099f8d 100644
> > --- a/hw/rdma/rdma_backend.c
> > +++ b/hw/rdma/rdma_backend.c
> 
> rdma_backend is getting huge, have you consider taking out
> mad related code?

1261 lines only, is that so big? comparing to other files under hw/ i don't
see much difference here.

> > @@ -16,8 +16,13 @@
> >   #include "qemu/osdep.h"
> >   #include "qemu/error-report.h"
> >   #include "qapi/error.h"
> > +#include "qapi/qmp/qlist.h"
> > +#include "qapi/qmp/qnum.h"
> >   #include <infiniband/verbs.h>
> > +#include <infiniband/umad_types.h>
> > +#include <infiniband/umad.h>
> > +#include <rdma/rdma_user_cm.h>
> >   #include "trace.h"
> >   #include "rdma_utils.h"
> > @@ -33,16 +38,25 @@
> >   #define VENDOR_ERR_MAD_SEND         0x206
> >   #define VENDOR_ERR_INVLKEY          0x207
> >   #define VENDOR_ERR_MR_SMALL         0x208
> > +#define VENDOR_ERR_INV_MAD_BUFF     0x209
> > +#define VENDOR_ERR_INV_NUM_SGE      0x210
> >   #define THR_NAME_LEN 16
> >   #define THR_POLL_TO  5000
> > +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
> > +
> >   typedef struct BackendCtx {
> > -    uint64_t req_id;
> >       void *up_ctx;
> >       bool is_tx_req;
> > +    struct ibv_sge sge; /* Used to save MAD recv buffer */
> >   } BackendCtx;
> > +struct backend_umad {
> > +    struct ib_user_mad hdr;
> > +    char mad[RDMA_MAX_PRIVATE_DATA];
> > +};
> > +
> >   static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
> >   static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> > @@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
> >       return 0;
> >   }
> > +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> > +                    uint32_t num_sge)
> > +{
> > +    struct backend_umad umad = {0};
> > +    char *hdr, *msg;
> > +    int ret;
> > +
> > +    pr_dbg("num_sge=%d\n", num_sge);
> > +
> > +    if (num_sge != 2) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    umad.hdr.length = sge[0].length + sge[1].length;
> > +    pr_dbg("msg_len=%d\n", umad.hdr.length);
> > +
> > +    if (umad.hdr.length > sizeof(umad.mad)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umad.hdr.addr.qpn = htobe32(1);
> > +    umad.hdr.addr.grh_present = 1;
> > +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> > +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > +    umad.hdr.addr.hop_limit = 1;
> > +
> > +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> > +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > +
> 
> If rdma_pci_dma_map fails it will return NULL ....

Done.

> 
> > +    memcpy(&umad.mad[0], hdr, sge[0].length);
> > +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> > +
> 
> ... and here we access a NULL pointer.
> Maybe is possible to return some error here.
> > +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> > +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> > +
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > +                            sizeof(umad));
> > +
> > +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> > +
> > +    return (ret != sizeof(umad));
> > +}
> > +
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > @@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> >           } else if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            rc = mad_send(backend_dev, sge, num_sge);
> > +            if (rc) {
> > +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            } else {
> > +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> > +            }
> >           }
> > -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
> >           return;
> >       }
> > @@ -370,6 +431,48 @@ out_free_bctx:
> >       g_free(bctx);
> >   }
> > +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
> > +                                         struct ibv_sge *sge, uint32_t num_sge,
> > +                                         void *ctx)
> > +{
> > +    BackendCtx *bctx;
> > +    int rc;
> > +    uint32_t bctx_id;
> > +
> > +    if (num_sge != 1) {
> > +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
> > +        return VENDOR_ERR_INV_NUM_SGE;
> > +    }
> > +
> > +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
> > +        pr_dbg("Too small buffer for MAD\n");
> > +        return VENDOR_ERR_INV_MAD_BUFF;
> > +    }
> > +
> > +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
> > +    pr_dbg("length=%d\n", sge[0].length);
> > +    pr_dbg("lkey=%d\n", sge[0].lkey);
> > +
> > +    bctx = g_malloc0(sizeof(*bctx));
> > +
> > +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
> > +    if (unlikely(rc)) {
> > +        g_free(bctx);
> > +        pr_dbg("Fail to allocate cqe_ctx\n");
> > +        return VENDOR_ERR_NOMEM;
> > +    }
> > +
> > +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
> > +    bctx->up_ctx = ctx;
> > +    bctx->sge = *sge;
> > +
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +
> > +    return 0;
> > +}
> > +
> >   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> >                               RdmaDeviceResources *rdma_dev_res,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> > @@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> >           }
> >           if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
> > +            if (rc) {
> > +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> > +            }
> >           }
> >           return;
> >       }
> > @@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> >       switch (qp_type) {
> >       case IBV_QPT_GSI:
> > -        pr_dbg("QP1 unsupported\n");
> >           return 0;
> >       case IBV_QPT_RC:
> > @@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
> >       return 0;
> >   }
> > +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> > +                                 union ibv_gid *my_gid, int paylen)
> > +{
> > +    grh->paylen = htons(paylen);
> > +    grh->sgid = *sgid;
> > +    grh->dgid = *my_gid;
> > +
> > +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> > +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> > +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> > +}
> > +
> > +static inline int mad_can_receieve(void *opaque)
> > +{
> > +    return sizeof(struct backend_umad);
> > +}
> > +
> > +static void mad_read(void *opaque, const uint8_t *buf, int size)
> > +{
> > +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> > +    QObject *o_ctx_id;
> > +    unsigned long cqe_ctx_id;
> > +    BackendCtx *bctx;
> > +    char *mad;
> > +    struct backend_umad *umad;
> > +
> > +    assert(size != sizeof(umad));
> > +    umad = (struct backend_umad *)buf;
> > +
> > +    pr_dbg("Got %d bytes\n", size);
> > +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> > +
> > +#ifdef PVRDMA_DEBUG
> > +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> > +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
> > +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
> > +           hdr->method, hdr->status, be64toh(hdr->tid),
> > +           hdr->attr_id, hdr->attr_mod);
> > +#endif
> > +
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +    if (!o_ctx_id) {
> > +        pr_dbg("No more free MADs buffers, waiting for a while\n");
> > +        sleep(THR_POLL_TO);
> > +        return;
> > +    }
> > +
> > +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> > +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +    if (unlikely(!bctx)) {
> > +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
> > +        return;
> > +    }
> > +
> > +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
> > +
> > +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> > +                           bctx->sge.length);
> > +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> > +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> > +                     bctx->up_ctx);
> > +    } else {
> > +        memset(mad, 0, bctx->sge.length);
> > +        build_mad_hdr((struct ibv_grh *)mad,
> > +                      (union ibv_gid *)&umad->hdr.addr.gid,
> > +                      &backend_dev->gid, umad->hdr.length);
> > +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> > +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> > +
> > +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> > +    }
> > +
> > +    g_free(bctx);
> > +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +}
> > +
> > +static int mad_init(RdmaBackendDev *backend_dev)
> > +{
> > +    struct backend_umad umad = {0};
> > +    int ret;
> > +
> > +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> > +        pr_dbg("Missing chardev for MAD multiplexer\n");
> > +        return -EIO;
> > +    }
> > +
> > +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> > +                             mad_read, NULL, NULL, backend_dev, NULL, true);
> > +
> > +    /* Register ourself */
> > +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > +                            sizeof(umad.hdr));
> > +    if (ret != sizeof(umad.hdr)) {
> > +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret)
> > +    }
> > +
> > +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> > +    backend_dev->recv_mads_list.list = qlist_new();
> > +
> 
> What happens if the device fails to register
> to rdma_umadmux other than a debug message?
> Can the device continue to work?

No it can't, in such case the device will exit with an error.

As i see it MAD is mandatory function of the device, the only reason it was
not there in day 1 is due to major effort it takes to implement it.

> 
> 
> > +    return 0;
> > +}
> > +
> > +static void mad_stop(RdmaBackendDev *backend_dev)
> > +{
> > +    QObject *o_ctx_id;
> > +    unsigned long cqe_ctx_id;
> > +    BackendCtx *bctx;
> > +
> > +    pr_dbg("Closing MAD\n");
> > +
> > +    /* Clear MAD buffers list */
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    do {
> > +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> > +        if (o_ctx_id) {
> > +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> > +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +            if (bctx) {
> 
> Maybe it worth adding a debug message if we have some orphan context.

res_tbl_get is doing it anyway so no need to dump two messages.

> 
> > +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +                g_free(bctx);
> > +            }
> > +        }
> > +    } while (o_ctx_id);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +}
> > +
> > +static void mad_fini(RdmaBackendDev *backend_dev)
> > +{
> > +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
> > +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> > +}
> > +
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> >                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      Error **errp)
> > +                      CharBackend *mad_chr_be, Error **errp)
> >   {
> >       int i;
> >       int ret = 0;
> > @@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       memset(backend_dev, 0, sizeof(*backend_dev));
> >       backend_dev->dev = pdev;
> > -
> > +    backend_dev->mad_chr_be = mad_chr_be;
> >       backend_dev->backend_gid_idx = backend_gid_idx;
> >       backend_dev->port_num = port_num;
> >       backend_dev->rdma_dev_res = rdma_dev_res;
> > @@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       pr_dbg("interface_id=0x%" PRIx64 "\n",
> >              be64_to_cpu(backend_dev->gid.global.interface_id));
> > +    ret = mad_init(backend_dev);
> > +    if (ret) {
> > +        error_setg(errp, "Fail to initialize mad");
> > +        ret = -EIO;
> > +        goto out_destroy_comm_channel;
> > +    }
> > +
> >       backend_dev->comp_thread.run = false;
> >       backend_dev->comp_thread.is_running = false;
> > @@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
> >   {
> >       pr_dbg("Stopping rdma_backend\n");
> >       stop_backend_thread(&backend_dev->comp_thread);
> > +    mad_stop(backend_dev);
> >   }
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev)
> >   {
> >       rdma_backend_stop(backend_dev);
> > +    mad_fini(backend_dev);
> >       g_hash_table_destroy(ah_hash);
> >       ibv_destroy_comp_channel(backend_dev->channel);
> >       ibv_close_device(backend_dev->context);
> > diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> > index 3ccc9a2494..fc83330251 100644
> > --- a/hw/rdma/rdma_backend.h
> > +++ b/hw/rdma/rdma_backend.h
> > @@ -17,6 +17,8 @@
> >   #define RDMA_BACKEND_H
> >   #include "qapi/error.h"
> > +#include "chardev/char-fe.h"
> > +
> >   #include "rdma_rm_defs.h"
> >   #include "rdma_backend_defs.h"
> > @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> >                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      Error **errp);
> > +                      CharBackend *mad_chr_be, Error **errp);
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> >   void rdma_backend_start(RdmaBackendDev *backend_dev);
> >   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> > diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> > index 7404f64002..2a7e667075 100644
> > --- a/hw/rdma/rdma_backend_defs.h
> > +++ b/hw/rdma/rdma_backend_defs.h
> > @@ -16,8 +16,9 @@
> >   #ifndef RDMA_BACKEND_DEFS_H
> >   #define RDMA_BACKEND_DEFS_H
> > -#include <infiniband/verbs.h>
> >   #include "qemu/thread.h"
> > +#include "chardev/char-fe.h"
> > +#include <infiniband/verbs.h>
> >   typedef struct RdmaDeviceResources RdmaDeviceResources;
> > @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
> >       bool is_running; /* Set by the thread to report its status */
> >   } RdmaBackendThread;
> > +typedef struct RecvMadList {
> > +    QemuMutex lock;
> > +    QList *list;
> > +} RecvMadList;
> > +
> >   typedef struct RdmaBackendDev {
> >       struct ibv_device_attr dev_attr;
> >       RdmaBackendThread comp_thread;
> > @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
> >       struct ibv_comp_channel *channel;
> >       uint8_t port_num;
> >       uint8_t backend_gid_idx;
> > +    RecvMadList recv_mads_list;
> > +    CharBackend *mad_chr_be;
> >   } RdmaBackendDev;
> >   typedef struct RdmaBackendPD {
> > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > index e2d9f93cdf..e3742d893a 100644
> > --- a/hw/rdma/vmw/pvrdma.h
> > +++ b/hw/rdma/vmw/pvrdma.h
> > @@ -19,6 +19,7 @@
> >   #include "qemu/units.h"
> >   #include "hw/pci/pci.h"
> >   #include "hw/pci/msix.h"
> > +#include "chardev/char-fe.h"
> >   #include "../rdma_backend_defs.h"
> >   #include "../rdma_rm_defs.h"
> > @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
> >       uint8_t backend_port_num;
> >       RdmaBackendDev backend_dev;
> >       RdmaDeviceResources rdma_dev_res;
> > +    CharBackend mad_chr;
> >   } PVRDMADev;
> >   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
> > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > index ca5fa8d981..6c8c0154fa 100644
> > --- a/hw/rdma/vmw/pvrdma_main.c
> > +++ b/hw/rdma/vmw/pvrdma_main.c
> > @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
> >       DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
> >                         dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
> >       DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
> > +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
> >       DEFINE_PROP_END_OF_LIST(),
> >   };
> > @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
> >                              dev->backend_device_name, dev->backend_port_num,
> > -                           dev->backend_gid_idx, &dev->dev_attr, errp);
> > +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> > +                           errp);
> >       if (rc) {
> >           goto out;
> >       }
> 
> 
> Thanks,
> Marcel

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  2018-11-17 12:25   ` Marcel Apfelbaum
@ 2018-11-18  9:42     ` Yuval Shaia
  0 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-18  9:42 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Sat, Nov 17, 2018 at 02:25:55PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/13/18 9:13 AM, Yuval Shaia wrote:
> > When device goes down the function fini_ports loops over all entries in
> > gid table regardless of the fact whether entry is valid or not. In case
> > that entry is not valid we'd like to skip from any further processing in
> > backend device.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_rm.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> > index 35a96d9a64..e3f6b2f6ea 100644
> > --- a/hw/rdma/rdma_rm.c
> > +++ b/hw/rdma/rdma_rm.c
> > @@ -555,6 +555,10 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >   {
> >       int rc;
> > +    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
> > +        return 0;
> > +    }
> > +
> >       rc = rdma_backend_del_gid(backend_dev, ifname,
> >                                 &dev_res->port.gid_tbl[gid_idx].gid);
> >       if (rc < 0) {
> 
> Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

There seems to be a missing space separator between "Apfelbaum" and "<". Is
that ok?

> 
> Thanks,
> Marcel
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-17 17:27   ` Shamir Rabinovitch
@ 2018-11-18 10:17     ` Yuval Shaia
  0 siblings, 0 replies; 70+ messages in thread
From: Yuval Shaia @ 2018-11-18 10:17 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, cohuck, yuval.shaia

On Sat, Nov 17, 2018 at 07:27:43PM +0200, Shamir Rabinovitch wrote:
> On Tue, Nov 13, 2018 at 09:13:14AM +0200, Yuval Shaia wrote:
> > RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
> > given MAD class.
> > This does not go hand-by-hand with qemu pvrdma device's requirements
> > where each VM is MAD agent.
> > Fix it by adding implementation of RDMA MAD multiplexer service which on
> > one hand register as a sole MAD agent with the kernel module and on the
> > other hand gives service to more than one VM.
> > 
> > Design Overview:
> > ----------------
> > A server process is registered to UMAD framework (for this to work the
> > rdma_cm kernel module needs to be unloaded) and creates a unix socket to
> > listen to incoming request from clients.
> > A client process (such as QEMU) connects to this unix socket and
> > registers with its own GID.
> > 
> > TX:
> > ---
> > When client needs to send rdma_cm MAD message it construct it the same
> > way as without this multiplexer, i.e. creates a umad packet but this
> > time it writes its content to the socket instead of calling umad_send().
> > The server, upon receiving such a message fetch local_comm_id from it so
> > a context for this session can be maintain and relay the message to UMAD
> > layer by calling umad_send().
> > 
> > RX:
> > ---
> > The server creates a worker thread to process incoming rdma_cm MAD
> > messages. When an incoming message arrived (umad_recv()) the server,
> > depending on the message type (attr_id) looks for target client by
> > either searching in gid->fd table or in local_comm_id->fd table. With
> > the extracted fd the server relays to incoming message to the client.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >  MAINTAINERS                      |   1 +
> >  Makefile                         |   3 +
> >  Makefile.objs                    |   1 +
> >  contrib/rdmacm-mux/Makefile.objs |   4 +
> >  contrib/rdmacm-mux/main.c        | 771 +++++++++++++++++++++++++++++++
> >  contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
> >  6 files changed, 836 insertions(+)
> >  create mode 100644 contrib/rdmacm-mux/Makefile.objs
> >  create mode 100644 contrib/rdmacm-mux/main.c
> >  create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
> >
> 
> Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

Thanks Shamir!

Thanks a lot also for all the MAD related tips and comments you gave
off-list, it make the code much more mature and correct.

> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response
  2018-11-18  8:24     ` Yuval Shaia
@ 2018-11-25  7:35       ` Marcel Apfelbaum
  0 siblings, 0 replies; 70+ messages in thread
From: Marcel Apfelbaum @ 2018-11-25  7:35 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck



On 11/18/18 10:24 AM, Yuval Shaia wrote:
> On Sat, Nov 17, 2018 at 02:22:31PM +0200, Marcel Apfelbaum wrote:
>>
>> On 11/13/18 9:13 AM, Yuval Shaia wrote:
>>> Driver checks error code let's set it.
>>>
>>> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
>>> ---
>>>    hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
>>>    1 file changed, 48 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
>>> index 0d3c818c20..a326c5d470 100644
>>> --- a/hw/rdma/vmw/pvrdma_cmd.c
>>> +++ b/hw/rdma/vmw/pvrdma_cmd.c
>>> @@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        if (rdma_backend_query_port(&dev->backend_dev,
>>>                                    (struct ibv_port_attr *)&attrs)) {
>>> -        return -ENOMEM;
>>> +        resp->hdr.err = -ENOMEM;
>>> +        goto out;
>>>        }
>>>        memset(resp, 0, sizeof(*resp));
>>> @@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        resp->attrs.active_width = 1;
>>>        resp->attrs.active_speed = 1;
>>> -    return 0;
>>> +out:
>>> +    pr_dbg("ret=%d\n", resp->hdr.err);
>>> +    return resp->hdr.err;
>>>    }
>>>    static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        resp->pkey = PVRDMA_PKEY;
>>>        pr_dbg("pkey=0x%x\n", resp->pkey);
>>> -    return 0;
>>> +    return resp->hdr.err;
>>>    }
>>>    static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>>   with se
>> Is it possible to ensure err is 0 by default during hdr creation
>> instead of manually setting it every time?
> Yes we can but since these handlers any fills some other fields in the
> response i thought it will be clean if they will fill the op-status as
> well.
> I believe filling it at "handler" level is more modular.
> Do you think filling it outside will make the code cleaner?

The only problem with manually clearing the filed
is one might forget to do it and we may see random
err codes in the future, hard to debug,

Thanks,
Marcel

>
>> Thanks,
>> Marcel
>>
>>> +
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>> +
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
>>> @@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
>>>        if (!cq) {
>>>            pr_dbg("Invalid CQ handle\n");
>>> -        return -EINVAL;
>>> +        rsp->hdr.err = -EINVAL;
>>> +        goto out;
>>>        }
>>>        ring = (PvrdmaRing *)cq->opaque;
>>> @@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>> +
>>> +out:
>>> +    pr_dbg("ret=%d\n", rsp->hdr.err);
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
>>> @@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
>>>        if (!qp) {
>>>            pr_dbg("Invalid QP handle\n");
>>> -        return -EINVAL;
>>> +        rsp->hdr.err = -EINVAL;
>>> +        goto out;
>>>        }
>>>        rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
>>> @@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
>>>        g_free(ring);
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>> +
>>> +out:
>>> +    pr_dbg("ret=%d\n", rsp->hdr.err);
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        pr_dbg("index=%d\n", cmd->index);
>>>        if (cmd->index >= MAX_PORT_GIDS) {
>>> -        return -EINVAL;
>>> +        rsp->hdr.err = -EINVAL;
>>> +        goto out;
>>>        }
>>>        pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
>>> @@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
>>>                             dev->backend_eth_device_name, gid, cmd->index);
>>>        if (rc < 0) {
>>> -        return -EINVAL;
>>> +        rsp->hdr.err = rc;
>>> +        goto out;
>>>        }
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>> +
>>> +out:
>>> +    pr_dbg("ret=%d\n", rsp->hdr.err);
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        pr_dbg("index=%d\n", cmd->index);
>>>        if (cmd->index >= MAX_PORT_GIDS) {
>>> -        return -EINVAL;
>>> +        rsp->hdr.err = -EINVAL;
>>> +        goto out;
>>>        }
>>>        rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
>>> @@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>            goto out;
>>>        }
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>> +
>>> +out:
>>> +    pr_dbg("ret=%d\n", rsp->hdr.err);
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
>>>                                         &resp->ctx_handle);
>>> -    pr_dbg("ret=%d\n", resp->hdr.err);
>>> -
>>> -    return 0;
>>> +    pr_dbg("ret=%d\n", rsp->hdr.err);
>>> +    return rsp->hdr.err;
>>>    }
>>>    static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>> @@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>>>        rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
>>> -    return 0;
>>> +    rsp->hdr.err = 0;
>>> +
>>> +    return rsp->hdr.err;
>>>    }
>>>    struct cmd_handler {
>>>        uint32_t cmd;
>>> @@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
>>>        }
>>>        err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
>>> -                            dsr_info->rsp);
>>> +                                                    dsr_info->rsp);
>>>    out:
>>>        set_reg_val(dev, PVRDMA_REG_ERR, err);
>>>        post_interrupt(dev, INTR_VEC_CMD_RING);

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2018-11-25  7:36 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-13  7:12 [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
2018-11-17 11:42   ` Marcel Apfelbaum
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
2018-11-17 12:06   ` Marcel Apfelbaum
2018-11-18  9:33     ` Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
2018-11-17 12:07   ` Marcel Apfelbaum
2018-11-13  7:12 ` [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
2018-11-17 12:48   ` Marcel Apfelbaum
2018-11-18  8:13     ` Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
2018-11-17 12:10   ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
2018-11-17 12:11   ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
2018-11-17 12:19   ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
2018-11-17 12:22   ` Marcel Apfelbaum
2018-11-18  8:24     ` Yuval Shaia
2018-11-25  7:35       ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
2018-11-17 12:23   ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
2018-11-13  9:34   ` Cornelia Huck
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
2018-11-17 12:24   ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
2018-11-17 12:25   ` Marcel Apfelbaum
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
2018-11-17 12:25   ` Marcel Apfelbaum
2018-11-18  9:42     ` Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia
2018-11-17 12:34   ` Marcel Apfelbaum
2018-11-18  7:27     ` Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 00/23] Add support for RDMA MAD Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
2018-11-17 17:27   ` Shamir Rabinovitch
2018-11-18 10:17     ` Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 02/23] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 18/23] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 22/23] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
2018-11-13  7:13 ` [Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation Yuval Shaia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.