All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD
@ 2018-11-22 12:13 Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 01/24] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
                   ` (23 more replies)
  0 siblings, 24 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Hi all.

This is a major enhancement to the pvrdma device to allow it to work with
state of the art applications such as MPI.

As described in patch #5, MAD packets are management packets that are used
for many purposes including but not limited to communication layer above IB
verbs API.

Patch 1 exposes new external executable (under contrib) that aims to
address a specific limitation in the RDMA usrespace MAD stack.

This patch-set mainly present MAD enhancement but during the work on it i
came across some bugs and enhancement needed to be implemented before doing
any MAD coding. This is the role of patches 2 to 4, 7 to 9 and 15 to 17.

Patches 6 and 18 are cosmetic changes while not relevant to this patchset
still introduce with it since (at least for 6) hard to decouple.

Patches 12 to 15 couple pvrdma device with vmxnet3 device as this is the
configuration enforced by pvrdma driver in guest - a vmxnet3 device in
function 0 and pvrdma device in function 1 in the same PCI slot. Patch 12
moves needed code from vmxnet3 device to a new header file that can be used
by pvrdma code while Patches 13 to 15 use of it.

Along with this patch-set there is a parallel patch posted to libvirt to
apply the change needed there as part of the process implemented in patches
10 and 11. This change is needed so that guest would be able to configure
any IP to the Ethernet function of the pvrdma device.
https://www.redhat.com/archives/libvir-list/2018-November/msg00135.html

Since we maintain external resources such as GIDs on host GID table we need
to do some cleanup before going down. This is the job of patches 19 and 20.

Patches 21 to 23 contain a fixes for bugs detected during the work on
processing VM shutdown notification.

Patch 24 fixes documentation.

Review is needed for:
[05] hw/rdma: Add support for MAD packets
[11] hw/pvrdma: Add support to allow guest to configure GID table
[13] hw/pvrdma: Make sure PCI function 0 is vmxnet3
[17] hw/pvrdma: Fill error code in command's response
[23] hw/pvrdma: Do not clean resources on shutdown
[24] docs: Update pvrdma device documentation

And second review is needed for:
[10] qapi: Define new QMP message for pvrdma

v1 -> v2:
    * Fix compilation issue detected when compiling for mingw
    * Address comment from Eric Blake re version of QEMU in json
      message
    * Fix example from QMP message in json file
    * Fix case where a VM tries to remove an invalid GID from GID table
    * rdmacm-mux: Cleanup entries in socket-gids table when socket is
      closed
    * Cleanup resources (GIDs, QPs etc) when VM goes down

v2 -> v3:
    * Address comment from Cornelia Huck for patch #19
    * Add some R-Bs from Marcel Apfelbaum and Dmitry Fleytman
    * Update docs/pvrdma.txt with the changes made by this patchset
    * Address comments from Shamir Rabinovitch for UMAD multiplexer

v3 -> v4:
    * Address some comments from Marcel
    * Add some R-Bs from Cornelia Huck and Shamir Rabinovitch

v4 -> v5:
    * Add one more patch that deletes code that performs unneeded (and
      buggy) cleanup of resources during VM shutdown.
    * Fix race condition that might happen when MAD respose arrive before
      ack for the send is received.
    * Based qapi patch on Eric Blake's patch "qapi: Reduce Makefile
      boilerplate" per Markus Armbruster's suggestion.
      Please note that this will cause build error until Eric's patch will
      be applied.
    * Add some debug log messages to rdmacm-mux

Yuval Shaia (24):
  contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  hw/rdma: Add ability to force notification without re-arm
  hw/rdma: Return qpn 1 if ibqp is NULL
  hw/rdma: Abort send-op if fail to create addr handler
  hw/rdma: Add support for MAD packets
  hw/pvrdma: Make function reset_device return void
  hw/pvrdma: Make default pkey 0xFFFF
  hw/pvrdma: Set the correct opcode for recv completion
  hw/pvrdma: Set the correct opcode for send completion
  qapi: Define new QMP message for pvrdma
  hw/pvrdma: Add support to allow guest to configure GID table
  vmxnet3: Move some definitions to header file
  hw/pvrdma: Make sure PCI function 0 is vmxnet3
  hw/rdma: Initialize node_guid from vmxnet3 mac address
  hw/pvrdma: Make device state depend on Ethernet function state
  hw/pvrdma: Fill all CQE fields
  hw/pvrdma: Fill error code in command's response
  hw/rdma: Remove unneeded code that handles more that one port
  vl: Introduce shutdown_notifiers
  hw/pvrdma: Clean device's resource when system is shutdown
  hw/rdma: Do not use bitmap_zero_extend to free bitmap
  hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  hw/pvrdma: Do not clean resources on shutdown
  docs: Update pvrdma device documentation

 MAINTAINERS                      |   2 +
 Makefile                         |   3 +
 Makefile.objs                    |   2 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 790 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  61 +++
 docs/pvrdma.txt                  | 103 +++-
 hw/net/vmxnet3.c                 | 116 +----
 hw/net/vmxnet3_defs.h            | 133 ++++++
 hw/rdma/rdma_backend.c           | 513 +++++++++++++++++---
 hw/rdma/rdma_backend.h           |  28 +-
 hw/rdma/rdma_backend_defs.h      |  19 +-
 hw/rdma/rdma_rm.c                | 120 ++++-
 hw/rdma/rdma_rm.h                |  17 +-
 hw/rdma/rdma_rm_defs.h           |  21 +-
 hw/rdma/rdma_utils.h             |  24 +
 hw/rdma/vmw/pvrdma.h             |  10 +-
 hw/rdma/vmw/pvrdma_cmd.c         | 119 +++--
 hw/rdma/vmw/pvrdma_main.c        |  61 ++-
 hw/rdma/vmw/pvrdma_qp_ops.c      |  62 ++-
 include/sysemu/sysemu.h          |   1 +
 qapi/qapi-schema.json            |   1 +
 qapi/rdma.json                   |  38 ++
 vl.c                             |  15 +-
 24 files changed, 1957 insertions(+), 306 deletions(-)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
 create mode 100644 hw/net/vmxnet3_defs.h
 create mode 100644 qapi/rdma.json

-- 
2.17.2

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 01/24] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 02/24] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
given MAD class.
This does not go hand-by-hand with qemu pvrdma device's requirements
where each VM is MAD agent.
Fix it by adding implementation of RDMA MAD multiplexer service which on
one hand register as a sole MAD agent with the kernel module and on the
other hand gives service to more than one VM.

Design Overview:
----------------
A server process is registered to UMAD framework (for this to work the
rdma_cm kernel module needs to be unloaded) and creates a unix socket to
listen to incoming request from clients.
A client process (such as QEMU) connects to this unix socket and
registers with its own GID.

TX:
---
When client needs to send rdma_cm MAD message it construct it the same
way as without this multiplexer, i.e. creates a umad packet but this
time it writes its content to the socket instead of calling umad_send().
The server, upon receiving such a message fetch local_comm_id from it so
a context for this session can be maintain and relay the message to UMAD
layer by calling umad_send().

RX:
---
The server creates a worker thread to process incoming rdma_cm MAD
messages. When an incoming message arrived (umad_recv()) the server,
depending on the message type (attr_id) looks for target client by
either searching in gid->fd table or in local_comm_id->fd table. With
the extracted fd the server relays to incoming message to the client.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
---
 MAINTAINERS                      |   1 +
 Makefile                         |   3 +
 Makefile.objs                    |   1 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 790 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  61 +++
 6 files changed, 860 insertions(+)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1032406c56..7b68080094 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2334,6 +2334,7 @@ S: Maintained
 F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
+F: contrib/rdmacm-mux/*
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index c8b9efdad4..2b46880eb6 100644
--- a/Makefile
+++ b/Makefile
@@ -362,6 +362,7 @@ dummy := $(call unnest-vars,, \
                 elf2dmp-obj-y \
                 ivshmem-client-obj-y \
                 ivshmem-server-obj-y \
+                rdmacm-mux-obj-y \
                 libvhost-user-obj-y \
                 vhost-user-scsi-obj-y \
                 vhost-user-blk-obj-y \
@@ -579,6 +580,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
 	$(call LINK, $^)
 vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
 	$(call LINK, $^)
+rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
+	$(call LINK, $^)
 
 module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
 	$(call quiet-command,$(PYTHON) $< $@ \
diff --git a/Makefile.objs b/Makefile.objs
index 56af0347d3..319f14d937 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -133,6 +133,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
 vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
 vhost-user-blk-obj-y = contrib/vhost-user-blk/
+rdmacm-mux-obj-y = contrib/rdmacm-mux/
 
 ######################################################################
 trace-events-subdirs =
diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
new file mode 100644
index 0000000000..be3eacb6f7
--- /dev/null
+++ b/contrib/rdmacm-mux/Makefile.objs
@@ -0,0 +1,4 @@
+ifdef CONFIG_PVRDMA
+CFLAGS += -libumad -Wno-format-truncation
+rdmacm-mux-obj-y = main.o
+endif
diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
new file mode 100644
index 0000000000..a4524b19ac
--- /dev/null
+++ b/contrib/rdmacm-mux/main.c
@@ -0,0 +1,790 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "sys/poll.h"
+#include "sys/ioctl.h"
+#include "pthread.h"
+#include "syslog.h"
+
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "infiniband/umad_types.h"
+#include "infiniband/umad_sa.h"
+#include "infiniband/umad_cm.h"
+
+#include "rdmacm-mux.h"
+
+#define SCALE_US 1000
+#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
+#define SLEEP_SECS 5 /* This is used both in poll() and thread */
+#define SERVER_LISTEN_BACKLOG 10
+#define MAX_CLIENTS 4096
+#define MAD_RMPP_VERSION 0
+#define MAD_METHOD_MASK0 0x8
+
+#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
+
+#define CM_REQ_DGID_POS      80
+#define CM_SIDR_REQ_DGID_POS 44
+
+/* The below can be override by command line parameter */
+#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
+#define RDMA_DEVICE "rxe0"
+#define RDMA_PORT_NUM 1
+
+typedef struct RdmaCmServerArgs {
+    char unix_socket_path[PATH_MAX];
+    char rdma_dev_name[NAME_MAX];
+    int rdma_port_num;
+} RdmaCMServerArgs;
+
+typedef struct CommId2FdEntry {
+    int fd;
+    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
+    __be64 gid_ifid;
+} CommId2FdEntry;
+
+typedef struct RdmaCmUMadAgent {
+    int port_id;
+    int agent_id;
+    GHashTable *gid2fd; /* Used to find fd of a given gid */
+    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
+} RdmaCmUMadAgent;
+
+typedef struct RdmaCmServer {
+    bool run;
+    RdmaCMServerArgs args;
+    struct pollfd fds[MAX_CLIENTS];
+    int nfds;
+    RdmaCmUMadAgent umad_agent;
+    pthread_t umad_recv_thread;
+    pthread_rwlock_t lock;
+} RdmaCMServer;
+
+static RdmaCMServer server = {0};
+
+static void usage(const char *progname)
+{
+    printf("Usage: %s [OPTION]...\n"
+           "Start a RDMA-CM multiplexer\n"
+           "\n"
+           "\t-h                    Show this help\n"
+           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
+           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
+           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
+           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
+}
+
+static void help(const char *progname)
+{
+    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    int c;
+    char unix_socket_path[PATH_MAX];
+
+    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
+    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
+    server.args.rdma_port_num = RDMA_PORT_NUM;
+
+    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
+        switch (c) {
+        case 'h':
+            usage(argv[0]);
+            exit(0);
+
+        case 's':
+            /* This is temporary, final name will build below */
+            strncpy(unix_socket_path, optarg, PATH_MAX);
+            break;
+
+        case 'd':
+            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
+            break;
+
+        case 'p':
+            server.args.rdma_port_num = atoi(optarg);
+            break;
+
+        default:
+            help(argv[0]);
+            exit(1);
+        }
+    }
+
+    /* Build unique unix-socket file name */
+    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
+             unix_socket_path, server.args.rdma_dev_name,
+             server.args.rdma_port_num);
+
+    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
+    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
+    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
+}
+
+static void hash_tbl_alloc(void)
+{
+
+    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
+                                                     g_int64_equal,
+                                                     g_free, g_free);
+    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
+                                                        g_int_equal,
+                                                        g_free, g_free);
+}
+
+static void hash_tbl_free(void)
+{
+    if (server.umad_agent.commid2fd) {
+        g_hash_table_destroy(server.umad_agent.commid2fd);
+    }
+    if (server.umad_agent.gid2fd) {
+        g_hash_table_destroy(server.umad_agent.gid2fd);
+    }
+}
+
+
+static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
+{
+    int *fd;
+
+    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    if (!fd) {
+        /* Let's try IPv4 */
+        *gid_ifid |= 0x00000000ffff0000;
+        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    }
+
+    return fd ? *fd : 0;
+}
+
+static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
+{
+    pthread_rwlock_rdlock(&server.lock);
+    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fd) {
+        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
+        return -ENOENT;
+    }
+
+    return 0;
+}
+
+static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
+                                         __be64 *gid_idid)
+{
+    CommId2FdEntry *fde;
+
+    pthread_rwlock_rdlock(&server.lock);
+    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fde) {
+        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
+        return -ENOENT;
+    }
+
+    *fd = fde->fd;
+    *gid_idid = fde->gid_ifid;
+
+    return 0;
+}
+
+static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (fd1) { /* record already exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
+                           RDMACM_MUX_ERR_CODE_EACCES;
+    }
+
+    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
+
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx registered on socket %d",
+           be64toh((uint64_t)gid_ifid), fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (!fd1) { /* record not exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
+    }
+
+    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)));
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx unregistered on socket %d",
+           be64toh((uint64_t)gid_ifid), fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
+                                          uint64_t gid_ifid)
+{
+    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
+
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_insert(server.umad_agent.commid2fd,
+                        g_memdup(&comm_id, sizeof(comm_id)),
+                        g_memdup(&fde, sizeof(fde)));
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static gboolean remove_old_comm_ids(gpointer key, gpointer value,
+                                    gpointer user_data)
+{
+    CommId2FdEntry *fde = (CommId2FdEntry *)value;
+
+    return !fde->ttl--;
+}
+
+static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    if (*(int *)value == *(int *)user_data) {
+        syslog(LOG_INFO, "0x%lx unregistered on socket %d",
+               be64toh(*(uint64_t *)key), *(int *)value);
+        return true;
+    }
+
+    return false;
+}
+
+static void hash_tbl_remove_fd_ifid_pair(int fd)
+{
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
+                                remove_entry_from_gid2fd, (gpointer)&fd);
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
+{
+    struct umad_hdr *hdr = (struct umad_hdr *)mad;
+    char *data = (char *)hdr + sizeof(*hdr);
+    int32_t comm_id = 0;
+    uint16_t attr_id = be16toh(hdr->attr_id);
+    int rc = 0;
+
+    switch (attr_id) {
+    case UMAD_CM_ATTR_REQ:
+        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_SIDR_REQ:
+        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_REP:
+        /* Fall through */
+    case UMAD_CM_ATTR_REJ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREQ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREP:
+        /* Fall through */
+    case UMAD_CM_ATTR_RTU:
+        data += sizeof(comm_id);
+        /* Fall through */
+    case UMAD_CM_ATTR_SIDR_REP:
+        memcpy(&comm_id, data, sizeof(comm_id));
+        if (comm_id) {
+            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
+        }
+        break;
+
+    default:
+        rc = -EINVAL;
+        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
+    }
+
+    syslog(LOG_DEBUG, "mad_to_vm: %d 0x%x 0x%x\n", *fd, attr_id, comm_id);
+
+    return rc;
+}
+
+static void *umad_recv_thread_func(void *args)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    int fd = -2;
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ;
+    msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD;
+
+    while (server.run) {
+        do {
+            msg.umad_len = sizeof(msg.umad.mad);
+            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
+                           SLEEP_SECS * SCALE_US);
+            if ((rc == -EIO) || (rc == -EINVAL)) {
+                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
+            }
+
+            if (rc == -ETIMEDOUT) {
+                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
+                                            remove_old_comm_ids, NULL);
+            }
+        } while (rc && server.run);
+
+        if (server.run) {
+            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
+            if (rc) {
+                continue;
+            }
+
+            send(fd, &msg, sizeof(msg), 0);
+        }
+    }
+
+    return NULL;
+}
+
+static int read_and_process(int fd)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    struct umad_hdr *hdr;
+    uint32_t *comm_id = 0;
+    uint16_t attr_id;
+
+    rc = recv(fd, &msg, sizeof(msg), 0);
+
+    if (rc < 0 && errno != EWOULDBLOCK) {
+        syslog(LOG_ERR, "Fail to read from socket %d\n", fd);
+        return -EIO;
+    }
+
+    if (!rc) {
+        syslog(LOG_ERR, "Fail to read from socket %d\n", fd);
+        return -EPIPE;
+    }
+
+    if (msg.hdr.msg_type != RDMACM_MUX_MSG_TYPE_REQ) {
+        syslog(LOG_WARNING, "Got non-request message (%d) from socket %d\n",
+               msg.hdr.msg_type, fd);
+        return -EPERM;
+    }
+
+    switch (msg.hdr.op_code) {
+    case RDMACM_MUX_OP_CODE_REG:
+        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_OP_CODE_UNREG:
+        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_OP_CODE_MAD:
+        /* If this is REQ or REP then store the pair comm_id,fd to be later
+         * used for other messages where gid is unknown */
+        hdr = (struct umad_hdr *)msg.umad.mad;
+        attr_id = be16toh(hdr->attr_id);
+        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
+            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
+            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
+            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
+            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
+                                          msg.hdr.sgid.global.interface_id);
+        }
+
+        syslog(LOG_DEBUG, "vm_to_mad: %d 0x%x 0x%x\n", fd, attr_id,
+               comm_id ? *comm_id : 0);
+        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
+                       &msg.umad, msg.umad_len, 1, 0);
+        if (rc) {
+            syslog(LOG_ERR,
+                  "Fail to send MAD message (0x%x) from socket %d, err=%d",
+                  attr_id, fd, rc);
+        }
+        break;
+
+    default:
+        syslog(LOG_ERR, "Got invalid op_code (%d) from socket %d",
+               msg.hdr.msg_type, fd);
+        rc = RDMACM_MUX_ERR_CODE_EINVAL;
+    }
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
+    msg.hdr.err_code = rc;
+    rc = send(fd, &msg, sizeof(msg), 0);
+
+    return rc == sizeof(msg) ? 0 : -EPIPE;
+}
+
+static int accept_all(void)
+{
+    int fd, rc = 0;;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    do {
+        if ((server.nfds + 1) > MAX_CLIENTS) {
+            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
+            rc = -EIO;
+            goto out;
+        }
+
+        fd = accept(server.fds[0].fd, NULL, NULL);
+        if (fd < 0) {
+            if (errno != EWOULDBLOCK) {
+                syslog(LOG_WARNING, "accept() failed");
+                rc = -EIO;
+                goto out;
+            }
+            break;
+        }
+
+        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
+        server.fds[server.nfds].fd = fd;
+        server.fds[server.nfds].events = POLLIN;
+        server.nfds++;
+    } while (fd != -1);
+
+out:
+    pthread_rwlock_unlock(&server.lock);
+    return rc;
+}
+
+static void compress_fds(void)
+{
+    int i, j;
+    int closed = 0;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    for (i = 1; i < server.nfds; i++) {
+        if (!server.fds[i].fd) {
+            closed++;
+            for (j = i; j < server.nfds; j++) {
+                server.fds[j].fd = server.fds[j + 1].fd;
+            }
+        }
+    }
+
+    server.nfds -= closed;
+
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static void close_fd(int idx)
+{
+    close(server.fds[idx].fd);
+    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
+    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
+    server.fds[idx].fd = 0;
+}
+
+static void run(void)
+{
+    int rc, nfds, i;
+    bool compress = false;
+
+    syslog(LOG_INFO, "Service started");
+
+    while (server.run) {
+        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
+        if (rc < 0) {
+            if (errno != EINTR) {
+                syslog(LOG_WARNING, "poll() failed");
+            }
+            continue;
+        }
+
+        if (rc == 0) {
+            continue;
+        }
+
+        nfds = server.nfds;
+        for (i = 0; i < nfds; i++) {
+            if (server.fds[i].revents == 0) {
+                continue;
+            }
+
+            if (server.fds[i].revents != POLLIN) {
+                if (i == 0) {
+                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
+                           server.fds[i].revents);
+                } else {
+                    close_fd(i);
+                    compress = true;
+                }
+                continue;
+            }
+
+            if (i == 0) {
+                rc = accept_all();
+                if (rc) {
+                    continue;
+                }
+            } else {
+                rc = read_and_process(server.fds[i].fd);
+                if (rc) {
+                    close_fd(i);
+                    compress = true;
+                }
+            }
+        }
+
+        if (compress) {
+            compress = false;
+            compress_fds();
+        }
+    }
+}
+
+static void fini_listener(void)
+{
+    int i;
+
+    if (server.fds[0].fd <= 0) {
+        return;
+    }
+
+    for (i = server.nfds - 1; i >= 0; i--) {
+        if (server.fds[i].fd) {
+            close(server.fds[i].fd);
+        }
+    }
+
+    unlink(server.args.unix_socket_path);
+}
+
+static void fini_umad(void)
+{
+    if (server.umad_agent.agent_id) {
+        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
+    }
+
+    if (server.umad_agent.port_id) {
+        umad_close_port(server.umad_agent.port_id);
+    }
+
+    hash_tbl_free();
+}
+
+static void fini(void)
+{
+    if (server.umad_recv_thread) {
+        pthread_join(server.umad_recv_thread, NULL);
+        server.umad_recv_thread = 0;
+    }
+    fini_umad();
+    fini_listener();
+    pthread_rwlock_destroy(&server.lock);
+
+    syslog(LOG_INFO, "Service going down");
+}
+
+static int init_listener(void)
+{
+    struct sockaddr_un sun;
+    int rc, on = 1;
+
+    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (server.fds[0].fd < 0) {
+        syslog(LOG_ALERT, "socket() failed");
+        return -EIO;
+    }
+
+    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
+                    sizeof(on));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "setsockopt() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "ioctl() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT,
+               "Invalid unix_socket_path, size must be less than %ld\n",
+               sizeof(sun.sun_path));
+        rc = -EINVAL;
+        goto err;
+    }
+
+    sun.sun_family = AF_UNIX;
+    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
+                  server.args.unix_socket_path);
+    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT, "Could not copy unix socket path\n");
+        rc = -EINVAL;
+        goto err;
+    }
+
+    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "bind() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "listen() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    server.fds[0].events = POLLIN;
+    server.nfds = 1;
+    server.run = true;
+
+    return 0;
+
+err:
+    close(server.fds[0].fd);
+    return rc;
+}
+
+static int init_umad(void)
+{
+    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
+
+    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
+                                               server.args.rdma_port_num);
+
+    if (server.umad_agent.port_id < 0) {
+        syslog(LOG_WARNING, "umad_open_port() failed");
+        return -EIO;
+    }
+
+    memset(&method_mask, 0, sizeof(method_mask));
+    method_mask[0] = MAD_METHOD_MASK0;
+    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
+                                               UMAD_CLASS_CM,
+                                               UMAD_SA_CLASS_VERSION,
+                                               MAD_RMPP_VERSION, method_mask);
+    if (server.umad_agent.agent_id < 0) {
+        syslog(LOG_WARNING, "umad_register() failed");
+        return -EIO;
+    }
+
+    hash_tbl_alloc();
+
+    return 0;
+}
+
+static void signal_handler(int sig, siginfo_t *siginfo, void *context)
+{
+    static bool warned;
+
+    /* Prevent stop if clients are connected */
+    if (server.nfds != 1) {
+        if (!warned) {
+            syslog(LOG_WARNING,
+                   "Can't stop while active client exist, resend SIGINT to overid");
+            warned = true;
+            return;
+        }
+    }
+
+    if (sig == SIGINT) {
+        server.run = false;
+        fini();
+    }
+
+    exit(0);
+}
+
+static int init(void)
+{
+    int rc;
+    struct sigaction sig = {0};
+
+    rc = init_listener();
+    if (rc) {
+        return rc;
+    }
+
+    rc = init_umad();
+    if (rc) {
+        return rc;
+    }
+
+    pthread_rwlock_init(&server.lock, 0);
+
+    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
+                        NULL);
+    if (rc) {
+        syslog(LOG_ERR, "Fail to create UMAD receiver thread (%d)\n", rc);
+        return rc;
+    }
+
+    sig.sa_sigaction = &signal_handler;
+    sig.sa_flags = SA_SIGINFO;
+    rc = sigaction(SIGINT, &sig, NULL);
+    if (rc < 0) {
+        syslog(LOG_ERR, "Fail to install SIGINT handler (%d)\n", errno);
+        return rc;
+    }
+
+    return 0;
+}
+
+int main(int argc, char *argv[])
+{
+    int rc;
+
+    memset(&server, 0, sizeof(server));
+
+    parse_args(argc, argv);
+
+    rc = init();
+    if (rc) {
+        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
+        rc = -EAGAIN;
+        goto out;
+    }
+
+    run();
+
+out:
+    fini();
+
+    return rc;
+}
diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
new file mode 100644
index 0000000000..942a802c47
--- /dev/null
+++ b/contrib/rdmacm-mux/rdmacm-mux.h
@@ -0,0 +1,61 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux declarations
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMACM_MUX_H
+#define RDMACM_MUX_H
+
+#include "linux/if.h"
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "rdma/rdma_user_cm.h"
+
+typedef enum RdmaCmMuxMsgType {
+    RDMACM_MUX_MSG_TYPE_REQ   = 0,
+    RDMACM_MUX_MSG_TYPE_RESP  = 1,
+} RdmaCmMuxMsgType;
+
+typedef enum RdmaCmMuxOpCode {
+    RDMACM_MUX_OP_CODE_REG   = 0,
+    RDMACM_MUX_OP_CODE_UNREG = 1,
+    RDMACM_MUX_OP_CODE_MAD   = 2,
+} RdmaCmMuxOpCode;
+
+typedef enum RdmaCmMuxErrCode {
+    RDMACM_MUX_ERR_CODE_OK        = 0,
+    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
+    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
+    RDMACM_MUX_ERR_CODE_EACCES    = 3,
+    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
+} RdmaCmMuxErrCode;
+
+typedef struct RdmaCmMuxHdr {
+    RdmaCmMuxMsgType msg_type;
+    RdmaCmMuxOpCode op_code;
+    union ibv_gid sgid;
+    RdmaCmMuxErrCode err_code;
+} RdmaCmUHdr;
+
+typedef struct RdmaCmUMad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+} RdmaCmUMad;
+
+typedef struct RdmaCmMuxMsg {
+    RdmaCmUHdr hdr;
+    int umad_len;
+    RdmaCmUMad umad;
+} RdmaCmMuxMsg;
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 02/24] hw/rdma: Add ability to force notification without re-arm
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 01/24] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 03/24] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Upon completion of incoming packet the device pushes CQE to driver's RX
ring and notify the driver (msix).
While for data-path incoming packets the driver needs the ability to
control whether it wished to receive interrupts or not, for control-path
packets such as incoming MAD the driver needs to be notified anyway, it
even do not need to re-arm the notification bit.

Enhance the notification field to support this.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_rm.c           | 12 ++++++++++--
 hw/rdma/rdma_rm_defs.h      |  8 +++++++-
 hw/rdma/vmw/pvrdma_qp_ops.c |  6 ++++--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 8d59a42cd1..4f10fcabcc 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -263,7 +263,7 @@ int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     }
 
     cq->opaque = opaque;
-    cq->notify = false;
+    cq->notify = CNT_CLEAR;
 
     rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
     if (rc) {
@@ -291,7 +291,10 @@ void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
         return;
     }
 
-    cq->notify = notify;
+    if (cq->notify != CNT_SET) {
+        cq->notify = notify ? CNT_ARM : CNT_CLEAR;
+    }
+
     pr_dbg("notify=%d\n", cq->notify);
 }
 
@@ -349,6 +352,11 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
         return -EINVAL;
     }
 
+    if (qp_type == IBV_QPT_GSI) {
+        scq->notify = CNT_SET;
+        rcq->notify = CNT_SET;
+    }
+
     qp = res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
     if (!qp) {
         return -ENOMEM;
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7228151239..9b399063d3 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -49,10 +49,16 @@ typedef struct RdmaRmPD {
     uint32_t ctx_handle;
 } RdmaRmPD;
 
+typedef enum CQNotificationType {
+    CNT_CLEAR,
+    CNT_ARM,
+    CNT_SET,
+} CQNotificationType;
+
 typedef struct RdmaRmCQ {
     RdmaBackendCQ backend_cq;
     void *opaque;
-    bool notify;
+    CQNotificationType notify;
 } RdmaRmCQ;
 
 /* MR (DMA region) */
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index c668afd0ed..762700a205 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -89,8 +89,10 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pvrdma_ring_write_inc(&dev->dsr_info.cq);
 
     pr_dbg("cq->notify=%d\n", cq->notify);
-    if (cq->notify) {
-        cq->notify = false;
+    if (cq->notify != CNT_CLEAR) {
+        if (cq->notify == CNT_ARM) {
+            cq->notify = CNT_CLEAR;
+        }
         post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
     }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 03/24] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 01/24] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 02/24] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 04/24] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Device is not supporting QP0, only QP1.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_backend.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 86e8fe8ab6..3ccc9a2494 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
 
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
-    return qp->ibqp ? qp->ibqp->qp_num : 0;
+    return qp->ibqp ? qp->ibqp->qp_num : 1;
 }
 
 static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 04/24] hw/rdma: Abort send-op if fail to create addr handler
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (2 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 03/24] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets Yuval Shaia
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Function create_ah might return NULL, let's exit with an error.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_backend.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index d7a4bbd91f..1e148398a2 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -338,6 +338,10 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
                                 backend_dev->backend_gid_idx, dgid);
+        if (!wr.wr.ud.ah) {
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            goto out_dealloc_cqe_ctx;
+        }
         wr.wr.ud.remote_qpn = dqpn;
         wr.wr.ud.remote_qkey = dqkey;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (3 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 04/24] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-25  7:05   ` Marcel Apfelbaum
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 06/24] hw/pvrdma: Make function reset_device return void Yuval Shaia
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

MAD (Management Datagram) packets are widely used by various modules
both in kernel and in user space for example the rdma_* API which is
used to create and maintain "connection" layer on top of RDMA uses
several types of MAD packets.

For more information please refer to chapter 13.4 in Volume 1
Architecture Specification, Release 1.1 available here:
https://www.infinibandta.org/ibta-specifications-download/

To support MAD packets the device uses an external utility
(contrib/rdmacm-mux) to relay packets from and to the guest driver.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 275 +++++++++++++++++++++++++++++++++++-
 hw/rdma/rdma_backend.h      |   4 +-
 hw/rdma/rdma_backend_defs.h |  10 +-
 hw/rdma/vmw/pvrdma.h        |   2 +
 hw/rdma/vmw/pvrdma_main.c   |   4 +-
 5 files changed, 285 insertions(+), 10 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 1e148398a2..7c220a5798 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -16,8 +16,13 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qnum.h"
 
 #include <infiniband/verbs.h>
+#include <infiniband/umad_types.h>
+#include <infiniband/umad.h>
+#include <rdma/rdma_user_cm.h>
 
 #include "trace.h"
 #include "rdma_utils.h"
@@ -33,16 +38,25 @@
 #define VENDOR_ERR_MAD_SEND         0x206
 #define VENDOR_ERR_INVLKEY          0x207
 #define VENDOR_ERR_MR_SMALL         0x208
+#define VENDOR_ERR_INV_MAD_BUFF     0x209
+#define VENDOR_ERR_INV_NUM_SGE      0x210
 
 #define THR_NAME_LEN 16
 #define THR_POLL_TO  5000
 
+#define MAD_HDR_SIZE sizeof(struct ibv_grh)
+
 typedef struct BackendCtx {
-    uint64_t req_id;
     void *up_ctx;
     bool is_tx_req;
+    struct ibv_sge sge; /* Used to save MAD recv buffer */
 } BackendCtx;
 
+struct backend_umad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+};
+
 static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
 
 static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
@@ -286,6 +300,61 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
+static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
+                    uint32_t num_sge)
+{
+    struct backend_umad umad = {0};
+    char *hdr, *msg;
+    int ret;
+
+    pr_dbg("num_sge=%d\n", num_sge);
+
+    if (num_sge != 2) {
+        return -EINVAL;
+    }
+
+    umad.hdr.length = sge[0].length + sge[1].length;
+    pr_dbg("msg_len=%d\n", umad.hdr.length);
+
+    if (umad.hdr.length > sizeof(umad.mad)) {
+        return -ENOMEM;
+    }
+
+    umad.hdr.addr.qpn = htobe32(1);
+    umad.hdr.addr.grh_present = 1;
+    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    umad.hdr.addr.hop_limit = 1;
+
+    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
+    if (!hdr) {
+        pr_dbg("Fail to map to sge[0]\n");
+        return -ENOMEM;
+    }
+    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+    if (!msg) {
+        pr_dbg("Fail to map to sge[1]\n");
+        rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+        return -ENOMEM;
+    }
+
+    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
+    pr_dbg_buf("mad_data", data, sge[1].length);
+
+    memcpy(&umad.mad[0], hdr, sge[0].length);
+    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+
+    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad));
+
+    pr_dbg("qemu_chr_fe_write=%d\n", ret);
+
+    return (ret != sizeof(umad));
+}
+
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
@@ -304,9 +373,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = mad_send(backend_dev, sge, num_sge);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            } else {
+                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+            }
         }
-        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
         return;
     }
 
@@ -370,6 +443,48 @@ out_free_bctx:
     g_free(bctx);
 }
 
+static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
+                                         struct ibv_sge *sge, uint32_t num_sge,
+                                         void *ctx)
+{
+    BackendCtx *bctx;
+    int rc;
+    uint32_t bctx_id;
+
+    if (num_sge != 1) {
+        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
+        return VENDOR_ERR_INV_NUM_SGE;
+    }
+
+    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
+        pr_dbg("Too small buffer for MAD\n");
+        return VENDOR_ERR_INV_MAD_BUFF;
+    }
+
+    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
+    pr_dbg("length=%d\n", sge[0].length);
+    pr_dbg("lkey=%d\n", sge[0].lkey);
+
+    bctx = g_malloc0(sizeof(*bctx));
+
+    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
+    if (unlikely(rc)) {
+        g_free(bctx);
+        pr_dbg("Fail to allocate cqe_ctx\n");
+        return VENDOR_ERR_NOMEM;
+    }
+
+    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
+    bctx->up_ctx = ctx;
+    bctx->sge = *sge;
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+
+    return 0;
+}
+
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
                             RdmaDeviceResources *rdma_dev_res,
                             RdmaBackendQP *qp, uint8_t qp_type,
@@ -388,7 +503,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+            }
         }
         return;
     }
@@ -517,7 +635,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 
     switch (qp_type) {
     case IBV_QPT_GSI:
-        pr_dbg("QP1 unsupported\n");
         return 0;
 
     case IBV_QPT_RC:
@@ -748,11 +865,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
     return 0;
 }
 
+static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
+                                 union ibv_gid *my_gid, int paylen)
+{
+    grh->paylen = htons(paylen);
+    grh->sgid = *sgid;
+    grh->dgid = *my_gid;
+
+    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
+    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+}
+
+static inline int mad_can_receieve(void *opaque)
+{
+    return sizeof(struct backend_umad);
+}
+
+static void mad_read(void *opaque, const uint8_t *buf, int size)
+{
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+    char *mad;
+    struct backend_umad *umad;
+
+    assert(size != sizeof(umad));
+    umad = (struct backend_umad *)buf;
+
+    pr_dbg("Got %d bytes\n", size);
+    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+
+#ifdef PVRDMA_DEBUG
+    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
+    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
+           hdr->base_version, hdr->mgmt_class, hdr->class_version,
+           hdr->method, hdr->status, be64toh(hdr->tid),
+           hdr->attr_id, hdr->attr_mod);
+#endif
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+    if (!o_ctx_id) {
+        pr_dbg("No more free MADs buffers, waiting for a while\n");
+        sleep(THR_POLL_TO);
+        return;
+    }
+
+    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+    if (unlikely(!bctx)) {
+        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
+        return;
+    }
+
+    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
+
+    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
+                           bctx->sge.length);
+    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                     bctx->up_ctx);
+    } else {
+        memset(mad, 0, bctx->sge.length);
+        build_mad_hdr((struct ibv_grh *)mad,
+                      (union ibv_gid *)&umad->hdr.addr.gid,
+                      &backend_dev->gid, umad->hdr.length);
+        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
+
+        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+    }
+
+    g_free(bctx);
+    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+}
+
+static int mad_init(RdmaBackendDev *backend_dev)
+{
+    struct backend_umad umad = {0};
+    int ret;
+
+    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+        pr_dbg("Missing chardev for MAD multiplexer\n");
+        return -EIO;
+    }
+
+    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
+                             mad_read, NULL, NULL, backend_dev, NULL, true);
+
+    /* Register ourself */
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad.hdr));
+    if (ret != sizeof(umad.hdr)) {
+        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
+    }
+
+    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
+    backend_dev->recv_mads_list.list = qlist_new();
+
+    return 0;
+}
+
+static void mad_stop(RdmaBackendDev *backend_dev)
+{
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+
+    pr_dbg("Closing MAD\n");
+
+    /* Clear MAD buffers list */
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    do {
+        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+        if (o_ctx_id) {
+            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+            if (bctx) {
+                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+                g_free(bctx);
+            }
+        }
+    } while (o_ctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+}
+
+static void mad_fini(RdmaBackendDev *backend_dev)
+{
+    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
+    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp)
+                      CharBackend *mad_chr_be, Error **errp)
 {
     int i;
     int ret = 0;
@@ -763,7 +1015,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
-
+    backend_dev->mad_chr_be = mad_chr_be;
     backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
@@ -854,6 +1106,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     pr_dbg("interface_id=0x%" PRIx64 "\n",
            be64_to_cpu(backend_dev->gid.global.interface_id));
 
+    ret = mad_init(backend_dev);
+    if (ret) {
+        error_setg(errp, "Fail to initialize mad");
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+
     backend_dev->comp_thread.run = false;
     backend_dev->comp_thread.is_running = false;
 
@@ -885,11 +1144,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
 {
     pr_dbg("Stopping rdma_backend\n");
     stop_backend_thread(&backend_dev->comp_thread);
+    mad_stop(backend_dev);
 }
 
 void rdma_backend_fini(RdmaBackendDev *backend_dev)
 {
     rdma_backend_stop(backend_dev);
+    mad_fini(backend_dev);
     g_hash_table_destroy(ah_hash);
     ibv_destroy_comp_channel(backend_dev->channel);
     ibv_close_device(backend_dev->context);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 3ccc9a2494..fc83330251 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -17,6 +17,8 @@
 #define RDMA_BACKEND_H
 
 #include "qapi/error.h"
+#include "chardev/char-fe.h"
+
 #include "rdma_rm_defs.h"
 #include "rdma_backend_defs.h"
 
@@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp);
+                      CharBackend *mad_chr_be, Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 7404f64002..2a7e667075 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -16,8 +16,9 @@
 #ifndef RDMA_BACKEND_DEFS_H
 #define RDMA_BACKEND_DEFS_H
 
-#include <infiniband/verbs.h>
 #include "qemu/thread.h"
+#include "chardev/char-fe.h"
+#include <infiniband/verbs.h>
 
 typedef struct RdmaDeviceResources RdmaDeviceResources;
 
@@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
     bool is_running; /* Set by the thread to report its status */
 } RdmaBackendThread;
 
+typedef struct RecvMadList {
+    QemuMutex lock;
+    QList *list;
+} RecvMadList;
+
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
@@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
     struct ibv_comp_channel *channel;
     uint8_t port_num;
     uint8_t backend_gid_idx;
+    RecvMadList recv_mads_list;
+    CharBackend *mad_chr_be;
 } RdmaBackendDev;
 
 typedef struct RdmaBackendPD {
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e2d9f93cdf..e3742d893a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -19,6 +19,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
+#include "chardev/char-fe.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -83,6 +84,7 @@ typedef struct PVRDMADev {
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
+    CharBackend mad_chr;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ca5fa8d981..6c8c0154fa 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
     DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
                       dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
     DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
+    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, errp);
+                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
+                           errp);
     if (rc) {
         goto out;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 06/24] hw/pvrdma: Make function reset_device return void
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (4 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 07/24] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

This function cannot fail - fix it to return void

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 6c8c0154fa..fc2abd34af 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
     return 0;
 }
 
-static int reset_device(PVRDMADev *dev)
+static void reset_device(PVRDMADev *dev)
 {
     pvrdma_stop(dev);
 
     pr_dbg("Device reset complete\n");
-
-    return 0;
 }
 
 static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 07/24] hw/pvrdma: Make default pkey 0xFFFF
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (5 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 06/24] hw/pvrdma: Make function reset_device return void Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 08/24] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Commit 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF") exports
default pkey as external definition but omit the change from 0x7FFF to
0xFFFF.

Fixes: 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF")

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e3742d893a..15c3f28b86 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -52,7 +52,7 @@
 #define PVRDMA_FW_VERSION    14
 
 /* Some defaults */
-#define PVRDMA_PKEY          0x7FFF
+#define PVRDMA_PKEY          0xFFFF
 
 typedef struct DSRInfo {
     dma_addr_t dma;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 08/24] hw/pvrdma: Set the correct opcode for recv completion
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (6 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 07/24] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 09/24] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

The function pvrdma_post_cqe populates CQE entry with opcode from the
given completion element. For receive operation value was not set. Fix
it by setting it to IBV_WC_RECV.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 762700a205..7b0f440fda 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx = g_malloc(sizeof(CompHandlerCtx));
         comp_ctx->dev = dev;
         comp_ctx->cq_handle = qp->recv_cq_handle;
-        comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cqe.opcode = IBV_WC_RECV;
 
         rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
                                &qp->backend_qp, qp->qp_type,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 09/24] hw/pvrdma: Set the correct opcode for send completion
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (7 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 08/24] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma Yuval Shaia
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

opcode for WC should be set by the device and not taken from work
element.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 7b0f440fda..3388be1926 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cq_handle = qp->send_cq_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
         comp_ctx->cqe.qp = qp_handle;
-        comp_ctx->cqe.opcode = wqe->hdr.opcode;
+        comp_ctx->cqe.opcode = IBV_WC_SEND;
 
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (8 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 09/24] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-26 10:01   ` Markus Armbruster
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

pvrdma requires that the same GID attached to it will be attached to the
backend device in the host.

A new QMP messages is defined so pvrdma device can broadcast any change
made to its GID table. This event is captured by libvirt which in turn
will update the GID table in the backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 MAINTAINERS           |  1 +
 Makefile.objs         |  1 +
 qapi/qapi-schema.json |  1 +
 qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 41 insertions(+)
 create mode 100644 qapi/rdma.json

diff --git a/MAINTAINERS b/MAINTAINERS
index 7b68080094..525bcdcf41 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2335,6 +2335,7 @@ F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
 F: contrib/rdmacm-mux/*
+F: qapi/rdma.json
 
 Build and test automation
 -------------------------
diff --git a/Makefile.objs b/Makefile.objs
index 319f14d937..fe3566b797 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -1,5 +1,6 @@
 QAPI_MODULES = block-core block char common crypto introspect job migration
 QAPI_MODULES += misc net rocker run-state sockets tpm trace transaction ui
+QAPI_MODULES += rdma
 
 #######################################################################
 # Common libraries for tools and emulators
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 65b6dc2f6f..3bbdfcee84 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -86,6 +86,7 @@
 { 'include': 'char.json' }
 { 'include': 'job.json' }
 { 'include': 'net.json' }
+{ 'include': 'rdma.json' }
 { 'include': 'rocker.json' }
 { 'include': 'tpm.json' }
 { 'include': 'ui.json' }
diff --git a/qapi/rdma.json b/qapi/rdma.json
new file mode 100644
index 0000000000..804c68ab36
--- /dev/null
+++ b/qapi/rdma.json
@@ -0,0 +1,38 @@
+# -*- Mode: Python -*-
+#
+
+##
+# = RDMA device
+##
+
+##
+# @RDMA_GID_STATUS_CHANGED:
+#
+# Emitted when guest driver adds/deletes GID to/from device
+#
+# @netdev: RoCE Network Device name - char *
+#
+# @gid-status: Add or delete indication - bool
+#
+# @subnet-prefix: Subnet Prefix - uint64
+#
+# @interface-id : Interface ID - uint64
+#
+# Since: 3.2
+#
+# Example:
+#
+# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
+#     "event": "RDMA_GID_STATUS_CHANGED",
+#     "data":
+#         {"netdev": "bridge0",
+#         "interface-id": 15880512517475447892,
+#         "gid-status": true,
+#         "subnet-prefix": 33022}}
+#
+##
+{ 'event': 'RDMA_GID_STATUS_CHANGED',
+  'data': { 'netdev'        : 'str',
+            'gid-status'    : 'bool',
+            'subnet-prefix' : 'uint64',
+            'interface-id'  : 'uint64' } }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (9 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-25  7:29   ` Marcel Apfelbaum
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 12/24] vmxnet3: Move some definitions to header file Yuval Shaia
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determined by the MAC address, the second
by the first IPv6 address and the third by the IPv4 address. Other
entries can be added by adding more IP addresses. The opposite is the
same, i.e. whenever an address is removed, the corresponding GID entry
is removed.

The process is done by the network and RDMA stacks. Whenever an address
is added the ib_core driver is notified and calls the device driver
add_gid function which in turn update the device.

To support this in pvrdma device we need to hook into the create_bind
and destroy_bind HW commands triggered by pvrdma driver in guest.
Whenever changed is made to the pvrdma port's GID table a special QMP
messages is sent to be processed by libvirt to update the address of the
backend Ethernet device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 336 +++++++++++++++++++++++++-----------
 hw/rdma/rdma_backend.h      |  22 +--
 hw/rdma/rdma_backend_defs.h |  11 +-
 hw/rdma/rdma_rm.c           | 104 ++++++++++-
 hw/rdma/rdma_rm.h           |  17 +-
 hw/rdma/rdma_rm_defs.h      |   9 +-
 hw/rdma/rdma_utils.h        |  15 ++
 hw/rdma/vmw/pvrdma.h        |   2 +-
 hw/rdma/vmw/pvrdma_cmd.c    |  55 +++---
 hw/rdma/vmw/pvrdma_main.c   |  25 +--
 hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
 11 files changed, 453 insertions(+), 163 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 7c220a5798..8b5a111bf4 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -15,15 +15,18 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "sysemu/sysemu.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qnum.h"
+#include "qapi/qapi-events-rdma.h"
 
 #include <infiniband/verbs.h>
 #include <infiniband/umad_types.h>
 #include <infiniband/umad.h>
 #include <rdma/rdma_user_cm.h>
 
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 #include "trace.h"
 #include "rdma_utils.h"
 #include "rdma_rm.h"
@@ -160,6 +163,71 @@ static void *comp_handler_thread(void *arg)
     return NULL;
 }
 
+static inline void disable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
+{
+    atomic_set(&backend_dev->rdmacm_mux.can_receive, 0);
+}
+
+static inline void enable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
+{
+    atomic_set(&backend_dev->rdmacm_mux.can_receive, sizeof(RdmaCmMuxMsg));
+}
+
+static inline int rdmacm_mux_can_process_async(RdmaBackendDev *backend_dev)
+{
+    return atomic_read(&backend_dev->rdmacm_mux.can_receive);
+}
+
+static int check_mux_op_status(CharBackend *mad_chr_be)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("Reading response\n");
+    ret = qemu_chr_fe_read_all(mad_chr_be, (uint8_t *)&msg, sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Invalid message size %d, expecting %ld\n", ret, sizeof(msg));
+        return -EIO;
+    }
+
+    if (msg.hdr.msg_type != RDMACM_MUX_MSG_TYPE_RESP) {
+        pr_dbg("Invalid message type %d\n", msg.hdr.msg_type);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Operation failed in mux, error code %d\n", msg.hdr.err_code);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+static int exec_rdmacm_mux_req(RdmaBackendDev *backend_dev, RdmaCmMuxMsg *msg)
+{
+    int rc = 0;
+
+    pr_dbg("Executing request %d\n", msg->hdr.op_code);
+
+    msg->hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ;
+    disable_rdmacm_mux_async(backend_dev);
+    rc = qemu_chr_fe_write(backend_dev->rdmacm_mux.chr_be,
+                           (const uint8_t *)msg, sizeof(*msg));
+    enable_rdmacm_mux_async(backend_dev);
+    if (rc != sizeof(*msg)) {
+        pr_dbg("Fail to send request to rdmacm_mux (rc=%d)\n", rc);
+        return -EIO;
+    }
+
+    rc = check_mux_op_status(backend_dev->rdmacm_mux.chr_be);
+    if (rc) {
+        pr_dbg("Fail to execute rdmacm_mux request %d (rc=%d)\n",
+               msg->hdr.op_code, rc);
+    }
+
+    return 0;
+}
+
 static void stop_backend_thread(RdmaBackendThread *thread)
 {
     thread->run = false;
@@ -300,11 +368,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
-static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
-                    uint32_t num_sge)
+static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
+                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
 {
-    struct backend_umad umad = {0};
-    char *hdr, *msg;
+    RdmaCmMuxMsg msg = {0};
+    char *hdr, *data;
     int ret;
 
     pr_dbg("num_sge=%d\n", num_sge);
@@ -313,26 +381,31 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
         return -EINVAL;
     }
 
-    umad.hdr.length = sge[0].length + sge[1].length;
-    pr_dbg("msg_len=%d\n", umad.hdr.length);
+    msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD;
+    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
 
-    if (umad.hdr.length > sizeof(umad.mad)) {
+    msg.umad_len = sge[0].length + sge[1].length;
+    pr_dbg("umad_len=%d\n", msg.umad_len);
+
+    if (msg.umad_len > sizeof(msg.umad.mad)) {
         return -ENOMEM;
     }
 
-    umad.hdr.addr.qpn = htobe32(1);
-    umad.hdr.addr.grh_present = 1;
-    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    umad.hdr.addr.hop_limit = 1;
+    msg.umad.hdr.addr.qpn = htobe32(1);
+    msg.umad.hdr.addr.grh_present = 1;
+    pr_dbg("sgid_idx=%d\n", sgid_idx);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
+    msg.umad.hdr.addr.gid_index = sgid_idx;
+    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
+    msg.umad.hdr.addr.hop_limit = 1;
 
     hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
     if (!hdr) {
         pr_dbg("Fail to map to sge[0]\n");
         return -ENOMEM;
     }
-    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
-    if (!msg) {
+    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+    if (!data) {
         pr_dbg("Fail to map to sge[1]\n");
         rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
         return -ENOMEM;
@@ -341,25 +414,27 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
     pr_dbg_buf("mad_hdr", hdr, sge[0].length);
     pr_dbg_buf("mad_data", data, sge[1].length);
 
-    memcpy(&umad.mad[0], hdr, sge[0].length);
-    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
+    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
 
-    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
     rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
 
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad));
-
-    pr_dbg("qemu_chr_fe_write=%d\n", ret);
+    ret = exec_rdmacm_mux_req(backend_dev, &msg);
+    if (ret) {
+        pr_dbg("Fail to send MAD to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
 
-    return (ret != sizeof(umad));
+    return 0;
 }
 
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
-                            union ibv_gid *dgid, uint32_t dqpn,
-                            uint32_t dqkey, void *ctx)
+                            uint8_t sgid_idx, union ibv_gid *sgid,
+                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
+                            void *ctx)
 {
     BackendCtx *bctx;
     struct ibv_sge new_sge[MAX_SGE];
@@ -373,7 +448,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            rc = mad_send(backend_dev, sge, num_sge);
+            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
                 comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
@@ -409,8 +484,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     }
 
     if (qp_type == IBV_QPT_UD) {
-        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
-                                backend_dev->backend_gid_idx, dgid);
+        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
@@ -715,9 +789,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
 }
 
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey)
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
 {
     struct ibv_qp_attr attr = {0};
     union ibv_gid ibv_gid = {
@@ -729,13 +803,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
     attr.qp_state = IBV_QPS_RTR;
     attr_mask = IBV_QP_STATE;
 
+    qp->sgid_idx = sgid_idx;
+
     switch (qp_type) {
     case IBV_QPT_RC:
         pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
                be64_to_cpu(ibv_gid.global.subnet_prefix),
                be64_to_cpu(ibv_gid.global.interface_id));
         pr_dbg("dqpn=0x%x\n", dqpn);
-        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
+        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
         pr_dbg("sport_num=%d\n", backend_dev->port_num);
         pr_dbg("rq_psn=0x%x\n", rq_psn);
 
@@ -747,7 +823,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         attr.ah_attr.is_global      = 1;
         attr.ah_attr.grh.hop_limit  = 1;
         attr.ah_attr.grh.dgid       = ibv_gid;
-        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
+        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
         attr.rq_psn                 = rq_psn;
 
         attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
@@ -756,8 +832,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         break;
 
     case IBV_QPT_UD:
+        pr_dbg("qkey=0x%x\n", qkey);
         if (use_qkey) {
-            pr_dbg("qkey=0x%x\n", qkey);
             attr.qkey = qkey;
             attr_mask |= IBV_QP_QKEY;
         }
@@ -873,29 +949,19 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
     grh->dgid = *my_gid;
 
     pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
-    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
-    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
 }
 
-static inline int mad_can_receieve(void *opaque)
+static void process_incoming_mad_req(RdmaBackendDev *backend_dev,
+                                     RdmaCmMuxMsg *msg)
 {
-    return sizeof(struct backend_umad);
-}
-
-static void mad_read(void *opaque, const uint8_t *buf, int size)
-{
-    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
     QObject *o_ctx_id;
     unsigned long cqe_ctx_id;
     BackendCtx *bctx;
     char *mad;
-    struct backend_umad *umad;
 
-    assert(size != sizeof(umad));
-    umad = (struct backend_umad *)buf;
-
-    pr_dbg("Got %d bytes\n", size);
-    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+    pr_dbg("umad_len=%d\n", msg->umad_len);
 
 #ifdef PVRDMA_DEBUG
     struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
@@ -925,15 +991,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
-    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
         comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
                      bctx->up_ctx);
     } else {
+        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
-                      (union ibv_gid *)&umad->hdr.addr.gid,
-                      &backend_dev->gid, umad->hdr.length);
-        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
+                      msg->umad_len);
+        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
         comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
@@ -943,30 +1010,51 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
 }
 
-static int mad_init(RdmaBackendDev *backend_dev)
+static inline int rdmacm_mux_can_receive(void *opaque)
 {
-    struct backend_umad umad = {0};
-    int ret;
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
 
-    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
-        pr_dbg("Missing chardev for MAD multiplexer\n");
-        return -EIO;
+    return rdmacm_mux_can_process_async(backend_dev);
+}
+
+static void rdmacm_mux_read(void *opaque, const uint8_t *buf, int size)
+{
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
+    RdmaCmMuxMsg *msg = (RdmaCmMuxMsg *)buf;
+
+    pr_dbg("Got %d bytes\n", size);
+    pr_dbg("msg_type=%d\n", msg->hdr.msg_type);
+    pr_dbg("op_code=%d\n", msg->hdr.op_code);
+
+    if (msg->hdr.msg_type != RDMACM_MUX_MSG_TYPE_REQ &&
+        msg->hdr.op_code != RDMACM_MUX_OP_CODE_MAD) {
+            pr_dbg("Error: Not a MAD request, skipping\n");
+            return;
     }
+    process_incoming_mad_req(backend_dev, msg);
+}
+
+static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
+{
+    int ret;
 
-    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
-                             mad_read, NULL, NULL, backend_dev, NULL, true);
+    backend_dev->rdmacm_mux.chr_be = mad_chr_be;
 
-    /* Register ourself */
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad.hdr));
-    if (ret != sizeof(umad.hdr)) {
-        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
+    ret = qemu_chr_fe_backend_connected(backend_dev->rdmacm_mux.chr_be);
+    if (!ret) {
+        pr_dbg("Missing chardev for MAD multiplexer\n");
+        return -EIO;
     }
 
     qemu_mutex_init(&backend_dev->recv_mads_list.lock);
     backend_dev->recv_mads_list.list = qlist_new();
 
+    enable_rdmacm_mux_async(backend_dev);
+
+    qemu_chr_fe_set_handlers(backend_dev->rdmacm_mux.chr_be,
+                             rdmacm_mux_can_receive, rdmacm_mux_read, NULL,
+                             NULL, backend_dev, NULL, true);
+
     return 0;
 }
 
@@ -978,6 +1066,8 @@ static void mad_stop(RdmaBackendDev *backend_dev)
 
     pr_dbg("Closing MAD\n");
 
+    disable_rdmacm_mux_async(backend_dev);
+
     /* Clear MAD buffers list */
     qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
     do {
@@ -1000,23 +1090,94 @@ static void mad_fini(RdmaBackendDev *backend_dev)
     qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
 }
 
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid)
+{
+    union ibv_gid sgid;
+    int ret;
+    int i = 0;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    do {
+        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
+                            &sgid);
+        i++;
+    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
+
+    pr_dbg("gid_index=%d\n", i - 1);
+
+    return ret ? ret : i - 1;
+}
+
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.op_code = RDMACM_MUX_OP_CODE_REG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+
+    ret = exec_rdmacm_mux_req(backend_dev, &msg);
+    if (ret) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, true,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return ret;
+}
+
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.op_code = RDMACM_MUX_OP_CODE_UNREG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+
+    ret = exec_rdmacm_mux_req(backend_dev, &msg);
+    if (ret) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, false,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return 0;
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp)
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp)
 {
     int i;
     int ret = 0;
     int num_ibv_devices;
     struct ibv_device **dev_list;
-    struct ibv_port_attr port_attr;
 
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
-    backend_dev->mad_chr_be = mad_chr_be;
-    backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
 
@@ -1053,9 +1214,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         backend_dev->ib_dev = *dev_list;
     }
 
-    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
-           ibv_get_device_name(backend_dev->ib_dev),
-           backend_dev->port_num, backend_dev->backend_gid_idx);
+    pr_dbg("Using backend device %s, port %d\n",
+           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
 
     backend_dev->context = ibv_open_device(backend_dev->ib_dev);
     if (!backend_dev->context) {
@@ -1072,20 +1232,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     }
     pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
 
-    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
-                         &port_attr);
-    if (ret) {
-        error_setg(errp, "Error %d from ibv_query_port", ret);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-
-    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
-        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
-                   port_attr.gid_tbl_len);
-        goto out_destroy_comm_channel;
-    }
-
     ret = init_device_caps(backend_dev, dev_attr);
     if (ret) {
         error_setg(errp, "Failed to initialize device capabilities");
@@ -1093,20 +1239,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         goto out_destroy_comm_channel;
     }
 
-    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
-                         backend_dev->backend_gid_idx, &backend_dev->gid);
-    if (ret) {
-        error_setg(errp, "Failed to query gid %d",
-                   backend_dev->backend_gid_idx);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
-    pr_dbg("interface_id=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.interface_id));
 
-    ret = mad_init(backend_dev);
+    ret = mad_init(backend_dev, mad_chr_be);
     if (ret) {
         error_setg(errp, "Fail to initialize mad");
         ret = -EIO;
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index fc83330251..59ad2b874b 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -28,11 +28,6 @@ enum ibv_special_qp_type {
     IBV_QPT_GSI = 1,
 };
 
-static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
-{
-    return &dev->gid;
-}
-
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
     return qp->ibqp ? qp->ibqp->qp_num : 1;
@@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp);
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
 void rdma_backend_register_comp_handler(void (*handler)(int status,
@@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
                                uint8_t qp_type, uint32_t qkey);
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey);
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
                               uint32_t sq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
@@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
+                            uint8_t sgid_idx, union ibv_gid *sgid,
                             union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
                             void *ctx);
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 2a7e667075..1e5c3dd3bf 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -19,6 +19,7 @@
 #include "qemu/thread.h"
 #include "chardev/char-fe.h"
 #include <infiniband/verbs.h>
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 
 typedef struct RdmaDeviceResources RdmaDeviceResources;
 
@@ -34,19 +35,22 @@ typedef struct RecvMadList {
     QList *list;
 } RecvMadList;
 
+typedef struct RdmaCmMux {
+    CharBackend *chr_be;
+    int can_receive;
+} RdmaCmMux;
+
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
-    union ibv_gid gid;
     PCIDevice *dev;
     RdmaDeviceResources *rdma_dev_res;
     struct ibv_device *ib_dev;
     struct ibv_context *context;
     struct ibv_comp_channel *channel;
     uint8_t port_num;
-    uint8_t backend_gid_idx;
     RecvMadList recv_mads_list;
-    CharBackend *mad_chr_be;
+    RdmaCmMux rdmacm_mux;
 } RdmaBackendDev;
 
 typedef struct RdmaBackendPD {
@@ -66,6 +70,7 @@ typedef struct RdmaBackendCQ {
 typedef struct RdmaBackendQP {
     struct ibv_pd *ibpd;
     struct ibv_qp *ibqp;
+    uint8_t sgid_idx;
 } RdmaBackendQP;
 
 #endif
diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 4f10fcabcc..250254561c 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -391,7 +391,7 @@ out_dealloc_qp:
 }
 
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn)
@@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int ret;
 
     pr_dbg("qpn=0x%x\n", qp_handle);
+    pr_dbg("qkey=0x%x\n", qkey);
 
     qp = rdma_rm_get_qp(dev_res, qp_handle);
     if (!qp) {
@@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         }
 
         if (qp->qp_state == IBV_QPS_RTR) {
+            /* Get backend gid index */
+            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
+            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
+                                                     sgid_idx);
+            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
+                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
+                return -EIO;
+            }
+
             ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
-                                            qp->qp_type, dgid, dqpn, rq_psn,
-                                            qkey, attr_mask & IBV_QP_QKEY);
+                                            qp->qp_type, sgid_idx, dgid, dqpn,
+                                            rq_psn, qkey,
+                                            attr_mask & IBV_QP_QKEY);
             if (ret) {
                 return -EIO;
             }
@@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
     res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
 }
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
+    if (rc) {
+        pr_dbg("Fail to add gid\n");
+        return -EINVAL;
+    }
+
+    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+
+    return 0;
+}
+
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_del_gid(backend_dev, ifname,
+                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+    if (rc) {
+        pr_dbg("Fail to delete gid\n");
+        return -EINVAL;
+    }
+
+    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
+    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+
+    return 0;
+}
+
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx)
+{
+    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
+        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
+        return -EINVAL;
+    }
+
+    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+        rdma_backend_get_gid_index(backend_dev,
+                                   &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+    }
+
+    pr_dbg("backend_gid_index=%d\n",
+           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+
+    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+}
+
 static void destroy_qp_hash_key(gpointer data)
 {
     g_bytes_unref(data);
 }
 
+static void init_ports(RdmaDeviceResources *dev_res)
+{
+    int i, j;
+
+    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        dev_res->ports[i].state = IBV_PORT_DOWN;
+        for (j = 0; j < MAX_PORT_GIDS; j++) {
+            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
+        }
+    }
+}
+
+static void fini_ports(RdmaDeviceResources *dev_res,
+                       RdmaBackendDev *backend_dev, const char *ifname)
+{
+    int i;
+
+    dev_res->ports[0].state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
+    }
+}
+
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp)
 {
@@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                        dev_attr->max_qp_wr, sizeof(void *));
     res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
 
+    init_ports(dev_res);
+
     return 0;
 }
 
-void rdma_rm_fini(RdmaDeviceResources *dev_res)
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname)
 {
+    fini_ports(dev_res, backend_dev, ifname);
+
     res_tbl_free(&dev_res->uc_tbl);
     res_tbl_free(&dev_res->cqe_ctx_tbl);
     res_tbl_free(&dev_res->qp_tbl);
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index b4e04cc7b4..a7169b4e89 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -22,7 +22,8 @@
 
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp);
-void rdma_rm_fini(RdmaDeviceResources *dev_res);
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname);
 
 int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
                      uint32_t *pd_handle, uint32_t ctx_handle);
@@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
                      uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
 RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn);
@@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
 void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx);
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx);
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx);
+static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
+                                             int sgid_idx)
+{
+    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+}
+
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 9b399063d3..7b3435f991 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -19,7 +19,7 @@
 #include "rdma_backend_defs.h"
 
 #define MAX_PORTS             1
-#define MAX_PORT_GIDS         1
+#define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
 #define MAX_PKEYS             MAX_PORT_PKEYS
@@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
     enum ibv_qp_state qp_state;
 } RdmaRmQP;
 
+typedef struct RdmaRmGid {
+    union ibv_gid gid;
+    int backend_gid_index;
+} RdmaRmGid;
+
 typedef struct RdmaRmPort {
-    union ibv_gid gid_tbl[MAX_PORT_GIDS];
+    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
     enum ibv_port_state state;
 } RdmaRmPort;
 
diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 04c7c2ef5b..989db249ef 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "hw/pci/pci.h"
 #include "sysemu/dma.h"
+#include "stdio.h"
 
 #define pr_info(fmt, ...) \
     fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
@@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
 #define pr_dbg(fmt, ...) \
     fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
             __func__, __LINE__, ## __VA_ARGS__)
+
+#define pr_dbg_buf(title, buf, len) \
+{ \
+    char *b = g_malloc0(len * 3 + 1); \
+    char b1[4]; \
+    for (int i = 0; i < len; i++) { \
+        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
+        strcat(b, b1); \
+    } \
+    pr_dbg("%s (%d): %s\n", title, len, b); \
+    g_free(b); \
+}
+
 #else
 #define init_pr_dbg(void)
 #define pr_dbg(fmt, ...)
+#define pr_dbg_buf(title, buf, len)
 #endif
 
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 15c3f28b86..b019cb843a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -79,8 +79,8 @@ typedef struct PVRDMADev {
     int interrupt_mask;
     struct ibv_device_attr dev_attr;
     uint64_t node_guid;
+    char *backend_eth_device_name;
     char *backend_device_name;
-    uint8_t backend_gid_idx;
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 57d6f41ae6..a334f6205e 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rsp->hdr.response = cmd->hdr.response;
     rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
 
-    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
-                                 cmd->qp_handle, cmd->attr_mask,
-                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
-                                 cmd->attrs.dest_qp_num,
-                                 (enum ibv_qp_state)cmd->attrs.qp_state,
-                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
-                                 cmd->attrs.sq_psn);
+    /* No need to verify sgid_index since it is u8 */
+
+    rsp->hdr.err =
+        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
+                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
+                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
+                          cmd->attrs.dest_qp_num,
+                          (enum ibv_qp_state)cmd->attrs.qp_state,
+                          cmd->attrs.qkey, cmd->attrs.rq_psn,
+                          cmd->attrs.sq_psn);
 
     pr_dbg("ret=%d\n", rsp->hdr.err);
     return rsp->hdr.err;
@@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                        union pvrdma_cmd_resp *rsp)
 {
     struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
-#ifdef PVRDMA_DEBUG
-    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
-    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
-#endif
+    int rc;
+    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
 
     pr_dbg("index=%d\n", cmd->index);
 
@@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
-           (long long unsigned int)be64_to_cpu(*subnet),
-           (long long unsigned int)be64_to_cpu(*if_id));
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
 
-    /* Driver forces to one port only */
-    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
-           sizeof(cmd->new_gid));
+    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                         dev->backend_eth_device_name, gid, cmd->index);
+    if (rc < 0) {
+        return -EINVAL;
+    }
 
     /* TODO: Since drivers stores node_guid at load_dsr phase then this
      * assignment is not relevant, i need to figure out a way how to
      * retrieve MAC of our netdev */
-    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
-    pr_dbg("dev->node_guid=0x%llx\n",
-           (long long unsigned int)be64_to_cpu(dev->node_guid));
+    if (!cmd->index) {
+        dev->node_guid =
+            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
+        pr_dbg("dev->node_guid=0x%llx\n",
+               (long long unsigned int)be64_to_cpu(dev->node_guid));
+    }
 
     return 0;
 }
@@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                         union pvrdma_cmd_resp *rsp)
 {
+    int rc;
+
     struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
 
     pr_dbg("index=%d\n", cmd->index);
@@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
-           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
+    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                        dev->backend_eth_device_name, cmd->index);
+
+    if (rc < 0) {
+        rsp->hdr.err = rc;
+        goto out;
+    }
 
     return 0;
 }
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fc2abd34af..ac8c092db0 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -36,9 +36,9 @@
 #include "pvrdma_qp_ops.h"
 
 static Property pvrdma_dev_properties[] = {
-    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
-    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
-    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
+    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
+    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
+    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
     DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
                        MAX_MR_SIZE),
     DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
@@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     pr_dbg("Initialized\n");
 }
 
-static void init_ports(PVRDMADev *dev, Error **errp)
-{
-    int i;
-
-    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
-
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
-    }
-}
-
 static void uninit_msix(PCIDevice *pdev, int used_vectors)
 {
     PVRDMADev *dev = PVRDMA_DEV(pdev);
@@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
 
     pvrdma_qp_ops_fini();
 
-    rdma_rm_fini(&dev->rdma_dev_res);
+    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
+                 dev->backend_eth_device_name);
 
     rdma_backend_fini(&dev->backend_dev);
 
@@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
-                           errp);
+                           &dev->dev_attr, &dev->mad_chr, errp);
     if (rc) {
         goto out;
     }
@@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
-    init_ports(dev, errp);
-
     rc = pvrdma_qp_ops_init();
     if (rc) {
         goto out;
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 3388be1926..2130824098 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
     RdmaRmQP *qp;
     PvrdmaSqWqe *wqe;
     PvrdmaRing *ring;
+    int sgid_idx;
+    union ibv_gid *sgid;
 
     pr_dbg("qp_handle=0x%x\n", qp_handle);
 
@@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.opcode = IBV_WC_SEND;
 
+        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
+        if (!sgid) {
+            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
+               sgid->global.interface_id);
+
+        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
+                                                 &dev->backend_dev,
+                                                 wqe->hdr.wr.ud.av.gid_index);
+        if (sgid_idx <= 0) {
+            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
+                   wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
+                               sgid_idx, sgid,
                                (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
                                wqe->hdr.wr.ud.remote_qpn,
                                wqe->hdr.wr.ud.remote_qkey, comp_ctx);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 12/24] vmxnet3: Move some definitions to header file
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (10 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 13/24] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
on PCI function 1.
pvrdma device needs to access vmxnet3 device object for several reasons:
1. Make sure PCI function 0 is vmxnet3.
2. To monitor vmxnet3 device state.
3. To configure node_guid accoring to vmxnet3 device's MAC address.

To be able to access vmxnet3 device the definition of VMXNET3State is
moved to a new header file.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Dmitry Fleytman <dmitry.fleytman@gmail.com>
---
 hw/net/vmxnet3.c      | 116 +-----------------------------------
 hw/net/vmxnet3_defs.h | 133 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+), 115 deletions(-)
 create mode 100644 hw/net/vmxnet3_defs.h

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 3648630386..54746a4030 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -18,7 +18,6 @@
 #include "qemu/osdep.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
-#include "net/net.h"
 #include "net/tap.h"
 #include "net/checksum.h"
 #include "sysemu/sysemu.h"
@@ -29,6 +28,7 @@
 #include "migration/register.h"
 
 #include "vmxnet3.h"
+#include "vmxnet3_defs.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
 #include "net_tx_pkt.h"
@@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
     DeviceRealize parent_dc_realize;
 } VMXNET3Class;
 
-#define TYPE_VMXNET3 "vmxnet3"
-#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
-
 #define VMXNET3_DEVICE_CLASS(klass) \
     OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
 #define VMXNET3_DEVICE_GET_CLASS(obj) \
     OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
 
-/* Cyclic ring abstraction */
-typedef struct {
-    hwaddr pa;
-    uint32_t size;
-    uint32_t cell_size;
-    uint32_t next;
-    uint8_t gen;
-} Vmxnet3Ring;
-
 static inline void vmxnet3_ring_init(PCIDevice *d,
 				     Vmxnet3Ring *ring,
                                      hwaddr pa,
@@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
               descr->rsvd, descr->dtype, descr->ext1, descr->btype);
 }
 
-/* Device state and helper functions */
-#define VMXNET3_RX_RINGS_PER_QUEUE (2)
-
-typedef struct {
-    Vmxnet3Ring tx_ring;
-    Vmxnet3Ring comp_ring;
-
-    uint8_t intr_idx;
-    hwaddr tx_stats_pa;
-    struct UPT1_TxStats txq_stats;
-} Vmxnet3TxqDescr;
-
-typedef struct {
-    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
-    Vmxnet3Ring comp_ring;
-    uint8_t intr_idx;
-    hwaddr rx_stats_pa;
-    struct UPT1_RxStats rxq_stats;
-} Vmxnet3RxqDescr;
-
-typedef struct {
-    bool is_masked;
-    bool is_pending;
-    bool is_asserted;
-} Vmxnet3IntState;
-
-typedef struct {
-        PCIDevice parent_obj;
-        NICState *nic;
-        NICConf conf;
-        MemoryRegion bar0;
-        MemoryRegion bar1;
-        MemoryRegion msix_bar;
-
-        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
-        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
-
-        /* Whether MSI-X support was installed successfully */
-        bool msix_used;
-        hwaddr drv_shmem;
-        hwaddr temp_shared_guest_driver_memory;
-
-        uint8_t txq_num;
-
-        /* This boolean tells whether RX packet being indicated has to */
-        /* be split into head and body chunks from different RX rings  */
-        bool rx_packets_compound;
-
-        bool rx_vlan_stripping;
-        bool lro_supported;
-
-        uint8_t rxq_num;
-
-        /* Network MTU */
-        uint32_t mtu;
-
-        /* Maximum number of fragments for indicated TX packets */
-        uint32_t max_tx_frags;
-
-        /* Maximum number of fragments for indicated RX packets */
-        uint16_t max_rx_frags;
-
-        /* Index for events interrupt */
-        uint8_t event_int_idx;
-
-        /* Whether automatic interrupts masking enabled */
-        bool auto_int_masking;
-
-        bool peer_has_vhdr;
-
-        /* TX packets to QEMU interface */
-        struct NetTxPkt *tx_pkt;
-        uint32_t offload_mode;
-        uint32_t cso_or_gso_size;
-        uint16_t tci;
-        bool needs_vlan;
-
-        struct NetRxPkt *rx_pkt;
-
-        bool tx_sop;
-        bool skip_current_tx_pkt;
-
-        uint32_t device_active;
-        uint32_t last_command;
-
-        uint32_t link_status_and_speed;
-
-        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
-
-        uint32_t temp_mac;   /* To store the low part first */
-
-        MACAddr perm_mac;
-        uint32_t vlan_table[VMXNET3_VFT_SIZE];
-        uint32_t rx_mode;
-        MACAddr *mcast_list;
-        uint32_t mcast_list_len;
-        uint32_t mcast_list_buff_size; /* needed for live migration. */
-
-        /* Compatibility flags for migration */
-        uint32_t compat_flags;
-} VMXNET3State;
-
 /* Interrupt management */
 
 /*
diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
new file mode 100644
index 0000000000..6c19d29b12
--- /dev/null
+++ b/hw/net/vmxnet3_defs.h
@@ -0,0 +1,133 @@
+/*
+ * QEMU VMWARE VMXNET3 paravirtual NIC
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "net/net.h"
+#include "hw/net/vmxnet3.h"
+
+#define TYPE_VMXNET3 "vmxnet3"
+#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
+
+/* Device state and helper functions */
+#define VMXNET3_RX_RINGS_PER_QUEUE (2)
+
+/* Cyclic ring abstraction */
+typedef struct {
+    hwaddr pa;
+    uint32_t size;
+    uint32_t cell_size;
+    uint32_t next;
+    uint8_t gen;
+} Vmxnet3Ring;
+
+typedef struct {
+    Vmxnet3Ring tx_ring;
+    Vmxnet3Ring comp_ring;
+
+    uint8_t intr_idx;
+    hwaddr tx_stats_pa;
+    struct UPT1_TxStats txq_stats;
+} Vmxnet3TxqDescr;
+
+typedef struct {
+    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
+    Vmxnet3Ring comp_ring;
+    uint8_t intr_idx;
+    hwaddr rx_stats_pa;
+    struct UPT1_RxStats rxq_stats;
+} Vmxnet3RxqDescr;
+
+typedef struct {
+    bool is_masked;
+    bool is_pending;
+    bool is_asserted;
+} Vmxnet3IntState;
+
+typedef struct {
+        PCIDevice parent_obj;
+        NICState *nic;
+        NICConf conf;
+        MemoryRegion bar0;
+        MemoryRegion bar1;
+        MemoryRegion msix_bar;
+
+        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
+        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
+
+        /* Whether MSI-X support was installed successfully */
+        bool msix_used;
+        hwaddr drv_shmem;
+        hwaddr temp_shared_guest_driver_memory;
+
+        uint8_t txq_num;
+
+        /* This boolean tells whether RX packet being indicated has to */
+        /* be split into head and body chunks from different RX rings  */
+        bool rx_packets_compound;
+
+        bool rx_vlan_stripping;
+        bool lro_supported;
+
+        uint8_t rxq_num;
+
+        /* Network MTU */
+        uint32_t mtu;
+
+        /* Maximum number of fragments for indicated TX packets */
+        uint32_t max_tx_frags;
+
+        /* Maximum number of fragments for indicated RX packets */
+        uint16_t max_rx_frags;
+
+        /* Index for events interrupt */
+        uint8_t event_int_idx;
+
+        /* Whether automatic interrupts masking enabled */
+        bool auto_int_masking;
+
+        bool peer_has_vhdr;
+
+        /* TX packets to QEMU interface */
+        struct NetTxPkt *tx_pkt;
+        uint32_t offload_mode;
+        uint32_t cso_or_gso_size;
+        uint16_t tci;
+        bool needs_vlan;
+
+        struct NetRxPkt *rx_pkt;
+
+        bool tx_sop;
+        bool skip_current_tx_pkt;
+
+        uint32_t device_active;
+        uint32_t last_command;
+
+        uint32_t link_status_and_speed;
+
+        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
+
+        uint32_t temp_mac;   /* To store the low part first */
+
+        MACAddr perm_mac;
+        uint32_t vlan_table[VMXNET3_VFT_SIZE];
+        uint32_t rx_mode;
+        MACAddr *mcast_list;
+        uint32_t mcast_list_len;
+        uint32_t mcast_list_buff_size; /* needed for live migration. */
+
+        /* Compatibility flags for migration */
+        uint32_t compat_flags;
+} VMXNET3State;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 13/24] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (11 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 12/24] vmxnet3: Move some definitions to header file Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-25  7:31   ` Marcel Apfelbaum
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 14/24] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Guest driver enforces it, we should also.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index b019cb843a..10a3c4fb7c 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -20,6 +20,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
+#include "hw/net/vmxnet3_defs.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -85,6 +86,7 @@ typedef struct PVRDMADev {
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
+    VMXNET3State *func0;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ac8c092db0..b35b5dc5f0 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -565,6 +565,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
     PVRDMADev *dev = PVRDMA_DEV(pdev);
     Object *memdev_root;
     bool ram_shared = false;
+    PCIDevice *func0;
 
     init_pr_dbg();
 
@@ -576,6 +577,17 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
+    func0 = pci_get_function_0(pdev);
+    /* Break if not vmxnet3 device in slot 0 */
+    if (strcmp(object_get_typename(&func0->qdev.parent_obj), TYPE_VMXNET3)) {
+        pr_dbg("func0 type is %s\n",
+               object_get_typename(&func0->qdev.parent_obj));
+        error_setg(errp, "Device on %x.0 must be %s", PCI_SLOT(pdev->devfn),
+                   TYPE_VMXNET3);
+        return;
+    }
+    dev->func0 = VMXNET3(func0);
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 14/24] hw/rdma: Initialize node_guid from vmxnet3 mac address
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (12 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 13/24] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 15/24] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

node_guid should be set once device is load.
Make node_guid be GID format (32 bit) of PCI function 0 vmxnet3 device's
MAC.

A new function was added to do the conversion.
So for example the MAC 56:b6:44:e9:62:dc will be converted to GID
54b6:44ff:fee9:62dc.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_utils.h      |  9 +++++++++
 hw/rdma/vmw/pvrdma_cmd.c  | 10 ----------
 hw/rdma/vmw/pvrdma_main.c |  5 ++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 989db249ef..202abb3366 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -63,4 +63,13 @@ extern unsigned long pr_dbg_cnt;
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
 void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
 
+static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
+{
+    memcpy(eui, addr, 3);
+    eui[3] = 0xFF;
+    eui[4] = 0xFE;
+    memcpy(eui + 5, addr + 3, 3);
+    eui[0] ^= 2;
+}
+
 #endif
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index a334f6205e..2979582fac 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -592,16 +592,6 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    /* TODO: Since drivers stores node_guid at load_dsr phase then this
-     * assignment is not relevant, i need to figure out a way how to
-     * retrieve MAC of our netdev */
-    if (!cmd->index) {
-        dev->node_guid =
-            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
-        pr_dbg("dev->node_guid=0x%llx\n",
-               (long long unsigned int)be64_to_cpu(dev->node_guid));
-    }
-
     return 0;
 }
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index b35b5dc5f0..150404dfa6 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -264,7 +264,7 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     dsr->caps.sys_image_guid = 0;
     pr_dbg("sys_image_guid=%" PRIx64 "\n", dsr->caps.sys_image_guid);
 
-    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+    dsr->caps.node_guid = dev->node_guid;
     pr_dbg("node_guid=%" PRIx64 "\n", be64_to_cpu(dsr->caps.node_guid));
 
     dsr->caps.phys_port_cnt = MAX_PORTS;
@@ -588,6 +588,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
     }
     dev->func0 = VMXNET3(func0);
 
+    addrconf_addr_eui48((unsigned char *)&dev->node_guid,
+                        (const char *)&dev->func0->conf.macaddr.a);
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 15/24] hw/pvrdma: Make device state depend on Ethernet function state
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (13 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 14/24] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 16/24] hw/pvrdma: Fill all CQE fields Yuval Shaia
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

User should be able to control the device by changing Ethernet function
state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
down as well.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 2979582fac..0d3c818c20 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
     resp->hdr.err = 0;
 
-    resp->attrs.state = attrs.state;
+    resp->attrs.state = dev->func0->device_active ? attrs.state :
+                                                    PVRDMA_PORT_DOWN;
     resp->attrs.max_mtu = attrs.max_mtu;
     resp->attrs.active_mtu = attrs.active_mtu;
     resp->attrs.phys_state = attrs.phys_state;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 16/24] hw/pvrdma: Fill all CQE fields
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (14 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 15/24] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response Yuval Shaia
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Add ability to pass specific WC attributes to CQE such as GRH_BIT flag.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_backend.c      | 59 +++++++++++++++++++++++--------------
 hw/rdma/rdma_backend.h      |  4 +--
 hw/rdma/vmw/pvrdma_qp_ops.c | 31 +++++++++++--------
 3 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 8b5a111bf4..6a1e39d4c0 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -60,13 +60,24 @@ struct backend_umad {
     char mad[RDMA_MAX_PRIVATE_DATA];
 };
 
-static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
+static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
 
-static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     pr_err("No completion handler is registered\n");
 }
 
+static inline void complete_work(enum ibv_wc_status status, uint32_t vendor_err,
+                                 void *ctx)
+{
+    struct ibv_wc wc = {0};
+
+    wc.status = status;
+    wc.vendor_err = vendor_err;
+
+    comp_handler(ctx, &wc);
+}
+
 static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
 {
     int i, ne;
@@ -91,7 +102,7 @@ static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
             }
             pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
 
-            comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
+            comp_handler(bctx->up_ctx, &wc[i]);
 
             rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
             g_free(bctx);
@@ -250,8 +261,8 @@ static void start_comp_thread(RdmaBackendDev *backend_dev)
                        comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
 }
 
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx))
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                         struct ibv_wc *wc))
 {
     comp_handler = handler;
 }
@@ -445,14 +456,14 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
-                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+                complete_work(IBV_WC_SUCCESS, 0, ctx);
             }
         }
         return;
@@ -461,7 +472,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -472,21 +483,21 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
         }
         wr.wr.ud.remote_qpn = dqpn;
@@ -504,7 +515,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -573,13 +584,13 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
             }
         }
         return;
@@ -588,7 +599,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -599,14 +610,14 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -618,7 +629,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post recv WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -992,9 +1003,10 @@ static void process_incoming_mad_req(RdmaBackendDev *backend_dev,
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
     if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
-                     bctx->up_ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                      bctx->up_ctx);
     } else {
+        struct ibv_wc wc = {0};
         pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
@@ -1003,7 +1015,10 @@ static void process_incoming_mad_req(RdmaBackendDev *backend_dev,
         memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
-        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+        wc.byte_len = msg->umad_len;
+        wc.status = IBV_WC_SUCCESS;
+        wc.wc_flags = IBV_WC_GRH;
+        comp_handler(bctx->up_ctx, &wc);
     }
 
     g_free(bctx);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 59ad2b874b..8cae40f827 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -57,8 +57,8 @@ int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
                                union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx));
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                        struct ibv_wc *wc));
 void rdma_backend_unregister_comp_handler(void);
 
 int rdma_backend_query_port(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 2130824098..300471a4c9 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -47,7 +47,7 @@ typedef struct PvrdmaRqWqe {
  * 3. Interrupt host
  */
 static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
-                           struct pvrdma_cqe *cqe)
+                           struct pvrdma_cqe *cqe, struct ibv_wc *wc)
 {
     struct pvrdma_cqe *cqe1;
     struct pvrdma_cqne *cqne;
@@ -66,6 +66,7 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pr_dbg("Writing CQE\n");
     cqe1 = pvrdma_ring_next_elem_write(ring);
     if (unlikely(!cqe1)) {
+        pr_dbg("No CQEs in ring\n");
         return -EINVAL;
     }
 
@@ -73,8 +74,20 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     cqe1->wr_id = cqe->wr_id;
     cqe1->qp = cqe->qp;
     cqe1->opcode = cqe->opcode;
-    cqe1->status = cqe->status;
-    cqe1->vendor_err = cqe->vendor_err;
+    cqe1->status = wc->status;
+    cqe1->byte_len = wc->byte_len;
+    cqe1->src_qp = wc->src_qp;
+    cqe1->wc_flags = wc->wc_flags;
+    cqe1->vendor_err = wc->vendor_err;
+
+    pr_dbg("wr_id=%" PRIx64 "\n", cqe1->wr_id);
+    pr_dbg("qp=0x%lx\n", cqe1->qp);
+    pr_dbg("opcode=%d\n", cqe1->opcode);
+    pr_dbg("status=%d\n", cqe1->status);
+    pr_dbg("byte_len=%d\n", cqe1->byte_len);
+    pr_dbg("src_qp=%d\n", cqe1->src_qp);
+    pr_dbg("wc_flags=%d\n", cqe1->wc_flags);
+    pr_dbg("vendor_err=%d\n", cqe1->vendor_err);
 
     pvrdma_ring_write_inc(ring);
 
@@ -99,18 +112,12 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     return 0;
 }
 
-static void pvrdma_qp_ops_comp_handler(int status, unsigned int vendor_err,
-                                       void *ctx)
+static void pvrdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
 
-    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
-    pr_dbg("wr_id=%" PRIx64 "\n", comp_ctx->cqe.wr_id);
-    pr_dbg("status=%d\n", status);
-    pr_dbg("vendor_err=0x%x\n", vendor_err);
-    comp_ctx->cqe.status = status;
-    comp_ctx->cqe.vendor_err = vendor_err;
-    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
+    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe, wc);
+
     g_free(ctx);
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (15 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 16/24] hw/pvrdma: Fill all CQE fields Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-25  7:40   ` Marcel Apfelbaum
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 18/24] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Driver checks error code let's set it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 0d3c818c20..a326c5d470 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     if (rdma_backend_query_port(&dev->backend_dev,
                                 (struct ibv_port_attr *)&attrs)) {
-        return -ENOMEM;
+        resp->hdr.err = -ENOMEM;
+        goto out;
     }
 
     memset(resp, 0, sizeof(*resp));
@@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->attrs.active_width = 1;
     resp->attrs.active_speed = 1;
 
-    return 0;
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
 }
 
 static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->pkey = PVRDMA_PKEY;
     pr_dbg("pkey=0x%x\n", resp->pkey);
 
-    return 0;
+    return resp->hdr.err;
 }
 
 static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
@@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
     cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
     if (!cq) {
         pr_dbg("Invalid CQ handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     ring = (PvrdmaRing *)cq->opaque;
@@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
@@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
     if (!qp) {
         pr_dbg("Invalid QP handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
@@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
     g_free(ring);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
@@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
                          dev->backend_eth_device_name, gid, cmd->index);
     if (rc < 0) {
-        return -EINVAL;
+        rsp->hdr.err = rc;
+        goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
@@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
                                      &resp->ctx_handle);
 
-    pr_dbg("ret=%d\n", resp->hdr.err);
-
-    return 0;
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 struct cmd_handler {
     uint32_t cmd;
@@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
     }
 
     err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
-                            dsr_info->rsp);
+                                                    dsr_info->rsp);
 out:
     set_reg_val(dev, PVRDMA_REG_ERR, err);
     post_interrupt(dev, INTR_VEC_CMD_RING);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 18/24] hw/rdma: Remove unneeded code that handles more that one port
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (16 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 19/24] vl: Introduce shutdown_notifiers Yuval Shaia
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Device supports only one port, let's remove a dead code that handles
more than one port.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_rm.c      | 34 ++++++++++++++++------------------
 hw/rdma/rdma_rm.h      |  2 +-
 hw/rdma/rdma_rm_defs.h |  4 ++--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 250254561c..b7d4ebe972 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -545,7 +545,7 @@ int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         return -EINVAL;
     }
 
-    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+    memcpy(&dev_res->port.gid_tbl[gid_idx].gid, gid, sizeof(*gid));
 
     return 0;
 }
@@ -556,15 +556,15 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int rc;
 
     rc = rdma_backend_del_gid(backend_dev, ifname,
-                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+                              &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc) {
         pr_dbg("Fail to delete gid\n");
         return -EINVAL;
     }
 
-    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
-           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
-    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+    memset(dev_res->port.gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->port.gid_tbl[gid_idx].gid));
+    dev_res->port.gid_tbl[gid_idx].backend_gid_index = -1;
 
     return 0;
 }
@@ -577,16 +577,16 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
         return -EINVAL;
     }
 
-    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
-        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+    if (unlikely(dev_res->port.gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->port.gid_tbl[sgid_idx].backend_gid_index =
         rdma_backend_get_gid_index(backend_dev,
-                                   &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+                                   &dev_res->port.gid_tbl[sgid_idx].gid);
     }
 
     pr_dbg("backend_gid_index=%d\n",
-           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+           dev_res->port.gid_tbl[sgid_idx].backend_gid_index);
 
-    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+    return dev_res->port.gid_tbl[sgid_idx].backend_gid_index;
 }
 
 static void destroy_qp_hash_key(gpointer data)
@@ -596,15 +596,13 @@ static void destroy_qp_hash_key(gpointer data)
 
 static void init_ports(RdmaDeviceResources *dev_res)
 {
-    int i, j;
+    int i;
 
-    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+    memset(&dev_res->port, 0, sizeof(dev_res->port));
 
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev_res->ports[i].state = IBV_PORT_DOWN;
-        for (j = 0; j < MAX_PORT_GIDS; j++) {
-            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
-        }
+    dev_res->port.state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        dev_res->port.gid_tbl[i].backend_gid_index = -1;
     }
 }
 
@@ -613,7 +611,7 @@ static void fini_ports(RdmaDeviceResources *dev_res,
 {
     int i;
 
-    dev_res->ports[0].state = IBV_PORT_DOWN;
+    dev_res->port.state = IBV_PORT_DOWN;
     for (i = 0; i < MAX_PORT_GIDS; i++) {
         rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
     }
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index a7169b4e89..3c602c04c0 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -79,7 +79,7 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
 static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
                                              int sgid_idx)
 {
-    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+    return &dev_res->port.gid_tbl[sgid_idx].gid;
 }
 
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7b3435f991..0ba61d1838 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -18,7 +18,7 @@
 
 #include "rdma_backend_defs.h"
 
-#define MAX_PORTS             1
+#define MAX_PORTS             1 /* Do not change - we support only one port */
 #define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
@@ -97,7 +97,7 @@ typedef struct RdmaRmPort {
 } RdmaRmPort;
 
 typedef struct RdmaDeviceResources {
-    RdmaRmPort ports[MAX_PORTS];
+    RdmaRmPort port;
     RdmaRmResTbl pd_tbl;
     RdmaRmResTbl mr_tbl;
     RdmaRmResTbl uc_tbl;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 19/24] vl: Introduce shutdown_notifiers
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (17 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 18/24] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 20/24] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Notifier will be used for signaling shutdown event to inform system is
shutdown. This will allow devices and other component to run some
cleanup code needed before VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 include/sysemu/sysemu.h |  1 +
 vl.c                    | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8d6095d98b..0d15f16492 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -80,6 +80,7 @@ void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_system_shutdown_request(ShutdownCause reason);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
+void qemu_register_shutdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
diff --git a/vl.c b/vl.c
index fa25d1ae2d..48177f7dd1 100644
--- a/vl.c
+++ b/vl.c
@@ -1578,6 +1578,8 @@ static NotifierList suspend_notifiers =
     NOTIFIER_LIST_INITIALIZER(suspend_notifiers);
 static NotifierList wakeup_notifiers =
     NOTIFIER_LIST_INITIALIZER(wakeup_notifiers);
+static NotifierList shutdown_notifiers =
+    NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
 
 ShutdownCause qemu_shutdown_requested_get(void)
@@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
     notifier_list_notify(&powerdown_notifiers, NULL);
 }
 
+static void qemu_system_shutdown(ShutdownCause cause)
+{
+    qapi_event_send_shutdown(shutdown_caused_by_guest(cause));
+    notifier_list_notify(&shutdown_notifiers, &cause);
+}
+
 void qemu_system_powerdown_request(void)
 {
     trace_qemu_system_powerdown_request();
@@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
     notifier_list_add(&powerdown_notifiers, notifier);
 }
 
+void qemu_register_shutdown_notifier(Notifier *notifier)
+{
+    notifier_list_add(&shutdown_notifiers, notifier);
+}
+
 void qemu_system_debug_request(void)
 {
     debug_requested = 1;
@@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
     request = qemu_shutdown_requested();
     if (request) {
         qemu_kill_report();
-        qapi_event_send_shutdown(shutdown_caused_by_guest(request));
+        qemu_system_shutdown(request);
         if (no_shutdown) {
             vm_stop(RUN_STATE_SHUTDOWN);
         } else {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 20/24] hw/pvrdma: Clean device's resource when system is shutdown
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (18 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 19/24] vl: Introduce shutdown_notifiers Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 21/24] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

In order to clean some external resources such as GIDs, QPs etc,
register to receive notification when VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/vmw/pvrdma.h      |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 10a3c4fb7c..ffae36986e 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -17,6 +17,7 @@
 #define PVRDMA_PVRDMA_H
 
 #include "qemu/units.h"
+#include "qemu/notify.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
@@ -87,6 +88,7 @@ typedef struct PVRDMADev {
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
     VMXNET3State *func0;
+    Notifier shutdown_notifier;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 150404dfa6..23dc9926e3 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -24,6 +24,7 @@
 #include "hw/qdev-properties.h"
 #include "cpu.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 #include "../rdma_rm.h"
 #include "../rdma_backend.h"
@@ -334,6 +335,9 @@ static void pvrdma_fini(PCIDevice *pdev)
     if (msix_enabled(pdev)) {
         uninit_msix(pdev, RDMA_MAX_INTRS);
     }
+
+    pr_dbg("Device %s %x.%x is down\n", pdev->name, PCI_SLOT(pdev->devfn),
+           PCI_FUNC(pdev->devfn));
 }
 
 static void pvrdma_stop(PVRDMADev *dev)
@@ -559,6 +563,14 @@ static int pvrdma_check_ram_shared(Object *obj, void *opaque)
     return 0;
 }
 
+static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
+{
+    PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    pvrdma_fini(pci_dev);
+}
+
 static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 {
     int rc;
@@ -632,6 +644,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
+    dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
+    qemu_register_shutdown_notifier(&dev->shutdown_notifier);
+
 out:
     if (rc) {
         error_append_hint(errp, "Device fail to load\n");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 21/24] hw/rdma: Do not use bitmap_zero_extend to free bitmap
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (19 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 20/24] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
@ 2018-11-22 12:13 ` Yuval Shaia
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 22/24] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:13 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

bitmap_zero_extend is designed to work for extending, not for
shrinking.
Using g_free instead.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_rm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index b7d4ebe972..ca127c8c26 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -43,7 +43,7 @@ static inline void res_tbl_free(RdmaRmResTbl *tbl)
 {
     qemu_mutex_destroy(&tbl->lock);
     g_free(tbl->tbl);
-    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+    g_free(tbl->bitmap);
 }
 
 static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 22/24] hw/rdma: Do not call rdma_backend_del_gid on an empty gid
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (20 preceding siblings ...)
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 21/24] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
@ 2018-11-22 12:14 ` Yuval Shaia
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown Yuval Shaia
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 24/24] docs: Update pvrdma device documentation Yuval Shaia
  23 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:14 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

When device goes down the function fini_ports loops over all entries in
gid table regardless of the fact whether entry is valid or not. In case
that entry is not valid we'd like to skip from any further processing in
backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
---
 hw/rdma/rdma_rm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index ca127c8c26..f5b1295890 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -555,6 +555,10 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
 {
     int rc;
 
+    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
+        return 0;
+    }
+
     rc = rdma_backend_del_gid(backend_dev, ifname,
                               &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc) {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (21 preceding siblings ...)
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 22/24] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
@ 2018-11-22 12:14 ` Yuval Shaia
  2018-11-25  7:30   ` Yuval Shaia
  2018-11-25  7:41   ` Marcel Apfelbaum
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 24/24] docs: Update pvrdma device documentation Yuval Shaia
  23 siblings, 2 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:14 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

All resources are already cleaned at rm_fini phase.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 6a1e39d4c0..8ab25e94b1 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -1075,28 +1075,9 @@ static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
 
 static void mad_stop(RdmaBackendDev *backend_dev)
 {
-    QObject *o_ctx_id;
-    unsigned long cqe_ctx_id;
-    BackendCtx *bctx;
-
-    pr_dbg("Closing MAD\n");
+    pr_dbg("Stopping MAD\n");
 
     disable_rdmacm_mux_async(backend_dev);
-
-    /* Clear MAD buffers list */
-    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
-    do {
-        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
-        if (o_ctx_id) {
-            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
-            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
-            if (bctx) {
-                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
-                g_free(bctx);
-            }
-        }
-    } while (o_ctx_id);
-    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
 }
 
 static void mad_fini(RdmaBackendDev *backend_dev)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Qemu-devel] [PATCH v5 24/24] docs: Update pvrdma device documentation
  2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
                   ` (22 preceding siblings ...)
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown Yuval Shaia
@ 2018-11-22 12:14 ` Yuval Shaia
       [not found]   ` <8b89bfaf-be29-e043-32fa-9615fb4ea0f7@gmail.com>
  23 siblings, 1 reply; 39+ messages in thread
From: Yuval Shaia @ 2018-11-22 12:14 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch, cohuck

Interface with the device is changed with the addition of support for
MAD packets.
Adjust documentation accordingly.

While there fix a minor mistake which may lead to think that there is a
relation between using RXE on host and the compatibility with bare-metal
peers.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 84 insertions(+), 19 deletions(-)

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
index 5599318159..f82b2a69d2 100644
--- a/docs/pvrdma.txt
+++ b/docs/pvrdma.txt
@@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need for any special guest
 modifications.
 
 While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
-can work with Soft-RoCE (rxe).
+metal RDMA-enabled machines as peers.
+
+It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
 
 It does not require the whole guest RAM to be pinned allowing memory
 over-commit and, even if not implemented yet, migration support will be
@@ -78,29 +79,93 @@ the required RDMA libraries.
 
 3. Usage
 ========
+
+
+3.1 VM Memory settings
+======+++=============
 Currently the device is working only with memory backed RAM
 and it must be mark as "shared":
    -m 1G \
    -object memory-backend-ram,id=mb1,size=1G,share \
    -numa node,memdev=mb1 \
 
-The pvrdma device is composed of two functions:
- - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
-   but is required to pass the ibdevice GID using its MAC.
-   Examples:
-     For an rxe backend using eth0 interface it will use its mac:
-       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
-     For an SRIOV VF, we take the Ethernet Interface exposed by it:
-       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
- - Function 1 is the actual device:
-       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
-   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
- Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
- The rules of conversion are part of the RoCE spec, but since manual conversion
- is not required, spotting problems is not hard:
-    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
-             MAC: 7c:fe:90:cb:74:3a
-    Note the difference between the first byte of the MAC and the GID.
+
+3.2 MAD Multiplexer
+===================
+MAD Multiplexer is a service that exposes MAD-like interface for VMs in
+order to overcome the limitation where only single entity can register with
+MAD layer to send and receive RDMA-CM MAD packets.
+
+To build rdmacm-mux run
+# make rdmacm-mux
+
+The application accepts 3 command line arguments and exposes a UNIX socket
+to pass control and data to it.
+-s unix-socket-path   Path to unix socket to listen on
+                      (default /var/run/rdmacm-mux)
+-d rdma-device-name   Name of RDMA device to register with
+                      (default rxe0)
+-p rdma-device-port   Port number of RDMA device to register with
+                      (default 1)
+The final UNIX socket file name is a concatenation of the 3 arguments so
+for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
+will be created.
+
+Please refer to contrib/rdmacm-mux for more details.
+
+
+3.3 PCI devices settings
+========================
+RoCE device exposes two functions - an Ethernet and RDMA.
+To support it, pvrdma device is composed of two PCI functions, an Ethernet
+device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
+Ethernet function can be used for other Ethernet purposes such as IP.
+
+
+3.4 Device parameters
+=====================
+- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
+  would be the Ethernet device used to create it. For any other physical
+  RoCE device this would be the netdev name of the device.
+- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
+- mad-chardev: The name of the MAD multiplexer char device.
+- ibport: In case of multi-port device (such as Mellanox's HCA) this
+  specify the port to use. If not set 1 will be used.
+- dev-caps-max-mr-size: The maximum size of MR.
+- dev-caps-max-qp: Maximum number of QPs.
+- dev-caps-max-sge: Maximum number of SGE elements in WR.
+- dev-caps-max-cq: Maximum number of CQs.
+- dev-caps-max-mr: Maximum number of MRs.
+- dev-caps-max-pd: Maximum number of PDs.
+- dev-caps-max-ah: Maximum number of AHs.
+
+Notes:
+- The first 3 parameters are mandatory settings, the rest have their
+  defaults.
+- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
+  limits but the final values is adjusted by the backend device limitations.
+
+3.5 Example
+===========
+Define bridge device with vmxnet3 network backend:
+<interface type='bridge'>
+  <mac address='56:b4:44:e9:62:dc'/>
+  <source bridge='bridge1'/>
+  <model type='vmxnet3'/>
+  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
+</interface>
+
+Define pvrdma device:
+<qemu:commandline>
+  <qemu:arg value='-object'/>
+  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
+  <qemu:arg value='-numa'/>
+  <qemu:arg value='node,memdev=mb1'/>
+  <qemu:arg value='-chardev'/>
+  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
+  <qemu:arg value='-device'/>
+  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
+</qemu:commandline>
 
 
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-25  7:05   ` Marcel Apfelbaum
  2018-11-25  7:27     ` Yuval Shaia
  0 siblings, 1 reply; 39+ messages in thread
From: Marcel Apfelbaum @ 2018-11-25  7:05 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/22/18 2:13 PM, Yuval Shaia wrote:
> MAD (Management Datagram) packets are widely used by various modules
> both in kernel and in user space for example the rdma_* API which is
> used to create and maintain "connection" layer on top of RDMA uses
> several types of MAD packets.
>
> For more information please refer to chapter 13.4 in Volume 1
> Architecture Specification, Release 1.1 available here:
> https://www.infinibandta.org/ibta-specifications-download/
>
> To support MAD packets the device uses an external utility
> (contrib/rdmacm-mux) to relay packets from and to the guest driver.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c      | 275 +++++++++++++++++++++++++++++++++++-
>   hw/rdma/rdma_backend.h      |   4 +-
>   hw/rdma/rdma_backend_defs.h |  10 +-
>   hw/rdma/vmw/pvrdma.h        |   2 +
>   hw/rdma/vmw/pvrdma_main.c   |   4 +-
>   5 files changed, 285 insertions(+), 10 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 1e148398a2..7c220a5798 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -16,8 +16,13 @@
>   #include "qemu/osdep.h"
>   #include "qemu/error-report.h"
>   #include "qapi/error.h"
> +#include "qapi/qmp/qlist.h"
> +#include "qapi/qmp/qnum.h"
>   
>   #include <infiniband/verbs.h>
> +#include <infiniband/umad_types.h>
> +#include <infiniband/umad.h>
> +#include <rdma/rdma_user_cm.h>
>   
>   #include "trace.h"
>   #include "rdma_utils.h"
> @@ -33,16 +38,25 @@
>   #define VENDOR_ERR_MAD_SEND         0x206
>   #define VENDOR_ERR_INVLKEY          0x207
>   #define VENDOR_ERR_MR_SMALL         0x208
> +#define VENDOR_ERR_INV_MAD_BUFF     0x209
> +#define VENDOR_ERR_INV_NUM_SGE      0x210
>   
>   #define THR_NAME_LEN 16
>   #define THR_POLL_TO  5000
>   
> +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
> +
>   typedef struct BackendCtx {
> -    uint64_t req_id;
>       void *up_ctx;
>       bool is_tx_req;
> +    struct ibv_sge sge; /* Used to save MAD recv buffer */
>   } BackendCtx;
>   
> +struct backend_umad {
> +    struct ib_user_mad hdr;
> +    char mad[RDMA_MAX_PRIVATE_DATA];
> +};
> +
>   static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
>   
>   static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> @@ -286,6 +300,61 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
>       return 0;
>   }
>   
> +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> +                    uint32_t num_sge)
> +{
> +    struct backend_umad umad = {0};
> +    char *hdr, *msg;
> +    int ret;
> +
> +    pr_dbg("num_sge=%d\n", num_sge);
> +
> +    if (num_sge != 2) {
> +        return -EINVAL;
> +    }
> +
> +    umad.hdr.length = sge[0].length + sge[1].length;
> +    pr_dbg("msg_len=%d\n", umad.hdr.length);
> +
> +    if (umad.hdr.length > sizeof(umad.mad)) {
> +        return -ENOMEM;
> +    }
> +
> +    umad.hdr.addr.qpn = htobe32(1);
> +    umad.hdr.addr.grh_present = 1;
> +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> +    umad.hdr.addr.hop_limit = 1;
> +
> +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> +    if (!hdr) {
> +        pr_dbg("Fail to map to sge[0]\n");
> +        return -ENOMEM;
> +    }
> +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> +    if (!msg) {
> +        pr_dbg("Fail to map to sge[1]\n");
> +        rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> +        return -ENOMEM;
> +    }
> +
> +    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
> +    pr_dbg_buf("mad_data", data, sge[1].length);
> +
> +    memcpy(&umad.mad[0], hdr, sge[0].length);
> +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> +
> +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> +
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> +                            sizeof(umad));
> +
> +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> +
> +    return (ret != sizeof(umad));
> +}
> +
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> @@ -304,9 +373,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           } else if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            rc = mad_send(backend_dev, sge, num_sge);
> +            if (rc) {
> +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            } else {
> +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> +            }
>           }
> -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
>           return;
>       }
>   
> @@ -370,6 +443,48 @@ out_free_bctx:
>       g_free(bctx);
>   }
>   
> +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
> +                                         struct ibv_sge *sge, uint32_t num_sge,
> +                                         void *ctx)
> +{
> +    BackendCtx *bctx;
> +    int rc;
> +    uint32_t bctx_id;
> +
> +    if (num_sge != 1) {
> +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);

Maybe not for this patch set, but please consider using
QEMU error report facilities for errors instead of debug messages
when necessary.

The same about using traces on normal flow instead pr_dbg,
but this has a lower priority than error reporting.

> +        return VENDOR_ERR_INV_NUM_SGE;
> +    }
> +
> +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
> +        pr_dbg("Too small buffer for MAD\n");
> +        return VENDOR_ERR_INV_MAD_BUFF;
> +    }
> +
> +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
> +    pr_dbg("length=%d\n", sge[0].length);
> +    pr_dbg("lkey=%d\n", sge[0].lkey);
> +
> +    bctx = g_malloc0(sizeof(*bctx));
> +
> +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
> +    if (unlikely(rc)) {
> +        g_free(bctx);
> +        pr_dbg("Fail to allocate cqe_ctx\n");
> +        return VENDOR_ERR_NOMEM;
> +    }
> +
> +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
> +    bctx->up_ctx = ctx;
> +    bctx->sge = *sge;
> +
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +
> +    return 0;
> +}
> +
>   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>                               RdmaDeviceResources *rdma_dev_res,
>                               RdmaBackendQP *qp, uint8_t qp_type,
> @@ -388,7 +503,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>           }
>           if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
> +            if (rc) {
> +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> +            }
>           }
>           return;
>       }
> @@ -517,7 +635,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
>   
>       switch (qp_type) {
>       case IBV_QPT_GSI:
> -        pr_dbg("QP1 unsupported\n");
>           return 0;
>   
>       case IBV_QPT_RC:
> @@ -748,11 +865,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
>       return 0;
>   }
>   
> +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> +                                 union ibv_gid *my_gid, int paylen)
> +{
> +    grh->paylen = htons(paylen);
> +    grh->sgid = *sgid;
> +    grh->dgid = *my_gid;
> +
> +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> +}
> +
> +static inline int mad_can_receieve(void *opaque)
> +{
> +    return sizeof(struct backend_umad);
> +}
> +
> +static void mad_read(void *opaque, const uint8_t *buf, int size)
> +{
> +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> +    QObject *o_ctx_id;
> +    unsigned long cqe_ctx_id;
> +    BackendCtx *bctx;
> +    char *mad;
> +    struct backend_umad *umad;
> +
> +    assert(size != sizeof(umad));
> +    umad = (struct backend_umad *)buf;
> +
> +    pr_dbg("Got %d bytes\n", size);
> +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> +
> +#ifdef PVRDMA_DEBUG
> +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
> +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
> +           hdr->method, hdr->status, be64toh(hdr->tid),
> +           hdr->attr_id, hdr->attr_mod);
> +#endif
> +
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +    if (!o_ctx_id) {
> +        pr_dbg("No more free MADs buffers, waiting for a while\n");
> +        sleep(THR_POLL_TO);
> +        return;
> +    }
> +
> +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +    if (unlikely(!bctx)) {
> +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
> +        return;
> +    }
> +
> +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
> +
> +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> +                           bctx->sge.length);
> +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> +                     bctx->up_ctx);
> +    } else {
> +        memset(mad, 0, bctx->sge.length);
> +        build_mad_hdr((struct ibv_grh *)mad,
> +                      (union ibv_gid *)&umad->hdr.addr.gid,
> +                      &backend_dev->gid, umad->hdr.length);
> +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> +
> +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> +    }
> +
> +    g_free(bctx);
> +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +}
> +
> +static int mad_init(RdmaBackendDev *backend_dev)
> +{
> +    struct backend_umad umad = {0};
> +    int ret;
> +
> +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> +        pr_dbg("Missing chardev for MAD multiplexer\n");
> +        return -EIO;
> +    }
> +
> +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> +                             mad_read, NULL, NULL, backend_dev, NULL, true);
> +
> +    /* Register ourself */
> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> +                            sizeof(umad.hdr));
> +    if (ret != sizeof(umad.hdr)) {
> +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> +    }
> +
> +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> +    backend_dev->recv_mads_list.list = qlist_new();
> +
> +    return 0;
> +}
> +
> +static void mad_stop(RdmaBackendDev *backend_dev)
> +{
> +    QObject *o_ctx_id;
> +    unsigned long cqe_ctx_id;
> +    BackendCtx *bctx;
> +
> +    pr_dbg("Closing MAD\n");
> +
> +    /* Clear MAD buffers list */
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    do {
> +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);

I suppose you cal lock only the above line, but maybe it doesn't
matter at this point.
> +        if (o_ctx_id) {
> +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +            if (bctx) {
> +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +                g_free(bctx);
> +            }
> +        }
> +    } while (o_ctx_id);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +}
> +
> +static void mad_fini(RdmaBackendDev *backend_dev)
> +{
> +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
> +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> +}
> +
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
>                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      Error **errp)
> +                      CharBackend *mad_chr_be, Error **errp)
>   {
>       int i;
>       int ret = 0;
> @@ -763,7 +1015,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       memset(backend_dev, 0, sizeof(*backend_dev));
>   
>       backend_dev->dev = pdev;
> -
> +    backend_dev->mad_chr_be = mad_chr_be;
>       backend_dev->backend_gid_idx = backend_gid_idx;
>       backend_dev->port_num = port_num;
>       backend_dev->rdma_dev_res = rdma_dev_res;
> @@ -854,6 +1106,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       pr_dbg("interface_id=0x%" PRIx64 "\n",
>              be64_to_cpu(backend_dev->gid.global.interface_id));
>   
> +    ret = mad_init(backend_dev);
> +    if (ret) {
> +        error_setg(errp, "Fail to initialize mad");
> +        ret = -EIO;
> +        goto out_destroy_comm_channel;
> +    }
> +
>       backend_dev->comp_thread.run = false;
>       backend_dev->comp_thread.is_running = false;
>   
> @@ -885,11 +1144,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
>   {
>       pr_dbg("Stopping rdma_backend\n");
>       stop_backend_thread(&backend_dev->comp_thread);
> +    mad_stop(backend_dev);
>   }
>   
>   void rdma_backend_fini(RdmaBackendDev *backend_dev)
>   {
>       rdma_backend_stop(backend_dev);
> +    mad_fini(backend_dev);
>       g_hash_table_destroy(ah_hash);
>       ibv_destroy_comp_channel(backend_dev->channel);
>       ibv_close_device(backend_dev->context);
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index 3ccc9a2494..fc83330251 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -17,6 +17,8 @@
>   #define RDMA_BACKEND_H
>   
>   #include "qapi/error.h"
> +#include "chardev/char-fe.h"
> +
>   #include "rdma_rm_defs.h"
>   #include "rdma_backend_defs.h"
>   
> @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
>                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      Error **errp);
> +                      CharBackend *mad_chr_be, Error **errp);
>   void rdma_backend_fini(RdmaBackendDev *backend_dev);
>   void rdma_backend_start(RdmaBackendDev *backend_dev);
>   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> index 7404f64002..2a7e667075 100644
> --- a/hw/rdma/rdma_backend_defs.h
> +++ b/hw/rdma/rdma_backend_defs.h
> @@ -16,8 +16,9 @@
>   #ifndef RDMA_BACKEND_DEFS_H
>   #define RDMA_BACKEND_DEFS_H
>   
> -#include <infiniband/verbs.h>
>   #include "qemu/thread.h"
> +#include "chardev/char-fe.h"
> +#include <infiniband/verbs.h>
>   
>   typedef struct RdmaDeviceResources RdmaDeviceResources;
>   
> @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
>       bool is_running; /* Set by the thread to report its status */
>   } RdmaBackendThread;
>   
> +typedef struct RecvMadList {
> +    QemuMutex lock;
> +    QList *list;
> +} RecvMadList;
> +
>   typedef struct RdmaBackendDev {
>       struct ibv_device_attr dev_attr;
>       RdmaBackendThread comp_thread;
> @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
>       struct ibv_comp_channel *channel;
>       uint8_t port_num;
>       uint8_t backend_gid_idx;
> +    RecvMadList recv_mads_list;
> +    CharBackend *mad_chr_be;
>   } RdmaBackendDev;
>   
>   typedef struct RdmaBackendPD {
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index e2d9f93cdf..e3742d893a 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -19,6 +19,7 @@
>   #include "qemu/units.h"
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msix.h"
> +#include "chardev/char-fe.h"
>   
>   #include "../rdma_backend_defs.h"
>   #include "../rdma_rm_defs.h"
> @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
>       uint8_t backend_port_num;
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
> +    CharBackend mad_chr;
>   } PVRDMADev;
>   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index ca5fa8d981..6c8c0154fa 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
>       DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
>                         dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
>       DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
> +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
>       DEFINE_PROP_END_OF_LIST(),
>   };
>   
> @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>   
>       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
>                              dev->backend_device_name, dev->backend_port_num,
> -                           dev->backend_gid_idx, &dev->dev_attr, errp);
> +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> +                           errp);
>       if (rc) {
>           goto out;
>       }

I only have a few minor comments, but it looks OK to me anyway.

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
Thanks,
Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets
  2018-11-25  7:05   ` Marcel Apfelbaum
@ 2018-11-25  7:27     ` Yuval Shaia
  0 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-25  7:27 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Sun, Nov 25, 2018 at 09:05:17AM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/22/18 2:13 PM, Yuval Shaia wrote:
> > MAD (Management Datagram) packets are widely used by various modules
> > both in kernel and in user space for example the rdma_* API which is
> > used to create and maintain "connection" layer on top of RDMA uses
> > several types of MAD packets.
> > 
> > For more information please refer to chapter 13.4 in Volume 1
> > Architecture Specification, Release 1.1 available here:
> > https://www.infinibandta.org/ibta-specifications-download/
> > 
> > To support MAD packets the device uses an external utility
> > (contrib/rdmacm-mux) to relay packets from and to the guest driver.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_backend.c      | 275 +++++++++++++++++++++++++++++++++++-
> >   hw/rdma/rdma_backend.h      |   4 +-
> >   hw/rdma/rdma_backend_defs.h |  10 +-
> >   hw/rdma/vmw/pvrdma.h        |   2 +
> >   hw/rdma/vmw/pvrdma_main.c   |   4 +-
> >   5 files changed, 285 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> > index 1e148398a2..7c220a5798 100644
> > --- a/hw/rdma/rdma_backend.c
> > +++ b/hw/rdma/rdma_backend.c
> > @@ -16,8 +16,13 @@
> >   #include "qemu/osdep.h"
> >   #include "qemu/error-report.h"
> >   #include "qapi/error.h"
> > +#include "qapi/qmp/qlist.h"
> > +#include "qapi/qmp/qnum.h"
> >   #include <infiniband/verbs.h>
> > +#include <infiniband/umad_types.h>
> > +#include <infiniband/umad.h>
> > +#include <rdma/rdma_user_cm.h>
> >   #include "trace.h"
> >   #include "rdma_utils.h"
> > @@ -33,16 +38,25 @@
> >   #define VENDOR_ERR_MAD_SEND         0x206
> >   #define VENDOR_ERR_INVLKEY          0x207
> >   #define VENDOR_ERR_MR_SMALL         0x208
> > +#define VENDOR_ERR_INV_MAD_BUFF     0x209
> > +#define VENDOR_ERR_INV_NUM_SGE      0x210
> >   #define THR_NAME_LEN 16
> >   #define THR_POLL_TO  5000
> > +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
> > +
> >   typedef struct BackendCtx {
> > -    uint64_t req_id;
> >       void *up_ctx;
> >       bool is_tx_req;
> > +    struct ibv_sge sge; /* Used to save MAD recv buffer */
> >   } BackendCtx;
> > +struct backend_umad {
> > +    struct ib_user_mad hdr;
> > +    char mad[RDMA_MAX_PRIVATE_DATA];
> > +};
> > +
> >   static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
> >   static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> > @@ -286,6 +300,61 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
> >       return 0;
> >   }
> > +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> > +                    uint32_t num_sge)
> > +{
> > +    struct backend_umad umad = {0};
> > +    char *hdr, *msg;
> > +    int ret;
> > +
> > +    pr_dbg("num_sge=%d\n", num_sge);
> > +
> > +    if (num_sge != 2) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    umad.hdr.length = sge[0].length + sge[1].length;
> > +    pr_dbg("msg_len=%d\n", umad.hdr.length);
> > +
> > +    if (umad.hdr.length > sizeof(umad.mad)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umad.hdr.addr.qpn = htobe32(1);
> > +    umad.hdr.addr.grh_present = 1;
> > +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> > +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > +    umad.hdr.addr.hop_limit = 1;
> > +
> > +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> > +    if (!hdr) {
> > +        pr_dbg("Fail to map to sge[0]\n");
> > +        return -ENOMEM;
> > +    }
> > +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > +    if (!msg) {
> > +        pr_dbg("Fail to map to sge[1]\n");
> > +        rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> > +        return -ENOMEM;
> > +    }
> > +
> > +    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
> > +    pr_dbg_buf("mad_data", data, sge[1].length);
> > +
> > +    memcpy(&umad.mad[0], hdr, sge[0].length);
> > +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> > +
> > +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> > +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> > +
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > +                            sizeof(umad));
> > +
> > +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> > +
> > +    return (ret != sizeof(umad));
> > +}
> > +
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > @@ -304,9 +373,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> >           } else if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            rc = mad_send(backend_dev, sge, num_sge);
> > +            if (rc) {
> > +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            } else {
> > +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> > +            }
> >           }
> > -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
> >           return;
> >       }
> > @@ -370,6 +443,48 @@ out_free_bctx:
> >       g_free(bctx);
> >   }
> > +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
> > +                                         struct ibv_sge *sge, uint32_t num_sge,
> > +                                         void *ctx)
> > +{
> > +    BackendCtx *bctx;
> > +    int rc;
> > +    uint32_t bctx_id;
> > +
> > +    if (num_sge != 1) {
> > +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
> 
> Maybe not for this patch set, but please consider using
> QEMU error report facilities for errors instead of debug messages
> when necessary.
> 
> The same about using traces on normal flow instead pr_dbg,
> but this has a lower priority than error reporting.

Sure, this is next on my plate.

> 
> > +        return VENDOR_ERR_INV_NUM_SGE;
> > +    }
> > +
> > +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
> > +        pr_dbg("Too small buffer for MAD\n");
> > +        return VENDOR_ERR_INV_MAD_BUFF;
> > +    }
> > +
> > +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
> > +    pr_dbg("length=%d\n", sge[0].length);
> > +    pr_dbg("lkey=%d\n", sge[0].lkey);
> > +
> > +    bctx = g_malloc0(sizeof(*bctx));
> > +
> > +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
> > +    if (unlikely(rc)) {
> > +        g_free(bctx);
> > +        pr_dbg("Fail to allocate cqe_ctx\n");
> > +        return VENDOR_ERR_NOMEM;
> > +    }
> > +
> > +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
> > +    bctx->up_ctx = ctx;
> > +    bctx->sge = *sge;
> > +
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +
> > +    return 0;
> > +}
> > +
> >   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> >                               RdmaDeviceResources *rdma_dev_res,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> > @@ -388,7 +503,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> >           }
> >           if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
> > +            if (rc) {
> > +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> > +            }
> >           }
> >           return;
> >       }
> > @@ -517,7 +635,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> >       switch (qp_type) {
> >       case IBV_QPT_GSI:
> > -        pr_dbg("QP1 unsupported\n");
> >           return 0;
> >       case IBV_QPT_RC:
> > @@ -748,11 +865,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
> >       return 0;
> >   }
> > +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> > +                                 union ibv_gid *my_gid, int paylen)
> > +{
> > +    grh->paylen = htons(paylen);
> > +    grh->sgid = *sgid;
> > +    grh->dgid = *my_gid;
> > +
> > +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> > +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> > +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> > +}
> > +
> > +static inline int mad_can_receieve(void *opaque)
> > +{
> > +    return sizeof(struct backend_umad);
> > +}
> > +
> > +static void mad_read(void *opaque, const uint8_t *buf, int size)
> > +{
> > +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> > +    QObject *o_ctx_id;
> > +    unsigned long cqe_ctx_id;
> > +    BackendCtx *bctx;
> > +    char *mad;
> > +    struct backend_umad *umad;
> > +
> > +    assert(size != sizeof(umad));
> > +    umad = (struct backend_umad *)buf;
> > +
> > +    pr_dbg("Got %d bytes\n", size);
> > +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> > +
> > +#ifdef PVRDMA_DEBUG
> > +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> > +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
> > +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
> > +           hdr->method, hdr->status, be64toh(hdr->tid),
> > +           hdr->attr_id, hdr->attr_mod);
> > +#endif
> > +
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +    if (!o_ctx_id) {
> > +        pr_dbg("No more free MADs buffers, waiting for a while\n");
> > +        sleep(THR_POLL_TO);
> > +        return;
> > +    }
> > +
> > +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> > +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +    if (unlikely(!bctx)) {
> > +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
> > +        return;
> > +    }
> > +
> > +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
> > +
> > +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> > +                           bctx->sge.length);
> > +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> > +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> > +                     bctx->up_ctx);
> > +    } else {
> > +        memset(mad, 0, bctx->sge.length);
> > +        build_mad_hdr((struct ibv_grh *)mad,
> > +                      (union ibv_gid *)&umad->hdr.addr.gid,
> > +                      &backend_dev->gid, umad->hdr.length);
> > +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> > +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> > +
> > +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> > +    }
> > +
> > +    g_free(bctx);
> > +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +}
> > +
> > +static int mad_init(RdmaBackendDev *backend_dev)
> > +{
> > +    struct backend_umad umad = {0};
> > +    int ret;
> > +
> > +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> > +        pr_dbg("Missing chardev for MAD multiplexer\n");
> > +        return -EIO;
> > +    }
> > +
> > +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> > +                             mad_read, NULL, NULL, backend_dev, NULL, true);
> > +
> > +    /* Register ourself */
> > +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > +                            sizeof(umad.hdr));
> > +    if (ret != sizeof(umad.hdr)) {
> > +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> > +    }
> > +
> > +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> > +    backend_dev->recv_mads_list.list = qlist_new();
> > +
> > +    return 0;
> > +}
> > +
> > +static void mad_stop(RdmaBackendDev *backend_dev)
> > +{
> > +    QObject *o_ctx_id;
> > +    unsigned long cqe_ctx_id;
> > +    BackendCtx *bctx;
> > +
> > +    pr_dbg("Closing MAD\n");
> > +
> > +    /* Clear MAD buffers list */
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    do {
> > +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> 
> I suppose you cal lock only the above line, but maybe it doesn't
> matter at this point.

Sorry, my bad, there is no need for this cleanup, will squash patch #23 to
this one.

> > +        if (o_ctx_id) {
> > +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> > +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +            if (bctx) {
> > +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +                g_free(bctx);
> > +            }
> > +        }
> > +    } while (o_ctx_id);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +}
> > +
> > +static void mad_fini(RdmaBackendDev *backend_dev)
> > +{
> > +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
> > +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> > +}
> > +
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> >                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      Error **errp)
> > +                      CharBackend *mad_chr_be, Error **errp)
> >   {
> >       int i;
> >       int ret = 0;
> > @@ -763,7 +1015,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       memset(backend_dev, 0, sizeof(*backend_dev));
> >       backend_dev->dev = pdev;
> > -
> > +    backend_dev->mad_chr_be = mad_chr_be;
> >       backend_dev->backend_gid_idx = backend_gid_idx;
> >       backend_dev->port_num = port_num;
> >       backend_dev->rdma_dev_res = rdma_dev_res;
> > @@ -854,6 +1106,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       pr_dbg("interface_id=0x%" PRIx64 "\n",
> >              be64_to_cpu(backend_dev->gid.global.interface_id));
> > +    ret = mad_init(backend_dev);
> > +    if (ret) {
> > +        error_setg(errp, "Fail to initialize mad");
> > +        ret = -EIO;
> > +        goto out_destroy_comm_channel;
> > +    }
> > +
> >       backend_dev->comp_thread.run = false;
> >       backend_dev->comp_thread.is_running = false;
> > @@ -885,11 +1144,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
> >   {
> >       pr_dbg("Stopping rdma_backend\n");
> >       stop_backend_thread(&backend_dev->comp_thread);
> > +    mad_stop(backend_dev);
> >   }
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev)
> >   {
> >       rdma_backend_stop(backend_dev);
> > +    mad_fini(backend_dev);
> >       g_hash_table_destroy(ah_hash);
> >       ibv_destroy_comp_channel(backend_dev->channel);
> >       ibv_close_device(backend_dev->context);
> > diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> > index 3ccc9a2494..fc83330251 100644
> > --- a/hw/rdma/rdma_backend.h
> > +++ b/hw/rdma/rdma_backend.h
> > @@ -17,6 +17,8 @@
> >   #define RDMA_BACKEND_H
> >   #include "qapi/error.h"
> > +#include "chardev/char-fe.h"
> > +
> >   #include "rdma_rm_defs.h"
> >   #include "rdma_backend_defs.h"
> > @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> >                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      Error **errp);
> > +                      CharBackend *mad_chr_be, Error **errp);
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> >   void rdma_backend_start(RdmaBackendDev *backend_dev);
> >   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> > diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> > index 7404f64002..2a7e667075 100644
> > --- a/hw/rdma/rdma_backend_defs.h
> > +++ b/hw/rdma/rdma_backend_defs.h
> > @@ -16,8 +16,9 @@
> >   #ifndef RDMA_BACKEND_DEFS_H
> >   #define RDMA_BACKEND_DEFS_H
> > -#include <infiniband/verbs.h>
> >   #include "qemu/thread.h"
> > +#include "chardev/char-fe.h"
> > +#include <infiniband/verbs.h>
> >   typedef struct RdmaDeviceResources RdmaDeviceResources;
> > @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
> >       bool is_running; /* Set by the thread to report its status */
> >   } RdmaBackendThread;
> > +typedef struct RecvMadList {
> > +    QemuMutex lock;
> > +    QList *list;
> > +} RecvMadList;
> > +
> >   typedef struct RdmaBackendDev {
> >       struct ibv_device_attr dev_attr;
> >       RdmaBackendThread comp_thread;
> > @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
> >       struct ibv_comp_channel *channel;
> >       uint8_t port_num;
> >       uint8_t backend_gid_idx;
> > +    RecvMadList recv_mads_list;
> > +    CharBackend *mad_chr_be;
> >   } RdmaBackendDev;
> >   typedef struct RdmaBackendPD {
> > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > index e2d9f93cdf..e3742d893a 100644
> > --- a/hw/rdma/vmw/pvrdma.h
> > +++ b/hw/rdma/vmw/pvrdma.h
> > @@ -19,6 +19,7 @@
> >   #include "qemu/units.h"
> >   #include "hw/pci/pci.h"
> >   #include "hw/pci/msix.h"
> > +#include "chardev/char-fe.h"
> >   #include "../rdma_backend_defs.h"
> >   #include "../rdma_rm_defs.h"
> > @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
> >       uint8_t backend_port_num;
> >       RdmaBackendDev backend_dev;
> >       RdmaDeviceResources rdma_dev_res;
> > +    CharBackend mad_chr;
> >   } PVRDMADev;
> >   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
> > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > index ca5fa8d981..6c8c0154fa 100644
> > --- a/hw/rdma/vmw/pvrdma_main.c
> > +++ b/hw/rdma/vmw/pvrdma_main.c
> > @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
> >       DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
> >                         dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
> >       DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
> > +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
> >       DEFINE_PROP_END_OF_LIST(),
> >   };
> > @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
> >                              dev->backend_device_name, dev->backend_port_num,
> > -                           dev->backend_gid_idx, &dev->dev_attr, errp);
> > +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> > +                           errp);
> >       if (rc) {
> >           goto out;
> >       }
> 
> I only have a few minor comments, but it looks OK to me anyway.
> 
> Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
> Thanks,
> Marcel

Thanks!

> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
@ 2018-11-25  7:29   ` Marcel Apfelbaum
  2018-11-25  9:10     ` Yuval Shaia
  0 siblings, 1 reply; 39+ messages in thread
From: Marcel Apfelbaum @ 2018-11-25  7:29 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/22/18 2:13 PM, Yuval Shaia wrote:
> The control over the RDMA device's GID table is done by updating the
> device's Ethernet function addresses.
> Usually the first GID entry is determined by the MAC address, the second
> by the first IPv6 address and the third by the IPv4 address. Other
> entries can be added by adding more IP addresses. The opposite is the
> same, i.e. whenever an address is removed, the corresponding GID entry
> is removed.
>
> The process is done by the network and RDMA stacks. Whenever an address
> is added the ib_core driver is notified and calls the device driver
> add_gid function which in turn update the device.
>
> To support this in pvrdma device we need to hook into the create_bind
> and destroy_bind HW commands triggered by pvrdma driver in guest.
> Whenever changed is made to the pvrdma port's GID table a special QMP
> messages is sent to be processed by libvirt to update the address of the
> backend Ethernet device.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c      | 336 +++++++++++++++++++++++++-----------
>   hw/rdma/rdma_backend.h      |  22 +--
>   hw/rdma/rdma_backend_defs.h |  11 +-
>   hw/rdma/rdma_rm.c           | 104 ++++++++++-
>   hw/rdma/rdma_rm.h           |  17 +-
>   hw/rdma/rdma_rm_defs.h      |   9 +-
>   hw/rdma/rdma_utils.h        |  15 ++
>   hw/rdma/vmw/pvrdma.h        |   2 +-
>   hw/rdma/vmw/pvrdma_cmd.c    |  55 +++---
>   hw/rdma/vmw/pvrdma_main.c   |  25 +--
>   hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
>   11 files changed, 453 insertions(+), 163 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 7c220a5798..8b5a111bf4 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -15,15 +15,18 @@
>   
>   #include "qemu/osdep.h"
>   #include "qemu/error-report.h"
> +#include "sysemu/sysemu.h"
>   #include "qapi/error.h"
>   #include "qapi/qmp/qlist.h"
>   #include "qapi/qmp/qnum.h"
> +#include "qapi/qapi-events-rdma.h"
>   
>   #include <infiniband/verbs.h>
>   #include <infiniband/umad_types.h>
>   #include <infiniband/umad.h>
>   #include <rdma/rdma_user_cm.h>
>   
> +#include "contrib/rdmacm-mux/rdmacm-mux.h"
>   #include "trace.h"
>   #include "rdma_utils.h"
>   #include "rdma_rm.h"
> @@ -160,6 +163,71 @@ static void *comp_handler_thread(void *arg)
>       return NULL;
>   }
>   
> +static inline void disable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
> +{
> +    atomic_set(&backend_dev->rdmacm_mux.can_receive, 0);
> +}
> +
> +static inline void enable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
> +{
> +    atomic_set(&backend_dev->rdmacm_mux.can_receive, sizeof(RdmaCmMuxMsg));

Why sizeof is used to set the can_receive field?

> +}
> +
> +static inline int rdmacm_mux_can_process_async(RdmaBackendDev *backend_dev)
> +{
> +    return atomic_read(&backend_dev->rdmacm_mux.can_receive);
> +}
> +
> +static int check_mux_op_status(CharBackend *mad_chr_be)
> +{
> +    RdmaCmMuxMsg msg = {0};
> +    int ret;
> +
> +    pr_dbg("Reading response\n");
> +    ret = qemu_chr_fe_read_all(mad_chr_be, (uint8_t *)&msg, sizeof(msg));
> +    if (ret != sizeof(msg)) {
> +        pr_dbg("Invalid message size %d, expecting %ld\n", ret, sizeof(msg));
> +        return -EIO;
> +    }
> +
> +    if (msg.hdr.msg_type != RDMACM_MUX_MSG_TYPE_RESP) {
> +        pr_dbg("Invalid message type %d\n", msg.hdr.msg_type);
> +        return -EIO;
> +    }
> +
> +    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
> +        pr_dbg("Operation failed in mux, error code %d\n", msg.hdr.err_code);
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
> +static int exec_rdmacm_mux_req(RdmaBackendDev *backend_dev, RdmaCmMuxMsg *msg)
> +{
> +    int rc = 0;
> +
> +    pr_dbg("Executing request %d\n", msg->hdr.op_code);
> +
> +    msg->hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ;
> +    disable_rdmacm_mux_async(backend_dev);
> +    rc = qemu_chr_fe_write(backend_dev->rdmacm_mux.chr_be,
> +                           (const uint8_t *)msg, sizeof(*msg));
> +    enable_rdmacm_mux_async(backend_dev);
> +    if (rc != sizeof(*msg)) {
> +        pr_dbg("Fail to send request to rdmacm_mux (rc=%d)\n", rc);
> +        return -EIO;
> +    }
> +
> +    rc = check_mux_op_status(backend_dev->rdmacm_mux.chr_be);
> +    if (rc) {
> +        pr_dbg("Fail to execute rdmacm_mux request %d (rc=%d)\n",
> +               msg->hdr.op_code, rc);
> +    }
> +
> +    return 0;
> +}
> +
>   static void stop_backend_thread(RdmaBackendThread *thread)
>   {
>       thread->run = false;
> @@ -300,11 +368,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
>       return 0;
>   }
>   
> -static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> -                    uint32_t num_sge)
> +static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
> +                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
>   {
> -    struct backend_umad umad = {0};
> -    char *hdr, *msg;
> +    RdmaCmMuxMsg msg = {0};
> +    char *hdr, *data;
>       int ret;
>   
>       pr_dbg("num_sge=%d\n", num_sge);
> @@ -313,26 +381,31 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
>           return -EINVAL;
>       }
>   
> -    umad.hdr.length = sge[0].length + sge[1].length;
> -    pr_dbg("msg_len=%d\n", umad.hdr.length);
> +    msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD;
> +    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
>   
> -    if (umad.hdr.length > sizeof(umad.mad)) {
> +    msg.umad_len = sge[0].length + sge[1].length;
> +    pr_dbg("umad_len=%d\n", msg.umad_len);
> +
> +    if (msg.umad_len > sizeof(msg.umad.mad)) {
>           return -ENOMEM;
>       }
>   
> -    umad.hdr.addr.qpn = htobe32(1);
> -    umad.hdr.addr.grh_present = 1;
> -    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> -    umad.hdr.addr.hop_limit = 1;
> +    msg.umad.hdr.addr.qpn = htobe32(1);
> +    msg.umad.hdr.addr.grh_present = 1;
> +    pr_dbg("sgid_idx=%d\n", sgid_idx);
> +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
> +    msg.umad.hdr.addr.gid_index = sgid_idx;
> +    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
> +    msg.umad.hdr.addr.hop_limit = 1;

Why is hop_limit set to 1 ?

>   
>       hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
>       if (!hdr) {
>           pr_dbg("Fail to map to sge[0]\n");
>           return -ENOMEM;
>       }
> -    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> -    if (!msg) {
> +    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> +    if (!data) {
>           pr_dbg("Fail to map to sge[1]\n");
>           rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
>           return -ENOMEM;
> @@ -341,25 +414,27 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
>       pr_dbg_buf("mad_hdr", hdr, sge[0].length);
>       pr_dbg_buf("mad_data", data, sge[1].length);
>   
> -    memcpy(&umad.mad[0], hdr, sge[0].length);
> -    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> +    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
> +    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
>   
> -    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> +    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
>       rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
>   
> -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> -                            sizeof(umad));
> -
> -    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> +    ret = exec_rdmacm_mux_req(backend_dev, &msg);
> +    if (ret) {
> +        pr_dbg("Fail to send MAD to rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
>   
> -    return (ret != sizeof(umad));
> +    return 0;
>   }
>   
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> -                            union ibv_gid *dgid, uint32_t dqpn,
> -                            uint32_t dqkey, void *ctx)
> +                            uint8_t sgid_idx, union ibv_gid *sgid,
> +                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
> +                            void *ctx)
>   {
>       BackendCtx *bctx;
>       struct ibv_sge new_sge[MAX_SGE];
> @@ -373,7 +448,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           } else if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            rc = mad_send(backend_dev, sge, num_sge);
> +            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
>               if (rc) {
>                   comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
>               } else {
> @@ -409,8 +484,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       }
>   
>       if (qp_type == IBV_QPT_UD) {
> -        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
> -                                backend_dev->backend_gid_idx, dgid);
> +        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
>           if (!wr.wr.ud.ah) {
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
>               goto out_dealloc_cqe_ctx;
> @@ -715,9 +789,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>   }
>   
>   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> -                              uint8_t qp_type, union ibv_gid *dgid,
> -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> -                              bool use_qkey)
> +                              uint8_t qp_type, uint8_t sgid_idx,
> +                              union ibv_gid *dgid, uint32_t dqpn,
> +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
>   {
>       struct ibv_qp_attr attr = {0};
>       union ibv_gid ibv_gid = {
> @@ -729,13 +803,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>       attr.qp_state = IBV_QPS_RTR;
>       attr_mask = IBV_QP_STATE;
>   
> +    qp->sgid_idx = sgid_idx;
> +
>       switch (qp_type) {
>       case IBV_QPT_RC:
>           pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
>                  be64_to_cpu(ibv_gid.global.subnet_prefix),
>                  be64_to_cpu(ibv_gid.global.interface_id));
>           pr_dbg("dqpn=0x%x\n", dqpn);
> -        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
> +        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
>           pr_dbg("sport_num=%d\n", backend_dev->port_num);
>           pr_dbg("rq_psn=0x%x\n", rq_psn);
>   
> @@ -747,7 +823,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>           attr.ah_attr.is_global      = 1;
>           attr.ah_attr.grh.hop_limit  = 1;
>           attr.ah_attr.grh.dgid       = ibv_gid;
> -        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
> +        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
>           attr.rq_psn                 = rq_psn;
>   
>           attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
> @@ -756,8 +832,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>           break;
>   
>       case IBV_QPT_UD:
> +        pr_dbg("qkey=0x%x\n", qkey);
>           if (use_qkey) {
> -            pr_dbg("qkey=0x%x\n", qkey);
>               attr.qkey = qkey;
>               attr_mask |= IBV_QP_QKEY;
>           }
> @@ -873,29 +949,19 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
>       grh->dgid = *my_gid;
>   
>       pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> -    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> -    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> +    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
> +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
>   }
>   
> -static inline int mad_can_receieve(void *opaque)
> +static void process_incoming_mad_req(RdmaBackendDev *backend_dev,
> +                                     RdmaCmMuxMsg *msg)
>   {
> -    return sizeof(struct backend_umad);
> -}
> -
> -static void mad_read(void *opaque, const uint8_t *buf, int size)
> -{
> -    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
>       QObject *o_ctx_id;
>       unsigned long cqe_ctx_id;
>       BackendCtx *bctx;
>       char *mad;
> -    struct backend_umad *umad;
>   
> -    assert(size != sizeof(umad));
> -    umad = (struct backend_umad *)buf;
> -
> -    pr_dbg("Got %d bytes\n", size);
> -    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> +    pr_dbg("umad_len=%d\n", msg->umad_len);
>   
>   #ifdef PVRDMA_DEBUG
>       struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> @@ -925,15 +991,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>   
>       mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
>                              bctx->sge.length);
> -    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> +    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
>           comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
>                        bctx->up_ctx);
>       } else {
> +        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
>           memset(mad, 0, bctx->sge.length);
>           build_mad_hdr((struct ibv_grh *)mad,
> -                      (union ibv_gid *)&umad->hdr.addr.gid,
> -                      &backend_dev->gid, umad->hdr.length);
> -        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> +                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
> +                      msg->umad_len);
> +        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
>           rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
>   
>           comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> @@ -943,30 +1010,51 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
>       rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
>   }
>   
> -static int mad_init(RdmaBackendDev *backend_dev)
> +static inline int rdmacm_mux_can_receive(void *opaque)
>   {
> -    struct backend_umad umad = {0};
> -    int ret;
> +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
>   
> -    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> -        pr_dbg("Missing chardev for MAD multiplexer\n");
> -        return -EIO;
> +    return rdmacm_mux_can_process_async(backend_dev);
> +}
> +
> +static void rdmacm_mux_read(void *opaque, const uint8_t *buf, int size)
> +{
> +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> +    RdmaCmMuxMsg *msg = (RdmaCmMuxMsg *)buf;
> +
> +    pr_dbg("Got %d bytes\n", size);
> +    pr_dbg("msg_type=%d\n", msg->hdr.msg_type);
> +    pr_dbg("op_code=%d\n", msg->hdr.op_code);
> +
> +    if (msg->hdr.msg_type != RDMACM_MUX_MSG_TYPE_REQ &&
> +        msg->hdr.op_code != RDMACM_MUX_OP_CODE_MAD) {
> +            pr_dbg("Error: Not a MAD request, skipping\n");
> +            return;

No error flow on mux_read ? What happens at caller site on error ?

>       }
> +    process_incoming_mad_req(backend_dev, msg);
> +}
> +
> +static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
> +{
> +    int ret;
>   
> -    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> -                             mad_read, NULL, NULL, backend_dev, NULL, true);
> +    backend_dev->rdmacm_mux.chr_be = mad_chr_be;
>   
> -    /* Register ourself */
> -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> -                            sizeof(umad.hdr));
> -    if (ret != sizeof(umad.hdr)) {
> -        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> +    ret = qemu_chr_fe_backend_connected(backend_dev->rdmacm_mux.chr_be);
> +    if (!ret) {
> +        pr_dbg("Missing chardev for MAD multiplexer\n");
> +        return -EIO;
>       }
>   
>       qemu_mutex_init(&backend_dev->recv_mads_list.lock);
>       backend_dev->recv_mads_list.list = qlist_new();
>   
> +    enable_rdmacm_mux_async(backend_dev);
> +
> +    qemu_chr_fe_set_handlers(backend_dev->rdmacm_mux.chr_be,
> +                             rdmacm_mux_can_receive, rdmacm_mux_read, NULL,
> +                             NULL, backend_dev, NULL, true);
> +
>       return 0;
>   }
>   
> @@ -978,6 +1066,8 @@ static void mad_stop(RdmaBackendDev *backend_dev)
>   
>       pr_dbg("Closing MAD\n");
>   
> +    disable_rdmacm_mux_async(backend_dev);
> +
>       /* Clear MAD buffers list */
>       qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
>       do {
> @@ -1000,23 +1090,94 @@ static void mad_fini(RdmaBackendDev *backend_dev)
>       qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
>   }
>   
> +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> +                               union ibv_gid *gid)
> +{
> +    union ibv_gid sgid;
> +    int ret;
> +    int i = 0;
> +
> +    pr_dbg("0x%llx, 0x%llx\n",
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> +
> +    do {
> +        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
> +                            &sgid);
> +        i++;
> +    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
> +
> +    pr_dbg("gid_index=%d\n", i - 1);
> +
> +    return ret ? ret : i - 1;
> +}
> +
> +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid)
> +{
> +    RdmaCmMuxMsg msg = {0};
> +    int ret;
> +
> +    pr_dbg("0x%llx, 0x%llx\n",
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> +
> +    msg.hdr.op_code = RDMACM_MUX_OP_CODE_REG;
> +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> +
> +    ret = exec_rdmacm_mux_req(backend_dev, &msg);
> +    if (ret) {
> +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
> +
> +    qapi_event_send_rdma_gid_status_changed(ifname, true,
> +                                            gid->global.subnet_prefix,
> +                                            gid->global.interface_id);
> +
> +    return ret;
> +}
> +
> +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid)
> +{
> +    RdmaCmMuxMsg msg = {0};
> +    int ret;
> +
> +    pr_dbg("0x%llx, 0x%llx\n",
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> +
> +    msg.hdr.op_code = RDMACM_MUX_OP_CODE_UNREG;
> +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> +
> +    ret = exec_rdmacm_mux_req(backend_dev, &msg);
> +    if (ret) {
> +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
> +        return -EIO;
> +    }
> +
> +    qapi_event_send_rdma_gid_status_changed(ifname, false,
> +                                            gid->global.subnet_prefix,
> +                                            gid->global.interface_id);
> +
> +    return 0;
> +}
> +
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
> -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      CharBackend *mad_chr_be, Error **errp)
> +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> +                      Error **errp)
>   {
>       int i;
>       int ret = 0;
>       int num_ibv_devices;
>       struct ibv_device **dev_list;
> -    struct ibv_port_attr port_attr;
>   
>       memset(backend_dev, 0, sizeof(*backend_dev));
>   
>       backend_dev->dev = pdev;
> -    backend_dev->mad_chr_be = mad_chr_be;
> -    backend_dev->backend_gid_idx = backend_gid_idx;
>       backend_dev->port_num = port_num;
>       backend_dev->rdma_dev_res = rdma_dev_res;
>   
> @@ -1053,9 +1214,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>           backend_dev->ib_dev = *dev_list;
>       }
>   
> -    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
> -           ibv_get_device_name(backend_dev->ib_dev),
> -           backend_dev->port_num, backend_dev->backend_gid_idx);
> +    pr_dbg("Using backend device %s, port %d\n",
> +           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
>   
>       backend_dev->context = ibv_open_device(backend_dev->ib_dev);
>       if (!backend_dev->context) {
> @@ -1072,20 +1232,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       }
>       pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
>   
> -    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
> -                         &port_attr);
> -    if (ret) {
> -        error_setg(errp, "Error %d from ibv_query_port", ret);
> -        ret = -EIO;
> -        goto out_destroy_comm_channel;
> -    }
> -
> -    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
> -        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
> -                   port_attr.gid_tbl_len);
> -        goto out_destroy_comm_channel;
> -    }
> -
>       ret = init_device_caps(backend_dev, dev_attr);
>       if (ret) {
>           error_setg(errp, "Failed to initialize device capabilities");
> @@ -1093,20 +1239,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>           goto out_destroy_comm_channel;
>       }
>   
> -    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
> -                         backend_dev->backend_gid_idx, &backend_dev->gid);
> -    if (ret) {
> -        error_setg(errp, "Failed to query gid %d",
> -                   backend_dev->backend_gid_idx);
> -        ret = -EIO;
> -        goto out_destroy_comm_channel;
> -    }
> -    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
> -           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
> -    pr_dbg("interface_id=0x%" PRIx64 "\n",
> -           be64_to_cpu(backend_dev->gid.global.interface_id));
>   
> -    ret = mad_init(backend_dev);
> +    ret = mad_init(backend_dev, mad_chr_be);
>       if (ret) {
>           error_setg(errp, "Fail to initialize mad");
>           ret = -EIO;
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index fc83330251..59ad2b874b 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -28,11 +28,6 @@ enum ibv_special_qp_type {
>       IBV_QPT_GSI = 1,
>   };
>   
> -static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
> -{
> -    return &dev->gid;
> -}
> -
>   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
>   {
>       return qp->ibqp ? qp->ibqp->qp_num : 1;
> @@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
> -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      CharBackend *mad_chr_be, Error **errp);
> +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> +                      Error **errp);
>   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid);
> +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> +                         union ibv_gid *gid);
> +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> +                               union ibv_gid *gid);
>   void rdma_backend_start(RdmaBackendDev *backend_dev);
>   void rdma_backend_stop(RdmaBackendDev *backend_dev);
>   void rdma_backend_register_comp_handler(void (*handler)(int status,
> @@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
>   int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
>                                  uint8_t qp_type, uint32_t qkey);
>   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> -                              uint8_t qp_type, union ibv_gid *dgid,
> -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> -                              bool use_qkey);
> +                              uint8_t qp_type, uint8_t sgid_idx,
> +                              union ibv_gid *dgid, uint32_t dqpn,
> +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
>   int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
>                                 uint32_t sq_psn, uint32_t qkey, bool use_qkey);
>   int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
> @@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> +                            uint8_t sgid_idx, union ibv_gid *sgid,
>                               union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
>                               void *ctx);
>   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> index 2a7e667075..1e5c3dd3bf 100644
> --- a/hw/rdma/rdma_backend_defs.h
> +++ b/hw/rdma/rdma_backend_defs.h
> @@ -19,6 +19,7 @@
>   #include "qemu/thread.h"
>   #include "chardev/char-fe.h"
>   #include <infiniband/verbs.h>
> +#include "contrib/rdmacm-mux/rdmacm-mux.h"
>   
>   typedef struct RdmaDeviceResources RdmaDeviceResources;
>   
> @@ -34,19 +35,22 @@ typedef struct RecvMadList {
>       QList *list;
>   } RecvMadList;
>   
> +typedef struct RdmaCmMux {
> +    CharBackend *chr_be;
> +    int can_receive;
> +} RdmaCmMux;
> +
>   typedef struct RdmaBackendDev {
>       struct ibv_device_attr dev_attr;
>       RdmaBackendThread comp_thread;
> -    union ibv_gid gid;
>       PCIDevice *dev;
>       RdmaDeviceResources *rdma_dev_res;
>       struct ibv_device *ib_dev;
>       struct ibv_context *context;
>       struct ibv_comp_channel *channel;
>       uint8_t port_num;
> -    uint8_t backend_gid_idx;
>       RecvMadList recv_mads_list;
> -    CharBackend *mad_chr_be;
> +    RdmaCmMux rdmacm_mux;
>   } RdmaBackendDev;
>   
>   typedef struct RdmaBackendPD {
> @@ -66,6 +70,7 @@ typedef struct RdmaBackendCQ {
>   typedef struct RdmaBackendQP {
>       struct ibv_pd *ibpd;
>       struct ibv_qp *ibqp;
> +    uint8_t sgid_idx;
>   } RdmaBackendQP;
>   
>   #endif
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index 4f10fcabcc..250254561c 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -391,7 +391,7 @@ out_dealloc_qp:
>   }
>   
>   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> -                      uint32_t qp_handle, uint32_t attr_mask,
> +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
>                         union ibv_gid *dgid, uint32_t dqpn,
>                         enum ibv_qp_state qp_state, uint32_t qkey,
>                         uint32_t rq_psn, uint32_t sq_psn)
> @@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>       int ret;
>   
>       pr_dbg("qpn=0x%x\n", qp_handle);
> +    pr_dbg("qkey=0x%x\n", qkey);
>   
>       qp = rdma_rm_get_qp(dev_res, qp_handle);
>       if (!qp) {
> @@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>           }
>   
>           if (qp->qp_state == IBV_QPS_RTR) {
> +            /* Get backend gid index */
> +            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
> +            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
> +                                                     sgid_idx);
> +            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
> +                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
> +                return -EIO;
> +            }
> +
>               ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
> -                                            qp->qp_type, dgid, dqpn, rq_psn,
> -                                            qkey, attr_mask & IBV_QP_QKEY);
> +                                            qp->qp_type, sgid_idx, dgid, dqpn,
> +                                            rq_psn, qkey,
> +                                            attr_mask & IBV_QP_QKEY);
>               if (ret) {
>                   return -EIO;
>               }
> @@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
>       res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
>   }
>   
> +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, union ibv_gid *gid, int gid_idx)
> +{
> +    int rc;
> +
> +    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
> +    if (rc) {
> +        pr_dbg("Fail to add gid\n");
> +        return -EINVAL;
> +    }
> +
> +    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
> +
> +    return 0;
> +}
> +
> +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, int gid_idx)
> +{
> +    int rc;
> +
> +    rc = rdma_backend_del_gid(backend_dev, ifname,
> +                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
> +    if (rc) {
> +        pr_dbg("Fail to delete gid\n");
> +        return -EINVAL;
> +    }
> +
> +    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
> +           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
> +    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
> +
> +    return 0;
> +}
> +
> +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> +                                  RdmaBackendDev *backend_dev, int sgid_idx)
> +{
> +    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
> +        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
> +        return -EINVAL;
> +    }
> +
> +    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
> +        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
> +        rdma_backend_get_gid_index(backend_dev,
> +                                   &dev_res->ports[0].gid_tbl[sgid_idx].gid);
> +    }
> +
> +    pr_dbg("backend_gid_index=%d\n",
> +           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
> +
> +    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
> +}
> +
>   static void destroy_qp_hash_key(gpointer data)
>   {
>       g_bytes_unref(data);
>   }
>   
> +static void init_ports(RdmaDeviceResources *dev_res)
> +{
> +    int i, j;
> +
> +    memset(dev_res->ports, 0, sizeof(dev_res->ports));
> +
> +    for (i = 0; i < MAX_PORTS; i++) {
> +        dev_res->ports[i].state = IBV_PORT_DOWN;
> +        for (j = 0; j < MAX_PORT_GIDS; j++) {
> +            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
> +        }
> +    }
> +}
> +
> +static void fini_ports(RdmaDeviceResources *dev_res,
> +                       RdmaBackendDev *backend_dev, const char *ifname)
> +{
> +    int i;
> +
> +    dev_res->ports[0].state = IBV_PORT_DOWN;
> +    for (i = 0; i < MAX_PORT_GIDS; i++) {
> +        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
> +    }
> +}
> +
>   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
>                    Error **errp)
>   {
> @@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
>                          dev_attr->max_qp_wr, sizeof(void *));
>       res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
>   
> +    init_ports(dev_res);
> +
>       return 0;
>   }
>   
> -void rdma_rm_fini(RdmaDeviceResources *dev_res)
> +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                  const char *ifname)
>   {
> +    fini_ports(dev_res, backend_dev, ifname);
> +
>       res_tbl_free(&dev_res->uc_tbl);
>       res_tbl_free(&dev_res->cqe_ctx_tbl);
>       res_tbl_free(&dev_res->qp_tbl);
> diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
> index b4e04cc7b4..a7169b4e89 100644
> --- a/hw/rdma/rdma_rm.h
> +++ b/hw/rdma/rdma_rm.h
> @@ -22,7 +22,8 @@
>   
>   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
>                    Error **errp);
> -void rdma_rm_fini(RdmaDeviceResources *dev_res);
> +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                  const char *ifname);
>   
>   int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>                        uint32_t *pd_handle, uint32_t ctx_handle);
> @@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
>                        uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
>   RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
>   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> -                      uint32_t qp_handle, uint32_t attr_mask,
> +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
>                         union ibv_gid *dgid, uint32_t dqpn,
>                         enum ibv_qp_state qp_state, uint32_t qkey,
>                         uint32_t rq_psn, uint32_t sq_psn);
> @@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
>   void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
>   void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
>   
> +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, union ibv_gid *gid, int gid_idx);
> +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> +                    const char *ifname, int gid_idx);
> +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> +                                  RdmaBackendDev *backend_dev, int sgid_idx);
> +static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
> +                                             int sgid_idx)
> +{
> +    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
> +}
> +
>   #endif
> diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
> index 9b399063d3..7b3435f991 100644
> --- a/hw/rdma/rdma_rm_defs.h
> +++ b/hw/rdma/rdma_rm_defs.h
> @@ -19,7 +19,7 @@
>   #include "rdma_backend_defs.h"
>   
>   #define MAX_PORTS             1
> -#define MAX_PORT_GIDS         1
> +#define MAX_PORT_GIDS         255
>   #define MAX_GIDS              MAX_PORT_GIDS
>   #define MAX_PORT_PKEYS        1
>   #define MAX_PKEYS             MAX_PORT_PKEYS
> @@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
>       enum ibv_qp_state qp_state;
>   } RdmaRmQP;
>   
> +typedef struct RdmaRmGid {
> +    union ibv_gid gid;
> +    int backend_gid_index;
> +} RdmaRmGid;
> +
>   typedef struct RdmaRmPort {
> -    union ibv_gid gid_tbl[MAX_PORT_GIDS];
> +    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
>       enum ibv_port_state state;
>   } RdmaRmPort;
>   
> diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
> index 04c7c2ef5b..989db249ef 100644
> --- a/hw/rdma/rdma_utils.h
> +++ b/hw/rdma/rdma_utils.h
> @@ -20,6 +20,7 @@
>   #include "qemu/osdep.h"
>   #include "hw/pci/pci.h"
>   #include "sysemu/dma.h"
> +#include "stdio.h"
>   
>   #define pr_info(fmt, ...) \
>       fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
> @@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
>   #define pr_dbg(fmt, ...) \
>       fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
>               __func__, __LINE__, ## __VA_ARGS__)
> +
> +#define pr_dbg_buf(title, buf, len) \
> +{ \
> +    char *b = g_malloc0(len * 3 + 1); \
> +    char b1[4]; \
> +    for (int i = 0; i < len; i++) { \
> +        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
> +        strcat(b, b1); \
> +    } \
> +    pr_dbg("%s (%d): %s\n", title, len, b); \
> +    g_free(b); \
> +}
> +
>   #else
>   #define init_pr_dbg(void)
>   #define pr_dbg(fmt, ...)
> +#define pr_dbg_buf(title, buf, len)
>   #endif
>   
>   void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index 15c3f28b86..b019cb843a 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -79,8 +79,8 @@ typedef struct PVRDMADev {
>       int interrupt_mask;
>       struct ibv_device_attr dev_attr;
>       uint64_t node_guid;
> +    char *backend_eth_device_name;
>       char *backend_device_name;
> -    uint8_t backend_gid_idx;
>       uint8_t backend_port_num;
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index 57d6f41ae6..a334f6205e 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       rsp->hdr.response = cmd->hdr.response;
>       rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
>   
> -    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
> -                                 cmd->qp_handle, cmd->attr_mask,
> -                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> -                                 cmd->attrs.dest_qp_num,
> -                                 (enum ibv_qp_state)cmd->attrs.qp_state,
> -                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
> -                                 cmd->attrs.sq_psn);
> +    /* No need to verify sgid_index since it is u8 */
> +
> +    rsp->hdr.err =
> +        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
> +                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
> +                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> +                          cmd->attrs.dest_qp_num,
> +                          (enum ibv_qp_state)cmd->attrs.qp_state,
> +                          cmd->attrs.qkey, cmd->attrs.rq_psn,
> +                          cmd->attrs.sq_psn);
>   
>       pr_dbg("ret=%d\n", rsp->hdr.err);
>       return rsp->hdr.err;
> @@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>                          union pvrdma_cmd_resp *rsp)
>   {
>       struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
> -#ifdef PVRDMA_DEBUG
> -    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
> -    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
> -#endif
> +    int rc;
> +    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
>   
>       pr_dbg("index=%d\n", cmd->index);
>   
> @@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       }
>   
>       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> -           (long long unsigned int)be64_to_cpu(*subnet),
> -           (long long unsigned int)be64_to_cpu(*if_id));
> +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
>   
> -    /* Driver forces to one port only */
> -    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
> -           sizeof(cmd->new_gid));
> +    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
> +                         dev->backend_eth_device_name, gid, cmd->index);
> +    if (rc < 0) {
> +        return -EINVAL;
> +    }
>   
>       /* TODO: Since drivers stores node_guid at load_dsr phase then this
>        * assignment is not relevant, i need to figure out a way how to
>        * retrieve MAC of our netdev */
> -    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
> -    pr_dbg("dev->node_guid=0x%llx\n",
> -           (long long unsigned int)be64_to_cpu(dev->node_guid));
> +    if (!cmd->index) {
> +        dev->node_guid =
> +            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
> +        pr_dbg("dev->node_guid=0x%llx\n",
> +               (long long unsigned int)be64_to_cpu(dev->node_guid));
> +    }
>   
>       return 0;
>   }
> @@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>                           union pvrdma_cmd_resp *rsp)
>   {
> +    int rc;
> +
>       struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
>   
>       pr_dbg("index=%d\n", cmd->index);
> @@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>           return -EINVAL;
>       }
>   
> -    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
> -           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
> +    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> +                        dev->backend_eth_device_name, cmd->index);
> +
> +    if (rc < 0) {
> +        rsp->hdr.err = rc;
> +        goto out;
> +    }
>   
>       return 0;
>   }
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index fc2abd34af..ac8c092db0 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -36,9 +36,9 @@
>   #include "pvrdma_qp_ops.h"
>   
>   static Property pvrdma_dev_properties[] = {
> -    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
> -    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
> -    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
> +    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
> +    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
> +    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
>       DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
>                          MAX_MR_SIZE),
>       DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
> @@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
>       pr_dbg("Initialized\n");
>   }
>   
> -static void init_ports(PVRDMADev *dev, Error **errp)
> -{
> -    int i;
> -
> -    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
> -
> -    for (i = 0; i < MAX_PORTS; i++) {
> -        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
> -    }
> -}
> -
>   static void uninit_msix(PCIDevice *pdev, int used_vectors)
>   {
>       PVRDMADev *dev = PVRDMA_DEV(pdev);
> @@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
>   
>       pvrdma_qp_ops_fini();
>   
> -    rdma_rm_fini(&dev->rdma_dev_res);
> +    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
> +                 dev->backend_eth_device_name);
>   
>       rdma_backend_fini(&dev->backend_dev);
>   
> @@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>   
>       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
>                              dev->backend_device_name, dev->backend_port_num,
> -                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> -                           errp);
> +                           &dev->dev_attr, &dev->mad_chr, errp);
>       if (rc) {
>           goto out;
>       }
> @@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>           goto out;
>       }
>   
> -    init_ports(dev, errp);
> -
>       rc = pvrdma_qp_ops_init();
>       if (rc) {
>           goto out;
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index 3388be1926..2130824098 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
>       RdmaRmQP *qp;
>       PvrdmaSqWqe *wqe;
>       PvrdmaRing *ring;
> +    int sgid_idx;
> +    union ibv_gid *sgid;
>   
>       pr_dbg("qp_handle=0x%x\n", qp_handle);
>   
> @@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
>           comp_ctx->cqe.qp = qp_handle;
>           comp_ctx->cqe.opcode = IBV_WC_SEND;
>   
> +        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
> +        if (!sgid) {
> +            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
> +            return -EIO;
> +        }
> +        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
> +               sgid->global.interface_id);
> +
> +        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
> +                                                 &dev->backend_dev,
> +                                                 wqe->hdr.wr.ud.av.gid_index);
> +        if (sgid_idx <= 0) {
> +            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
> +                   wqe->hdr.wr.ud.av.gid_index);
> +            return -EIO;
> +        }
> +
>           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
>                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
> +                               sgid_idx, sgid,
>                                  (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
>                                  wqe->hdr.wr.ud.remote_qpn,
>                                  wqe->hdr.wr.ud.remote_qkey, comp_ctx);


Thanks,
Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown Yuval Shaia
@ 2018-11-25  7:30   ` Yuval Shaia
  2018-11-25  7:41   ` Marcel Apfelbaum
  1 sibling, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-25  7:30 UTC (permalink / raw)
  To: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, shamir.rabinovitch, cohuck, yuval.shaia

On Thu, Nov 22, 2018 at 02:14:01PM +0200, Yuval Shaia wrote:
> All resources are already cleaned at rm_fini phase.

Please ignore this patch, i will squash it to patch #5.

> 
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  hw/rdma/rdma_backend.c | 21 +--------------------
>  1 file changed, 1 insertion(+), 20 deletions(-)
> 
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 6a1e39d4c0..8ab25e94b1 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -1075,28 +1075,9 @@ static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
>  
>  static void mad_stop(RdmaBackendDev *backend_dev)
>  {
> -    QObject *o_ctx_id;
> -    unsigned long cqe_ctx_id;
> -    BackendCtx *bctx;
> -
> -    pr_dbg("Closing MAD\n");
> +    pr_dbg("Stopping MAD\n");
>  
>      disable_rdmacm_mux_async(backend_dev);
> -
> -    /* Clear MAD buffers list */
> -    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> -    do {
> -        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> -        if (o_ctx_id) {
> -            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> -            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> -            if (bctx) {
> -                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> -                g_free(bctx);
> -            }
> -        }
> -    } while (o_ctx_id);
> -    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
>  }
>  
>  static void mad_fini(RdmaBackendDev *backend_dev)
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/24] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 13/24] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
@ 2018-11-25  7:31   ` Marcel Apfelbaum
  0 siblings, 0 replies; 39+ messages in thread
From: Marcel Apfelbaum @ 2018-11-25  7:31 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/22/18 2:13 PM, Yuval Shaia wrote:
> Guest driver enforces it, we should also.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma.h      |  2 ++
>   hw/rdma/vmw/pvrdma_main.c | 12 ++++++++++++
>   2 files changed, 14 insertions(+)
>
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index b019cb843a..10a3c4fb7c 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -20,6 +20,7 @@
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msix.h"
>   #include "chardev/char-fe.h"
> +#include "hw/net/vmxnet3_defs.h"
>   
>   #include "../rdma_backend_defs.h"
>   #include "../rdma_rm_defs.h"
> @@ -85,6 +86,7 @@ typedef struct PVRDMADev {
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
>       CharBackend mad_chr;
> +    VMXNET3State *func0;
>   } PVRDMADev;
>   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index ac8c092db0..b35b5dc5f0 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -565,6 +565,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>       PVRDMADev *dev = PVRDMA_DEV(pdev);
>       Object *memdev_root;
>       bool ram_shared = false;
> +    PCIDevice *func0;
>   
>       init_pr_dbg();
>   
> @@ -576,6 +577,17 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>           return;
>       }
>   
> +    func0 = pci_get_function_0(pdev);
> +    /* Break if not vmxnet3 device in slot 0 */
> +    if (strcmp(object_get_typename(&func0->qdev.parent_obj), TYPE_VMXNET3)) {
> +        pr_dbg("func0 type is %s\n",
> +               object_get_typename(&func0->qdev.parent_obj));
> +        error_setg(errp, "Device on %x.0 must be %s", PCI_SLOT(pdev->devfn),
> +                   TYPE_VMXNET3);
> +        return;
> +    }
> +    dev->func0 = VMXNET3(func0);
> +
>       memdev_root = object_resolve_path("/objects", NULL);
>       if (memdev_root) {
>           object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);

Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response Yuval Shaia
@ 2018-11-25  7:40   ` Marcel Apfelbaum
  2018-11-25 11:53     ` Yuval Shaia
  0 siblings, 1 reply; 39+ messages in thread
From: Marcel Apfelbaum @ 2018-11-25  7:40 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/22/18 2:13 PM, Yuval Shaia wrote:
> Driver checks error code let's set it.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
>   1 file changed, 48 insertions(+), 19 deletions(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index 0d3c818c20..a326c5d470 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       if (rdma_backend_query_port(&dev->backend_dev,
>                                   (struct ibv_port_attr *)&attrs)) {
> -        return -ENOMEM;
> +        resp->hdr.err = -ENOMEM;
> +        goto out;
>       }
>   
>       memset(resp, 0, sizeof(*resp));
> @@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->attrs.active_width = 1;
>       resp->attrs.active_speed = 1;
>   
> -    return 0;
> +out:
> +    pr_dbg("ret=%d\n", resp->hdr.err);
> +    return resp->hdr.err;
>   }
>   
>   static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->pkey = PVRDMA_PKEY;
>       pr_dbg("pkey=0x%x\n", resp->pkey);
>   
> -    return 0;
> +    return resp->hdr.err;
>   }
>   
>   static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +    return rsp->hdr.err;
>   }
>   
>   static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +    return rsp->hdr.err;
>   }
>   
>   static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
> @@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
>       if (!cq) {
>           pr_dbg("Invalid CQ handle\n");
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       ring = (PvrdmaRing *)cq->opaque;
> @@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
> @@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
>       if (!qp) {
>           pr_dbg("Invalid QP handle\n");
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
> @@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
>       g_free(ring);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       pr_dbg("index=%d\n", cmd->index);
>   
>       if (cmd->index >= MAX_PORT_GIDS) {
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> @@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
>                            dev->backend_eth_device_name, gid, cmd->index);
>       if (rc < 0) {
> -        return -EINVAL;
> +        rsp->hdr.err = rc;
> +        goto out;
>       }
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       pr_dbg("index=%d\n", cmd->index);
>   
>       if (cmd->index >= MAX_PORT_GIDS) {
> -        return -EINVAL;
> +        rsp->hdr.err = -EINVAL;
> +        goto out;
>       }
>   
>       rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> @@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
>           goto out;
>       }
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +out:
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>       resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
>                                        &resp->ctx_handle);
>   
> -    pr_dbg("ret=%d\n", resp->hdr.err);
> -
> -    return 0;
> +    pr_dbg("ret=%d\n", rsp->hdr.err);
> +    return rsp->hdr.err;
>   }
>   
>   static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
> @@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
>   
>       rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
>   
> -    return 0;
> +    rsp->hdr.err = 0;
> +
> +    return rsp->hdr.err;
>   }
>   struct cmd_handler {
>       uint32_t cmd;
> @@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
>       }
>   
>       err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
> -                            dsr_info->rsp);
> +                                                    dsr_info->rsp);
>   out:
>       set_reg_val(dev, PVRDMA_REG_ERR, err);
>       post_interrupt(dev, INTR_VEC_CMD_RING);


As I responded in V4 thread :) one might forget to set hdr.error to 0
resulting in errors hard to debug.

Please consider clearing the field on init.

Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown
  2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown Yuval Shaia
  2018-11-25  7:30   ` Yuval Shaia
@ 2018-11-25  7:41   ` Marcel Apfelbaum
  1 sibling, 0 replies; 39+ messages in thread
From: Marcel Apfelbaum @ 2018-11-25  7:41 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck



On 11/22/18 2:14 PM, Yuval Shaia wrote:
> All resources are already cleaned at rm_fini phase.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c | 21 +--------------------
>   1 file changed, 1 insertion(+), 20 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 6a1e39d4c0..8ab25e94b1 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -1075,28 +1075,9 @@ static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
>   
>   static void mad_stop(RdmaBackendDev *backend_dev)
>   {
> -    QObject *o_ctx_id;
> -    unsigned long cqe_ctx_id;
> -    BackendCtx *bctx;
> -
> -    pr_dbg("Closing MAD\n");
> +    pr_dbg("Stopping MAD\n");
>   
>       disable_rdmacm_mux_async(backend_dev);
> -
> -    /* Clear MAD buffers list */
> -    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> -    do {
> -        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> -        if (o_ctx_id) {
> -            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> -            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> -            if (bctx) {
> -                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> -                g_free(bctx);
> -            }
> -        }
> -    } while (o_ctx_id);
> -    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
>   }
>   
>   static void mad_fini(RdmaBackendDev *backend_dev)
>

Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-25  7:29   ` Marcel Apfelbaum
@ 2018-11-25  9:10     ` Yuval Shaia
  0 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-25  9:10 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Sun, Nov 25, 2018 at 09:29:19AM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/22/18 2:13 PM, Yuval Shaia wrote:
> > The control over the RDMA device's GID table is done by updating the
> > device's Ethernet function addresses.
> > Usually the first GID entry is determined by the MAC address, the second
> > by the first IPv6 address and the third by the IPv4 address. Other
> > entries can be added by adding more IP addresses. The opposite is the
> > same, i.e. whenever an address is removed, the corresponding GID entry
> > is removed.
> > 
> > The process is done by the network and RDMA stacks. Whenever an address
> > is added the ib_core driver is notified and calls the device driver
> > add_gid function which in turn update the device.
> > 
> > To support this in pvrdma device we need to hook into the create_bind
> > and destroy_bind HW commands triggered by pvrdma driver in guest.
> > Whenever changed is made to the pvrdma port's GID table a special QMP
> > messages is sent to be processed by libvirt to update the address of the
> > backend Ethernet device.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_backend.c      | 336 +++++++++++++++++++++++++-----------
> >   hw/rdma/rdma_backend.h      |  22 +--
> >   hw/rdma/rdma_backend_defs.h |  11 +-
> >   hw/rdma/rdma_rm.c           | 104 ++++++++++-
> >   hw/rdma/rdma_rm.h           |  17 +-
> >   hw/rdma/rdma_rm_defs.h      |   9 +-
> >   hw/rdma/rdma_utils.h        |  15 ++
> >   hw/rdma/vmw/pvrdma.h        |   2 +-
> >   hw/rdma/vmw/pvrdma_cmd.c    |  55 +++---
> >   hw/rdma/vmw/pvrdma_main.c   |  25 +--
> >   hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
> >   11 files changed, 453 insertions(+), 163 deletions(-)
> > 
> > diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> > index 7c220a5798..8b5a111bf4 100644
> > --- a/hw/rdma/rdma_backend.c
> > +++ b/hw/rdma/rdma_backend.c
> > @@ -15,15 +15,18 @@
> >   #include "qemu/osdep.h"
> >   #include "qemu/error-report.h"
> > +#include "sysemu/sysemu.h"
> >   #include "qapi/error.h"
> >   #include "qapi/qmp/qlist.h"
> >   #include "qapi/qmp/qnum.h"
> > +#include "qapi/qapi-events-rdma.h"
> >   #include <infiniband/verbs.h>
> >   #include <infiniband/umad_types.h>
> >   #include <infiniband/umad.h>
> >   #include <rdma/rdma_user_cm.h>
> > +#include "contrib/rdmacm-mux/rdmacm-mux.h"
> >   #include "trace.h"
> >   #include "rdma_utils.h"
> >   #include "rdma_rm.h"
> > @@ -160,6 +163,71 @@ static void *comp_handler_thread(void *arg)
> >       return NULL;
> >   }
> > +static inline void disable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
> > +{
> > +    atomic_set(&backend_dev->rdmacm_mux.can_receive, 0);
> > +}
> > +
> > +static inline void enable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
> > +{
> > +    atomic_set(&backend_dev->rdmacm_mux.can_receive, sizeof(RdmaCmMuxMsg));
> 
> Why sizeof is used to set the can_receive field?

Documentation say: "Return the number of bytes that #IOReadHandler can
accept"

> 
> > +}
> > +
> > +static inline int rdmacm_mux_can_process_async(RdmaBackendDev *backend_dev)
> > +{
> > +    return atomic_read(&backend_dev->rdmacm_mux.can_receive);
> > +}
> > +
> > +static int check_mux_op_status(CharBackend *mad_chr_be)
> > +{
> > +    RdmaCmMuxMsg msg = {0};
> > +    int ret;
> > +
> > +    pr_dbg("Reading response\n");
> > +    ret = qemu_chr_fe_read_all(mad_chr_be, (uint8_t *)&msg, sizeof(msg));
> > +    if (ret != sizeof(msg)) {
> > +        pr_dbg("Invalid message size %d, expecting %ld\n", ret, sizeof(msg));
> > +        return -EIO;
> > +    }
> > +
> > +    if (msg.hdr.msg_type != RDMACM_MUX_MSG_TYPE_RESP) {
> > +        pr_dbg("Invalid message type %d\n", msg.hdr.msg_type);
> > +        return -EIO;
> > +    }
> > +
> > +    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
> > +        pr_dbg("Operation failed in mux, error code %d\n", msg.hdr.err_code);
> > +        return -EIO;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int exec_rdmacm_mux_req(RdmaBackendDev *backend_dev, RdmaCmMuxMsg *msg)
> > +{
> > +    int rc = 0;
> > +
> > +    pr_dbg("Executing request %d\n", msg->hdr.op_code);
> > +
> > +    msg->hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ;
> > +    disable_rdmacm_mux_async(backend_dev);
> > +    rc = qemu_chr_fe_write(backend_dev->rdmacm_mux.chr_be,
> > +                           (const uint8_t *)msg, sizeof(*msg));
> > +    enable_rdmacm_mux_async(backend_dev);
> > +    if (rc != sizeof(*msg)) {
> > +        pr_dbg("Fail to send request to rdmacm_mux (rc=%d)\n", rc);
> > +        return -EIO;
> > +    }
> > +
> > +    rc = check_mux_op_status(backend_dev->rdmacm_mux.chr_be);
> > +    if (rc) {
> > +        pr_dbg("Fail to execute rdmacm_mux request %d (rc=%d)\n",
> > +               msg->hdr.op_code, rc);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >   static void stop_backend_thread(RdmaBackendThread *thread)
> >   {
> >       thread->run = false;
> > @@ -300,11 +368,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
> >       return 0;
> >   }
> > -static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> > -                    uint32_t num_sge)
> > +static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
> > +                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
> >   {
> > -    struct backend_umad umad = {0};
> > -    char *hdr, *msg;
> > +    RdmaCmMuxMsg msg = {0};
> > +    char *hdr, *data;
> >       int ret;
> >       pr_dbg("num_sge=%d\n", num_sge);
> > @@ -313,26 +381,31 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> >           return -EINVAL;
> >       }
> > -    umad.hdr.length = sge[0].length + sge[1].length;
> > -    pr_dbg("msg_len=%d\n", umad.hdr.length);
> > +    msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD;
> > +    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
> > -    if (umad.hdr.length > sizeof(umad.mad)) {
> > +    msg.umad_len = sge[0].length + sge[1].length;
> > +    pr_dbg("umad_len=%d\n", msg.umad_len);
> > +
> > +    if (msg.umad_len > sizeof(msg.umad.mad)) {
> >           return -ENOMEM;
> >       }
> > -    umad.hdr.addr.qpn = htobe32(1);
> > -    umad.hdr.addr.grh_present = 1;
> > -    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> > -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > -    umad.hdr.addr.hop_limit = 1;
> > +    msg.umad.hdr.addr.qpn = htobe32(1);
> > +    msg.umad.hdr.addr.grh_present = 1;
> > +    pr_dbg("sgid_idx=%d\n", sgid_idx);
> > +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
> > +    msg.umad.hdr.addr.gid_index = sgid_idx;
> > +    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
> > +    msg.umad.hdr.addr.hop_limit = 1;
> 
> Why is hop_limit set to 1 ?

Probably habit from IB, i guess i can safely set it to 0xFF, thanks.

> 
> >       hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> >       if (!hdr) {
> >           pr_dbg("Fail to map to sge[0]\n");
> >           return -ENOMEM;
> >       }
> > -    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > -    if (!msg) {
> > +    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > +    if (!data) {
> >           pr_dbg("Fail to map to sge[1]\n");
> >           rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> >           return -ENOMEM;
> > @@ -341,25 +414,27 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> >       pr_dbg_buf("mad_hdr", hdr, sge[0].length);
> >       pr_dbg_buf("mad_data", data, sge[1].length);
> > -    memcpy(&umad.mad[0], hdr, sge[0].length);
> > -    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> > +    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
> > +    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
> > -    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> > +    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
> >       rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> > -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > -                            sizeof(umad));
> > -
> > -    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> > +    ret = exec_rdmacm_mux_req(backend_dev, &msg);
> > +    if (ret) {
> > +        pr_dbg("Fail to send MAD to rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > -    return (ret != sizeof(umad));
> > +    return 0;
> >   }
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > -                            union ibv_gid *dgid, uint32_t dqpn,
> > -                            uint32_t dqkey, void *ctx)
> > +                            uint8_t sgid_idx, union ibv_gid *sgid,
> > +                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
> > +                            void *ctx)
> >   {
> >       BackendCtx *bctx;
> >       struct ibv_sge new_sge[MAX_SGE];
> > @@ -373,7 +448,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> >           } else if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            rc = mad_send(backend_dev, sge, num_sge);
> > +            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
> >               if (rc) {
> >                   comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> >               } else {
> > @@ -409,8 +484,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >       }
> >       if (qp_type == IBV_QPT_UD) {
> > -        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
> > -                                backend_dev->backend_gid_idx, dgid);
> > +        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
> >           if (!wr.wr.ud.ah) {
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
> >               goto out_dealloc_cqe_ctx;
> > @@ -715,9 +789,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >   }
> >   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> > -                              uint8_t qp_type, union ibv_gid *dgid,
> > -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> > -                              bool use_qkey)
> > +                              uint8_t qp_type, uint8_t sgid_idx,
> > +                              union ibv_gid *dgid, uint32_t dqpn,
> > +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
> >   {
> >       struct ibv_qp_attr attr = {0};
> >       union ibv_gid ibv_gid = {
> > @@ -729,13 +803,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >       attr.qp_state = IBV_QPS_RTR;
> >       attr_mask = IBV_QP_STATE;
> > +    qp->sgid_idx = sgid_idx;
> > +
> >       switch (qp_type) {
> >       case IBV_QPT_RC:
> >           pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
> >                  be64_to_cpu(ibv_gid.global.subnet_prefix),
> >                  be64_to_cpu(ibv_gid.global.interface_id));
> >           pr_dbg("dqpn=0x%x\n", dqpn);
> > -        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
> > +        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
> >           pr_dbg("sport_num=%d\n", backend_dev->port_num);
> >           pr_dbg("rq_psn=0x%x\n", rq_psn);
> > @@ -747,7 +823,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >           attr.ah_attr.is_global      = 1;
> >           attr.ah_attr.grh.hop_limit  = 1;
> >           attr.ah_attr.grh.dgid       = ibv_gid;
> > -        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
> > +        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
> >           attr.rq_psn                 = rq_psn;
> >           attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
> > @@ -756,8 +832,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >           break;
> >       case IBV_QPT_UD:
> > +        pr_dbg("qkey=0x%x\n", qkey);
> >           if (use_qkey) {
> > -            pr_dbg("qkey=0x%x\n", qkey);
> >               attr.qkey = qkey;
> >               attr_mask |= IBV_QP_QKEY;
> >           }
> > @@ -873,29 +949,19 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> >       grh->dgid = *my_gid;
> >       pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> > -    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> > -    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> > +    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
> > +    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
> >   }
> > -static inline int mad_can_receieve(void *opaque)
> > +static void process_incoming_mad_req(RdmaBackendDev *backend_dev,
> > +                                     RdmaCmMuxMsg *msg)
> >   {
> > -    return sizeof(struct backend_umad);
> > -}
> > -
> > -static void mad_read(void *opaque, const uint8_t *buf, int size)
> > -{
> > -    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> >       QObject *o_ctx_id;
> >       unsigned long cqe_ctx_id;
> >       BackendCtx *bctx;
> >       char *mad;
> > -    struct backend_umad *umad;
> > -    assert(size != sizeof(umad));
> > -    umad = (struct backend_umad *)buf;
> > -
> > -    pr_dbg("Got %d bytes\n", size);
> > -    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> > +    pr_dbg("umad_len=%d\n", msg->umad_len);
> >   #ifdef PVRDMA_DEBUG
> >       struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> > @@ -925,15 +991,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
> >       mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> >                              bctx->sge.length);
> > -    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> > +    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
> >           comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> >                        bctx->up_ctx);
> >       } else {
> > +        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
> >           memset(mad, 0, bctx->sge.length);
> >           build_mad_hdr((struct ibv_grh *)mad,
> > -                      (union ibv_gid *)&umad->hdr.addr.gid,
> > -                      &backend_dev->gid, umad->hdr.length);
> > -        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> > +                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
> > +                      msg->umad_len);
> > +        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
> >           rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> >           comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> > @@ -943,30 +1010,51 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
> >       rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> >   }
> > -static int mad_init(RdmaBackendDev *backend_dev)
> > +static inline int rdmacm_mux_can_receive(void *opaque)
> >   {
> > -    struct backend_umad umad = {0};
> > -    int ret;
> > +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> > -    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> > -        pr_dbg("Missing chardev for MAD multiplexer\n");
> > -        return -EIO;
> > +    return rdmacm_mux_can_process_async(backend_dev);
> > +}
> > +
> > +static void rdmacm_mux_read(void *opaque, const uint8_t *buf, int size)
> > +{
> > +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> > +    RdmaCmMuxMsg *msg = (RdmaCmMuxMsg *)buf;
> > +
> > +    pr_dbg("Got %d bytes\n", size);
> > +    pr_dbg("msg_type=%d\n", msg->hdr.msg_type);
> > +    pr_dbg("op_code=%d\n", msg->hdr.op_code);
> > +
> > +    if (msg->hdr.msg_type != RDMACM_MUX_MSG_TYPE_REQ &&
> > +        msg->hdr.op_code != RDMACM_MUX_OP_CODE_MAD) {
> > +            pr_dbg("Error: Not a MAD request, skipping\n");
> > +            return;
> 
> No error flow on mux_read ? What happens at caller site on error ?

This is async channel, response is not issued.

Plus, the combination of REQ and MAD is by design the way to pass MAD
messages to clients from mux, other messages considered as bugs and
discarded.

> 
> >       }
> > +    process_incoming_mad_req(backend_dev, msg);
> > +}
> > +
> > +static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
> > +{
> > +    int ret;
> > -    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> > -                             mad_read, NULL, NULL, backend_dev, NULL, true);
> > +    backend_dev->rdmacm_mux.chr_be = mad_chr_be;
> > -    /* Register ourself */
> > -    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > -    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > -                            sizeof(umad.hdr));
> > -    if (ret != sizeof(umad.hdr)) {
> > -        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> > +    ret = qemu_chr_fe_backend_connected(backend_dev->rdmacm_mux.chr_be);
> > +    if (!ret) {
> > +        pr_dbg("Missing chardev for MAD multiplexer\n");
> > +        return -EIO;
> >       }
> >       qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> >       backend_dev->recv_mads_list.list = qlist_new();
> > +    enable_rdmacm_mux_async(backend_dev);
> > +
> > +    qemu_chr_fe_set_handlers(backend_dev->rdmacm_mux.chr_be,
> > +                             rdmacm_mux_can_receive, rdmacm_mux_read, NULL,
> > +                             NULL, backend_dev, NULL, true);
> > +
> >       return 0;
> >   }
> > @@ -978,6 +1066,8 @@ static void mad_stop(RdmaBackendDev *backend_dev)
> >       pr_dbg("Closing MAD\n");
> > +    disable_rdmacm_mux_async(backend_dev);
> > +
> >       /* Clear MAD buffers list */
> >       qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> >       do {
> > @@ -1000,23 +1090,94 @@ static void mad_fini(RdmaBackendDev *backend_dev)
> >       qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> >   }
> > +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> > +                               union ibv_gid *gid)
> > +{
> > +    union ibv_gid sgid;
> > +    int ret;
> > +    int i = 0;
> > +
> > +    pr_dbg("0x%llx, 0x%llx\n",
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > +
> > +    do {
> > +        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
> > +                            &sgid);
> > +        i++;
> > +    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
> > +
> > +    pr_dbg("gid_index=%d\n", i - 1);
> > +
> > +    return ret ? ret : i - 1;
> > +}
> > +
> > +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid)
> > +{
> > +    RdmaCmMuxMsg msg = {0};
> > +    int ret;
> > +
> > +    pr_dbg("0x%llx, 0x%llx\n",
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > +
> > +    msg.hdr.op_code = RDMACM_MUX_OP_CODE_REG;
> > +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> > +
> > +    ret = exec_rdmacm_mux_req(backend_dev, &msg);
> > +    if (ret) {
> > +        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > +
> > +    qapi_event_send_rdma_gid_status_changed(ifname, true,
> > +                                            gid->global.subnet_prefix,
> > +                                            gid->global.interface_id);
> > +
> > +    return ret;
> > +}
> > +
> > +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid)
> > +{
> > +    RdmaCmMuxMsg msg = {0};
> > +    int ret;
> > +
> > +    pr_dbg("0x%llx, 0x%llx\n",
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > +
> > +    msg.hdr.op_code = RDMACM_MUX_OP_CODE_UNREG;
> > +    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
> > +
> > +    ret = exec_rdmacm_mux_req(backend_dev, &msg);
> > +    if (ret) {
> > +        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
> > +        return -EIO;
> > +    }
> > +
> > +    qapi_event_send_rdma_gid_status_changed(ifname, false,
> > +                                            gid->global.subnet_prefix,
> > +                                            gid->global.interface_id);
> > +
> > +    return 0;
> > +}
> > +
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> > -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      CharBackend *mad_chr_be, Error **errp)
> > +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> > +                      Error **errp)
> >   {
> >       int i;
> >       int ret = 0;
> >       int num_ibv_devices;
> >       struct ibv_device **dev_list;
> > -    struct ibv_port_attr port_attr;
> >       memset(backend_dev, 0, sizeof(*backend_dev));
> >       backend_dev->dev = pdev;
> > -    backend_dev->mad_chr_be = mad_chr_be;
> > -    backend_dev->backend_gid_idx = backend_gid_idx;
> >       backend_dev->port_num = port_num;
> >       backend_dev->rdma_dev_res = rdma_dev_res;
> > @@ -1053,9 +1214,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >           backend_dev->ib_dev = *dev_list;
> >       }
> > -    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
> > -           ibv_get_device_name(backend_dev->ib_dev),
> > -           backend_dev->port_num, backend_dev->backend_gid_idx);
> > +    pr_dbg("Using backend device %s, port %d\n",
> > +           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
> >       backend_dev->context = ibv_open_device(backend_dev->ib_dev);
> >       if (!backend_dev->context) {
> > @@ -1072,20 +1232,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       }
> >       pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
> > -    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
> > -                         &port_attr);
> > -    if (ret) {
> > -        error_setg(errp, "Error %d from ibv_query_port", ret);
> > -        ret = -EIO;
> > -        goto out_destroy_comm_channel;
> > -    }
> > -
> > -    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
> > -        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
> > -                   port_attr.gid_tbl_len);
> > -        goto out_destroy_comm_channel;
> > -    }
> > -
> >       ret = init_device_caps(backend_dev, dev_attr);
> >       if (ret) {
> >           error_setg(errp, "Failed to initialize device capabilities");
> > @@ -1093,20 +1239,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >           goto out_destroy_comm_channel;
> >       }
> > -    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
> > -                         backend_dev->backend_gid_idx, &backend_dev->gid);
> > -    if (ret) {
> > -        error_setg(errp, "Failed to query gid %d",
> > -                   backend_dev->backend_gid_idx);
> > -        ret = -EIO;
> > -        goto out_destroy_comm_channel;
> > -    }
> > -    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
> > -           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
> > -    pr_dbg("interface_id=0x%" PRIx64 "\n",
> > -           be64_to_cpu(backend_dev->gid.global.interface_id));
> > -    ret = mad_init(backend_dev);
> > +    ret = mad_init(backend_dev, mad_chr_be);
> >       if (ret) {
> >           error_setg(errp, "Fail to initialize mad");
> >           ret = -EIO;
> > diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> > index fc83330251..59ad2b874b 100644
> > --- a/hw/rdma/rdma_backend.h
> > +++ b/hw/rdma/rdma_backend.h
> > @@ -28,11 +28,6 @@ enum ibv_special_qp_type {
> >       IBV_QPT_GSI = 1,
> >   };
> > -static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
> > -{
> > -    return &dev->gid;
> > -}
> > -
> >   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
> >   {
> >       return qp->ibqp ? qp->ibqp->qp_num : 1;
> > @@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> > -                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      CharBackend *mad_chr_be, Error **errp);
> > +                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
> > +                      Error **errp);
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> > +int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid);
> > +int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
> > +                         union ibv_gid *gid);
> > +int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
> > +                               union ibv_gid *gid);
> >   void rdma_backend_start(RdmaBackendDev *backend_dev);
> >   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> >   void rdma_backend_register_comp_handler(void (*handler)(int status,
> > @@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> >   int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> >                                  uint8_t qp_type, uint32_t qkey);
> >   int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> > -                              uint8_t qp_type, union ibv_gid *dgid,
> > -                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
> > -                              bool use_qkey);
> > +                              uint8_t qp_type, uint8_t sgid_idx,
> > +                              union ibv_gid *dgid, uint32_t dqpn,
> > +                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
> >   int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
> >                                 uint32_t sq_psn, uint32_t qkey, bool use_qkey);
> >   int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
> > @@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > +                            uint8_t sgid_idx, union ibv_gid *sgid,
> >                               union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
> >                               void *ctx);
> >   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> > diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> > index 2a7e667075..1e5c3dd3bf 100644
> > --- a/hw/rdma/rdma_backend_defs.h
> > +++ b/hw/rdma/rdma_backend_defs.h
> > @@ -19,6 +19,7 @@
> >   #include "qemu/thread.h"
> >   #include "chardev/char-fe.h"
> >   #include <infiniband/verbs.h>
> > +#include "contrib/rdmacm-mux/rdmacm-mux.h"
> >   typedef struct RdmaDeviceResources RdmaDeviceResources;
> > @@ -34,19 +35,22 @@ typedef struct RecvMadList {
> >       QList *list;
> >   } RecvMadList;
> > +typedef struct RdmaCmMux {
> > +    CharBackend *chr_be;
> > +    int can_receive;
> > +} RdmaCmMux;
> > +
> >   typedef struct RdmaBackendDev {
> >       struct ibv_device_attr dev_attr;
> >       RdmaBackendThread comp_thread;
> > -    union ibv_gid gid;
> >       PCIDevice *dev;
> >       RdmaDeviceResources *rdma_dev_res;
> >       struct ibv_device *ib_dev;
> >       struct ibv_context *context;
> >       struct ibv_comp_channel *channel;
> >       uint8_t port_num;
> > -    uint8_t backend_gid_idx;
> >       RecvMadList recv_mads_list;
> > -    CharBackend *mad_chr_be;
> > +    RdmaCmMux rdmacm_mux;
> >   } RdmaBackendDev;
> >   typedef struct RdmaBackendPD {
> > @@ -66,6 +70,7 @@ typedef struct RdmaBackendCQ {
> >   typedef struct RdmaBackendQP {
> >       struct ibv_pd *ibpd;
> >       struct ibv_qp *ibqp;
> > +    uint8_t sgid_idx;
> >   } RdmaBackendQP;
> >   #endif
> > diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> > index 4f10fcabcc..250254561c 100644
> > --- a/hw/rdma/rdma_rm.c
> > +++ b/hw/rdma/rdma_rm.c
> > @@ -391,7 +391,7 @@ out_dealloc_qp:
> >   }
> >   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > -                      uint32_t qp_handle, uint32_t attr_mask,
> > +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
> >                         union ibv_gid *dgid, uint32_t dqpn,
> >                         enum ibv_qp_state qp_state, uint32_t qkey,
> >                         uint32_t rq_psn, uint32_t sq_psn)
> > @@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >       int ret;
> >       pr_dbg("qpn=0x%x\n", qp_handle);
> > +    pr_dbg("qkey=0x%x\n", qkey);
> >       qp = rdma_rm_get_qp(dev_res, qp_handle);
> >       if (!qp) {
> > @@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >           }
> >           if (qp->qp_state == IBV_QPS_RTR) {
> > +            /* Get backend gid index */
> > +            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
> > +            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
> > +                                                     sgid_idx);
> > +            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
> > +                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
> > +                return -EIO;
> > +            }
> > +
> >               ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
> > -                                            qp->qp_type, dgid, dqpn, rq_psn,
> > -                                            qkey, attr_mask & IBV_QP_QKEY);
> > +                                            qp->qp_type, sgid_idx, dgid, dqpn,
> > +                                            rq_psn, qkey,
> > +                                            attr_mask & IBV_QP_QKEY);
> >               if (ret) {
> >                   return -EIO;
> >               }
> > @@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
> >       res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
> >   }
> > +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, union ibv_gid *gid, int gid_idx)
> > +{
> > +    int rc;
> > +
> > +    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
> > +    if (rc) {
> > +        pr_dbg("Fail to add gid\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
> > +
> > +    return 0;
> > +}
> > +
> > +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, int gid_idx)
> > +{
> > +    int rc;
> > +
> > +    rc = rdma_backend_del_gid(backend_dev, ifname,
> > +                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
> > +    if (rc) {
> > +        pr_dbg("Fail to delete gid\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
> > +           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
> > +    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
> > +
> > +    return 0;
> > +}
> > +
> > +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> > +                                  RdmaBackendDev *backend_dev, int sgid_idx)
> > +{
> > +    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
> > +        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
> > +        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
> > +        rdma_backend_get_gid_index(backend_dev,
> > +                                   &dev_res->ports[0].gid_tbl[sgid_idx].gid);
> > +    }
> > +
> > +    pr_dbg("backend_gid_index=%d\n",
> > +           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
> > +
> > +    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
> > +}
> > +
> >   static void destroy_qp_hash_key(gpointer data)
> >   {
> >       g_bytes_unref(data);
> >   }
> > +static void init_ports(RdmaDeviceResources *dev_res)
> > +{
> > +    int i, j;
> > +
> > +    memset(dev_res->ports, 0, sizeof(dev_res->ports));
> > +
> > +    for (i = 0; i < MAX_PORTS; i++) {
> > +        dev_res->ports[i].state = IBV_PORT_DOWN;
> > +        for (j = 0; j < MAX_PORT_GIDS; j++) {
> > +            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
> > +        }
> > +    }
> > +}
> > +
> > +static void fini_ports(RdmaDeviceResources *dev_res,
> > +                       RdmaBackendDev *backend_dev, const char *ifname)
> > +{
> > +    int i;
> > +
> > +    dev_res->ports[0].state = IBV_PORT_DOWN;
> > +    for (i = 0; i < MAX_PORT_GIDS; i++) {
> > +        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
> > +    }
> > +}
> > +
> >   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
> >                    Error **errp)
> >   {
> > @@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
> >                          dev_attr->max_qp_wr, sizeof(void *));
> >       res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
> > +    init_ports(dev_res);
> > +
> >       return 0;
> >   }
> > -void rdma_rm_fini(RdmaDeviceResources *dev_res)
> > +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                  const char *ifname)
> >   {
> > +    fini_ports(dev_res, backend_dev, ifname);
> > +
> >       res_tbl_free(&dev_res->uc_tbl);
> >       res_tbl_free(&dev_res->cqe_ctx_tbl);
> >       res_tbl_free(&dev_res->qp_tbl);
> > diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
> > index b4e04cc7b4..a7169b4e89 100644
> > --- a/hw/rdma/rdma_rm.h
> > +++ b/hw/rdma/rdma_rm.h
> > @@ -22,7 +22,8 @@
> >   int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
> >                    Error **errp);
> > -void rdma_rm_fini(RdmaDeviceResources *dev_res);
> > +void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                  const char *ifname);
> >   int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> >                        uint32_t *pd_handle, uint32_t ctx_handle);
> > @@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
> >                        uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
> >   RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
> >   int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > -                      uint32_t qp_handle, uint32_t attr_mask,
> > +                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
> >                         union ibv_gid *dgid, uint32_t dqpn,
> >                         enum ibv_qp_state qp_state, uint32_t qkey,
> >                         uint32_t rq_psn, uint32_t sq_psn);
> > @@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
> >   void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
> >   void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
> > +int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, union ibv_gid *gid, int gid_idx);
> > +int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
> > +                    const char *ifname, int gid_idx);
> > +int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
> > +                                  RdmaBackendDev *backend_dev, int sgid_idx);
> > +static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
> > +                                             int sgid_idx)
> > +{
> > +    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
> > +}
> > +
> >   #endif
> > diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
> > index 9b399063d3..7b3435f991 100644
> > --- a/hw/rdma/rdma_rm_defs.h
> > +++ b/hw/rdma/rdma_rm_defs.h
> > @@ -19,7 +19,7 @@
> >   #include "rdma_backend_defs.h"
> >   #define MAX_PORTS             1
> > -#define MAX_PORT_GIDS         1
> > +#define MAX_PORT_GIDS         255
> >   #define MAX_GIDS              MAX_PORT_GIDS
> >   #define MAX_PORT_PKEYS        1
> >   #define MAX_PKEYS             MAX_PORT_PKEYS
> > @@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
> >       enum ibv_qp_state qp_state;
> >   } RdmaRmQP;
> > +typedef struct RdmaRmGid {
> > +    union ibv_gid gid;
> > +    int backend_gid_index;
> > +} RdmaRmGid;
> > +
> >   typedef struct RdmaRmPort {
> > -    union ibv_gid gid_tbl[MAX_PORT_GIDS];
> > +    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
> >       enum ibv_port_state state;
> >   } RdmaRmPort;
> > diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
> > index 04c7c2ef5b..989db249ef 100644
> > --- a/hw/rdma/rdma_utils.h
> > +++ b/hw/rdma/rdma_utils.h
> > @@ -20,6 +20,7 @@
> >   #include "qemu/osdep.h"
> >   #include "hw/pci/pci.h"
> >   #include "sysemu/dma.h"
> > +#include "stdio.h"
> >   #define pr_info(fmt, ...) \
> >       fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
> > @@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
> >   #define pr_dbg(fmt, ...) \
> >       fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
> >               __func__, __LINE__, ## __VA_ARGS__)
> > +
> > +#define pr_dbg_buf(title, buf, len) \
> > +{ \
> > +    char *b = g_malloc0(len * 3 + 1); \
> > +    char b1[4]; \
> > +    for (int i = 0; i < len; i++) { \
> > +        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
> > +        strcat(b, b1); \
> > +    } \
> > +    pr_dbg("%s (%d): %s\n", title, len, b); \
> > +    g_free(b); \
> > +}
> > +
> >   #else
> >   #define init_pr_dbg(void)
> >   #define pr_dbg(fmt, ...)
> > +#define pr_dbg_buf(title, buf, len)
> >   #endif
> >   void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
> > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > index 15c3f28b86..b019cb843a 100644
> > --- a/hw/rdma/vmw/pvrdma.h
> > +++ b/hw/rdma/vmw/pvrdma.h
> > @@ -79,8 +79,8 @@ typedef struct PVRDMADev {
> >       int interrupt_mask;
> >       struct ibv_device_attr dev_attr;
> >       uint64_t node_guid;
> > +    char *backend_eth_device_name;
> >       char *backend_device_name;
> > -    uint8_t backend_gid_idx;
> >       uint8_t backend_port_num;
> >       RdmaBackendDev backend_dev;
> >       RdmaDeviceResources rdma_dev_res;
> > diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> > index 57d6f41ae6..a334f6205e 100644
> > --- a/hw/rdma/vmw/pvrdma_cmd.c
> > +++ b/hw/rdma/vmw/pvrdma_cmd.c
> > @@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       rsp->hdr.response = cmd->hdr.response;
> >       rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
> > -    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
> > -                                 cmd->qp_handle, cmd->attr_mask,
> > -                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> > -                                 cmd->attrs.dest_qp_num,
> > -                                 (enum ibv_qp_state)cmd->attrs.qp_state,
> > -                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
> > -                                 cmd->attrs.sq_psn);
> > +    /* No need to verify sgid_index since it is u8 */
> > +
> > +    rsp->hdr.err =
> > +        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
> > +                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
> > +                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
> > +                          cmd->attrs.dest_qp_num,
> > +                          (enum ibv_qp_state)cmd->attrs.qp_state,
> > +                          cmd->attrs.qkey, cmd->attrs.rq_psn,
> > +                          cmd->attrs.sq_psn);
> >       pr_dbg("ret=%d\n", rsp->hdr.err);
> >       return rsp->hdr.err;
> > @@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >                          union pvrdma_cmd_resp *rsp)
> >   {
> >       struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
> > -#ifdef PVRDMA_DEBUG
> > -    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
> > -    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
> > -#endif
> > +    int rc;
> > +    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
> >       pr_dbg("index=%d\n", cmd->index);
> > @@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >       }
> >       pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
> > -           (long long unsigned int)be64_to_cpu(*subnet),
> > -           (long long unsigned int)be64_to_cpu(*if_id));
> > +           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
> > +           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
> > -    /* Driver forces to one port only */
> > -    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
> > -           sizeof(cmd->new_gid));
> > +    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
> > +                         dev->backend_eth_device_name, gid, cmd->index);
> > +    if (rc < 0) {
> > +        return -EINVAL;
> > +    }
> >       /* TODO: Since drivers stores node_guid at load_dsr phase then this
> >        * assignment is not relevant, i need to figure out a way how to
> >        * retrieve MAC of our netdev */
> > -    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
> > -    pr_dbg("dev->node_guid=0x%llx\n",
> > -           (long long unsigned int)be64_to_cpu(dev->node_guid));
> > +    if (!cmd->index) {
> > +        dev->node_guid =
> > +            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
> > +        pr_dbg("dev->node_guid=0x%llx\n",
> > +               (long long unsigned int)be64_to_cpu(dev->node_guid));
> > +    }
> >       return 0;
> >   }
> > @@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >   static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >                           union pvrdma_cmd_resp *rsp)
> >   {
> > +    int rc;
> > +
> >       struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
> >       pr_dbg("index=%d\n", cmd->index);
> > @@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
> >           return -EINVAL;
> >       }
> > -    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
> > -           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
> > +    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
> > +                        dev->backend_eth_device_name, cmd->index);
> > +
> > +    if (rc < 0) {
> > +        rsp->hdr.err = rc;
> > +        goto out;
> > +    }
> >       return 0;
> >   }
> > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > index fc2abd34af..ac8c092db0 100644
> > --- a/hw/rdma/vmw/pvrdma_main.c
> > +++ b/hw/rdma/vmw/pvrdma_main.c
> > @@ -36,9 +36,9 @@
> >   #include "pvrdma_qp_ops.h"
> >   static Property pvrdma_dev_properties[] = {
> > -    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
> > -    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
> > -    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
> > +    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
> > +    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
> > +    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
> >       DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
> >                          MAX_MR_SIZE),
> >       DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
> > @@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
> >       pr_dbg("Initialized\n");
> >   }
> > -static void init_ports(PVRDMADev *dev, Error **errp)
> > -{
> > -    int i;
> > -
> > -    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
> > -
> > -    for (i = 0; i < MAX_PORTS; i++) {
> > -        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
> > -    }
> > -}
> > -
> >   static void uninit_msix(PCIDevice *pdev, int used_vectors)
> >   {
> >       PVRDMADev *dev = PVRDMA_DEV(pdev);
> > @@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
> >       pvrdma_qp_ops_fini();
> > -    rdma_rm_fini(&dev->rdma_dev_res);
> > +    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
> > +                 dev->backend_eth_device_name);
> >       rdma_backend_fini(&dev->backend_dev);
> > @@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
> >                              dev->backend_device_name, dev->backend_port_num,
> > -                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> > -                           errp);
> > +                           &dev->dev_attr, &dev->mad_chr, errp);
> >       if (rc) {
> >           goto out;
> >       }
> > @@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >           goto out;
> >       }
> > -    init_ports(dev, errp);
> > -
> >       rc = pvrdma_qp_ops_init();
> >       if (rc) {
> >           goto out;
> > diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> > index 3388be1926..2130824098 100644
> > --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> > +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> > @@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
> >       RdmaRmQP *qp;
> >       PvrdmaSqWqe *wqe;
> >       PvrdmaRing *ring;
> > +    int sgid_idx;
> > +    union ibv_gid *sgid;
> >       pr_dbg("qp_handle=0x%x\n", qp_handle);
> > @@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
> >           comp_ctx->cqe.qp = qp_handle;
> >           comp_ctx->cqe.opcode = IBV_WC_SEND;
> > +        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
> > +        if (!sgid) {
> > +            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
> > +            return -EIO;
> > +        }
> > +        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
> > +               sgid->global.interface_id);
> > +
> > +        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
> > +                                                 &dev->backend_dev,
> > +                                                 wqe->hdr.wr.ud.av.gid_index);
> > +        if (sgid_idx <= 0) {
> > +            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
> > +                   wqe->hdr.wr.ud.av.gid_index);
> > +            return -EIO;
> > +        }
> > +
> >           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
> >                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
> > +                               sgid_idx, sgid,
> >                                  (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
> >                                  wqe->hdr.wr.ud.remote_qpn,
> >                                  wqe->hdr.wr.ud.remote_qkey, comp_ctx);
> 
> 
> Thanks,
> Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response
  2018-11-25  7:40   ` Marcel Apfelbaum
@ 2018-11-25 11:53     ` Yuval Shaia
  0 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-25 11:53 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

> >       err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
> > -                            dsr_info->rsp);
> > +                                                    dsr_info->rsp);
> >   out:
> >       set_reg_val(dev, PVRDMA_REG_ERR, err);
> >       post_interrupt(dev, INTR_VEC_CMD_RING);
> 
> 
> As I responded in V4 thread :) one might forget to set hdr.error to 0
> resulting in errors hard to debug.
> 
> Please consider clearing the field on init.

Will fix.

> 
> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Please review v6.

> 
> Thanks,
> Marcel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma
  2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma Yuval Shaia
@ 2018-11-26 10:01   ` Markus Armbruster
  2018-11-26 10:08     ` Yuval Shaia
  2018-11-26 20:43     ` Eric Blake
  0 siblings, 2 replies; 39+ messages in thread
From: Markus Armbruster @ 2018-11-26 10:01 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck

Yuval Shaia <yuval.shaia@oracle.com> writes:

> pvrdma requires that the same GID attached to it will be attached to the
> backend device in the host.
>
> A new QMP messages is defined so pvrdma device can broadcast any change
> made to its GID table. This event is captured by libvirt which in turn
> will update the GID table in the backend device.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> ---
>  MAINTAINERS           |  1 +
>  Makefile.objs         |  1 +
>  qapi/qapi-schema.json |  1 +
>  qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 41 insertions(+)
>  create mode 100644 qapi/rdma.json
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7b68080094..525bcdcf41 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2335,6 +2335,7 @@ F: hw/rdma/*
>  F: hw/rdma/vmw/*
>  F: docs/pvrdma.txt
>  F: contrib/rdmacm-mux/*
> +F: qapi/rdma.json
>  
>  Build and test automation
>  -------------------------
> diff --git a/Makefile.objs b/Makefile.objs
> index 319f14d937..fe3566b797 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -1,5 +1,6 @@
>  QAPI_MODULES = block-core block char common crypto introspect job migration
>  QAPI_MODULES += misc net rocker run-state sockets tpm trace transaction ui
> +QAPI_MODULES += rdma

Please keep the list of QAPI modules sorted, e.g. like this:

   QAPI_MODULES = block-core block char common crypto introspect job
   QAPI_MODULES += migration misc net rdma rocker run-state sockets tpm
   QAPI_MODULES += trace transaction ui

>  
>  #######################################################################
>  # Common libraries for tools and emulators
> diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
> index 65b6dc2f6f..3bbdfcee84 100644
> --- a/qapi/qapi-schema.json
> +++ b/qapi/qapi-schema.json
> @@ -86,6 +86,7 @@
>  { 'include': 'char.json' }
>  { 'include': 'job.json' }
>  { 'include': 'net.json' }
> +{ 'include': 'rdma.json' }
>  { 'include': 'rocker.json' }
>  { 'include': 'tpm.json' }
>  { 'include': 'ui.json' }
> diff --git a/qapi/rdma.json b/qapi/rdma.json
> new file mode 100644
> index 0000000000..804c68ab36
> --- /dev/null
> +++ b/qapi/rdma.json
> @@ -0,0 +1,38 @@
> +# -*- Mode: Python -*-
> +#
> +
> +##
> +# = RDMA device
> +##
> +
> +##
> +# @RDMA_GID_STATUS_CHANGED:
> +#
> +# Emitted when guest driver adds/deletes GID to/from device
> +#
> +# @netdev: RoCE Network Device name - char *
> +#
> +# @gid-status: Add or delete indication - bool
> +#
> +# @subnet-prefix: Subnet Prefix - uint64
> +#
> +# @interface-id : Interface ID - uint64
> +#
> +# Since: 3.2
> +#
> +# Example:
> +#
> +# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
> +#     "event": "RDMA_GID_STATUS_CHANGED",
> +#     "data":
> +#         {"netdev": "bridge0",
> +#         "interface-id": 15880512517475447892,
> +#         "gid-status": true,
> +#         "subnet-prefix": 33022}}
> +#
> +##
> +{ 'event': 'RDMA_GID_STATUS_CHANGED',
> +  'data': { 'netdev'        : 'str',
> +            'gid-status'    : 'bool',
> +            'subnet-prefix' : 'uint64',
> +            'interface-id'  : 'uint64' } }

Preferably with Makefile.objs tidied up:
Acked-by: Markus Armbruster <armbru@redhat.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma
  2018-11-26 10:01   ` Markus Armbruster
@ 2018-11-26 10:08     ` Yuval Shaia
  2018-11-26 20:43     ` Eric Blake
  1 sibling, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-26 10:08 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck, yuval.shaia

On Mon, Nov 26, 2018 at 11:01:59AM +0100, Markus Armbruster wrote:
> Yuval Shaia <yuval.shaia@oracle.com> writes:
> 
> > pvrdma requires that the same GID attached to it will be attached to the
> > backend device in the host.
> >
> > A new QMP messages is defined so pvrdma device can broadcast any change
> > made to its GID table. This event is captured by libvirt which in turn
> > will update the GID table in the backend device.
> >
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > ---
> >  MAINTAINERS           |  1 +
> >  Makefile.objs         |  1 +
> >  qapi/qapi-schema.json |  1 +
> >  qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 41 insertions(+)
> >  create mode 100644 qapi/rdma.json
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 7b68080094..525bcdcf41 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -2335,6 +2335,7 @@ F: hw/rdma/*
> >  F: hw/rdma/vmw/*
> >  F: docs/pvrdma.txt
> >  F: contrib/rdmacm-mux/*
> > +F: qapi/rdma.json
> >  
> >  Build and test automation
> >  -------------------------
> > diff --git a/Makefile.objs b/Makefile.objs
> > index 319f14d937..fe3566b797 100644
> > --- a/Makefile.objs
> > +++ b/Makefile.objs
> > @@ -1,5 +1,6 @@
> >  QAPI_MODULES = block-core block char common crypto introspect job migration
> >  QAPI_MODULES += misc net rocker run-state sockets tpm trace transaction ui
> > +QAPI_MODULES += rdma
> 
> Please keep the list of QAPI modules sorted, e.g. like this:
> 
>    QAPI_MODULES = block-core block char common crypto introspect job
>    QAPI_MODULES += migration misc net rdma rocker run-state sockets tpm
>    QAPI_MODULES += trace transaction ui

Sure, will do.

> 
> >  
> >  #######################################################################
> >  # Common libraries for tools and emulators
> > diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
> > index 65b6dc2f6f..3bbdfcee84 100644
> > --- a/qapi/qapi-schema.json
> > +++ b/qapi/qapi-schema.json
> > @@ -86,6 +86,7 @@
> >  { 'include': 'char.json' }
> >  { 'include': 'job.json' }
> >  { 'include': 'net.json' }
> > +{ 'include': 'rdma.json' }
> >  { 'include': 'rocker.json' }
> >  { 'include': 'tpm.json' }
> >  { 'include': 'ui.json' }
> > diff --git a/qapi/rdma.json b/qapi/rdma.json
> > new file mode 100644
> > index 0000000000..804c68ab36
> > --- /dev/null
> > +++ b/qapi/rdma.json
> > @@ -0,0 +1,38 @@
> > +# -*- Mode: Python -*-
> > +#
> > +
> > +##
> > +# = RDMA device
> > +##
> > +
> > +##
> > +# @RDMA_GID_STATUS_CHANGED:
> > +#
> > +# Emitted when guest driver adds/deletes GID to/from device
> > +#
> > +# @netdev: RoCE Network Device name - char *
> > +#
> > +# @gid-status: Add or delete indication - bool
> > +#
> > +# @subnet-prefix: Subnet Prefix - uint64
> > +#
> > +# @interface-id : Interface ID - uint64
> > +#
> > +# Since: 3.2
> > +#
> > +# Example:
> > +#
> > +# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
> > +#     "event": "RDMA_GID_STATUS_CHANGED",
> > +#     "data":
> > +#         {"netdev": "bridge0",
> > +#         "interface-id": 15880512517475447892,
> > +#         "gid-status": true,
> > +#         "subnet-prefix": 33022}}
> > +#
> > +##
> > +{ 'event': 'RDMA_GID_STATUS_CHANGED',
> > +  'data': { 'netdev'        : 'str',
> > +            'gid-status'    : 'bool',
> > +            'subnet-prefix' : 'uint64',
> > +            'interface-id'  : 'uint64' } }
> 
> Preferably with Makefile.objs tidied up:
> Acked-by: Markus Armbruster <armbru@redhat.com>

Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 24/24] docs: Update pvrdma device documentation
       [not found]   ` <8b89bfaf-be29-e043-32fa-9615fb4ea0f7@gmail.com>
@ 2018-11-26 10:34     ` Marcel Apfelbaum
  2018-11-26 13:05       ` Yuval Shaia
  0 siblings, 1 reply; 39+ messages in thread
From: Marcel Apfelbaum @ 2018-11-26 10:34 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck

Re-sending the comments, some of the recipients didn't get it,

Thanks,
Marcel

On 11/25/18 9:51 AM, Marcel Apfelbaum wrote:
>
>
> On 11/22/18 2:14 PM, Yuval Shaia wrote:
>> Interface with the device is changed with the addition of support for
>> MAD packets.
>> Adjust documentation accordingly.
>>
>> While there fix a minor mistake which may lead to think that there is a
>> relation between using RXE on host and the compatibility with bare-metal
>> peers.
>>
>> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
>> ---
>>   docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 84 insertions(+), 19 deletions(-)
>>
>> diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
>> index 5599318159..f82b2a69d2 100644
>> --- a/docs/pvrdma.txt
>> +++ b/docs/pvrdma.txt
>> @@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need 
>> for any special guest
>>   modifications.
>>     While it complies with the VMware device, it can also communicate 
>> with bare
>> -metal RDMA-enabled machines and does not require an RDMA HCA in the 
>> host, it
>> -can work with Soft-RoCE (rxe).
>> +metal RDMA-enabled machines as peers.
>> +
>> +It does not require an RDMA HCA in the host, it can work with 
>> Soft-RoCE (rxe).
>>     It does not require the whole guest RAM to be pinned allowing memory
>>   over-commit and, even if not implemented yet, migration support 
>> will be
>> @@ -78,29 +79,93 @@ the required RDMA libraries.
>>     3. Usage
>>   ========
>> +
>> +
>> +3.1 VM Memory settings
>> +======+++=============
>>   Currently the device is working only with memory backed RAM
>>   and it must be mark as "shared":
>>      -m 1G \
>>      -object memory-backend-ram,id=mb1,size=1G,share \
>>      -numa node,memdev=mb1 \
>>   -The pvrdma device is composed of two functions:
>> - - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
>> -   but is required to pass the ibdevice GID using its MAC.
>> -   Examples:
>> -     For an rxe backend using eth0 interface it will use its mac:
>> -       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
>> -     For an SRIOV VF, we take the Ethernet Interface exposed by it:
>> -       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
>> - - Function 1 is the actual device:
>> -       -device 
>> pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
>> -   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
>> - Note: Pay special attention that the GID at backend-gid-idx matches 
>> vmxnet's MAC.
>> - The rules of conversion are part of the RoCE spec, but since manual 
>> conversion
>> - is not required, spotting problems is not hard:
>> -    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
>> -             MAC: 7c:fe:90:cb:74:3a
>> -    Note the difference between the first byte of the MAC and the GID.
>> +
>> +3.2 MAD Multiplexer
>> +===================
>> +MAD Multiplexer is a service that exposes MAD-like interface for VMs in
>> +order to overcome the limitation where only single entity can 
>> register with
>> +MAD layer to send and receive RDMA-CM MAD packets.
>> +
>> +To build rdmacm-mux run
>> +# make rdmacm-mux
>> +
>> +The application accepts 3 command line arguments and exposes a UNIX 
>> socket
>> +to pass control and data to it.
>> +-s unix-socket-path   Path to unix socket to listen on
>> +                      (default /var/run/rdmacm-mux)
>> +-d rdma-device-name   Name of RDMA device to register with
>> +                      (default rxe0)
>
> I would not default it to rxe0, but request to specify a RDMA interface.
> One can think the multiplexer may select the best available device
> and finish with an rxe instance instead of a bare-metal one...
>
>> +-p rdma-device-port   Port number of RDMA device to register with
>> +                      (default 1)
>> +The final UNIX socket file name is a concatenation of the 3 
>> arguments so
>> +for example for device mlx5_0 on port 2 this 
>> /var/run/rdmacm-mux-mlx5_0-2
>> +will be created.
>> +
>> +Please refer to contrib/rdmacm-mux for more details.
>> +
>> +
>> +3.3 PCI devices settings
>> +========================
>> +RoCE device exposes two functions - an Ethernet and RDMA.
>> +To support it, pvrdma device is composed of two PCI functions, an 
>> Ethernet
>> +device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 
>> 1. The
>> +Ethernet function can be used for other Ethernet purposes such as IP.
>
> Nice !
>
>> +
>> +
>> +3.4 Device parameters
>> +=====================
>> +- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) 
>> this
>> +  would be the Ethernet device used to create it. For any other 
>> physical
>> +  RoCE device this would be the netdev name of the device.
>
> I don't fully understand the above explanation. Can you elaborate
> or give an exmaple?
>
>> +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
>> +- mad-chardev: The name of the MAD multiplexer char device.
>> +- ibport: In case of multi-port device (such as Mellanox's HCA) this
>> +  specify the port to use. If not set 1 will be used.
>> +- dev-caps-max-mr-size: The maximum size of MR.
>> +- dev-caps-max-qp: Maximum number of QPs.
>> +- dev-caps-max-sge: Maximum number of SGE elements in WR.
>> +- dev-caps-max-cq: Maximum number of CQs.
>> +- dev-caps-max-mr: Maximum number of MRs.
>> +- dev-caps-max-pd: Maximum number of PDs.
>> +- dev-caps-max-ah: Maximum number of AHs.
>> +
>> +Notes:
>> +- The first 3 parameters are mandatory settings, the rest have their
>> +  defaults.
>> +- The last 8 parameters (the ones that prefixed by dev-caps) defines 
>> the top
>> +  limits but the final values is adjusted by the backend device 
>> limitations.
>> +
>> +3.5 Example
>> +===========
>> +Define bridge device with vmxnet3 network backend:
>> +<interface type='bridge'>
>> +  <mac address='56:b4:44:e9:62:dc'/>
>> +  <source bridge='bridge1'/>
>> +  <model type='vmxnet3'/>
>> +  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' 
>> function='0x0' multifunction='on'/>
>> +</interface>
>> +
>> +Define pvrdma device:
>> +<qemu:commandline>
>> +  <qemu:arg value='-object'/>
>> +  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
>> +  <qemu:arg value='-numa'/>
>> +  <qemu:arg value='node,memdev=mb1'/>
>> +  <qemu:arg value='-chardev'/>
>> +  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
>> +  <qemu:arg value='-device'/>
>> +  <qemu:arg 
>> value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
>> +</qemu:commandline>
>
> Please be sure to emphasize that the pvrdma works only
> if the QEMU is operated by libvirt. The same about the multiplexer.
>
> Thanks,
> Marcel
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 24/24] docs: Update pvrdma device documentation
  2018-11-26 10:34     ` Marcel Apfelbaum
@ 2018-11-26 13:05       ` Yuval Shaia
  0 siblings, 0 replies; 39+ messages in thread
From: Yuval Shaia @ 2018-11-26 13:05 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, cohuck, yuval.shaia

On Mon, Nov 26, 2018 at 12:34:41PM +0200, Marcel Apfelbaum wrote:
> Re-sending the comments, some of the recipients didn't get it,
> 
> Thanks,
> Marcel
> 
> On 11/25/18 9:51 AM, Marcel Apfelbaum wrote:
> > 
> > 
> > On 11/22/18 2:14 PM, Yuval Shaia wrote:
> > > Interface with the device is changed with the addition of support for
> > > MAD packets.
> > > Adjust documentation accordingly.
> > > 
> > > While there fix a minor mistake which may lead to think that there is a
> > > relation between using RXE on host and the compatibility with bare-metal
> > > peers.
> > > 
> > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > > ---
> > >   docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
> > >   1 file changed, 84 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
> > > index 5599318159..f82b2a69d2 100644
> > > --- a/docs/pvrdma.txt
> > > +++ b/docs/pvrdma.txt
> > > @@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need
> > > for any special guest
> > >   modifications.
> > >     While it complies with the VMware device, it can also
> > > communicate with bare
> > > -metal RDMA-enabled machines and does not require an RDMA HCA in the
> > > host, it
> > > -can work with Soft-RoCE (rxe).
> > > +metal RDMA-enabled machines as peers.
> > > +
> > > +It does not require an RDMA HCA in the host, it can work with
> > > Soft-RoCE (rxe).
> > >     It does not require the whole guest RAM to be pinned allowing memory
> > >   over-commit and, even if not implemented yet, migration support
> > > will be
> > > @@ -78,29 +79,93 @@ the required RDMA libraries.
> > >     3. Usage
> > >   ========
> > > +
> > > +
> > > +3.1 VM Memory settings
> > > +======+++=============
> > >   Currently the device is working only with memory backed RAM
> > >   and it must be mark as "shared":
> > >      -m 1G \
> > >      -object memory-backend-ram,id=mb1,size=1G,share \
> > >      -numa node,memdev=mb1 \
> > >   -The pvrdma device is composed of two functions:
> > > - - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
> > > -   but is required to pass the ibdevice GID using its MAC.
> > > -   Examples:
> > > -     For an rxe backend using eth0 interface it will use its mac:
> > > -       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
> > > -     For an SRIOV VF, we take the Ethernet Interface exposed by it:
> > > -       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
> > > - - Function 1 is the actual device:
> > > -       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
> > > -   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
> > > - Note: Pay special attention that the GID at backend-gid-idx
> > > matches vmxnet's MAC.
> > > - The rules of conversion are part of the RoCE spec, but since
> > > manual conversion
> > > - is not required, spotting problems is not hard:
> > > -    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
> > > -             MAC: 7c:fe:90:cb:74:3a
> > > -    Note the difference between the first byte of the MAC and the GID.
> > > +
> > > +3.2 MAD Multiplexer
> > > +===================
> > > +MAD Multiplexer is a service that exposes MAD-like interface for VMs in
> > > +order to overcome the limitation where only single entity can
> > > register with
> > > +MAD layer to send and receive RDMA-CM MAD packets.
> > > +
> > > +To build rdmacm-mux run
> > > +# make rdmacm-mux
> > > +
> > > +The application accepts 3 command line arguments and exposes a UNIX
> > > socket
> > > +to pass control and data to it.
> > > +-s unix-socket-path   Path to unix socket to listen on
> > > +                      (default /var/run/rdmacm-mux)
> > > +-d rdma-device-name   Name of RDMA device to register with
> > > +                      (default rxe0)
> > 
> > I would not default it to rxe0, but request to specify a RDMA interface.
> > One can think the multiplexer may select the best available device
> > and finish with an rxe instance instead of a bare-metal one...

Done.

> > 
> > > +-p rdma-device-port   Port number of RDMA device to register with
> > > +                      (default 1)
> > > +The final UNIX socket file name is a concatenation of the 3
> > > arguments so
> > > +for example for device mlx5_0 on port 2 this
> > > /var/run/rdmacm-mux-mlx5_0-2
> > > +will be created.
> > > +
> > > +Please refer to contrib/rdmacm-mux for more details.
> > > +
> > > +
> > > +3.3 PCI devices settings
> > > +========================
> > > +RoCE device exposes two functions - an Ethernet and RDMA.
> > > +To support it, pvrdma device is composed of two PCI functions, an
> > > Ethernet
> > > +device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI
> > > slot 1. The
> > > +Ethernet function can be used for other Ethernet purposes such as IP.
> > 
> > Nice !
> > 
> > > +
> > > +
> > > +3.4 Device parameters
> > > +=====================
> > > +- netdev: Specifies the Ethernet device on host. For Soft-RoCE
> > > (rxe) this
> > > +  would be the Ethernet device used to create it. For any other
> > > physical
> > > +  RoCE device this would be the netdev name of the device.
> > 
> > I don't fully understand the above explanation. Can you elaborate
> > or give an exmaple?

How about this:
- netdev: Specifies the Ethernet device function name on the host for
  example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
  device used to create it.

> > 
> > > +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
> > > +- mad-chardev: The name of the MAD multiplexer char device.
> > > +- ibport: In case of multi-port device (such as Mellanox's HCA) this
> > > +  specify the port to use. If not set 1 will be used.
> > > +- dev-caps-max-mr-size: The maximum size of MR.
> > > +- dev-caps-max-qp: Maximum number of QPs.
> > > +- dev-caps-max-sge: Maximum number of SGE elements in WR.
> > > +- dev-caps-max-cq: Maximum number of CQs.
> > > +- dev-caps-max-mr: Maximum number of MRs.
> > > +- dev-caps-max-pd: Maximum number of PDs.
> > > +- dev-caps-max-ah: Maximum number of AHs.
> > > +
> > > +Notes:
> > > +- The first 3 parameters are mandatory settings, the rest have their
> > > +  defaults.
> > > +- The last 8 parameters (the ones that prefixed by dev-caps)
> > > defines the top
> > > +  limits but the final values is adjusted by the backend device
> > > limitations.
> > > +
> > > +3.5 Example
> > > +===========
> > > +Define bridge device with vmxnet3 network backend:
> > > +<interface type='bridge'>
> > > +  <mac address='56:b4:44:e9:62:dc'/>
> > > +  <source bridge='bridge1'/>
> > > +  <model type='vmxnet3'/>
> > > +  <address type='pci' domain='0x0000' bus='0x00' slot='0x10'
> > > function='0x0' multifunction='on'/>
> > > +</interface>
> > > +
> > > +Define pvrdma device:
> > > +<qemu:commandline>
> > > +  <qemu:arg value='-object'/>
> > > +  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
> > > +  <qemu:arg value='-numa'/>
> > > +  <qemu:arg value='node,memdev=mb1'/>
> > > +  <qemu:arg value='-chardev'/>
> > > +  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
> > > +  <qemu:arg value='-device'/>
> > > +  <qemu:arg
> > > value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
> > > +</qemu:commandline>
> > 
> > Please be sure to emphasize that the pvrdma works only
> > if the QEMU is operated by libvirt. The same about the multiplexer.

Added this:

3.3 Service exposed by libvirt daemon
=====================================
The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determined by the MAC address, the second by
the first IPv6 address and the third by the IPv4 address. Other entries can
be added by adding more IP addresses. The opposite is the same, i.e.
whenever an address is removed, the corresponding GID entry is removed.
The process is done by the network and RDMA stacks. Whenever an address is
added the ib_core driver is notified and calls the device driver add_gid
function which in turn update the device.
To support this in pvrdma device the device hooks into the create_bind and
destroy_bind HW commands triggered by pvrdma driver in guest.

Whenever changed is made to the pvrdma port's GID table a special QMP
messages is sent to be processed by libvirt to update the address of the
backend Ethernet device.

pvrdma requires that libvirt service will be up.

> > 
> > Thanks,
> > Marcel
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma
  2018-11-26 10:01   ` Markus Armbruster
  2018-11-26 10:08     ` Yuval Shaia
@ 2018-11-26 20:43     ` Eric Blake
  1 sibling, 0 replies; 39+ messages in thread
From: Eric Blake @ 2018-11-26 20:43 UTC (permalink / raw)
  To: Markus Armbruster, Yuval Shaia
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, pbonzini,
	qemu-devel, shamir.rabinovitch, cohuck

On 11/26/18 4:01 AM, Markus Armbruster wrote:
> Yuval Shaia <yuval.shaia@oracle.com> writes:
> 
>> pvrdma requires that the same GID attached to it will be attached to the
>> backend device in the host.
>>
>> A new QMP messages is defined so pvrdma device can broadcast any change
>> made to its GID table. This event is captured by libvirt which in turn
>> will update the GID table in the backend device.
>>
>> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
>> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>> ---

>> +++ b/qapi/rdma.json
>> @@ -0,0 +1,38 @@
>> +# -*- Mode: Python -*-
>> +#
>> +
>> +##
>> +# = RDMA device
>> +##
>> +
>> +##
>> +# @RDMA_GID_STATUS_CHANGED:
>> +#
>> +# Emitted when guest driver adds/deletes GID to/from device
>> +#
>> +# @netdev: RoCE Network Device name - char *
>> +#
>> +# @gid-status: Add or delete indication - bool
>> +#
>> +# @subnet-prefix: Subnet Prefix - uint64
>> +#
>> +# @interface-id : Interface ID - uint64
>> +#
>> +# Since: 3.2

The next release will be 4.0, not 3.2 (we'll probably have to do a 
global search-and-replace in January to catch things that have slipped 
in, as your patch is not alone in that).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-11-26 20:43 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-22 12:13 [Qemu-devel] [PATCH v5 00/24] Add support for RDMA MAD Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 01/24] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 02/24] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 03/24] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 04/24] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 05/24] hw/rdma: Add support for MAD packets Yuval Shaia
2018-11-25  7:05   ` Marcel Apfelbaum
2018-11-25  7:27     ` Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 06/24] hw/pvrdma: Make function reset_device return void Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 07/24] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 08/24] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 09/24] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 10/24] qapi: Define new QMP message for pvrdma Yuval Shaia
2018-11-26 10:01   ` Markus Armbruster
2018-11-26 10:08     ` Yuval Shaia
2018-11-26 20:43     ` Eric Blake
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 11/24] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
2018-11-25  7:29   ` Marcel Apfelbaum
2018-11-25  9:10     ` Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 12/24] vmxnet3: Move some definitions to header file Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 13/24] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
2018-11-25  7:31   ` Marcel Apfelbaum
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 14/24] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 15/24] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 16/24] hw/pvrdma: Fill all CQE fields Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 17/24] hw/pvrdma: Fill error code in command's response Yuval Shaia
2018-11-25  7:40   ` Marcel Apfelbaum
2018-11-25 11:53     ` Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 18/24] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 19/24] vl: Introduce shutdown_notifiers Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 20/24] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
2018-11-22 12:13 ` [Qemu-devel] [PATCH v5 21/24] hw/rdma: Do not use bitmap_zero_extend to free bitmap Yuval Shaia
2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 22/24] hw/rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 23/24] hw/pvrdma: Do not clean resources on shutdown Yuval Shaia
2018-11-25  7:30   ` Yuval Shaia
2018-11-25  7:41   ` Marcel Apfelbaum
2018-11-22 12:14 ` [Qemu-devel] [PATCH v5 24/24] docs: Update pvrdma device documentation Yuval Shaia
     [not found]   ` <8b89bfaf-be29-e043-32fa-9615fb4ea0f7@gmail.com>
2018-11-26 10:34     ` Marcel Apfelbaum
2018-11-26 13:05       ` Yuval Shaia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.