All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD
@ 2018-11-08 16:07 Yuval Shaia
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
                   ` (21 more replies)
  0 siblings, 22 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:07 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Hi all.

This is a major enhancement to the pvrdma device to allow it to work with
state of the art applications such as MPI.

As described in patch #5, MAD packets are management packets that are used
for many purposes including but not limited to communication layer above IB
verbs API.

Patch 1 exposes new external executable (under contrib) that aims to
address a specific limitation in the RDMA usrespace MAD stack.

This patch-set mainly present MAD enhancement but during the work on it i
came across some bugs and enhancement needed to be implemented before doing
any MAD coding. This is the role of patches 2 to 4, 7 to 9 and 15 to 17.

Patches 6 and 18 are cosmetic changes while not relevant to this patchset
still introduce with it since (at least for 6) hard to decouple.

Patches 12 to 15 couple pvrdma device with vmxnet3 device as this is the
configuration enforced by pvrdma driver in guest - a vmxnet3 device in
function 0 and pvrdma device in function 1 in the same PCI slot. Patch 12
moves needed code from vmxnet3 device to a new header file that can be used
by pvrdma code while Patches 13 to 15 use of it.

Along with this patch-set there is a parallel patch posted to libvirt to
apply the change needed there as part of the process implemented in patches
10 and 11. This change is needed so that guest would be able to configure
any IP to the Ethernet function of the pvrdma device.
https://www.redhat.com/archives/libvir-list/2018-November/msg00135.html

Since we maintain external resources such as GIDs on host GID table we need
to do some cleanup before going down. This is the job of patches 19 and 20.
Patches 20 and 21 contain a fixes for bugs detected during the work on
processing cleanup code during shutdown.

v1 -> v2:
    * Fix compilation issue detected when compiling for mingw
    * Address comment from Eric Blake re version of QEMU in json
      message
    * Fix example from QMP message in json file
    * Fix case where a VM tries to remove an invalid GID from GID table
    * rdmacm-mux: Cleanup entries in socket-gids table when socket is
      closed
    * Cleanup resources (GIDs, QPs etc) when VM goes down

Thanks,
Yuval

Yuval Shaia (22):
  contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  hw/rdma: Add ability to force notification without re-arm
  hw/rdma: Return qpn 1 if ibqp is NULL
  hw/rdma: Abort send-op if fail to create addr handler
  hw/rdma: Add support for MAD packets
  hw/pvrdma: Make function reset_device return void
  hw/pvrdma: Make default pkey 0xFFFF
  hw/pvrdma: Set the correct opcode for recv completion
  hw/pvrdma: Set the correct opcode for send completion
  json: Define new QMP message for pvrdma
  hw/pvrdma: Add support to allow guest to configure GID table
  vmxnet3: Move some definitions to header file
  hw/pvrdma: Make sure PCI function 0 is vmxnet3
  hw/rdma: Initialize node_guid from vmxnet3 mac address
  hw/pvrdma: Make device state depend on Ethernet function state
  hw/pvrdma: Fill all CQE fields
  hw/pvrdma: Fill error code in command's response
  hw/rdma: Remove unneeded code that handles more that one port
  vl: Introduce shutdown_notifiers
  hw/pvrdma: Clean device's resource when system is shutdown
  rdma: Do not use bitmap_zero_extend to fee bitmap
  rdma: Do not call rdma_backend_del_gid on an empty gid

 MAINTAINERS                      |   2 +
 Makefile                         |   6 +-
 Makefile.objs                    |   5 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 770 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 hw/net/vmxnet3.c                 | 116 +----
 hw/net/vmxnet3_defs.h            | 133 ++++++
 hw/rdma/rdma_backend.c           | 461 +++++++++++++++---
 hw/rdma/rdma_backend.h           |  28 +-
 hw/rdma/rdma_backend_defs.h      |  13 +-
 hw/rdma/rdma_rm.c                | 120 ++++-
 hw/rdma/rdma_rm.h                |  17 +-
 hw/rdma/rdma_rm_defs.h           |  21 +-
 hw/rdma/rdma_utils.h             |  24 +
 hw/rdma/vmw/pvrdma.h             |  10 +-
 hw/rdma/vmw/pvrdma_cmd.c         | 119 +++--
 hw/rdma/vmw/pvrdma_main.c        |  49 +-
 hw/rdma/vmw/pvrdma_qp_ops.c      |  62 ++-
 include/sysemu/sysemu.h          |   1 +
 qapi/qapi-schema.json            |   1 +
 qapi/rdma.json                   |  38 ++
 vl.c                             |  15 +-
 23 files changed, 1783 insertions(+), 288 deletions(-)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
 create mode 100644 hw/net/vmxnet3_defs.h
 create mode 100644 qapi/rdma.json

-- 
2.17.2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
@ 2018-11-08 16:07 ` Yuval Shaia
  2018-11-10 20:10   ` Shamir Rabinovitch
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 02/22] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:07 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
given MAD class.
This does not go hand-by-hand with qemu pvrdma device's requirements
where each VM is MAD agent.
Fix it by adding implementation of RDMA MAD multiplexer service which on
one hand register as a sole MAD agent with the kernel module and on the
other hand gives service to more than one VM.

Design Overview:
----------------
A server process is registered to UMAD framework (for this to work the
rdma_cm kernel module needs to be unloaded) and creates a unix socket to
listen to incoming request from clients.
A client process (such as QEMU) connects to this unix socket and
registers with its own GID.

TX:
---
When client needs to send rdma_cm MAD message it construct it the same
way as without this multiplexer, i.e. creates a umad packet but this
time it writes its content to the socket instead of calling umad_send().
The server, upon receiving such a message fetch local_comm_id from it so
a context for this session can be maintain and relay the message to UMAD
layer by calling umad_send().

RX:
---
The server creates a worker thread to process incoming rdma_cm MAD
messages. When an incoming message arrived (umad_recv()) the server,
depending on the message type (attr_id) looks for target client by
either searching in gid->fd table or in local_comm_id->fd table. With
the extracted fd the server relays to incoming message to the client.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 MAINTAINERS                      |   1 +
 Makefile                         |   3 +
 Makefile.objs                    |   1 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c        | 770 +++++++++++++++++++++++++++++++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 6 files changed, 835 insertions(+)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 98a1856afc..e087d58ac6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2231,6 +2231,7 @@ S: Maintained
 F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
+F: contrib/rdmacm-mux/*
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index f2947186a4..94072776ff 100644
--- a/Makefile
+++ b/Makefile
@@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
                 elf2dmp-obj-y \
                 ivshmem-client-obj-y \
                 ivshmem-server-obj-y \
+                rdmacm-mux-obj-y \
                 libvhost-user-obj-y \
                 vhost-user-scsi-obj-y \
                 vhost-user-blk-obj-y \
@@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
 	$(call LINK, $^)
 vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
 	$(call LINK, $^)
+rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
+	$(call LINK, $^)
 
 module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
 	$(call quiet-command,$(PYTHON) $< $@ \
diff --git a/Makefile.objs b/Makefile.objs
index 1e1ff387d7..cc7df3ad80 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
 vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
 vhost-user-blk-obj-y = contrib/vhost-user-blk/
+rdmacm-mux-obj-y = contrib/rdmacm-mux/
 
 ######################################################################
 trace-events-subdirs =
diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
new file mode 100644
index 0000000000..be3eacb6f7
--- /dev/null
+++ b/contrib/rdmacm-mux/Makefile.objs
@@ -0,0 +1,4 @@
+ifdef CONFIG_PVRDMA
+CFLAGS += -libumad -Wno-format-truncation
+rdmacm-mux-obj-y = main.o
+endif
diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
new file mode 100644
index 0000000000..0308074b15
--- /dev/null
+++ b/contrib/rdmacm-mux/main.c
@@ -0,0 +1,770 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "sys/poll.h"
+#include "sys/ioctl.h"
+#include "pthread.h"
+#include "syslog.h"
+
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "infiniband/umad_types.h"
+#include "infiniband/umad_sa.h"
+#include "infiniband/umad_cm.h"
+
+#include "rdmacm-mux.h"
+
+#define SCALE_US 1000
+#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
+#define SLEEP_SECS 5 /* This is used both in poll() and thread */
+#define SERVER_LISTEN_BACKLOG 10
+#define MAX_CLIENTS 4096
+#define MAD_RMPP_VERSION 0
+#define MAD_METHOD_MASK0 0x8
+
+#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
+
+#define CM_REQ_DGID_POS      80
+#define CM_SIDR_REQ_DGID_POS 44
+
+/* The below can be override by command line parameter */
+#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
+#define RDMA_DEVICE "rxe0"
+#define RDMA_PORT_NUM 1
+
+typedef struct RdmaCmServerArgs {
+    char unix_socket_path[PATH_MAX];
+    char rdma_dev_name[NAME_MAX];
+    int rdma_port_num;
+} RdmaCMServerArgs;
+
+typedef struct CommId2FdEntry {
+    int fd;
+    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
+    __be64 gid_ifid;
+} CommId2FdEntry;
+
+typedef struct RdmaCmUMadAgent {
+    int port_id;
+    int agent_id;
+    GHashTable *gid2fd; /* Used to find fd of a given gid */
+    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
+} RdmaCmUMadAgent;
+
+typedef struct RdmaCmServer {
+    bool run;
+    RdmaCMServerArgs args;
+    struct pollfd fds[MAX_CLIENTS];
+    int nfds;
+    RdmaCmUMadAgent umad_agent;
+    pthread_t umad_recv_thread;
+    pthread_rwlock_t lock;
+} RdmaCMServer;
+
+RdmaCMServer server = {0};
+
+static void usage(const char *progname)
+{
+    printf("Usage: %s [OPTION]...\n"
+           "Start a RDMA-CM multiplexer\n"
+           "\n"
+           "\t-h                    Show this help\n"
+           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
+           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
+           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
+           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
+}
+
+static void help(const char *progname)
+{
+    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    int c;
+    char unix_socket_path[PATH_MAX];
+
+    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
+    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
+    server.args.rdma_port_num = RDMA_PORT_NUM;
+
+    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
+        switch (c) {
+        case 'h':
+            usage(argv[0]);
+            exit(0);
+
+        case 's':
+            /* This is temporary, final name will build below */
+            strncpy(unix_socket_path, optarg, PATH_MAX);
+            break;
+
+        case 'd':
+            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
+            break;
+
+        case 'p':
+            server.args.rdma_port_num = atoi(optarg);
+            break;
+
+        default:
+            help(argv[0]);
+            exit(1);
+        }
+    }
+
+    /* Build unique unix-socket file name */
+    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
+             unix_socket_path, server.args.rdma_dev_name,
+             server.args.rdma_port_num);
+
+    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
+    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
+    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
+}
+
+static void hash_tbl_alloc(void)
+{
+
+    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
+                                                     g_int64_equal,
+                                                     g_free, g_free);
+    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
+                                                        g_int_equal,
+                                                        g_free, g_free);
+}
+
+static void hash_tbl_free(void)
+{
+    if (server.umad_agent.commid2fd) {
+        g_hash_table_destroy(server.umad_agent.commid2fd);
+    }
+    if (server.umad_agent.gid2fd) {
+        g_hash_table_destroy(server.umad_agent.gid2fd);
+    }
+}
+
+
+static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
+{
+    int *fd;
+
+    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    if (!fd) {
+        /* Let's try IPv4 */
+        *gid_ifid |= 0x00000000ffff0000;
+        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
+    }
+
+    return fd ? *fd : 0;
+}
+
+static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
+{
+    pthread_rwlock_rdlock(&server.lock);
+    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fd) {
+        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
+        return -ENOENT;
+    }
+
+    return 0;
+}
+
+static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
+                                         __be64 *gid_idid)
+{
+    CommId2FdEntry *fde;
+
+    pthread_rwlock_rdlock(&server.lock);
+    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
+    pthread_rwlock_unlock(&server.lock);
+
+    if (!fde) {
+        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
+        return -ENOENT;
+    }
+
+    *fd = fde->fd;
+    *gid_idid = fde->gid_ifid;
+
+    return 0;
+}
+
+static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (fd1) { /* record already exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
+                           RDMACM_MUX_ERR_CODE_EACCES;
+    }
+
+    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
+
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx registered on socket %d", (uint64_t)gid_ifid, fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
+{
+    int fd1;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
+    if (!fd1) { /* record not exist - an error */
+        pthread_rwlock_unlock(&server.lock);
+        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
+    }
+
+    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
+                        sizeof(gid_ifid)));
+    pthread_rwlock_unlock(&server.lock);
+
+    syslog(LOG_INFO, "0x%lx unregistered on socket %d", (uint64_t)gid_ifid, fd);
+
+    return RDMACM_MUX_ERR_CODE_OK;
+}
+
+static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
+                                          uint64_t gid_ifid)
+{
+    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
+
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_insert(server.umad_agent.commid2fd,
+                        g_memdup(&comm_id, sizeof(comm_id)),
+                        g_memdup(&fde, sizeof(fde)));
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static gboolean remove_old_comm_ids(gpointer key, gpointer value,
+                                    gpointer user_data)
+{
+    CommId2FdEntry *fde = (CommId2FdEntry *)value;
+
+    return !fde->ttl--;
+}
+
+static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    if (*(int *)value == *(int *)user_data) {
+        syslog(LOG_INFO, "0x%lx unregistered on socket %d", *(uint64_t *)key,
+               *(int *)value);
+        return true;
+    }
+
+    return false;
+}
+
+static void hash_tbl_remove_fd_ifid_pair(int fd)
+{
+    pthread_rwlock_wrlock(&server.lock);
+    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
+                                remove_entry_from_gid2fd, (gpointer)&fd);
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
+{
+    struct umad_hdr *hdr = (struct umad_hdr *)mad;
+    char *data = (char *)hdr + sizeof(*hdr);
+    int32_t comm_id;
+    uint16_t attr_id = be16toh(hdr->attr_id);
+    int rc = 0;
+
+    switch (attr_id) {
+    case UMAD_CM_ATTR_REQ:
+        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_SIDR_REQ:
+        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
+        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
+        break;
+
+    case UMAD_CM_ATTR_REP:
+        /* Fall through */
+    case UMAD_CM_ATTR_REJ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREQ:
+        /* Fall through */
+    case UMAD_CM_ATTR_DREP:
+        /* Fall through */
+    case UMAD_CM_ATTR_RTU:
+        data += sizeof(comm_id);
+        /* Fall through */
+    case UMAD_CM_ATTR_SIDR_REP:
+        memcpy(&comm_id, data, sizeof(comm_id));
+        if (comm_id) {
+            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
+        }
+        break;
+
+    default:
+        rc = -EINVAL;
+        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
+    }
+
+    return rc;
+}
+
+static void *umad_recv_thread_func(void *args)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    int fd = -2;
+
+    while (server.run) {
+        do {
+            msg.umad_len = sizeof(msg.umad.mad);
+            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
+                           SLEEP_SECS * SCALE_US);
+            if ((rc == -EIO) || (rc == -EINVAL)) {
+                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
+            }
+
+            if (rc == -ETIMEDOUT) {
+                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
+                                            remove_old_comm_ids, NULL);
+            }
+        } while (rc && server.run);
+
+        if (server.run) {
+            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
+            if (rc) {
+                continue;
+            }
+
+            send(fd, &msg, sizeof(msg), 0);
+        }
+    }
+
+    return NULL;
+}
+
+static int read_and_process(int fd)
+{
+    int rc;
+    RdmaCmMuxMsg msg = {0};
+    struct umad_hdr *hdr;
+    uint32_t *comm_id;
+    uint16_t attr_id;
+
+    rc = recv(fd, &msg, sizeof(msg), 0);
+
+    if (rc < 0 && errno != EWOULDBLOCK) {
+        return -EIO;
+    }
+
+    if (!rc) {
+        return -EPIPE;
+    }
+
+    switch (msg.hdr.msg_type) {
+    case RDMACM_MUX_MSG_TYPE_REG:
+        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_MSG_TYPE_UNREG:
+        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
+        break;
+
+    case RDMACM_MUX_MSG_TYPE_MAD:
+        /* If this is REQ or REP then store the pair comm_id,fd to be later
+         * used for other messages where gid is unknown */
+        hdr = (struct umad_hdr *)msg.umad.mad;
+        attr_id = be16toh(hdr->attr_id);
+        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
+            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
+            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
+            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
+            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
+                                          msg.hdr.sgid.global.interface_id);
+        }
+
+        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
+                       &msg.umad, msg.umad_len, 1, 0);
+        if (rc) {
+            syslog(LOG_WARNING, "Fail to send MAD message, err=%d", rc);
+        }
+        break;
+
+    default:
+        syslog(LOG_WARNING, "Got invalid message (%d) from %d",
+               msg.hdr.msg_type, fd);
+        rc = RDMACM_MUX_ERR_CODE_EINVAL;
+    }
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
+    msg.hdr.err_code = rc;
+    rc = send(fd, &msg, sizeof(msg), 0);
+
+    return rc == sizeof(msg) ? 0 : -EPIPE;
+}
+
+static int accept_all(void)
+{
+    int fd, rc = 0;;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    do {
+        if ((server.nfds + 1) > MAX_CLIENTS) {
+            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
+            rc = -EIO;
+            goto out;
+        }
+
+        fd = accept(server.fds[0].fd, NULL, NULL);
+        if (fd < 0) {
+            if (errno != EWOULDBLOCK) {
+                syslog(LOG_WARNING, "accept() failed");
+                rc = -EIO;
+                goto out;
+            }
+            break;
+        }
+
+        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
+        server.fds[server.nfds].fd = fd;
+        server.fds[server.nfds].events = POLLIN;
+        server.nfds++;
+    } while (fd != -1);
+
+out:
+    pthread_rwlock_unlock(&server.lock);
+    return rc;
+}
+
+static void compress_fds(void)
+{
+    int i, j;
+    int closed = 0;
+
+    pthread_rwlock_wrlock(&server.lock);
+
+    for (i = 1; i < server.nfds; i++) {
+        if (!server.fds[i].fd) {
+            closed++;
+            for (j = i; j < server.nfds; j++) {
+                server.fds[j].fd = server.fds[j + 1].fd;
+            }
+        }
+    }
+
+    server.nfds -= closed;
+
+    pthread_rwlock_unlock(&server.lock);
+}
+
+static void close_fd(int idx)
+{
+    close(server.fds[idx].fd);
+    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
+    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
+    server.fds[idx].fd = 0;
+}
+
+static void run(void)
+{
+    int rc, nfds, i;
+    bool compress = false;
+
+    syslog(LOG_INFO, "Service started");
+
+    while (server.run) {
+        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
+        if (rc < 0) {
+            if (errno != EINTR) {
+                syslog(LOG_WARNING, "poll() failed");
+            }
+            continue;
+        }
+
+        if (rc == 0) {
+            continue;
+        }
+
+        nfds = server.nfds;
+        for (i = 0; i < nfds; i++) {
+            if (server.fds[i].revents == 0) {
+                continue;
+            }
+
+            if (server.fds[i].revents != POLLIN) {
+                if (i == 0) {
+                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
+                           server.fds[i].revents);
+                } else {
+                    close_fd(i);
+                    compress = true;
+                }
+                continue;
+            }
+
+            if (i == 0) {
+                rc = accept_all();
+                if (rc) {
+                    continue;
+                }
+            } else {
+                rc = read_and_process(server.fds[i].fd);
+                if (rc) {
+                    close_fd(i);
+                    compress = true;
+                }
+            }
+        }
+
+        if (compress) {
+            compress = false;
+            compress_fds();
+        }
+    }
+}
+
+static void fini_listener(void)
+{
+    int i;
+
+    if (server.fds[0].fd <= 0) {
+        return;
+    }
+
+    for (i = server.nfds - 1; i >= 0; i--) {
+        if (server.fds[i].fd) {
+            close(server.fds[i].fd);
+        }
+    }
+
+    unlink(server.args.unix_socket_path);
+}
+
+static void fini_umad(void)
+{
+    if (server.umad_agent.agent_id) {
+        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
+    }
+
+    if (server.umad_agent.port_id) {
+        umad_close_port(server.umad_agent.port_id);
+    }
+
+    hash_tbl_free();
+}
+
+static void fini(void)
+{
+    if (server.umad_recv_thread) {
+        pthread_join(server.umad_recv_thread, NULL);
+        server.umad_recv_thread = 0;
+    }
+    fini_umad();
+    fini_listener();
+    pthread_rwlock_destroy(&server.lock);
+
+    syslog(LOG_INFO, "Service going down");
+}
+
+static int init_listener(void)
+{
+    struct sockaddr_un sun;
+    int rc, on = 1;
+
+    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (server.fds[0].fd < 0) {
+        syslog(LOG_ALERT, "socket() failed");
+        return -EIO;
+    }
+
+    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
+                    sizeof(on));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "setsockopt() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "ioctl() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT,
+               "Invalid unix_socket_path, size must be less than %ld\n",
+               sizeof(sun.sun_path));
+        rc = -EINVAL;
+        goto err;
+    }
+
+    sun.sun_family = AF_UNIX;
+    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
+                  server.args.unix_socket_path);
+    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
+        syslog(LOG_ALERT, "Could not copy unix socket path\n");
+        rc = -EINVAL;
+        goto err;
+    }
+
+    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
+    if (rc < 0) {
+        syslog(LOG_ALERT, "bind() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
+    if (rc < 0) {
+        syslog(LOG_ALERT, "listen() failed");
+        rc = -EIO;
+        goto err;
+    }
+
+    server.fds[0].events = POLLIN;
+    server.nfds = 1;
+    server.run = true;
+
+    return 0;
+
+err:
+    close(server.fds[0].fd);
+    return rc;
+}
+
+static int init_umad(void)
+{
+    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
+
+    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
+                                               server.args.rdma_port_num);
+
+    if (server.umad_agent.port_id < 0) {
+        syslog(LOG_WARNING, "umad_open_port() failed");
+        return -EIO;
+    }
+
+    memset(&method_mask, 0, sizeof(method_mask));
+    method_mask[0] = MAD_METHOD_MASK0;
+    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
+                                               UMAD_CLASS_CM,
+                                               UMAD_SA_CLASS_VERSION,
+                                               MAD_RMPP_VERSION, method_mask);
+    if (server.umad_agent.agent_id < 0) {
+        syslog(LOG_WARNING, "umad_register() failed");
+        return -EIO;
+    }
+
+    hash_tbl_alloc();
+
+    return 0;
+}
+
+static void signal_handler(int sig, siginfo_t *siginfo, void *context)
+{
+    static bool warned;
+
+    /* Prevent stop if clients are connected */
+    if (server.nfds != 1) {
+        if (!warned) {
+            syslog(LOG_WARNING,
+                   "Can't stop while active client exist, resend SIGINT to overid");
+            warned = true;
+            return;
+        }
+    }
+
+    if (sig == SIGINT) {
+        server.run = false;
+        fini();
+    }
+
+    exit(0);
+}
+
+static int init(void)
+{
+    int rc;
+
+    rc = init_listener();
+    if (rc) {
+        return rc;
+    }
+
+    rc = init_umad();
+    if (rc) {
+        return rc;
+    }
+
+    pthread_rwlock_init(&server.lock, 0);
+
+    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
+                        NULL);
+    if (!rc) {
+        return rc;
+    }
+
+    return 0;
+}
+
+int main(int argc, char *argv[])
+{
+    int rc;
+    struct sigaction sig = {0};
+
+    sig.sa_sigaction = &signal_handler;
+    sig.sa_flags = SA_SIGINFO;
+
+    if (sigaction(SIGINT, &sig, NULL) < 0) {
+        syslog(LOG_ERR, "Fail to install SIGINT handler\n");
+        return -EAGAIN;
+    }
+
+    memset(&server, 0, sizeof(server));
+
+    parse_args(argc, argv);
+
+    rc = init();
+    if (rc) {
+        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
+        rc = -EAGAIN;
+        goto out;
+    }
+
+    run();
+
+out:
+    fini();
+
+    return rc;
+}
diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
new file mode 100644
index 0000000000..03508d52b2
--- /dev/null
+++ b/contrib/rdmacm-mux/rdmacm-mux.h
@@ -0,0 +1,56 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux declarations
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ *     Yuval Shaia <yuval.shaia@oracle.com>
+ *     Marcel Apfelbaum <marcel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef RDMACM_MUX_H
+#define RDMACM_MUX_H
+
+#include "linux/if.h"
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "rdma/rdma_user_cm.h"
+
+typedef enum RdmaCmMuxMsgType {
+    RDMACM_MUX_MSG_TYPE_REG   = 0,
+    RDMACM_MUX_MSG_TYPE_UNREG = 1,
+    RDMACM_MUX_MSG_TYPE_MAD   = 2,
+    RDMACM_MUX_MSG_TYPE_RESP  = 3,
+} RdmaCmMuxMsgType;
+
+typedef enum RdmaCmMuxErrCode {
+    RDMACM_MUX_ERR_CODE_OK        = 0,
+    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
+    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
+    RDMACM_MUX_ERR_CODE_EACCES    = 3,
+    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
+} RdmaCmMuxErrCode;
+
+typedef struct RdmaCmMuxHdr {
+    RdmaCmMuxMsgType msg_type;
+    union ibv_gid sgid;
+    RdmaCmMuxErrCode err_code;
+} RdmaCmUHdr;
+
+typedef struct RdmaCmUMad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+} RdmaCmUMad;
+
+typedef struct RdmaCmMuxMsg {
+    RdmaCmUHdr hdr;
+    int umad_len;
+    RdmaCmUMad umad;
+} RdmaCmMuxMsg;
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 02/22] hw/rdma: Add ability to force notification without re-arm
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
@ 2018-11-08 16:07 ` Yuval Shaia
  2018-11-10 17:56   ` Marcel Apfelbaum
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:07 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Upon completion of incoming packet the device pushes CQE to driver's RX
ring and notify the driver (msix).
While for data-path incoming packets the driver needs the ability to
control whether it wished to receive interrupts or not, for control-path
packets such as incoming MAD the driver needs to be notified anyway, it
even do not need to re-arm the notification bit.

Enhance the notification field to support this.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c           | 12 ++++++++++--
 hw/rdma/rdma_rm_defs.h      |  8 +++++++-
 hw/rdma/vmw/pvrdma_qp_ops.c |  6 ++++--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 8d59a42cd1..4f10fcabcc 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -263,7 +263,7 @@ int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     }
 
     cq->opaque = opaque;
-    cq->notify = false;
+    cq->notify = CNT_CLEAR;
 
     rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
     if (rc) {
@@ -291,7 +291,10 @@ void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
         return;
     }
 
-    cq->notify = notify;
+    if (cq->notify != CNT_SET) {
+        cq->notify = notify ? CNT_ARM : CNT_CLEAR;
+    }
+
     pr_dbg("notify=%d\n", cq->notify);
 }
 
@@ -349,6 +352,11 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
         return -EINVAL;
     }
 
+    if (qp_type == IBV_QPT_GSI) {
+        scq->notify = CNT_SET;
+        rcq->notify = CNT_SET;
+    }
+
     qp = res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
     if (!qp) {
         return -ENOMEM;
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7228151239..9b399063d3 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -49,10 +49,16 @@ typedef struct RdmaRmPD {
     uint32_t ctx_handle;
 } RdmaRmPD;
 
+typedef enum CQNotificationType {
+    CNT_CLEAR,
+    CNT_ARM,
+    CNT_SET,
+} CQNotificationType;
+
 typedef struct RdmaRmCQ {
     RdmaBackendCQ backend_cq;
     void *opaque;
-    bool notify;
+    CQNotificationType notify;
 } RdmaRmCQ;
 
 /* MR (DMA region) */
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index c668afd0ed..762700a205 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -89,8 +89,10 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pvrdma_ring_write_inc(&dev->dsr_info.cq);
 
     pr_dbg("cq->notify=%d\n", cq->notify);
-    if (cq->notify) {
-        cq->notify = false;
+    if (cq->notify != CNT_CLEAR) {
+        if (cq->notify == CNT_ARM) {
+            cq->notify = CNT_CLEAR;
+        }
         post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
     }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 02/22] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
@ 2018-11-08 16:07 ` Yuval Shaia
  2018-11-10 17:59   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 04/22] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:07 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Device is not supporting QP0, only QP1.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 86e8fe8ab6..3ccc9a2494 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
 
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
-    return qp->ibqp ? qp->ibqp->qp_num : 0;
+    return qp->ibqp ? qp->ibqp->qp_num : 1;
 }
 
 static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 04/22] hw/rdma: Abort send-op if fail to create addr handler
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (2 preceding siblings ...)
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 17:59   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets Yuval Shaia
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Function create_ah might return NULL, let's exit with an error.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index d7a4bbd91f..1e148398a2 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -338,6 +338,10 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
                                 backend_dev->backend_gid_idx, dgid);
+        if (!wr.wr.ud.ah) {
+            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            goto out_dealloc_cqe_ctx;
+        }
         wr.wr.ud.remote_qpn = dqpn;
         wr.wr.ud.remote_qkey = dqkey;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (3 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 04/22] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:15   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 06/22] hw/pvrdma: Make function reset_device return void Yuval Shaia
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

MAD (Management Datagram) packets are widely used by various modules
both in kernel and in user space for example the rdma_* API which is
used to create and maintain "connection" layer on top of RDMA uses
several types of MAD packets.
To support MAD packets the device uses an external utility
(contrib/rdmacm-mux) to relay packets from and to the guest driver.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
 hw/rdma/rdma_backend.h      |   4 +-
 hw/rdma/rdma_backend_defs.h |  10 +-
 hw/rdma/vmw/pvrdma.h        |   2 +
 hw/rdma/vmw/pvrdma_main.c   |   4 +-
 5 files changed, 273 insertions(+), 10 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 1e148398a2..3eb0099f8d 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -16,8 +16,13 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qnum.h"
 
 #include <infiniband/verbs.h>
+#include <infiniband/umad_types.h>
+#include <infiniband/umad.h>
+#include <rdma/rdma_user_cm.h>
 
 #include "trace.h"
 #include "rdma_utils.h"
@@ -33,16 +38,25 @@
 #define VENDOR_ERR_MAD_SEND         0x206
 #define VENDOR_ERR_INVLKEY          0x207
 #define VENDOR_ERR_MR_SMALL         0x208
+#define VENDOR_ERR_INV_MAD_BUFF     0x209
+#define VENDOR_ERR_INV_NUM_SGE      0x210
 
 #define THR_NAME_LEN 16
 #define THR_POLL_TO  5000
 
+#define MAD_HDR_SIZE sizeof(struct ibv_grh)
+
 typedef struct BackendCtx {
-    uint64_t req_id;
     void *up_ctx;
     bool is_tx_req;
+    struct ibv_sge sge; /* Used to save MAD recv buffer */
 } BackendCtx;
 
+struct backend_umad {
+    struct ib_user_mad hdr;
+    char mad[RDMA_MAX_PRIVATE_DATA];
+};
+
 static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
 
 static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
@@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
+static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
+                    uint32_t num_sge)
+{
+    struct backend_umad umad = {0};
+    char *hdr, *msg;
+    int ret;
+
+    pr_dbg("num_sge=%d\n", num_sge);
+
+    if (num_sge != 2) {
+        return -EINVAL;
+    }
+
+    umad.hdr.length = sge[0].length + sge[1].length;
+    pr_dbg("msg_len=%d\n", umad.hdr.length);
+
+    if (umad.hdr.length > sizeof(umad.mad)) {
+        return -ENOMEM;
+    }
+
+    umad.hdr.addr.qpn = htobe32(1);
+    umad.hdr.addr.grh_present = 1;
+    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    umad.hdr.addr.hop_limit = 1;
+
+    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
+    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+    memcpy(&umad.mad[0], hdr, sge[0].length);
+    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+
+    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad));
+
+    pr_dbg("qemu_chr_fe_write=%d\n", ret);
+
+    return (ret != sizeof(umad));
+}
+
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
@@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = mad_send(backend_dev, sge, num_sge);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            } else {
+                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+            }
         }
-        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
         return;
     }
 
@@ -370,6 +431,48 @@ out_free_bctx:
     g_free(bctx);
 }
 
+static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
+                                         struct ibv_sge *sge, uint32_t num_sge,
+                                         void *ctx)
+{
+    BackendCtx *bctx;
+    int rc;
+    uint32_t bctx_id;
+
+    if (num_sge != 1) {
+        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
+        return VENDOR_ERR_INV_NUM_SGE;
+    }
+
+    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
+        pr_dbg("Too small buffer for MAD\n");
+        return VENDOR_ERR_INV_MAD_BUFF;
+    }
+
+    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
+    pr_dbg("length=%d\n", sge[0].length);
+    pr_dbg("lkey=%d\n", sge[0].lkey);
+
+    bctx = g_malloc0(sizeof(*bctx));
+
+    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
+    if (unlikely(rc)) {
+        g_free(bctx);
+        pr_dbg("Fail to allocate cqe_ctx\n");
+        return VENDOR_ERR_NOMEM;
+    }
+
+    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
+    bctx->up_ctx = ctx;
+    bctx->sge = *sge;
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+
+    return 0;
+}
+
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
                             RdmaDeviceResources *rdma_dev_res,
                             RdmaBackendQP *qp, uint8_t qp_type,
@@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
+            if (rc) {
+                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+            }
         }
         return;
     }
@@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 
     switch (qp_type) {
     case IBV_QPT_GSI:
-        pr_dbg("QP1 unsupported\n");
         return 0;
 
     case IBV_QPT_RC:
@@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
     return 0;
 }
 
+static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
+                                 union ibv_gid *my_gid, int paylen)
+{
+    grh->paylen = htons(paylen);
+    grh->sgid = *sgid;
+    grh->dgid = *my_gid;
+
+    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
+    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+}
+
+static inline int mad_can_receieve(void *opaque)
+{
+    return sizeof(struct backend_umad);
+}
+
+static void mad_read(void *opaque, const uint8_t *buf, int size)
+{
+    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+    char *mad;
+    struct backend_umad *umad;
+
+    assert(size != sizeof(umad));
+    umad = (struct backend_umad *)buf;
+
+    pr_dbg("Got %d bytes\n", size);
+    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+
+#ifdef PVRDMA_DEBUG
+    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
+    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
+           hdr->base_version, hdr->mgmt_class, hdr->class_version,
+           hdr->method, hdr->status, be64toh(hdr->tid),
+           hdr->attr_id, hdr->attr_mod);
+#endif
+
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+    if (!o_ctx_id) {
+        pr_dbg("No more free MADs buffers, waiting for a while\n");
+        sleep(THR_POLL_TO);
+        return;
+    }
+
+    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+    if (unlikely(!bctx)) {
+        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
+        return;
+    }
+
+    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
+
+    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
+                           bctx->sge.length);
+    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                     bctx->up_ctx);
+    } else {
+        memset(mad, 0, bctx->sge.length);
+        build_mad_hdr((struct ibv_grh *)mad,
+                      (union ibv_gid *)&umad->hdr.addr.gid,
+                      &backend_dev->gid, umad->hdr.length);
+        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
+
+        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+    }
+
+    g_free(bctx);
+    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+}
+
+static int mad_init(RdmaBackendDev *backend_dev)
+{
+    struct backend_umad umad = {0};
+    int ret;
+
+    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+        pr_dbg("Missing chardev for MAD multiplexer\n");
+        return -EIO;
+    }
+
+    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
+                             mad_read, NULL, NULL, backend_dev, NULL, true);
+
+    /* Register ourself */
+    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
+                            sizeof(umad.hdr));
+    if (ret != sizeof(umad.hdr)) {
+        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
+    }
+
+    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
+    backend_dev->recv_mads_list.list = qlist_new();
+
+    return 0;
+}
+
+static void mad_stop(RdmaBackendDev *backend_dev)
+{
+    QObject *o_ctx_id;
+    unsigned long cqe_ctx_id;
+    BackendCtx *bctx;
+
+    pr_dbg("Closing MAD\n");
+
+    /* Clear MAD buffers list */
+    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
+    do {
+        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
+        if (o_ctx_id) {
+            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
+            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+            if (bctx) {
+                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
+                g_free(bctx);
+            }
+        }
+    } while (o_ctx_id);
+    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
+}
+
+static void mad_fini(RdmaBackendDev *backend_dev)
+{
+    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
+    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp)
+                      CharBackend *mad_chr_be, Error **errp)
 {
     int i;
     int ret = 0;
@@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
-
+    backend_dev->mad_chr_be = mad_chr_be;
     backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
@@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     pr_dbg("interface_id=0x%" PRIx64 "\n",
            be64_to_cpu(backend_dev->gid.global.interface_id));
 
+    ret = mad_init(backend_dev);
+    if (ret) {
+        error_setg(errp, "Fail to initialize mad");
+        ret = -EIO;
+        goto out_destroy_comm_channel;
+    }
+
     backend_dev->comp_thread.run = false;
     backend_dev->comp_thread.is_running = false;
 
@@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
 {
     pr_dbg("Stopping rdma_backend\n");
     stop_backend_thread(&backend_dev->comp_thread);
+    mad_stop(backend_dev);
 }
 
 void rdma_backend_fini(RdmaBackendDev *backend_dev)
 {
     rdma_backend_stop(backend_dev);
+    mad_fini(backend_dev);
     g_hash_table_destroy(ah_hash);
     ibv_destroy_comp_channel(backend_dev->channel);
     ibv_close_device(backend_dev->context);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 3ccc9a2494..fc83330251 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -17,6 +17,8 @@
 #define RDMA_BACKEND_H
 
 #include "qapi/error.h"
+#include "chardev/char-fe.h"
+
 #include "rdma_rm_defs.h"
 #include "rdma_backend_defs.h"
 
@@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
                       uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      Error **errp);
+                      CharBackend *mad_chr_be, Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 7404f64002..2a7e667075 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -16,8 +16,9 @@
 #ifndef RDMA_BACKEND_DEFS_H
 #define RDMA_BACKEND_DEFS_H
 
-#include <infiniband/verbs.h>
 #include "qemu/thread.h"
+#include "chardev/char-fe.h"
+#include <infiniband/verbs.h>
 
 typedef struct RdmaDeviceResources RdmaDeviceResources;
 
@@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
     bool is_running; /* Set by the thread to report its status */
 } RdmaBackendThread;
 
+typedef struct RecvMadList {
+    QemuMutex lock;
+    QList *list;
+} RecvMadList;
+
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
@@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
     struct ibv_comp_channel *channel;
     uint8_t port_num;
     uint8_t backend_gid_idx;
+    RecvMadList recv_mads_list;
+    CharBackend *mad_chr_be;
 } RdmaBackendDev;
 
 typedef struct RdmaBackendPD {
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e2d9f93cdf..e3742d893a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -19,6 +19,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
+#include "chardev/char-fe.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -83,6 +84,7 @@ typedef struct PVRDMADev {
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
+    CharBackend mad_chr;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ca5fa8d981..6c8c0154fa 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
     DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
                       dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
     DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
+    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, errp);
+                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
+                           errp);
     if (rc) {
         goto out;
     }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 06/22] hw/pvrdma: Make function reset_device return void
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (4 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:17   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 07/22] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

This function cannot fail - fix it to return void

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 6c8c0154fa..fc2abd34af 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
     return 0;
 }
 
-static int reset_device(PVRDMADev *dev)
+static void reset_device(PVRDMADev *dev)
 {
     pvrdma_stop(dev);
 
     pr_dbg("Device reset complete\n");
-
-    return 0;
 }
 
 static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 07/22] hw/pvrdma: Make default pkey 0xFFFF
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (5 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 06/22] hw/pvrdma: Make function reset_device return void Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:17   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Commit 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF") exports
default pkey as external definition but omit the change from 0x7FFF to
0xFFFF.

Fixes: 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF")

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e3742d893a..15c3f28b86 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -52,7 +52,7 @@
 #define PVRDMA_FW_VERSION    14
 
 /* Some defaults */
-#define PVRDMA_PKEY          0x7FFF
+#define PVRDMA_PKEY          0xFFFF
 
 typedef struct DSRInfo {
     dma_addr_t dma;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (6 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 07/22] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:18   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

The function pvrdma_post_cqe populates CQE entry with opcode from the
given completion element. For receive operation value was not set. Fix
it by setting it to IBV_WC_RECV.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 762700a205..7b0f440fda 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx = g_malloc(sizeof(CompHandlerCtx));
         comp_ctx->dev = dev;
         comp_ctx->cq_handle = qp->recv_cq_handle;
-        comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+        comp_ctx->cqe.qp = qp_handle;
+        comp_ctx->cqe.opcode = IBV_WC_RECV;
 
         rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
                                &qp->backend_qp, qp->qp_type,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (7 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:21   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma Yuval Shaia
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

opcode for WC should be set by the device and not taken from work
element.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 7b0f440fda..3388be1926 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cq_handle = qp->send_cq_handle;
         comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
         comp_ctx->cqe.qp = qp_handle;
-        comp_ctx->cqe.opcode = wqe->hdr.opcode;
+        comp_ctx->cqe.opcode = IBV_WC_SEND;
 
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (8 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:25   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 11/22] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

pvrdma requires that the same GID attached to it will be attached to the
backend device in the host.

A new QMP messages is defined so pvrdma device can broadcast any change
made to its GID table. This event is captured by libvirt which in turn
will update the GID table in the backend device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 MAINTAINERS           |  1 +
 Makefile              |  3 ++-
 Makefile.objs         |  4 ++++
 qapi/qapi-schema.json |  1 +
 qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 qapi/rdma.json

diff --git a/MAINTAINERS b/MAINTAINERS
index e087d58ac6..a149f68a8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2232,6 +2232,7 @@ F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
 F: contrib/rdmacm-mux/*
+F: qapi/rdma.json
 
 Build and test automation
 -------------------------
diff --git a/Makefile b/Makefile
index 94072776ff..db4ce60ee5 100644
--- a/Makefile
+++ b/Makefile
@@ -599,7 +599,8 @@ qapi-modules = $(SRC_PATH)/qapi/qapi-schema.json $(SRC_PATH)/qapi/common.json \
                $(SRC_PATH)/qapi/tpm.json \
                $(SRC_PATH)/qapi/trace.json \
                $(SRC_PATH)/qapi/transaction.json \
-               $(SRC_PATH)/qapi/ui.json
+               $(SRC_PATH)/qapi/ui.json \
+               $(SRC_PATH)/qapi/rdma.json
 
 qapi/qapi-builtin-types.c qapi/qapi-builtin-types.h \
 qapi/qapi-types.c qapi/qapi-types.h \
diff --git a/Makefile.objs b/Makefile.objs
index cc7df3ad80..76d8028f2f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -21,6 +21,7 @@ util-obj-y += qapi/qapi-types-tpm.o
 util-obj-y += qapi/qapi-types-trace.o
 util-obj-y += qapi/qapi-types-transaction.o
 util-obj-y += qapi/qapi-types-ui.o
+util-obj-y += qapi/qapi-types-rdma.o
 util-obj-y += qapi/qapi-builtin-visit.o
 util-obj-y += qapi/qapi-visit.o
 util-obj-y += qapi/qapi-visit-block-core.o
@@ -40,6 +41,7 @@ util-obj-y += qapi/qapi-visit-tpm.o
 util-obj-y += qapi/qapi-visit-trace.o
 util-obj-y += qapi/qapi-visit-transaction.o
 util-obj-y += qapi/qapi-visit-ui.o
+util-obj-y += qapi/qapi-visit-rdma.o
 util-obj-y += qapi/qapi-events.o
 util-obj-y += qapi/qapi-events-block-core.o
 util-obj-y += qapi/qapi-events-block.o
@@ -58,6 +60,7 @@ util-obj-y += qapi/qapi-events-tpm.o
 util-obj-y += qapi/qapi-events-trace.o
 util-obj-y += qapi/qapi-events-transaction.o
 util-obj-y += qapi/qapi-events-ui.o
+util-obj-y += qapi/qapi-events-rdma.o
 util-obj-y += qapi/qapi-introspect.o
 
 chardev-obj-y = chardev/
@@ -155,6 +158,7 @@ common-obj-y += qapi/qapi-commands-tpm.o
 common-obj-y += qapi/qapi-commands-trace.o
 common-obj-y += qapi/qapi-commands-transaction.o
 common-obj-y += qapi/qapi-commands-ui.o
+common-obj-y += qapi/qapi-commands-rdma.o
 common-obj-y += qapi/qapi-introspect.o
 common-obj-y += qmp.o hmp.o
 endif
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 65b6dc2f6f..a650d80f83 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -94,3 +94,4 @@
 { 'include': 'trace.json' }
 { 'include': 'introspect.json' }
 { 'include': 'misc.json' }
+{ 'include': 'rdma.json' }
diff --git a/qapi/rdma.json b/qapi/rdma.json
new file mode 100644
index 0000000000..804c68ab36
--- /dev/null
+++ b/qapi/rdma.json
@@ -0,0 +1,38 @@
+# -*- Mode: Python -*-
+#
+
+##
+# = RDMA device
+##
+
+##
+# @RDMA_GID_STATUS_CHANGED:
+#
+# Emitted when guest driver adds/deletes GID to/from device
+#
+# @netdev: RoCE Network Device name - char *
+#
+# @gid-status: Add or delete indication - bool
+#
+# @subnet-prefix: Subnet Prefix - uint64
+#
+# @interface-id : Interface ID - uint64
+#
+# Since: 3.2
+#
+# Example:
+#
+# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
+#     "event": "RDMA_GID_STATUS_CHANGED",
+#     "data":
+#         {"netdev": "bridge0",
+#         "interface-id": 15880512517475447892,
+#         "gid-status": true,
+#         "subnet-prefix": 33022}}
+#
+##
+{ 'event': 'RDMA_GID_STATUS_CHANGED',
+  'data': { 'netdev'        : 'str',
+            'gid-status'    : 'bool',
+            'subnet-prefix' : 'uint64',
+            'interface-id'  : 'uint64' } }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 11/22] hw/pvrdma: Add support to allow guest to configure GID table
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (9 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 12/22] vmxnet3: Move some definitions to header file Yuval Shaia
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determine by the MAC address, the second
by the first IPv6 address and the third by the IPv4 address. Other
entries can be added by adding more IP addresses. The opposite is the
same, i.e. whenever an address is removed, the corresponding GID entry
is removed.

The process is done by the network and RDMA stacks. Whenever an address
is added the ib_core driver is notified and calls the device driver
add_gid function which in turn update the device.

To support this in pvrdma device we need to hook into the create_bind
and destroy_bind HW commands triggered by pvrdma driver in guest.
Whenever a changed is made to the pvrdma device's GID table a special
QMP messages is sent to be processed by libvirt to update the address of
the backend Ethernet device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 243 +++++++++++++++++++++++-------------
 hw/rdma/rdma_backend.h      |  22 ++--
 hw/rdma/rdma_backend_defs.h |   3 +-
 hw/rdma/rdma_rm.c           | 104 ++++++++++++++-
 hw/rdma/rdma_rm.h           |  17 ++-
 hw/rdma/rdma_rm_defs.h      |   9 +-
 hw/rdma/rdma_utils.h        |  15 +++
 hw/rdma/vmw/pvrdma.h        |   2 +-
 hw/rdma/vmw/pvrdma_cmd.c    |  55 ++++----
 hw/rdma/vmw/pvrdma_main.c   |  25 +---
 hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
 11 files changed, 370 insertions(+), 145 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 3eb0099f8d..5675504165 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -18,12 +18,14 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qnum.h"
+#include "qapi/qapi-events-rdma.h"
 
 #include <infiniband/verbs.h>
 #include <infiniband/umad_types.h>
 #include <infiniband/umad.h>
 #include <rdma/rdma_user_cm.h>
 
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 #include "trace.h"
 #include "rdma_utils.h"
 #include "rdma_rm.h"
@@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
     return 0;
 }
 
-static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
-                    uint32_t num_sge)
+static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
+                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
 {
-    struct backend_umad umad = {0};
-    char *hdr, *msg;
+    RdmaCmMuxMsg msg = {0};
+    char *hdr, *data;
     int ret;
 
     pr_dbg("num_sge=%d\n", num_sge);
@@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
         return -EINVAL;
     }
 
-    umad.hdr.length = sge[0].length + sge[1].length;
-    pr_dbg("msg_len=%d\n", umad.hdr.length);
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
+    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
 
-    if (umad.hdr.length > sizeof(umad.mad)) {
+    msg.umad_len = sge[0].length + sge[1].length;
+    pr_dbg("umad_len=%d\n", msg.umad_len);
+
+    if (msg.umad_len > sizeof(msg.umad.mad)) {
         return -ENOMEM;
     }
 
-    umad.hdr.addr.qpn = htobe32(1);
-    umad.hdr.addr.grh_present = 1;
-    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    umad.hdr.addr.hop_limit = 1;
+    msg.umad.hdr.addr.qpn = htobe32(1);
+    msg.umad.hdr.addr.grh_present = 1;
+    pr_dbg("sgid_idx=%d\n", sgid_idx);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
+    msg.umad.hdr.addr.gid_index = sgid_idx;
+    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
+    msg.umad.hdr.addr.hop_limit = 1;
 
     hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
-    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+    pr_dbg_buf("mad_hdr", hdr, sge[0].length);
+    pr_dbg_buf("mad_data", data, sge[1].length);
 
-    memcpy(&umad.mad[0], hdr, sge[0].length);
-    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
+    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
+    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
 
-    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
     rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
 
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
 
     pr_dbg("qemu_chr_fe_write=%d\n", ret);
 
-    return (ret != sizeof(umad));
+    return (ret != sizeof(msg));
 }
 
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
-                            union ibv_gid *dgid, uint32_t dqpn,
-                            uint32_t dqkey, void *ctx)
+                            uint8_t sgid_idx, union ibv_gid *sgid,
+                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
+                            void *ctx)
 {
     BackendCtx *bctx;
     struct ibv_sge new_sge[MAX_SGE];
@@ -361,7 +372,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
-            rc = mad_send(backend_dev, sge, num_sge);
+            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
                 comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
@@ -397,8 +408,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     }
 
     if (qp_type == IBV_QPT_UD) {
-        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
-                                backend_dev->backend_gid_idx, dgid);
+        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
             comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
@@ -703,9 +713,9 @@ int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
 }
 
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey)
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
 {
     struct ibv_qp_attr attr = {0};
     union ibv_gid ibv_gid = {
@@ -717,13 +727,15 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
     attr.qp_state = IBV_QPS_RTR;
     attr_mask = IBV_QP_STATE;
 
+    qp->sgid_idx = sgid_idx;
+
     switch (qp_type) {
     case IBV_QPT_RC:
         pr_dbg("dgid=0x%" PRIx64 ",%" PRIx64 "\n",
                be64_to_cpu(ibv_gid.global.subnet_prefix),
                be64_to_cpu(ibv_gid.global.interface_id));
         pr_dbg("dqpn=0x%x\n", dqpn);
-        pr_dbg("sgid_idx=%d\n", backend_dev->backend_gid_idx);
+        pr_dbg("sgid_idx=%d\n", qp->sgid_idx);
         pr_dbg("sport_num=%d\n", backend_dev->port_num);
         pr_dbg("rq_psn=0x%x\n", rq_psn);
 
@@ -735,7 +747,7 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         attr.ah_attr.is_global      = 1;
         attr.ah_attr.grh.hop_limit  = 1;
         attr.ah_attr.grh.dgid       = ibv_gid;
-        attr.ah_attr.grh.sgid_index = backend_dev->backend_gid_idx;
+        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
         attr.rq_psn                 = rq_psn;
 
         attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
@@ -744,8 +756,8 @@ int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
         break;
 
     case IBV_QPT_UD:
+        pr_dbg("qkey=0x%x\n", qkey);
         if (use_qkey) {
-            pr_dbg("qkey=0x%x\n", qkey);
             attr.qkey = qkey;
             attr_mask |= IBV_QP_QKEY;
         }
@@ -861,13 +873,13 @@ static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
     grh->dgid = *my_gid;
 
     pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
-    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
-    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
+    pr_dbg("dgid=0x%llx\n", my_gid->global.interface_id);
+    pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
 }
 
 static inline int mad_can_receieve(void *opaque)
 {
-    return sizeof(struct backend_umad);
+    return sizeof(RdmaCmMuxMsg);
 }
 
 static void mad_read(void *opaque, const uint8_t *buf, int size)
@@ -877,13 +889,13 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     unsigned long cqe_ctx_id;
     BackendCtx *bctx;
     char *mad;
-    struct backend_umad *umad;
+    RdmaCmMuxMsg *msg;
 
-    assert(size != sizeof(umad));
-    umad = (struct backend_umad *)buf;
+    assert(size != sizeof(msg));
+    msg = (RdmaCmMuxMsg *)buf;
 
     pr_dbg("Got %d bytes\n", size);
-    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
+    pr_dbg("umad_len=%d\n", msg->umad_len);
 
 #ifdef PVRDMA_DEBUG
     struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
@@ -913,15 +925,16 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
-    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
+    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
         comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
                      bctx->up_ctx);
     } else {
+        pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
-                      (union ibv_gid *)&umad->hdr.addr.gid,
-                      &backend_dev->gid, umad->hdr.length);
-        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
+                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
+                      msg->umad_len);
+        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
         comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
@@ -933,10 +946,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
 
 static int mad_init(RdmaBackendDev *backend_dev)
 {
-    struct backend_umad umad = {0};
     int ret;
 
-    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
+    ret = qemu_chr_fe_backend_connected(backend_dev->mad_chr_be);
+    if (!ret) {
         pr_dbg("Missing chardev for MAD multiplexer\n");
         return -EIO;
     }
@@ -944,14 +957,6 @@ static int mad_init(RdmaBackendDev *backend_dev)
     qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
                              mad_read, NULL, NULL, backend_dev, NULL, true);
 
-    /* Register ourself */
-    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
-                            sizeof(umad.hdr));
-    if (ret != sizeof(umad.hdr)) {
-        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
-    }
-
     qemu_mutex_init(&backend_dev->recv_mads_list.lock);
     backend_dev->recv_mads_list.list = qlist_new();
 
@@ -988,23 +993,120 @@ static void mad_fini(RdmaBackendDev *backend_dev)
     qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
 }
 
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid)
+{
+    union ibv_gid sgid;
+    int ret;
+    int i = 0;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    do {
+        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
+                            &sgid);
+        i++;
+    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
+
+    pr_dbg("gid_index=%d\n", i - 1);
+
+    return ret ? ret : i - 1;
+}
+
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Fail to register GID to rdma_umadmux (%d)\n", msg.hdr.err_code);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, true,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return ret;
+}
+
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid)
+{
+    RdmaCmMuxMsg msg = {0};
+    int ret;
+
+    pr_dbg("0x%llx, 0x%llx\n",
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
+
+    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_UNREG;
+    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
+    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    ret = qemu_chr_fe_read_all(backend_dev->mad_chr_be, (uint8_t *)&msg,
+                            sizeof(msg));
+    if (ret != sizeof(msg)) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n", ret);
+        return -EIO;
+    }
+
+    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
+        pr_dbg("Fail to unregister GID from rdma_umadmux (%d)\n",
+               msg.hdr.err_code);
+        return -EIO;
+    }
+
+    qapi_event_send_rdma_gid_status_changed(ifname, false,
+                                            gid->global.subnet_prefix,
+                                            gid->global.interface_id);
+
+    return 0;
+}
+
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp)
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp)
 {
     int i;
     int ret = 0;
     int num_ibv_devices;
     struct ibv_device **dev_list;
-    struct ibv_port_attr port_attr;
 
     memset(backend_dev, 0, sizeof(*backend_dev));
 
     backend_dev->dev = pdev;
     backend_dev->mad_chr_be = mad_chr_be;
-    backend_dev->backend_gid_idx = backend_gid_idx;
     backend_dev->port_num = port_num;
     backend_dev->rdma_dev_res = rdma_dev_res;
 
@@ -1041,9 +1143,8 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         backend_dev->ib_dev = *dev_list;
     }
 
-    pr_dbg("Using backend device %s, port %d, gid_idx %d\n",
-           ibv_get_device_name(backend_dev->ib_dev),
-           backend_dev->port_num, backend_dev->backend_gid_idx);
+    pr_dbg("Using backend device %s, port %d\n",
+           ibv_get_device_name(backend_dev->ib_dev), backend_dev->port_num);
 
     backend_dev->context = ibv_open_device(backend_dev->ib_dev);
     if (!backend_dev->context) {
@@ -1060,20 +1161,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
     }
     pr_dbg("dev->backend_dev.channel=%p\n", backend_dev->channel);
 
-    ret = ibv_query_port(backend_dev->context, backend_dev->port_num,
-                         &port_attr);
-    if (ret) {
-        error_setg(errp, "Error %d from ibv_query_port", ret);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-
-    if (backend_dev->backend_gid_idx >= port_attr.gid_tbl_len) {
-        error_setg(errp, "Invalid backend_gid_idx, should be less than %d",
-                   port_attr.gid_tbl_len);
-        goto out_destroy_comm_channel;
-    }
-
     ret = init_device_caps(backend_dev, dev_attr);
     if (ret) {
         error_setg(errp, "Failed to initialize device capabilities");
@@ -1081,18 +1168,6 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
         goto out_destroy_comm_channel;
     }
 
-    ret = ibv_query_gid(backend_dev->context, backend_dev->port_num,
-                         backend_dev->backend_gid_idx, &backend_dev->gid);
-    if (ret) {
-        error_setg(errp, "Failed to query gid %d",
-                   backend_dev->backend_gid_idx);
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-    pr_dbg("subnet_prefix=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.subnet_prefix));
-    pr_dbg("interface_id=0x%" PRIx64 "\n",
-           be64_to_cpu(backend_dev->gid.global.interface_id));
 
     ret = mad_init(backend_dev);
     if (ret) {
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index fc83330251..59ad2b874b 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -28,11 +28,6 @@ enum ibv_special_qp_type {
     IBV_QPT_GSI = 1,
 };
 
-static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
-{
-    return &dev->gid;
-}
-
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
     return qp->ibqp ? qp->ibqp->qp_num : 1;
@@ -51,9 +46,15 @@ static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
 int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
                       RdmaDeviceResources *rdma_dev_res,
                       const char *backend_device_name, uint8_t port_num,
-                      uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be, Error **errp);
+                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be,
+                      Error **errp);
 void rdma_backend_fini(RdmaBackendDev *backend_dev);
+int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
+                         union ibv_gid *gid);
+int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
+                               union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
 void rdma_backend_register_comp_handler(void (*handler)(int status,
@@ -82,9 +83,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
 int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
                                uint8_t qp_type, uint32_t qkey);
 int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, union ibv_gid *dgid,
-                              uint32_t dqpn, uint32_t rq_psn, uint32_t qkey,
-                              bool use_qkey);
+                              uint8_t qp_type, uint8_t sgid_idx,
+                              union ibv_gid *dgid, uint32_t dqpn,
+                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
                               uint32_t sq_psn, uint32_t qkey, bool use_qkey);
 int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
@@ -94,6 +95,7 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp);
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
                             RdmaBackendQP *qp, uint8_t qp_type,
                             struct ibv_sge *sge, uint32_t num_sge,
+                            uint8_t sgid_idx, union ibv_gid *sgid,
                             union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
                             void *ctx);
 void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
index 2a7e667075..ff8b2426a0 100644
--- a/hw/rdma/rdma_backend_defs.h
+++ b/hw/rdma/rdma_backend_defs.h
@@ -37,14 +37,12 @@ typedef struct RecvMadList {
 typedef struct RdmaBackendDev {
     struct ibv_device_attr dev_attr;
     RdmaBackendThread comp_thread;
-    union ibv_gid gid;
     PCIDevice *dev;
     RdmaDeviceResources *rdma_dev_res;
     struct ibv_device *ib_dev;
     struct ibv_context *context;
     struct ibv_comp_channel *channel;
     uint8_t port_num;
-    uint8_t backend_gid_idx;
     RecvMadList recv_mads_list;
     CharBackend *mad_chr_be;
 } RdmaBackendDev;
@@ -66,6 +64,7 @@ typedef struct RdmaBackendCQ {
 typedef struct RdmaBackendQP {
     struct ibv_pd *ibpd;
     struct ibv_qp *ibqp;
+    uint8_t sgid_idx;
 } RdmaBackendQP;
 
 #endif
diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 4f10fcabcc..fe0979415d 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -391,7 +391,7 @@ out_dealloc_qp:
 }
 
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn)
@@ -400,6 +400,7 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int ret;
 
     pr_dbg("qpn=0x%x\n", qp_handle);
+    pr_dbg("qkey=0x%x\n", qkey);
 
     qp = rdma_rm_get_qp(dev_res, qp_handle);
     if (!qp) {
@@ -430,9 +431,19 @@ int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         }
 
         if (qp->qp_state == IBV_QPS_RTR) {
+            /* Get backend gid index */
+            pr_dbg("Guest sgid_idx=%d\n", sgid_idx);
+            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
+                                                     sgid_idx);
+            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
+                pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", sgid_idx);
+                return -EIO;
+            }
+
             ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
-                                            qp->qp_type, dgid, dqpn, rq_psn,
-                                            qkey, attr_mask & IBV_QP_QKEY);
+                                            qp->qp_type, sgid_idx, dgid, dqpn,
+                                            rq_psn, qkey,
+                                            attr_mask & IBV_QP_QKEY);
             if (ret) {
                 return -EIO;
             }
@@ -523,11 +534,91 @@ void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
     res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
 }
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
+    if (rc <= 0) {
+        pr_dbg("Fail to add gid\n");
+        return -EINVAL;
+    }
+
+    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+
+    return 0;
+}
+
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx)
+{
+    int rc;
+
+    rc = rdma_backend_del_gid(backend_dev, ifname,
+                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+    if (rc < 0) {
+        pr_dbg("Fail to delete gid\n");
+        return -EINVAL;
+    }
+
+    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
+    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+
+    return 0;
+}
+
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx)
+{
+    if (unlikely(sgid_idx < 0 || sgid_idx > MAX_PORT_GIDS)) {
+        pr_dbg("Got invalid sgid_idx %d\n", sgid_idx);
+        return -EINVAL;
+    }
+
+    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+        rdma_backend_get_gid_index(backend_dev,
+                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+    }
+
+    pr_dbg("backend_gid_index=%d\n",
+           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+
+    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+}
+
 static void destroy_qp_hash_key(gpointer data)
 {
     g_bytes_unref(data);
 }
 
+static void init_ports(RdmaDeviceResources *dev_res)
+{
+    int i, j;
+
+    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+
+    for (i = 0; i < MAX_PORTS; i++) {
+        dev_res->ports[i].state = IBV_PORT_DOWN;
+        for (j = 0; j < MAX_PORT_GIDS; j++) {
+            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
+        }
+    }
+}
+
+static void fini_ports(RdmaDeviceResources *dev_res,
+                       RdmaBackendDev *backend_dev, const char *ifname)
+{
+    int i;
+
+    dev_res->ports[0].state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
+    }
+}
+
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp)
 {
@@ -545,11 +636,16 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                        dev_attr->max_qp_wr, sizeof(void *));
     res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
 
+    init_ports(dev_res);
+
     return 0;
 }
 
-void rdma_rm_fini(RdmaDeviceResources *dev_res)
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname)
 {
+    fini_ports(dev_res, backend_dev, ifname);
+
     res_tbl_free(&dev_res->uc_tbl);
     res_tbl_free(&dev_res->cqe_ctx_tbl);
     res_tbl_free(&dev_res->qp_tbl);
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index b4e04cc7b4..a7169b4e89 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -22,7 +22,8 @@
 
 int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr,
                  Error **errp);
-void rdma_rm_fini(RdmaDeviceResources *dev_res);
+void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                  const char *ifname);
 
 int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
                      uint32_t *pd_handle, uint32_t ctx_handle);
@@ -55,7 +56,7 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
                      uint32_t recv_cq_handle, void *opaque, uint32_t *qpn);
 RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
 int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask,
+                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
                       union ibv_gid *dgid, uint32_t dqpn,
                       enum ibv_qp_state qp_state, uint32_t qkey,
                       uint32_t rq_psn, uint32_t sq_psn);
@@ -69,4 +70,16 @@ int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
 void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
 
+int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, union ibv_gid *gid, int gid_idx);
+int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
+                    const char *ifname, int gid_idx);
+int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
+                                  RdmaBackendDev *backend_dev, int sgid_idx);
+static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
+                                             int sgid_idx)
+{
+    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+}
+
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 9b399063d3..7b3435f991 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -19,7 +19,7 @@
 #include "rdma_backend_defs.h"
 
 #define MAX_PORTS             1
-#define MAX_PORT_GIDS         1
+#define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
 #define MAX_PKEYS             MAX_PORT_PKEYS
@@ -86,8 +86,13 @@ typedef struct RdmaRmQP {
     enum ibv_qp_state qp_state;
 } RdmaRmQP;
 
+typedef struct RdmaRmGid {
+    union ibv_gid gid;
+    int backend_gid_index;
+} RdmaRmGid;
+
 typedef struct RdmaRmPort {
-    union ibv_gid gid_tbl[MAX_PORT_GIDS];
+    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
     enum ibv_port_state state;
 } RdmaRmPort;
 
diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 04c7c2ef5b..989db249ef 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "hw/pci/pci.h"
 #include "sysemu/dma.h"
+#include "stdio.h"
 
 #define pr_info(fmt, ...) \
     fprintf(stdout, "%s: %-20s (%3d): " fmt, "rdma",  __func__, __LINE__,\
@@ -40,9 +41,23 @@ extern unsigned long pr_dbg_cnt;
 #define pr_dbg(fmt, ...) \
     fprintf(stdout, "%lx %ld: %-20s (%3d): " fmt, pthread_self(), pr_dbg_cnt++, \
             __func__, __LINE__, ## __VA_ARGS__)
+
+#define pr_dbg_buf(title, buf, len) \
+{ \
+    char *b = g_malloc0(len * 3 + 1); \
+    char b1[4]; \
+    for (int i = 0; i < len; i++) { \
+        sprintf(b1, "%.2X ", buf[i] & 0x000000FF); \
+        strcat(b, b1); \
+    } \
+    pr_dbg("%s (%d): %s\n", title, len, b); \
+    g_free(b); \
+}
+
 #else
 #define init_pr_dbg(void)
 #define pr_dbg(fmt, ...)
+#define pr_dbg_buf(title, buf, len)
 #endif
 
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 15c3f28b86..b019cb843a 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -79,8 +79,8 @@ typedef struct PVRDMADev {
     int interrupt_mask;
     struct ibv_device_attr dev_attr;
     uint64_t node_guid;
+    char *backend_eth_device_name;
     char *backend_device_name;
-    uint8_t backend_gid_idx;
     uint8_t backend_port_num;
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 57d6f41ae6..a334f6205e 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -504,13 +504,16 @@ static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rsp->hdr.response = cmd->hdr.response;
     rsp->hdr.ack = PVRDMA_CMD_MODIFY_QP_RESP;
 
-    rsp->hdr.err = rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
-                                 cmd->qp_handle, cmd->attr_mask,
-                                 (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
-                                 cmd->attrs.dest_qp_num,
-                                 (enum ibv_qp_state)cmd->attrs.qp_state,
-                                 cmd->attrs.qkey, cmd->attrs.rq_psn,
-                                 cmd->attrs.sq_psn);
+    /* No need to verify sgid_index since it is u8 */
+
+    rsp->hdr.err =
+        rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev, cmd->qp_handle,
+                          cmd->attr_mask, cmd->attrs.ah_attr.grh.sgid_index,
+                          (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
+                          cmd->attrs.dest_qp_num,
+                          (enum ibv_qp_state)cmd->attrs.qp_state,
+                          cmd->attrs.qkey, cmd->attrs.rq_psn,
+                          cmd->attrs.sq_psn);
 
     pr_dbg("ret=%d\n", rsp->hdr.err);
     return rsp->hdr.err;
@@ -570,10 +573,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                        union pvrdma_cmd_resp *rsp)
 {
     struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
-#ifdef PVRDMA_DEBUG
-    __be64 *subnet = (__be64 *)&cmd->new_gid[0];
-    __be64 *if_id = (__be64 *)&cmd->new_gid[8];
-#endif
+    int rc;
+    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
 
     pr_dbg("index=%d\n", cmd->index);
 
@@ -582,19 +583,24 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
-           (long long unsigned int)be64_to_cpu(*subnet),
-           (long long unsigned int)be64_to_cpu(*if_id));
+           (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix),
+           (long long unsigned int)be64_to_cpu(gid->global.interface_id));
 
-    /* Driver forces to one port only */
-    memcpy(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, &cmd->new_gid,
-           sizeof(cmd->new_gid));
+    rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                         dev->backend_eth_device_name, gid, cmd->index);
+    if (rc < 0) {
+        return -EINVAL;
+    }
 
     /* TODO: Since drivers stores node_guid at load_dsr phase then this
      * assignment is not relevant, i need to figure out a way how to
      * retrieve MAC of our netdev */
-    dev->node_guid = dev->rdma_dev_res.ports[0].gid_tbl[0].global.interface_id;
-    pr_dbg("dev->node_guid=0x%llx\n",
-           (long long unsigned int)be64_to_cpu(dev->node_guid));
+    if (!cmd->index) {
+        dev->node_guid =
+            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
+        pr_dbg("dev->node_guid=0x%llx\n",
+               (long long unsigned int)be64_to_cpu(dev->node_guid));
+    }
 
     return 0;
 }
@@ -602,6 +608,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
                         union pvrdma_cmd_resp *rsp)
 {
+    int rc;
+
     struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
 
     pr_dbg("index=%d\n", cmd->index);
@@ -610,8 +618,13 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    memset(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw, 0,
-           sizeof(dev->rdma_dev_res.ports[0].gid_tbl[cmd->index].raw));
+    rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
+                        dev->backend_eth_device_name, cmd->index);
+
+    if (rc < 0) {
+        rsp->hdr.err = rc;
+        goto out;
+    }
 
     return 0;
 }
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fc2abd34af..ac8c092db0 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -36,9 +36,9 @@
 #include "pvrdma_qp_ops.h"
 
 static Property pvrdma_dev_properties[] = {
-    DEFINE_PROP_STRING("backend-dev", PVRDMADev, backend_device_name),
-    DEFINE_PROP_UINT8("backend-port", PVRDMADev, backend_port_num, 1),
-    DEFINE_PROP_UINT8("backend-gid-idx", PVRDMADev, backend_gid_idx, 0),
+    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
+    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
+    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
     DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
                        MAX_MR_SIZE),
     DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
@@ -276,17 +276,6 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     pr_dbg("Initialized\n");
 }
 
-static void init_ports(PVRDMADev *dev, Error **errp)
-{
-    int i;
-
-    memset(dev->rdma_dev_res.ports, 0, sizeof(dev->rdma_dev_res.ports));
-
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev->rdma_dev_res.ports[i].state = IBV_PORT_DOWN;
-    }
-}
-
 static void uninit_msix(PCIDevice *pdev, int used_vectors)
 {
     PVRDMADev *dev = PVRDMA_DEV(pdev);
@@ -335,7 +324,8 @@ static void pvrdma_fini(PCIDevice *pdev)
 
     pvrdma_qp_ops_fini();
 
-    rdma_rm_fini(&dev->rdma_dev_res);
+    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
+                 dev->backend_eth_device_name);
 
     rdma_backend_fini(&dev->backend_dev);
 
@@ -612,8 +602,7 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 
     rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
                            dev->backend_device_name, dev->backend_port_num,
-                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
-                           errp);
+                           &dev->dev_attr, &dev->mad_chr, errp);
     if (rc) {
         goto out;
     }
@@ -623,8 +612,6 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
-    init_ports(dev, errp);
-
     rc = pvrdma_qp_ops_init();
     if (rc) {
         goto out;
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 3388be1926..2130824098 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -131,6 +131,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
     RdmaRmQP *qp;
     PvrdmaSqWqe *wqe;
     PvrdmaRing *ring;
+    int sgid_idx;
+    union ibv_gid *sgid;
 
     pr_dbg("qp_handle=0x%x\n", qp_handle);
 
@@ -156,8 +158,26 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
         comp_ctx->cqe.qp = qp_handle;
         comp_ctx->cqe.opcode = IBV_WC_SEND;
 
+        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
+        if (!sgid) {
+            pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+        pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index,
+               sgid->global.interface_id);
+
+        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
+                                                 &dev->backend_dev,
+                                                 wqe->hdr.wr.ud.av.gid_index);
+        if (sgid_idx <= 0) {
+            pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n",
+                   wqe->hdr.wr.ud.av.gid_index);
+            return -EIO;
+        }
+
         rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
                                (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
+                               sgid_idx, sgid,
                                (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
                                wqe->hdr.wr.ud.remote_qpn,
                                wqe->hdr.wr.ud.remote_qkey, comp_ctx);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 12/22] vmxnet3: Move some definitions to header file
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (10 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 11/22] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-12 13:56   ` Dmitry Fleytman
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
on PCI function 1.
pvrdma device needs to access vmxnet3 device object for several reasons:
1. Make sure PCI function 0 is vmxnet3.
2. To monitor vmxnet3 device state.
3. To configure node_guid accoring to vmxnet3 device's MAC address.

To be able to access vmxnet3 device the definition of VMXNET3State is
moved to a new header file.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/net/vmxnet3.c      | 116 +-----------------------------------
 hw/net/vmxnet3_defs.h | 133 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+), 115 deletions(-)
 create mode 100644 hw/net/vmxnet3_defs.h

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 3648630386..54746a4030 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -18,7 +18,6 @@
 #include "qemu/osdep.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
-#include "net/net.h"
 #include "net/tap.h"
 #include "net/checksum.h"
 #include "sysemu/sysemu.h"
@@ -29,6 +28,7 @@
 #include "migration/register.h"
 
 #include "vmxnet3.h"
+#include "vmxnet3_defs.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
 #include "net_tx_pkt.h"
@@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
     DeviceRealize parent_dc_realize;
 } VMXNET3Class;
 
-#define TYPE_VMXNET3 "vmxnet3"
-#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
-
 #define VMXNET3_DEVICE_CLASS(klass) \
     OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
 #define VMXNET3_DEVICE_GET_CLASS(obj) \
     OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
 
-/* Cyclic ring abstraction */
-typedef struct {
-    hwaddr pa;
-    uint32_t size;
-    uint32_t cell_size;
-    uint32_t next;
-    uint8_t gen;
-} Vmxnet3Ring;
-
 static inline void vmxnet3_ring_init(PCIDevice *d,
 				     Vmxnet3Ring *ring,
                                      hwaddr pa,
@@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
               descr->rsvd, descr->dtype, descr->ext1, descr->btype);
 }
 
-/* Device state and helper functions */
-#define VMXNET3_RX_RINGS_PER_QUEUE (2)
-
-typedef struct {
-    Vmxnet3Ring tx_ring;
-    Vmxnet3Ring comp_ring;
-
-    uint8_t intr_idx;
-    hwaddr tx_stats_pa;
-    struct UPT1_TxStats txq_stats;
-} Vmxnet3TxqDescr;
-
-typedef struct {
-    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
-    Vmxnet3Ring comp_ring;
-    uint8_t intr_idx;
-    hwaddr rx_stats_pa;
-    struct UPT1_RxStats rxq_stats;
-} Vmxnet3RxqDescr;
-
-typedef struct {
-    bool is_masked;
-    bool is_pending;
-    bool is_asserted;
-} Vmxnet3IntState;
-
-typedef struct {
-        PCIDevice parent_obj;
-        NICState *nic;
-        NICConf conf;
-        MemoryRegion bar0;
-        MemoryRegion bar1;
-        MemoryRegion msix_bar;
-
-        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
-        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
-
-        /* Whether MSI-X support was installed successfully */
-        bool msix_used;
-        hwaddr drv_shmem;
-        hwaddr temp_shared_guest_driver_memory;
-
-        uint8_t txq_num;
-
-        /* This boolean tells whether RX packet being indicated has to */
-        /* be split into head and body chunks from different RX rings  */
-        bool rx_packets_compound;
-
-        bool rx_vlan_stripping;
-        bool lro_supported;
-
-        uint8_t rxq_num;
-
-        /* Network MTU */
-        uint32_t mtu;
-
-        /* Maximum number of fragments for indicated TX packets */
-        uint32_t max_tx_frags;
-
-        /* Maximum number of fragments for indicated RX packets */
-        uint16_t max_rx_frags;
-
-        /* Index for events interrupt */
-        uint8_t event_int_idx;
-
-        /* Whether automatic interrupts masking enabled */
-        bool auto_int_masking;
-
-        bool peer_has_vhdr;
-
-        /* TX packets to QEMU interface */
-        struct NetTxPkt *tx_pkt;
-        uint32_t offload_mode;
-        uint32_t cso_or_gso_size;
-        uint16_t tci;
-        bool needs_vlan;
-
-        struct NetRxPkt *rx_pkt;
-
-        bool tx_sop;
-        bool skip_current_tx_pkt;
-
-        uint32_t device_active;
-        uint32_t last_command;
-
-        uint32_t link_status_and_speed;
-
-        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
-
-        uint32_t temp_mac;   /* To store the low part first */
-
-        MACAddr perm_mac;
-        uint32_t vlan_table[VMXNET3_VFT_SIZE];
-        uint32_t rx_mode;
-        MACAddr *mcast_list;
-        uint32_t mcast_list_len;
-        uint32_t mcast_list_buff_size; /* needed for live migration. */
-
-        /* Compatibility flags for migration */
-        uint32_t compat_flags;
-} VMXNET3State;
-
 /* Interrupt management */
 
 /*
diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
new file mode 100644
index 0000000000..6c19d29b12
--- /dev/null
+++ b/hw/net/vmxnet3_defs.h
@@ -0,0 +1,133 @@
+/*
+ * QEMU VMWARE VMXNET3 paravirtual NIC
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "net/net.h"
+#include "hw/net/vmxnet3.h"
+
+#define TYPE_VMXNET3 "vmxnet3"
+#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
+
+/* Device state and helper functions */
+#define VMXNET3_RX_RINGS_PER_QUEUE (2)
+
+/* Cyclic ring abstraction */
+typedef struct {
+    hwaddr pa;
+    uint32_t size;
+    uint32_t cell_size;
+    uint32_t next;
+    uint8_t gen;
+} Vmxnet3Ring;
+
+typedef struct {
+    Vmxnet3Ring tx_ring;
+    Vmxnet3Ring comp_ring;
+
+    uint8_t intr_idx;
+    hwaddr tx_stats_pa;
+    struct UPT1_TxStats txq_stats;
+} Vmxnet3TxqDescr;
+
+typedef struct {
+    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
+    Vmxnet3Ring comp_ring;
+    uint8_t intr_idx;
+    hwaddr rx_stats_pa;
+    struct UPT1_RxStats rxq_stats;
+} Vmxnet3RxqDescr;
+
+typedef struct {
+    bool is_masked;
+    bool is_pending;
+    bool is_asserted;
+} Vmxnet3IntState;
+
+typedef struct {
+        PCIDevice parent_obj;
+        NICState *nic;
+        NICConf conf;
+        MemoryRegion bar0;
+        MemoryRegion bar1;
+        MemoryRegion msix_bar;
+
+        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
+        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
+
+        /* Whether MSI-X support was installed successfully */
+        bool msix_used;
+        hwaddr drv_shmem;
+        hwaddr temp_shared_guest_driver_memory;
+
+        uint8_t txq_num;
+
+        /* This boolean tells whether RX packet being indicated has to */
+        /* be split into head and body chunks from different RX rings  */
+        bool rx_packets_compound;
+
+        bool rx_vlan_stripping;
+        bool lro_supported;
+
+        uint8_t rxq_num;
+
+        /* Network MTU */
+        uint32_t mtu;
+
+        /* Maximum number of fragments for indicated TX packets */
+        uint32_t max_tx_frags;
+
+        /* Maximum number of fragments for indicated RX packets */
+        uint16_t max_rx_frags;
+
+        /* Index for events interrupt */
+        uint8_t event_int_idx;
+
+        /* Whether automatic interrupts masking enabled */
+        bool auto_int_masking;
+
+        bool peer_has_vhdr;
+
+        /* TX packets to QEMU interface */
+        struct NetTxPkt *tx_pkt;
+        uint32_t offload_mode;
+        uint32_t cso_or_gso_size;
+        uint16_t tci;
+        bool needs_vlan;
+
+        struct NetRxPkt *rx_pkt;
+
+        bool tx_sop;
+        bool skip_current_tx_pkt;
+
+        uint32_t device_active;
+        uint32_t last_command;
+
+        uint32_t link_status_and_speed;
+
+        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
+
+        uint32_t temp_mac;   /* To store the low part first */
+
+        MACAddr perm_mac;
+        uint32_t vlan_table[VMXNET3_VFT_SIZE];
+        uint32_t rx_mode;
+        MACAddr *mcast_list;
+        uint32_t mcast_list_len;
+        uint32_t mcast_list_buff_size; /* needed for live migration. */
+
+        /* Compatibility flags for migration */
+        uint32_t compat_flags;
+} VMXNET3State;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (11 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 12/22] vmxnet3: Move some definitions to header file Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-10 18:27   ` Marcel Apfelbaum
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 14/22] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Guest driver enforces it, we should also.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      | 2 ++
 hw/rdma/vmw/pvrdma_main.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index b019cb843a..10a3c4fb7c 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -20,6 +20,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
+#include "hw/net/vmxnet3_defs.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -85,6 +86,7 @@ typedef struct PVRDMADev {
     RdmaBackendDev backend_dev;
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
+    VMXNET3State *func0;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ac8c092db0..fa6468d221 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
+    /* Break if not vmxnet3 device in slot 0 */
+    dev->func0 = VMXNET3(pci_get_function_0(pdev));
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 14/22] hw/rdma: Initialize node_guid from vmxnet3 mac address
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (12 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 15/22] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

node_guid should be set once device is load.
Make node_guid be GID format (32 bit) of PCI function 0 vmxnet3 device's
MAC.

A new function was added to do the conversion.
So for example the MAC 56:b6:44:e9:62:dc will be converted to GID
54b6:44ff:fee9:62dc.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_utils.h      |  9 +++++++++
 hw/rdma/vmw/pvrdma_cmd.c  | 10 ----------
 hw/rdma/vmw/pvrdma_main.c |  5 ++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 989db249ef..202abb3366 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -63,4 +63,13 @@ extern unsigned long pr_dbg_cnt;
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
 void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
 
+static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
+{
+    memcpy(eui, addr, 3);
+    eui[3] = 0xFF;
+    eui[4] = 0xFE;
+    memcpy(eui + 5, addr + 3, 3);
+    eui[0] ^= 2;
+}
+
 #endif
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index a334f6205e..2979582fac 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -592,16 +592,6 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         return -EINVAL;
     }
 
-    /* TODO: Since drivers stores node_guid at load_dsr phase then this
-     * assignment is not relevant, i need to figure out a way how to
-     * retrieve MAC of our netdev */
-    if (!cmd->index) {
-        dev->node_guid =
-            dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
-        pr_dbg("dev->node_guid=0x%llx\n",
-               (long long unsigned int)be64_to_cpu(dev->node_guid));
-    }
-
     return 0;
 }
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fa6468d221..95e9322b7c 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -264,7 +264,7 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
     dsr->caps.sys_image_guid = 0;
     pr_dbg("sys_image_guid=%" PRIx64 "\n", dsr->caps.sys_image_guid);
 
-    dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+    dsr->caps.node_guid = dev->node_guid;
     pr_dbg("node_guid=%" PRIx64 "\n", be64_to_cpu(dsr->caps.node_guid));
 
     dsr->caps.phys_port_cnt = MAX_PORTS;
@@ -579,6 +579,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
     /* Break if not vmxnet3 device in slot 0 */
     dev->func0 = VMXNET3(pci_get_function_0(pdev));
 
+    addrconf_addr_eui48((unsigned char *)&dev->node_guid,
+                        (const char *)&dev->func0->conf.macaddr.a);
+
     memdev_root = object_resolve_path("/objects", NULL);
     if (memdev_root) {
         object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 15/22] hw/pvrdma: Make device state depend on Ethernet function state
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (13 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 14/22] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 16/22] hw/pvrdma: Fill all CQE fields Yuval Shaia
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

User should be able to control the device by changing Ethernet function
state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
down as well.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 2979582fac..0d3c818c20 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
     resp->hdr.err = 0;
 
-    resp->attrs.state = attrs.state;
+    resp->attrs.state = dev->func0->device_active ? attrs.state :
+                                                    PVRDMA_PORT_DOWN;
     resp->attrs.max_mtu = attrs.max_mtu;
     resp->attrs.active_mtu = attrs.active_mtu;
     resp->attrs.phys_state = attrs.phys_state;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 16/22] hw/pvrdma: Fill all CQE fields
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (14 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 15/22] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 17/22] hw/pvrdma: Fill error code in command's response Yuval Shaia
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Add ability to pass specific WC attributes to CQE such as GRH_BIT flag.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_backend.c      | 59 +++++++++++++++++++++++--------------
 hw/rdma/rdma_backend.h      |  4 +--
 hw/rdma/vmw/pvrdma_qp_ops.c | 31 +++++++++++--------
 3 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 5675504165..e453bda8f9 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -59,13 +59,24 @@ struct backend_umad {
     char mad[RDMA_MAX_PRIVATE_DATA];
 };
 
-static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
+static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
 
-static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     pr_err("No completion handler is registered\n");
 }
 
+static inline void complete_work(enum ibv_wc_status status, uint32_t vendor_err,
+                                 void *ctx)
+{
+    struct ibv_wc wc = {0};
+
+    wc.status = status;
+    wc.vendor_err = vendor_err;
+
+    comp_handler(ctx, &wc);
+}
+
 static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
 {
     int i, ne;
@@ -90,7 +101,7 @@ static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
             }
             pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
 
-            comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
+            comp_handler(bctx->up_ctx, &wc[i]);
 
             rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
             g_free(bctx);
@@ -184,8 +195,8 @@ static void start_comp_thread(RdmaBackendDev *backend_dev)
                        comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
 }
 
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx))
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                         struct ibv_wc *wc))
 {
     comp_handler = handler;
 }
@@ -369,14 +380,14 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         } else if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
             } else {
-                comp_handler(IBV_WC_SUCCESS, 0, ctx);
+                complete_work(IBV_WC_SUCCESS, 0, ctx);
             }
         }
         return;
@@ -385,7 +396,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -396,21 +407,21 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
     if (qp_type == IBV_QPT_UD) {
         wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
         if (!wr.wr.ud.ah) {
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
             goto out_dealloc_cqe_ctx;
         }
         wr.wr.ud.remote_qpn = dqpn;
@@ -428,7 +439,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -497,13 +508,13 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
         if (qp_type == IBV_QPT_SMI) {
             pr_dbg("QP0 unsupported\n");
-            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
         }
         if (qp_type == IBV_QPT_GSI) {
             pr_dbg("QP1\n");
             rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
             if (rc) {
-                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+                complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
             }
         }
         return;
@@ -512,7 +523,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     pr_dbg("num_sge=%d\n", num_sge);
     if (!num_sge) {
         pr_dbg("num_sge=0\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
         return;
     }
 
@@ -523,14 +534,14 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     rc = rdma_rm_alloc_cqe_ctx(rdma_dev_res, &bctx_id, bctx);
     if (unlikely(rc)) {
         pr_dbg("Failed to allocate cqe_ctx\n");
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
         goto out_free_bctx;
     }
 
     rc = build_host_sge_array(rdma_dev_res, new_sge, sge, num_sge);
     if (rc) {
         pr_dbg("Error: Failed to build host SGE array\n");
-        comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -542,7 +553,7 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
     if (rc) {
         pr_dbg("Fail (%d, %d) to post recv WQE to qpn %d\n", rc, errno,
                 qp->ibqp->qp_num);
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
         goto out_dealloc_cqe_ctx;
     }
 
@@ -926,9 +937,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
     mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
                            bctx->sge.length);
     if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
-        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
-                     bctx->up_ctx);
+        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
+                      bctx->up_ctx);
     } else {
+        struct ibv_wc wc = {0};
         pr_dbg_buf("mad", msg->umad.mad, msg->umad_len);
         memset(mad, 0, bctx->sge.length);
         build_mad_hdr((struct ibv_grh *)mad,
@@ -937,7 +949,10 @@ static void mad_read(void *opaque, const uint8_t *buf, int size)
         memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
         rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
 
-        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
+        wc.byte_len = msg->umad_len;
+        wc.status = IBV_WC_SUCCESS;
+        wc.wc_flags = IBV_WC_GRH;
+        comp_handler(bctx->up_ctx, &wc);
     }
 
     g_free(bctx);
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 59ad2b874b..8cae40f827 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -57,8 +57,8 @@ int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
                                union ibv_gid *gid);
 void rdma_backend_start(RdmaBackendDev *backend_dev);
 void rdma_backend_stop(RdmaBackendDev *backend_dev);
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-                                        unsigned int vendor_err, void *ctx));
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+                                                        struct ibv_wc *wc));
 void rdma_backend_unregister_comp_handler(void);
 
 int rdma_backend_query_port(RdmaBackendDev *backend_dev,
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 2130824098..300471a4c9 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -47,7 +47,7 @@ typedef struct PvrdmaRqWqe {
  * 3. Interrupt host
  */
 static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
-                           struct pvrdma_cqe *cqe)
+                           struct pvrdma_cqe *cqe, struct ibv_wc *wc)
 {
     struct pvrdma_cqe *cqe1;
     struct pvrdma_cqne *cqne;
@@ -66,6 +66,7 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     pr_dbg("Writing CQE\n");
     cqe1 = pvrdma_ring_next_elem_write(ring);
     if (unlikely(!cqe1)) {
+        pr_dbg("No CQEs in ring\n");
         return -EINVAL;
     }
 
@@ -73,8 +74,20 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     cqe1->wr_id = cqe->wr_id;
     cqe1->qp = cqe->qp;
     cqe1->opcode = cqe->opcode;
-    cqe1->status = cqe->status;
-    cqe1->vendor_err = cqe->vendor_err;
+    cqe1->status = wc->status;
+    cqe1->byte_len = wc->byte_len;
+    cqe1->src_qp = wc->src_qp;
+    cqe1->wc_flags = wc->wc_flags;
+    cqe1->vendor_err = wc->vendor_err;
+
+    pr_dbg("wr_id=%" PRIx64 "\n", cqe1->wr_id);
+    pr_dbg("qp=0x%lx\n", cqe1->qp);
+    pr_dbg("opcode=%d\n", cqe1->opcode);
+    pr_dbg("status=%d\n", cqe1->status);
+    pr_dbg("byte_len=%d\n", cqe1->byte_len);
+    pr_dbg("src_qp=%d\n", cqe1->src_qp);
+    pr_dbg("wc_flags=%d\n", cqe1->wc_flags);
+    pr_dbg("vendor_err=%d\n", cqe1->vendor_err);
 
     pvrdma_ring_write_inc(ring);
 
@@ -99,18 +112,12 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
     return 0;
 }
 
-static void pvrdma_qp_ops_comp_handler(int status, unsigned int vendor_err,
-                                       void *ctx)
+static void pvrdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
 {
     CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
 
-    pr_dbg("cq_handle=%d\n", comp_ctx->cq_handle);
-    pr_dbg("wr_id=%" PRIx64 "\n", comp_ctx->cqe.wr_id);
-    pr_dbg("status=%d\n", status);
-    pr_dbg("vendor_err=0x%x\n", vendor_err);
-    comp_ctx->cqe.status = status;
-    comp_ctx->cqe.vendor_err = vendor_err;
-    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe);
+    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe, wc);
+
     g_free(ctx);
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 17/22] hw/pvrdma: Fill error code in command's response
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (15 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 16/22] hw/pvrdma: Fill all CQE fields Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 18/22] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Driver checks error code let's set it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma_cmd.c | 67 ++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 0d3c818c20..a326c5d470 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     if (rdma_backend_query_port(&dev->backend_dev,
                                 (struct ibv_port_attr *)&attrs)) {
-        return -ENOMEM;
+        resp->hdr.err = -ENOMEM;
+        goto out;
     }
 
     memset(resp, 0, sizeof(*resp));
@@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->attrs.active_width = 1;
     resp->attrs.active_speed = 1;
 
-    return 0;
+out:
+    pr_dbg("ret=%d\n", resp->hdr.err);
+    return resp->hdr.err;
 }
 
 static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->pkey = PVRDMA_PKEY;
     pr_dbg("pkey=0x%x\n", resp->pkey);
 
-    return 0;
+    return resp->hdr.err;
 }
 
 static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 
 static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
@@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
     cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
     if (!cq) {
         pr_dbg("Invalid CQ handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     ring = (PvrdmaRing *)cq->opaque;
@@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
@@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
     if (!qp) {
         pr_dbg("Invalid QP handle\n");
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
@@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
     g_free(ring);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
@@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     rc = rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
                          dev->backend_eth_device_name, gid, cmd->index);
     if (rc < 0) {
-        return -EINVAL;
+        rsp->hdr.err = rc;
+        goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
     pr_dbg("index=%d\n", cmd->index);
 
     if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
+        rsp->hdr.err = -EINVAL;
+        goto out;
     }
 
     rc = rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
@@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
         goto out;
     }
 
-    return 0;
+    rsp->hdr.err = 0;
+
+out:
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
     resp->hdr.err = rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn,
                                      &resp->ctx_handle);
 
-    pr_dbg("ret=%d\n", resp->hdr.err);
-
-    return 0;
+    pr_dbg("ret=%d\n", rsp->hdr.err);
+    return rsp->hdr.err;
 }
 
 static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -648,7 +675,9 @@ static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
 
     rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
 
-    return 0;
+    rsp->hdr.err = 0;
+
+    return rsp->hdr.err;
 }
 struct cmd_handler {
     uint32_t cmd;
@@ -696,7 +725,7 @@ int execute_command(PVRDMADev *dev)
     }
 
     err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
-                            dsr_info->rsp);
+                                                    dsr_info->rsp);
 out:
     set_reg_val(dev, PVRDMA_REG_ERR, err);
     post_interrupt(dev, INTR_VEC_CMD_RING);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 18/22] hw/rdma: Remove unneeded code that handles more that one port
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (16 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 17/22] hw/pvrdma: Fill error code in command's response Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers Yuval Shaia
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Device supports only one port, let's remove a dead code that handles
more than one port.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c      | 34 ++++++++++++++++------------------
 hw/rdma/rdma_rm.h      |  2 +-
 hw/rdma/rdma_rm_defs.h |  4 ++--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index fe0979415d..0a5ab8935a 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -545,7 +545,7 @@ int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
         return -EINVAL;
     }
 
-    memcpy(&dev_res->ports[0].gid_tbl[gid_idx].gid, gid, sizeof(*gid));
+    memcpy(&dev_res->port.gid_tbl[gid_idx].gid, gid, sizeof(*gid));
 
     return 0;
 }
@@ -556,15 +556,15 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
     int rc;
 
     rc = rdma_backend_del_gid(backend_dev, ifname,
-                              &dev_res->ports[0].gid_tbl[gid_idx].gid);
+                              &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc < 0) {
         pr_dbg("Fail to delete gid\n");
         return -EINVAL;
     }
 
-    memset(dev_res->ports[0].gid_tbl[gid_idx].gid.raw, 0,
-           sizeof(dev_res->ports[0].gid_tbl[gid_idx].gid));
-    dev_res->ports[0].gid_tbl[gid_idx].backend_gid_index = -1;
+    memset(dev_res->port.gid_tbl[gid_idx].gid.raw, 0,
+           sizeof(dev_res->port.gid_tbl[gid_idx].gid));
+    dev_res->port.gid_tbl[gid_idx].backend_gid_index = -1;
 
     return 0;
 }
@@ -577,16 +577,16 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
         return -EINVAL;
     }
 
-    if (unlikely(dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index == -1)) {
-        dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index =
+    if (unlikely(dev_res->port.gid_tbl[sgid_idx].backend_gid_index == -1)) {
+        dev_res->port.gid_tbl[sgid_idx].backend_gid_index =
         rdma_backend_get_gid_index(backend_dev,
-                                       &dev_res->ports[0].gid_tbl[sgid_idx].gid);
+                                   &dev_res->port.gid_tbl[sgid_idx].gid);
     }
 
     pr_dbg("backend_gid_index=%d\n",
-           dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index);
+           dev_res->port.gid_tbl[sgid_idx].backend_gid_index);
 
-    return dev_res->ports[0].gid_tbl[sgid_idx].backend_gid_index;
+    return dev_res->port.gid_tbl[sgid_idx].backend_gid_index;
 }
 
 static void destroy_qp_hash_key(gpointer data)
@@ -596,15 +596,13 @@ static void destroy_qp_hash_key(gpointer data)
 
 static void init_ports(RdmaDeviceResources *dev_res)
 {
-    int i, j;
+    int i;
 
-    memset(dev_res->ports, 0, sizeof(dev_res->ports));
+    memset(&dev_res->port, 0, sizeof(dev_res->port));
 
-    for (i = 0; i < MAX_PORTS; i++) {
-        dev_res->ports[i].state = IBV_PORT_DOWN;
-        for (j = 0; j < MAX_PORT_GIDS; j++) {
-            dev_res->ports[i].gid_tbl[j].backend_gid_index = -1;
-        }
+    dev_res->port.state = IBV_PORT_DOWN;
+    for (i = 0; i < MAX_PORT_GIDS; i++) {
+        dev_res->port.gid_tbl[i].backend_gid_index = -1;
     }
 }
 
@@ -613,7 +611,7 @@ static void fini_ports(RdmaDeviceResources *dev_res,
 {
     int i;
 
-    dev_res->ports[0].state = IBV_PORT_DOWN;
+    dev_res->port.state = IBV_PORT_DOWN;
     for (i = 0; i < MAX_PORT_GIDS; i++) {
         rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
     }
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
index a7169b4e89..3c602c04c0 100644
--- a/hw/rdma/rdma_rm.h
+++ b/hw/rdma/rdma_rm.h
@@ -79,7 +79,7 @@ int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
 static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
                                              int sgid_idx)
 {
-    return &dev_res->ports[0].gid_tbl[sgid_idx].gid;
+    return &dev_res->port.gid_tbl[sgid_idx].gid;
 }
 
 #endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
index 7b3435f991..0ba61d1838 100644
--- a/hw/rdma/rdma_rm_defs.h
+++ b/hw/rdma/rdma_rm_defs.h
@@ -18,7 +18,7 @@
 
 #include "rdma_backend_defs.h"
 
-#define MAX_PORTS             1
+#define MAX_PORTS             1 /* Do not change - we support only one port */
 #define MAX_PORT_GIDS         255
 #define MAX_GIDS              MAX_PORT_GIDS
 #define MAX_PORT_PKEYS        1
@@ -97,7 +97,7 @@ typedef struct RdmaRmPort {
 } RdmaRmPort;
 
 typedef struct RdmaDeviceResources {
-    RdmaRmPort ports[MAX_PORTS];
+    RdmaRmPort port;
     RdmaRmResTbl pd_tbl;
     RdmaRmResTbl mr_tbl;
     RdmaRmResTbl uc_tbl;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (17 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 18/22] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:26   ` Cornelia Huck
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 20/22] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Notifier will be used for signaling shutdown event to inform system is
shutdown. This will allow devices and other component to run some
cleanup code needed before VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 include/sysemu/sysemu.h |  1 +
 vl.c                    | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8d6095d98b..0d15f16492 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -80,6 +80,7 @@ void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_system_shutdown_request(ShutdownCause reason);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
+void qemu_register_shutdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
diff --git a/vl.c b/vl.c
index 1fcacc5caa..c5ba750f3e 100644
--- a/vl.c
+++ b/vl.c
@@ -1578,6 +1578,8 @@ static NotifierList suspend_notifiers =
     NOTIFIER_LIST_INITIALIZER(suspend_notifiers);
 static NotifierList wakeup_notifiers =
     NOTIFIER_LIST_INITIALIZER(wakeup_notifiers);
+static NotifierList shutdown_notifiers =
+    NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
 
 ShutdownCause qemu_shutdown_requested_get(void)
@@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
     notifier_list_notify(&powerdown_notifiers, NULL);
 }
 
+static void qemu_system_shutdown(bool by_guest)
+{
+    qapi_event_send_shutdown(by_guest);
+    notifier_list_notify(&shutdown_notifiers, NULL);
+}
+
 void qemu_system_powerdown_request(void)
 {
     trace_qemu_system_powerdown_request();
@@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
     notifier_list_add(&powerdown_notifiers, notifier);
 }
 
+void qemu_register_shutdown_notifier(Notifier *notifier)
+{
+    notifier_list_add(&shutdown_notifiers, notifier);
+}
+
 void qemu_system_debug_request(void)
 {
     debug_requested = 1;
@@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
     request = qemu_shutdown_requested();
     if (request) {
         qemu_kill_report();
-        qapi_event_send_shutdown(shutdown_caused_by_guest(request));
+        qemu_system_shutdown(shutdown_caused_by_guest(request));
         if (no_shutdown) {
             vm_stop(RUN_STATE_SHUTDOWN);
         } else {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 20/22] hw/pvrdma: Clean device's resource when system is shutdown
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (18 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 21/22] rdma: Do not use bitmap_zero_extend to fee bitmap Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 22/22] rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

In order to clean some external resources such as GIDs, QPs etc,
register to receive notification when VM is shutdown.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/vmw/pvrdma.h      |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 10a3c4fb7c..ffae36986e 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -17,6 +17,7 @@
 #define PVRDMA_PVRDMA_H
 
 #include "qemu/units.h"
+#include "qemu/notify.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
@@ -87,6 +88,7 @@ typedef struct PVRDMADev {
     RdmaDeviceResources rdma_dev_res;
     CharBackend mad_chr;
     VMXNET3State *func0;
+    Notifier shutdown_notifier;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 95e9322b7c..45a59cddf9 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -24,6 +24,7 @@
 #include "hw/qdev-properties.h"
 #include "cpu.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 #include "../rdma_rm.h"
 #include "../rdma_backend.h"
@@ -559,6 +560,14 @@ static int pvrdma_check_ram_shared(Object *obj, void *opaque)
     return 0;
 }
 
+static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
+{
+    PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+    pvrdma_fini(pci_dev);
+}
+
 static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 {
     int rc;
@@ -623,6 +632,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
         goto out;
     }
 
+    dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
+    qemu_register_shutdown_notifier(&dev->shutdown_notifier);
+
 out:
     if (rc) {
         error_append_hint(errp, "Device fail to load\n");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 21/22] rdma: Do not use bitmap_zero_extend to fee bitmap
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (19 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 20/22] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 22/22] rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 0a5ab8935a..35a96d9a64 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -43,7 +43,7 @@ static inline void res_tbl_free(RdmaRmResTbl *tbl)
 {
     qemu_mutex_destroy(&tbl->lock);
     g_free(tbl->tbl);
-    bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+    g_free(tbl->bitmap);
 }
 
 static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH v2 22/22] rdma: Do not call rdma_backend_del_gid on an empty gid
  2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
                   ` (20 preceding siblings ...)
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 21/22] rdma: Do not use bitmap_zero_extend to fee bitmap Yuval Shaia
@ 2018-11-08 16:08 ` Yuval Shaia
  21 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 16:08 UTC (permalink / raw)
  To: yuval.shaia, marcel.apfelbaum, dmitry.fleytman, jasowang, eblake,
	armbru, pbonzini, qemu-devel, shamir.rabinovitch

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 hw/rdma/rdma_rm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 35a96d9a64..e3f6b2f6ea 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -555,6 +555,10 @@ int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
 {
     int rc;
 
+    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
+        return 0;
+    }
+
     rc = rdma_backend_del_gid(backend_dev, ifname,
                               &dev_res->port.gid_tbl[gid_idx].gid);
     if (rc < 0) {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers Yuval Shaia
@ 2018-11-08 16:26   ` Cornelia Huck
  2018-11-08 20:45     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Cornelia Huck @ 2018-11-08 16:26 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, shamir.rabinovitch

On Thu,  8 Nov 2018 18:08:15 +0200
Yuval Shaia <yuval.shaia@oracle.com> wrote:

> Notifier will be used for signaling shutdown event to inform system is
> shutdown. This will allow devices and other component to run some
> cleanup code needed before VM is shutdown.
> 
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  include/sysemu/sysemu.h |  1 +
>  vl.c                    | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 

> @@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
>      notifier_list_notify(&powerdown_notifiers, NULL);
>  }
>  
> +static void qemu_system_shutdown(bool by_guest)

I would pass the shutdown reason here directly (instead of only whether
this was triggered by the guest or not)...

> +{
> +    qapi_event_send_shutdown(by_guest);
> +    notifier_list_notify(&shutdown_notifiers, NULL);

...and also pass it to the notifiers here. If we have the info anyway,
why not simply pass it along.

> +}
> +
>  void qemu_system_powerdown_request(void)
>  {
>      trace_qemu_system_powerdown_request();
> @@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
>      notifier_list_add(&powerdown_notifiers, notifier);
>  }
>  
> +void qemu_register_shutdown_notifier(Notifier *notifier)
> +{
> +    notifier_list_add(&shutdown_notifiers, notifier);
> +}
> +
>  void qemu_system_debug_request(void)
>  {
>      debug_requested = 1;
> @@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
>      request = qemu_shutdown_requested();
>      if (request) {
>          qemu_kill_report();
> -        qapi_event_send_shutdown(shutdown_caused_by_guest(request));
> +        qemu_system_shutdown(shutdown_caused_by_guest(request));
>          if (no_shutdown) {
>              vm_stop(RUN_STATE_SHUTDOWN);
>          } else {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers
  2018-11-08 16:26   ` Cornelia Huck
@ 2018-11-08 20:45     ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-08 20:45 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, shamir.rabinovitch

On Thu, Nov 08, 2018 at 05:26:06PM +0100, Cornelia Huck wrote:
> On Thu,  8 Nov 2018 18:08:15 +0200
> Yuval Shaia <yuval.shaia@oracle.com> wrote:
> 
> > Notifier will be used for signaling shutdown event to inform system is
> > shutdown. This will allow devices and other component to run some
> > cleanup code needed before VM is shutdown.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >  include/sysemu/sysemu.h |  1 +
> >  vl.c                    | 15 ++++++++++++++-
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> > 
> 
> > @@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
> >      notifier_list_notify(&powerdown_notifiers, NULL);
> >  }
> >  
> > +static void qemu_system_shutdown(bool by_guest)
> 
> I would pass the shutdown reason here directly (instead of only whether
> this was triggered by the guest or not)...
> 
> > +{
> > +    qapi_event_send_shutdown(by_guest);
> > +    notifier_list_notify(&shutdown_notifiers, NULL);
> 
> ...and also pass it to the notifiers here. If we have the info anyway,
> why not simply pass it along.

Agree, make sense.

> 
> > +}
> > +
> >  void qemu_system_powerdown_request(void)
> >  {
> >      trace_qemu_system_powerdown_request();
> > @@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
> >      notifier_list_add(&powerdown_notifiers, notifier);
> >  }
> >  
> > +void qemu_register_shutdown_notifier(Notifier *notifier)
> > +{
> > +    notifier_list_add(&shutdown_notifiers, notifier);
> > +}
> > +
> >  void qemu_system_debug_request(void)
> >  {
> >      debug_requested = 1;
> > @@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
> >      request = qemu_shutdown_requested();
> >      if (request) {
> >          qemu_kill_report();
> > -        qapi_event_send_shutdown(shutdown_caused_by_guest(request));
> > +        qemu_system_shutdown(shutdown_caused_by_guest(request));
> >          if (no_shutdown) {
> >              vm_stop(RUN_STATE_SHUTDOWN);
> >          } else {
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/22] hw/rdma: Add ability to force notification without re-arm
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 02/22] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
@ 2018-11-10 17:56   ` Marcel Apfelbaum
  0 siblings, 0 replies; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 17:56 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:07 PM, Yuval Shaia wrote:
> Upon completion of incoming packet the device pushes CQE to driver's RX
> ring and notify the driver (msix).
> While for data-path incoming packets the driver needs the ability to
> control whether it wished to receive interrupts or not, for control-path
> packets such as incoming MAD the driver needs to be notified anyway, it
> even do not need to re-arm the notification bit.
>
> Enhance the notification field to support this.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_rm.c           | 12 ++++++++++--
>   hw/rdma/rdma_rm_defs.h      |  8 +++++++-
>   hw/rdma/vmw/pvrdma_qp_ops.c |  6 ++++--
>   3 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index 8d59a42cd1..4f10fcabcc 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -263,7 +263,7 @@ int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
>       }
>   
>       cq->opaque = opaque;
> -    cq->notify = false;
> +    cq->notify = CNT_CLEAR;
>   
>       rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
>       if (rc) {
> @@ -291,7 +291,10 @@ void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
>           return;
>       }
>   
> -    cq->notify = notify;
> +    if (cq->notify != CNT_SET) {
> +        cq->notify = notify ? CNT_ARM : CNT_CLEAR;
> +    }
> +
>       pr_dbg("notify=%d\n", cq->notify);
>   }
>   
> @@ -349,6 +352,11 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
>           return -EINVAL;
>       }
>   
> +    if (qp_type == IBV_QPT_GSI) {
> +        scq->notify = CNT_SET;
> +        rcq->notify = CNT_SET;
> +    }
> +
>       qp = res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
>       if (!qp) {
>           return -ENOMEM;
> diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
> index 7228151239..9b399063d3 100644
> --- a/hw/rdma/rdma_rm_defs.h
> +++ b/hw/rdma/rdma_rm_defs.h
> @@ -49,10 +49,16 @@ typedef struct RdmaRmPD {
>       uint32_t ctx_handle;
>   } RdmaRmPD;
>   
> +typedef enum CQNotificationType {
> +    CNT_CLEAR,
> +    CNT_ARM,
> +    CNT_SET,
> +} CQNotificationType;
> +
>   typedef struct RdmaRmCQ {
>       RdmaBackendCQ backend_cq;
>       void *opaque;
> -    bool notify;
> +    CQNotificationType notify;
>   } RdmaRmCQ;
>   
>   /* MR (DMA region) */
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index c668afd0ed..762700a205 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -89,8 +89,10 @@ static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
>       pvrdma_ring_write_inc(&dev->dsr_info.cq);
>   
>       pr_dbg("cq->notify=%d\n", cq->notify);
> -    if (cq->notify) {
> -        cq->notify = false;
> +    if (cq->notify != CNT_CLEAR) {
> +        if (cq->notify == CNT_ARM) {
> +            cq->notify = CNT_CLEAR;
> +        }
>           post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
>       }
>   

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
@ 2018-11-10 17:59   ` Marcel Apfelbaum
  2018-11-11  9:12     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 17:59 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch

Hi Yuval,

On 11/8/18 6:07 PM, Yuval Shaia wrote:
> Device is not supporting QP0, only QP1.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index 86e8fe8ab6..3ccc9a2494 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
>   
>   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
>   {
> -    return qp->ibqp ? qp->ibqp->qp_num : 0;
> +    return qp->ibqp ? qp->ibqp->qp_num : 1;

Just to be sure, what are the cases we don't get  a qp_num?
Can we assume all of them are MADs?

Thanks,
Marcel

>   }
>   
>   static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/22] hw/rdma: Abort send-op if fail to create addr handler
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 04/22] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
@ 2018-11-10 17:59   ` Marcel Apfelbaum
  0 siblings, 0 replies; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 17:59 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> Function create_ah might return NULL, let's exit with an error.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index d7a4bbd91f..1e148398a2 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -338,6 +338,10 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>       if (qp_type == IBV_QPT_UD) {
>           wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
>                                   backend_dev->backend_gid_idx, dgid);
> +        if (!wr.wr.ud.ah) {
> +            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
> +            goto out_dealloc_cqe_ctx;
> +        }
>           wr.wr.ud.remote_qpn = dqpn;
>           wr.wr.ud.remote_qkey = dqkey;
>       }

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets Yuval Shaia
@ 2018-11-10 18:15   ` Marcel Apfelbaum
  2018-11-11 10:31     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:15 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch

Hi Yuval

On 11/8/18 6:08 PM, Yuval Shaia wrote:
> MAD (Management Datagram) packets are widely used by various modules
> both in kernel and in user space for example the rdma_* API which is
> used to create and maintain "connection" layer on top of RDMA uses
> several types of MAD packets.

Can you add a link to MAD spec to commit or event in the code?

> To support MAD packets the device uses an external utility
> (contrib/rdmacm-mux) to relay packets from and to the guest driver.

Can the device be used without MADs support?
If not, can you update the pvrdma documentation to
reflect the changes?

> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
>   hw/rdma/rdma_backend.h      |   4 +-
>   hw/rdma/rdma_backend_defs.h |  10 +-
>   hw/rdma/vmw/pvrdma.h        |   2 +
>   hw/rdma/vmw/pvrdma_main.c   |   4 +-
>   5 files changed, 273 insertions(+), 10 deletions(-)
>
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 1e148398a2..3eb0099f8d 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -16,8 +16,13 @@
>   #include "qemu/osdep.h"
>   #include "qemu/error-report.h"
>   #include "qapi/error.h"
> +#include "qapi/qmp/qlist.h"
> +#include "qapi/qmp/qnum.h"
>   
>   #include <infiniband/verbs.h>
> +#include <infiniband/umad_types.h>
> +#include <infiniband/umad.h>
> +#include <rdma/rdma_user_cm.h>
>   
>   #include "trace.h"
>   #include "rdma_utils.h"
> @@ -33,16 +38,25 @@
>   #define VENDOR_ERR_MAD_SEND         0x206
>   #define VENDOR_ERR_INVLKEY          0x207
>   #define VENDOR_ERR_MR_SMALL         0x208
> +#define VENDOR_ERR_INV_MAD_BUFF     0x209
> +#define VENDOR_ERR_INV_NUM_SGE      0x210
>   
>   #define THR_NAME_LEN 16
>   #define THR_POLL_TO  5000
>   
> +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
> +
>   typedef struct BackendCtx {
> -    uint64_t req_id;
>       void *up_ctx;
>       bool is_tx_req;
> +    struct ibv_sge sge; /* Used to save MAD recv buffer */
>   } BackendCtx;
>   
> +struct backend_umad {
> +    struct ib_user_mad hdr;
> +    char mad[RDMA_MAX_PRIVATE_DATA];
> +};
> +
>   static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
>   
>   static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> @@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
>       return 0;
>   }
>   
> +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> +                    uint32_t num_sge)
> +{
> +    struct backend_umad umad = {0};
> +    char *hdr, *msg;
> +    int ret;
> +
> +    pr_dbg("num_sge=%d\n", num_sge);
> +
> +    if (num_sge != 2) {
> +        return -EINVAL;
> +    }
> +
> +    umad.hdr.length = sge[0].length + sge[1].length;
> +    pr_dbg("msg_len=%d\n", umad.hdr.length);
> +
> +    if (umad.hdr.length > sizeof(umad.mad)) {
> +        return -ENOMEM;
> +    }
> +
> +    umad.hdr.addr.qpn = htobe32(1);
> +    umad.hdr.addr.grh_present = 1;
> +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> +    umad.hdr.addr.hop_limit = 1;
> +
> +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> +
> +    memcpy(&umad.mad[0], hdr, sge[0].length);
> +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> +
> +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> +
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> +                            sizeof(umad));
> +
> +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> +
> +    return (ret != sizeof(umad));
> +}
> +
>   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>                               RdmaBackendQP *qp, uint8_t qp_type,
>                               struct ibv_sge *sge, uint32_t num_sge,
> @@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>           } else if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            rc = mad_send(backend_dev, sge, num_sge);
> +            if (rc) {
> +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            } else {
> +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> +            }
>           }
> -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
>           return;
>       }
>   
> @@ -370,6 +431,48 @@ out_free_bctx:
>       g_free(bctx);
>   }
>   
> +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
> +                                         struct ibv_sge *sge, uint32_t num_sge,
> +                                         void *ctx)
> +{
> +    BackendCtx *bctx;
> +    int rc;
> +    uint32_t bctx_id;
> +
> +    if (num_sge != 1) {
> +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
> +        return VENDOR_ERR_INV_NUM_SGE;
> +    }
> +
> +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
> +        pr_dbg("Too small buffer for MAD\n");
> +        return VENDOR_ERR_INV_MAD_BUFF;
> +    }
> +
> +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
> +    pr_dbg("length=%d\n", sge[0].length);
> +    pr_dbg("lkey=%d\n", sge[0].lkey);
> +
> +    bctx = g_malloc0(sizeof(*bctx));
> +
> +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
> +    if (unlikely(rc)) {
> +        g_free(bctx);
> +        pr_dbg("Fail to allocate cqe_ctx\n");
> +        return VENDOR_ERR_NOMEM;
> +    }
> +
> +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
> +    bctx->up_ctx = ctx;
> +    bctx->sge = *sge;
> +
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +
> +    return 0;
> +}
> +
>   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>                               RdmaDeviceResources *rdma_dev_res,
>                               RdmaBackendQP *qp, uint8_t qp_type,
> @@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>           }
>           if (qp_type == IBV_QPT_GSI) {
>               pr_dbg("QP1\n");
> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
> +            if (rc) {
> +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> +            }
>           }
>           return;
>       }
> @@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
>   
>       switch (qp_type) {
>       case IBV_QPT_GSI:
> -        pr_dbg("QP1 unsupported\n");
>           return 0;
>   
>       case IBV_QPT_RC:
> @@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
>       return 0;
>   }
>   
> +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> +                                 union ibv_gid *my_gid, int paylen)
> +{
> +    grh->paylen = htons(paylen);
> +    grh->sgid = *sgid;
> +    grh->dgid = *my_gid;
> +
> +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> +}
> +
> +static inline int mad_can_receieve(void *opaque)
> +{
> +    return sizeof(struct backend_umad);
> +}
> +
> +static void mad_read(void *opaque, const uint8_t *buf, int size)
> +{
> +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> +    QObject *o_ctx_id;
> +    unsigned long cqe_ctx_id;
> +    BackendCtx *bctx;
> +    char *mad;
> +    struct backend_umad *umad;
> +
> +    assert(size != sizeof(umad));
> +    umad = (struct backend_umad *)buf;
> +
> +    pr_dbg("Got %d bytes\n", size);
> +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> +
> +#ifdef PVRDMA_DEBUG
> +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
> +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
> +           hdr->method, hdr->status, be64toh(hdr->tid),
> +           hdr->attr_id, hdr->attr_mod);
> +#endif
> +
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +    if (!o_ctx_id) {
> +        pr_dbg("No more free MADs buffers, waiting for a while\n");
> +        sleep(THR_POLL_TO);

Why do we sleep here? Seems a little odd.

> +        return;
> +    }
> +
> +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +    if (unlikely(!bctx)) {
> +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
> +        return;
> +    }
> +
> +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
> +
> +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> +                           bctx->sge.length);
> +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> +                     bctx->up_ctx);
> +    } else {
> +        memset(mad, 0, bctx->sge.length);
> +        build_mad_hdr((struct ibv_grh *)mad,
> +                      (union ibv_gid *)&umad->hdr.addr.gid,
> +                      &backend_dev->gid, umad->hdr.length);
> +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> +
> +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> +    }
> +
> +    g_free(bctx);
> +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +}
> +
> +static int mad_init(RdmaBackendDev *backend_dev)
> +{
> +    struct backend_umad umad = {0};
> +    int ret;
> +
> +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> +        pr_dbg("Missing chardev for MAD multiplexer\n");
> +        return -EIO;
> +    }
> +
> +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> +                             mad_read, NULL, NULL, backend_dev, NULL, true);
> +
> +    /* Register ourself */
> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> +                            sizeof(umad.hdr));
> +    if (ret != sizeof(umad.hdr)) {
> +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);

Why only a dbg message and  not fail the init proc in this case ?

> +    }
> +
> +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> +    backend_dev->recv_mads_list.list = qlist_new();
> +
> +    return 0;
> +}
> +
> +static void mad_stop(RdmaBackendDev *backend_dev)
> +{
> +    QObject *o_ctx_id;
> +    unsigned long cqe_ctx_id;
> +    BackendCtx *bctx;
> +
> +    pr_dbg("Closing MAD\n");
> +
> +    /* Clear MAD buffers list */
> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);

Does it makes sense to lock only around the
qlist_post call?


Thanks,
Marcel

> +    do {
> +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> +        if (o_ctx_id) {
> +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +            if (bctx) {
> +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> +                g_free(bctx);
> +            }
> +        }
> +    } while (o_ctx_id);
> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> +}
> +
> +static void mad_fini(RdmaBackendDev *backend_dev)
> +{
> +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
> +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> +}
> +
>   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
>                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      Error **errp)
> +                      CharBackend *mad_chr_be, Error **errp)
>   {
>       int i;
>       int ret = 0;
> @@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       memset(backend_dev, 0, sizeof(*backend_dev));
>   
>       backend_dev->dev = pdev;
> -
> +    backend_dev->mad_chr_be = mad_chr_be;
>       backend_dev->backend_gid_idx = backend_gid_idx;
>       backend_dev->port_num = port_num;
>       backend_dev->rdma_dev_res = rdma_dev_res;
> @@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>       pr_dbg("interface_id=0x%" PRIx64 "\n",
>              be64_to_cpu(backend_dev->gid.global.interface_id));
>   
> +    ret = mad_init(backend_dev);
> +    if (ret) {
> +        error_setg(errp, "Fail to initialize mad");
> +        ret = -EIO;
> +        goto out_destroy_comm_channel;
> +    }
> +
>       backend_dev->comp_thread.run = false;
>       backend_dev->comp_thread.is_running = false;
>   
> @@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
>   {
>       pr_dbg("Stopping rdma_backend\n");
>       stop_backend_thread(&backend_dev->comp_thread);
> +    mad_stop(backend_dev);
>   }
>   
>   void rdma_backend_fini(RdmaBackendDev *backend_dev)
>   {
>       rdma_backend_stop(backend_dev);
> +    mad_fini(backend_dev);
>       g_hash_table_destroy(ah_hash);
>       ibv_destroy_comp_channel(backend_dev->channel);
>       ibv_close_device(backend_dev->context);
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index 3ccc9a2494..fc83330251 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -17,6 +17,8 @@
>   #define RDMA_BACKEND_H
>   
>   #include "qapi/error.h"
> +#include "chardev/char-fe.h"
> +
>   #include "rdma_rm_defs.h"
>   #include "rdma_backend_defs.h"
>   
> @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>                         RdmaDeviceResources *rdma_dev_res,
>                         const char *backend_device_name, uint8_t port_num,
>                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> -                      Error **errp);
> +                      CharBackend *mad_chr_be, Error **errp);
>   void rdma_backend_fini(RdmaBackendDev *backend_dev);
>   void rdma_backend_start(RdmaBackendDev *backend_dev);
>   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> index 7404f64002..2a7e667075 100644
> --- a/hw/rdma/rdma_backend_defs.h
> +++ b/hw/rdma/rdma_backend_defs.h
> @@ -16,8 +16,9 @@
>   #ifndef RDMA_BACKEND_DEFS_H
>   #define RDMA_BACKEND_DEFS_H
>   
> -#include <infiniband/verbs.h>
>   #include "qemu/thread.h"
> +#include "chardev/char-fe.h"
> +#include <infiniband/verbs.h>
>   
>   typedef struct RdmaDeviceResources RdmaDeviceResources;
>   
> @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
>       bool is_running; /* Set by the thread to report its status */
>   } RdmaBackendThread;
>   
> +typedef struct RecvMadList {
> +    QemuMutex lock;
> +    QList *list;
> +} RecvMadList;
> +
>   typedef struct RdmaBackendDev {
>       struct ibv_device_attr dev_attr;
>       RdmaBackendThread comp_thread;
> @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
>       struct ibv_comp_channel *channel;
>       uint8_t port_num;
>       uint8_t backend_gid_idx;
> +    RecvMadList recv_mads_list;
> +    CharBackend *mad_chr_be;
>   } RdmaBackendDev;
>   
>   typedef struct RdmaBackendPD {
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index e2d9f93cdf..e3742d893a 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -19,6 +19,7 @@
>   #include "qemu/units.h"
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msix.h"
> +#include "chardev/char-fe.h"
>   
>   #include "../rdma_backend_defs.h"
>   #include "../rdma_rm_defs.h"
> @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
>       uint8_t backend_port_num;
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
> +    CharBackend mad_chr;
>   } PVRDMADev;
>   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index ca5fa8d981..6c8c0154fa 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
>       DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
>                         dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
>       DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
> +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
>       DEFINE_PROP_END_OF_LIST(),
>   };
>   
> @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>   
>       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
>                              dev->backend_device_name, dev->backend_port_num,
> -                           dev->backend_gid_idx, &dev->dev_attr, errp);
> +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> +                           errp);
>       if (rc) {
>           goto out;
>       }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/22] hw/pvrdma: Make function reset_device return void
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 06/22] hw/pvrdma: Make function reset_device return void Yuval Shaia
@ 2018-11-10 18:17   ` Marcel Apfelbaum
  0 siblings, 0 replies; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:17 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> This function cannot fail - fix it to return void
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_main.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index 6c8c0154fa..fc2abd34af 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
>       return 0;
>   }
>   
> -static int reset_device(PVRDMADev *dev)
> +static void reset_device(PVRDMADev *dev)
>   {
>       pvrdma_stop(dev);
>   
>       pr_dbg("Device reset complete\n");
> -
> -    return 0;
>
>   
>   static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/22] hw/pvrdma: Make default pkey 0xFFFF
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 07/22] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
@ 2018-11-10 18:17   ` Marcel Apfelbaum
  0 siblings, 0 replies; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:17 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> Commit 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF") exports
> default pkey as external definition but omit the change from 0x7FFF to
> 0xFFFF.
>
> Fixes: 6e7dba23af ("hw/pvrdma: Make default pkey 0xFFFF")
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index e3742d893a..15c3f28b86 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -52,7 +52,7 @@
>   #define PVRDMA_FW_VERSION    14
>   
>   /* Some defaults */
> -#define PVRDMA_PKEY          0x7FFF
> +#define PVRDMA_PKEY          0xFFFF
>   
>   typedef struct DSRInfo {
>       dma_addr_t dma;


Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
@ 2018-11-10 18:18   ` Marcel Apfelbaum
  2018-11-11  8:43     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:18 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> The function pvrdma_post_cqe populates CQE entry with opcode from the
> given completion element. For receive operation value was not set. Fix
> it by setting it to IBV_WC_RECV.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index 762700a205..7b0f440fda 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
>           comp_ctx = g_malloc(sizeof(CompHandlerCtx));
>           comp_ctx->dev = dev;
>           comp_ctx->cq_handle = qp->recv_cq_handle;
> -        comp_ctx->cqe.qp = qp_handle;
>           comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
> +        comp_ctx->cqe.qp = qp_handle;

Not sure the above chunk is needed.

> +        comp_ctx->cqe.opcode = IBV_WC_RECV;
>   

Anyway

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

>           rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
>                                  &qp->backend_qp, qp->qp_type,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
@ 2018-11-10 18:21   ` Marcel Apfelbaum
  2018-11-11  8:04     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:21 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> opcode for WC should be set by the device and not taken from work
> element.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> index 7b0f440fda..3388be1926 100644
> --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> @@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
>           comp_ctx->cq_handle = qp->send_cq_handle;
>           comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
>           comp_ctx->cqe.qp = qp_handle;
> -        comp_ctx->cqe.opcode = wqe->hdr.opcode;
> +        comp_ctx->cqe.opcode = IBV_WC_SEND;

That is interesting, what should happen if the opcode in hdr is different?
Maybe fail the operation?

Thanks,
Marcel

>   
>           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
>                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma Yuval Shaia
@ 2018-11-10 18:25   ` Marcel Apfelbaum
  2018-11-11  7:50     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:25 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> pvrdma requires that the same GID attached to it will be attached to the
> backend device in the host.
>
> A new QMP messages is defined so pvrdma device can broadcast any change
> made to its GID table. This event is captured by libvirt which in turn
> will update the GID table in the backend device.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   MAINTAINERS           |  1 +
>   Makefile              |  3 ++-
>   Makefile.objs         |  4 ++++
>   qapi/qapi-schema.json |  1 +
>   qapi/rdma.json        | 38 ++++++++++++++++++++++++++++++++++++++
>   5 files changed, 46 insertions(+), 1 deletion(-)
>   create mode 100644 qapi/rdma.json
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e087d58ac6..a149f68a8f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2232,6 +2232,7 @@ F: hw/rdma/*
>   F: hw/rdma/vmw/*
>   F: docs/pvrdma.txt
>   F: contrib/rdmacm-mux/*
> +F: qapi/rdma.json
>   
>   Build and test automation
>   -------------------------
> diff --git a/Makefile b/Makefile
> index 94072776ff..db4ce60ee5 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -599,7 +599,8 @@ qapi-modules = $(SRC_PATH)/qapi/qapi-schema.json $(SRC_PATH)/qapi/common.json \
>                  $(SRC_PATH)/qapi/tpm.json \
>                  $(SRC_PATH)/qapi/trace.json \
>                  $(SRC_PATH)/qapi/transaction.json \
> -               $(SRC_PATH)/qapi/ui.json
> +               $(SRC_PATH)/qapi/ui.json \
> +               $(SRC_PATH)/qapi/rdma.json
>   
>   qapi/qapi-builtin-types.c qapi/qapi-builtin-types.h \
>   qapi/qapi-types.c qapi/qapi-types.h \
> diff --git a/Makefile.objs b/Makefile.objs
> index cc7df3ad80..76d8028f2f 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -21,6 +21,7 @@ util-obj-y += qapi/qapi-types-tpm.o
>   util-obj-y += qapi/qapi-types-trace.o
>   util-obj-y += qapi/qapi-types-transaction.o
>   util-obj-y += qapi/qapi-types-ui.o
> +util-obj-y += qapi/qapi-types-rdma.o
>   util-obj-y += qapi/qapi-builtin-visit.o
>   util-obj-y += qapi/qapi-visit.o
>   util-obj-y += qapi/qapi-visit-block-core.o
> @@ -40,6 +41,7 @@ util-obj-y += qapi/qapi-visit-tpm.o
>   util-obj-y += qapi/qapi-visit-trace.o
>   util-obj-y += qapi/qapi-visit-transaction.o
>   util-obj-y += qapi/qapi-visit-ui.o
> +util-obj-y += qapi/qapi-visit-rdma.o
>   util-obj-y += qapi/qapi-events.o
>   util-obj-y += qapi/qapi-events-block-core.o
>   util-obj-y += qapi/qapi-events-block.o
> @@ -58,6 +60,7 @@ util-obj-y += qapi/qapi-events-tpm.o
>   util-obj-y += qapi/qapi-events-trace.o
>   util-obj-y += qapi/qapi-events-transaction.o
>   util-obj-y += qapi/qapi-events-ui.o
> +util-obj-y += qapi/qapi-events-rdma.o
>   util-obj-y += qapi/qapi-introspect.o
>   
>   chardev-obj-y = chardev/
> @@ -155,6 +158,7 @@ common-obj-y += qapi/qapi-commands-tpm.o
>   common-obj-y += qapi/qapi-commands-trace.o
>   common-obj-y += qapi/qapi-commands-transaction.o
>   common-obj-y += qapi/qapi-commands-ui.o
> +common-obj-y += qapi/qapi-commands-rdma.o
>   common-obj-y += qapi/qapi-introspect.o
>   common-obj-y += qmp.o hmp.o
>   endif
> diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
> index 65b6dc2f6f..a650d80f83 100644
> --- a/qapi/qapi-schema.json
> +++ b/qapi/qapi-schema.json
> @@ -94,3 +94,4 @@
>   { 'include': 'trace.json' }
>   { 'include': 'introspect.json' }
>   { 'include': 'misc.json' }
> +{ 'include': 'rdma.json' }
> diff --git a/qapi/rdma.json b/qapi/rdma.json
> new file mode 100644
> index 0000000000..804c68ab36
> --- /dev/null
> +++ b/qapi/rdma.json
> @@ -0,0 +1,38 @@
> +# -*- Mode: Python -*-
> +#
> +
> +##
> +# = RDMA device
> +##
> +
> +##
> +# @RDMA_GID_STATUS_CHANGED:
> +#
> +# Emitted when guest driver adds/deletes GID to/from device
> +#
> +# @netdev: RoCE Network Device name - char *
> +#
> +# @gid-status: Add or delete indication - bool
> +#
> +# @subnet-prefix: Subnet Prefix - uint64
> +#
> +# @interface-id : Interface ID - uint64
> +#
> +# Since: 3.2
> +#
> +# Example:
> +#
> +# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
> +#     "event": "RDMA_GID_STATUS_CHANGED",
> +#     "data":
> +#         {"netdev": "bridge0",
> +#         "interface-id": 15880512517475447892,
> +#         "gid-status": true,
> +#         "subnet-prefix": 33022}}
> +#
> +##
> +{ 'event': 'RDMA_GID_STATUS_CHANGED',
> +  'data': { 'netdev'        : 'str',
> +            'gid-status'    : 'bool',

'git-status' naming as of indication if we add or remove a gid
is a little odd, but I can't come up with something better.

Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks,
Marcel

> +            'subnet-prefix' : 'uint64',
> +            'interface-id'  : 'uint64' } }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
@ 2018-11-10 18:27   ` Marcel Apfelbaum
  2018-11-11  7:45     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-10 18:27 UTC (permalink / raw)
  To: Yuval Shaia, dmitry.fleytman, jasowang, eblake, armbru, pbonzini,
	qemu-devel, shamir.rabinovitch



On 11/8/18 6:08 PM, Yuval Shaia wrote:
> Guest driver enforces it, we should also.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>   hw/rdma/vmw/pvrdma.h      | 2 ++
>   hw/rdma/vmw/pvrdma_main.c | 3 +++
>   2 files changed, 5 insertions(+)
>
> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> index b019cb843a..10a3c4fb7c 100644
> --- a/hw/rdma/vmw/pvrdma.h
> +++ b/hw/rdma/vmw/pvrdma.h
> @@ -20,6 +20,7 @@
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msix.h"
>   #include "chardev/char-fe.h"
> +#include "hw/net/vmxnet3_defs.h"
>   
>   #include "../rdma_backend_defs.h"
>   #include "../rdma_rm_defs.h"
> @@ -85,6 +86,7 @@ typedef struct PVRDMADev {
>       RdmaBackendDev backend_dev;
>       RdmaDeviceResources rdma_dev_res;
>       CharBackend mad_chr;
> +    VMXNET3State *func0;
>   } PVRDMADev;
>   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>   
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index ac8c092db0..fa6468d221 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>           return;
>       }
>   
> +    /* Break if not vmxnet3 device in slot 0 */
> +    dev->func0 = VMXNET3(pci_get_function_0(pdev));
> +

I don't see the error code flow in case VMXNET3 is not func 0.
Am I missing something?


Thanks,
Marcel

>       memdev_root = object_resolve_path("/objects", NULL);
>       if (memdev_root) {
>           object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
@ 2018-11-10 20:10   ` Shamir Rabinovitch
  2018-11-11  7:38     ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Shamir Rabinovitch @ 2018-11-10 20:10 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel

On Thu, Nov 08, 2018 at 06:07:57PM +0200, Yuval Shaia wrote:
> RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
> given MAD class.
> This does not go hand-by-hand with qemu pvrdma device's requirements
> where each VM is MAD agent.
> Fix it by adding implementation of RDMA MAD multiplexer service which on
> one hand register as a sole MAD agent with the kernel module and on the
> other hand gives service to more than one VM.
> 
> Design Overview:
> ----------------
> A server process is registered to UMAD framework (for this to work the
> rdma_cm kernel module needs to be unloaded) and creates a unix socket to
> listen to incoming request from clients.
> A client process (such as QEMU) connects to this unix socket and
> registers with its own GID.
> 
> TX:
> ---
> When client needs to send rdma_cm MAD message it construct it the same
> way as without this multiplexer, i.e. creates a umad packet but this
> time it writes its content to the socket instead of calling umad_send().
> The server, upon receiving such a message fetch local_comm_id from it so
> a context for this session can be maintain and relay the message to UMAD
> layer by calling umad_send().
> 
> RX:
> ---
> The server creates a worker thread to process incoming rdma_cm MAD
> messages. When an incoming message arrived (umad_recv()) the server,
> depending on the message type (attr_id) looks for target client by
> either searching in gid->fd table or in local_comm_id->fd table. With
> the extracted fd the server relays to incoming message to the client.
> 
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  MAINTAINERS                      |   1 +
>  Makefile                         |   3 +
>  Makefile.objs                    |   1 +
>  contrib/rdmacm-mux/Makefile.objs |   4 +
>  contrib/rdmacm-mux/main.c        | 770 +++++++++++++++++++++++++++++++
>  contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
>  6 files changed, 835 insertions(+)
>  create mode 100644 contrib/rdmacm-mux/Makefile.objs
>  create mode 100644 contrib/rdmacm-mux/main.c
>  create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 98a1856afc..e087d58ac6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2231,6 +2231,7 @@ S: Maintained
>  F: hw/rdma/*
>  F: hw/rdma/vmw/*
>  F: docs/pvrdma.txt
> +F: contrib/rdmacm-mux/*
>  
>  Build and test automation
>  -------------------------
> diff --git a/Makefile b/Makefile
> index f2947186a4..94072776ff 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
>                  elf2dmp-obj-y \
>                  ivshmem-client-obj-y \
>                  ivshmem-server-obj-y \
> +                rdmacm-mux-obj-y \
>                  libvhost-user-obj-y \
>                  vhost-user-scsi-obj-y \
>                  vhost-user-blk-obj-y \
> @@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
>  	$(call LINK, $^)
>  vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
>  	$(call LINK, $^)
> +rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
> +	$(call LINK, $^)
>  
>  module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
>  	$(call quiet-command,$(PYTHON) $< $@ \
> diff --git a/Makefile.objs b/Makefile.objs
> index 1e1ff387d7..cc7df3ad80 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
>  vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
>  vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
>  vhost-user-blk-obj-y = contrib/vhost-user-blk/
> +rdmacm-mux-obj-y = contrib/rdmacm-mux/
>  
>  ######################################################################
>  trace-events-subdirs =
> diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
> new file mode 100644
> index 0000000000..be3eacb6f7
> --- /dev/null
> +++ b/contrib/rdmacm-mux/Makefile.objs
> @@ -0,0 +1,4 @@
> +ifdef CONFIG_PVRDMA
> +CFLAGS += -libumad -Wno-format-truncation
> +rdmacm-mux-obj-y = main.o
> +endif
> diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
> new file mode 100644
> index 0000000000..0308074b15
> --- /dev/null
> +++ b/contrib/rdmacm-mux/main.c
> @@ -0,0 +1,770 @@
> +/*
> + * QEMU paravirtual RDMA - rdmacm-mux implementation
> + *
> + * Copyright (C) 2018 Oracle
> + * Copyright (C) 2018 Red Hat Inc
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "sys/poll.h"
> +#include "sys/ioctl.h"
> +#include "pthread.h"
> +#include "syslog.h"
> +
> +#include "infiniband/verbs.h"
> +#include "infiniband/umad.h"
> +#include "infiniband/umad_types.h"
> +#include "infiniband/umad_sa.h"
> +#include "infiniband/umad_cm.h"
> +
> +#include "rdmacm-mux.h"
> +
> +#define SCALE_US 1000
> +#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
> +#define SLEEP_SECS 5 /* This is used both in poll() and thread */
> +#define SERVER_LISTEN_BACKLOG 10
> +#define MAX_CLIENTS 4096
> +#define MAD_RMPP_VERSION 0
> +#define MAD_METHOD_MASK0 0x8
> +
> +#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
> +
> +#define CM_REQ_DGID_POS      80
> +#define CM_SIDR_REQ_DGID_POS 44
> +
> +/* The below can be override by command line parameter */
> +#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
> +#define RDMA_DEVICE "rxe0"
> +#define RDMA_PORT_NUM 1
> +
> +typedef struct RdmaCmServerArgs {
> +    char unix_socket_path[PATH_MAX];
> +    char rdma_dev_name[NAME_MAX];
> +    int rdma_port_num;
> +} RdmaCMServerArgs;
> +
> +typedef struct CommId2FdEntry {
> +    int fd;
> +    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
> +    __be64 gid_ifid;
> +} CommId2FdEntry;
> +
> +typedef struct RdmaCmUMadAgent {
> +    int port_id;
> +    int agent_id;
> +    GHashTable *gid2fd; /* Used to find fd of a given gid */
> +    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
> +} RdmaCmUMadAgent;
> +
> +typedef struct RdmaCmServer {
> +    bool run;
> +    RdmaCMServerArgs args;
> +    struct pollfd fds[MAX_CLIENTS];
> +    int nfds;
> +    RdmaCmUMadAgent umad_agent;
> +    pthread_t umad_recv_thread;
> +    pthread_rwlock_t lock;
> +} RdmaCMServer;
> +
> +RdmaCMServer server = {0};

Maybe static is better here ?

> +
> +static void usage(const char *progname)
> +{
> +    printf("Usage: %s [OPTION]...\n"
> +           "Start a RDMA-CM multiplexer\n"
> +           "\n"
> +           "\t-h                    Show this help\n"
> +           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
> +           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
> +           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
> +           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
> +}
> +
> +static void help(const char *progname)
> +{
> +    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
> +}
> +
> +static void parse_args(int argc, char *argv[])
> +{
> +    int c;
> +    char unix_socket_path[PATH_MAX];
> +
> +    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
> +    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
> +    server.args.rdma_port_num = RDMA_PORT_NUM;
> +
> +    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
> +        switch (c) {
> +        case 'h':
> +            usage(argv[0]);
> +            exit(0);
> +
> +        case 's':
> +            /* This is temporary, final name will build below */
> +            strncpy(unix_socket_path, optarg, PATH_MAX);
> +            break;
> +
> +        case 'd':
> +            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
> +            break;
> +
> +        case 'p':
> +            server.args.rdma_port_num = atoi(optarg);
> +            break;
> +
> +        default:
> +            help(argv[0]);
> +            exit(1);
> +        }
> +    }
> +
> +    /* Build unique unix-socket file name */
> +    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
> +             unix_socket_path, server.args.rdma_dev_name,
> +             server.args.rdma_port_num);

Please check for truncation:
"a return value of size or more means that the output was truncated"

> +
> +    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
> +    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
> +    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
> +}
> +
> +static void hash_tbl_alloc(void)
> +{
> +
> +    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
> +                                                     g_int64_equal,
> +                                                     g_free, g_free);
> +    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
> +                                                        g_int_equal,
> +                                                        g_free, g_free);

Any reason not to use 'g_hash_table_new' above ?

Can the above functions fail to create new hash table ? If yes, please
add return status from this function and check it in the caller...

> +}
> +
> +static void hash_tbl_free(void)
> +{
> +    if (server.umad_agent.commid2fd) {
> +        g_hash_table_destroy(server.umad_agent.commid2fd);
> +    }
> +    if (server.umad_agent.gid2fd) {
> +        g_hash_table_destroy(server.umad_agent.gid2fd);
> +    }
> +}
> +
> +
> +static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
> +{
> +    int *fd;
> +
> +    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
> +    if (!fd) {
> +        /* Let's try IPv4 */
> +        *gid_ifid |= 0x00000000ffff0000;
> +        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
> +    }
> +
> +    return fd ? *fd : 0;
> +}
> +
> +static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
> +{
> +    pthread_rwlock_rdlock(&server.lock);
> +    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    if (!fd) {
> +        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
> +        return -ENOENT;
> +    }
> +
> +    return 0;
> +}
> +
> +static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
> +                                         __be64 *gid_idid)
> +{
> +    CommId2FdEntry *fde;
> +
> +    pthread_rwlock_rdlock(&server.lock);
> +    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    if (!fde) {
> +        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
> +        return -ENOENT;
> +    }
> +
> +    *fd = fde->fd;
> +    *gid_idid = fde->gid_ifid;
> +
> +    return 0;
> +}
> +
> +static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
> +{
> +    int fd1;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
> +    if (fd1) { /* record already exist - an error */
> +        pthread_rwlock_unlock(&server.lock);
> +        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
> +                           RDMACM_MUX_ERR_CODE_EACCES;
> +    }
> +
> +    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
> +                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
> +
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    syslog(LOG_INFO, "0x%lx registered on socket %d", (uint64_t)gid_ifid, fd);
> +
> +    return RDMACM_MUX_ERR_CODE_OK;
> +}
> +
> +static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
> +{
> +    int fd1;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
> +    if (!fd1) { /* record not exist - an error */
> +        pthread_rwlock_unlock(&server.lock);
> +        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
> +    }
> +
> +    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
> +                        sizeof(gid_ifid)));
> +    pthread_rwlock_unlock(&server.lock);
> +
> +    syslog(LOG_INFO, "0x%lx unregistered on socket %d", (uint64_t)gid_ifid, fd);
> +
> +    return RDMACM_MUX_ERR_CODE_OK;
> +}
> +
> +static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
> +                                          uint64_t gid_ifid)
> +{
> +    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +    g_hash_table_insert(server.umad_agent.commid2fd,
> +                        g_memdup(&comm_id, sizeof(comm_id)),
> +                        g_memdup(&fde, sizeof(fde)));
> +    pthread_rwlock_unlock(&server.lock);
> +}
> +
> +static gboolean remove_old_comm_ids(gpointer key, gpointer value,
> +                                    gpointer user_data)
> +{
> +    CommId2FdEntry *fde = (CommId2FdEntry *)value;
> +
> +    return !fde->ttl--;
> +}
> +
> +static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
> +                                         gpointer user_data)
> +{
> +    if (*(int *)value == *(int *)user_data) {
> +        syslog(LOG_INFO, "0x%lx unregistered on socket %d", *(uint64_t *)key,
> +               *(int *)value);
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void hash_tbl_remove_fd_ifid_pair(int fd)
> +{
> +    pthread_rwlock_wrlock(&server.lock);
> +    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
> +                                remove_entry_from_gid2fd, (gpointer)&fd);
> +    pthread_rwlock_unlock(&server.lock);
> +}
> +
> +static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
> +{
> +    struct umad_hdr *hdr = (struct umad_hdr *)mad;
> +    char *data = (char *)hdr + sizeof(*hdr);
> +    int32_t comm_id;
> +    uint16_t attr_id = be16toh(hdr->attr_id);
> +    int rc = 0;
> +
> +    switch (attr_id) {
> +    case UMAD_CM_ATTR_REQ:
> +        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
> +        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
> +        break;
> +
> +    case UMAD_CM_ATTR_SIDR_REQ:
> +        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
> +        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
> +        break;
> +
> +    case UMAD_CM_ATTR_REP:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_REJ:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_DREQ:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_DREP:
> +        /* Fall through */
> +    case UMAD_CM_ATTR_RTU:
> +        data += sizeof(comm_id);
> +        /* Fall through */
> +    case UMAD_CM_ATTR_SIDR_REP:
> +        memcpy(&comm_id, data, sizeof(comm_id));
> +        if (comm_id) {
> +            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
> +        }
> +        break;
> +
> +    default:
> +        rc = -EINVAL;
> +        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
> +    }
> +
> +    return rc;
> +}
> +
> +static void *umad_recv_thread_func(void *args)
> +{
> +    int rc;
> +    RdmaCmMuxMsg msg = {0};
> +    int fd = -2;
> +
> +    while (server.run) {
> +        do {
> +            msg.umad_len = sizeof(msg.umad.mad);
> +            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
> +                           SLEEP_SECS * SCALE_US);
> +            if ((rc == -EIO) || (rc == -EINVAL)) {
> +                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
> +            }
> +
> +            if (rc == -ETIMEDOUT) {
> +                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
> +                                            remove_old_comm_ids, NULL);
> +            }
> +        } while (rc && server.run);
> +
> +        if (server.run) {
> +            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
> +            if (rc) {
> +                continue;
> +            }
> +
> +            send(fd, &msg, sizeof(msg), 0);
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int read_and_process(int fd)
> +{
> +    int rc;
> +    RdmaCmMuxMsg msg = {0};
> +    struct umad_hdr *hdr;
> +    uint32_t *comm_id;
> +    uint16_t attr_id;
> +
> +    rc = recv(fd, &msg, sizeof(msg), 0);
> +
> +    if (rc < 0 && errno != EWOULDBLOCK) {
> +        return -EIO;
> +    }
> +
> +    if (!rc) {
> +        return -EPIPE;
> +    }
> +
> +    switch (msg.hdr.msg_type) {
> +    case RDMACM_MUX_MSG_TYPE_REG:
> +        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
> +        break;
> +
> +    case RDMACM_MUX_MSG_TYPE_UNREG:
> +        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
> +        break;
> +
> +    case RDMACM_MUX_MSG_TYPE_MAD:
> +        /* If this is REQ or REP then store the pair comm_id,fd to be later
> +         * used for other messages where gid is unknown */
> +        hdr = (struct umad_hdr *)msg.umad.mad;
> +        attr_id = be16toh(hdr->attr_id);
> +        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
> +            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
> +            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
> +            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
> +            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
> +                                          msg.hdr.sgid.global.interface_id);
> +        }
> +
> +        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
> +                       &msg.umad, msg.umad_len, 1, 0);
> +        if (rc) {
> +            syslog(LOG_WARNING, "Fail to send MAD message, err=%d", rc);
> +        }
> +        break;
> +
> +    default:
> +        syslog(LOG_WARNING, "Got invalid message (%d) from %d",
> +               msg.hdr.msg_type, fd);
> +        rc = RDMACM_MUX_ERR_CODE_EINVAL;
> +    }
> +
> +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
> +    msg.hdr.err_code = rc;
> +    rc = send(fd, &msg, sizeof(msg), 0);
> +
> +    return rc == sizeof(msg) ? 0 : -EPIPE;
> +}
> +
> +static int accept_all(void)
> +{
> +    int fd, rc = 0;;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    do {
> +        if ((server.nfds + 1) > MAX_CLIENTS) {
> +            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
> +            rc = -EIO;
> +            goto out;
> +        }
> +
> +        fd = accept(server.fds[0].fd, NULL, NULL);
> +        if (fd < 0) {
> +            if (errno != EWOULDBLOCK) {
> +                syslog(LOG_WARNING, "accept() failed");
> +                rc = -EIO;
> +                goto out;
> +            }
> +            break;
> +        }
> +
> +        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
> +        server.fds[server.nfds].fd = fd;
> +        server.fds[server.nfds].events = POLLIN;
> +        server.nfds++;
> +    } while (fd != -1);
> +
> +out:
> +    pthread_rwlock_unlock(&server.lock);
> +    return rc;
> +}
> +
> +static void compress_fds(void)
> +{
> +    int i, j;
> +    int closed = 0;
> +
> +    pthread_rwlock_wrlock(&server.lock);
> +
> +    for (i = 1; i < server.nfds; i++) {
> +        if (!server.fds[i].fd) {
> +            closed++;
> +            for (j = i; j < server.nfds; j++) {
> +                server.fds[j].fd = server.fds[j + 1].fd;
> +            }
> +        }
> +    }
> +
> +    server.nfds -= closed;
> +
> +    pthread_rwlock_unlock(&server.lock);
> +}
> +
> +static void close_fd(int idx)
> +{
> +    close(server.fds[idx].fd);
> +    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
> +    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
> +    server.fds[idx].fd = 0;
> +}
> +
> +static void run(void)
> +{
> +    int rc, nfds, i;
> +    bool compress = false;
> +
> +    syslog(LOG_INFO, "Service started");
> +
> +    while (server.run) {
> +        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
> +        if (rc < 0) {
> +            if (errno != EINTR) {
> +                syslog(LOG_WARNING, "poll() failed");
> +            }
> +            continue;
> +        }
> +
> +        if (rc == 0) {
> +            continue;
> +        }
> +
> +        nfds = server.nfds;
> +        for (i = 0; i < nfds; i++) {
> +            if (server.fds[i].revents == 0) {
> +                continue;
> +            }
> +
> +            if (server.fds[i].revents != POLLIN) {
> +                if (i == 0) {
> +                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
> +                           server.fds[i].revents);
> +                } else {
> +                    close_fd(i);
> +                    compress = true;
> +                }
> +                continue;
> +            }
> +
> +            if (i == 0) {
> +                rc = accept_all();
> +                if (rc) {
> +                    continue;
> +                }
> +            } else {
> +                rc = read_and_process(server.fds[i].fd);
> +                if (rc) {
> +                    close_fd(i);
> +                    compress = true;
> +                }
> +            }
> +        }
> +
> +        if (compress) {
> +            compress = false;
> +            compress_fds();
> +        }
> +    }
> +}
> +
> +static void fini_listener(void)
> +{
> +    int i;
> +
> +    if (server.fds[0].fd <= 0) {
> +        return;
> +    }
> +
> +    for (i = server.nfds - 1; i >= 0; i--) {
> +        if (server.fds[i].fd) {
> +            close(server.fds[i].fd);
> +        }
> +    }
> +
> +    unlink(server.args.unix_socket_path);
> +}
> +
> +static void fini_umad(void)
> +{
> +    if (server.umad_agent.agent_id) {
> +        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
> +    }
> +
> +    if (server.umad_agent.port_id) {
> +        umad_close_port(server.umad_agent.port_id);
> +    }
> +
> +    hash_tbl_free();
> +}
> +
> +static void fini(void)
> +{
> +    if (server.umad_recv_thread) {
> +        pthread_join(server.umad_recv_thread, NULL);

Can pthread_join be called (raced) from signal handler & main ?

The man say this:
"If multiple threads simultaneously try to join with the same thread,
the results are undefined."

What ensure that above will not happen ?

> +        server.umad_recv_thread = 0;
> +    }
> +    fini_umad();
> +    fini_listener();
> +    pthread_rwlock_destroy(&server.lock);

Same above question goes to here.

The man say this:
"Results are undefined if a read-write lock is used without first being
initialized."

> +
> +    syslog(LOG_INFO, "Service going down");
> +}
> +
> +static int init_listener(void)
> +{
> +    struct sockaddr_un sun;
> +    int rc, on = 1;
> +
> +    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
> +    if (server.fds[0].fd < 0) {
> +        syslog(LOG_ALERT, "socket() failed");
> +        return -EIO;
> +    }

Since you work with full mad messages, won't it be better to use
SOCK_DGRAM ? Is there any use for partial mad message ?

>From the man:
"SOCK_DGRAM, for a datagram-oriented socket that preserves message boundaries"

> +
> +    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
> +                    sizeof(on));
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "setsockopt() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "ioctl() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
> +        syslog(LOG_ALERT,
> +               "Invalid unix_socket_path, size must be less than %ld\n",
> +               sizeof(sun.sun_path));
> +        rc = -EINVAL;
> +        goto err;
> +    }
> +
> +    sun.sun_family = AF_UNIX;
> +    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
> +                  server.args.unix_socket_path);
> +    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
> +        syslog(LOG_ALERT, "Could not copy unix socket path\n");
> +        rc = -EINVAL;
> +        goto err;
> +    }
> +
> +    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "bind() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
> +    if (rc < 0) {
> +        syslog(LOG_ALERT, "listen() failed");
> +        rc = -EIO;
> +        goto err;
> +    }
> +
> +    server.fds[0].events = POLLIN;
> +    server.nfds = 1;
> +    server.run = true;
> +
> +    return 0;
> +
> +err:
> +    close(server.fds[0].fd);
> +    return rc;
> +}
> +
> +static int init_umad(void)
> +{
> +    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
> +
> +    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
> +                                               server.args.rdma_port_num);
> +
> +    if (server.umad_agent.port_id < 0) {
> +        syslog(LOG_WARNING, "umad_open_port() failed");
> +        return -EIO;
> +    }
> +
> +    memset(&method_mask, 0, sizeof(method_mask));
> +    method_mask[0] = MAD_METHOD_MASK0;
> +    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
> +                                               UMAD_CLASS_CM,
> +                                               UMAD_SA_CLASS_VERSION,
> +                                               MAD_RMPP_VERSION, method_mask);
> +    if (server.umad_agent.agent_id < 0) {
> +        syslog(LOG_WARNING, "umad_register() failed");
> +        return -EIO;
> +    }
> +
> +    hash_tbl_alloc();
> +
> +    return 0;
> +}
> +
> +static void signal_handler(int sig, siginfo_t *siginfo, void *context)
> +{
> +    static bool warned;
> +
> +    /* Prevent stop if clients are connected */
> +    if (server.nfds != 1) {
> +        if (!warned) {
> +            syslog(LOG_WARNING,
> +                   "Can't stop while active client exist, resend SIGINT to overid");
> +            warned = true;
> +            return;
> +        }
> +    }
> +
> +    if (sig == SIGINT) {
> +        server.run = false;
> +        fini();
> +    }
> +
> +    exit(0);
> +}
> +
> +static int init(void)
> +{
> +    int rc;
> +
> +    rc = init_listener();
> +    if (rc) {
> +        return rc;
> +    }
> +
> +    rc = init_umad();
> +    if (rc) {
> +        return rc;
> +    }
> +
> +    pthread_rwlock_init(&server.lock, 0);
> +
> +    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
> +                        NULL);
> +    if (!rc) {
> +        return rc;
> +    }
> +
> +    return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    int rc;
> +    struct sigaction sig = {0};
> +
> +    sig.sa_sigaction = &signal_handler;
> +    sig.sa_flags = SA_SIGINFO;
> +
> +    if (sigaction(SIGINT, &sig, NULL) < 0) {
> +        syslog(LOG_ERR, "Fail to install SIGINT handler\n");
> +        return -EAGAIN;
> +    }
> +
> +    memset(&server, 0, sizeof(server));
> +
> +    parse_args(argc, argv);
> +
> +    rc = init();
> +    if (rc) {
> +        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
> +        rc = -EAGAIN;
> +        goto out;
> +    }
> +
> +    run();
> +
> +out:
> +    fini();
> +
> +    return rc;
> +}
> diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
> new file mode 100644
> index 0000000000..03508d52b2
> --- /dev/null
> +++ b/contrib/rdmacm-mux/rdmacm-mux.h
> @@ -0,0 +1,56 @@
> +/*
> + * QEMU paravirtual RDMA - rdmacm-mux declarations
> + *
> + * Copyright (C) 2018 Oracle
> + * Copyright (C) 2018 Red Hat Inc
> + *
> + * Authors:
> + *     Yuval Shaia <yuval.shaia@oracle.com>
> + *     Marcel Apfelbaum <marcel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef RDMACM_MUX_H
> +#define RDMACM_MUX_H
> +
> +#include "linux/if.h"
> +#include "infiniband/verbs.h"
> +#include "infiniband/umad.h"
> +#include "rdma/rdma_user_cm.h"
> +
> +typedef enum RdmaCmMuxMsgType {
> +    RDMACM_MUX_MSG_TYPE_REG   = 0,
> +    RDMACM_MUX_MSG_TYPE_UNREG = 1,
> +    RDMACM_MUX_MSG_TYPE_MAD   = 2,
> +    RDMACM_MUX_MSG_TYPE_RESP  = 3,
> +} RdmaCmMuxMsgType;
> +
> +typedef enum RdmaCmMuxErrCode {
> +    RDMACM_MUX_ERR_CODE_OK        = 0,
> +    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
> +    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
> +    RDMACM_MUX_ERR_CODE_EACCES    = 3,
> +    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
> +} RdmaCmMuxErrCode;
> +
> +typedef struct RdmaCmMuxHdr {
> +    RdmaCmMuxMsgType msg_type;
> +    union ibv_gid sgid;
> +    RdmaCmMuxErrCode err_code;
> +} RdmaCmUHdr;
> +
> +typedef struct RdmaCmUMad {
> +    struct ib_user_mad hdr;
> +    char mad[RDMA_MAX_PRIVATE_DATA];
> +} RdmaCmUMad;
> +
> +typedef struct RdmaCmMuxMsg {
> +    RdmaCmUHdr hdr;
> +    int umad_len;
> +    RdmaCmUMad umad;
> +} RdmaCmMuxMsg;
> +
> +#endif
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
  2018-11-10 20:10   ` Shamir Rabinovitch
@ 2018-11-11  7:38     ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11  7:38 UTC (permalink / raw)
  To: Shamir Rabinovitch
  Cc: marcel.apfelbaum, dmitry.fleytman, jasowang, eblake, armbru,
	pbonzini, qemu-devel, yuval.shaia

On Sat, Nov 10, 2018 at 10:10:04PM +0200, Shamir Rabinovitch wrote:
> On Thu, Nov 08, 2018 at 06:07:57PM +0200, Yuval Shaia wrote:
> > RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
> > given MAD class.
> > This does not go hand-by-hand with qemu pvrdma device's requirements
> > where each VM is MAD agent.
> > Fix it by adding implementation of RDMA MAD multiplexer service which on
> > one hand register as a sole MAD agent with the kernel module and on the
> > other hand gives service to more than one VM.
> > 
> > Design Overview:
> > ----------------
> > A server process is registered to UMAD framework (for this to work the
> > rdma_cm kernel module needs to be unloaded) and creates a unix socket to
> > listen to incoming request from clients.
> > A client process (such as QEMU) connects to this unix socket and
> > registers with its own GID.
> > 
> > TX:
> > ---
> > When client needs to send rdma_cm MAD message it construct it the same
> > way as without this multiplexer, i.e. creates a umad packet but this
> > time it writes its content to the socket instead of calling umad_send().
> > The server, upon receiving such a message fetch local_comm_id from it so
> > a context for this session can be maintain and relay the message to UMAD
> > layer by calling umad_send().
> > 
> > RX:
> > ---
> > The server creates a worker thread to process incoming rdma_cm MAD
> > messages. When an incoming message arrived (umad_recv()) the server,
> > depending on the message type (attr_id) looks for target client by
> > either searching in gid->fd table or in local_comm_id->fd table. With
> > the extracted fd the server relays to incoming message to the client.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >  MAINTAINERS                      |   1 +
> >  Makefile                         |   3 +
> >  Makefile.objs                    |   1 +
> >  contrib/rdmacm-mux/Makefile.objs |   4 +
> >  contrib/rdmacm-mux/main.c        | 770 +++++++++++++++++++++++++++++++
> >  contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
> >  6 files changed, 835 insertions(+)
> >  create mode 100644 contrib/rdmacm-mux/Makefile.objs
> >  create mode 100644 contrib/rdmacm-mux/main.c
> >  create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 98a1856afc..e087d58ac6 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -2231,6 +2231,7 @@ S: Maintained
> >  F: hw/rdma/*
> >  F: hw/rdma/vmw/*
> >  F: docs/pvrdma.txt
> > +F: contrib/rdmacm-mux/*
> >  
> >  Build and test automation
> >  -------------------------
> > diff --git a/Makefile b/Makefile
> > index f2947186a4..94072776ff 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
> >                  elf2dmp-obj-y \
> >                  ivshmem-client-obj-y \
> >                  ivshmem-server-obj-y \
> > +                rdmacm-mux-obj-y \
> >                  libvhost-user-obj-y \
> >                  vhost-user-scsi-obj-y \
> >                  vhost-user-blk-obj-y \
> > @@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
> >  	$(call LINK, $^)
> >  vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
> >  	$(call LINK, $^)
> > +rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
> > +	$(call LINK, $^)
> >  
> >  module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
> >  	$(call quiet-command,$(PYTHON) $< $@ \
> > diff --git a/Makefile.objs b/Makefile.objs
> > index 1e1ff387d7..cc7df3ad80 100644
> > --- a/Makefile.objs
> > +++ b/Makefile.objs
> > @@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
> >  vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
> >  vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
> >  vhost-user-blk-obj-y = contrib/vhost-user-blk/
> > +rdmacm-mux-obj-y = contrib/rdmacm-mux/
> >  
> >  ######################################################################
> >  trace-events-subdirs =
> > diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
> > new file mode 100644
> > index 0000000000..be3eacb6f7
> > --- /dev/null
> > +++ b/contrib/rdmacm-mux/Makefile.objs
> > @@ -0,0 +1,4 @@
> > +ifdef CONFIG_PVRDMA
> > +CFLAGS += -libumad -Wno-format-truncation
> > +rdmacm-mux-obj-y = main.o
> > +endif
> > diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
> > new file mode 100644
> > index 0000000000..0308074b15
> > --- /dev/null
> > +++ b/contrib/rdmacm-mux/main.c
> > @@ -0,0 +1,770 @@
> > +/*
> > + * QEMU paravirtual RDMA - rdmacm-mux implementation
> > + *
> > + * Copyright (C) 2018 Oracle
> > + * Copyright (C) 2018 Red Hat Inc
> > + *
> > + * Authors:
> > + *     Yuval Shaia <yuval.shaia@oracle.com>
> > + *     Marcel Apfelbaum <marcel@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "sys/poll.h"
> > +#include "sys/ioctl.h"
> > +#include "pthread.h"
> > +#include "syslog.h"
> > +
> > +#include "infiniband/verbs.h"
> > +#include "infiniband/umad.h"
> > +#include "infiniband/umad_types.h"
> > +#include "infiniband/umad_sa.h"
> > +#include "infiniband/umad_cm.h"
> > +
> > +#include "rdmacm-mux.h"
> > +
> > +#define SCALE_US 1000
> > +#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
> > +#define SLEEP_SECS 5 /* This is used both in poll() and thread */
> > +#define SERVER_LISTEN_BACKLOG 10
> > +#define MAX_CLIENTS 4096
> > +#define MAD_RMPP_VERSION 0
> > +#define MAD_METHOD_MASK0 0x8
> > +
> > +#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
> > +
> > +#define CM_REQ_DGID_POS      80
> > +#define CM_SIDR_REQ_DGID_POS 44
> > +
> > +/* The below can be override by command line parameter */
> > +#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
> > +#define RDMA_DEVICE "rxe0"
> > +#define RDMA_PORT_NUM 1
> > +
> > +typedef struct RdmaCmServerArgs {
> > +    char unix_socket_path[PATH_MAX];
> > +    char rdma_dev_name[NAME_MAX];
> > +    int rdma_port_num;
> > +} RdmaCMServerArgs;
> > +
> > +typedef struct CommId2FdEntry {
> > +    int fd;
> > +    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
> > +    __be64 gid_ifid;
> > +} CommId2FdEntry;
> > +
> > +typedef struct RdmaCmUMadAgent {
> > +    int port_id;
> > +    int agent_id;
> > +    GHashTable *gid2fd; /* Used to find fd of a given gid */
> > +    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
> > +} RdmaCmUMadAgent;
> > +
> > +typedef struct RdmaCmServer {
> > +    bool run;
> > +    RdmaCMServerArgs args;
> > +    struct pollfd fds[MAX_CLIENTS];
> > +    int nfds;
> > +    RdmaCmUMadAgent umad_agent;
> > +    pthread_t umad_recv_thread;
> > +    pthread_rwlock_t lock;
> > +} RdmaCMServer;
> > +
> > +RdmaCMServer server = {0};
> 
> Maybe static is better here ?

Done.

> 
> > +
> > +static void usage(const char *progname)
> > +{
> > +    printf("Usage: %s [OPTION]...\n"
> > +           "Start a RDMA-CM multiplexer\n"
> > +           "\n"
> > +           "\t-h                    Show this help\n"
> > +           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
> > +           "\t-d rdma-device-name   Name of RDMA device to register with (default %s)\n"
> > +           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
> > +           progname, UNIX_SOCKET_PATH, RDMA_DEVICE, RDMA_PORT_NUM);
> > +}
> > +
> > +static void help(const char *progname)
> > +{
> > +    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
> > +}
> > +
> > +static void parse_args(int argc, char *argv[])
> > +{
> > +    int c;
> > +    char unix_socket_path[PATH_MAX];
> > +
> > +    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
> > +    strncpy(server.args.rdma_dev_name, RDMA_DEVICE, NAME_MAX - 1);
> > +    server.args.rdma_port_num = RDMA_PORT_NUM;
> > +
> > +    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
> > +        switch (c) {
> > +        case 'h':
> > +            usage(argv[0]);
> > +            exit(0);
> > +
> > +        case 's':
> > +            /* This is temporary, final name will build below */
> > +            strncpy(unix_socket_path, optarg, PATH_MAX);
> > +            break;
> > +
> > +        case 'd':
> > +            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
> > +            break;
> > +
> > +        case 'p':
> > +            server.args.rdma_port_num = atoi(optarg);
> > +            break;
> > +
> > +        default:
> > +            help(argv[0]);
> > +            exit(1);
> > +        }
> > +    }
> > +
> > +    /* Build unique unix-socket file name */
> > +    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
> > +             unix_socket_path, server.args.rdma_dev_name,
> > +             server.args.rdma_port_num);
> 
> Please check for truncation:
> "a return value of size or more means that the output was truncated"

So, you are suggesting to give a warning to a user that chooses a name with
more than 4096 characters, right?
Give a warning and then what? exit?
How about just truncating it, printing a log message [1] with the truncated
(corrected) name and continue as usual?

> 
> > +
> > +    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);

[1]
Log message here shows the final name anyway.

> > +    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
> > +    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
> > +}
> > +
> > +static void hash_tbl_alloc(void)
> > +{
> > +
> > +    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
> > +                                                     g_int64_equal,
> > +                                                     g_free, g_free);
> > +    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
> > +                                                        g_int_equal,
> > +                                                        g_free, g_free);
> 
> Any reason not to use 'g_hash_table_new' above ?

Just so i can use the cleaners? I'm not sure that they will be called just
because i'm using a primitives, it is not documented so can't trust it.

> 
> Can the above functions fail to create new hash table ? If yes, please
> add return status from this function and check it in the caller...

It can't.

> 
> > +}
> > +
> > +static void hash_tbl_free(void)
> > +{
> > +    if (server.umad_agent.commid2fd) {
> > +        g_hash_table_destroy(server.umad_agent.commid2fd);
> > +    }
> > +    if (server.umad_agent.gid2fd) {
> > +        g_hash_table_destroy(server.umad_agent.gid2fd);
> > +    }
> > +}
> > +
> > +
> > +static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
> > +{
> > +    int *fd;
> > +
> > +    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
> > +    if (!fd) {
> > +        /* Let's try IPv4 */
> > +        *gid_ifid |= 0x00000000ffff0000;
> > +        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
> > +    }
> > +
> > +    return fd ? *fd : 0;
> > +}
> > +
> > +static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
> > +{
> > +    pthread_rwlock_rdlock(&server.lock);
> > +    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
> > +    pthread_rwlock_unlock(&server.lock);
> > +
> > +    if (!fd) {
> > +        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
> > +        return -ENOENT;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
> > +                                         __be64 *gid_idid)
> > +{
> > +    CommId2FdEntry *fde;
> > +
> > +    pthread_rwlock_rdlock(&server.lock);
> > +    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
> > +    pthread_rwlock_unlock(&server.lock);
> > +
> > +    if (!fde) {
> > +        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
> > +        return -ENOENT;
> > +    }
> > +
> > +    *fd = fde->fd;
> > +    *gid_idid = fde->gid_ifid;
> > +
> > +    return 0;
> > +}
> > +
> > +static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
> > +{
> > +    int fd1;
> > +
> > +    pthread_rwlock_wrlock(&server.lock);
> > +
> > +    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
> > +    if (fd1) { /* record already exist - an error */
> > +        pthread_rwlock_unlock(&server.lock);
> > +        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
> > +                           RDMACM_MUX_ERR_CODE_EACCES;
> > +    }
> > +
> > +    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
> > +                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
> > +
> > +    pthread_rwlock_unlock(&server.lock);
> > +
> > +    syslog(LOG_INFO, "0x%lx registered on socket %d", (uint64_t)gid_ifid, fd);
> > +
> > +    return RDMACM_MUX_ERR_CODE_OK;
> > +}
> > +
> > +static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
> > +{
> > +    int fd1;
> > +
> > +    pthread_rwlock_wrlock(&server.lock);
> > +
> > +    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
> > +    if (!fd1) { /* record not exist - an error */
> > +        pthread_rwlock_unlock(&server.lock);
> > +        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
> > +    }
> > +
> > +    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
> > +                        sizeof(gid_ifid)));
> > +    pthread_rwlock_unlock(&server.lock);
> > +
> > +    syslog(LOG_INFO, "0x%lx unregistered on socket %d", (uint64_t)gid_ifid, fd);
> > +
> > +    return RDMACM_MUX_ERR_CODE_OK;
> > +}
> > +
> > +static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
> > +                                          uint64_t gid_ifid)
> > +{
> > +    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
> > +
> > +    pthread_rwlock_wrlock(&server.lock);
> > +    g_hash_table_insert(server.umad_agent.commid2fd,
> > +                        g_memdup(&comm_id, sizeof(comm_id)),
> > +                        g_memdup(&fde, sizeof(fde)));
> > +    pthread_rwlock_unlock(&server.lock);
> > +}
> > +
> > +static gboolean remove_old_comm_ids(gpointer key, gpointer value,
> > +                                    gpointer user_data)
> > +{
> > +    CommId2FdEntry *fde = (CommId2FdEntry *)value;
> > +
> > +    return !fde->ttl--;
> > +}
> > +
> > +static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
> > +                                         gpointer user_data)
> > +{
> > +    if (*(int *)value == *(int *)user_data) {
> > +        syslog(LOG_INFO, "0x%lx unregistered on socket %d", *(uint64_t *)key,
> > +               *(int *)value);
> > +        return true;
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +static void hash_tbl_remove_fd_ifid_pair(int fd)
> > +{
> > +    pthread_rwlock_wrlock(&server.lock);
> > +    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
> > +                                remove_entry_from_gid2fd, (gpointer)&fd);
> > +    pthread_rwlock_unlock(&server.lock);
> > +}
> > +
> > +static int get_fd(const char *mad, int *fd, __be64 *gid_ifid)
> > +{
> > +    struct umad_hdr *hdr = (struct umad_hdr *)mad;
> > +    char *data = (char *)hdr + sizeof(*hdr);
> > +    int32_t comm_id;
> > +    uint16_t attr_id = be16toh(hdr->attr_id);
> > +    int rc = 0;
> > +
> > +    switch (attr_id) {
> > +    case UMAD_CM_ATTR_REQ:
> > +        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
> > +        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
> > +        break;
> > +
> > +    case UMAD_CM_ATTR_SIDR_REQ:
> > +        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
> > +        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
> > +        break;
> > +
> > +    case UMAD_CM_ATTR_REP:
> > +        /* Fall through */
> > +    case UMAD_CM_ATTR_REJ:
> > +        /* Fall through */
> > +    case UMAD_CM_ATTR_DREQ:
> > +        /* Fall through */
> > +    case UMAD_CM_ATTR_DREP:
> > +        /* Fall through */
> > +    case UMAD_CM_ATTR_RTU:
> > +        data += sizeof(comm_id);
> > +        /* Fall through */
> > +    case UMAD_CM_ATTR_SIDR_REP:
> > +        memcpy(&comm_id, data, sizeof(comm_id));
> > +        if (comm_id) {
> > +            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
> > +        }
> > +        break;
> > +
> > +    default:
> > +        rc = -EINVAL;
> > +        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
> > +    }
> > +
> > +    return rc;
> > +}
> > +
> > +static void *umad_recv_thread_func(void *args)
> > +{
> > +    int rc;
> > +    RdmaCmMuxMsg msg = {0};
> > +    int fd = -2;
> > +
> > +    while (server.run) {
> > +        do {
> > +            msg.umad_len = sizeof(msg.umad.mad);
> > +            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
> > +                           SLEEP_SECS * SCALE_US);
> > +            if ((rc == -EIO) || (rc == -EINVAL)) {
> > +                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
> > +            }
> > +
> > +            if (rc == -ETIMEDOUT) {
> > +                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
> > +                                            remove_old_comm_ids, NULL);
> > +            }
> > +        } while (rc && server.run);
> > +
> > +        if (server.run) {
> > +            rc = get_fd(msg.umad.mad, &fd, &msg.hdr.sgid.global.interface_id);
> > +            if (rc) {
> > +                continue;
> > +            }
> > +
> > +            send(fd, &msg, sizeof(msg), 0);
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int read_and_process(int fd)
> > +{
> > +    int rc;
> > +    RdmaCmMuxMsg msg = {0};
> > +    struct umad_hdr *hdr;
> > +    uint32_t *comm_id;
> > +    uint16_t attr_id;
> > +
> > +    rc = recv(fd, &msg, sizeof(msg), 0);
> > +
> > +    if (rc < 0 && errno != EWOULDBLOCK) {
> > +        return -EIO;
> > +    }
> > +
> > +    if (!rc) {
> > +        return -EPIPE;
> > +    }
> > +
> > +    switch (msg.hdr.msg_type) {
> > +    case RDMACM_MUX_MSG_TYPE_REG:
> > +        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
> > +        break;
> > +
> > +    case RDMACM_MUX_MSG_TYPE_UNREG:
> > +        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
> > +        break;
> > +
> > +    case RDMACM_MUX_MSG_TYPE_MAD:
> > +        /* If this is REQ or REP then store the pair comm_id,fd to be later
> > +         * used for other messages where gid is unknown */
> > +        hdr = (struct umad_hdr *)msg.umad.mad;
> > +        attr_id = be16toh(hdr->attr_id);
> > +        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
> > +            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
> > +            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
> > +            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
> > +            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
> > +                                          msg.hdr.sgid.global.interface_id);
> > +        }
> > +
> > +        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
> > +                       &msg.umad, msg.umad_len, 1, 0);
> > +        if (rc) {
> > +            syslog(LOG_WARNING, "Fail to send MAD message, err=%d", rc);
> > +        }
> > +        break;
> > +
> > +    default:
> > +        syslog(LOG_WARNING, "Got invalid message (%d) from %d",
> > +               msg.hdr.msg_type, fd);
> > +        rc = RDMACM_MUX_ERR_CODE_EINVAL;
> > +    }
> > +
> > +    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
> > +    msg.hdr.err_code = rc;
> > +    rc = send(fd, &msg, sizeof(msg), 0);
> > +
> > +    return rc == sizeof(msg) ? 0 : -EPIPE;
> > +}
> > +
> > +static int accept_all(void)
> > +{
> > +    int fd, rc = 0;;
> > +
> > +    pthread_rwlock_wrlock(&server.lock);
> > +
> > +    do {
> > +        if ((server.nfds + 1) > MAX_CLIENTS) {
> > +            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
> > +            rc = -EIO;
> > +            goto out;
> > +        }
> > +
> > +        fd = accept(server.fds[0].fd, NULL, NULL);
> > +        if (fd < 0) {
> > +            if (errno != EWOULDBLOCK) {
> > +                syslog(LOG_WARNING, "accept() failed");
> > +                rc = -EIO;
> > +                goto out;
> > +            }
> > +            break;
> > +        }
> > +
> > +        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
> > +        server.fds[server.nfds].fd = fd;
> > +        server.fds[server.nfds].events = POLLIN;
> > +        server.nfds++;
> > +    } while (fd != -1);
> > +
> > +out:
> > +    pthread_rwlock_unlock(&server.lock);
> > +    return rc;
> > +}
> > +
> > +static void compress_fds(void)
> > +{
> > +    int i, j;
> > +    int closed = 0;
> > +
> > +    pthread_rwlock_wrlock(&server.lock);
> > +
> > +    for (i = 1; i < server.nfds; i++) {
> > +        if (!server.fds[i].fd) {
> > +            closed++;
> > +            for (j = i; j < server.nfds; j++) {
> > +                server.fds[j].fd = server.fds[j + 1].fd;
> > +            }
> > +        }
> > +    }
> > +
> > +    server.nfds -= closed;
> > +
> > +    pthread_rwlock_unlock(&server.lock);
> > +}
> > +
> > +static void close_fd(int idx)
> > +{
> > +    close(server.fds[idx].fd);
> > +    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
> > +    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
> > +    server.fds[idx].fd = 0;
> > +}
> > +
> > +static void run(void)
> > +{
> > +    int rc, nfds, i;
> > +    bool compress = false;
> > +
> > +    syslog(LOG_INFO, "Service started");
> > +
> > +    while (server.run) {
> > +        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
> > +        if (rc < 0) {
> > +            if (errno != EINTR) {
> > +                syslog(LOG_WARNING, "poll() failed");
> > +            }
> > +            continue;
> > +        }
> > +
> > +        if (rc == 0) {
> > +            continue;
> > +        }
> > +
> > +        nfds = server.nfds;
> > +        for (i = 0; i < nfds; i++) {
> > +            if (server.fds[i].revents == 0) {
> > +                continue;
> > +            }
> > +
> > +            if (server.fds[i].revents != POLLIN) {
> > +                if (i == 0) {
> > +                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
> > +                           server.fds[i].revents);
> > +                } else {
> > +                    close_fd(i);
> > +                    compress = true;
> > +                }
> > +                continue;
> > +            }
> > +
> > +            if (i == 0) {
> > +                rc = accept_all();
> > +                if (rc) {
> > +                    continue;
> > +                }
> > +            } else {
> > +                rc = read_and_process(server.fds[i].fd);
> > +                if (rc) {
> > +                    close_fd(i);
> > +                    compress = true;
> > +                }
> > +            }
> > +        }
> > +
> > +        if (compress) {
> > +            compress = false;
> > +            compress_fds();
> > +        }
> > +    }
> > +}
> > +
> > +static void fini_listener(void)
> > +{
> > +    int i;
> > +
> > +    if (server.fds[0].fd <= 0) {
> > +        return;
> > +    }
> > +
> > +    for (i = server.nfds - 1; i >= 0; i--) {
> > +        if (server.fds[i].fd) {
> > +            close(server.fds[i].fd);
> > +        }
> > +    }
> > +
> > +    unlink(server.args.unix_socket_path);
> > +}
> > +
> > +static void fini_umad(void)
> > +{
> > +    if (server.umad_agent.agent_id) {
> > +        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
> > +    }
> > +
> > +    if (server.umad_agent.port_id) {
> > +        umad_close_port(server.umad_agent.port_id);
> > +    }
> > +
> > +    hash_tbl_free();
> > +}
> > +
> > +static void fini(void)
> > +{
> > +    if (server.umad_recv_thread) {
> > +        pthread_join(server.umad_recv_thread, NULL);
> 
> Can pthread_join be called (raced) from signal handler & main ?
> 
> The man say this:
> "If multiple threads simultaneously try to join with the same thread,
> the results are undefined."
> 
> What ensure that above will not happen ?

[2]
To be on the safe side i will move the code that installs the sig handler
to after init is done, this way there will be only one caller to fini().

> 
> > +        server.umad_recv_thread = 0;
> > +    }
> > +    fini_umad();
> > +    fini_listener();
> > +    pthread_rwlock_destroy(&server.lock);
> 
> Same above question goes to here.
> 
> The man say this:
> "Results are undefined if a read-write lock is used without first being
> initialized."

Done ([2]).

> 
> > +
> > +    syslog(LOG_INFO, "Service going down");
> > +}
> > +
> > +static int init_listener(void)
> > +{
> > +    struct sockaddr_un sun;
> > +    int rc, on = 1;
> > +
> > +    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
> > +    if (server.fds[0].fd < 0) {
> > +        syslog(LOG_ALERT, "socket() failed");
> > +        return -EIO;
> > +    }
> 
> Since you work with full mad messages, won't it be better to use
> SOCK_DGRAM ? Is there any use for partial mad message ?

This socket is also used for registration messages, these are not MAD.
Registration message is a message where client (VM) identifies itself with
a GID.

> 
> From the man:
> "SOCK_DGRAM, for a datagram-oriented socket that preserves message boundaries"
> 
> > +
> > +    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
> > +                    sizeof(on));
> > +    if (rc < 0) {
> > +        syslog(LOG_ALERT, "setsockopt() failed");
> > +        rc = -EIO;
> > +        goto err;
> > +    }
> > +
> > +    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
> > +    if (rc < 0) {
> > +        syslog(LOG_ALERT, "ioctl() failed");
> > +        rc = -EIO;
> > +        goto err;
> > +    }
> > +
> > +    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
> > +        syslog(LOG_ALERT,
> > +               "Invalid unix_socket_path, size must be less than %ld\n",
> > +               sizeof(sun.sun_path));
> > +        rc = -EINVAL;
> > +        goto err;
> > +    }
> > +
> > +    sun.sun_family = AF_UNIX;
> > +    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
> > +                  server.args.unix_socket_path);
> > +    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
> > +        syslog(LOG_ALERT, "Could not copy unix socket path\n");
> > +        rc = -EINVAL;
> > +        goto err;
> > +    }
> > +
> > +    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
> > +    if (rc < 0) {
> > +        syslog(LOG_ALERT, "bind() failed");
> > +        rc = -EIO;
> > +        goto err;
> > +    }
> > +
> > +    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
> > +    if (rc < 0) {
> > +        syslog(LOG_ALERT, "listen() failed");
> > +        rc = -EIO;
> > +        goto err;
> > +    }
> > +
> > +    server.fds[0].events = POLLIN;
> > +    server.nfds = 1;
> > +    server.run = true;
> > +
> > +    return 0;
> > +
> > +err:
> > +    close(server.fds[0].fd);
> > +    return rc;
> > +}
> > +
> > +static int init_umad(void)
> > +{
> > +    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
> > +
> > +    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
> > +                                               server.args.rdma_port_num);
> > +
> > +    if (server.umad_agent.port_id < 0) {
> > +        syslog(LOG_WARNING, "umad_open_port() failed");
> > +        return -EIO;
> > +    }
> > +
> > +    memset(&method_mask, 0, sizeof(method_mask));
> > +    method_mask[0] = MAD_METHOD_MASK0;
> > +    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
> > +                                               UMAD_CLASS_CM,
> > +                                               UMAD_SA_CLASS_VERSION,
> > +                                               MAD_RMPP_VERSION, method_mask);
> > +    if (server.umad_agent.agent_id < 0) {
> > +        syslog(LOG_WARNING, "umad_register() failed");
> > +        return -EIO;
> > +    }
> > +
> > +    hash_tbl_alloc();
> > +
> > +    return 0;
> > +}
> > +
> > +static void signal_handler(int sig, siginfo_t *siginfo, void *context)
> > +{
> > +    static bool warned;
> > +
> > +    /* Prevent stop if clients are connected */
> > +    if (server.nfds != 1) {
> > +        if (!warned) {
> > +            syslog(LOG_WARNING,
> > +                   "Can't stop while active client exist, resend SIGINT to overid");
> > +            warned = true;
> > +            return;
> > +        }
> > +    }
> > +
> > +    if (sig == SIGINT) {
> > +        server.run = false;
> > +        fini();
> > +    }
> > +
> > +    exit(0);
> > +}
> > +
> > +static int init(void)
> > +{
> > +    int rc;
> > +
> > +    rc = init_listener();
> > +    if (rc) {
> > +        return rc;
> > +    }
> > +
> > +    rc = init_umad();
> > +    if (rc) {
> > +        return rc;
> > +    }
> > +
> > +    pthread_rwlock_init(&server.lock, 0);
> > +
> > +    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
> > +                        NULL);
> > +    if (!rc) {
> > +        return rc;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +    int rc;
> > +    struct sigaction sig = {0};
> > +
> > +    sig.sa_sigaction = &signal_handler;
> > +    sig.sa_flags = SA_SIGINFO;
> > +
> > +    if (sigaction(SIGINT, &sig, NULL) < 0) {
> > +        syslog(LOG_ERR, "Fail to install SIGINT handler\n");
> > +        return -EAGAIN;
> > +    }
> > +
> > +    memset(&server, 0, sizeof(server));
> > +
> > +    parse_args(argc, argv);
> > +
> > +    rc = init();
> > +    if (rc) {
> > +        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
> > +        rc = -EAGAIN;
> > +        goto out;
> > +    }
> > +
> > +    run();
> > +
> > +out:
> > +    fini();
> > +
> > +    return rc;
> > +}
> > diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
> > new file mode 100644
> > index 0000000000..03508d52b2
> > --- /dev/null
> > +++ b/contrib/rdmacm-mux/rdmacm-mux.h
> > @@ -0,0 +1,56 @@
> > +/*
> > + * QEMU paravirtual RDMA - rdmacm-mux declarations
> > + *
> > + * Copyright (C) 2018 Oracle
> > + * Copyright (C) 2018 Red Hat Inc
> > + *
> > + * Authors:
> > + *     Yuval Shaia <yuval.shaia@oracle.com>
> > + *     Marcel Apfelbaum <marcel@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#ifndef RDMACM_MUX_H
> > +#define RDMACM_MUX_H
> > +
> > +#include "linux/if.h"
> > +#include "infiniband/verbs.h"
> > +#include "infiniband/umad.h"
> > +#include "rdma/rdma_user_cm.h"
> > +
> > +typedef enum RdmaCmMuxMsgType {
> > +    RDMACM_MUX_MSG_TYPE_REG   = 0,
> > +    RDMACM_MUX_MSG_TYPE_UNREG = 1,
> > +    RDMACM_MUX_MSG_TYPE_MAD   = 2,
> > +    RDMACM_MUX_MSG_TYPE_RESP  = 3,
> > +} RdmaCmMuxMsgType;
> > +
> > +typedef enum RdmaCmMuxErrCode {
> > +    RDMACM_MUX_ERR_CODE_OK        = 0,
> > +    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
> > +    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
> > +    RDMACM_MUX_ERR_CODE_EACCES    = 3,
> > +    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
> > +} RdmaCmMuxErrCode;
> > +
> > +typedef struct RdmaCmMuxHdr {
> > +    RdmaCmMuxMsgType msg_type;
> > +    union ibv_gid sgid;
> > +    RdmaCmMuxErrCode err_code;
> > +} RdmaCmUHdr;
> > +
> > +typedef struct RdmaCmUMad {
> > +    struct ib_user_mad hdr;
> > +    char mad[RDMA_MAX_PRIVATE_DATA];
> > +} RdmaCmUMad;
> > +
> > +typedef struct RdmaCmMuxMsg {
> > +    RdmaCmUHdr hdr;
> > +    int umad_len;
> > +    RdmaCmUMad umad;
> > +} RdmaCmMuxMsg;
> > +
> > +#endif
> > -- 
> > 2.17.2
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-10 18:27   ` Marcel Apfelbaum
@ 2018-11-11  7:45     ` Yuval Shaia
  2018-11-17 11:41       ` Marcel Apfelbaum
  0 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11  7:45 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, yuval.shaia

On Sat, Nov 10, 2018 at 08:27:44PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/8/18 6:08 PM, Yuval Shaia wrote:
> > Guest driver enforces it, we should also.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/vmw/pvrdma.h      | 2 ++
> >   hw/rdma/vmw/pvrdma_main.c | 3 +++
> >   2 files changed, 5 insertions(+)
> > 
> > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > index b019cb843a..10a3c4fb7c 100644
> > --- a/hw/rdma/vmw/pvrdma.h
> > +++ b/hw/rdma/vmw/pvrdma.h
> > @@ -20,6 +20,7 @@
> >   #include "hw/pci/pci.h"
> >   #include "hw/pci/msix.h"
> >   #include "chardev/char-fe.h"
> > +#include "hw/net/vmxnet3_defs.h"
> >   #include "../rdma_backend_defs.h"
> >   #include "../rdma_rm_defs.h"
> > @@ -85,6 +86,7 @@ typedef struct PVRDMADev {
> >       RdmaBackendDev backend_dev;
> >       RdmaDeviceResources rdma_dev_res;
> >       CharBackend mad_chr;
> > +    VMXNET3State *func0;
> >   } PVRDMADev;
> >   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
> > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > index ac8c092db0..fa6468d221 100644
> > --- a/hw/rdma/vmw/pvrdma_main.c
> > +++ b/hw/rdma/vmw/pvrdma_main.c
> > @@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >           return;
> >       }
> > +    /* Break if not vmxnet3 device in slot 0 */
> > +    dev->func0 = VMXNET3(pci_get_function_0(pdev));
> > +
> 
> I don't see the error code flow in case VMXNET3 is not func 0.
> Am I missing something?

Yes, this is a dynamic cast that will break the process when fail to cast.

This is the error message that you will get in case that device on function
0 is not vmxnet3:

pvrdma_main.c:589:pvrdma_realize: Object 0x557b959841a0 is not an instance of type vmxnet3

> 
> 
> Thanks,
> Marcel
> 
> >       memdev_root = object_resolve_path("/objects", NULL);
> >       if (memdev_root) {
> >           object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma
  2018-11-10 18:25   ` Marcel Apfelbaum
@ 2018-11-11  7:50     ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11  7:50 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch

On Sat, Nov 10, 2018 at 08:25:46PM +0200, Marcel Apfelbaum wrote:
> 
> > +#
> > +##
> > +{ 'event': 'RDMA_GID_STATUS_CHANGED',
> > +  'data': { 'netdev'        : 'str',
> > +            'gid-status'    : 'bool',
> 
> 'git-status' naming as of indication if we add or remove a gid
> is a little odd, but I can't come up with something better.

How about 'gid-in-use'?

> 
> Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>
> 
> Thanks,
> Marcel
> 
> > +            'subnet-prefix' : 'uint64',
> > +            'interface-id'  : 'uint64' } }
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion
  2018-11-10 18:21   ` Marcel Apfelbaum
@ 2018-11-11  8:04     ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11  8:04 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, yuval.shaia

On Sat, Nov 10, 2018 at 08:21:51PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/8/18 6:08 PM, Yuval Shaia wrote:
> > opcode for WC should be set by the device and not taken from work
> > element.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> > index 7b0f440fda..3388be1926 100644
> > --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> > +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> > @@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
> >           comp_ctx->cq_handle = qp->send_cq_handle;
> >           comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
> >           comp_ctx->cqe.qp = qp_handle;
> > -        comp_ctx->cqe.opcode = wqe->hdr.opcode;
> > +        comp_ctx->cqe.opcode = IBV_WC_SEND;
> 
> That is interesting, what should happen if the opcode in hdr is different?
> Maybe fail the operation?

openmpi builds its entire IB state machine on that, see here:

https://github.com/open-mpi/ompi/blob/3dc1629771177a883cd8f1be6e97ab152e0f4584/opal/mca/btl/openib/btl_openib_component.c#L3512

> 
> Thanks,
> Marcel
> 
> >           rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
> >                                  (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion
  2018-11-10 18:18   ` Marcel Apfelbaum
@ 2018-11-11  8:43     ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11  8:43 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, yuval.shaia

On Sat, Nov 10, 2018 at 08:18:58PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/8/18 6:08 PM, Yuval Shaia wrote:
> > The function pvrdma_post_cqe populates CQE entry with opcode from the
> > given completion element. For receive operation value was not set. Fix
> > it by setting it to IBV_WC_RECV.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
> > index 762700a205..7b0f440fda 100644
> > --- a/hw/rdma/vmw/pvrdma_qp_ops.c
> > +++ b/hw/rdma/vmw/pvrdma_qp_ops.c
> > @@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
> >           comp_ctx = g_malloc(sizeof(CompHandlerCtx));
> >           comp_ctx->dev = dev;
> >           comp_ctx->cq_handle = qp->recv_cq_handle;
> > -        comp_ctx->cqe.qp = qp_handle;
> >           comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
> > +        comp_ctx->cqe.qp = qp_handle;
> 
> Not sure the above chunk is needed.

Right, it is not related to the change but did it "while there" to be
consisted with settings order in pvrdma_qp_send :)

> 
> > +        comp_ctx->cqe.opcode = IBV_WC_RECV;
> 
> Anyway
> 
> Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com>

Thanks.

> 
> Thanks,
> Marcel
> 
> >           rdma_backend_post_recv(&dev->backend_dev, &dev->rdma_dev_res,
> >                                  &qp->backend_qp, qp->qp_type,
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL
  2018-11-10 17:59   ` Marcel Apfelbaum
@ 2018-11-11  9:12     ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11  9:12 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, yuval.shaia

On Sat, Nov 10, 2018 at 07:59:00PM +0200, Marcel Apfelbaum wrote:
> Hi Yuval,
> 
> On 11/8/18 6:07 PM, Yuval Shaia wrote:
> > Device is not supporting QP0, only QP1.
> > 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_backend.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> > index 86e8fe8ab6..3ccc9a2494 100644
> > --- a/hw/rdma/rdma_backend.h
> > +++ b/hw/rdma/rdma_backend.h
> > @@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev *dev)
> >   static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
> >   {
> > -    return qp->ibqp ? qp->ibqp->qp_num : 0;
> > +    return qp->ibqp ? qp->ibqp->qp_num : 1;
> 
> Just to be sure, what are the cases we don't get  a qp_num?
> Can we assume all of them are MADs?
> 
> Thanks,
> Marcel

qp->ibqp is set only in case that QP type is not QP 1 (see
rdma_backend_create_qp()) so we can safely assume that this is QP 1.

> 
> >   }
> >   static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets
  2018-11-10 18:15   ` Marcel Apfelbaum
@ 2018-11-11 10:31     ` Yuval Shaia
  2018-11-17 11:35       ` Marcel Apfelbaum
  0 siblings, 1 reply; 47+ messages in thread
From: Yuval Shaia @ 2018-11-11 10:31 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, yuval.shaia

On Sat, Nov 10, 2018 at 08:15:27PM +0200, Marcel Apfelbaum wrote:
> Hi Yuval
> 
> On 11/8/18 6:08 PM, Yuval Shaia wrote:
> > MAD (Management Datagram) packets are widely used by various modules
> > both in kernel and in user space for example the rdma_* API which is
> > used to create and maintain "connection" layer on top of RDMA uses
> > several types of MAD packets.
> 
> Can you add a link to MAD spec to commit or event in the code?

Have no idea where to take it from, does it requires some subscription or
so?

> 
> > To support MAD packets the device uses an external utility
> > (contrib/rdmacm-mux) to relay packets from and to the guest driver.
> 
> Can the device be used without MADs support?

Since we have a support now i don't see a reason why we like to use (or
even expose) device with no MAD support.

> If not, can you update the pvrdma documentation to
> reflect the changes?

Sure, missed that, will document the changes in v3.

> 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> >   hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
> >   hw/rdma/rdma_backend.h      |   4 +-
> >   hw/rdma/rdma_backend_defs.h |  10 +-
> >   hw/rdma/vmw/pvrdma.h        |   2 +
> >   hw/rdma/vmw/pvrdma_main.c   |   4 +-
> >   5 files changed, 273 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> > index 1e148398a2..3eb0099f8d 100644
> > --- a/hw/rdma/rdma_backend.c
> > +++ b/hw/rdma/rdma_backend.c
> > @@ -16,8 +16,13 @@
> >   #include "qemu/osdep.h"
> >   #include "qemu/error-report.h"
> >   #include "qapi/error.h"
> > +#include "qapi/qmp/qlist.h"
> > +#include "qapi/qmp/qnum.h"
> >   #include <infiniband/verbs.h>
> > +#include <infiniband/umad_types.h>
> > +#include <infiniband/umad.h>
> > +#include <rdma/rdma_user_cm.h>
> >   #include "trace.h"
> >   #include "rdma_utils.h"
> > @@ -33,16 +38,25 @@
> >   #define VENDOR_ERR_MAD_SEND         0x206
> >   #define VENDOR_ERR_INVLKEY          0x207
> >   #define VENDOR_ERR_MR_SMALL         0x208
> > +#define VENDOR_ERR_INV_MAD_BUFF     0x209
> > +#define VENDOR_ERR_INV_NUM_SGE      0x210
> >   #define THR_NAME_LEN 16
> >   #define THR_POLL_TO  5000
> > +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
> > +
> >   typedef struct BackendCtx {
> > -    uint64_t req_id;
> >       void *up_ctx;
> >       bool is_tx_req;
> > +    struct ibv_sge sge; /* Used to save MAD recv buffer */
> >   } BackendCtx;
> > +struct backend_umad {
> > +    struct ib_user_mad hdr;
> > +    char mad[RDMA_MAX_PRIVATE_DATA];
> > +};
> > +
> >   static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
> >   static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
> > @@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
> >       return 0;
> >   }
> > +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
> > +                    uint32_t num_sge)
> > +{
> > +    struct backend_umad umad = {0};
> > +    char *hdr, *msg;
> > +    int ret;
> > +
> > +    pr_dbg("num_sge=%d\n", num_sge);
> > +
> > +    if (num_sge != 2) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    umad.hdr.length = sge[0].length + sge[1].length;
> > +    pr_dbg("msg_len=%d\n", umad.hdr.length);
> > +
> > +    if (umad.hdr.length > sizeof(umad.mad)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umad.hdr.addr.qpn = htobe32(1);
> > +    umad.hdr.addr.grh_present = 1;
> > +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
> > +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > +    umad.hdr.addr.hop_limit = 1;
> > +
> > +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
> > +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
> > +
> > +    memcpy(&umad.mad[0], hdr, sge[0].length);
> > +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
> > +
> > +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
> > +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
> > +
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > +                            sizeof(umad));
> > +
> > +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
> > +
> > +    return (ret != sizeof(umad));
> > +}
> > +
> >   void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> >                               struct ibv_sge *sge, uint32_t num_sge,
> > @@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
> >               comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
> >           } else if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            rc = mad_send(backend_dev, sge, num_sge);
> > +            if (rc) {
> > +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            } else {
> > +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
> > +            }
> >           }
> > -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
> >           return;
> >       }
> > @@ -370,6 +431,48 @@ out_free_bctx:
> >       g_free(bctx);
> >   }
> > +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
> > +                                         struct ibv_sge *sge, uint32_t num_sge,
> > +                                         void *ctx)
> > +{
> > +    BackendCtx *bctx;
> > +    int rc;
> > +    uint32_t bctx_id;
> > +
> > +    if (num_sge != 1) {
> > +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
> > +        return VENDOR_ERR_INV_NUM_SGE;
> > +    }
> > +
> > +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
> > +        pr_dbg("Too small buffer for MAD\n");
> > +        return VENDOR_ERR_INV_MAD_BUFF;
> > +    }
> > +
> > +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
> > +    pr_dbg("length=%d\n", sge[0].length);
> > +    pr_dbg("lkey=%d\n", sge[0].lkey);
> > +
> > +    bctx = g_malloc0(sizeof(*bctx));
> > +
> > +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
> > +    if (unlikely(rc)) {
> > +        g_free(bctx);
> > +        pr_dbg("Fail to allocate cqe_ctx\n");
> > +        return VENDOR_ERR_NOMEM;
> > +    }
> > +
> > +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
> > +    bctx->up_ctx = ctx;
> > +    bctx->sge = *sge;
> > +
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +
> > +    return 0;
> > +}
> > +
> >   void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> >                               RdmaDeviceResources *rdma_dev_res,
> >                               RdmaBackendQP *qp, uint8_t qp_type,
> > @@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
> >           }
> >           if (qp_type == IBV_QPT_GSI) {
> >               pr_dbg("QP1\n");
> > -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
> > +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
> > +            if (rc) {
> > +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
> > +            }
> >           }
> >           return;
> >       }
> > @@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> >       switch (qp_type) {
> >       case IBV_QPT_GSI:
> > -        pr_dbg("QP1 unsupported\n");
> >           return 0;
> >       case IBV_QPT_RC:
> > @@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
> >       return 0;
> >   }
> > +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
> > +                                 union ibv_gid *my_gid, int paylen)
> > +{
> > +    grh->paylen = htons(paylen);
> > +    grh->sgid = *sgid;
> > +    grh->dgid = *my_gid;
> > +
> > +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
> > +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
> > +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
> > +}
> > +
> > +static inline int mad_can_receieve(void *opaque)
> > +{
> > +    return sizeof(struct backend_umad);
> > +}
> > +
> > +static void mad_read(void *opaque, const uint8_t *buf, int size)
> > +{
> > +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
> > +    QObject *o_ctx_id;
> > +    unsigned long cqe_ctx_id;
> > +    BackendCtx *bctx;
> > +    char *mad;
> > +    struct backend_umad *umad;
> > +
> > +    assert(size != sizeof(umad));
> > +    umad = (struct backend_umad *)buf;
> > +
> > +    pr_dbg("Got %d bytes\n", size);
> > +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
> > +
> > +#ifdef PVRDMA_DEBUG
> > +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
> > +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
> > +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
> > +           hdr->method, hdr->status, be64toh(hdr->tid),
> > +           hdr->attr_id, hdr->attr_mod);
> > +#endif
> > +
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> > +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +    if (!o_ctx_id) {
> > +        pr_dbg("No more free MADs buffers, waiting for a while\n");
> > +        sleep(THR_POLL_TO);
> 
> Why do we sleep here? Seems a little odd.

Well, this should never happen. The guest driver on load, pushes 512
buffers and then refill it back on every MAD CQ.
But what device should do in case the bucket is empty? The best i found is
to wait and retry, just to cover a burst that will be fixed soon.

Any other idea is welcome.

> 
> > +        return;
> > +    }
> > +
> > +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> > +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +    if (unlikely(!bctx)) {
> > +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
> > +        return;
> > +    }
> > +
> > +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
> > +
> > +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
> > +                           bctx->sge.length);
> > +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
> > +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
> > +                     bctx->up_ctx);
> > +    } else {
> > +        memset(mad, 0, bctx->sge.length);
> > +        build_mad_hdr((struct ibv_grh *)mad,
> > +                      (union ibv_gid *)&umad->hdr.addr.gid,
> > +                      &backend_dev->gid, umad->hdr.length);
> > +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
> > +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
> > +
> > +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
> > +    }
> > +
> > +    g_free(bctx);
> > +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +}
> > +
> > +static int mad_init(RdmaBackendDev *backend_dev)
> > +{
> > +    struct backend_umad umad = {0};
> > +    int ret;
> > +
> > +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
> > +        pr_dbg("Missing chardev for MAD multiplexer\n");
> > +        return -EIO;
> > +    }
> > +
> > +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
> > +                             mad_read, NULL, NULL, backend_dev, NULL, true);
> > +
> > +    /* Register ourself */
> > +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
> > +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
> > +                            sizeof(umad.hdr));
> > +    if (ret != sizeof(umad.hdr)) {
> > +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
> 
> Why only a dbg message and  not fail the init proc in this case ?

You are correct, this code is moved and fixed in patch #11.

> 
> > +    }
> > +
> > +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
> > +    backend_dev->recv_mads_list.list = qlist_new();
> > +
> > +    return 0;
> > +}
> > +
> > +static void mad_stop(RdmaBackendDev *backend_dev)
> > +{
> > +    QObject *o_ctx_id;
> > +    unsigned long cqe_ctx_id;
> > +    BackendCtx *bctx;
> > +
> > +    pr_dbg("Closing MAD\n");
> > +
> > +    /* Clear MAD buffers list */
> > +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
> 
> Does it makes sense to lock only around the
> qlist_post call?

Yes, it does.
But since this function is called when devices is going down i do not see a
reason to release the lock just to allow "list" users to append entries in
between.

> 
> 
> Thanks,
> Marcel
> 
> > +    do {
> > +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
> > +        if (o_ctx_id) {
> > +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
> > +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +            if (bctx) {
> > +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
> > +                g_free(bctx);
> > +            }
> > +        }
> > +    } while (o_ctx_id);
> > +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
> > +}
> > +
> > +static void mad_fini(RdmaBackendDev *backend_dev)
> > +{
> > +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
> > +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
> > +}
> > +
> >   int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> >                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      Error **errp)
> > +                      CharBackend *mad_chr_be, Error **errp)
> >   {
> >       int i;
> >       int ret = 0;
> > @@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       memset(backend_dev, 0, sizeof(*backend_dev));
> >       backend_dev->dev = pdev;
> > -
> > +    backend_dev->mad_chr_be = mad_chr_be;
> >       backend_dev->backend_gid_idx = backend_gid_idx;
> >       backend_dev->port_num = port_num;
> >       backend_dev->rdma_dev_res = rdma_dev_res;
> > @@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >       pr_dbg("interface_id=0x%" PRIx64 "\n",
> >              be64_to_cpu(backend_dev->gid.global.interface_id));
> > +    ret = mad_init(backend_dev);
> > +    if (ret) {
> > +        error_setg(errp, "Fail to initialize mad");
> > +        ret = -EIO;
> > +        goto out_destroy_comm_channel;
> > +    }
> > +
> >       backend_dev->comp_thread.run = false;
> >       backend_dev->comp_thread.is_running = false;
> > @@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
> >   {
> >       pr_dbg("Stopping rdma_backend\n");
> >       stop_backend_thread(&backend_dev->comp_thread);
> > +    mad_stop(backend_dev);
> >   }
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev)
> >   {
> >       rdma_backend_stop(backend_dev);
> > +    mad_fini(backend_dev);
> >       g_hash_table_destroy(ah_hash);
> >       ibv_destroy_comp_channel(backend_dev->channel);
> >       ibv_close_device(backend_dev->context);
> > diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> > index 3ccc9a2494..fc83330251 100644
> > --- a/hw/rdma/rdma_backend.h
> > +++ b/hw/rdma/rdma_backend.h
> > @@ -17,6 +17,8 @@
> >   #define RDMA_BACKEND_H
> >   #include "qapi/error.h"
> > +#include "chardev/char-fe.h"
> > +
> >   #include "rdma_rm_defs.h"
> >   #include "rdma_backend_defs.h"
> > @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
> >                         RdmaDeviceResources *rdma_dev_res,
> >                         const char *backend_device_name, uint8_t port_num,
> >                         uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
> > -                      Error **errp);
> > +                      CharBackend *mad_chr_be, Error **errp);
> >   void rdma_backend_fini(RdmaBackendDev *backend_dev);
> >   void rdma_backend_start(RdmaBackendDev *backend_dev);
> >   void rdma_backend_stop(RdmaBackendDev *backend_dev);
> > diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
> > index 7404f64002..2a7e667075 100644
> > --- a/hw/rdma/rdma_backend_defs.h
> > +++ b/hw/rdma/rdma_backend_defs.h
> > @@ -16,8 +16,9 @@
> >   #ifndef RDMA_BACKEND_DEFS_H
> >   #define RDMA_BACKEND_DEFS_H
> > -#include <infiniband/verbs.h>
> >   #include "qemu/thread.h"
> > +#include "chardev/char-fe.h"
> > +#include <infiniband/verbs.h>
> >   typedef struct RdmaDeviceResources RdmaDeviceResources;
> > @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
> >       bool is_running; /* Set by the thread to report its status */
> >   } RdmaBackendThread;
> > +typedef struct RecvMadList {
> > +    QemuMutex lock;
> > +    QList *list;
> > +} RecvMadList;
> > +
> >   typedef struct RdmaBackendDev {
> >       struct ibv_device_attr dev_attr;
> >       RdmaBackendThread comp_thread;
> > @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
> >       struct ibv_comp_channel *channel;
> >       uint8_t port_num;
> >       uint8_t backend_gid_idx;
> > +    RecvMadList recv_mads_list;
> > +    CharBackend *mad_chr_be;
> >   } RdmaBackendDev;
> >   typedef struct RdmaBackendPD {
> > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > index e2d9f93cdf..e3742d893a 100644
> > --- a/hw/rdma/vmw/pvrdma.h
> > +++ b/hw/rdma/vmw/pvrdma.h
> > @@ -19,6 +19,7 @@
> >   #include "qemu/units.h"
> >   #include "hw/pci/pci.h"
> >   #include "hw/pci/msix.h"
> > +#include "chardev/char-fe.h"
> >   #include "../rdma_backend_defs.h"
> >   #include "../rdma_rm_defs.h"
> > @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
> >       uint8_t backend_port_num;
> >       RdmaBackendDev backend_dev;
> >       RdmaDeviceResources rdma_dev_res;
> > +    CharBackend mad_chr;
> >   } PVRDMADev;
> >   #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
> > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > index ca5fa8d981..6c8c0154fa 100644
> > --- a/hw/rdma/vmw/pvrdma_main.c
> > +++ b/hw/rdma/vmw/pvrdma_main.c
> > @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
> >       DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
> >                         dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
> >       DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
> > +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
> >       DEFINE_PROP_END_OF_LIST(),
> >   };
> > @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> >       rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
> >                              dev->backend_device_name, dev->backend_port_num,
> > -                           dev->backend_gid_idx, &dev->dev_attr, errp);
> > +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
> > +                           errp);
> >       if (rc) {
> >           goto out;
> >       }
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 12/22] vmxnet3: Move some definitions to header file
  2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 12/22] vmxnet3: Move some definitions to header file Yuval Shaia
@ 2018-11-12 13:56   ` Dmitry Fleytman
  0 siblings, 0 replies; 47+ messages in thread
From: Dmitry Fleytman @ 2018-11-12 13:56 UTC (permalink / raw)
  To: yuval.shaia
  Cc: marcel.apfelbaum, Jason Wang, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Qemu Developers, shamir.rabinovitch

On Thu, Nov 8, 2018 at 6:09 PM Yuval Shaia <yuval.shaia@oracle.com> wrote:
>
> pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
> on PCI function 1.
> pvrdma device needs to access vmxnet3 device object for several reasons:
> 1. Make sure PCI function 0 is vmxnet3.
> 2. To monitor vmxnet3 device state.
> 3. To configure node_guid accoring to vmxnet3 device's MAC address.
>
> To be able to access vmxnet3 device the definition of VMXNET3State is
> moved to a new header file.
>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>

Reviewed-by: Dmitry Fleytman <dmitry.fleytman@gmail.com>

> ---
>  hw/net/vmxnet3.c      | 116 +-----------------------------------
>  hw/net/vmxnet3_defs.h | 133 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 134 insertions(+), 115 deletions(-)
>  create mode 100644 hw/net/vmxnet3_defs.h
>
> diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
> index 3648630386..54746a4030 100644
> --- a/hw/net/vmxnet3.c
> +++ b/hw/net/vmxnet3.c
> @@ -18,7 +18,6 @@
>  #include "qemu/osdep.h"
>  #include "hw/hw.h"
>  #include "hw/pci/pci.h"
> -#include "net/net.h"
>  #include "net/tap.h"
>  #include "net/checksum.h"
>  #include "sysemu/sysemu.h"
> @@ -29,6 +28,7 @@
>  #include "migration/register.h"
>
>  #include "vmxnet3.h"
> +#include "vmxnet3_defs.h"
>  #include "vmxnet_debug.h"
>  #include "vmware_utils.h"
>  #include "net_tx_pkt.h"
> @@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
>      DeviceRealize parent_dc_realize;
>  } VMXNET3Class;
>
> -#define TYPE_VMXNET3 "vmxnet3"
> -#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
> -
>  #define VMXNET3_DEVICE_CLASS(klass) \
>      OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
>  #define VMXNET3_DEVICE_GET_CLASS(obj) \
>      OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
>
> -/* Cyclic ring abstraction */
> -typedef struct {
> -    hwaddr pa;
> -    uint32_t size;
> -    uint32_t cell_size;
> -    uint32_t next;
> -    uint8_t gen;
> -} Vmxnet3Ring;
> -
>  static inline void vmxnet3_ring_init(PCIDevice *d,
>                                      Vmxnet3Ring *ring,
>                                       hwaddr pa,
> @@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
>                descr->rsvd, descr->dtype, descr->ext1, descr->btype);
>  }
>
> -/* Device state and helper functions */
> -#define VMXNET3_RX_RINGS_PER_QUEUE (2)
> -
> -typedef struct {
> -    Vmxnet3Ring tx_ring;
> -    Vmxnet3Ring comp_ring;
> -
> -    uint8_t intr_idx;
> -    hwaddr tx_stats_pa;
> -    struct UPT1_TxStats txq_stats;
> -} Vmxnet3TxqDescr;
> -
> -typedef struct {
> -    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
> -    Vmxnet3Ring comp_ring;
> -    uint8_t intr_idx;
> -    hwaddr rx_stats_pa;
> -    struct UPT1_RxStats rxq_stats;
> -} Vmxnet3RxqDescr;
> -
> -typedef struct {
> -    bool is_masked;
> -    bool is_pending;
> -    bool is_asserted;
> -} Vmxnet3IntState;
> -
> -typedef struct {
> -        PCIDevice parent_obj;
> -        NICState *nic;
> -        NICConf conf;
> -        MemoryRegion bar0;
> -        MemoryRegion bar1;
> -        MemoryRegion msix_bar;
> -
> -        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
> -        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
> -
> -        /* Whether MSI-X support was installed successfully */
> -        bool msix_used;
> -        hwaddr drv_shmem;
> -        hwaddr temp_shared_guest_driver_memory;
> -
> -        uint8_t txq_num;
> -
> -        /* This boolean tells whether RX packet being indicated has to */
> -        /* be split into head and body chunks from different RX rings  */
> -        bool rx_packets_compound;
> -
> -        bool rx_vlan_stripping;
> -        bool lro_supported;
> -
> -        uint8_t rxq_num;
> -
> -        /* Network MTU */
> -        uint32_t mtu;
> -
> -        /* Maximum number of fragments for indicated TX packets */
> -        uint32_t max_tx_frags;
> -
> -        /* Maximum number of fragments for indicated RX packets */
> -        uint16_t max_rx_frags;
> -
> -        /* Index for events interrupt */
> -        uint8_t event_int_idx;
> -
> -        /* Whether automatic interrupts masking enabled */
> -        bool auto_int_masking;
> -
> -        bool peer_has_vhdr;
> -
> -        /* TX packets to QEMU interface */
> -        struct NetTxPkt *tx_pkt;
> -        uint32_t offload_mode;
> -        uint32_t cso_or_gso_size;
> -        uint16_t tci;
> -        bool needs_vlan;
> -
> -        struct NetRxPkt *rx_pkt;
> -
> -        bool tx_sop;
> -        bool skip_current_tx_pkt;
> -
> -        uint32_t device_active;
> -        uint32_t last_command;
> -
> -        uint32_t link_status_and_speed;
> -
> -        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
> -
> -        uint32_t temp_mac;   /* To store the low part first */
> -
> -        MACAddr perm_mac;
> -        uint32_t vlan_table[VMXNET3_VFT_SIZE];
> -        uint32_t rx_mode;
> -        MACAddr *mcast_list;
> -        uint32_t mcast_list_len;
> -        uint32_t mcast_list_buff_size; /* needed for live migration. */
> -
> -        /* Compatibility flags for migration */
> -        uint32_t compat_flags;
> -} VMXNET3State;
> -
>  /* Interrupt management */
>
>  /*
> diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
> new file mode 100644
> index 0000000000..6c19d29b12
> --- /dev/null
> +++ b/hw/net/vmxnet3_defs.h
> @@ -0,0 +1,133 @@
> +/*
> + * QEMU VMWARE VMXNET3 paravirtual NIC
> + *
> + * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
> + *
> + * Developed by Daynix Computing LTD (http://www.daynix.com)
> + *
> + * Authors:
> + * Dmitry Fleytman <dmitry@daynix.com>
> + * Tamir Shomer <tamirs@daynix.com>
> + * Yan Vugenfirer <yan@daynix.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "net/net.h"
> +#include "hw/net/vmxnet3.h"
> +
> +#define TYPE_VMXNET3 "vmxnet3"
> +#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
> +
> +/* Device state and helper functions */
> +#define VMXNET3_RX_RINGS_PER_QUEUE (2)
> +
> +/* Cyclic ring abstraction */
> +typedef struct {
> +    hwaddr pa;
> +    uint32_t size;
> +    uint32_t cell_size;
> +    uint32_t next;
> +    uint8_t gen;
> +} Vmxnet3Ring;
> +
> +typedef struct {
> +    Vmxnet3Ring tx_ring;
> +    Vmxnet3Ring comp_ring;
> +
> +    uint8_t intr_idx;
> +    hwaddr tx_stats_pa;
> +    struct UPT1_TxStats txq_stats;
> +} Vmxnet3TxqDescr;
> +
> +typedef struct {
> +    Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
> +    Vmxnet3Ring comp_ring;
> +    uint8_t intr_idx;
> +    hwaddr rx_stats_pa;
> +    struct UPT1_RxStats rxq_stats;
> +} Vmxnet3RxqDescr;
> +
> +typedef struct {
> +    bool is_masked;
> +    bool is_pending;
> +    bool is_asserted;
> +} Vmxnet3IntState;
> +
> +typedef struct {
> +        PCIDevice parent_obj;
> +        NICState *nic;
> +        NICConf conf;
> +        MemoryRegion bar0;
> +        MemoryRegion bar1;
> +        MemoryRegion msix_bar;
> +
> +        Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
> +        Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
> +
> +        /* Whether MSI-X support was installed successfully */
> +        bool msix_used;
> +        hwaddr drv_shmem;
> +        hwaddr temp_shared_guest_driver_memory;
> +
> +        uint8_t txq_num;
> +
> +        /* This boolean tells whether RX packet being indicated has to */
> +        /* be split into head and body chunks from different RX rings  */
> +        bool rx_packets_compound;
> +
> +        bool rx_vlan_stripping;
> +        bool lro_supported;
> +
> +        uint8_t rxq_num;
> +
> +        /* Network MTU */
> +        uint32_t mtu;
> +
> +        /* Maximum number of fragments for indicated TX packets */
> +        uint32_t max_tx_frags;
> +
> +        /* Maximum number of fragments for indicated RX packets */
> +        uint16_t max_rx_frags;
> +
> +        /* Index for events interrupt */
> +        uint8_t event_int_idx;
> +
> +        /* Whether automatic interrupts masking enabled */
> +        bool auto_int_masking;
> +
> +        bool peer_has_vhdr;
> +
> +        /* TX packets to QEMU interface */
> +        struct NetTxPkt *tx_pkt;
> +        uint32_t offload_mode;
> +        uint32_t cso_or_gso_size;
> +        uint16_t tci;
> +        bool needs_vlan;
> +
> +        struct NetRxPkt *rx_pkt;
> +
> +        bool tx_sop;
> +        bool skip_current_tx_pkt;
> +
> +        uint32_t device_active;
> +        uint32_t last_command;
> +
> +        uint32_t link_status_and_speed;
> +
> +        Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
> +
> +        uint32_t temp_mac;   /* To store the low part first */
> +
> +        MACAddr perm_mac;
> +        uint32_t vlan_table[VMXNET3_VFT_SIZE];
> +        uint32_t rx_mode;
> +        MACAddr *mcast_list;
> +        uint32_t mcast_list_len;
> +        uint32_t mcast_list_buff_size; /* needed for live migration. */
> +
> +        /* Compatibility flags for migration */
> +        uint32_t compat_flags;
> +} VMXNET3State;
> --
> 2.17.2
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets
  2018-11-11 10:31     ` Yuval Shaia
@ 2018-11-17 11:35       ` Marcel Apfelbaum
  0 siblings, 0 replies; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 11:35 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch

Hi Yuval,

On 11/11/18 12:31 PM, Yuval Shaia wrote:
> On Sat, Nov 10, 2018 at 08:15:27PM +0200, Marcel Apfelbaum wrote:
>> Hi Yuval
>>
>> On 11/8/18 6:08 PM, Yuval Shaia wrote:
>>> MAD (Management Datagram) packets are widely used by various modules
>>> both in kernel and in user space for example the rdma_* API which is
>>> used to create and maintain "connection" layer on top of RDMA uses
>>> several types of MAD packets.
>> Can you add a link to MAD spec to commit or event in the code?
> Have no idea where to take it from, does it requires some subscription or
> so?

No subscription required:
     https://www.infinibandta.org/ibta-specifications-download/
     Volume 1 Architecture Specification, Release 1.1
     Chapter 13.4


>>> To support MAD packets the device uses an external utility
>>> (contrib/rdmacm-mux) to relay packets from and to the guest driver.
>> Can the device be used without MADs support?
> Since we have a support now i don't see a reason why we like to use (or
> even expose) device with no MAD support.

Good point, we just need to make sure users know how to enable MADs.
Thanks,
Marcel

>> If not, can you update the pvrdma documentation to
>> reflect the changes?
> Sure, missed that, will document the changes in v3.
>
>>> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
>>> ---
>>>    hw/rdma/rdma_backend.c      | 263 +++++++++++++++++++++++++++++++++++-
>>>    hw/rdma/rdma_backend.h      |   4 +-
>>>    hw/rdma/rdma_backend_defs.h |  10 +-
>>>    hw/rdma/vmw/pvrdma.h        |   2 +
>>>    hw/rdma/vmw/pvrdma_main.c   |   4 +-
>>>    5 files changed, 273 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
>>> index 1e148398a2..3eb0099f8d 100644
>>> --- a/hw/rdma/rdma_backend.c
>>> +++ b/hw/rdma/rdma_backend.c
>>> @@ -16,8 +16,13 @@
>>>    #include "qemu/osdep.h"
>>>    #include "qemu/error-report.h"
>>>    #include "qapi/error.h"
>>> +#include "qapi/qmp/qlist.h"
>>> +#include "qapi/qmp/qnum.h"
>>>    #include <infiniband/verbs.h>
>>> +#include <infiniband/umad_types.h>
>>> +#include <infiniband/umad.h>
>>> +#include <rdma/rdma_user_cm.h>
>>>    #include "trace.h"
>>>    #include "rdma_utils.h"
>>> @@ -33,16 +38,25 @@
>>>    #define VENDOR_ERR_MAD_SEND         0x206
>>>    #define VENDOR_ERR_INVLKEY          0x207
>>>    #define VENDOR_ERR_MR_SMALL         0x208
>>> +#define VENDOR_ERR_INV_MAD_BUFF     0x209
>>> +#define VENDOR_ERR_INV_NUM_SGE      0x210
>>>    #define THR_NAME_LEN 16
>>>    #define THR_POLL_TO  5000
>>> +#define MAD_HDR_SIZE sizeof(struct ibv_grh)
>>> +
>>>    typedef struct BackendCtx {
>>> -    uint64_t req_id;
>>>        void *up_ctx;
>>>        bool is_tx_req;
>>> +    struct ibv_sge sge; /* Used to save MAD recv buffer */
>>>    } BackendCtx;
>>> +struct backend_umad {
>>> +    struct ib_user_mad hdr;
>>> +    char mad[RDMA_MAX_PRIVATE_DATA];
>>> +};
>>> +
>>>    static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
>>>    static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
>>> @@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
>>>        return 0;
>>>    }
>>> +static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
>>> +                    uint32_t num_sge)
>>> +{
>>> +    struct backend_umad umad = {0};
>>> +    char *hdr, *msg;
>>> +    int ret;
>>> +
>>> +    pr_dbg("num_sge=%d\n", num_sge);
>>> +
>>> +    if (num_sge != 2) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    umad.hdr.length = sge[0].length + sge[1].length;
>>> +    pr_dbg("msg_len=%d\n", umad.hdr.length);
>>> +
>>> +    if (umad.hdr.length > sizeof(umad.mad)) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    umad.hdr.addr.qpn = htobe32(1);
>>> +    umad.hdr.addr.grh_present = 1;
>>> +    umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
>>> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
>>> +    umad.hdr.addr.hop_limit = 1;
>>> +
>>> +    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
>>> +    msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
>>> +
>>> +    memcpy(&umad.mad[0], hdr, sge[0].length);
>>> +    memcpy(&umad.mad[sge[0].length], msg, sge[1].length);
>>> +
>>> +    rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
>>> +    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
>>> +
>>> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
>>> +                            sizeof(umad));
>>> +
>>> +    pr_dbg("qemu_chr_fe_write=%d\n", ret);
>>> +
>>> +    return (ret != sizeof(umad));
>>> +}
>>> +
>>>    void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>>>                                RdmaBackendQP *qp, uint8_t qp_type,
>>>                                struct ibv_sge *sge, uint32_t num_sge,
>>> @@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
>>>                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
>>>            } else if (qp_type == IBV_QPT_GSI) {
>>>                pr_dbg("QP1\n");
>>> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
>>> +            rc = mad_send(backend_dev, sge, num_sge);
>>> +            if (rc) {
>>> +                comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
>>> +            } else {
>>> +                comp_handler(IBV_WC_SUCCESS, 0, ctx);
>>> +            }
>>>            }
>>> -        pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
>>>            return;
>>>        }
>>> @@ -370,6 +431,48 @@ out_free_bctx:
>>>        g_free(bctx);
>>>    }
>>> +static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
>>> +                                         struct ibv_sge *sge, uint32_t num_sge,
>>> +                                         void *ctx)
>>> +{
>>> +    BackendCtx *bctx;
>>> +    int rc;
>>> +    uint32_t bctx_id;
>>> +
>>> +    if (num_sge != 1) {
>>> +        pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
>>> +        return VENDOR_ERR_INV_NUM_SGE;
>>> +    }
>>> +
>>> +    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
>>> +        pr_dbg("Too small buffer for MAD\n");
>>> +        return VENDOR_ERR_INV_MAD_BUFF;
>>> +    }
>>> +
>>> +    pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
>>> +    pr_dbg("length=%d\n", sge[0].length);
>>> +    pr_dbg("lkey=%d\n", sge[0].lkey);
>>> +
>>> +    bctx = g_malloc0(sizeof(*bctx));
>>> +
>>> +    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
>>> +    if (unlikely(rc)) {
>>> +        g_free(bctx);
>>> +        pr_dbg("Fail to allocate cqe_ctx\n");
>>> +        return VENDOR_ERR_NOMEM;
>>> +    }
>>> +
>>> +    pr_dbg("bctx_id %d, bctx %p, ctx %p\n", bctx_id, bctx, ctx);
>>> +    bctx->up_ctx = ctx;
>>> +    bctx->sge = *sge;
>>> +
>>> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
>>> +    qlist_append_int(backend_dev->recv_mads_list.list, bctx_id);
>>> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>>>                                RdmaDeviceResources *rdma_dev_res,
>>>                                RdmaBackendQP *qp, uint8_t qp_type,
>>> @@ -388,7 +491,10 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
>>>            }
>>>            if (qp_type == IBV_QPT_GSI) {
>>>                pr_dbg("QP1\n");
>>> -            comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
>>> +            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
>>> +            if (rc) {
>>> +                comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
>>> +            }
>>>            }
>>>            return;
>>>        }
>>> @@ -517,7 +623,6 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
>>>        switch (qp_type) {
>>>        case IBV_QPT_GSI:
>>> -        pr_dbg("QP1 unsupported\n");
>>>            return 0;
>>>        case IBV_QPT_RC:
>>> @@ -748,11 +853,146 @@ static int init_device_caps(RdmaBackendDev *backend_dev,
>>>        return 0;
>>>    }
>>> +static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
>>> +                                 union ibv_gid *my_gid, int paylen)
>>> +{
>>> +    grh->paylen = htons(paylen);
>>> +    grh->sgid = *sgid;
>>> +    grh->dgid = *my_gid;
>>> +
>>> +    pr_dbg("paylen=%d (net=0x%x)\n", paylen, grh->paylen);
>>> +    pr_dbg("my_gid=0x%llx\n", my_gid->global.interface_id);
>>> +    pr_dbg("gid=0x%llx\n", sgid->global.interface_id);
>>> +}
>>> +
>>> +static inline int mad_can_receieve(void *opaque)
>>> +{
>>> +    return sizeof(struct backend_umad);
>>> +}
>>> +
>>> +static void mad_read(void *opaque, const uint8_t *buf, int size)
>>> +{
>>> +    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
>>> +    QObject *o_ctx_id;
>>> +    unsigned long cqe_ctx_id;
>>> +    BackendCtx *bctx;
>>> +    char *mad;
>>> +    struct backend_umad *umad;
>>> +
>>> +    assert(size != sizeof(umad));
>>> +    umad = (struct backend_umad *)buf;
>>> +
>>> +    pr_dbg("Got %d bytes\n", size);
>>> +    pr_dbg("umad->hdr.length=%d\n", umad->hdr.length);
>>> +
>>> +#ifdef PVRDMA_DEBUG
>>> +    struct umad_hdr *hdr = (struct umad_hdr *)&msg->umad.mad;
>>> +    pr_dbg("bv %x cls %x cv %x mtd %x st %d tid %" PRIx64 " at %x atm %x\n",
>>> +           hdr->base_version, hdr->mgmt_class, hdr->class_version,
>>> +           hdr->method, hdr->status, be64toh(hdr->tid),
>>> +           hdr->attr_id, hdr->attr_mod);
>>> +#endif
>>> +
>>> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
>>> +    o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
>>> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
>>> +    if (!o_ctx_id) {
>>> +        pr_dbg("No more free MADs buffers, waiting for a while\n");
>>> +        sleep(THR_POLL_TO);
>> Why do we sleep here? Seems a little odd.
> Well, this should never happen. The guest driver on load, pushes 512
> buffers and then refill it back on every MAD CQ.
> But what device should do in case the bucket is empty? The best i found is
> to wait and retry, just to cover a burst that will be fixed soon.
>
> Any other idea is welcome.
>
>>> +        return;
>>> +    }
>>> +
>>> +    cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
>>> +    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
>>> +    if (unlikely(!bctx)) {
>>> +        pr_dbg("Error: Fail to find ctx for %ld\n", cqe_ctx_id);
>>> +        return;
>>> +    }
>>> +
>>> +    pr_dbg("id %ld, bctx %p, ctx %p\n", cqe_ctx_id, bctx, bctx->up_ctx);
>>> +
>>> +    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
>>> +                           bctx->sge.length);
>>> +    if (!mad || bctx->sge.length < umad->hdr.length + MAD_HDR_SIZE) {
>>> +        comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
>>> +                     bctx->up_ctx);
>>> +    } else {
>>> +        memset(mad, 0, bctx->sge.length);
>>> +        build_mad_hdr((struct ibv_grh *)mad,
>>> +                      (union ibv_gid *)&umad->hdr.addr.gid,
>>> +                      &backend_dev->gid, umad->hdr.length);
>>> +        memcpy(&mad[MAD_HDR_SIZE], umad->mad, umad->hdr.length);
>>> +        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
>>> +
>>> +        comp_handler(IBV_WC_SUCCESS, 0, bctx->up_ctx);
>>> +    }
>>> +
>>> +    g_free(bctx);
>>> +    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
>>> +}
>>> +
>>> +static int mad_init(RdmaBackendDev *backend_dev)
>>> +{
>>> +    struct backend_umad umad = {0};
>>> +    int ret;
>>> +
>>> +    if (!qemu_chr_fe_backend_connected(backend_dev->mad_chr_be)) {
>>> +        pr_dbg("Missing chardev for MAD multiplexer\n");
>>> +        return -EIO;
>>> +    }
>>> +
>>> +    qemu_chr_fe_set_handlers(backend_dev->mad_chr_be, mad_can_receieve,
>>> +                             mad_read, NULL, NULL, backend_dev, NULL, true);
>>> +
>>> +    /* Register ourself */
>>> +    memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
>>> +    ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *)&umad,
>>> +                            sizeof(umad.hdr));
>>> +    if (ret != sizeof(umad.hdr)) {
>>> +        pr_dbg("Fail to register to rdma_umadmux (%d)\n", ret);
>> Why only a dbg message and  not fail the init proc in this case ?
> You are correct, this code is moved and fixed in patch #11.
>
>>> +    }
>>> +
>>> +    qemu_mutex_init(&backend_dev->recv_mads_list.lock);
>>> +    backend_dev->recv_mads_list.list = qlist_new();
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void mad_stop(RdmaBackendDev *backend_dev)
>>> +{
>>> +    QObject *o_ctx_id;
>>> +    unsigned long cqe_ctx_id;
>>> +    BackendCtx *bctx;
>>> +
>>> +    pr_dbg("Closing MAD\n");
>>> +
>>> +    /* Clear MAD buffers list */
>>> +    qemu_mutex_lock(&backend_dev->recv_mads_list.lock);
>> Does it makes sense to lock only around the
>> qlist_post call?
> Yes, it does.
> But since this function is called when devices is going down i do not see a
> reason to release the lock just to allow "list" users to append entries in
> between.
>
>>
>> Thanks,
>> Marcel
>>
>>> +    do {
>>> +        o_ctx_id = qlist_pop(backend_dev->recv_mads_list.list);
>>> +        if (o_ctx_id) {
>>> +            cqe_ctx_id = qnum_get_uint(qobject_to(QNum, o_ctx_id));
>>> +            bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
>>> +            if (bctx) {
>>> +                rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
>>> +                g_free(bctx);
>>> +            }
>>> +        }
>>> +    } while (o_ctx_id);
>>> +    qemu_mutex_unlock(&backend_dev->recv_mads_list.lock);
>>> +}
>>> +
>>> +static void mad_fini(RdmaBackendDev *backend_dev)
>>> +{
>>> +    qlist_destroy_obj(QOBJECT(backend_dev->recv_mads_list.list));
>>> +    qemu_mutex_destroy(&backend_dev->recv_mads_list.lock);
>>> +}
>>> +
>>>    int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>>>                          RdmaDeviceResources *rdma_dev_res,
>>>                          const char *backend_device_name, uint8_t port_num,
>>>                          uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
>>> -                      Error **errp)
>>> +                      CharBackend *mad_chr_be, Error **errp)
>>>    {
>>>        int i;
>>>        int ret = 0;
>>> @@ -763,7 +1003,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>>>        memset(backend_dev, 0, sizeof(*backend_dev));
>>>        backend_dev->dev = pdev;
>>> -
>>> +    backend_dev->mad_chr_be = mad_chr_be;
>>>        backend_dev->backend_gid_idx = backend_gid_idx;
>>>        backend_dev->port_num = port_num;
>>>        backend_dev->rdma_dev_res = rdma_dev_res;
>>> @@ -854,6 +1094,13 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>>>        pr_dbg("interface_id=0x%" PRIx64 "\n",
>>>               be64_to_cpu(backend_dev->gid.global.interface_id));
>>> +    ret = mad_init(backend_dev);
>>> +    if (ret) {
>>> +        error_setg(errp, "Fail to initialize mad");
>>> +        ret = -EIO;
>>> +        goto out_destroy_comm_channel;
>>> +    }
>>> +
>>>        backend_dev->comp_thread.run = false;
>>>        backend_dev->comp_thread.is_running = false;
>>> @@ -885,11 +1132,13 @@ void rdma_backend_stop(RdmaBackendDev *backend_dev)
>>>    {
>>>        pr_dbg("Stopping rdma_backend\n");
>>>        stop_backend_thread(&backend_dev->comp_thread);
>>> +    mad_stop(backend_dev);
>>>    }
>>>    void rdma_backend_fini(RdmaBackendDev *backend_dev)
>>>    {
>>>        rdma_backend_stop(backend_dev);
>>> +    mad_fini(backend_dev);
>>>        g_hash_table_destroy(ah_hash);
>>>        ibv_destroy_comp_channel(backend_dev->channel);
>>>        ibv_close_device(backend_dev->context);
>>> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
>>> index 3ccc9a2494..fc83330251 100644
>>> --- a/hw/rdma/rdma_backend.h
>>> +++ b/hw/rdma/rdma_backend.h
>>> @@ -17,6 +17,8 @@
>>>    #define RDMA_BACKEND_H
>>>    #include "qapi/error.h"
>>> +#include "chardev/char-fe.h"
>>> +
>>>    #include "rdma_rm_defs.h"
>>>    #include "rdma_backend_defs.h"
>>> @@ -50,7 +52,7 @@ int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
>>>                          RdmaDeviceResources *rdma_dev_res,
>>>                          const char *backend_device_name, uint8_t port_num,
>>>                          uint8_t backend_gid_idx, struct ibv_device_attr *dev_attr,
>>> -                      Error **errp);
>>> +                      CharBackend *mad_chr_be, Error **errp);
>>>    void rdma_backend_fini(RdmaBackendDev *backend_dev);
>>>    void rdma_backend_start(RdmaBackendDev *backend_dev);
>>>    void rdma_backend_stop(RdmaBackendDev *backend_dev);
>>> diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
>>> index 7404f64002..2a7e667075 100644
>>> --- a/hw/rdma/rdma_backend_defs.h
>>> +++ b/hw/rdma/rdma_backend_defs.h
>>> @@ -16,8 +16,9 @@
>>>    #ifndef RDMA_BACKEND_DEFS_H
>>>    #define RDMA_BACKEND_DEFS_H
>>> -#include <infiniband/verbs.h>
>>>    #include "qemu/thread.h"
>>> +#include "chardev/char-fe.h"
>>> +#include <infiniband/verbs.h>
>>>    typedef struct RdmaDeviceResources RdmaDeviceResources;
>>> @@ -28,6 +29,11 @@ typedef struct RdmaBackendThread {
>>>        bool is_running; /* Set by the thread to report its status */
>>>    } RdmaBackendThread;
>>> +typedef struct RecvMadList {
>>> +    QemuMutex lock;
>>> +    QList *list;
>>> +} RecvMadList;
>>> +
>>>    typedef struct RdmaBackendDev {
>>>        struct ibv_device_attr dev_attr;
>>>        RdmaBackendThread comp_thread;
>>> @@ -39,6 +45,8 @@ typedef struct RdmaBackendDev {
>>>        struct ibv_comp_channel *channel;
>>>        uint8_t port_num;
>>>        uint8_t backend_gid_idx;
>>> +    RecvMadList recv_mads_list;
>>> +    CharBackend *mad_chr_be;
>>>    } RdmaBackendDev;
>>>    typedef struct RdmaBackendPD {
>>> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
>>> index e2d9f93cdf..e3742d893a 100644
>>> --- a/hw/rdma/vmw/pvrdma.h
>>> +++ b/hw/rdma/vmw/pvrdma.h
>>> @@ -19,6 +19,7 @@
>>>    #include "qemu/units.h"
>>>    #include "hw/pci/pci.h"
>>>    #include "hw/pci/msix.h"
>>> +#include "chardev/char-fe.h"
>>>    #include "../rdma_backend_defs.h"
>>>    #include "../rdma_rm_defs.h"
>>> @@ -83,6 +84,7 @@ typedef struct PVRDMADev {
>>>        uint8_t backend_port_num;
>>>        RdmaBackendDev backend_dev;
>>>        RdmaDeviceResources rdma_dev_res;
>>> +    CharBackend mad_chr;
>>>    } PVRDMADev;
>>>    #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>>> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
>>> index ca5fa8d981..6c8c0154fa 100644
>>> --- a/hw/rdma/vmw/pvrdma_main.c
>>> +++ b/hw/rdma/vmw/pvrdma_main.c
>>> @@ -51,6 +51,7 @@ static Property pvrdma_dev_properties[] = {
>>>        DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
>>>                          dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
>>>        DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
>>> +    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
>>>        DEFINE_PROP_END_OF_LIST(),
>>>    };
>>> @@ -613,7 +614,8 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>>>        rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
>>>                               dev->backend_device_name, dev->backend_port_num,
>>> -                           dev->backend_gid_idx, &dev->dev_attr, errp);
>>> +                           dev->backend_gid_idx, &dev->dev_attr, &dev->mad_chr,
>>> +                           errp);
>>>        if (rc) {
>>>            goto out;
>>>        }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-11  7:45     ` Yuval Shaia
@ 2018-11-17 11:41       ` Marcel Apfelbaum
  2018-11-18  9:16         ` Yuval Shaia
  0 siblings, 1 reply; 47+ messages in thread
From: Marcel Apfelbaum @ 2018-11-17 11:41 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch



On 11/11/18 9:45 AM, Yuval Shaia wrote:
> On Sat, Nov 10, 2018 at 08:27:44PM +0200, Marcel Apfelbaum wrote:
>>
>> On 11/8/18 6:08 PM, Yuval Shaia wrote:
>>> Guest driver enforces it, we should also.
>>>
>>> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
>>> ---
>>>    hw/rdma/vmw/pvrdma.h      | 2 ++
>>>    hw/rdma/vmw/pvrdma_main.c | 3 +++
>>>    2 files changed, 5 insertions(+)
>>>
>>> diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
>>> index b019cb843a..10a3c4fb7c 100644
>>> --- a/hw/rdma/vmw/pvrdma.h
>>> +++ b/hw/rdma/vmw/pvrdma.h
>>> @@ -20,6 +20,7 @@
>>>    #include "hw/pci/pci.h"
>>>    #include "hw/pci/msix.h"
>>>    #include "chardev/char-fe.h"
>>> +#include "hw/net/vmxnet3_defs.h"
>>>    #include "../rdma_backend_defs.h"
>>>    #include "../rdma_rm_defs.h"
>>> @@ -85,6 +86,7 @@ typedef struct PVRDMADev {
>>>        RdmaBackendDev backend_dev;
>>>        RdmaDeviceResources rdma_dev_res;
>>>        CharBackend mad_chr;
>>> +    VMXNET3State *func0;
>>>    } PVRDMADev;
>>>    #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
>>> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
>>> index ac8c092db0..fa6468d221 100644
>>> --- a/hw/rdma/vmw/pvrdma_main.c
>>> +++ b/hw/rdma/vmw/pvrdma_main.c
>>> @@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
>>>            return;
>>>        }
>>> +    /* Break if not vmxnet3 device in slot 0 */
>>> +    dev->func0 = VMXNET3(pci_get_function_0(pdev));
>>> +
>> I don't see the error code flow in case VMXNET3 is not func 0.
>> Am I missing something?
> Yes, this is a dynamic cast that will break the process when fail to cast.
>
> This is the error message that you will get in case that device on function
> 0 is not vmxnet3:
>
> pvrdma_main.c:589:pvrdma_realize: Object 0x557b959841a0 is not an instance of type vmxnet3

I am not sure we will see this error if QEMU is compiled in Release mode.
I think object_dynamic_cast_assert throws this error only if 
CONFIG_QOM_CAST_DEBUG
is set, and is possible the mentioned flag is not set in Release.

Thanks,
Marcel

>
>>
>> Thanks,
>> Marcel
>>
>>>        memdev_root = object_resolve_path("/objects", NULL);
>>>        if (memdev_root) {
>>>            object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3
  2018-11-17 11:41       ` Marcel Apfelbaum
@ 2018-11-18  9:16         ` Yuval Shaia
  0 siblings, 0 replies; 47+ messages in thread
From: Yuval Shaia @ 2018-11-18  9:16 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: dmitry.fleytman, jasowang, eblake, armbru, pbonzini, qemu-devel,
	shamir.rabinovitch, yuval.shaia

On Sat, Nov 17, 2018 at 01:41:41PM +0200, Marcel Apfelbaum wrote:
> 
> 
> On 11/11/18 9:45 AM, Yuval Shaia wrote:
> > On Sat, Nov 10, 2018 at 08:27:44PM +0200, Marcel Apfelbaum wrote:
> > > 
> > > On 11/8/18 6:08 PM, Yuval Shaia wrote:
> > > > Guest driver enforces it, we should also.
> > > > 
> > > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > > > ---
> > > >    hw/rdma/vmw/pvrdma.h      | 2 ++
> > > >    hw/rdma/vmw/pvrdma_main.c | 3 +++
> > > >    2 files changed, 5 insertions(+)
> > > > 
> > > > diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
> > > > index b019cb843a..10a3c4fb7c 100644
> > > > --- a/hw/rdma/vmw/pvrdma.h
> > > > +++ b/hw/rdma/vmw/pvrdma.h
> > > > @@ -20,6 +20,7 @@
> > > >    #include "hw/pci/pci.h"
> > > >    #include "hw/pci/msix.h"
> > > >    #include "chardev/char-fe.h"
> > > > +#include "hw/net/vmxnet3_defs.h"
> > > >    #include "../rdma_backend_defs.h"
> > > >    #include "../rdma_rm_defs.h"
> > > > @@ -85,6 +86,7 @@ typedef struct PVRDMADev {
> > > >        RdmaBackendDev backend_dev;
> > > >        RdmaDeviceResources rdma_dev_res;
> > > >        CharBackend mad_chr;
> > > > +    VMXNET3State *func0;
> > > >    } PVRDMADev;
> > > >    #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
> > > > diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> > > > index ac8c092db0..fa6468d221 100644
> > > > --- a/hw/rdma/vmw/pvrdma_main.c
> > > > +++ b/hw/rdma/vmw/pvrdma_main.c
> > > > @@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
> > > >            return;
> > > >        }
> > > > +    /* Break if not vmxnet3 device in slot 0 */
> > > > +    dev->func0 = VMXNET3(pci_get_function_0(pdev));
> > > > +
> > > I don't see the error code flow in case VMXNET3 is not func 0.
> > > Am I missing something?
> > Yes, this is a dynamic cast that will break the process when fail to cast.
> > 
> > This is the error message that you will get in case that device on function
> > 0 is not vmxnet3:
> > 
> > pvrdma_main.c:589:pvrdma_realize: Object 0x557b959841a0 is not an instance of type vmxnet3
> 
> I am not sure we will see this error if QEMU is compiled in Release mode.
> I think object_dynamic_cast_assert throws this error only if
> CONFIG_QOM_CAST_DEBUG
> is set, and is possible the mentioned flag is not set in Release.
> 
> Thanks,
> Marcel

Done.
Thanks.

> 
> > 
> > > 
> > > Thanks,
> > > Marcel
> > > 
> > > >        memdev_root = object_resolve_path("/objects", NULL);
> > > >        if (memdev_root) {
> > > >            object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-11-18  9:16 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08 16:07 [Qemu-devel] [PATCH v2 00/22] Add support for RDMA MAD Yuval Shaia
2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 01/22] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer Yuval Shaia
2018-11-10 20:10   ` Shamir Rabinovitch
2018-11-11  7:38     ` Yuval Shaia
2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 02/22] hw/rdma: Add ability to force notification without re-arm Yuval Shaia
2018-11-10 17:56   ` Marcel Apfelbaum
2018-11-08 16:07 ` [Qemu-devel] [PATCH v2 03/22] hw/rdma: Return qpn 1 if ibqp is NULL Yuval Shaia
2018-11-10 17:59   ` Marcel Apfelbaum
2018-11-11  9:12     ` Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 04/22] hw/rdma: Abort send-op if fail to create addr handler Yuval Shaia
2018-11-10 17:59   ` Marcel Apfelbaum
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 05/22] hw/rdma: Add support for MAD packets Yuval Shaia
2018-11-10 18:15   ` Marcel Apfelbaum
2018-11-11 10:31     ` Yuval Shaia
2018-11-17 11:35       ` Marcel Apfelbaum
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 06/22] hw/pvrdma: Make function reset_device return void Yuval Shaia
2018-11-10 18:17   ` Marcel Apfelbaum
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 07/22] hw/pvrdma: Make default pkey 0xFFFF Yuval Shaia
2018-11-10 18:17   ` Marcel Apfelbaum
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 08/22] hw/pvrdma: Set the correct opcode for recv completion Yuval Shaia
2018-11-10 18:18   ` Marcel Apfelbaum
2018-11-11  8:43     ` Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 09/22] hw/pvrdma: Set the correct opcode for send completion Yuval Shaia
2018-11-10 18:21   ` Marcel Apfelbaum
2018-11-11  8:04     ` Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 10/22] json: Define new QMP message for pvrdma Yuval Shaia
2018-11-10 18:25   ` Marcel Apfelbaum
2018-11-11  7:50     ` Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 11/22] hw/pvrdma: Add support to allow guest to configure GID table Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 12/22] vmxnet3: Move some definitions to header file Yuval Shaia
2018-11-12 13:56   ` Dmitry Fleytman
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 13/22] hw/pvrdma: Make sure PCI function 0 is vmxnet3 Yuval Shaia
2018-11-10 18:27   ` Marcel Apfelbaum
2018-11-11  7:45     ` Yuval Shaia
2018-11-17 11:41       ` Marcel Apfelbaum
2018-11-18  9:16         ` Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 14/22] hw/rdma: Initialize node_guid from vmxnet3 mac address Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 15/22] hw/pvrdma: Make device state depend on Ethernet function state Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 16/22] hw/pvrdma: Fill all CQE fields Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 17/22] hw/pvrdma: Fill error code in command's response Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 18/22] hw/rdma: Remove unneeded code that handles more that one port Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 19/22] vl: Introduce shutdown_notifiers Yuval Shaia
2018-11-08 16:26   ` Cornelia Huck
2018-11-08 20:45     ` Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 20/22] hw/pvrdma: Clean device's resource when system is shutdown Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 21/22] rdma: Do not use bitmap_zero_extend to fee bitmap Yuval Shaia
2018-11-08 16:08 ` [Qemu-devel] [PATCH v2 22/22] rdma: Do not call rdma_backend_del_gid on an empty gid Yuval Shaia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.