All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround
@ 2013-08-09 20:05 mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 1/6] rdma: use resp.len after validation in qemu_rdma_registration_stop mrhines
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Changes since v1:

1. IPv6 support over RDMA ethernet is broken in linux right now.
   Although a patch is in review on linux-rdma, we need a work-around to make sure
   the user knows why it's not working.

   See PATCH 0/5 for a detailed description.

Isaku Yamahata (3):
  rdma: use resp.len after validation in qemu_rdma_registration_stop
  rdma: validate RDMAControlHeader::len
  rdma: check if RDMAControlHeader::len match transferred byte

Michael R. Hines (3):
  rdma: proper getaddrinfo() handling
  rdma: IPv6 over Ethernet (RoCE) is broken in linux - workaround
  rdma: remaining documentation fixes

 migration-rdma.c |  275 +++++++++++++++++++++++++++++++++++++++++++-----------
 qmp-commands.hx  |   10 ++
 2 files changed, 231 insertions(+), 54 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 for-1.6 1/6] rdma: use resp.len after validation in qemu_rdma_registration_stop
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
@ 2013-08-09 20:05 ` mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 2/6] rdma: validate RDMAControlHeader::len mrhines
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: Isaku Yamahata <yamahata@private.email.ne.jp>

resp.len is given from remote host. So should be validated before use.
Otherwise memcpy can access beyond the buffer.

Cc: Michael R. Hines <mrhines@us.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Michael R. Hines <mrhines@us.ibm.com>
Signed-off-by: Isaku Yamahata <yamahata@private.email.ne.jp>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/migration-rdma.c b/migration-rdma.c
index 3a380d4..6721266 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -3045,10 +3045,6 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
             return ret;
         }
 
-        qemu_rdma_move_header(rdma, reg_result_idx, &resp);
-        memcpy(rdma->block,
-            rdma->wr_data[reg_result_idx].control_curr, resp.len);
-
         nb_remote_blocks = resp.len / sizeof(RDMARemoteBlock);
 
         /*
@@ -3070,6 +3066,9 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
             return -EINVAL;
         }
 
+        qemu_rdma_move_header(rdma, reg_result_idx, &resp);
+        memcpy(rdma->block,
+            rdma->wr_data[reg_result_idx].control_curr, resp.len);
         for (i = 0; i < nb_remote_blocks; i++) {
             network_to_remote_block(&rdma->block[i]);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 for-1.6 2/6] rdma: validate RDMAControlHeader::len
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 1/6] rdma: use resp.len after validation in qemu_rdma_registration_stop mrhines
@ 2013-08-09 20:05 ` mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 3/6] rdma: check if RDMAControlHeader::len match transferred byte mrhines
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: Isaku Yamahata <yamahata@private.email.ne.jp>

RMDAControlHeader::len is provided from remote, so validate it.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Michael R. Hines <mrhines@us.ibm.com>
Signed-off-by: Isaku Yamahata <yamahata@private.email.ne.jp>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/migration-rdma.c b/migration-rdma.c
index 6721266..ebe1f55 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -1424,6 +1424,7 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
      * The copy makes the RDMAControlHeader simpler to manipulate
      * for the time being.
      */
+    assert(head->len <= RDMA_CONTROL_MAX_BUFFER - sizeof(*head));
     memcpy(wr->control, head, sizeof(RDMAControlHeader));
     control_to_network((void *) wr->control);
 
@@ -1504,6 +1505,10 @@ static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
                 control_desc[head->type], head->type, head->len);
         return -EIO;
     }
+    if (head->len > RDMA_CONTROL_MAX_BUFFER - sizeof(*head)) {
+        fprintf(stderr, "too long length: %d\n", head->len);
+        return -EINVAL;
+    }
 
     return 0;
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 for-1.6 3/6] rdma: check if RDMAControlHeader::len match transferred byte
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 1/6] rdma: use resp.len after validation in qemu_rdma_registration_stop mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 2/6] rdma: validate RDMAControlHeader::len mrhines
@ 2013-08-09 20:05 ` mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 4/6] rdma: proper getaddrinfo() handling mrhines
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: Isaku Yamahata <yamahata@private.email.ne.jp>

RDMAControlHeader::len is provided from remote, so check if the value
match the actual transferred byte_len.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Michael R. Hines <mrhines@us.ibm.com>
Signed-off-by: Isaku Yamahata <yamahata@private.email.ne.jp>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |   32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/migration-rdma.c b/migration-rdma.c
index ebe1f55..30e08cd 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -1214,7 +1214,8 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
  * (of any kind) has completed.
  * Return the work request ID that completed.
  */
-static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out)
+static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
+                               uint32_t *byte_len)
 {
     int ret;
     struct ibv_wc wc;
@@ -1285,6 +1286,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out)
     }
 
     *wr_id_out = wc.wr_id;
+    if (byte_len) {
+        *byte_len = wc.byte_len;
+    }
 
     return  0;
 }
@@ -1302,7 +1306,8 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out)
  * completions only need to be recorded, but do not actually
  * need further processing.
  */
-static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested)
+static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
+                                    uint32_t *byte_len)
 {
     int num_cq_events = 0, ret = 0;
     struct ibv_cq *cq;
@@ -1314,7 +1319,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested)
     }
     /* poll cq first */
     while (wr_id != wrid_requested) {
-        ret = qemu_rdma_poll(rdma, &wr_id_in);
+        ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
         if (ret < 0) {
             return ret;
         }
@@ -1356,7 +1361,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested)
         }
 
         while (wr_id != wrid_requested) {
-            ret = qemu_rdma_poll(rdma, &wr_id_in);
+            ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
             if (ret < 0) {
                 goto err_block_for_wrid;
             }
@@ -1442,7 +1447,7 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
         return ret;
     }
 
-    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL);
+    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
     if (ret < 0) {
         fprintf(stderr, "rdma migration: send polling control error!\n");
     }
@@ -1483,7 +1488,9 @@ static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx)
 static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
                 RDMAControlHeader *head, int expecting, int idx)
 {
-    int ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx);
+    uint32_t byte_len;
+    int ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx,
+                                       &byte_len);
 
     if (ret < 0) {
         fprintf(stderr, "rdma migration: recv polling control error!\n");
@@ -1509,6 +1516,11 @@ static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
         fprintf(stderr, "too long length: %d\n", head->len);
         return -EINVAL;
     }
+    if (sizeof(*head) + head->len != byte_len) {
+        fprintf(stderr, "Malformed length: %d byte_len %d\n",
+                head->len, byte_len);
+        return -EINVAL;
+    }
 
     return 0;
 }
@@ -1738,7 +1750,7 @@ retry:
                 count++, current_index, chunk,
                 sge.addr, length, rdma->nb_sent, block->nb_chunks);
 
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
 
         if (ret < 0) {
             fprintf(stderr, "Failed to Wait for previous write to complete "
@@ -1882,7 +1894,7 @@ retry:
 
     if (ret == ENOMEM) {
         DDPRINTF("send queue is full. wait a little....\n");
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
         if (ret < 0) {
             fprintf(stderr, "rdma migration: failed to make "
                             "room in full send queue! %d\n", ret);
@@ -2471,7 +2483,7 @@ static int qemu_rdma_drain_cq(QEMUFile *f, RDMAContext *rdma)
     }
 
     while (rdma->nb_sent) {
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
         if (ret < 0) {
             fprintf(stderr, "rdma migration: complete polling error!\n");
             return -EIO;
@@ -2607,7 +2619,7 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
      */
     while (1) {
         uint64_t wr_id, wr_id_in;
-        int ret = qemu_rdma_poll(rdma, &wr_id_in);
+        int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
         if (ret < 0) {
             fprintf(stderr, "rdma migration: polling error! %d\n", ret);
             goto err;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 for-1.6 4/6] rdma: proper getaddrinfo() handling
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
                   ` (2 preceding siblings ...)
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 3/6] rdma: check if RDMAControlHeader::len match transferred byte mrhines
@ 2013-08-09 20:05 ` mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 5/6] rdma: IPv6 over Ethernet (RoCE) is broken in linux - workaround mrhines
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

getaddrinfo() already knows what it's doing,
but it can potentially return multiple addresses.
We need to handle that...

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |   56 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/migration-rdma.c b/migration-rdma.c
index 30e08cd..d71cca5 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -392,7 +392,6 @@ typedef struct RDMAContext {
     uint64_t unregistrations[RDMA_SIGNALED_SEND_MAX];
 
     GHashTable *blockmap;
-    bool ipv6;
 } RDMAContext;
 
 /*
@@ -745,7 +744,7 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
     char port_str[16];
     struct rdma_cm_event *cm_event;
     char ip[40] = "unknown";
-    int af = rdma->ipv6 ? PF_INET6 : PF_INET;
+    struct addrinfo *e;
 
     if (rdma->host == NULL || !strcmp(rdma->host, "")) {
         ERROR(errp, "RDMA hostname has not been set");
@@ -775,18 +774,23 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
         goto err_resolve_get_addr;
     }
 
-    inet_ntop(af, &((struct sockaddr_in *) res->ai_addr)->sin_addr,
-                                ip, sizeof ip);
-    DPRINTF("%s => %s\n", rdma->host, ip);
+    for (e = res; e != NULL; e = e->ai_next) {
+        inet_ntop(e->ai_family,
+            &((struct sockaddr_in *) e->ai_addr)->sin_addr, ip, sizeof ip);
+        DPRINTF("Trying %s => %s\n", rdma->host, ip);
 
-    /* resolve the first address */
-    ret = rdma_resolve_addr(rdma->cm_id, NULL, res->ai_addr,
-            RDMA_RESOLVE_TIMEOUT_MS);
-    if (ret) {
-        ERROR(errp, "could not resolve address %s", rdma->host);
-        goto err_resolve_get_addr;
+        /* resolve the first address */
+        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_addr,
+                RDMA_RESOLVE_TIMEOUT_MS);
+        if (!ret) {
+            goto route;
+        }
     }
 
+    ERROR(errp, "could not resolve address %s", rdma->host);
+    goto err_resolve_get_addr;
+
+route:
     qemu_rdma_dump_gid("source_resolve_addr", rdma->cm_id);
 
     ret = rdma_get_cm_event(rdma->channel, &cm_event);
@@ -2260,8 +2264,6 @@ err_rdma_source_connect:
 static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
 {
     int ret = -EINVAL, idx;
-    int af = rdma->ipv6 ? PF_INET6 : PF_INET;
-    struct sockaddr_in sin;
     struct rdma_cm_id *listen_id;
     char ip[40] = "unknown";
     struct addrinfo *res;
@@ -2292,35 +2294,36 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
         goto err_dest_init_create_listen_id;
     }
 
-    memset(&sin, 0, sizeof(sin));
-    sin.sin_family = af;
-    sin.sin_port = htons(rdma->port);
     snprintf(port_str, 16, "%d", rdma->port);
     port_str[15] = '\0';
 
     if (rdma->host && strcmp("", rdma->host)) {
+        struct addrinfo *e;
+
         ret = getaddrinfo(rdma->host, port_str, NULL, &res);
         if (ret < 0) {
             ERROR(errp, "could not getaddrinfo address %s", rdma->host);
             goto err_dest_init_bind_addr;
         }
 
+        for (e = res; e != NULL; e = e->ai_next) {
+            inet_ntop(e->ai_family,
+                &((struct sockaddr_in *) e->ai_addr)->sin_addr, ip, sizeof ip);
+            DPRINTF("Trying %s => %s\n", rdma->host, ip);
+            ret = rdma_bind_addr(listen_id, e->ai_addr);
+            if (!ret) {
+                goto listen;
+            }
+        }
 
-        inet_ntop(af, &((struct sockaddr_in *) res->ai_addr)->sin_addr,
-                                    ip, sizeof ip);
+        ERROR(errp, "Error: could not rdma_bind_addr!");
+        goto err_dest_init_bind_addr;
     } else {
         ERROR(errp, "migration host and port not specified!");
         ret = -EINVAL;
         goto err_dest_init_bind_addr;
     }
-
-    DPRINTF("%s => %s\n", rdma->host, ip);
-
-    ret = rdma_bind_addr(listen_id, res->ai_addr);
-    if (ret) {
-        ERROR(errp, "Error: could not rdma_bind_addr!");
-        goto err_dest_init_bind_addr;
-    }
+listen:
 
     rdma->listen_id = listen_id;
     qemu_rdma_dump_gid("dest_init", listen_id);
@@ -2351,7 +2354,6 @@ static void *qemu_rdma_data_init(const char *host_port, Error **errp)
         if (addr != NULL) {
             rdma->port = atoi(addr->port);
             rdma->host = g_strdup(addr->host);
-            rdma->ipv6 = addr->ipv6;
         } else {
             ERROR(errp, "bad RDMA migration address '%s'", host_port);
             g_free(rdma);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 for-1.6 5/6] rdma: IPv6 over Ethernet (RoCE) is broken in linux - workaround
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
                   ` (3 preceding siblings ...)
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 4/6] rdma: proper getaddrinfo() handling mrhines
@ 2013-08-09 20:05 ` mrhines
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes mrhines
  2013-08-14 16:27 ` [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround Anthony Liguori
  6 siblings, 0 replies; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

We've gotten reports from multiple testers (including Frank Yangjie
and myself) that RDMA IPv6 support over RocE (Ethernet) is broken
in linux.

A patch to Linux is still in review:

http://comments.gmane.org/gmane.linux.drivers.rdma/16448

If the user is listening on '[::]', then we will not have a opened a device
yet and have no way of verifying if the device is RoCE or not.

In this case, the source VM will throw an error for ALL types of
connections (both IPv4 and IPv6) if the destination machine does not have
a regular infiniband network available for use.

The only way to gaurantee that an error is thrown for broken kernels is
for the management software to choose a *specific* interface at bind time
and validate what time of hardware it is.

Unfortunately, this puts the user in a fix:

 If the source VM connects with an IPv4 address without knowing that the
 destination has bound to '[::]' the migration will unconditionally fail
 unless the management software is not explicitly listening on the the IPv4
 address while using a RoCE-based device.

 If the source VM connects with an IPv6 address, then we're OK because we can
 throw an error on the source (and similarly on the destination).

 But in mixed environments, this will be broken for a while until it is fixed
 inside linux.

We do provide a *tiny* bit of help in mixed environments, though in this patch:

We can list all of the devices in the system and check to see if all the
devices are RoCE or Infiniband.

If we detect that we have a *pure* RoCE environment, then we can safely
thrown an error even if the management sofware has specified '[::]' as the
bind address.

However, if there is are multiple hetergeneous devices, then we cannot make
this assumption and the user just has to be sure they know what they are doing.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |  189 ++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 169 insertions(+), 20 deletions(-)

diff --git a/migration-rdma.c b/migration-rdma.c
index d71cca5..3d1266f 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -707,15 +707,27 @@ static int __qemu_rdma_delete_block(RDMAContext *rdma, ram_addr_t block_offset)
  */
 static void qemu_rdma_dump_id(const char *who, struct ibv_context *verbs)
 {
+    struct ibv_port_attr port;
+
+    if (ibv_query_port(verbs, 1, &port)) {
+        fprintf(stderr, "FAILED TO QUERY PORT INFORMATION!\n");
+        return;
+    }
+
     printf("%s RDMA Device opened: kernel name %s "
            "uverbs device name %s, "
-           "infiniband_verbs class device path %s,"
-           " infiniband class device path %s\n",
+           "infiniband_verbs class device path %s, "
+           "infiniband class device path %s, "
+           "transport: (%d) %s\n",
                 who,
                 verbs->device->name,
                 verbs->device->dev_name,
                 verbs->device->dev_path,
-                verbs->device->ibdev_path);
+                verbs->device->ibdev_path,
+                port.link_layer,
+                (port.link_layer == IBV_LINK_LAYER_INFINIBAND) ? "Infiniband" :
+                 ((port.link_layer == IBV_LINK_LAYER_ETHERNET) 
+                    ? "Ethernet" : "Unknown"));
 }
 
 /*
@@ -733,6 +745,132 @@ static void qemu_rdma_dump_gid(const char *who, struct rdma_cm_id *id)
 }
 
 /*
+ * As of now, IPv6 over RoCE / iWARP is not supported by linux.
+ * We will try the next addrinfo struct, and fail if there are
+ * no other valid addresses to bind against.
+ *
+ * If user is listening on '[::]', then we will not have a opened a device
+ * yet and have no way of verifying if the device is RoCE or not.
+ *
+ * In this case, the source VM will throw an error for ALL types of
+ * connections (both IPv4 and IPv6) if the destination machine does not have
+ * a regular infiniband network available for use.
+ *
+ * The only way to gaurantee that an error is thrown for broken kernels is
+ * for the management software to choose a *specific* interface at bind time
+ * and validate what time of hardware it is.
+ *
+ * Unfortunately, this puts the user in a fix:
+ * 
+ *  If the source VM connects with an IPv4 address without knowing that the
+ *  destination has bound to '[::]' the migration will unconditionally fail
+ *  unless the management software is explicitly listening on the the IPv4
+ *  address while using a RoCE-based device.
+ *
+ *  If the source VM connects with an IPv6 address, then we're OK because we can
+ *  throw an error on the source (and similarly on the destination).
+ * 
+ *  But in mixed environments, this will be broken for a while until it is fixed
+ *  inside linux.
+ *
+ * We do provide a *tiny* bit of help in this function: We can list all of the
+ * devices in the system and check to see if all the devices are RoCE or
+ * Infiniband. 
+ *
+ * If we detect that we have a *pure* RoCE environment, then we can safely
+ * thrown an error even if the management sofware has specified '[::]' as the
+ * bind address.
+ *
+ * However, if there is are multiple hetergeneous devices, then we cannot make
+ * this assumption and the user just has to be sure they know what they are
+ * doing.
+ *
+ * Patches are being reviewed on linux-rdma.
+ */
+static int qemu_rdma_broken_ipv6_kernel(Error **errp, struct ibv_context *verbs)
+{
+    struct ibv_port_attr port_attr;
+
+    /* This bug only exists in linux, to our knowledge. */
+#ifdef CONFIG_LINUX
+
+    /* 
+     * Verbs are only NULL if management has bound to '[::]'.
+     * 
+     * Let's iterate through all the devices and see if there any pure IB
+     * devices (non-ethernet).
+     * 
+     * If not, then we can safely proceed with the migration.
+     * Otherwise, there are no gaurantees until the bug is fixed in linux.
+     */
+    if (!verbs) {
+	    int num_devices, x;
+        struct ibv_device ** dev_list = ibv_get_device_list(&num_devices);
+        bool roce_found = false;
+        bool ib_found = false;
+
+        for (x = 0; x < num_devices; x++) {
+            verbs = ibv_open_device(dev_list[x]);
+
+            if (ibv_query_port(verbs, 1, &port_attr)) {
+                ibv_close_device(verbs);
+                ERROR(errp, "Could not query initial IB port");
+                return -EINVAL;
+            }
+
+            if (port_attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
+                ib_found = true;
+            } else if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
+                roce_found = true;
+            }
+
+            ibv_close_device(verbs);
+
+        }
+
+        if (roce_found) {
+            if (ib_found) {
+                fprintf(stderr, "WARN: migrations may fail:"
+                                " IPv6 over RoCE / iWARP in linux"
+                                " is broken. But since you appear to have a"
+                                " mixed RoCE / IB environment, be sure to only"
+                                " migrate over the IB fabric until the kernel "
+                                " fixes the bug.\n");
+            } else {
+                ERROR(errp, "You only have RoCE / iWARP devices in your systems"
+                            " and your management software has specified '[::]'"
+                            ", but IPv6 over RoCE / iWARP is not supported in Linux.");
+                return -ENONET;
+            }
+        }
+
+        return 0;
+    }
+
+    /*
+     * If we have a verbs context, that means that some other than '[::]' was
+     * used by the management software for binding. In which case we can actually 
+     * warn the user about a potential broken kernel;
+     */
+
+    /* IB ports start with 1, not 0 */
+    if (ibv_query_port(verbs, 1, &port_attr)) {
+        ERROR(errp, "Could not query initial IB port");
+        return -EINVAL;
+    }
+
+    if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
+        ERROR(errp, "Linux kernel's RoCE / iWARP does not support IPv6 "
+                    "(but patches on linux-rdma in progress)");
+        return -ENONET;
+    }
+
+#endif
+
+    return 0;
+}
+
+/*
  * Figure out which RDMA device corresponds to the requested IP hostname
  * Also create the initial connection manager identifiers for opening
  * the connection.
@@ -740,22 +878,22 @@ static void qemu_rdma_dump_gid(const char *who, struct rdma_cm_id *id)
 static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
 {
     int ret;
-    struct addrinfo *res;
+    struct rdma_addrinfo *res;
     char port_str[16];
     struct rdma_cm_event *cm_event;
     char ip[40] = "unknown";
-    struct addrinfo *e;
+    struct rdma_addrinfo *e;
 
     if (rdma->host == NULL || !strcmp(rdma->host, "")) {
         ERROR(errp, "RDMA hostname has not been set");
-        return -1;
+        return -EINVAL;
     }
 
     /* create CM channel */
     rdma->channel = rdma_create_event_channel();
     if (!rdma->channel) {
         ERROR(errp, "could not create CM channel");
-        return -1;
+        return -EINVAL;
     }
 
     /* create CM id */
@@ -768,21 +906,24 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
     snprintf(port_str, 16, "%d", rdma->port);
     port_str[15] = '\0';
 
-    ret = getaddrinfo(rdma->host, port_str, NULL, &res);
+    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
     if (ret < 0) {
-        ERROR(errp, "could not getaddrinfo address %s", rdma->host);
+        ERROR(errp, "could not rdma_getaddrinfo address %s", rdma->host);
         goto err_resolve_get_addr;
     }
 
     for (e = res; e != NULL; e = e->ai_next) {
         inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_addr)->sin_addr, ip, sizeof ip);
+            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
         DPRINTF("Trying %s => %s\n", rdma->host, ip);
 
-        /* resolve the first address */
-        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_addr,
+        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_dst_addr,
                 RDMA_RESOLVE_TIMEOUT_MS);
         if (!ret) {
+            ret = qemu_rdma_broken_ipv6_kernel(errp, rdma->cm_id->verbs);
+            if (ret) {
+                continue;
+            }
             goto route;
         }
     }
@@ -803,6 +944,7 @@ route:
         ERROR(errp, "result not equal to event_addr_resolved %s",
                 rdma_event_str(cm_event->event));
         perror("rdma_resolve_addr");
+        ret = -EINVAL;
         goto err_resolve_get_addr;
     }
     rdma_ack_cm_event(cm_event);
@@ -823,6 +965,7 @@ route:
         ERROR(errp, "result not equal to event_route_resolved: %s",
                         rdma_event_str(cm_event->event));
         rdma_ack_cm_event(cm_event);
+        ret = -EINVAL;
         goto err_resolve_get_addr;
     }
     rdma_ack_cm_event(cm_event);
@@ -837,8 +980,7 @@ err_resolve_get_addr:
 err_resolve_create_id:
     rdma_destroy_event_channel(rdma->channel);
     rdma->channel = NULL;
-
-    return -1;
+    return ret;
 }
 
 /*
@@ -2266,7 +2408,7 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
     int ret = -EINVAL, idx;
     struct rdma_cm_id *listen_id;
     char ip[40] = "unknown";
-    struct addrinfo *res;
+    struct rdma_addrinfo *res;
     char port_str[16];
 
     for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
@@ -2298,20 +2440,27 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
     port_str[15] = '\0';
 
     if (rdma->host && strcmp("", rdma->host)) {
-        struct addrinfo *e;
+        struct rdma_addrinfo *e;
 
-        ret = getaddrinfo(rdma->host, port_str, NULL, &res);
+        ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
         if (ret < 0) {
-            ERROR(errp, "could not getaddrinfo address %s", rdma->host);
+            ERROR(errp, "could not rdma_getaddrinfo address %s", rdma->host);
             goto err_dest_init_bind_addr;
         }
 
         for (e = res; e != NULL; e = e->ai_next) {
             inet_ntop(e->ai_family,
-                &((struct sockaddr_in *) e->ai_addr)->sin_addr, ip, sizeof ip);
+                &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
             DPRINTF("Trying %s => %s\n", rdma->host, ip);
-            ret = rdma_bind_addr(listen_id, e->ai_addr);
+            ret = rdma_bind_addr(listen_id, e->ai_dst_addr);
             if (!ret) {
+                if (e->ai_family == AF_INET6) {
+                    ret = qemu_rdma_broken_ipv6_kernel(errp, listen_id->verbs);
+                    if (ret) {
+                        continue;
+                    }
+                }
+                    
                 goto listen;
             }
         }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
                   ` (4 preceding siblings ...)
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 5/6] rdma: IPv6 over Ethernet (RoCE) is broken in linux - workaround mrhines
@ 2013-08-09 20:05 ` mrhines
  2013-08-09 20:26   ` Eric Blake
  2013-08-14 16:27 ` [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround Anthony Liguori
  6 siblings, 1 reply; 11+ messages in thread
From: mrhines @ 2013-08-09 20:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Was missing 'setup-time' in some of the QMP documentation...

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 qmp-commands.hx |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/qmp-commands.hx b/qmp-commands.hx
index 2e59b0d..cf47e3f 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2621,6 +2621,12 @@ The main json-object contains the following:
 - "total-time": total amount of ms since migration started.  If
                 migration has ended, it returns the total migration
                 time (json-int)
+- "setup-time" amount of setup time in milliseconds _before_ the
+               iterations begin but _after_ the QMP command is issued.
+               This is designed to provide an accounting of any activities
+               (such as RDMA pinning) which may be expensive, but do not 
+               actually occur during the iterative migration rounds 
+               themselves. (json-int)
 - "downtime": only present when migration has finished correctly
               total amount in ms for downtime that happened (json-int)
 - "expected-downtime": only present while migration is active
@@ -2674,6 +2680,7 @@ Examples:
           "remaining":123,
           "total":246,
           "total-time":12345,
+          "setup-time":12345,
           "downtime":12345,
           "duplicate":123,
           "normal":123,
@@ -2698,6 +2705,7 @@ Examples:
             "remaining":123,
             "total":246,
             "total-time":12345,
+            "setup-time":12345,
             "expected-downtime":12345,
             "duplicate":123,
             "normal":123,
@@ -2717,6 +2725,7 @@ Examples:
             "remaining":1053304,
             "transferred":3720,
             "total-time":12345,
+            "setup-time":12345,
             "expected-downtime":12345,
             "duplicate":123,
             "normal":123,
@@ -2742,6 +2751,7 @@ Examples:
             "remaining":1053304,
             "transferred":3720,
             "total-time":12345,
+            "setup-time":12345,
             "expected-downtime":12345,
             "duplicate":10,
             "normal":3333,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes mrhines
@ 2013-08-09 20:26   ` Eric Blake
  2013-08-09 20:32     ` Michael R. Hines
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2013-08-09 20:26 UTC (permalink / raw)
  To: mrhines
  Cc: yamahata, aliguori, quintela, frank.yangjie, qemu-devel,
	owasserm, mrhines, pbonzini

[-- Attachment #1: Type: text/plain, Size: 986 bytes --]

On 08/09/2013 02:05 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Was missing 'setup-time' in some of the QMP documentation...
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  qmp-commands.hx |   10 ++++++++++
>  1 file changed, 10 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

> @@ -2742,6 +2751,7 @@ Examples:
>              "remaining":1053304,
>              "transferred":3720,
>              "total-time":12345,
> +            "setup-time":12345,

Theoretically, if setup-time == total-time, then we are still in the
setup phase, right?  The example might be more realistic if it uses a
smaller number for setup time.  But at the end of the day, I don't
actually care enough to force a respin, when we are this close to the
1.6 deadline and this deserves to be in 1.6.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes
  2013-08-09 20:26   ` Eric Blake
@ 2013-08-09 20:32     ` Michael R. Hines
  2013-08-09 20:36       ` Eric Blake
  0 siblings, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-08-09 20:32 UTC (permalink / raw)
  To: Eric Blake
  Cc: yamahata, aliguori, quintela, frank.yangjie, qemu-devel,
	owasserm, mrhines, pbonzini

On 08/09/2013 04:26 PM, Eric Blake wrote:
> On 08/09/2013 02:05 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Was missing 'setup-time' in some of the QMP documentation...
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   qmp-commands.hx |   10 ++++++++++
>>   1 file changed, 10 insertions(+)
> Reviewed-by: Eric Blake <eblake@redhat.com>
>
>> @@ -2742,6 +2751,7 @@ Examples:
>>               "remaining":1053304,
>>               "transferred":3720,
>>               "total-time":12345,
>> +            "setup-time":12345,
> Theoretically, if setup-time == total-time, then we are still in the
> setup phase, right?  The example might be more realistic if it uses a
> smaller number for setup time.  But at the end of the day, I don't
> actually care enough to force a respin, when we are this close to the
> 1.6 deadline and this deserves to be in 1.6.
>

Ooops - well, when I was reading the file - all the numbers seemed like 
place-holders, not realistic numbers.

This documentation is kind of hard to maintain in the first place - 
something like this is more the job of a tool like 'virt-test' which 
could actually exercise QMP and dump the output during a test cycle in 
case the user actually wanted to see what the migration life cycle 
actually looks like.

- Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes
  2013-08-09 20:32     ` Michael R. Hines
@ 2013-08-09 20:36       ` Eric Blake
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Blake @ 2013-08-09 20:36 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: yamahata, aliguori, quintela, frank.yangjie, qemu-devel,
	owasserm, mrhines, pbonzini

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]

On 08/09/2013 02:32 PM, Michael R. Hines wrote:

>>>               "total-time":12345,
>>> +            "setup-time":12345,
>> Theoretically, if setup-time == total-time, then we are still in the
>> setup phase, right?  The example might be more realistic if it uses a
>> smaller number for setup time.  But at the end of the day, I don't
>> actually care enough to force a respin, when we are this close to the
>> 1.6 deadline and this deserves to be in 1.6.
>>
> 
> Ooops - well, when I was reading the file - all the numbers seemed like
> place-holders, not realistic numbers.

Indeed, which is why I'm okay with no respin.

> 
> This documentation is kind of hard to maintain in the first place -
> something like this is more the job of a tool like 'virt-test' which
> could actually exercise QMP and dump the output during a test cycle in
> case the user actually wanted to see what the migration life cycle
> actually looks like.

Yes, I'd love to have more of this file generated (why are we forced to
duplicate efforts into both the .json and the .hx files again?)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround
  2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
                   ` (5 preceding siblings ...)
  2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes mrhines
@ 2013-08-14 16:27 ` Anthony Liguori
  6 siblings, 0 replies; 11+ messages in thread
From: Anthony Liguori @ 2013-08-14 16:27 UTC (permalink / raw)
  To: mrhines, qemu-devel
  Cc: yamahata, aliguori, quintela, frank.yangjie, owasserm, mrhines, pbonzini

Applied.  Thanks.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-14 16:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-09 20:05 [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround mrhines
2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 1/6] rdma: use resp.len after validation in qemu_rdma_registration_stop mrhines
2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 2/6] rdma: validate RDMAControlHeader::len mrhines
2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 3/6] rdma: check if RDMAControlHeader::len match transferred byte mrhines
2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 4/6] rdma: proper getaddrinfo() handling mrhines
2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 5/6] rdma: IPv6 over Ethernet (RoCE) is broken in linux - workaround mrhines
2013-08-09 20:05 ` [Qemu-devel] [PATCH v2 for-1.6 6/6] rdma: remaining documentation fixes mrhines
2013-08-09 20:26   ` Eric Blake
2013-08-09 20:32     ` Michael R. Hines
2013-08-09 20:36       ` Eric Blake
2013-08-14 16:27 ` [Qemu-devel] [PATCH v2 for-1.6 0/6] rdma: uh oh! IPv6 broken in linux - need workaround Anthony Liguori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.