All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes
@ 2017-07-17 11:09 Dr. David Alan Gilbert (git)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 1/6] migration/rdma: Fix race on source Dr. David Alan Gilbert (git)
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

  This is a bunch of RDMA fixes, the first is a race
I spotted a while ago that you don't hit during normal
operation; the rest are to do with migration failure and
cancellation that I started looking at because of lp1545052 which
is a failure to recover on the source if the destination
fails.
 
  I'm pretty sure there are other cases where the source
might hang waiting for a failed destination; particularly
if the destination hangs rather than fails completely;
one I know of is waiting for the event after the rdma_disconnect
but I don't have a good fix for it.  Suggestions welcome.
 
v4
  Fix more error paths in qemu_rdma_block_for_wrid - often
  the 'ret' wasn't set. Fix that and also fix the case we
  add to also set the ret. Spotted by Peter.

Dr. David Alan Gilbert (6):
  migration/rdma: Fix race on source
  migration: Close file on failed migration load
  migration/rdma: fix qemu_rdma_block_for_wrid error paths
  migration/rdma: Allow cancelling while waiting for wrid
  migration/rdma: Safely convert control types
  migration/rdma: Send error during cancelling

 migration/migration.c |   1 +
 migration/rdma.c      | 137 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 101 insertions(+), 37 deletions(-)

-- 
2.13.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v4 1/6] migration/rdma: Fix race on source
  2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
@ 2017-07-17 11:09 ` Dr. David Alan Gilbert (git)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 2/6] migration: Close file on failed migration load Dr. David Alan Gilbert (git)
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Fix a race where the destination might try and send the source a
WRID_READY before the source has done a post-recv for it.

rdma_post_recv has to happen after the qp exists, and we're
OK since we've already called qemu_rdma_source_init that calls
qemu_alloc_qp.

This corresponds to:
https://bugzilla.redhat.com/show_bug.cgi?id=1285044

The race can be triggered by adding a few ms wait before this
post_recv_control (which was originally due to me turning on loads of
debug).

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/rdma.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index c6bc607a03..6111e10c70 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2365,6 +2365,12 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
 
     caps_to_network(&cap);
 
+    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY);
+    if (ret) {
+        ERROR(errp, "posting second control recv");
+        goto err_rdma_source_connect;
+    }
+
     ret = rdma_connect(rdma->cm_id, &conn_param);
     if (ret) {
         perror("rdma_connect");
@@ -2405,12 +2411,6 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
 
     rdma_ack_cm_event(cm_event);
 
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY);
-    if (ret) {
-        ERROR(errp, "posting second control recv!");
-        goto err_rdma_source_connect;
-    }
-
     rdma->control_ready_expected = 1;
     rdma->nb_sent = 0;
     return 0;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v4 2/6] migration: Close file on failed migration load
  2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 1/6] migration/rdma: Fix race on source Dr. David Alan Gilbert (git)
@ 2017-07-17 11:09 ` Dr. David Alan Gilbert (git)
  2017-07-17 19:49   ` Juan Quintela
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Closing the file before exit on a failure allows
the source to cleanup better, especially with RDMA.

Partial fix for https://bugs.launchpad.net/qemu/+bug/1545052

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/migration.c b/migration/migration.c
index a0db40d364..8552f54ab4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -348,6 +348,7 @@ static void process_incoming_migration_co(void *opaque)
         migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                           MIGRATION_STATUS_FAILED);
         error_report("load of migration failed: %s", strerror(-ret));
+        qemu_fclose(mis->from_src_file);
         exit(EXIT_FAILURE);
     }
     mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths
  2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 1/6] migration/rdma: Fix race on source Dr. David Alan Gilbert (git)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 2/6] migration: Close file on failed migration load Dr. David Alan Gilbert (git)
@ 2017-07-17 11:09 ` Dr. David Alan Gilbert (git)
  2017-07-18  1:20   ` Peter Xu
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 4/6] migration/rdma: Allow cancelling while waiting for wrid Dr. David Alan Gilbert (git)
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The two places that 'goto err_block_for_wrid' weren't setting ret
and so would end up returning 0 even though we've failed.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/rdma.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 6111e10c70..59810aec2e 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1521,14 +1521,16 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
             yield_until_fd_readable(rdma->comp_channel->fd);
         }
 
-        if (ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx)) {
+        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
+        if (ret) {
             perror("ibv_get_cq_event");
             goto err_block_for_wrid;
         }
 
         num_cq_events++;
 
-        if (ibv_req_notify_cq(cq, 0)) {
+        ret = -ibv_req_notify_cq(cq, 0);
+        if (ret) {
             goto err_block_for_wrid;
         }
 
@@ -1564,6 +1566,8 @@ err_block_for_wrid:
     if (num_cq_events) {
         ibv_ack_cq_events(cq, num_cq_events);
     }
+
+    rdma->error_state = ret;
     return ret;
 }
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v4 4/6] migration/rdma: Allow cancelling while waiting for wrid
  2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths Dr. David Alan Gilbert (git)
@ 2017-07-17 11:09 ` Dr. David Alan Gilbert (git)
  2017-07-18  1:23   ` Peter Xu
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 5/6] migration/rdma: Safely convert control types Dr. David Alan Gilbert (git)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 6/6] migration/rdma: Send error during cancelling Dr. David Alan Gilbert (git)
  5 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When waiting for a WRID, if the other side dies we end up waiting
for ever with no way to cancel the migration.
Cure this by poll()ing the fd first with a timeout and checking
error flags and migration state.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/rdma.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 59810aec2e..0cf55a6d5b 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1466,6 +1466,56 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
     return  0;
 }
 
+/* Wait for activity on the completion channel.
+ * Returns 0 on success, none-0 on error.
+ */
+static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
+{
+    /*
+     * Coroutine doesn't start until migration_fd_process_incoming()
+     * so don't yield unless we know we're running inside of a coroutine.
+     */
+    if (rdma->migration_started_on_destination) {
+        yield_until_fd_readable(rdma->comp_channel->fd);
+    } else {
+        /* This is the source side, we're in a separate thread
+         * or destination prior to migration_fd_process_incoming()
+         * we can't yield; so we have to poll the fd.
+         * But we need to be able to handle 'cancel' or an error
+         * without hanging forever.
+         */
+        while (!rdma->error_state  && !rdma->received_error) {
+            GPollFD pfds[1];
+            pfds[0].fd = rdma->comp_channel->fd;
+            pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+            /* 0.1s timeout, should be fine for a 'cancel' */
+            switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) {
+            case 1: /* fd active */
+                return 0;
+
+            case 0: /* Timeout, go around again */
+                break;
+
+            default: /* Error of some type -
+                      * I don't trust errno from qemu_poll_ns
+                     */
+                error_report("%s: poll failed", __func__);
+                return -EPIPE;
+            }
+
+            if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
+                /* Bail out and let the cancellation happen */
+                return -EPIPE;
+            }
+        }
+    }
+
+    if (rdma->received_error) {
+        return -EPIPE;
+    }
+    return rdma->error_state;
+}
+
 /*
  * Block until the next work request has completed.
  *
@@ -1513,12 +1563,9 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
     }
 
     while (1) {
-        /*
-         * Coroutine doesn't start until migration_fd_process_incoming()
-         * so don't yield unless we know we're running inside of a coroutine.
-         */
-        if (rdma->migration_started_on_destination) {
-            yield_until_fd_readable(rdma->comp_channel->fd);
+        ret = qemu_rdma_wait_comp_channel(rdma);
+        if (ret) {
+            goto err_block_for_wrid;
         }
 
         ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v4 5/6] migration/rdma: Safely convert control types
  2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 4/6] migration/rdma: Allow cancelling while waiting for wrid Dr. David Alan Gilbert (git)
@ 2017-07-17 11:09 ` Dr. David Alan Gilbert (git)
  2017-07-17 16:20   ` Juan Quintela
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 6/6] migration/rdma: Send error during cancelling Dr. David Alan Gilbert (git)
  5 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

control_desc[] is an array of strings that correspond to a
series of message types; they're used only for error messages, but if
the message type is seriously broken then we could go off the end of
the array.

Convert the array to a function control_desc() that bound checks.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/rdma.c | 54 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 0cf55a6d5b..972167d899 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -165,20 +165,6 @@ enum {
     RDMA_CONTROL_UNREGISTER_FINISHED, /* unpinning finished */
 };
 
-static const char *control_desc[] = {
-    [RDMA_CONTROL_NONE] = "NONE",
-    [RDMA_CONTROL_ERROR] = "ERROR",
-    [RDMA_CONTROL_READY] = "READY",
-    [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
-    [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST",
-    [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT",
-    [RDMA_CONTROL_COMPRESS] = "COMPRESS",
-    [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
-    [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
-    [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
-    [RDMA_CONTROL_UNREGISTER_REQUEST] = "UNREGISTER REQUEST",
-    [RDMA_CONTROL_UNREGISTER_FINISHED] = "UNREGISTER FINISHED",
-};
 
 /*
  * Memory and MR structures used to represent an IB Send/Recv work request.
@@ -251,6 +237,30 @@ typedef struct QEMU_PACKED RDMADestBlock {
     uint32_t padding;
 } RDMADestBlock;
 
+static const char *control_desc(unsigned int rdma_control)
+{
+    static const char *strs[] = {
+        [RDMA_CONTROL_NONE] = "NONE",
+        [RDMA_CONTROL_ERROR] = "ERROR",
+        [RDMA_CONTROL_READY] = "READY",
+        [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
+        [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST",
+        [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT",
+        [RDMA_CONTROL_COMPRESS] = "COMPRESS",
+        [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
+        [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
+        [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
+        [RDMA_CONTROL_UNREGISTER_REQUEST] = "UNREGISTER REQUEST",
+        [RDMA_CONTROL_UNREGISTER_FINISHED] = "UNREGISTER FINISHED",
+    };
+
+    if (rdma_control > RDMA_CONTROL_UNREGISTER_FINISHED) {
+        return "??BAD CONTROL VALUE??";
+    }
+
+    return strs[rdma_control];
+}
+
 static uint64_t htonll(uint64_t v)
 {
     union { uint32_t lv[2]; uint64_t llv; } u;
@@ -1641,7 +1651,7 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
                                    .num_sge = 1,
                                 };
 
-    trace_qemu_rdma_post_send_control(control_desc[head->type]);
+    trace_qemu_rdma_post_send_control(control_desc(head->type));
 
     /*
      * We don't actually need to do a memcpy() in here if we used
@@ -1720,16 +1730,16 @@ static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
     network_to_control((void *) rdma->wr_data[idx].control);
     memcpy(head, rdma->wr_data[idx].control, sizeof(RDMAControlHeader));
 
-    trace_qemu_rdma_exchange_get_response_start(control_desc[expecting]);
+    trace_qemu_rdma_exchange_get_response_start(control_desc(expecting));
 
     if (expecting == RDMA_CONTROL_NONE) {
-        trace_qemu_rdma_exchange_get_response_none(control_desc[head->type],
+        trace_qemu_rdma_exchange_get_response_none(control_desc(head->type),
                                              head->type);
     } else if (head->type != expecting || head->type == RDMA_CONTROL_ERROR) {
         error_report("Was expecting a %s (%d) control message"
                 ", but got: %s (%d), length: %d",
-                control_desc[expecting], expecting,
-                control_desc[head->type], head->type, head->len);
+                control_desc(expecting), expecting,
+                control_desc(head->type), head->type, head->len);
         if (head->type == RDMA_CONTROL_ERROR) {
             rdma->received_error = true;
         }
@@ -1839,7 +1849,7 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
             }
         }
 
-        trace_qemu_rdma_exchange_send_waiting(control_desc[resp->type]);
+        trace_qemu_rdma_exchange_send_waiting(control_desc(resp->type));
         ret = qemu_rdma_exchange_get_response(rdma, resp,
                                               resp->type, RDMA_WRID_DATA);
 
@@ -1851,7 +1861,7 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
         if (resp_idx) {
             *resp_idx = RDMA_WRID_DATA;
         }
-        trace_qemu_rdma_exchange_send_received(control_desc[resp->type]);
+        trace_qemu_rdma_exchange_send_received(control_desc(resp->type));
     }
 
     rdma->control_ready_expected = 1;
@@ -3401,7 +3411,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque)
             ret = -EIO;
             goto out;
         default:
-            error_report("Unknown control message %s", control_desc[head.type]);
+            error_report("Unknown control message %s", control_desc(head.type));
             ret = -EIO;
             goto out;
         }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v4 6/6] migration/rdma: Send error during cancelling
  2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
                   ` (4 preceding siblings ...)
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 5/6] migration/rdma: Safely convert control types Dr. David Alan Gilbert (git)
@ 2017-07-17 11:09 ` Dr. David Alan Gilbert (git)
  5 siblings, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2017-07-17 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, lvivier, peterx

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When we issue a cancel and clean up the RDMA channel
send a CONTROL_ERROR to get the destination to quit.

The rdma_cleanup code waits for the event to come back
from the rdma_disconnect; but that wont happen until the
destination quits and there's currently nothing to force
it.

Note this makes the case of a cancel work while the destination
is alive, and it already works if the destination is
truly dead.  Note it doesn't fix the case where the destination
is hung (we get stuck waiting for the rdma_disconnect event).

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/rdma.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 972167d899..ca56594328 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2269,7 +2269,9 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
     int ret, idx;
 
     if (rdma->cm_id && rdma->connected) {
-        if (rdma->error_state && !rdma->received_error) {
+        if ((rdma->error_state ||
+             migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) &&
+            !rdma->received_error) {
             RDMAControlHeader head = { .len = 0,
                                        .type = RDMA_CONTROL_ERROR,
                                        .repeat = 1,
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v4 5/6] migration/rdma: Safely convert control types
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 5/6] migration/rdma: Safely convert control types Dr. David Alan Gilbert (git)
@ 2017-07-17 16:20   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2017-07-17 16:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, lvivier, peterx

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> control_desc[] is an array of strings that correspond to a
> series of message types; they're used only for error messages, but if
> the message type is seriously broken then we could go off the end of
> the array.
>
> Convert the array to a function control_desc() that bound checks.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/6] migration: Close file on failed migration load
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 2/6] migration: Close file on failed migration load Dr. David Alan Gilbert (git)
@ 2017-07-17 19:49   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2017-07-17 19:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, lvivier, peterx

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Closing the file before exit on a failure allows
> the source to cleanup better, especially with RDMA.
>
> Partial fix for https://bugs.launchpad.net/qemu/+bug/1545052
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Also to this version.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths Dr. David Alan Gilbert (git)
@ 2017-07-18  1:20   ` Peter Xu
  2017-07-18 19:04     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2017-07-18  1:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, quintela, lvivier

On Mon, Jul 17, 2017 at 12:09:33PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The two places that 'goto err_block_for_wrid' weren't setting ret
> and so would end up returning 0 even though we've failed.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/rdma.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 6111e10c70..59810aec2e 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1521,14 +1521,16 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
>              yield_until_fd_readable(rdma->comp_channel->fd);
>          }
>  
> -        if (ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx)) {
> +        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
> +        if (ret) {
>              perror("ibv_get_cq_event");
>              goto err_block_for_wrid;
>          }
>  
>          num_cq_events++;
>  
> -        if (ibv_req_notify_cq(cq, 0)) {
> +        ret = -ibv_req_notify_cq(cq, 0);

(I didn't really notice that it is returning a positive value for
 error...)

Reviewed-by: Peter Xu <peterx@redhat.com>

> +        if (ret) {
>              goto err_block_for_wrid;
>          }
>  
> @@ -1564,6 +1566,8 @@ err_block_for_wrid:
>      if (num_cq_events) {
>          ibv_ack_cq_events(cq, num_cq_events);
>      }
> +
> +    rdma->error_state = ret;
>      return ret;
>  }
>  
> -- 
> 2.13.0
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/6] migration/rdma: Allow cancelling while waiting for wrid
  2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 4/6] migration/rdma: Allow cancelling while waiting for wrid Dr. David Alan Gilbert (git)
@ 2017-07-18  1:23   ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2017-07-18  1:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, quintela, lvivier

On Mon, Jul 17, 2017 at 12:09:34PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> When waiting for a WRID, if the other side dies we end up waiting
> for ever with no way to cancel the migration.
> Cure this by poll()ing the fd first with a timeout and checking
> error flags and migration state.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths
  2017-07-18  1:20   ` Peter Xu
@ 2017-07-18 19:04     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2017-07-18 19:04 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, quintela, lvivier

* Peter Xu (peterx@redhat.com) wrote:
> On Mon, Jul 17, 2017 at 12:09:33PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The two places that 'goto err_block_for_wrid' weren't setting ret
> > and so would end up returning 0 even though we've failed.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/rdma.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 6111e10c70..59810aec2e 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -1521,14 +1521,16 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
> >              yield_until_fd_readable(rdma->comp_channel->fd);
> >          }
> >  
> > -        if (ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx)) {
> > +        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
> > +        if (ret) {
> >              perror("ibv_get_cq_event");
> >              goto err_block_for_wrid;
> >          }
> >  
> >          num_cq_events++;
> >  
> > -        if (ibv_req_notify_cq(cq, 0)) {
> > +        ret = -ibv_req_notify_cq(cq, 0);
> 
> (I didn't really notice that it is returning a positive value for
>  error...)

Yes, the manpage said that and I followed it into the source to check;
none of the ibv interfaces are in any way consistent.

From the manpage:
   ibv_get_cq_event() returns 0 on success, and -1 on error.
   ibv_ack_cq_events() returns no value.
   ibv_req_notify_cq() returns 0 on success, or the value of errno on failure

   etc

(and -1 will do as a ret value if we havent got a real errno)

Dave

> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> > +        if (ret) {
> >              goto err_block_for_wrid;
> >          }
> >  
> > @@ -1564,6 +1566,8 @@ err_block_for_wrid:
> >      if (num_cq_events) {
> >          ibv_ack_cq_events(cq, num_cq_events);
> >      }
> > +
> > +    rdma->error_state = ret;
> >      return ret;
> >  }
> >  
> > -- 
> > 2.13.0
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-07-18 19:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-17 11:09 [Qemu-devel] [PATCH v4 0/6] A bunch of RDMA fixes Dr. David Alan Gilbert (git)
2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 1/6] migration/rdma: Fix race on source Dr. David Alan Gilbert (git)
2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 2/6] migration: Close file on failed migration load Dr. David Alan Gilbert (git)
2017-07-17 19:49   ` Juan Quintela
2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 3/6] migration/rdma: fix qemu_rdma_block_for_wrid error paths Dr. David Alan Gilbert (git)
2017-07-18  1:20   ` Peter Xu
2017-07-18 19:04     ` Dr. David Alan Gilbert
2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 4/6] migration/rdma: Allow cancelling while waiting for wrid Dr. David Alan Gilbert (git)
2017-07-18  1:23   ` Peter Xu
2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 5/6] migration/rdma: Safely convert control types Dr. David Alan Gilbert (git)
2017-07-17 16:20   ` Juan Quintela
2017-07-17 11:09 ` [Qemu-devel] [PATCH v4 6/6] migration/rdma: Send error during cancelling Dr. David Alan Gilbert (git)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.