All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] An RDMA race?
@ 2015-12-11 17:48 Dr. David Alan Gilbert
  2015-12-12  3:40 ` Michael R. Hines
  0 siblings, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-11 17:48 UTC (permalink / raw)
  To: michael; +Cc: qemu-devel

Hi Michael,
   I think I've got an RDMA race condition, but I'm being a little
cautious at the moment and wondered if you agree with the following
diagnosis.

It's showing up in a world of mine that's sending more control messages
from the destination->source and I'm seeing the following.

We normally expect:

   src                        dest
     ----------->control ready->
   Sees SEND_CONTROL signal to ack that it has been sent
         <-----control message--
   Sees RECV_CONTROL message from dest        
           
but what I'm seeing is:
   src                        dest
     ----------->control ready->
         <-----control message--
   Sees RECV_CONTROL message from dest        
   Sees SEND_CONTROL signal to ack that it has been sent

which seems a bit odd - that means that the source is sending
a message, and getting the reply before it gets the acknowledgment
from it's own stack that it sent the message the reply is to ?!
It's rare (~1 in 100 migrations ish) and only in my world
where I'm sending more control messages.

But that confuses qemu_rdma_post_send_control when it sends
the READY, because it calls block_for_wrid(SEND_CONTROL),
that sees the RECV_CONTROL (which it loses) and then the SEND_CONTROL.

Does this sound a sane explanation?   My current fix (below)
I only use it for RDMA_CONTROL_READY , I'm torn between worrying
that the race is potentially more general, but worry what will
happen if I stop making post_send_control wait at the end for
all the other control messages.
It seems to work :-)

What do you reckon?

Dave


From 332b867fb0f63be47d89822b7a10b2a519b431fc Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Fri, 11 Dec 2015 14:53:11 +0000
Subject: [PATCH] Don't wait for send-control on a ready

---
 migration/rdma.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 trace-events     |  1 +
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index b0feddc..6a0e59f 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -314,7 +314,13 @@ typedef struct RDMAContext {
      * the READY message before send() does, in which case we need to
      * know if it completed.
      */
-    int control_ready_expected;
+    bool control_ready_expected;
+
+    /*
+     * Set when we've sent a control message but not yet waited for the
+     * result.
+     */
+    bool control_sent_expected;
 
     /* number of outstanding writes */
     int nb_sent;
@@ -1454,6 +1460,13 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
         rdma->control_ready_expected = 0;
     }
 
+    if (rdma->control_sent_expected &&
+        (wr_id >= RDMA_WRID_SEND_CONTROL)) {
+        trace_qemu_rdma_poll_send(wrid_desc[RDMA_WRID_SEND_CONTROL],
+                  wr_id - RDMA_WRID_SEND_CONTROL, wr_id, rdma->nb_sent);
+        rdma->control_sent_expected = 0;
+    }
+
     if (wr_id == RDMA_WRID_RDMA_WRITE) {
         uint64_t chunk =
             (wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
@@ -1609,6 +1622,7 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
                                        RDMAControlHeader *head)
 {
     int ret = 0;
+    bool wait;
     RDMAWorkRequestData *wr = &rdma->wr_data[RDMA_WRID_CONTROL];
     struct ibv_send_wr *bad_wr;
     struct ibv_sge sge = {
@@ -1626,6 +1640,26 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
 
     trace_qemu_rdma_post_send_control(control_desc[head->type]);
 
+    if (rdma->control_sent_expected) {
+        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+        if (ret < 0) {
+            error_report("%s: send polling control error (entry) %s",
+                         __func__, strerror(ret));
+            return ret;
+        }
+    }
+
+    switch (head->type) {
+    case RDMA_CONTROL_READY:
+        wait=false;
+        rdma->control_sent_expected = true;
+        break;
+
+    default:
+        wait=true;
+        break;
+    }
+
     /*
      * We don't actually need to do a memcpy() in here if we used
      * the "sge" properly, but since we're only sending control messages
@@ -1646,13 +1680,15 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
     ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
 
     if (ret > 0) {
-        error_report("Failed to use post IB SEND for control");
+        error_report("Failed to use post IB SEND for control: %s", strerror(ret));
         return -ret;
     }
 
-    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
-    if (ret < 0) {
-        error_report("rdma migration: send polling control error");
+    if (wait) {
+        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+        if (ret < 0) {
+            error_report("rdma migration: send polling control error");
+        }
     }
 
     return ret;
diff --git a/trace-events b/trace-events
index 9043b56..0eeb6b2 100644
--- a/trace-events
+++ b/trace-events
@@ -1447,6 +1447,7 @@ qemu_rdma_exchange_send_received(const char *desc) "Response %s received."
 qemu_rdma_fill(size_t control_len, size_t size) "RDMA %zd of %zd bytes already in buffer"
 qemu_rdma_init_ram_blocks(int blocks) "Allocated %d local ram block structures"
 qemu_rdma_poll_recv(const char *compstr, int64_t comp, int64_t id, int sent) "completion %s #%" PRId64 " received (%" PRId64 ") left %d"
+qemu_rdma_poll_send(const char *compstr, int64_t comp, int64_t id, int sent) "completion %s #%" PRId64 " received (%" PRId64 ") left %d"
 qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block, uint64_t chunk, void *local, void *remote) "completions %s (%" PRId64 ") left %d, block %" PRIu64 ", chunk: %" PRIu64 " %p %p"
 qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
 qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
-- 
2.5.0

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] An RDMA race?
  2015-12-11 17:48 [Qemu-devel] An RDMA race? Dr. David Alan Gilbert
@ 2015-12-12  3:40 ` Michael R. Hines
  2015-12-14 10:53   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 7+ messages in thread
From: Michael R. Hines @ 2015-12-12  3:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Michael Hines, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1904 bytes --]

David,

Thanks for including my email directly. It helps a lot.

Below, I'm going to assume that only "dest" is calling
qemu_rdma_exchange_recv()
and only src is calling qemu_rdma_exchange_send(), since you didn't specify
who
is sending and who is receiving.

If that assumption is wrong, please respond again.

Comments inline.....

On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <dgilbert@redhat.com
> wrote:

> Hi Michael,
>    I think I've got an RDMA race condition, but I'm being a little
> cautious at the moment and wondered if you agree with the following
> diagnosis.
>
> It's showing up in a world of mine that's sending more control messages
> from the destination->source and I'm seeing the following.
>
> We normally expect:
>
>    src                        dest
>      ----------->control ready->
>

If src is sending, this is not correct. Dest should send the ready message
if it is receiving, not src, which breaks the above assumption. So, I'll
reverse the assumption previously and continue with your observation and
assume that src is receiving instead of dest, which should instead look
like:

src  (receiving)                      dest (sending)
     ----------->control ready->



>    Sees SEND_CONTROL signal to ack that it has been sent
>

I'll assume here that you meant that dest sees the ready message and is
then later sends something.


>          <-----control message--
>    Sees RECV_CONTROL message from dest
>
>
Similar assumption for the receiver (src).


> but what I'm seeing is:
>    src                        dest
>      ----------->control ready->
>          <-----control message--
>    Sees RECV_CONTROL message from dest
>

hmmmmm....


>    Sees SEND_CONTROL signal to ack that it has been sent
>
>
There's not enough information here....... do you have a multi-threaded
send or receive or something?

Do the work request IDs match up?

- Michael

[-- Attachment #2: Type: text/html, Size: 3322 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] An RDMA race?
  2015-12-12  3:40 ` Michael R. Hines
@ 2015-12-14 10:53   ` Dr. David Alan Gilbert
  2015-12-20  7:08     ` Michael R. Hines
  0 siblings, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-14 10:53 UTC (permalink / raw)
  To: Michael R. Hines; +Cc: Michael Hines, qemu-devel

* Michael R. Hines (mhines@digitalocean.com) wrote:
> David,
> 
> Thanks for including my email directly. It helps a lot.
> 
> Below, I'm going to assume that only "dest" is calling
> qemu_rdma_exchange_recv()
> and only src is calling qemu_rdma_exchange_send(), since you didn't specify
> who
> is sending and who is receiving.
> 
> If that assumption is wrong, please respond again.

That's correct.

> Comments inline.....
> 
> On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <dgilbert@redhat.com
> > wrote:
> 
> > Hi Michael,
> >    I think I've got an RDMA race condition, but I'm being a little
> > cautious at the moment and wondered if you agree with the following
> > diagnosis.
> >
> > It's showing up in a world of mine that's sending more control messages
> > from the destination->source and I'm seeing the following.
> >
> > We normally expect:
> >
> >    src                        dest
> >      ----------->control ready->
> >
> 
> If src is sending, this is not correct. Dest should send the ready message
> if it is receiving, not src, which breaks the above assumption. So, I'll
> reverse the assumption previously and continue with your observation and
> assume that src is receiving instead of dest, which should instead look
> like:

Gah! Yes, I got the label the wrong way around; it's dest sending control ready.

> src  (receiving)                      dest (sending)
>      ----------->control ready->
> 
> 
> 
> >    Sees SEND_CONTROL signal to ack that it has been sent
> >
> 
> I'll assume here that you meant that dest sees the ready message and is
> then later sends something.
> 
> 
> >          <-----control message--
> >    Sees RECV_CONTROL message from dest
> >
> >
> Similar assumption for the receiver (src).
> 
> 
> > but what I'm seeing is:
> >    src                        dest
> >      ----------->control ready->
> >          <-----control message--
> >    Sees RECV_CONTROL message from dest
> >
> 
> hmmmmm....
> 
> 
> >    Sees SEND_CONTROL signal to ack that it has been sent
> >
> >
> There's not enough information here....... do you have a multi-threaded
> send or receive or something?

No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
so the change which got me to trigger this bug was that I'd
added a new control message 'notify write' which explicitly
told the destination it had a page written to; at the RDMA level
that was the only change.

> Do the work request IDs match up?

Yes I think so; I also added a sequence number to the 'ready' messages
to check I wasn't losing one.
I had a chat to one of our RDMA guys (Doug Ledford) and he said
it's perfectly legal for RDMA to take longer to return the signal
from the send than for the round trip of the destination responding;
the 'signal' doesn't happen until an ack has been received from the
destination card anyway, so the ack can get delayed or retried.
So I think we do need to fix this; the question then is how do we fix
it for all control messages without breaking anything else.   Are there
any cases that rely on having received the signal from the send before
continuing, or could i just do what I'm doing for all control messages?

Dave

> - Michael
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] An RDMA race?
  2015-12-14 10:53   ` Dr. David Alan Gilbert
@ 2015-12-20  7:08     ` Michael R. Hines
  2015-12-20  7:09       ` Michael R. Hines
  2016-01-04 18:15       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 7+ messages in thread
From: Michael R. Hines @ 2015-12-20  7:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Michael Hines, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4124 bytes --]

Adding such a control message would defeat the benefits of RDMA, as there
shouldn't be any signalling in the actual DMA path, or RDMA latency would
be too high. If you're sending control messages for individual writes, then
you need to change up your design. It's OK to design ACKs for groups of
writes, depending on the requirements.

So, the out-of-order issue you're seeing is only with your new message, not
the original messages?

Can you describe/document it in more detail so I can help advise?

- Michael

On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert <dgilbert@redhat.com
> wrote:

> * Michael R. Hines (mhines@digitalocean.com) wrote:
> > David,
> >
> > Thanks for including my email directly. It helps a lot.
> >
> > Below, I'm going to assume that only "dest" is calling
> > qemu_rdma_exchange_recv()
> > and only src is calling qemu_rdma_exchange_send(), since you didn't
> specify
> > who
> > is sending and who is receiving.
> >
> > If that assumption is wrong, please respond again.
>
> That's correct.
>
> > Comments inline.....
> >
> > On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <
> dgilbert@redhat.com
> > > wrote:
> >
> > > Hi Michael,
> > >    I think I've got an RDMA race condition, but I'm being a little
> > > cautious at the moment and wondered if you agree with the following
> > > diagnosis.
> > >
> > > It's showing up in a world of mine that's sending more control messages
> > > from the destination->source and I'm seeing the following.
> > >
> > > We normally expect:
> > >
> > >    src                        dest
> > >      ----------->control ready->
> > >
> >
> > If src is sending, this is not correct. Dest should send the ready
> message
> > if it is receiving, not src, which breaks the above assumption. So, I'll
> > reverse the assumption previously and continue with your observation and
> > assume that src is receiving instead of dest, which should instead look
> > like:
>
> Gah! Yes, I got the label the wrong way around; it's dest sending control
> ready.
>
> > src  (receiving)                      dest (sending)
> >      ----------->control ready->
> >
> >
> >
> > >    Sees SEND_CONTROL signal to ack that it has been sent
> > >
> >
> > I'll assume here that you meant that dest sees the ready message and is
> > then later sends something.
> >
> >
> > >          <-----control message--
> > >    Sees RECV_CONTROL message from dest
> > >
> > >
> > Similar assumption for the receiver (src).
> >
> >
> > > but what I'm seeing is:
> > >    src                        dest
> > >      ----------->control ready->
> > >          <-----control message--
> > >    Sees RECV_CONTROL message from dest
> > >
> >
> > hmmmmm....
> >
> >
> > >    Sees SEND_CONTROL signal to ack that it has been sent
> > >
> > >
> > There's not enough information here....... do you have a multi-threaded
> > send or receive or something?
>
> No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
> so the change which got me to trigger this bug was that I'd
> added a new control message 'notify write' which explicitly
> told the destination it had a page written to; at the RDMA level
> that was the only change.
>
> > Do the work request IDs match up?
>
> Yes I think so; I also added a sequence number to the 'ready' messages
> to check I wasn't losing one.
> I had a chat to one of our RDMA guys (Doug Ledford) and he said
> it's perfectly legal for RDMA to take longer to return the signal
> from the send than for the round trip of the destination responding;
> the 'signal' doesn't happen until an ack has been received from the
> destination card anyway, so the ack can get delayed or retried.
> So I think we do need to fix this; the question then is how do we fix
> it for all control messages without breaking anything else.   Are there
> any cases that rely on having received the signal from the send before
> continuing, or could i just do what I'm doing for all control messages?
>
> Dave
>
> > - Michael
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>



-- 
/*
 * Michael R. Hines
 * https://michael.hinespot.com
 */

[-- Attachment #2: Type: text/html, Size: 5740 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] An RDMA race?
  2015-12-20  7:08     ` Michael R. Hines
@ 2015-12-20  7:09       ` Michael R. Hines
  2016-01-04 18:15       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 7+ messages in thread
From: Michael R. Hines @ 2015-12-20  7:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Michael Hines, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]

And, yes, out-of-order messages are totally fine ----- we just have to be
careful with the design.

- Michael

On Sun, Dec 20, 2015 at 3:08 PM, Michael R. Hines <mhines@digitalocean.com>
wrote:

> Adding such a control message would defeat the benefits of RDMA, as there
> shouldn't be any signalling in the actual DMA path, or RDMA latency would
> be too high. If you're sending control messages for individual writes, then
> you need to change up your design. It's OK to design ACKs for groups of
> writes, depending on the requirements.
>
> So, the out-of-order issue you're seeing is only with your new message,
> not the original messages?
>
> Can you describe/document it in more detail so I can help advise?
>
> - Michael
>
> On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert <
> dgilbert@redhat.com> wrote:
>
>> * Michael R. Hines (mhines@digitalocean.com) wrote:
>> > David,
>> >
>> > Thanks for including my email directly. It helps a lot.
>> >
>> > Below, I'm going to assume that only "dest" is calling
>> > qemu_rdma_exchange_recv()
>> > and only src is calling qemu_rdma_exchange_send(), since you didn't
>> specify
>> > who
>> > is sending and who is receiving.
>> >
>> > If that assumption is wrong, please respond again.
>>
>> That's correct.
>>
>> > Comments inline.....
>> >
>> > On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <
>> dgilbert@redhat.com
>> > > wrote:
>> >
>> > > Hi Michael,
>> > >    I think I've got an RDMA race condition, but I'm being a little
>> > > cautious at the moment and wondered if you agree with the following
>> > > diagnosis.
>> > >
>> > > It's showing up in a world of mine that's sending more control
>> messages
>> > > from the destination->source and I'm seeing the following.
>> > >
>> > > We normally expect:
>> > >
>> > >    src                        dest
>> > >      ----------->control ready->
>> > >
>> >
>> > If src is sending, this is not correct. Dest should send the ready
>> message
>> > if it is receiving, not src, which breaks the above assumption. So, I'll
>> > reverse the assumption previously and continue with your observation and
>> > assume that src is receiving instead of dest, which should instead look
>> > like:
>>
>> Gah! Yes, I got the label the wrong way around; it's dest sending control
>> ready.
>>
>> > src  (receiving)                      dest (sending)
>> >      ----------->control ready->
>> >
>> >
>> >
>> > >    Sees SEND_CONTROL signal to ack that it has been sent
>> > >
>> >
>> > I'll assume here that you meant that dest sees the ready message and is
>> > then later sends something.
>> >
>> >
>> > >          <-----control message--
>> > >    Sees RECV_CONTROL message from dest
>> > >
>> > >
>> > Similar assumption for the receiver (src).
>> >
>> >
>> > > but what I'm seeing is:
>> > >    src                        dest
>> > >      ----------->control ready->
>> > >          <-----control message--
>> > >    Sees RECV_CONTROL message from dest
>> > >
>> >
>> > hmmmmm....
>> >
>> >
>> > >    Sees SEND_CONTROL signal to ack that it has been sent
>> > >
>> > >
>> > There's not enough information here....... do you have a multi-threaded
>> > send or receive or something?
>>
>> No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
>> so the change which got me to trigger this bug was that I'd
>> added a new control message 'notify write' which explicitly
>> told the destination it had a page written to; at the RDMA level
>> that was the only change.
>>
>> > Do the work request IDs match up?
>>
>> Yes I think so; I also added a sequence number to the 'ready' messages
>> to check I wasn't losing one.
>> I had a chat to one of our RDMA guys (Doug Ledford) and he said
>> it's perfectly legal for RDMA to take longer to return the signal
>> from the send than for the round trip of the destination responding;
>> the 'signal' doesn't happen until an ack has been received from the
>> destination card anyway, so the ack can get delayed or retried.
>> So I think we do need to fix this; the question then is how do we fix
>> it for all control messages without breaking anything else.   Are there
>> any cases that rely on having received the signal from the send before
>> continuing, or could i just do what I'm doing for all control messages?
>>
>> Dave
>>
>> > - Michael
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>
>
>
>
> --
> /*
>  * Michael R. Hines
>  * https://michael.hinespot.com
>  */
>



-- 
/*
 * Michael R. Hines
 * https://michael.hinespot.com
 */

[-- Attachment #2: Type: text/html, Size: 6546 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] An RDMA race?
  2015-12-20  7:08     ` Michael R. Hines
  2015-12-20  7:09       ` Michael R. Hines
@ 2016-01-04 18:15       ` Dr. David Alan Gilbert
  2016-01-09 11:03         ` Michael R. Hines
  1 sibling, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-04 18:15 UTC (permalink / raw)
  To: Michael R. Hines; +Cc: Michael Hines, qemu-devel

* Michael R. Hines (mhines@digitalocean.com) wrote:
> Adding such a control message would defeat the benefits of RDMA, as there
> shouldn't be any signalling in the actual DMA path, or RDMA latency would
> be too high. If you're sending control messages for individual writes, then
> you need to change up your design. It's OK to design ACKs for groups of
> writes, depending on the requirements.

I started off with sending individual messages, and then once I had it working
I made it group them to send one message every 2048 pages.
The performance isn't very good though, and I've not yet analysed why.

> So, the out-of-order issue you're seeing is only with your new message, not
> the original messages?

Yes I believe they're only on the new messages; however:
  1) I'm sending a lot more control messages, so if there's a race I'm
    a lot more likely to trigger it. (I'm not sure I'm triggering it in the
    case where I group those 2048 together) - so does this mean it would
    occasionally trigger on the unmodified code?

  2) My reading of the existing code is that I think it could happen;
    a) the source is ready to send something and is waiting for a CONTROL_READY,
    b) the destination sends the CONTROL_READY
        (blocking in qemu_rdma_post_send_control call to 
         qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL)
    c) The source sends it's data
    d) That arrives at the destination
    e) finally the WRID_SEND_CONTROL arrives back

   It's having d/e the wrong way round which is the race I think I'm seeing
   and then we lose (d)'s data.

> Can you describe/document it in more detail so I can help advise?

There are 2 cases where the destination needs to know which pages it's received:
  i) In COLO or checkpointing where it's receiving a partial new checkpoint;
    since it's only receiving a partial checkpoint it needs to know what it's
    received. This allows the destination to avoid copying the whole of it's
    received checkpoint and only copy the bits that changed.

 ii) On postcopy once a page is received by the destination the page has to
    be atomically placed;  I've not thought too hard about that yet.

Dave

> 
> - Michael
> 
> On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert <dgilbert@redhat.com
> > wrote:
> 
> > * Michael R. Hines (mhines@digitalocean.com) wrote:
> > > David,
> > >
> > > Thanks for including my email directly. It helps a lot.
> > >
> > > Below, I'm going to assume that only "dest" is calling
> > > qemu_rdma_exchange_recv()
> > > and only src is calling qemu_rdma_exchange_send(), since you didn't
> > specify
> > > who
> > > is sending and who is receiving.
> > >
> > > If that assumption is wrong, please respond again.
> >
> > That's correct.
> >
> > > Comments inline.....
> > >
> > > On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <
> > dgilbert@redhat.com
> > > > wrote:
> > >
> > > > Hi Michael,
> > > >    I think I've got an RDMA race condition, but I'm being a little
> > > > cautious at the moment and wondered if you agree with the following
> > > > diagnosis.
> > > >
> > > > It's showing up in a world of mine that's sending more control messages
> > > > from the destination->source and I'm seeing the following.
> > > >
> > > > We normally expect:
> > > >
> > > >    src                        dest
> > > >      ----------->control ready->
> > > >
> > >
> > > If src is sending, this is not correct. Dest should send the ready
> > message
> > > if it is receiving, not src, which breaks the above assumption. So, I'll
> > > reverse the assumption previously and continue with your observation and
> > > assume that src is receiving instead of dest, which should instead look
> > > like:
> >
> > Gah! Yes, I got the label the wrong way around; it's dest sending control
> > ready.
> >
> > > src  (receiving)                      dest (sending)
> > >      ----------->control ready->
> > >
> > >
> > >
> > > >    Sees SEND_CONTROL signal to ack that it has been sent
> > > >
> > >
> > > I'll assume here that you meant that dest sees the ready message and is
> > > then later sends something.
> > >
> > >
> > > >          <-----control message--
> > > >    Sees RECV_CONTROL message from dest
> > > >
> > > >
> > > Similar assumption for the receiver (src).
> > >
> > >
> > > > but what I'm seeing is:
> > > >    src                        dest
> > > >      ----------->control ready->
> > > >          <-----control message--
> > > >    Sees RECV_CONTROL message from dest
> > > >
> > >
> > > hmmmmm....
> > >
> > >
> > > >    Sees SEND_CONTROL signal to ack that it has been sent
> > > >
> > > >
> > > There's not enough information here....... do you have a multi-threaded
> > > send or receive or something?
> >
> > No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
> > so the change which got me to trigger this bug was that I'd
> > added a new control message 'notify write' which explicitly
> > told the destination it had a page written to; at the RDMA level
> > that was the only change.
> >
> > > Do the work request IDs match up?
> >
> > Yes I think so; I also added a sequence number to the 'ready' messages
> > to check I wasn't losing one.
> > I had a chat to one of our RDMA guys (Doug Ledford) and he said
> > it's perfectly legal for RDMA to take longer to return the signal
> > from the send than for the round trip of the destination responding;
> > the 'signal' doesn't happen until an ack has been received from the
> > destination card anyway, so the ack can get delayed or retried.
> > So I think we do need to fix this; the question then is how do we fix
> > it for all control messages without breaking anything else.   Are there
> > any cases that rely on having received the signal from the send before
> > continuing, or could i just do what I'm doing for all control messages?
> >
> > Dave
> >
> > > - Michael
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
> 
> 
> -- 
> /*
>  * Michael R. Hines
>  * https://michael.hinespot.com
>  */
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] An RDMA race?
  2016-01-04 18:15       ` Dr. David Alan Gilbert
@ 2016-01-09 11:03         ` Michael R. Hines
  0 siblings, 0 replies; 7+ messages in thread
From: Michael R. Hines @ 2016-01-09 11:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Michael Hines, qemu-devel

I don't mind ACKing this change if we could agree on some kind of 
regression test for this. (I have an RDMA card at home that I could run 
tests on if need be).

The way that virt-test goes about this is not sufficient. The way I do 
testing for RDMA is that I not only confirm that
the migration succeded or failed, but I actually compare serial console 
output for funny keywords, like "panic"
and so forth to make a poor-man's attempt guess at whether or not there 
was any memory corruption.

Do you have a testing harness for yourself? (I'd also like to know what 
the COLO guys are doing).

Maybe we can coalesce around something?

- Michael

On 01/04/2016 12:15 PM, Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mhines@digitalocean.com) wrote:
>> Adding such a control message would defeat the benefits of RDMA, as there
>> shouldn't be any signalling in the actual DMA path, or RDMA latency would
>> be too high. If you're sending control messages for individual writes, then
>> you need to change up your design. It's OK to design ACKs for groups of
>> writes, depending on the requirements.
> I started off with sending individual messages, and then once I had it working
> I made it group them to send one message every 2048 pages.
> The performance isn't very good though, and I've not yet analysed why.
>
>> So, the out-of-order issue you're seeing is only with your new message, not
>> the original messages?
> Yes I believe they're only on the new messages; however:
>    1) I'm sending a lot more control messages, so if there's a race I'm
>      a lot more likely to trigger it. (I'm not sure I'm triggering it in the
>      case where I group those 2048 together) - so does this mean it would
>      occasionally trigger on the unmodified code?
>
>    2) My reading of the existing code is that I think it could happen;
>      a) the source is ready to send something and is waiting for a CONTROL_READY,
>      b) the destination sends the CONTROL_READY
>          (blocking in qemu_rdma_post_send_control call to
>           qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL)
>      c) The source sends it's data
>      d) That arrives at the destination
>      e) finally the WRID_SEND_CONTROL arrives back
>
>     It's having d/e the wrong way round which is the race I think I'm seeing
>     and then we lose (d)'s data.
>
>> Can you describe/document it in more detail so I can help advise?
> There are 2 cases where the destination needs to know which pages it's received:
>    i) In COLO or checkpointing where it's receiving a partial new checkpoint;
>      since it's only receiving a partial checkpoint it needs to know what it's
>      received. This allows the destination to avoid copying the whole of it's
>      received checkpoint and only copy the bits that changed.
>
>   ii) On postcopy once a page is received by the destination the page has to
>      be atomically placed;  I've not thought too hard about that yet.
>
> Dave
>
>> - Michael
>>
>> On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert <dgilbert@redhat.com
>>> wrote:
>>> * Michael R. Hines (mhines@digitalocean.com) wrote:
>>>> David,
>>>>
>>>> Thanks for including my email directly. It helps a lot.
>>>>
>>>> Below, I'm going to assume that only "dest" is calling
>>>> qemu_rdma_exchange_recv()
>>>> and only src is calling qemu_rdma_exchange_send(), since you didn't
>>> specify
>>>> who
>>>> is sending and who is receiving.
>>>>
>>>> If that assumption is wrong, please respond again.
>>> That's correct.
>>>
>>>> Comments inline.....
>>>>
>>>> On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <
>>> dgilbert@redhat.com
>>>>> wrote:
>>>>> Hi Michael,
>>>>>     I think I've got an RDMA race condition, but I'm being a little
>>>>> cautious at the moment and wondered if you agree with the following
>>>>> diagnosis.
>>>>>
>>>>> It's showing up in a world of mine that's sending more control messages
>>>>> from the destination->source and I'm seeing the following.
>>>>>
>>>>> We normally expect:
>>>>>
>>>>>     src                        dest
>>>>>       ----------->control ready->
>>>>>
>>>> If src is sending, this is not correct. Dest should send the ready
>>> message
>>>> if it is receiving, not src, which breaks the above assumption. So, I'll
>>>> reverse the assumption previously and continue with your observation and
>>>> assume that src is receiving instead of dest, which should instead look
>>>> like:
>>> Gah! Yes, I got the label the wrong way around; it's dest sending control
>>> ready.
>>>
>>>> src  (receiving)                      dest (sending)
>>>>       ----------->control ready->
>>>>
>>>>
>>>>
>>>>>     Sees SEND_CONTROL signal to ack that it has been sent
>>>>>
>>>> I'll assume here that you meant that dest sees the ready message and is
>>>> then later sends something.
>>>>
>>>>
>>>>>           <-----control message--
>>>>>     Sees RECV_CONTROL message from dest
>>>>>
>>>>>
>>>> Similar assumption for the receiver (src).
>>>>
>>>>
>>>>> but what I'm seeing is:
>>>>>     src                        dest
>>>>>       ----------->control ready->
>>>>>           <-----control message--
>>>>>     Sees RECV_CONTROL message from dest
>>>>>
>>>> hmmmmm....
>>>>
>>>>
>>>>>     Sees SEND_CONTROL signal to ack that it has been sent
>>>>>
>>>>>
>>>> There's not enough information here....... do you have a multi-threaded
>>>> send or receive or something?
>>> No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
>>> so the change which got me to trigger this bug was that I'd
>>> added a new control message 'notify write' which explicitly
>>> told the destination it had a page written to; at the RDMA level
>>> that was the only change.
>>>
>>>> Do the work request IDs match up?
>>> Yes I think so; I also added a sequence number to the 'ready' messages
>>> to check I wasn't losing one.
>>> I had a chat to one of our RDMA guys (Doug Ledford) and he said
>>> it's perfectly legal for RDMA to take longer to return the signal
>>> from the send than for the round trip of the destination responding;
>>> the 'signal' doesn't happen until an ack has been received from the
>>> destination card anyway, so the ack can get delayed or retried.
>>> So I think we do need to fix this; the question then is how do we fix
>>> it for all control messages without breaking anything else.   Are there
>>> any cases that rely on having received the signal from the send before
>>> continuing, or could i just do what I'm doing for all control messages?
>>>
>>> Dave
>>>
>>>> - Michael
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>
>>
>> -- 
>> /*
>>   * Michael R. Hines
>>   * https://michael.hinespot.com
>>   */
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

-- 
/*
  * Michael R. Hines
  * Platform Engineer, DigitalOcean.
  */

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-01-09 11:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11 17:48 [Qemu-devel] An RDMA race? Dr. David Alan Gilbert
2015-12-12  3:40 ` Michael R. Hines
2015-12-14 10:53   ` Dr. David Alan Gilbert
2015-12-20  7:08     ` Michael R. Hines
2015-12-20  7:09       ` Michael R. Hines
2016-01-04 18:15       ` Dr. David Alan Gilbert
2016-01-09 11:03         ` Michael R. Hines

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.