All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] libxc: succeed silently on restore
@ 2010-09-02 16:14 Ian Campbell
  2010-09-02 17:01 ` Ian Campbell
  2010-09-02 17:14 ` Ian Jackson
  0 siblings, 2 replies; 9+ messages in thread
From: Ian Campbell @ 2010-09-02 16:14 UTC (permalink / raw)
  To: xen-devel; +Cc: Brendan Cully

# HG changeset patch
# User Ian Campbell <ian.campbell@citrix.com>
# Date 1283444053 -3600
# Node ID 5ad37819cddd19a270659d50ddf48da27ef987f6
# Parent  5906bc603a3ce47ca01b33bef7d952af822459fd
libxc: succeed silently on restore.

When doing a non-checkpoint restore we should expect an EOF at the end
of the restore and not log errors:
  xc: error: 0-length read: Internal error
  xc: error: read_exact_timed failed (read rc: 0, errno: 0): Internal error
  xc: error: Error when reading batch size (0 = Success): Internal error
  xc: error: error when buffering batch, finishing (0 = Success): Internal error
(I especially like implied the "Error == Success" ;-))

We allow for an EOF precisely at the start of a new batch only when
the restore is already marked as completed. In this case rdexact will
fail silently and propagate the specific failure upwards so in order
for higher levels to remain silent as well.

This is hackier than I would like but short of a massive refactoring
is the best I could come up with

I think this does something sensible for Remus too, although I've not
actually tried it. It's not clear that the select timeout case
shouldn't be logged in a less scary way too, since it isn't really an
error in the restore (although presumably it represents some sort of
failure on the active VM on the other end of the pipe).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Brendan Cully <brendan@cs.ubc.ca>

diff -r 5906bc603a3c -r 5ad37819cddd tools/libxc/xc_domain_restore.c
--- a/tools/libxc/xc_domain_restore.c	Thu Sep 02 14:38:44 2010 +0100
+++ b/tools/libxc/xc_domain_restore.c	Thu Sep 02 17:14:13 2010 +0100
@@ -49,7 +49,7 @@ struct restore_ctx {
 
 #ifndef __MINIOS__
 static ssize_t rdexact(struct xc_interface *xch, struct restore_ctx *ctx,
-                       int fd, void* buf, size_t size)
+                       int fd, void* buf, size_t size, int eof)
 {
     size_t offset = 0;
     ssize_t len;
@@ -69,7 +69,7 @@ static ssize_t rdexact(struct xc_interfa
             if ( len == -1 && errno == EINTR )
                 continue;
             if ( !FD_ISSET(fd, &rfds) ) {
-                ERROR("read_exact_timed failed (select returned %zd)", len);
+                ERROR("rdexact failed (select returned %zd)", len);
                 errno = ETIMEDOUT;
                 return -1;
             }
@@ -79,11 +79,14 @@ static ssize_t rdexact(struct xc_interfa
         if ( (len == -1) && ((errno == EINTR) || (errno == EAGAIN)) )
             continue;
         if ( len == 0 ) {
+            /* If we are expecting EOF then fail silently with errno = 0 */
+            if ( offset == 0 && eof )
+                return -2;
             ERROR("0-length read");
             errno = 0;
         }
         if ( len <= 0 ) {
-            ERROR("read_exact_timed failed (read rc: %d, errno: %d)", len, errno);
+            ERROR("rdexact failed (read rc: %d, errno: %d)", len, errno);
             return -1;
         }
         offset += len;
@@ -92,9 +95,11 @@ static ssize_t rdexact(struct xc_interfa
     return 0;
 }
 
-#define RDEXACT(fd,buf,size) rdexact(xch, ctx, fd, buf, size)
+#define RDEXACT(fd,buf,size) rdexact(xch, ctx, fd, buf, size, 0)
+#define RDEXACT_EOF(fd,buf,size) rdexact(xch, ctx, fd, buf, size, ctx->completed)
 #else
 #define RDEXACT read_exact
+#define RDEXACT_EOF read_exact
 #endif
 /*
 ** In the state file (or during transfer), all page-table pages are
@@ -671,11 +676,13 @@ static int pagebuf_get_one(xc_interface 
 {
     int count, countpages, oldcount, i;
     void* ptmp;
+    int rc;
 
-    if ( RDEXACT(fd, &count, sizeof(count)) )
+    if ( (rc = RDEXACT_EOF(fd, &count, sizeof(count))) )
     {
-        PERROR("Error when reading batch size");
-        return -1;
+        if ( rc == -1 )
+            PERROR("Error when reading batch size");
+        return rc;
     }
 
     // DPRINTF("reading batch of %d pages\n", count);
@@ -1289,8 +1296,9 @@ int xc_domain_restore(xc_interface *xch,
 
     // DPRINTF("Buffered checkpoint\n");
 
-    if ( pagebuf_get(xch, ctx, &pagebuf, io_fd, dom) ) {
-        PERROR("error when buffering batch, finishing");
+    if ( (frc = pagebuf_get(xch, ctx, &pagebuf, io_fd, dom)) ) {
+        if ( frc == -1 )
+            PERROR("error when buffering batch, finishing");
         goto finish;
     }
     memset(&tmptail, 0, sizeof(tmptail));

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 16:14 [PATCH] libxc: succeed silently on restore Ian Campbell
@ 2010-09-02 17:01 ` Ian Campbell
  2010-09-02 17:07   ` Ian Jackson
  2010-09-02 18:16   ` Brendan Cully
  2010-09-02 17:14 ` Ian Jackson
  1 sibling, 2 replies; 9+ messages in thread
From: Ian Campbell @ 2010-09-02 17:01 UTC (permalink / raw)
  To: xen-devel, Ian Jackson; +Cc: Brendan Cully

So it turns out that there is a similar issue on migration:
        xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    0%xc: error: rdexact failed (select returned 0): Internal error
        xc: error: Error when reading batch size (110 = Connection timed out): Internal error
        xc: error: error when buffering batch, finishing (110 = Connection timed out): Internal error

I'm not so sure what can be done about this case, the way
xc_domain_restore is (currently) designed it relies on the saving end to
close its FD when it is done in order to generate an EOF at the receiver
end to signal the end of the migration.

The xl migration protocol has a postamble which prevents us from closing
the FD and so instead what happens is that the sender finishes the save
and then sits waiting for the ACK from the receiver so the receiver hits
the remus heartbeat timeout which causes us to continue. This isn't
ideal from the downtime point of view nor from just a general design
POV.

Perhaps we should insert an explicit done marker into the xc save
protocol which would be appended in the non-checkpoint case? Only the
save end is aware if the migration is a checkpoint or not (and only
implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.

Ian.

On Thu, 2010-09-02 at 17:14 +0100, Ian Campbell wrote:
> # HG changeset patch
> # User Ian Campbell <ian.campbell@citrix.com>
> # Date 1283444053 -3600
> # Node ID 5ad37819cddd19a270659d50ddf48da27ef987f6
> # Parent  5906bc603a3ce47ca01b33bef7d952af822459fd
> libxc: succeed silently on restore.
> 
> When doing a non-checkpoint restore we should expect an EOF at the end
> of the restore and not log errors:
>   xc: error: 0-length read: Internal error
>   xc: error: read_exact_timed failed (read rc: 0, errno: 0): Internal error
>   xc: error: Error when reading batch size (0 = Success): Internal error
>   xc: error: error when buffering batch, finishing (0 = Success): Internal error
> (I especially like implied the "Error == Success" ;-))
> 
> We allow for an EOF precisely at the start of a new batch only when
> the restore is already marked as completed. In this case rdexact will
> fail silently and propagate the specific failure upwards so in order
> for higher levels to remain silent as well.
> 
> This is hackier than I would like but short of a massive refactoring
> is the best I could come up with
> 
> I think this does something sensible for Remus too, although I've not
> actually tried it. It's not clear that the select timeout case
> shouldn't be logged in a less scary way too, since it isn't really an
> error in the restore (although presumably it represents some sort of
> failure on the active VM on the other end of the pipe).
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Brendan Cully <brendan@cs.ubc.ca>
> 
> diff -r 5906bc603a3c -r 5ad37819cddd tools/libxc/xc_domain_restore.c
> --- a/tools/libxc/xc_domain_restore.c	Thu Sep 02 14:38:44 2010 +0100
> +++ b/tools/libxc/xc_domain_restore.c	Thu Sep 02 17:14:13 2010 +0100
> @@ -49,7 +49,7 @@ struct restore_ctx {
>  
>  #ifndef __MINIOS__
>  static ssize_t rdexact(struct xc_interface *xch, struct restore_ctx *ctx,
> -                       int fd, void* buf, size_t size)
> +                       int fd, void* buf, size_t size, int eof)
>  {
>      size_t offset = 0;
>      ssize_t len;
> @@ -69,7 +69,7 @@ static ssize_t rdexact(struct xc_interfa
>              if ( len == -1 && errno == EINTR )
>                  continue;
>              if ( !FD_ISSET(fd, &rfds) ) {
> -                ERROR("read_exact_timed failed (select returned %zd)", len);
> +                ERROR("rdexact failed (select returned %zd)", len);
>                  errno = ETIMEDOUT;
>                  return -1;
>              }
> @@ -79,11 +79,14 @@ static ssize_t rdexact(struct xc_interfa
>          if ( (len == -1) && ((errno == EINTR) || (errno == EAGAIN)) )
>              continue;
>          if ( len == 0 ) {
> +            /* If we are expecting EOF then fail silently with errno = 0 */
> +            if ( offset == 0 && eof )
> +                return -2;
>              ERROR("0-length read");
>              errno = 0;
>          }
>          if ( len <= 0 ) {
> -            ERROR("read_exact_timed failed (read rc: %d, errno: %d)", len, errno);
> +            ERROR("rdexact failed (read rc: %d, errno: %d)", len, errno);
>              return -1;
>          }
>          offset += len;
> @@ -92,9 +95,11 @@ static ssize_t rdexact(struct xc_interfa
>      return 0;
>  }
>  
> -#define RDEXACT(fd,buf,size) rdexact(xch, ctx, fd, buf, size)
> +#define RDEXACT(fd,buf,size) rdexact(xch, ctx, fd, buf, size, 0)
> +#define RDEXACT_EOF(fd,buf,size) rdexact(xch, ctx, fd, buf, size, ctx->completed)
>  #else
>  #define RDEXACT read_exact
> +#define RDEXACT_EOF read_exact
>  #endif
>  /*
>  ** In the state file (or during transfer), all page-table pages are
> @@ -671,11 +676,13 @@ static int pagebuf_get_one(xc_interface 
>  {
>      int count, countpages, oldcount, i;
>      void* ptmp;
> +    int rc;
>  
> -    if ( RDEXACT(fd, &count, sizeof(count)) )
> +    if ( (rc = RDEXACT_EOF(fd, &count, sizeof(count))) )
>      {
> -        PERROR("Error when reading batch size");
> -        return -1;
> +        if ( rc == -1 )
> +            PERROR("Error when reading batch size");
> +        return rc;
>      }
>  
>      // DPRINTF("reading batch of %d pages\n", count);
> @@ -1289,8 +1296,9 @@ int xc_domain_restore(xc_interface *xch,
>  
>      // DPRINTF("Buffered checkpoint\n");
>  
> -    if ( pagebuf_get(xch, ctx, &pagebuf, io_fd, dom) ) {
> -        PERROR("error when buffering batch, finishing");
> +    if ( (frc = pagebuf_get(xch, ctx, &pagebuf, io_fd, dom)) ) {
> +        if ( frc == -1 )
> +            PERROR("error when buffering batch, finishing");
>          goto finish;
>      }
>      memset(&tmptail, 0, sizeof(tmptail));
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 17:01 ` Ian Campbell
@ 2010-09-02 17:07   ` Ian Jackson
  2010-09-02 18:26     ` Ian Campbell
  2010-09-02 18:16   ` Brendan Cully
  1 sibling, 1 reply; 9+ messages in thread
From: Ian Jackson @ 2010-09-02 17:07 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian, Brendan Cully, xen-devel, Jackson

Ian Campbell writes ("Re: [Xen-devel] [PATCH] libxc: succeed silently on restore"):
> I'm not so sure what can be done about this case, the way
> xc_domain_restore is (currently) designed it relies on the saving end to
> close its FD when it is done in order to generate an EOF at the receiver
> end to signal the end of the migration.

This was introduced in the Remus patches and is IMO not correct.

> The xl migration protocol has a postamble which prevents us from closing
> the FD and so instead what happens is that the sender finishes the save
> and then sits waiting for the ACK from the receiver so the receiver hits
> the remus heartbeat timeout which causes us to continue. This isn't
> ideal from the downtime point of view nor from just a general design
> POV.

The xl migration protocol postamble is needed to try to mitigate the
consequences of network failure, where otherwise it is easy to get
into situations where neither the sender nor the receiver can safely
resume the domain.

> Perhaps we should insert an explicit done marker into the xc save
> protocol which would be appended in the non-checkpoint case? Only the
> save end is aware if the migration is a checkpoint or not (and only
> implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.

There _is_ an explicit done marker: the sender stops sending pages and
sends a register dump.  It's just that remus then wants to continue
anyway.

The solution is that the interface to xc_domain_restore should be
extended so that:
 * Callers specify whether they are expecting a series of checkpoints,
   or just one.
 * When it returns you find out whether the response was "we got
   exactly the one checkpoint you were expecting" or "the network
   connection failed too soon" or "we got some checkpoints and then
   the network connection failed".

A related problem is that it is very difficult for the caller to
determine when the replication has been properly set up: ie, to know
when the receiver has got at least one whole checkpoint.

Ian.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 16:14 [PATCH] libxc: succeed silently on restore Ian Campbell
  2010-09-02 17:01 ` Ian Campbell
@ 2010-09-02 17:14 ` Ian Jackson
  1 sibling, 0 replies; 9+ messages in thread
From: Ian Jackson @ 2010-09-02 17:14 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Brendan Cully, xen-devel

Ian Campbell writes ("[Xen-devel] [PATCH] libxc: succeed silently on restore"):
> We allow for an EOF precisely at the start of a new batch only when
> the restore is already marked as completed. In this case rdexact will
> fail silently and propagate the specific failure upwards so in order
> for higher levels to remain silent as well.

That seems correct to me.  I'll wait for comments from Brendan though.

Ian.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 17:01 ` Ian Campbell
  2010-09-02 17:07   ` Ian Jackson
@ 2010-09-02 18:16   ` Brendan Cully
  2010-09-02 18:29     ` Ian Campbell
  1 sibling, 1 reply; 9+ messages in thread
From: Brendan Cully @ 2010-09-02 18:16 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson

On Thursday, 02 September 2010 at 18:01, Ian Campbell wrote:
> So it turns out that there is a similar issue on migration:
>         xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    0%xc: error: rdexact failed (select returned 0): Internal error
>         xc: error: Error when reading batch size (110 = Connection timed out): Internal error
>         xc: error: error when buffering batch, finishing (110 = Connection timed out): Internal error
> 
> I'm not so sure what can be done about this case, the way
> xc_domain_restore is (currently) designed it relies on the saving end to
> close its FD when it is done in order to generate an EOF at the receiver
> end to signal the end of the migration.
> 
> The xl migration protocol has a postamble which prevents us from closing
> the FD and so instead what happens is that the sender finishes the save
> and then sits waiting for the ACK from the receiver so the receiver hits
> the remus heartbeat timeout which causes us to continue. This isn't
> ideal from the downtime point of view nor from just a general design
> POV.
> 
> Perhaps we should insert an explicit done marker into the xc save
> protocol which would be appended in the non-checkpoint case? Only the
> save end is aware if the migration is a checkpoint or not (and only
> implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.

I think this can be done trivially? We can just add another negative
length record at the end of memory copying (like the debug flag, tmem,
hvm extensions, etc) if we're running the new xl migration protocol
and expect restore to exit after receiving the first full
checkpoint. Or, if you're not as worried about preserving the existing
semantics, make the minus flag indicate that callbacks->checkpoint is
not null, and only continue reading past the first complete checkpoint
if you see that minus flag on the receive side.

Isn't that sufficient?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 17:07   ` Ian Jackson
@ 2010-09-02 18:26     ` Ian Campbell
  2010-09-03 10:30       ` Ian Jackson
  0 siblings, 1 reply; 9+ messages in thread
From: Ian Campbell @ 2010-09-02 18:26 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Brendan, xen-devel, Cully

On Thu, 2010-09-02 at 18:07 +0100, Ian Jackson wrote:
> Ian Campbell writes ("Re: [Xen-devel] [PATCH] libxc: succeed silently on restore"):
> > I'm not so sure what can be done about this case, the way
> > xc_domain_restore is (currently) designed it relies on the saving end to
> > close its FD when it is done in order to generate an EOF at the receiver
> > end to signal the end of the migration.
> 
> This was introduced in the Remus patches and is IMO not correct.
> 
> > The xl migration protocol has a postamble which prevents us from closing
> > the FD and so instead what happens is that the sender finishes the save
> > and then sits waiting for the ACK from the receiver so the receiver hits
> > the remus heartbeat timeout which causes us to continue. This isn't
> > ideal from the downtime point of view nor from just a general design
> > POV.
> 
> The xl migration protocol postamble is needed to try to mitigate the
> consequences of network failure, where otherwise it is easy to get
> into situations where neither the sender nor the receiver can safely
> resume the domain.

Yes, I wasn't suggesting getting rid of the postamble, just commenting
on why we can't simply close the sending fd as xc_domain_restore
currently expects.

> > Perhaps we should insert an explicit done marker into the xc save
> > protocol which would be appended in the non-checkpoint case? Only the
> > save end is aware if the migration is a checkpoint or not (and only
> > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> 
> There _is_ an explicit done marker: the sender stops sending pages and
> sends a register dump.  It's just that remus then wants to continue
> anyway.

I was suggesting a second "alldone" marker to be sent after the register
dump and other tail bits when there are no more checkpoints to come.
But...

> The solution is that the interface to xc_domain_restore should be
> extended so that:
>  * Callers specify whether they are expecting a series of checkpoints,
>    or just one.
>  * When it returns you find out whether the response was "we got
>    exactly the one checkpoint you were expecting" or "the network
>    connection failed too soon" or "we got some checkpoints and then
>    the network connection failed".

... I like this idea more. I'll see what I can rustle up.

> A related problem is that it is very difficult for the caller to
> determine when the replication has been properly set up: ie, to know
> when the receiver has got at least one whole checkpoint.

I think this actually does work with the code as it is -- the receive
will return error if it doesn't get at least one whole checkpoint and
will return success and commit to the most recent complete checkpoint
otherwise.

Ian.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 18:16   ` Brendan Cully
@ 2010-09-02 18:29     ` Ian Campbell
  2010-09-02 18:39       ` Brendan Cully
  0 siblings, 1 reply; 9+ messages in thread
From: Ian Campbell @ 2010-09-02 18:29 UTC (permalink / raw)
  To: Brendan Cully; +Cc: xen-devel, Ian Jackson

On Thu, 2010-09-02 at 19:16 +0100, Brendan Cully wrote:
> On Thursday, 02 September 2010 at 18:01, Ian Campbell wrote:
> > So it turns out that there is a similar issue on migration:
> >         xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    0%xc: error: rdexact failed (select returned 0): Internal error
> >         xc: error: Error when reading batch size (110 = Connection timed out): Internal error
> >         xc: error: error when buffering batch, finishing (110 = Connection timed out): Internal error
> > 
> > I'm not so sure what can be done about this case, the way
> > xc_domain_restore is (currently) designed it relies on the saving end to
> > close its FD when it is done in order to generate an EOF at the receiver
> > end to signal the end of the migration.
> > 
> > The xl migration protocol has a postamble which prevents us from closing
> > the FD and so instead what happens is that the sender finishes the save
> > and then sits waiting for the ACK from the receiver so the receiver hits
> > the remus heartbeat timeout which causes us to continue. This isn't
> > ideal from the downtime point of view nor from just a general design
> > POV.
> > 
> > Perhaps we should insert an explicit done marker into the xc save
> > protocol which would be appended in the non-checkpoint case? Only the
> > save end is aware if the migration is a checkpoint or not (and only
> > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> 
> I think this can be done trivially? We can just add another negative
> length record at the end of memory copying (like the debug flag, tmem,
> hvm extensions, etc) if we're running the new xl migration protocol
> and expect restore to exit after receiving the first full
> checkpoint. Or, if you're not as worried about preserving the existing
> semantics, make the minus flag indicate that callbacks->checkpoint is
> not null, and only continue reading past the first complete checkpoint
> if you see that minus flag on the receive side.
> 
> Isn't that sufficient?

It would probably work but isn't there a benefit to having the receiver
know that it is partaking in a multiple checkpoint restore and being
told how many iterations there were etc?

Ian.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 18:29     ` Ian Campbell
@ 2010-09-02 18:39       ` Brendan Cully
  0 siblings, 0 replies; 9+ messages in thread
From: Brendan Cully @ 2010-09-02 18:39 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson

On Thursday, 02 September 2010 at 19:29, Ian Campbell wrote:
> On Thu, 2010-09-02 at 19:16 +0100, Brendan Cully wrote:
> > On Thursday, 02 September 2010 at 18:01, Ian Campbell wrote:
> > > So it turns out that there is a similar issue on migration:
> > >         xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    0%xc: error: rdexact failed (select returned 0): Internal error
> > >         xc: error: Error when reading batch size (110 = Connection timed out): Internal error
> > >         xc: error: error when buffering batch, finishing (110 = Connection timed out): Internal error
> > > 
> > > I'm not so sure what can be done about this case, the way
> > > xc_domain_restore is (currently) designed it relies on the saving end to
> > > close its FD when it is done in order to generate an EOF at the receiver
> > > end to signal the end of the migration.
> > > 
> > > The xl migration protocol has a postamble which prevents us from closing
> > > the FD and so instead what happens is that the sender finishes the save
> > > and then sits waiting for the ACK from the receiver so the receiver hits
> > > the remus heartbeat timeout which causes us to continue. This isn't
> > > ideal from the downtime point of view nor from just a general design
> > > POV.
> > > 
> > > Perhaps we should insert an explicit done marker into the xc save
> > > protocol which would be appended in the non-checkpoint case? Only the
> > > save end is aware if the migration is a checkpoint or not (and only
> > > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> > 
> > I think this can be done trivially? We can just add another negative
> > length record at the end of memory copying (like the debug flag, tmem,
> > hvm extensions, etc) if we're running the new xl migration protocol
> > and expect restore to exit after receiving the first full
> > checkpoint. Or, if you're not as worried about preserving the existing
> > semantics, make the minus flag indicate that callbacks->checkpoint is
> > not null, and only continue reading past the first complete checkpoint
> > if you see that minus flag on the receive side.
> > 
> > Isn't that sufficient?
> 
> It would probably work but isn't there a benefit to having the receiver
> know that it is partaking in a multiple checkpoint restore and being
> told how many iterations there were etc?

Is there?

The minusflag does tell the receiver that it is participating in a
multiple checkpoint restore (when it receives the flag). I can't
really see a reason why the sender should want to tell the receiver to
expect n checkpoints (as opposed to 1 or continuous). I suppose it
would be nice if the sender could gracefully abort a continual
checkpoint process without causing the receiver to activate the
VM. Yet another minusflag? :)

I have no objection to more aggressive refactoring (the current
protocol and code are gross), I'm just noting that this particular
problem also has an easy fix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] libxc: succeed silently on restore
  2010-09-02 18:26     ` Ian Campbell
@ 2010-09-03 10:30       ` Ian Jackson
  0 siblings, 0 replies; 9+ messages in thread
From: Ian Jackson @ 2010-09-03 10:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Brendan Cully, xen-devel

Ian Campbell writes ("Re: [Xen-devel] [PATCH] libxc: succeed silently on restore"):
> On Thu, 2010-09-02 at 18:07 +0100, Ian Jackson wrote:
> > A related problem is that it is very difficult for the caller to
> > determine when the replication has been properly set up: ie, to know
> > when the receiver has got at least one whole checkpoint.
> 
> I think this actually does work with the code as it is -- the receive
> will return error if it doesn't get at least one whole checkpoint and
> will return success and commit to the most recent complete checkpoint
> otherwise.

Except that the caller doesn't get either of these indications until
the replication session eventually fails.

Ian.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-09-03 10:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-02 16:14 [PATCH] libxc: succeed silently on restore Ian Campbell
2010-09-02 17:01 ` Ian Campbell
2010-09-02 17:07   ` Ian Jackson
2010-09-02 18:26     ` Ian Campbell
2010-09-03 10:30       ` Ian Jackson
2010-09-02 18:16   ` Brendan Cully
2010-09-02 18:29     ` Ian Campbell
2010-09-02 18:39       ` Brendan Cully
2010-09-02 17:14 ` Ian Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.