All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/20] Add postcopy live migration support
@ 2017-03-27  9:06 Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
                   ` (21 more replies)
  0 siblings, 22 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Hi,

We're a team of three fourth-year undergraduate software engineering students at
the University of Waterloo in Canada.  In late 2015 we posted on the list [1] to
ask for a project to undertake for our program's capstone design project, and
Andrew Cooper pointed us in the direction of the live migration implementation
as an area that could use some attention.  We were particularly interested in
post-copy live migration (as evaluated by [2] and discussed on the list at [3]),
and have been working on an implementation of this on-and-off since then.

We now have a working implementation of this scheme, and are submitting it for
comment.  The changes are also available as the 'postcopy' branch of the GitHub
repository at [4]

As a brief overview of our approach:
- We introduce a mechanism by which libxl can indicate to the libxc stream
  helper process that the iterative migration precopy loop should be terminated
  and postcopy should begin.
- At this point, we suspend the domain, collect the final set of dirty pfns and
  write these pfns (and _not_ their contents) into the stream.
- At the destination, the xc restore logic registers itself as a pager for the
  migrating domain, 'evicts' all of the pfns indicated by the sender as
  outstanding, and then resumes the domain at the destination.
- As the domain executes, the migration sender continues to push the remaining
  oustanding pages to the receiver in the background.  The receiver
  monitors both the stream for incoming page data and the paging ring event
  channel for page faults triggered by the guest.  Page faults are forwarded on
  the back-channel migration stream to the migration sender, which prioritizes
  these pages for transmission.

By leveraging the existing paging API, we are able to implement the postcopy
scheme without any hypervisor modifications - all of our changes are confined to
the userspace toolstack.  However, we inherit from the paging API the
requirement that the domains be HVM and that the host have HAP/EPT support.

We haven't yet had the opportunity to perform a quantitative evaluation of the
performance trade-offs between the traditional pre-copy and our post-copy
strategies, but intend to.  Informally, we've been testing our implementation by
migrating a domain running the x86 memtest program (which is obviously a
tremendously write-heavy workload), and have observed a substantial reduction in
total time required for migration completion (at the expense of a visually
obvious 'slowdown' in the execution of the program).  We've also noticed that,
when performing a postcopy without any leading precopy iterations, the time
required at the destination to 'evict' all of the outstanding pages is
substantial - possibly because there is no batching mechanism by which pages can
be evicted - so this area in particular might require further attention.

We're really interested in any feedback you might have!

Thanks!

Harley Armstrong, Chester Lin, Joshua Otto

[1] https://lists.gt.net/xen/devel/410255
[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.2368
[3] https://lists.gt.net/xen/devel/261568
[4] https://github.com/jtotto/xen

Joshua Otto (20):
  tools: rename COLO 'postcopy' to 'aftercopy'
  libxc/xc_sr: parameterise write_record() on fd
  libxc/xc_sr_restore.c: use write_record() in
    send_checkpoint_dirty_pfn_list()
  libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN()
  libxc/xc_sr: factor out filter_pages()
  libxc/xc_sr: factor helpers out of handle_page_data()
  migration: defer precopy policy to libxl
  libxl/migration: add precopy tuning parameters
  libxc/xc_sr_save: introduce save batch types
  libxc/xc_sr_save.c: initialise rec.data before free()
  libxc/migration: correct hvm record ordering specification
  libxc/migration: specify postcopy live migration
  libxc/migration: add try_read_record()
  libxc/migration: implement the sender side of postcopy live migration
  libxc/migration: implement the receiver side of postcopy live
    migration
  libxl/libxl_stream_write.c: track callback chains with an explicit
    phase
  libxl/libxl_stream_read.c: track callback chains with an explicit
    phase
  libxl/migration: implement the sender side of postcopy live migration
  libxl/migration: implement the receiver side of postcopy live
    migration
  tools: expose postcopy live migration support in libxl and xl

 docs/specs/libxc-migration-stream.pandoc |  184 ++++-
 docs/specs/libxl-migration-stream.pandoc |   19 +-
 tools/libxc/include/xenguest.h           |  170 ++--
 tools/libxc/xc_nomigrate.c               |    3 +-
 tools/libxc/xc_private.c                 |   21 +-
 tools/libxc/xc_private.h                 |    2 +
 tools/libxc/xc_sr_common.c               |  118 ++-
 tools/libxc/xc_sr_common.h               |  152 +++-
 tools/libxc/xc_sr_common_x86.c           |    2 +-
 tools/libxc/xc_sr_restore.c              | 1297 +++++++++++++++++++++++++-----
 tools/libxc/xc_sr_restore_x86_hvm.c      |   38 +-
 tools/libxc/xc_sr_save.c                 |  828 +++++++++++++++----
 tools/libxc/xc_sr_save_x86_hvm.c         |   18 +-
 tools/libxc/xc_sr_save_x86_pv.c          |   17 +-
 tools/libxc/xc_sr_stream_format.h        |   15 +-
 tools/libxc/xg_save_restore.h            |   16 +-
 tools/libxl/libxl.h                      |   44 +-
 tools/libxl/libxl_colo_restore.c         |    2 +-
 tools/libxl/libxl_colo_save.c            |    2 +-
 tools/libxl/libxl_create.c               |  167 +++-
 tools/libxl/libxl_dom_save.c             |   55 +-
 tools/libxl/libxl_domain.c               |   41 +-
 tools/libxl/libxl_internal.h             |   79 +-
 tools/libxl/libxl_remus.c                |    2 +-
 tools/libxl/libxl_save_callout.c         |    3 +-
 tools/libxl/libxl_save_helper.c          |    7 +-
 tools/libxl/libxl_save_msgs_gen.pl       |   10 +-
 tools/libxl/libxl_sr_stream_format.h     |   13 +-
 tools/libxl/libxl_stream_read.c          |  136 +++-
 tools/libxl/libxl_stream_write.c         |  161 ++--
 tools/ocaml/libs/xl/xenlight_stubs.c     |    2 +-
 tools/xl/xl.h                            |    7 +-
 tools/xl/xl_cmdtable.c                   |   25 +-
 tools/xl/xl_migrate.c                    |   85 +-
 tools/xl/xl_vmcontrol.c                  |    8 +-
 35 files changed, 3144 insertions(+), 605 deletions(-)

-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy'
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 16:34   ` Wei Liu
  2017-03-27  9:06 ` [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

The COLO xc domain save and restore procedures both make use of a 'postcopy'
callback to defer part of each checkpoint operation to xl.  In this context, the
name 'postcopy' is meant as "the callback invoked immediately after this
checkpoint's memory callback."  This is an unfortunate name collision with the
other common use of 'postcopy' in the context of live migration, where it is
used to mean "a memory migration that permits the guest to execute at the
destination before all of its memory is migrated by servicing accesses to
unmigrated memory via a network page-fault."

Mechanically rename 'postcopy' -> 'aftercopy' to free up the postcopy namespace
while preserving the original intent of the name in the COLO context.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h     | 4 ++--
 tools/libxc/xc_sr_restore.c        | 4 ++--
 tools/libxc/xc_sr_save.c           | 4 ++--
 tools/libxl/libxl_colo_restore.c   | 2 +-
 tools/libxl/libxl_colo_save.c      | 2 +-
 tools/libxl/libxl_remus.c          | 2 +-
 tools/libxl/libxl_save_msgs_gen.pl | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 40902ee..aa8cc8b 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -53,7 +53,7 @@ struct save_callbacks {
      * xc_domain_save then flushes the output buffer, while the
      *  guest continues to run.
      */
-    int (*postcopy)(void* data);
+    int (*aftercopy)(void* data);
 
     /* Called after the memory checkpoint has been flushed
      * out into the network. Typical actions performed in this
@@ -115,7 +115,7 @@ struct restore_callbacks {
      * Callback function resumes the guest & the device model,
      * returns to xc_domain_restore.
      */
-    int (*postcopy)(void* data);
+    int (*aftercopy)(void* data);
 
     /* A checkpoint record has been found in the stream.
      * returns: */
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 3549f0a..ee06b3d 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -576,7 +576,7 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
                                                 ctx->restore.callbacks->data);
 
         /* Resume secondary vm */
-        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
+        ret = ctx->restore.callbacks->aftercopy(ctx->restore.callbacks->data);
         HANDLE_CALLBACK_RETURN_VALUE(ret);
 
         /* Wait for a new checkpoint */
@@ -855,7 +855,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     {
         /* this is COLO restore */
         assert(callbacks->suspend &&
-               callbacks->postcopy &&
+               callbacks->aftercopy &&
                callbacks->wait_checkpoint &&
                callbacks->restore_results);
     }
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index f98c827..fc63a55 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -863,7 +863,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
                 }
             }
 
-            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            rc = ctx->save.callbacks->aftercopy(ctx->save.callbacks->data);
             if ( rc <= 0 )
                 goto err;
 
@@ -951,7 +951,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
     if ( hvm )
         assert(callbacks->switch_qemu_logdirty);
     if ( ctx.save.checkpointed )
-        assert(callbacks->checkpoint && callbacks->postcopy);
+        assert(callbacks->checkpoint && callbacks->aftercopy);
     if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
         assert(callbacks->wait_checkpoint);
 
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index 0c535bd..7d8f9ff 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -246,7 +246,7 @@ void libxl__colo_restore_setup(libxl__egc *egc,
     if (init_dsps(&crcs->dsps))
         goto out;
 
-    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
+    callbacks->aftercopy = libxl__colo_restore_domain_resume_callback;
     callbacks->wait_checkpoint = libxl__colo_restore_domain_wait_checkpoint_callback;
     callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
     callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index f687d5a..5921196 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -145,7 +145,7 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
 
     callbacks->suspend = libxl__colo_save_domain_suspend_callback;
     callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
-    callbacks->postcopy = libxl__colo_save_domain_resume_callback;
+    callbacks->aftercopy = libxl__colo_save_domain_resume_callback;
     callbacks->wait_checkpoint = libxl__colo_save_domain_wait_checkpoint_callback;
 
     libxl__checkpoint_devices_setup(egc, &dss->cds);
diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
index 29a4783..1453365 100644
--- a/tools/libxl/libxl_remus.c
+++ b/tools/libxl/libxl_remus.c
@@ -110,7 +110,7 @@ void libxl__remus_setup(libxl__egc *egc, libxl__remus_state *rs)
     dss->sws.checkpoint_callback = remus_checkpoint_stream_written;
 
     callbacks->suspend = libxl__remus_domain_suspend_callback;
-    callbacks->postcopy = libxl__remus_domain_resume_callback;
+    callbacks->aftercopy = libxl__remus_domain_resume_callback;
     callbacks->checkpoint = libxl__remus_domain_save_checkpoint_callback;
 
     libxl__checkpoint_devices_setup(egc, cds);
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 3ae7373..27845bb 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -24,7 +24,7 @@ our @msgs = (
                                                 'unsigned long', 'done',
                                                 'unsigned long', 'total'] ],
     [  3, 'srcxA',  "suspend", [] ],
-    [  4, 'srcxA',  "postcopy", [] ],
+    [  4, 'srcxA',  "aftercopy", [] ],
     [  5, 'srcxA',  "checkpoint", [] ],
     [  6, 'srcxA',  "wait_checkpoint", [] ],
     [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 18:53   ` Andrew Cooper
  2017-03-31 14:19   ` Wei Liu
  2017-03-27  9:06 ` [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Right now, write_split_record() - which is delegated to by
write_record() - implicitly writes to ctx->fd.  This means it can't be
used with the restore context's send_back_fd, which is unhandy.

Add an 'fd' parameter to both write_record() and write_split_record(),
and mechanically update all existing callsites to pass ctx->fd for it.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_common.c       |  6 +++---
 tools/libxc/xc_sr_common.h       |  8 ++++----
 tools/libxc/xc_sr_common_x86.c   |  2 +-
 tools/libxc/xc_sr_save.c         |  6 +++---
 tools/libxc/xc_sr_save_x86_hvm.c |  5 +++--
 tools/libxc/xc_sr_save_x86_pv.c  | 17 +++++++++--------
 6 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 48fa676..c1babf6 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -52,8 +52,8 @@ const char *rec_type_to_str(uint32_t type)
     return "Reserved";
 }
 
-int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
-                       void *buf, size_t sz)
+int write_split_record(struct xc_sr_context *ctx, int fd,
+                       struct xc_sr_record *rec, void *buf, size_t sz)
 {
     static const char zeroes[(1u << REC_ALIGN_ORDER) - 1] = { 0 };
 
@@ -81,7 +81,7 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
     if ( sz )
         assert(buf);
 
-    if ( writev_exact(ctx->fd, parts, ARRAY_SIZE(parts)) )
+    if ( writev_exact(fd, parts, ARRAY_SIZE(parts)) )
         goto err;
 
     return 0;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index a83f22a..2f33ccc 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -361,8 +361,8 @@ struct xc_sr_record
  *
  * Returns 0 on success and non0 on failure.
  */
-int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
-                       void *buf, size_t sz);
+int write_split_record(struct xc_sr_context *ctx, int fd,
+                       struct xc_sr_record *rec, void *buf, size_t sz);
 
 /*
  * Writes a record to the stream, applying correct padding where appropriate.
@@ -371,10 +371,10 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
  *
  * Returns 0 on success and non0 on failure.
  */
-static inline int write_record(struct xc_sr_context *ctx,
+static inline int write_record(struct xc_sr_context *ctx, int fd,
                                struct xc_sr_record *rec)
 {
-    return write_split_record(ctx, rec, NULL, 0);
+    return write_split_record(ctx, fd, rec, NULL, 0);
 }
 
 /*
diff --git a/tools/libxc/xc_sr_common_x86.c b/tools/libxc/xc_sr_common_x86.c
index 98f1cef..7b3dd50 100644
--- a/tools/libxc/xc_sr_common_x86.c
+++ b/tools/libxc/xc_sr_common_x86.c
@@ -18,7 +18,7 @@ int write_tsc_info(struct xc_sr_context *ctx)
         return -1;
     }
 
-    return write_record(ctx, &rec);
+    return write_record(ctx, ctx->fd, &rec);
 }
 
 int handle_tsc_info(struct xc_sr_context *ctx, struct xc_sr_record *rec)
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index fc63a55..61fc4a4 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -53,7 +53,7 @@ static int write_end_record(struct xc_sr_context *ctx)
 {
     struct xc_sr_record end = { REC_TYPE_END, 0, NULL };
 
-    return write_record(ctx, &end);
+    return write_record(ctx, ctx->fd, &end);
 }
 
 /*
@@ -63,7 +63,7 @@ static int write_checkpoint_record(struct xc_sr_context *ctx)
 {
     struct xc_sr_record checkpoint = { REC_TYPE_CHECKPOINT, 0, NULL };
 
-    return write_record(ctx, &checkpoint);
+    return write_record(ctx, ctx->fd, &checkpoint);
 }
 
 /*
@@ -646,7 +646,7 @@ static int verify_frames(struct xc_sr_context *ctx)
 
     DPRINTF("Enabling verify mode");
 
-    rc = write_record(ctx, &rec);
+    rc = write_record(ctx, ctx->fd, &rec);
     if ( rc )
         goto out;
 
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index e485928..ea4b780 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -42,7 +42,7 @@ static int write_hvm_context(struct xc_sr_context *ctx)
     }
 
     hvm_rec.length = hvm_buf_size;
-    rc = write_record(ctx, &hvm_rec);
+    rc = write_record(ctx, ctx->fd, &hvm_rec);
     if ( rc < 0 )
     {
         PERROR("error write HVM_CONTEXT record");
@@ -112,7 +112,8 @@ static int write_hvm_params(struct xc_sr_context *ctx)
         }
     }
 
-    rc = write_split_record(ctx, &rec, entries, hdr.count * sizeof(*entries));
+    rc = write_split_record(ctx, ctx->fd, &rec, entries,
+                            hdr.count * sizeof(*entries));
     if ( rc )
         PERROR("Failed to write HVM_PARAMS record");
 
diff --git a/tools/libxc/xc_sr_save_x86_pv.c b/tools/libxc/xc_sr_save_x86_pv.c
index f218d17..2b2c050 100644
--- a/tools/libxc/xc_sr_save_x86_pv.c
+++ b/tools/libxc/xc_sr_save_x86_pv.c
@@ -571,9 +571,9 @@ static int write_one_vcpu_basic(struct xc_sr_context *ctx, uint32_t id)
     }
 
     if ( ctx->x86_pv.width == 8 )
-        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x64));
+        rc = write_split_record(ctx, ctx->fd, &rec, &vcpu, sizeof(vcpu.x64));
     else
-        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x32));
+        rc = write_split_record(ctx, ctx->fd, &rec, &vcpu, sizeof(vcpu.x32));
 
  err:
     return rc;
@@ -609,7 +609,7 @@ static int write_one_vcpu_extended(struct xc_sr_context *ctx, uint32_t id)
         return -1;
     }
 
-    return write_split_record(ctx, &rec, &domctl.u.ext_vcpucontext,
+    return write_split_record(ctx, ctx->fd, &rec, &domctl.u.ext_vcpucontext,
                               domctl.u.ext_vcpucontext.size);
 }
 
@@ -664,7 +664,8 @@ static int write_one_vcpu_xsave(struct xc_sr_context *ctx, uint32_t id)
         goto err;
     }
 
-    rc = write_split_record(ctx, &rec, buffer, domctl.u.vcpuextstate.size);
+    rc = write_split_record(ctx, ctx->fd, &rec, buffer,
+                            domctl.u.vcpuextstate.size);
     if ( rc )
         goto err;
 
@@ -730,7 +731,7 @@ static int write_one_vcpu_msrs(struct xc_sr_context *ctx, uint32_t id)
         goto err;
     }
 
-    rc = write_split_record(ctx, &rec, buffer,
+    rc = write_split_record(ctx, ctx->fd, &rec, buffer,
                             domctl.u.vcpu_msrs.msr_count *
                             sizeof(xen_domctl_vcpu_msr_t));
     if ( rc )
@@ -805,7 +806,7 @@ static int write_x86_pv_info(struct xc_sr_context *ctx)
             .data = &info
         };
 
-    return write_record(ctx, &rec);
+    return write_record(ctx, ctx->fd, &rec);
 }
 
 /*
@@ -846,7 +847,7 @@ static int write_x86_pv_p2m_frames(struct xc_sr_context *ctx)
     else
         data = (uint64_t *)ctx->x86_pv.p2m_pfns;
 
-    rc = write_split_record(ctx, &rec, data, datasz);
+    rc = write_split_record(ctx, ctx->fd, &rec, data, datasz);
 
     if ( data != (uint64_t *)ctx->x86_pv.p2m_pfns )
         free(data);
@@ -866,7 +867,7 @@ static int write_shared_info(struct xc_sr_context *ctx)
         .data = ctx->x86_pv.shinfo,
     };
 
-    return write_record(ctx, &rec);
+    return write_record(ctx, ctx->fd, &rec);
 }
 
 /*
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list()
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 18:56   ` Andrew Cooper
  2017-03-31 14:19   ` Wei Liu
  2017-03-27  9:06 ` [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN() Joshua Otto
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Teach send_checkpoint_dirty_pfn_list() to use write_record()'s new fd
parameter, avoiding the need for a manual writev().

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_restore.c | 27 ++++-----------------------
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index ee06b3d..481a904 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -420,7 +420,6 @@ static int send_checkpoint_dirty_pfn_list(struct xc_sr_context *ctx)
     int rc = -1;
     unsigned count, written;
     uint64_t i, *pfns = NULL;
-    struct iovec *iov = NULL;
     xc_shadow_op_stats_t stats = { 0, ctx->restore.p2m_size };
     struct xc_sr_record rec =
     {
@@ -467,35 +466,17 @@ static int send_checkpoint_dirty_pfn_list(struct xc_sr_context *ctx)
         pfns[written++] = i;
     }
 
-    /* iovec[] for writev(). */
-    iov = malloc(3 * sizeof(*iov));
-    if ( !iov )
-    {
-        ERROR("Unable to allocate memory for sending dirty bitmap");
-        goto err;
-    }
-
+    rec.data = pfns;
     rec.length = count * sizeof(*pfns);
 
-    iov[0].iov_base = &rec.type;
-    iov[0].iov_len = sizeof(rec.type);
-
-    iov[1].iov_base = &rec.length;
-    iov[1].iov_len = sizeof(rec.length);
-
-    iov[2].iov_base = pfns;
-    iov[2].iov_len = count * sizeof(*pfns);
-
-    if ( writev_exact(ctx->restore.send_back_fd, iov, 3) )
-    {
-        PERROR("Failed to write dirty bitmap to stream");
+    rc = write_record(ctx, ctx->restore.send_back_fd, &rec);
+    if ( rc )
         goto err;
-    }
 
     rc = 0;
+
  err:
     free(pfns);
-    free(iov);
     return rc;
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN()
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (2 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 19:03   ` Andrew Cooper
  2017-03-27  9:06 ` [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages() Joshua Otto
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Writing the libxc save stream requires writing a few 'trivial' records,
consisting only of a header with a particular type.  As a readability
aid, it's nice to have obviously-named functions that write these sorts
of records into the stream - for example, the first such function was
write_end_record(), which reads much more pleasantly at its call-site
than write_generic_record(REC_TYPE_END) would.  However, it's tedious
and error-prone to copy-paste the generic body of such a function for
each new trivial record type.

Add a helper macro that takes a name base and a record type and declares
the corresponding trivial record write function.  Use this to re-define
the two existing trivial record functions, write_end_record() and
write_checkpoint_record().

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_save.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 61fc4a4..86f6903 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -47,24 +47,18 @@ static int write_headers(struct xc_sr_context *ctx, uint16_t guest_type)
 }
 
 /*
- * Writes an END record into the stream.
+ * Declares a helper function to write an empty record of a particular type.
  */
-static int write_end_record(struct xc_sr_context *ctx)
-{
-    struct xc_sr_record end = { REC_TYPE_END, 0, NULL };
-
-    return write_record(ctx, ctx->fd, &end);
-}
-
-/*
- * Writes a CHECKPOINT record into the stream.
- */
-static int write_checkpoint_record(struct xc_sr_context *ctx)
-{
-    struct xc_sr_record checkpoint = { REC_TYPE_CHECKPOINT, 0, NULL };
+#define WRITE_TRIVIAL_RECORD_FN(name, type)                         \
+    static int write_ ## name ## _record(struct xc_sr_context *ctx) \
+    {                                                               \
+        struct xc_sr_record name = { (type), 0, NULL };             \
+                                                                    \
+        return write_record(ctx, ctx->fd, &name);                   \
+    }
 
-    return write_record(ctx, ctx->fd, &checkpoint);
-}
+WRITE_TRIVIAL_RECORD_FN(end,                 REC_TYPE_END);
+WRITE_TRIVIAL_RECORD_FN(checkpoint,          REC_TYPE_CHECKPOINT);
 
 /*
  * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages()
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (3 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN() Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 19:27   ` Andrew Cooper
  2017-03-27  9:06 ` [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data() Joshua Otto
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

When processing a PAGE_DATA record, the restore side needs to set the
types of incoming pages using the appropriate restore op and filter the
list of pfns in the record to the subset that are 'backed' - ie.
accompanied by real backing data in the stream that needs to be filled
in.

Both of these steps are also required when processing postcopy records,
so factor it out into a common helper.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_restore.c | 100 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 75 insertions(+), 25 deletions(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 481a904..8574ee8 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -194,6 +194,68 @@ int populate_pfns(struct xc_sr_context *ctx, unsigned count,
     return rc;
 }
 
+static void set_page_types(struct xc_sr_context *ctx, unsigned count,
+                           xen_pfn_t *pfns, uint32_t *types)
+{
+    unsigned i;
+
+    for ( i = 0; i < count; ++i )
+        ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
+}
+
+/*
+ * Given count pfns and their types, allocate and fill in buffer bpfns with only
+ * those pfns that are 'backed' by real page data that needs to be migrated.
+ * The caller must later free() *bpfns.
+ *
+ * Returns 0 on success and non-0 on failure.  *bpfns can be free()ed even after
+ * failure.
+ */
+static int filter_pages(struct xc_sr_context *ctx,
+                        unsigned count,
+                        xen_pfn_t *pfns,
+                        uint32_t *types,
+                        /* OUT */ unsigned *nr_pages,
+                        /* OUT */ xen_pfn_t **bpfns)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned i;
+
+    *nr_pages = 0;
+    *bpfns = malloc(count * sizeof(*bpfns));
+    if ( !(*bpfns) )
+    {
+        ERROR("Failed to allocate %zu bytes to process page data",
+              count * (sizeof(*bpfns)));
+        return -1;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_NOTAB:
+
+        case XEN_DOMCTL_PFINFO_L1TAB:
+        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+            (*bpfns)[(*nr_pages)++] = pfns[i];
+            break;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Given a list of pfns, their types, and a block of page data from the
  * stream, populate and record their types, map the relevant subset and copy
@@ -203,7 +265,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
                              xen_pfn_t *pfns, uint32_t *types, void *page_data)
 {
     xc_interface *xch = ctx->xch;
-    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
+    xen_pfn_t *mfns = NULL;
     int *map_errs = malloc(count * sizeof(*map_errs));
     int rc;
     void *mapping = NULL, *guest_page = NULL;
@@ -211,11 +273,11 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
         j,         /* j indexes the subset of pfns we decide to map. */
         nr_pages = 0;
 
-    if ( !mfns || !map_errs )
+    if ( !map_errs )
     {
         rc = -1;
         ERROR("Failed to allocate %zu bytes to process page data",
-              count * (sizeof(*mfns) + sizeof(*map_errs)));
+              count * sizeof(*map_errs));
         goto err;
     }
 
@@ -226,31 +288,19 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
         goto err;
     }
 
-    for ( i = 0; i < count; ++i )
-    {
-        ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
-
-        switch ( types[i] )
-        {
-        case XEN_DOMCTL_PFINFO_NOTAB:
-
-        case XEN_DOMCTL_PFINFO_L1TAB:
-        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
-        case XEN_DOMCTL_PFINFO_L2TAB:
-        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+    set_page_types(ctx, count, pfns, types);
 
-        case XEN_DOMCTL_PFINFO_L3TAB:
-        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
-        case XEN_DOMCTL_PFINFO_L4TAB:
-        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
-            mfns[nr_pages++] = ctx->restore.ops.pfn_to_gfn(ctx, pfns[i]);
-            break;
-        }
+    rc = filter_pages(ctx, count, pfns, types, &nr_pages, &mfns);
+    if ( rc )
+    {
+        ERROR("Failed to filter mfns for batch of %u pages", count);
+        goto err;
     }
 
+    /* Map physically-backed pfns ('bpfns') to their gmfns. */
+    for ( i = 0; i < nr_pages; ++i )
+        mfns[i] = ctx->restore.ops.pfn_to_gfn(ctx, mfns[i]);
+
     /* Nothing to do? */
     if ( nr_pages == 0 )
         goto done;
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data()
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (4 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages() Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 19:52   ` Andrew Cooper
  2017-03-27  9:06 ` [PATCH RFC 07/20] migration: defer precopy policy to libxl Joshua Otto
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

When processing a PAGE_DATA record, the restore code:
1) applies a number of sanity checks on the record's headers and size
2) decodes the list of packed page info into pfns and their types
3) using the pfn and type info, populates and fills the pages into the
   guest using process_page_data()

Steps 1) and 2) are also useful at various other stages of postcopy live
migrations, so factor them into reusable helper routines.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_common.c        | 38 +++++++++++++++-
 tools/libxc/xc_sr_common.h        | 10 +++++
 tools/libxc/xc_sr_restore.c       | 94 ++++++++++++++++++++++++---------------
 tools/libxc/xc_sr_save.c          |  2 +-
 tools/libxc/xc_sr_stream_format.h |  6 +--
 5 files changed, 109 insertions(+), 41 deletions(-)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index c1babf6..f443974 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -140,13 +140,49 @@ int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
     return 0;
 };
 
+int validate_pages_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                          uint32_t expected_type)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+
+    if ( rec->type != expected_type )
+    {
+        ERROR("%s record type expected, instead received record of type "
+              "%08x (%s)", rec_type_to_str(expected_type), rec->type,
+              rec_type_to_str(rec->type));
+        return -1;
+    }
+    else if ( rec->length < sizeof(*pages) )
+    {
+        ERROR("%s record truncated: length %u, min %zu",
+              rec_type_to_str(rec->type), rec->length, sizeof(*pages));
+        return -1;
+    }
+    else if ( pages->count < 1 )
+    {
+        ERROR("Expected at least 1 pfn in %s record",
+              rec_type_to_str(rec->type));
+        return -1;
+    }
+    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
+    {
+        ERROR("%s record (length %u) too short to contain %u"
+              " pfns worth of information", rec_type_to_str(rec->type),
+              rec->length, pages->count);
+        return -1;
+    }
+
+    return 0;
+}
+
 static void __attribute__((unused)) build_assertions(void)
 {
     BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
     BUILD_BUG_ON(sizeof(struct xc_sr_dhdr) != 16);
     BUILD_BUG_ON(sizeof(struct xc_sr_rhdr) != 8);
 
-    BUILD_BUG_ON(sizeof(struct xc_sr_rec_page_data_header)  != 8);
+    BUILD_BUG_ON(sizeof(struct xc_sr_rec_pages_header)      != 8);
     BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_info)       != 8);
     BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_p2m_frames) != 8);
     BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_vcpu_hdr)   != 8);
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 2f33ccc..b1aa88e 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -392,6 +392,16 @@ static inline int write_record(struct xc_sr_context *ctx, int fd,
 int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
 
 /*
+ * Given a record of one of the page data types, validate it by:
+ * - checking its actual type against its specific expected type
+ * - sanity checking its actual length against its claimed length
+ *
+ * Returns 0 on success and non-0 on failure.
+ */
+int validate_pages_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                          uint32_t expected_type);
+
+/*
  * This would ideally be private in restore.c, but is needed by
  * x86_pv_localise_page() if we receive pagetables frames ahead of the
  * contents of the frames they point at.
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 8574ee8..4e3c472 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -376,39 +376,25 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
 }
 
 /*
- * Validate a PAGE_DATA record from the stream, and pass the results to
- * process_page_data() to actually perform the legwork.
+ * Given a PAGE_DATA record, decode each packed entry into its encoded pfn and
+ * type, storing the results in newly-allocated pfns and types buffers that the
+ * caller must later free().  *pfns and *types may safely be free()ed even after
+ * failure.
  */
-static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+static int decode_pages_record(struct xc_sr_context *ctx,
+                               struct xc_sr_rec_pages_header *pages,
+                               /* OUT */ xen_pfn_t **pfns,
+                               /* OUT */ uint32_t **types,
+                               /* OUT */ unsigned *pages_of_data)
 {
     xc_interface *xch = ctx->xch;
-    struct xc_sr_rec_page_data_header *pages = rec->data;
-    unsigned i, pages_of_data = 0;
-    int rc = -1;
-
-    xen_pfn_t *pfns = NULL, pfn;
-    uint32_t *types = NULL, type;
-
-    if ( rec->length < sizeof(*pages) )
-    {
-        ERROR("PAGE_DATA record truncated: length %u, min %zu",
-              rec->length, sizeof(*pages));
-        goto err;
-    }
-    else if ( pages->count < 1 )
-    {
-        ERROR("Expected at least 1 pfn in PAGE_DATA record");
-        goto err;
-    }
-    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
-    {
-        ERROR("PAGE_DATA record (length %u) too short to contain %u"
-              " pfns worth of information", rec->length, pages->count);
-        goto err;
-    }
+    unsigned i;
+    xen_pfn_t pfn;
+    uint32_t type;
 
-    pfns = malloc(pages->count * sizeof(*pfns));
-    types = malloc(pages->count * sizeof(*types));
+    *pfns = malloc(pages->count * sizeof(*pfns));
+    *types = malloc(pages->count * sizeof(*types));
+    *pages_of_data = 0;
     if ( !pfns || !types )
     {
         ERROR("Unable to allocate enough memory for %u pfns",
@@ -418,14 +404,14 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
 
     for ( i = 0; i < pages->count; ++i )
     {
-        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
+        pfn = pages->pfn[i] & REC_PFINFO_PFN_MASK;
         if ( !ctx->restore.ops.pfn_is_valid(ctx, pfn) )
         {
             ERROR("pfn %#"PRIpfn" (index %u) outside domain maximum", pfn, i);
             goto err;
         }
 
-        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
+        type = (pages->pfn[i] & REC_PFINFO_TYPE_MASK) >> 32;
         if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
              ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
         {
@@ -434,14 +420,50 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
             goto err;
         }
         else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
-            /* NOTAB and all L1 through L4 tables (including pinned) should
-             * have a page worth of data in the record. */
-            pages_of_data++;
+            /* NOTAB and all L1 through L4 tables (including pinned) require the
+             * migration of a page of real data. */
+            (*pages_of_data)++;
 
-        pfns[i] = pfn;
-        types[i] = type;
+        (*pfns)[i] = pfn;
+        (*types)[i] = type;
     }
 
+    return 0;
+
+ err:
+    free(*pfns);
+    *pfns = NULL;
+
+    free(*types);
+    *types = NULL;
+
+    *pages_of_data = 0;
+
+    return -1;
+}
+
+/*
+ * Validate a PAGE_DATA record from the stream, and pass the results to
+ * process_page_data() to actually perform the legwork.
+ */
+static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+    unsigned pages_of_data;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL;
+    uint32_t *types = NULL;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_PAGE_DATA);
+    if ( rc )
+        goto err;
+
+    rc = decode_pages_record(ctx, pages, &pfns, &types, &pages_of_data);
+    if ( rc )
+        goto err;
+
     if ( rec->length != (sizeof(*pages) +
                          (sizeof(uint64_t) * pages->count) +
                          (PAGE_SIZE * pages_of_data)) )
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 86f6903..797aec5 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -83,7 +83,7 @@ static int write_batch(struct xc_sr_context *ctx)
     void *page, *orig_page;
     uint64_t *rec_pfns = NULL;
     struct iovec *iov = NULL; int iovcnt = 0;
-    struct xc_sr_rec_page_data_header hdr = { 0 };
+    struct xc_sr_rec_pages_header hdr = { 0 };
     struct xc_sr_record rec =
     {
         .type = REC_TYPE_PAGE_DATA,
diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
index 3291b25..32400b2 100644
--- a/tools/libxc/xc_sr_stream_format.h
+++ b/tools/libxc/xc_sr_stream_format.h
@@ -80,15 +80,15 @@ struct xc_sr_rhdr
 #define REC_TYPE_OPTIONAL             0x80000000U
 
 /* PAGE_DATA */
-struct xc_sr_rec_page_data_header
+struct xc_sr_rec_pages_header
 {
     uint32_t count;
     uint32_t _res1;
     uint64_t pfn[0];
 };
 
-#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
-#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
+#define REC_PFINFO_PFN_MASK  0x000fffffffffffffULL
+#define REC_PFINFO_TYPE_MASK 0xf000000000000000ULL
 
 /* X86_PV_INFO */
 struct xc_sr_rec_x86_pv_info
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (5 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data() Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-29 18:54   ` Jennifer Herbert
  2017-03-29 20:18   ` Andrew Cooper
  2017-03-27  9:06 ` [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters Joshua Otto
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

The precopy phase of the xc_domain_save() live migration algorithm has
historically been implemented to run until either a) (almost) no pages
are dirty or b) some fixed, hard-coded maximum number of precopy
iterations has been exceeded.  This policy and its implementation are
less than ideal for a few reasons:
- the logic of the policy is intertwined with the control flow of the
  mechanism of the precopy stage
- it can't take into account facts external to the immediate
  migration context, such as interactive user input or the passage of
  wall-clock time
- it does not permit the user to change their mind, over time, about
  what to do at the end of the precopy (they get an unconditional
  transition into the stop-and-copy phase of the migration)

To permit users to implement arbitrary higher-level policies governing
when the live migration precopy phase should end, and what should be
done next:
- add a precopy_policy() callback to the xc_domain_save() user-supplied
  callbacks
- during the precopy phase of live migrations, consult this policy after
  each batch of pages transmitted and take the dictated action, which
  may be to a) abort the migration entirely, b) continue with the
  precopy, or c) proceed to the stop-and-copy phase.
- provide an implementation of the old policy as such a callback in
  libxl and plumb it through the IPC machinery to libxc, effectively
  maintaing the old policy for now

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h     |  23 ++++-
 tools/libxc/xc_nomigrate.c         |   3 +-
 tools/libxc/xc_sr_common.h         |   7 +-
 tools/libxc/xc_sr_save.c           | 194 ++++++++++++++++++++++++++-----------
 tools/libxl/libxl_dom_save.c       |  20 ++++
 tools/libxl/libxl_save_callout.c   |   3 +-
 tools/libxl/libxl_save_helper.c    |   7 +-
 tools/libxl/libxl_save_msgs_gen.pl |   4 +-
 8 files changed, 189 insertions(+), 72 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index aa8cc8b..30ffb6f 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -39,6 +39,14 @@
  */
 struct xenevtchn_handle;
 
+/* For save's precopy_policy(). */
+struct precopy_stats
+{
+    unsigned iteration;
+    unsigned total_written;
+    long dirty_count; /* -1 if unknown */
+};
+
 /* callbacks provided by xc_domain_save */
 struct save_callbacks {
     /* Called after expiration of checkpoint interval,
@@ -46,6 +54,17 @@ struct save_callbacks {
      */
     int (*suspend)(void* data);
 
+    /* Called after every batch of page data sent during the precopy phase of a
+     * live migration to ask the caller what to do next based on the current
+     * state of the precopy migration.
+     */
+#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
+                                        * tidy up. */
+#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
+#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
+                                        * remaining dirty pages. */
+    int (*precopy_policy)(struct precopy_stats stats, void *data);
+
     /* Called after the guest's dirty pages have been
      *  copied into an output buffer.
      * Callback function resumes the guest & the device model,
@@ -100,8 +119,8 @@ typedef enum {
  *        doesn't use checkpointing
  * @return 0 on success, -1 on failure
  */
-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
-                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
+int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
+                   uint32_t flags /* XCFLAGS_xxx */,
                    struct save_callbacks* callbacks, int hvm,
                    xc_migration_stream_t stream_type, int recv_fd);
 
diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
index 15c838f..2af64e4 100644
--- a/tools/libxc/xc_nomigrate.c
+++ b/tools/libxc/xc_nomigrate.c
@@ -20,8 +20,7 @@
 #include <xenctrl.h>
 #include <xenguest.h>
 
-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
-                   uint32_t max_factor, uint32_t flags,
+int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags,
                    struct save_callbacks* callbacks, int hvm,
                    xc_migration_stream_t stream_type, int recv_fd)
 {
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index b1aa88e..a9160bd 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -198,12 +198,11 @@ struct xc_sr_context
             /* Further debugging information in the stream. */
             bool debug;
 
-            /* Parameters for tweaking live migration. */
-            unsigned max_iterations;
-            unsigned dirty_threshold;
-
             unsigned long p2m_size;
 
+            struct precopy_stats stats;
+            int policy_decision;
+
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 797aec5..eb95334 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx)
 }
 
 /*
+ * Test if the batch is full.
+ */
+static bool batch_full(struct xc_sr_context *ctx)
+{
+    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
+}
+
+/*
+ * Test if the batch is empty.
+ */
+static bool batch_empty(struct xc_sr_context *ctx)
+{
+    return ctx->save.nr_batch_pfns == 0;
+}
+
+/*
  * Flush a batch of pfns into the stream.
  */
 static int flush_batch(struct xc_sr_context *ctx)
 {
     int rc = 0;
 
-    if ( ctx->save.nr_batch_pfns == 0 )
+    if ( batch_empty(ctx) )
         return rc;
 
     rc = write_batch(ctx);
@@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx)
 }
 
 /*
- * Add a single pfn to the batch, flushing the batch if full.
+ * Add a single pfn to the batch.
  */
-static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
+static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
-    int rc = 0;
-
-    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
-        rc = flush_batch(ctx);
-
-    if ( rc == 0 )
-        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
-
-    return rc;
+    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
+    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
 }
 
 /*
@@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx)
  * Send a subset of pages in the guests p2m, according to the dirty bitmap.
  * Used for each subsequent iteration of the live migration loop.
  *
+ * During the precopy stage of a live migration, test the user-supplied
+ * policy function after each batch of pages and cut off the operation
+ * early if indicated.  Unless aborting, the dirty pages remaining in this round
+ * are transferred into the deferred_pages bitmap.
+ *
  * Bitmap is bounded by p2m_size.
  */
 static int send_dirty_pages(struct xc_sr_context *ctx,
-                            unsigned long entries)
+                            unsigned long entries, bool precopy)
 {
     xc_interface *xch = ctx->xch;
     xen_pfn_t p;
@@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
-    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
+    int (*precopy_policy)(struct precopy_stats, void *) =
+        ctx->save.callbacks->precopy_policy;
+    void *data = ctx->save.callbacks->data;
+
+    assert(batch_empty(ctx));
+    for ( p = 0, written = 0; p < ctx->save.p2m_size; )
     {
-        if ( !test_bit(p, dirty_bitmap) )
-            continue;
+        if ( ctx->save.live && precopy )
+        {
+            ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
+            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
+            {
+                return -1;
+            }
+            else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
+            {
+                /* Any outstanding dirty pages are now deferred until the next
+                 * phase of the migration. */
+                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
+                          ctx->save.p2m_size);
+                if ( entries > written )
+                    ctx->save.nr_deferred_pages += entries - written;
+
+                goto done;
+            }
+        }
 
-        rc = add_to_batch(ctx, p);
+        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
+        {
+            if ( test_and_clear_bit(p, dirty_bitmap) )
+            {
+                add_to_batch(ctx, p);
+                ++written;
+                ++ctx->save.stats.total_written;
+            }
+        }
+
+        rc = flush_batch(ctx);
         if ( rc )
             return rc;
 
-        /* Update progress every 4MB worth of memory sent. */
-        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
-            xc_report_progress_step(xch, written, entries);
-
-        ++written;
+        /* Update progress after every batch (4MB) worth of memory sent. */
+        xc_report_progress_step(xch, written, entries);
     }
 
-    rc = flush_batch(ctx);
-    if ( rc )
-        return rc;
-
     if ( written > entries )
         DPRINTF("Bitmap contained more entries than expected...");
 
     xc_report_progress_step(xch, entries, entries);
 
+ done:
     return ctx->save.ops.check_vm_state(ctx);
 }
 
@@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
  * Send all pages in the guests p2m.  Used as the first iteration of the live
  * migration loop, and for a non-live save.
  */
-static int send_all_pages(struct xc_sr_context *ctx)
+static int send_all_pages(struct xc_sr_context *ctx, bool precopy)
 {
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
     bitmap_set(dirty_bitmap, ctx->save.p2m_size);
 
-    return send_dirty_pages(ctx, ctx->save.p2m_size);
+    return send_dirty_pages(ctx, ctx->save.p2m_size, precopy);
 }
 
 static int enable_logdirty(struct xc_sr_context *ctx)
@@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context *ctx,
     xc_interface *xch = ctx->xch;
     char *new_str = NULL;
 
-    if ( asprintf(&new_str, "Frames iteration %u of %u",
-                  iter, ctx->save.max_iterations) == -1 )
+    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
     {
         PERROR("Unable to allocate new progress string");
         return -1;
@@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx)
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
     char *progress_str = NULL;
-    unsigned x;
     int rc;
 
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    int (*precopy_policy)(struct precopy_stats, void *) =
+        ctx->save.callbacks->precopy_policy;
+    void *data = ctx->save.callbacks->data;
+
     rc = update_progress_string(ctx, &progress_str, 0);
     if ( rc )
         goto out;
 
-    rc = send_all_pages(ctx);
+#define CONSULT_POLICY                                                        \
+    do {                                                                      \
+        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )                  \
+        {                                                                     \
+            rc = -1;                                                          \
+            goto out;                                                         \
+        }                                                                     \
+        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )  \
+        {                                                                     \
+            rc = 0;                                                           \
+            goto out;                                                         \
+        }                                                                     \
+    } while (0)
+
+    ctx->save.stats = (struct precopy_stats)
+        {
+            .iteration     = 0,
+            .total_written = 0,
+            .dirty_count   = -1
+        };
+    rc = send_all_pages(ctx, /* precopy */ true);
     if ( rc )
         goto out;
 
-    for ( x = 1;
-          ((x < ctx->save.max_iterations) &&
-           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
+    /* send_all_pages() has updated the stats */
+    CONSULT_POLICY;
+
+    for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration )
     {
         if ( xc_shadow_control(
                  xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
@@ -493,18 +559,42 @@ static int send_memory_live(struct xc_sr_context *ctx)
             goto out;
         }
 
-        if ( stats.dirty_count == 0 )
-            break;
+        /* Check the new dirty_count against the policy. */
+        ctx->save.stats.dirty_count = stats.dirty_count;
+        ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
+        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
+        {
+            rc = -1;
+            goto out;
+        }
+        else if (ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
+        {
+            bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
+                      ctx->save.p2m_size);
+            ctx->save.nr_deferred_pages += stats.dirty_count;
+            rc = 0;
+            goto out;
+        }
+
+        /* After this point we won't know how many pages are really dirty until
+         * the next iteration. */
+        ctx->save.stats.dirty_count = -1;
 
-        rc = update_progress_string(ctx, &progress_str, x);
+        rc = update_progress_string(ctx, &progress_str,
+                                    ctx->save.stats.iteration);
         if ( rc )
             goto out;
 
-        rc = send_dirty_pages(ctx, stats.dirty_count);
+        rc = send_dirty_pages(ctx, stats.dirty_count, /* precopy */ true);
         if ( rc )
             goto out;
+
+        /* send_dirty_pages() has updated the stats */
+        CONSULT_POLICY;
     }
 
+#undef CONSULT_POLICY
+
  out:
     xc_set_progress_prefix(xch, NULL);
     free(progress_str);
@@ -595,7 +685,7 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
     if ( ctx->save.live )
     {
         rc = update_progress_string(ctx, &progress_str,
-                                    ctx->save.max_iterations);
+                                    ctx->save.stats.iteration);
         if ( rc )
             goto out;
     }
@@ -614,7 +704,8 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         }
     }
 
-    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
+    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages,
+                          /* precopy */ false);
     if ( rc )
         goto out;
 
@@ -645,7 +736,7 @@ static int verify_frames(struct xc_sr_context *ctx)
         goto out;
 
     xc_set_progress_prefix(xch, "Frames verify");
-    rc = send_all_pages(ctx);
+    rc = send_all_pages(ctx, /* precopy */ false);
     if ( rc )
         goto out;
 
@@ -719,7 +810,7 @@ static int send_domain_memory_nonlive(struct xc_sr_context *ctx)
 
     xc_set_progress_prefix(xch, "Frames");
 
-    rc = send_all_pages(ctx);
+    rc = send_all_pages(ctx, /* precopy */ false);
     if ( rc )
         goto err;
 
@@ -910,8 +1001,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
 };
 
 int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
-                   uint32_t max_iters, uint32_t max_factor, uint32_t flags,
-                   struct save_callbacks* callbacks, int hvm,
+                   uint32_t flags, struct save_callbacks* callbacks, int hvm,
                    xc_migration_stream_t stream_type, int recv_fd)
 {
     struct xc_sr_context ctx =
@@ -932,25 +1022,17 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
            stream_type == XC_MIG_STREAM_REMUS ||
            stream_type == XC_MIG_STREAM_COLO);
 
-    /*
-     * TODO: Find some time to better tweak the live migration algorithm.
-     *
-     * These parameters are better than the legacy algorithm especially for
-     * busy guests.
-     */
-    ctx.save.max_iterations = 5;
-    ctx.save.dirty_threshold = 50;
-
     /* Sanity checks for callbacks. */
     if ( hvm )
         assert(callbacks->switch_qemu_logdirty);
+    if ( ctx.save.live )
+        assert(callbacks->precopy_policy);
     if ( ctx.save.checkpointed )
         assert(callbacks->checkpoint && callbacks->aftercopy);
     if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
         assert(callbacks->wait_checkpoint);
 
-    DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
-            io_fd, dom, max_iters, max_factor, flags, hvm);
+    DPRINTF("fd %d, dom %u, flags %u, hvm %d", io_fd, dom, flags, hvm);
 
     if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
     {
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 77fe30e..6d28cce 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -328,6 +328,25 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
     return rc;
 }
 
+/*
+ * This is the live migration precopy policy - it's called periodically during
+ * the precopy phase of live migrations, and is responsible for deciding when
+ * the precopy phase should terminate and what should be done next.
+ *
+ * The policy implemented here behaves identically to the policy previously
+ * hard-coded into xc_domain_save() - it proceeds to the stop-and-copy phase of
+ * the live migration when there are either fewer than 50 dirty pages, or more
+ * than 5 precopy rounds have completed.
+ */
+static int libxl__save_live_migration_simple_precopy_policy(
+    struct precopy_stats stats, void *user)
+{
+    return ((stats.dirty_count >= 0 && stats.dirty_count < 50) ||
+            stats.iteration >= 5)
+        ? XGS_POLICY_STOP_AND_COPY
+        : XGS_POLICY_CONTINUE_PRECOPY;
+}
+
 /*----- main code for saving, in order of execution -----*/
 
 void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
@@ -401,6 +420,7 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
     if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
         callbacks->suspend = libxl__domain_suspend_callback;
 
+    callbacks->precopy_policy = libxl__save_live_migration_simple_precopy_policy;
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
 
     dss->sws.ao  = dss->ao;
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 46b892c..026b572 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -89,8 +89,7 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_save_state *dss,
         libxl__srm_callout_enumcallbacks_save(&shs->callbacks.save.a);
 
     const unsigned long argnums[] = {
-        dss->domid, 0, 0, dss->xcflags, dss->hvm,
-        cbflags, dss->checkpointed_stream,
+        dss->domid, dss->xcflags, dss->hvm, cbflags, dss->checkpointed_stream,
     };
 
     shs->ao = ao;
diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c
index d3def6b..0241a6b 100644
--- a/tools/libxl/libxl_save_helper.c
+++ b/tools/libxl/libxl_save_helper.c
@@ -251,8 +251,6 @@ int main(int argc, char **argv)
         io_fd =                             atoi(NEXTARG);
         recv_fd =                           atoi(NEXTARG);
         uint32_t dom =                      strtoul(NEXTARG,0,10);
-        uint32_t max_iters =                strtoul(NEXTARG,0,10);
-        uint32_t max_factor =               strtoul(NEXTARG,0,10);
         uint32_t flags =                    strtoul(NEXTARG,0,10);
         int hvm =                           atoi(NEXTARG);
         unsigned cbflags =                  strtoul(NEXTARG,0,10);
@@ -264,9 +262,8 @@ int main(int argc, char **argv)
         startup("save");
         setup_signals(save_signal_handler);
 
-        r = xc_domain_save(xch, io_fd, dom, max_iters, max_factor, flags,
-                           &helper_save_callbacks, hvm, stream_type,
-                           recv_fd);
+        r = xc_domain_save(xch, io_fd, dom, flags, &helper_save_callbacks, hvm,
+                           stream_type, recv_fd);
         complete(r);
 
     } else if (!strcmp(mode,"--restore-domain")) {
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 27845bb..50c97b4 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -33,6 +33,7 @@ our @msgs = (
                                               'xen_pfn_t', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
 );
 
 #----------------------------------------
@@ -141,7 +142,8 @@ static void bytes_put(unsigned char *const buf, int *len,
 
 END
 
-foreach my $simpletype (qw(int uint16_t uint32_t unsigned), 'unsigned long', 'xen_pfn_t') {
+foreach my $simpletype (qw(int uint16_t uint32_t unsigned),
+                        'unsigned long', 'xen_pfn_t', 'struct precopy_stats') {
     my $typeid = typeid($simpletype);
     $out_body{'callout'} .= <<END;
 static int ${typeid}_get(const unsigned char **msg,
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (6 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 07/20] migration: defer precopy policy to libxl Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-29 21:08   ` Andrew Cooper
  2017-03-27  9:06 ` [PATCH RFC 09/20] libxc/xc_sr_save: introduce save batch types Joshua Otto
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

In the context of the live migration algorithm, the precopy iteration
count refers to the number of page-copying iterations performed prior to
the suspension of the guest and transmission of the final set of dirty
pages.  Similarly, the precopy dirty threshold refers to the dirty page
count below which we judge it more profitable to proceed to
stop-and-copy rather than continue with the precopy.  These would be
helpful tuning parameters to work with when migrating particularly busy
guests, as they enable an administrator to reap the available benefits
of the precopy algorithm (the transmission of guest pages _not_ in the
writable working set can be completed without guest downtime) while
reducing the total amount of time required for the migration (as
iterations of the precopy loop that will certainly be redundant can be
skipped in favour of an earlier suspension).

To expose these tuning parameters to users:
- introduce a new libxl API function, libxl_domain_live_migrate(),
  taking the same parameters as libxl_domain_suspend() _and_
  precopy_iterations and precopy_dirty_threshold parameters, and
  consider these parameters in the precopy policy

  (though a pair of new parameters on their own might not warrant an
  entirely new API function, it is added in anticipation of a number of
  additional migration-only parameters that would be cumbersome on the
  whole to tack on to the existing suspend API)

- switch xl migrate to the new libxl_domain_live_migrate() and add new
  --postcopy-iterations and --postcopy-threshold parameters to pass
  through

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl.h          | 10 ++++++++++
 tools/libxl/libxl_dom_save.c | 20 +++++++++++---------
 tools/libxl/libxl_domain.c   | 27 +++++++++++++++++++++++++--
 tools/libxl/libxl_internal.h |  2 ++
 tools/xl/xl_cmdtable.c       | 22 +++++++++++++---------
 tools/xl/xl_migrate.c        | 31 +++++++++++++++++++++++++++----
 6 files changed, 88 insertions(+), 24 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 833f866..84ac96a 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1375,6 +1375,16 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd,
 #define LIBXL_SUSPEND_DEBUG 1
 #define LIBXL_SUSPEND_LIVE 2
 
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int fd,
+                              int flags, /* LIBXL_SUSPEND_* */
+                              unsigned int precopy_iterations,
+                              unsigned int precopy_dirty_threshold,
+                              const libxl_asyncop_how *ao_how)
+                              LIBXL_EXTERNAL_CALLERS_ONLY;
+
+#define LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT 5
+#define LIBXL_LM_DIRTY_THRESHOLD_DEFAULT 50
+
 /* @param suspend_cancel [from xenctrl.h:xc_domain_resume( @param fast )]
  *   If this parameter is true, use co-operative resume. The guest
  *   must support this.
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 6d28cce..10d5012 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -332,19 +332,21 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
  * This is the live migration precopy policy - it's called periodically during
  * the precopy phase of live migrations, and is responsible for deciding when
  * the precopy phase should terminate and what should be done next.
- *
- * The policy implemented here behaves identically to the policy previously
- * hard-coded into xc_domain_save() - it proceeds to the stop-and-copy phase of
- * the live migration when there are either fewer than 50 dirty pages, or more
- * than 5 precopy rounds have completed.
  */
 static int libxl__save_live_migration_simple_precopy_policy(
     struct precopy_stats stats, void *user)
 {
-    return ((stats.dirty_count >= 0 && stats.dirty_count < 50) ||
-            stats.iteration >= 5)
-        ? XGS_POLICY_STOP_AND_COPY
-        : XGS_POLICY_CONTINUE_PRECOPY;
+    libxl__save_helper_state *shs = user;
+    libxl__domain_save_state *dss = shs->caller_state;
+
+    if (stats.dirty_count >= 0 &&
+        stats.dirty_count <= dss->precopy_dirty_threshold)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    if (stats.iteration >= dss->precopy_iterations)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
 /*----- main code for saving, in order of execution -----*/
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index 08eccd0..b1cf643 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -486,8 +486,10 @@ static void domain_suspend_cb(libxl__egc *egc,
 
 }
 
-int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
-                         const libxl_asyncop_how *ao_how)
+static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                             unsigned int precopy_iterations,
+                             unsigned int precopy_dirty_threshold,
+                             const libxl_asyncop_how *ao_how)
 {
     AO_CREATE(ctx, domid, ao_how);
     int rc;
@@ -510,6 +512,8 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     dss->live = flags & LIBXL_SUSPEND_LIVE;
     dss->debug = flags & LIBXL_SUSPEND_DEBUG;
     dss->checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_NONE;
+    dss->precopy_iterations = precopy_iterations;
+    dss->precopy_dirty_threshold = precopy_dirty_threshold;
 
     rc = libxl__fd_flags_modify_save(gc, dss->fd,
                                      ~(O_NONBLOCK|O_NDELAY), 0,
@@ -523,6 +527,25 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     return AO_CREATE_FAIL(rc);
 }
 
+int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                         const libxl_asyncop_how *ao_how)
+{
+    return do_domain_suspend(ctx, domid, fd, flags,
+                             LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
+                             LIBXL_LM_DIRTY_THRESHOLD_DEFAULT, ao_how);
+}
+
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                              unsigned int precopy_iterations,
+                              unsigned int precopy_dirty_threshold,
+                              const libxl_asyncop_how *ao_how)
+{
+    flags |= LIBXL_SUSPEND_LIVE;
+
+    return do_domain_suspend(ctx, domid, fd, flags, precopy_iterations,
+                             precopy_dirty_threshold, ao_how);
+}
+
 int libxl_domain_pause(libxl_ctx *ctx, uint32_t domid)
 {
     int ret;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index f1d8f9a..45d607a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3292,6 +3292,8 @@ struct libxl__domain_save_state {
     int live;
     int debug;
     int checkpointed_stream;
+    unsigned int precopy_iterations;
+    unsigned int precopy_dirty_threshold;
     const libxl_domain_remus_info *remus;
     /* private */
     int rc;
diff --git a/tools/xl/xl_cmdtable.c b/tools/xl/xl_cmdtable.c
index 7d97811..6df66fb 100644
--- a/tools/xl/xl_cmdtable.c
+++ b/tools/xl/xl_cmdtable.c
@@ -157,15 +157,19 @@ struct cmd_spec cmd_table[] = {
       &main_migrate, 0, 1,
       "Migrate a domain to another host",
       "[options] <Domain> <host>",
-      "-h              Print this help.\n"
-      "-C <config>     Send <config> instead of config file from creation.\n"
-      "-s <sshcommand> Use <sshcommand> instead of ssh.  String will be passed\n"
-      "                to sh. If empty, run <host> instead of ssh <host> xl\n"
-      "                migrate-receive [-d -e]\n"
-      "-e              Do not wait in the background (on <host>) for the death\n"
-      "                of the domain.\n"
-      "--debug         Print huge (!) amount of debug during the migration process.\n"
-      "-p              Do not unpause domain after migrating it."
+      "-h                   Print this help.\n"
+      "-C <config>          Send <config> instead of config file from creation.\n"
+      "-s <sshcommand>      Use <sshcommand> instead of ssh.  String will be passed\n"
+      "                     to sh. If empty, run <host> instead of ssh <host> xl\n"
+      "                     migrate-receive [-d -e]\n"
+      "-e                   Do not wait in the background (on <host>) for the death\n"
+      "                     of the domain.\n"
+      "--debug              Print huge (!) amount of debug during the migration process.\n"
+      "-p                   Do not unpause domain after migrating it.\n"
+      "--precopy-iterations Perform at most this many iterations of the precopy\n"
+      "                     memory migration loop before suspending the domain.\n"
+      "--precopy-threshold  If fewer than this many pages are dirty at the end of a\n"
+      "                     copy round, exit the precopy loop and suspend the domain."
     },
     { "restore",
       &main_restore, 0, 1,
diff --git a/tools/xl/xl_migrate.c b/tools/xl/xl_migrate.c
index 1f0e87d..1bb3fb4 100644
--- a/tools/xl/xl_migrate.c
+++ b/tools/xl/xl_migrate.c
@@ -177,7 +177,9 @@ static void migrate_do_preamble(int send_fd, int recv_fd, pid_t child,
 }
 
 static void migrate_domain(uint32_t domid, const char *rune, int debug,
-                           const char *override_config_file)
+                           const char *override_config_file,
+                           unsigned int precopy_iterations,
+                           unsigned int precopy_dirty_threshold)
 {
     pid_t child = -1;
     int rc;
@@ -205,7 +207,9 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
 
     if (debug)
         flags |= LIBXL_SUSPEND_DEBUG;
-    rc = libxl_domain_suspend(ctx, domid, send_fd, flags, NULL);
+    rc = libxl_domain_live_migrate(ctx, domid, send_fd, flags,
+                                   precopy_iterations, precopy_dirty_threshold,
+                                   NULL);
     if (rc) {
         fprintf(stderr, "migration sender: libxl_domain_suspend failed"
                 " (rc=%d)\n", rc);
@@ -537,13 +541,17 @@ int main_migrate(int argc, char **argv)
     char *rune = NULL;
     char *host;
     int opt, daemonize = 1, monitor = 1, debug = 0, pause_after_migration = 0;
+    int precopy_iterations = LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
+        precopy_dirty_threshold = LIBXL_LM_DIRTY_THRESHOLD_DEFAULT;
     static struct option opts[] = {
         {"debug", 0, 0, 0x100},
         {"live", 0, 0, 0x200},
+        {"precopy-iterations", 1, 0, 'i'},
+        {"precopy-threshold", 1, 0, 'd'},
         COMMON_LONG_OPTS
     };
 
-    SWITCH_FOREACH_OPT(opt, "FC:s:ep", opts, "migrate", 2) {
+    SWITCH_FOREACH_OPT(opt, "FC:s:epi:d:", opts, "migrate", 2) {
     case 'C':
         config_filename = optarg;
         break;
@@ -560,6 +568,20 @@ int main_migrate(int argc, char **argv)
     case 'p':
         pause_after_migration = 1;
         break;
+    case 'i':
+        precopy_iterations = atoi(optarg);
+        if (precopy_iterations < 0) {
+            fprintf(stderr, "negative precopy iterations not supported\n");
+            return EXIT_FAILURE;
+        }
+        break;
+    case 'd':
+        precopy_dirty_threshold = atoi(optarg);
+        if (precopy_dirty_threshold < 0) {
+            fprintf(stderr, "negative dirty threshold not supported\n");
+            return EXIT_FAILURE;
+        }
+        break;
     case 0x100: /* --debug */
         debug = 1;
         break;
@@ -596,7 +618,8 @@ int main_migrate(int argc, char **argv)
                   pause_after_migration ? " -p" : "");
     }
 
-    migrate_domain(domid, rune, debug, config_filename);
+    migrate_domain(domid, rune, debug, config_filename, precopy_iterations,
+                   precopy_dirty_threshold);
     return EXIT_SUCCESS;
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 09/20] libxc/xc_sr_save: introduce save batch types
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (7 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free() Joshua Otto
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

To write guest pages into the stream, the save logic builds up batches
of pfns to be written and performs all of the work necessary to write
them whenever a full batch has been accumulated.  Writing a PAGE_DATA
batch entails determining the types of all pfns in the batch, mapping
the subset of pfns that are backed by real memory constructing a
PAGE_DATA record describing the batch and writing everything into the
stream.

Postcopy live migration introduces several new types of batches.  To
enable the postcopy logic to re-use the bulk of the code used to manage
and write PAGE_DATA records, introduce a batch_type member to the save
context (which for now can take on only a single value), and refactor
write_batch() to take the batch_type into account when preparing and
writing each record.

While refactoring write_batch(), factor the operation of querying the
page types of a batch into a subroutine that is useable independently of
write_batch().

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_common.h    |   3 +
 tools/libxc/xc_sr_save.c      | 217 ++++++++++++++++++++++++++++--------------
 tools/libxc/xg_save_restore.h |   2 +-
 3 files changed, 151 insertions(+), 71 deletions(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index a9160bd..ee463d9 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -203,6 +203,9 @@ struct xc_sr_context
             struct precopy_stats stats;
             int policy_decision;
 
+            enum {
+                XC_SR_SAVE_BATCH_PRECOPY_PAGE
+            } batch_type;
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index eb95334..ac97d93 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -3,6 +3,23 @@
 
 #include "xc_sr_common.h"
 
+#define MAX_BATCH_SIZE MAX_PRECOPY_BATCH_SIZE
+
+static const unsigned batch_sizes[] =
+{
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE
+};
+
+static const bool batch_includes_contents[] =
+{
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE] = true
+};
+
+static const uint32_t batch_rec_types[] =
+{
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA
+};
+
 /*
  * Writes an Image header and Domain header into the stream.
  */
@@ -61,19 +78,80 @@ WRITE_TRIVIAL_RECORD_FN(end,                 REC_TYPE_END);
 WRITE_TRIVIAL_RECORD_FN(checkpoint,          REC_TYPE_CHECKPOINT);
 
 /*
+ * This function:
+ * - maps each pfn in the current batch to its gfn
+ * - gets the type of each pfn in the batch.
+ *
+ * The caller must free() both of the returned buffers.  Both pointers are safe
+ * to free() after failure.
+ */
+static int get_batch_info(struct xc_sr_context *ctx,
+                          /* OUT */ xen_pfn_t **p_mfns,
+                          /* OUT */ xen_pfn_t **p_types)
+{
+    int rc = -1;
+    unsigned nr_pfns = ctx->save.nr_batch_pfns;
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns, *types;
+    unsigned i;
+
+    assert(p_mfns);
+    assert(p_types);
+
+    *p_mfns = mfns = malloc(nr_pfns * sizeof(*mfns));
+    *p_types = types = malloc(nr_pfns * sizeof(*types));
+
+    if ( !mfns || !types )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+        types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
+                                                      ctx->save.batch_pfns[i]);
+
+    /* The type query domctl accepts batches of at most 1024 pfns, so we need to
+     * break our batch here into appropriately-sized sub-batches. */
+    for ( i = 0; i < nr_pfns; i += 1024 )
+    {
+        rc = xc_get_pfn_type_batch(xch, ctx->domid, min(1024U, nr_pfns - i), &types[i]);
+        if ( rc )
+        {
+            PERROR("Failed to get types for pfn batch");
+            goto err;
+        }
+    }
+
+    rc = 0;
+    goto done;
+
+ err:
+    free(mfns);
+    *p_mfns = NULL;
+
+    free(types);
+    *p_types = NULL;
+
+ done:
+    return rc;
+}
+
+/*
  * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
  * is constructed in ctx->save.batch_pfns.
  *
  * This function:
- * - gets the types for each pfn in the batch.
  * - for each pfn with real data:
  *   - maps and attempts to localise the pages.
  * - construct and writes a PAGE_DATA record into the stream.
  */
-static int write_batch(struct xc_sr_context *ctx)
+static int write_batch(struct xc_sr_context *ctx, xen_pfn_t *mfns,
+                       xen_pfn_t *types)
 {
     xc_interface *xch = ctx->xch;
-    xen_pfn_t *mfns = NULL, *types = NULL;
+    xen_pfn_t *bmfns = NULL;
     void *guest_mapping = NULL;
     void **guest_data = NULL;
     void **local_pages = NULL;
@@ -84,17 +162,16 @@ static int write_batch(struct xc_sr_context *ctx)
     uint64_t *rec_pfns = NULL;
     struct iovec *iov = NULL; int iovcnt = 0;
     struct xc_sr_rec_pages_header hdr = { 0 };
+    bool send_page_contents = batch_includes_contents[ctx->save.batch_type];
     struct xc_sr_record rec =
     {
-        .type = REC_TYPE_PAGE_DATA,
+        .type = batch_rec_types[ctx->save.batch_type],
     };
 
     assert(nr_pfns != 0);
 
-    /* Mfns of the batch pfns. */
-    mfns = malloc(nr_pfns * sizeof(*mfns));
-    /* Types of the batch pfns. */
-    types = malloc(nr_pfns * sizeof(*types));
+    /* The subset of mfns that are physically-backed. */
+    bmfns = malloc(nr_pfns * sizeof(*bmfns));
     /* Errors from attempting to map the gfns. */
     errors = malloc(nr_pfns * sizeof(*errors));
     /* Pointers to page data to send.  Mapped gfns or local allocations. */
@@ -104,19 +181,16 @@ static int write_batch(struct xc_sr_context *ctx)
     /* iovec[] for writev(). */
     iov = malloc((nr_pfns + 4) * sizeof(*iov));
 
-    if ( !mfns || !types || !errors || !guest_data || !local_pages || !iov )
+    if ( !bmfns || !errors || !guest_data || !local_pages || !iov )
     {
         ERROR("Unable to allocate arrays for a batch of %u pages",
               nr_pfns);
         goto err;
     }
 
+    /* Mark likely-ballooned pages as deferred. */
     for ( i = 0; i < nr_pfns; ++i )
     {
-        types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
-                                                      ctx->save.batch_pfns[i]);
-
-        /* Likely a ballooned page. */
         if ( mfns[i] == INVALID_MFN )
         {
             set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
@@ -124,39 +198,9 @@ static int write_batch(struct xc_sr_context *ctx)
         }
     }
 
-    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
-    if ( rc )
-    {
-        PERROR("Failed to get types for pfn batch");
-        goto err;
-    }
-    rc = -1;
-
-    for ( i = 0; i < nr_pfns; ++i )
-    {
-        switch ( types[i] )
-        {
-        case XEN_DOMCTL_PFINFO_BROKEN:
-        case XEN_DOMCTL_PFINFO_XALLOC:
-        case XEN_DOMCTL_PFINFO_XTAB:
-            continue;
-        }
-
-        mfns[nr_pages++] = mfns[i];
-    }
-
-    if ( nr_pages > 0 )
+    if ( send_page_contents )
     {
-        guest_mapping = xenforeignmemory_map(xch->fmem,
-            ctx->domid, PROT_READ, nr_pages, mfns, errors);
-        if ( !guest_mapping )
-        {
-            PERROR("Failed to map guest pages");
-            goto err;
-        }
-        nr_pages_mapped = nr_pages;
-
-        for ( i = 0, p = 0; i < nr_pfns; ++i )
+        for ( i = 0; i < nr_pfns; ++i )
         {
             switch ( types[i] )
             {
@@ -166,36 +210,62 @@ static int write_batch(struct xc_sr_context *ctx)
                 continue;
             }
 
-            if ( errors[p] )
+            bmfns[nr_pages++] = mfns[i];
+        }
+
+        if ( nr_pages > 0 )
+        {
+            guest_mapping = xenforeignmemory_map(xch->fmem,
+                ctx->domid, PROT_READ, nr_pages, bmfns, errors);
+            if ( !guest_mapping )
             {
-                ERROR("Mapping of pfn %#"PRIpfn" (mfn %#"PRIpfn") failed %d",
-                      ctx->save.batch_pfns[i], mfns[p], errors[p]);
+                PERROR("Failed to map guest pages");
                 goto err;
             }
+            nr_pages_mapped = nr_pages;
 
-            orig_page = page = guest_mapping + (p * PAGE_SIZE);
-            rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
+            for ( i = 0, p = 0; i < nr_pfns; ++i )
+            {
+                switch ( types[i] )
+                {
+                case XEN_DOMCTL_PFINFO_BROKEN:
+                case XEN_DOMCTL_PFINFO_XALLOC:
+                case XEN_DOMCTL_PFINFO_XTAB:
+                    continue;
+                }
+
+                if ( errors[p] )
+                {
+                    ERROR("Mapping of pfn %#"PRIpfn" (mfn %#"PRIpfn") failed %d",
+                          ctx->save.batch_pfns[i], bmfns[p], errors[p]);
+                    goto err;
+                }
 
-            if ( orig_page != page )
-                local_pages[i] = page;
+                orig_page = page = guest_mapping + (p * PAGE_SIZE);
+                rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
 
-            if ( rc )
-            {
-                if ( rc == -1 && errno == EAGAIN )
+                if ( orig_page != page )
+                    local_pages[i] = page;
+
+                if ( rc )
                 {
-                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
-                    ++ctx->save.nr_deferred_pages;
-                    types[i] = XEN_DOMCTL_PFINFO_XTAB;
-                    --nr_pages;
+                    if ( rc == -1 && errno == EAGAIN )
+                    {
+                        set_bit(ctx->save.batch_pfns[i],
+                                ctx->save.deferred_pages);
+                        ++ctx->save.nr_deferred_pages;
+                        types[i] = XEN_DOMCTL_PFINFO_XTAB;
+                        --nr_pages;
+                    }
+                    else
+                        goto err;
                 }
                 else
-                    goto err;
-            }
-            else
-                guest_data[i] = page;
+                    guest_data[i] = page;
 
-            rc = -1;
-            ++p;
+                rc = -1;
+                ++p;
+            }
         }
     }
 
@@ -264,8 +334,7 @@ static int write_batch(struct xc_sr_context *ctx)
     free(local_pages);
     free(guest_data);
     free(errors);
-    free(types);
-    free(mfns);
+    free(bmfns);
 
     return rc;
 }
@@ -275,7 +344,7 @@ static int write_batch(struct xc_sr_context *ctx)
  */
 static bool batch_full(struct xc_sr_context *ctx)
 {
-    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
+    return ctx->save.nr_batch_pfns == batch_sizes[ctx->save.batch_type];
 }
 
 /*
@@ -292,11 +361,18 @@ static bool batch_empty(struct xc_sr_context *ctx)
 static int flush_batch(struct xc_sr_context *ctx)
 {
     int rc = 0;
+    xen_pfn_t *mfns = NULL, *types = NULL;
 
     if ( batch_empty(ctx) )
         return rc;
 
-    rc = write_batch(ctx);
+    rc = get_batch_info(ctx, &mfns, &types);
+    if ( rc )
+        return rc;
+
+    rc = write_batch(ctx, mfns, types);
+    free(mfns);
+    free(types);
 
     if ( !rc )
     {
@@ -313,7 +389,7 @@ static int flush_batch(struct xc_sr_context *ctx)
  */
 static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
-    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
+    assert(ctx->save.nr_batch_pfns < batch_sizes[ctx->save.batch_type]);
     ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
 }
 
@@ -383,6 +459,7 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
     void *data = ctx->save.callbacks->data;
 
     assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_PRECOPY_PAGE;
     for ( p = 0, written = 0; p < ctx->save.p2m_size; )
     {
         if ( ctx->save.live && precopy )
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
index 303081d..40debf6 100644
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -24,7 +24,7 @@
 ** We process save/restore/migrate in batches of pages; the below
 ** determines how many pages we (at maximum) deal with in each batch.
 */
-#define MAX_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
+#define MAX_PRECOPY_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
 
 /* When pinning page tables at the end of restore, we also use batching. */
 #define MAX_PIN_BATCH  1024
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free()
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (8 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 09/20] libxc/xc_sr_save: introduce save batch types Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 19:59   ` Andrew Cooper
  2017-03-27  9:06 ` [PATCH RFC 11/20] libxc/migration: correct hvm record ordering specification Joshua Otto
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

colo_merge_secondary_dirty_bitmap() unconditionally free()s the .data
member of its local xc_sr_record structure rec on its exit path.
However, if the initial call to read_record() fails then this member is
uninitialised.  Initialise it.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_save.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index ac97d93..6acc8d3 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -681,7 +681,7 @@ static int send_memory_live(struct xc_sr_context *ctx)
 static int colo_merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
-    struct xc_sr_record rec;
+    struct xc_sr_record rec = { 0, 0, NULL };
     uint64_t *pfns = NULL;
     uint64_t pfn;
     unsigned count, i;
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 11/20] libxc/migration: correct hvm record ordering specification
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (9 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free() Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 12/20] libxc/migration: specify postcopy live migration Joshua Otto
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

The libxc migration stream specification document asserts that, within
an hvm migration stream, "HVM_PARAMS must precede HVM_CONTEXT, as
certain parameters can affect the validity of architectural state in the
context."  This sounds reasonable, but the in-tree implementation of hvm
domain save actually writes these records in the _reverse_ order, with
HVM_CONTEXT first and HVM_PARAMS next.  This has been the case for the
entire history of that implementation, seemingly to no ill effect, so
update the spec to reflect this.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 docs/specs/libxc-migration-stream.pandoc | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
index 31eba10..96a6cb0 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -668,11 +668,8 @@ A typical save record for an x86 HVM guest image would look like:
 2. Domain header
 3. Many PAGE\_DATA records
 4. TSC\_INFO
-5. HVM\_PARAMS
-6. HVM\_CONTEXT
-
-HVM\_PARAMS must precede HVM\_CONTEXT, as certain parameters can affect
-the validity of architectural state in the context.
+5. HVM\_CONTEXT
+6. HVM\_PARAMS
 
 
 Legacy Images (x86 only)
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 12/20] libxc/migration: specify postcopy live migration
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (10 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 11/20] libxc/migration: correct hvm record ordering specification Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 13/20] libxc/migration: add try_read_record() Joshua Otto
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

- allocate the new postcopy record type numbers
- augment the stream format specification to include these new types and
  their role in the protocol

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 docs/specs/libxc-migration-stream.pandoc | 177 ++++++++++++++++++++++++++++++-
 tools/libxc/xc_sr_common.c               |   7 ++
 tools/libxc/xc_sr_stream_format.h        |   9 +-
 3 files changed, 191 insertions(+), 2 deletions(-)

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
index 96a6cb0..8ff8da5 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -3,7 +3,8 @@
   Andrew Cooper <<andrew.cooper3@citrix.com>>
   Wen Congyang <<wency@cn.fujitsu.com>>
   Yang Hongyang <<hongyang.yang@easystack.cn>>
-% Revision 1
+  Joshua Otto <<jtotto@uwaterloo.ca>>
+% Revision 2
 
 Introduction
 ============
@@ -231,6 +232,20 @@ type         0x00000000: END
 
              0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
 
+             0x00000010: POSTCOPY_BEGIN
+
+             0x00000011: POSTCOPY_PFNS_BEGIN
+
+             0x00000012: POSTCOPY_PFNS
+
+             0x00000013: POSTCOPY_TRANSITION
+
+             0x00000014: POSTCOPY_PAGE_DATA
+
+             0x00000015: POSTCOPY_FAULT
+
+             0x00000016: POSTCOPY_COMPLETE
+
              0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
              records.
 
@@ -624,6 +639,142 @@ The count of pfns is: record->length/sizeof(uint64_t).
 
 \clearpage
 
+POSTCOPY_BEGIN
+--------------
+
+This record must only appear in a truly _live_ migration stream, and is
+transmitted by the migration sender to signal to the destination that
+the migration will (as soon as possible) transition from the memory
+pre-copy phase to the post-copy phase, during which remaining unmigrated
+domain memory is paged over the network on-demand _after_ the guest has
+resumed.
+
+This record _must_ be followed immediately by the domain CPU context
+records (e.g. TSC_INFO, HVM_CONTEXT and HVM_PARAMS for HVM domains).
+This is for practical reasons: in the HVM case, the PAGING_RING_PFN
+parameter must be known at the destination before preparation for paging
+can begin.
+
+This record contains no fields; its body_length is 0.
+
+\clearpage
+
+POSTCOPY_PFNS_BEGIN
+-------------------
+
+During the initiation sequence of a postcopy live migration, this record
+immediately follows the final domain CPU context record and indicates
+the beginning of a sequence of 0 or more POSTCOPY_PFNS records.  The
+destination uses this record as a cue to prepare for postcopy paging.
+
+This record contains no fields; its body_length is 0.
+
+\clearpage
+
+POSTCOPY_PFNS
+-------------
+
+Each POSTCOPY_PFNS record contains an unordered list of 'postcopy PFNS'
+- i.e. pfns that are dirty at the sender and require migration during
+the postcopy phase.  The structure of the record is identical that of
+the PAGE_DATA record type, but omitting any actual trailing page
+contents.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+
+\clearpage
+
+POSTCOPY_TRANSITION
+-------------------
+
+This record is transmitted by a postcopy live migration sender after the
+final POSTCOPY_PFNS record, and indicates that the embedded libxc stream
+will be interrupted by content in the higher-layer stream necessary to
+permit resumption of the domain at the destination, and further than
+when the higher-layer content is complete the domain should be resumed
+in postcopy mode at the destination.
+
+This record contains no fields; its body_length is 0.
+
+\clearpage
+
+POSTCOPY_PAGE_DATA
+------------------
+
+This record is identical in meaning and format to the PAGE_DATA record
+type, and is transmitted during live migration by the sender during the
+postcopy phase to transfer batches of outstanding domain memory.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+    | page_data[0]...                                 |
+    ...
+    +-------------------------------------------------+
+    | page_data[C-1]...                               |
+    ...
+    +-------------------------------------------------+
+
+It is an error for an XTAB, BROKEN or XALLOC pfn to be transmitted in a
+record of this type, so all pfns must be accompanied by backing data.
+It is an error for a pfn not previously included in a POSTCOPY_PFNS
+record to be included in a record of this type.
+
+\clearpage
+
+POSTCOPY_FAULT
+--------------
+
+A POSTCOPY_FAULT record is transmitted by a postcopy live migration
+_destination_ to communicate an urgent need for a batch of pfns.  It is
+identical in format to the POSTCOPY_PFNS record type, _except_ that the
+type of each page is not encoded in the transmitted pfns.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+
+\clearpage
+
+POSTCOPY_COMPLETE
+-----------------
+
+A postcopy live migration _destination_ transmits a POSTCOPY_COMPLETE
+record when the postcopy phase of a migration is complete, if one was
+entered.
+
+This record contains no fields; its body_length is 0.
+
+In addition to reporting the phase completion to the sender, this record
+also enables the migration sender to flush its receive stream of
+in-flight POSTCOPY_FAULT records before handing control of the stream
+back to a higher layer.
+
+\clearpage
+
 Layout
 ======
 
@@ -671,6 +822,30 @@ A typical save record for an x86 HVM guest image would look like:
 5. HVM\_CONTEXT
 6. HVM\_PARAMS
 
+x86 HVM Postcopy Live Migration
+-------------------------------
+
+The bi-directional migration stream for postcopy live migration of an
+x86 HVM guest image would look like:
+
+ 1. Image header
+ 2. Domain header
+ 3. Many (or few!) PAGE\_DATA records
+ 4. POSTCOPY\_BEGIN
+ 5. TSC\_INFO
+ 6. HVM\_CONTEXT
+ 7. HVM\_PARAMS
+ 8. POSTCOPY\_PFNS\_BEGIN
+ 9. Many POSTCOPY\_PFNS records
+10. POSTCOPY\_TRANSITION
+... higher layer stream content ...
+11. Many POSTCOPY\_PAGE\_DATA records
+
+During 11, the destination would reply with (hopefully not too) many
+POSTCOPY\_FAULT records.
+
+After 11, the destination would transmit a final POSTCOPY\_COMPLETE.
+
 
 Legacy Images (x86 only)
 ========================
diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index f443974..090b5fd 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -38,6 +38,13 @@ static const char *mandatory_rec_types[] =
     [REC_TYPE_VERIFY]                       = "Verify",
     [REC_TYPE_CHECKPOINT]                   = "Checkpoint",
     [REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST]    = "Checkpoint dirty pfn list",
+    [REC_TYPE_POSTCOPY_BEGIN]               = "Postcopy begin",
+    [REC_TYPE_POSTCOPY_PFNS_BEGIN]          = "Postcopy pfns begin",
+    [REC_TYPE_POSTCOPY_PFNS]                = "Postcopy pfns",
+    [REC_TYPE_POSTCOPY_TRANSITION]          = "Postcopy transition",
+    [REC_TYPE_POSTCOPY_PAGE_DATA]           = "Postcopy page data",
+    [REC_TYPE_POSTCOPY_FAULT]               = "Postcopy fault",
+    [REC_TYPE_POSTCOPY_COMPLETE]            = "Postcopy complete",
 };
 
 const char *rec_type_to_str(uint32_t type)
diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
index 32400b2..d16d0c7 100644
--- a/tools/libxc/xc_sr_stream_format.h
+++ b/tools/libxc/xc_sr_stream_format.h
@@ -76,10 +76,17 @@ struct xc_sr_rhdr
 #define REC_TYPE_VERIFY                     0x0000000dU
 #define REC_TYPE_CHECKPOINT                 0x0000000eU
 #define REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST  0x0000000fU
+#define REC_TYPE_POSTCOPY_BEGIN             0x00000010U
+#define REC_TYPE_POSTCOPY_PFNS_BEGIN        0x00000011U
+#define REC_TYPE_POSTCOPY_PFNS              0x00000012U
+#define REC_TYPE_POSTCOPY_TRANSITION        0x00000013U
+#define REC_TYPE_POSTCOPY_PAGE_DATA         0x00000014U
+#define REC_TYPE_POSTCOPY_FAULT             0x00000015U
+#define REC_TYPE_POSTCOPY_COMPLETE          0x00000016U
 
 #define REC_TYPE_OPTIONAL             0x80000000U
 
-/* PAGE_DATA */
+/* PAGE_DATA/POSTCOPY_PFNS/POSTCOPY_PAGE_DATA/POSTCOPY_FAULT */
 struct xc_sr_rec_pages_header
 {
     uint32_t count;
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 13/20] libxc/migration: add try_read_record()
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (11 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 12/20] libxc/migration: specify postcopy live migration Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-04-12 15:16   ` Wei Liu
  2017-03-27  9:06 ` [PATCH RFC 14/20] libxc/migration: implement the sender side of postcopy live migration Joshua Otto
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Enable non-blocking migration record reads by adding a helper routine that
manages the context of a record read across multiple invocations as the record's
data becomes available over time.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_private.c   | 21 +++++++++++----
 tools/libxc/xc_private.h   |  2 ++
 tools/libxc/xc_sr_common.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_common.h | 39 +++++++++++++++++++++++++++
 4 files changed, 124 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index 72e6242..2c53b22 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -633,26 +633,37 @@ void bitmap_byte_to_64(uint64_t *lp, const uint8_t *bp, int nbits)
     }
 }
 
-int read_exact(int fd, void *data, size_t size)
+int try_read_exact(int fd, void *data, size_t size, size_t *offset)
 {
-    size_t offset = 0;
     ssize_t len;
 
-    while ( offset < size )
+    assert(offset);
+    *offset = 0;
+    while ( *offset < size )
     {
-        len = read(fd, (char *)data + offset, size - offset);
+        len = read(fd, (char *)data + *offset, size - *offset);
         if ( (len == -1) && (errno == EINTR) )
             continue;
         if ( len == 0 )
             errno = 0;
         if ( len <= 0 )
             return -1;
-        offset += len;
+        *offset += len;
     }
 
     return 0;
 }
 
+int read_exact(int fd, void *data, size_t size)
+{
+    size_t offset;
+    int rc;
+
+    rc = try_read_exact(fd, data, size, &offset);
+    assert(rc == -1 || offset == size);
+    return rc;
+}
+
 int write_exact(int fd, const void *data, size_t size)
 {
     size_t offset = 0;
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 1c27b0f..aaae344 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -384,6 +384,8 @@ int xc_flush_mmu_updates(xc_interface *xch, struct xc_mmu *mmu);
 
 /* Return 0 on success; -1 on error setting errno. */
 int read_exact(int fd, void *data, size_t size); /* EOF => -1, errno=0 */
+/* Like read_exact(), but stores the length read before error to *offset. */
+int try_read_exact(int fd, void *data, size_t size, size_t *offset);
 int write_exact(int fd, const void *data, size_t size);
 int writev_exact(int fd, const struct iovec *iov, int iovcnt);
 
diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 090b5fd..b762775 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -147,6 +147,73 @@ int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
     return 0;
 };
 
+int try_read_record(struct xc_sr_read_record_context *rrctx, int fd,
+                    struct xc_sr_record *rec)
+{
+    int rc;
+    xc_interface *xch = rrctx->ctx->xch;
+    size_t offset_out, dataoff, datasz;
+
+    /* If the header isn't yet complete, attempt to finish it first. */
+    if ( rrctx->offset < sizeof(rrctx->rhdr) )
+    {
+        rc = try_read_exact(fd, (char *)&rrctx->rhdr + rrctx->offset,
+                            sizeof(rrctx->rhdr) - rrctx->offset, &offset_out);
+        rrctx->offset += offset_out;
+
+        if ( rc )
+            return rc;
+        else
+            assert(rrctx->offset == sizeof(rrctx->rhdr));
+    }
+
+    datasz = ROUNDUP(rrctx->rhdr.length, REC_ALIGN_ORDER);
+
+    if ( datasz )
+    {
+        if ( !rrctx->data )
+        {
+            rrctx->data = malloc(datasz);
+
+            if ( !rrctx->data )
+            {
+                ERROR("Unable to allocate %zu bytes for record (0x%08x, %s)",
+                      datasz, rrctx->rhdr.type,
+                      rec_type_to_str(rrctx->rhdr.type));
+                return -1;
+            }
+        }
+
+        dataoff = rrctx->offset - sizeof(rrctx->rhdr);
+        rc = try_read_exact(fd, (char *)rrctx->data + dataoff, datasz - dataoff,
+                            &offset_out);
+        rrctx->offset += offset_out;
+
+        if ( rc == -1 )
+        {
+            /* Differentiate between expected and fatal errors. */
+            if ( (errno != EAGAIN) && (errno != EWOULDBLOCK) )
+            {
+                free(rrctx->data);
+                rrctx->data = NULL;
+                PERROR("Failed to read %zu bytes for record (0x%08x, %s)",
+                       datasz, rrctx->rhdr.type,
+                       rec_type_to_str(rrctx->rhdr.type));
+            }
+
+            return rc;
+        }
+    }
+
+    /* Success!  Fill in the output record structure. */
+    rec->type   = rrctx->rhdr.type;
+    rec->length = rrctx->rhdr.length;
+    rec->data   = rrctx->data;
+    rrctx->data = NULL;
+
+    return 0;
+}
+
 int validate_pages_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
                           uint32_t expected_type)
 {
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index ee463d9..b52355d 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -394,6 +394,45 @@ static inline int write_record(struct xc_sr_context *ctx, int fd,
 int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
 
 /*
+ * try_read_record() (prototype below) reads a record from a _non-blocking_
+ * stream over the course of one or more invocations.  Context for the record
+ * read is maintained in an xc_sr_read_record_context.
+ *
+ * The protocol is:
+ * - call read_record_init() on an uninitialized or previously-destroyed
+ *   read-record context prior to using it to read a record
+ * - call try_read_record() with this initialized context one or more times
+ *   - rc < 0 and errno == EAGAIN/EWOULDBLOCK => try again
+ *   - rc < 0 otherwise => failure
+ *   - rc == 0 => a complete record has been read, and is filled into
+ *     try_read_record()'s rec argument
+ * - after either failure or completion of a record, destroy the context with
+ *   read_record_destroy()
+ */
+struct xc_sr_read_record_context
+{
+    struct xc_sr_context *ctx;
+    size_t offset;
+    struct xc_sr_rhdr rhdr;
+    void *data;
+};
+
+static inline void read_record_init(struct xc_sr_read_record_context *rrctx,
+                                    struct xc_sr_context *ctx)
+{
+    *rrctx = (struct xc_sr_read_record_context) { .ctx = ctx };
+}
+
+int try_read_record(struct xc_sr_read_record_context *rrctx, int fd,
+                    struct xc_sr_record *rec);
+
+static inline void read_record_destroy(struct xc_sr_read_record_context *rrctx)
+{
+    free(rrctx->data);
+    rrctx->data = NULL;
+}
+
+/*
  * Given a record of one of the page data types, validate it by:
  * - checking its actual type against its specific expected type
  * - sanity checking its actual length against its claimed length
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 14/20] libxc/migration: implement the sender side of postcopy live migration
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (12 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 13/20] libxc/migration: add try_read_record() Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 15/20] libxc/migration: implement the receiver " Joshua Otto
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Add a new 'postcopy' phase to the live migration algorithm, during which
unmigrated domain memory is paged over the network on-demand _after_ the
guest has been resumed at the destination.

To do so:
- Add a new precopy policy option, XGS_POLICY_POSTCOPY, that policies
  can use to request a transition to the postcopy live migration phase
  rather than a stop-and-copy of the remaining dirty pages.
- Add support to xc_domain_save() for this policy option by breaking out
  of the precopy loop early, transmitting the final set of dirty pfns
  and all remaining domain state (including higher-layer state) except
  memory, and entering a postcopy loop during which the remaining page
  data is pushed in the background.  Remote requests for specific pages
  in response to faults in the domain are serviced with priority in this
  loop.

The new save callbacks required for this migration phase are stubbed in
libxl for now, to be replaced in a subsequent patch that adds libxl
support for this migration phase.  Support for this phase on the
migration receiver side follows immediately in the next patch.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h     |  82 +++++---
 tools/libxc/xc_sr_common.h         |   5 +-
 tools/libxc/xc_sr_save.c           | 421 ++++++++++++++++++++++++++++++++++---
 tools/libxc/xc_sr_save_x86_hvm.c   |  13 ++
 tools/libxc/xg_save_restore.h      |  16 +-
 tools/libxl/libxl_dom_save.c       |  11 +-
 tools/libxl/libxl_save_msgs_gen.pl |   6 +-
 7 files changed, 487 insertions(+), 67 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 30ffb6f..16441c9 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -63,41 +63,57 @@ struct save_callbacks {
 #define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
 #define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
                                         * remaining dirty pages. */
+#define XGS_POLICY_POSTCOPY         2  /* Suspend the guest and transition into
+                                        * the postcopy phase of the migration. */
     int (*precopy_policy)(struct precopy_stats stats, void *data);
 
-    /* Called after the guest's dirty pages have been
-     *  copied into an output buffer.
-     * Callback function resumes the guest & the device model,
-     *  returns to xc_domain_save.
-     * xc_domain_save then flushes the output buffer, while the
-     *  guest continues to run.
-     */
-    int (*aftercopy)(void* data);
-
-    /* Called after the memory checkpoint has been flushed
-     * out into the network. Typical actions performed in this
-     * callback include:
-     *   (a) send the saved device model state (for HVM guests),
-     *   (b) wait for checkpoint ack
-     *   (c) release the network output buffer pertaining to the acked checkpoint.
-     *   (c) sleep for the checkpoint interval.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint */
-    int (*checkpoint)(void* data);
-
-    /*
-     * Called after the checkpoint callback.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint
-     */
-    int (*wait_checkpoint)(void* data);
-
-    /* Enable qemu-dm logging dirty pages to xen */
-    int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        struct {
+            /* Called during a live migration's transition to the postcopy phase
+             * to yield control of the stream back to a higher layer so it can
+             * transmit records needed for resumption of the guest at the
+             * destination (e.g. device model state, xenstore context) */
+            int (*postcopy_transition)(void *data);
+        };
+
+        struct {
+            /* Called after the guest's dirty pages have been
+             *  copied into an output buffer.
+             * Callback function resumes the guest & the device model,
+             *  returns to xc_domain_save.
+             * xc_domain_save then flushes the output buffer, while the
+             *  guest continues to run.
+             */
+            int (*aftercopy)(void* data);
+
+            /* Called after the memory checkpoint has been flushed
+             * out into the network. Typical actions performed in this
+             * callback include:
+             *   (a) send the saved device model state (for HVM guests),
+             *   (b) wait for checkpoint ack
+             *   (c) release the network output buffer pertaining to the acked
+             *       checkpoint.
+             *   (c) sleep for the checkpoint interval.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint */
+            int (*checkpoint)(void* data);
+
+            /*
+             * Called after the checkpoint callback.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint
+             */
+            int (*wait_checkpoint)(void* data);
+
+            /* Enable qemu-dm logging dirty pages to xen */
+            int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+        };
+    };
 
     /* to be provided as the last argument to each callback function */
     void* data;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index b52355d..0043791 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -204,13 +204,16 @@ struct xc_sr_context
             int policy_decision;
 
             enum {
-                XC_SR_SAVE_BATCH_PRECOPY_PAGE
+                XC_SR_SAVE_BATCH_PRECOPY_PAGE,
+                XC_SR_SAVE_BATCH_POSTCOPY_PFN,
+                XC_SR_SAVE_BATCH_POSTCOPY_PAGE
             } batch_type;
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
             unsigned long nr_deferred_pages;
             xc_hypercall_buffer_t dirty_bitmap_hbuf;
+            unsigned long nr_final_dirty_pages;
         } save;
 
         struct /* Restore data. */
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 6acc8d3..51d7016 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -3,21 +3,28 @@
 
 #include "xc_sr_common.h"
 
-#define MAX_BATCH_SIZE MAX_PRECOPY_BATCH_SIZE
+#define MAX_BATCH_SIZE \
+    max(max(MAX_PRECOPY_BATCH_SIZE, MAX_PFN_BATCH_SIZE), MAX_POSTCOPY_BATCH_SIZE)
 
 static const unsigned batch_sizes[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = MAX_PFN_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = MAX_POSTCOPY_BATCH_SIZE
 };
 
 static const bool batch_includes_contents[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE] = true
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = true,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = false,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = true
 };
 
 static const uint32_t batch_rec_types[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = REC_TYPE_POSTCOPY_PFNS,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = REC_TYPE_POSTCOPY_PAGE_DATA
 };
 
 /*
@@ -76,6 +83,9 @@ static int write_headers(struct xc_sr_context *ctx, uint16_t guest_type)
 
 WRITE_TRIVIAL_RECORD_FN(end,                 REC_TYPE_END);
 WRITE_TRIVIAL_RECORD_FN(checkpoint,          REC_TYPE_CHECKPOINT);
+WRITE_TRIVIAL_RECORD_FN(postcopy_begin,      REC_TYPE_POSTCOPY_BEGIN);
+WRITE_TRIVIAL_RECORD_FN(postcopy_pfns_begin, REC_TYPE_POSTCOPY_PFNS_BEGIN);
+WRITE_TRIVIAL_RECORD_FN(postcopy_transition, REC_TYPE_POSTCOPY_TRANSITION);
 
 /*
  * This function:
@@ -394,6 +404,108 @@ static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 }
 
 /*
+ * This function:
+ * - flushes the current batch of postcopy pfns into the migration stream
+ * - clears the dirty bits of all pfns with no migrateable backing data
+ * - counts the number of pfns that _do_ have migrateable backing data, adding
+ *   it to nr_final_dirty_pfns
+ */
+static int flush_postcopy_pfns_batch(struct xc_sr_context *ctx)
+{
+    int rc = 0;
+    xen_pfn_t *pfns = ctx->save.batch_pfns, *mfns = NULL, *types = NULL;
+    unsigned i, nr_pfns = ctx->save.nr_batch_pfns;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PFN);
+
+    if ( batch_empty(ctx) )
+        return rc;
+
+    rc = get_batch_info(ctx, &mfns, &types);
+    if ( rc )
+        return rc;
+
+    /* Consider any pages not backed by a physical page of data to have been
+     * 'cleaned' at this point - there's no sense wasting room in a subsequent
+     * postcopy batch to duplicate the type information. */
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            clear_bit(pfns[i], dirty_bitmap);
+            continue;
+        }
+
+        ++ctx->save.nr_final_dirty_pages;
+    }
+
+    rc = write_batch(ctx, mfns, types);
+    free(mfns);
+    free(types);
+
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE *
+                                    sizeof(*ctx->save.batch_pfns));
+    }
+
+    return rc;
+}
+
+/*
+ * This function:
+ * - writes a POSTCOPY_PFNS_BEGIN record into the stream
+ * - writes 0 or more POSTCOPY_PFNS records specifying the subset of domain
+ *   memory that must be migrated during the upcoming postcopy phase of the
+ *   migration
+ * - counts the number of pfns in this subset, storing it in
+ *   nr_final_dirty_pages
+ */
+static int send_postcopy_pfns(struct xc_sr_context *ctx)
+{
+    xen_pfn_t p;
+    int rc;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    /* The true nr_final_dirty_pages is iteratively computed by
+     * flush_postcopy_pfns_batch(), which counts only pages actually backed by
+     * data we need to migrate. */
+    ctx->save.nr_final_dirty_pages = 0;
+
+    rc = write_postcopy_pfns_begin_record(ctx);
+    if ( rc )
+        return rc;
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PFN;
+    for ( p = 0; p < ctx->save.p2m_size; ++p )
+    {
+        if ( !test_bit(p, dirty_bitmap) )
+            continue;
+
+        if ( batch_full(ctx) )
+        {
+            rc = flush_postcopy_pfns_batch(ctx);
+            if ( rc )
+                return rc;
+        }
+
+        add_to_batch(ctx, p);
+    }
+
+    return flush_postcopy_pfns_batch(ctx);
+}
+
+/*
  * Pause/suspend the domain, and refresh ctx->dominfo if required.
  */
 static int suspend_domain(struct xc_sr_context *ctx)
@@ -731,15 +843,12 @@ static int colo_merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
 }
 
 /*
- * Suspend the domain and send dirty memory.
- * This is the last iteration of the live migration and the
- * heart of the checkpointed stream.
+ * Suspend the domain and determine the final set of dirty pages.
  */
-static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+static int suspend_and_check_dirty(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
-    char *progress_str = NULL;
     int rc;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
@@ -759,16 +868,6 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         goto out;
     }
 
-    if ( ctx->save.live )
-    {
-        rc = update_progress_string(ctx, &progress_str,
-                                    ctx->save.stats.iteration);
-        if ( rc )
-            goto out;
-    }
-    else
-        xc_set_progress_prefix(xch, "Checkpointed save");
-
     bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
     if ( !ctx->save.live && ctx->save.checkpointed == XC_MIG_STREAM_COLO )
@@ -781,20 +880,36 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         }
     }
 
-    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages,
-                          /* precopy */ false);
-    if ( rc )
-        goto out;
+    if ( !ctx->save.live || ctx->save.policy_decision != XGS_POLICY_POSTCOPY )
+    {
+        /* If we aren't transitioning to a postcopy live migration, then rather
+         * than explicitly counting the number of final dirty pages, simply
+         * (somewhat crudely) estimate it as this sum to save time.  If we _are_
+         * about to begin postcopy then we don't bother, since our count must in
+         * that case be exact and we'll work it out later on. */
+        ctx->save.nr_final_dirty_pages =
+            stats.dirty_count + ctx->save.nr_deferred_pages;
+    }
 
     bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
     ctx->save.nr_deferred_pages = 0;
 
  out:
-    xc_set_progress_prefix(xch, NULL);
-    free(progress_str);
     return rc;
 }
 
+static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = suspend_and_check_dirty(ctx);
+    if ( rc )
+        return rc;
+
+    return send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages,
+                            /* precopy */ false);
+}
+
 static int verify_frames(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
@@ -835,11 +950,13 @@ static int verify_frames(struct xc_sr_context *ctx)
 }
 
 /*
- * Send all domain memory.  This is the heart of the live migration loop.
+ * Send all domain memory, modulo postcopy pages.  This is the heart of the live
+ * migration loop.
  */
 static int send_domain_memory_live(struct xc_sr_context *ctx)
 {
     int rc;
+    xc_interface *xch = ctx->xch;
 
     rc = enable_logdirty(ctx);
     if ( rc )
@@ -849,10 +966,20 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
     if ( rc )
         goto out;
 
-    rc = suspend_and_send_dirty(ctx);
+    rc = suspend_and_check_dirty(ctx);
     if ( rc )
         goto out;
 
+    if ( ctx->save.policy_decision == XGS_POLICY_STOP_AND_COPY )
+    {
+        xc_set_progress_prefix(xch, "Final precopy iteration");
+        rc = send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages,
+                              /* precopy */ false);
+        xc_set_progress_prefix(xch, NULL);
+        if ( rc )
+            goto out;
+    }
+
     if ( ctx->save.debug && ctx->save.checkpointed != XC_MIG_STREAM_NONE )
     {
         rc = verify_frames(ctx);
@@ -864,12 +991,209 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
     return rc;
 }
 
+static int handle_postcopy_faults(struct xc_sr_context *ctx,
+                                  struct xc_sr_record *rec,
+                                  /* OUT */ unsigned long *nr_new_fault_pfns,
+                                  /* OUT */ xen_pfn_t *last_fault_pfn)
+{
+    int rc;
+    unsigned i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *fault_pages = rec->data;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(nr_new_fault_pfns);
+    *nr_new_fault_pfns = 0;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_FAULT);
+    if ( rc )
+        return rc;
+
+    DBGPRINTF("Handling a batch of %"PRIu32" faults!", fault_pages->count);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PAGE);
+    for ( i = 0; i < fault_pages->count; ++i )
+    {
+        if ( test_and_clear_bit(fault_pages->pfn[i], dirty_bitmap) )
+        {
+            if ( batch_full(ctx) )
+            {
+                rc = flush_batch(ctx);
+                if ( rc )
+                    return rc;
+            }
+
+            add_to_batch(ctx, fault_pages->pfn[i]);
+            ++(*nr_new_fault_pfns);
+        }
+    }
+
+    /* _Don't_ flush yet - fill out the rest of the batch. */
+
+    assert(fault_pages->count);
+    *last_fault_pfn = fault_pages->pfn[fault_pages->count - 1];
+    return 0;
+}
+
+/*
+ * Now that the guest has resumed at the destination, send all of the remaining
+ * dirty pages.  Periodically check for pages needed by the destination to make
+ * progress.
+ */
+static int postcopy_domain_memory(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    int recv_fd = ctx->save.recv_fd;
+    int old_flags;
+    struct xc_sr_read_record_context rrctx;
+    struct xc_sr_record rec = { 0, 0, NULL };
+    unsigned long nr_new_fault_pfns;
+    unsigned long pages_remaining = ctx->save.nr_final_dirty_pages;
+    xen_pfn_t last_fault_pfn, p;
+    bool received_postcopy_complete = false;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    read_record_init(&rrctx, ctx);
+
+    /* First, configure the receive stream as non-blocking so we can
+     * periodically poll it for fault requests. */
+    old_flags = fcntl(recv_fd, F_GETFL);
+    if ( old_flags == -1 )
+    {
+        rc = old_flags;
+        goto err;
+    }
+
+    assert(!(old_flags & O_NONBLOCK));
+
+    rc = fcntl(recv_fd, F_SETFL, old_flags | O_NONBLOCK);
+    if ( rc == -1 )
+    {
+        goto err;
+    }
+
+    xc_set_progress_prefix(xch, "Postcopy phase");
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PAGE;
+
+    p = 0;
+    while ( pages_remaining )
+    {
+        /* Between (small) batches, poll the receive stream for new
+         * POSTCOPY_FAULT messages. */
+        for ( ; ; )
+        {
+            rc = try_read_record(&rrctx, recv_fd, &rec);
+            if ( rc )
+            {
+                if ( (errno == EAGAIN) || (errno == EWOULDBLOCK) )
+                {
+                    break;
+                }
+
+                goto err;
+            }
+            else
+            {
+                /* Tear down and re-initialize the read record context for the
+                 * next request record. */
+                read_record_destroy(&rrctx);
+                read_record_init(&rrctx, ctx);
+
+                if ( rec.type == REC_TYPE_POSTCOPY_COMPLETE )
+                {
+                    /* The restore side may ultimately not need all of the pages
+                     * we think it does - for example, the guest may release
+                     * some outstanding pages.  If this occurs, we'll receive
+                     * this record before we'd otherwise expect to. */
+                    received_postcopy_complete = true;
+                    goto done;
+                }
+
+                rc = handle_postcopy_faults(ctx, &rec, &nr_new_fault_pfns,
+                                            &last_fault_pfn);
+                if ( rc )
+                    goto err;
+
+                free(rec.data);
+                rec.data = NULL;
+
+                assert(pages_remaining >= nr_new_fault_pfns);
+                pages_remaining -= nr_new_fault_pfns;
+
+                /* To take advantage of any locality present in the postcopy
+                 * faults, continue the background copy process from the newest
+                 * page in the fault batch. */
+                p = (last_fault_pfn + 1) % ctx->save.p2m_size;
+            }
+        }
+
+        /* Now that we've serviced all of the POSTCOPY_FAULT requests we know
+         * about for now, fill out the current batch with background pages. */
+        for ( ;
+              pages_remaining && !batch_full(ctx);
+              p = (p + 1) % ctx->save.p2m_size )
+        {
+            if ( test_and_clear_bit(p, dirty_bitmap) )
+            {
+                add_to_batch(ctx, p);
+                --pages_remaining;
+            }
+        }
+
+        rc = flush_batch(ctx);
+        if ( rc )
+            goto err;
+
+        xc_report_progress_step(
+            xch, ctx->save.nr_final_dirty_pages - pages_remaining,
+            ctx->save.nr_final_dirty_pages);
+    }
+
+ done:
+    /* Revert the receive stream to the (blocking) state we found it in. */
+    rc = fcntl(recv_fd, F_SETFL, old_flags);
+    if ( rc == -1 )
+        goto err;
+
+    if ( !received_postcopy_complete )
+    {
+        /* Flush any outstanding POSTCOPY_FAULT requests from the migration
+         * stream by reading until a POSTCOPY_COMPLETE is received. */
+        do
+        {
+            rc = read_record(ctx, recv_fd, &rec);
+            if ( rc )
+                goto err;
+        } while ( rec.type != REC_TYPE_POSTCOPY_COMPLETE );
+    }
+
+ err:
+    xc_set_progress_prefix(xch, NULL);
+    free(rec.data);
+    read_record_destroy(&rrctx);
+    return rc;
+}
+
 /*
  * Checkpointed save.
  */
 static int send_domain_memory_checkpointed(struct xc_sr_context *ctx)
 {
-    return suspend_and_send_dirty(ctx);
+    int rc;
+    xc_interface *xch = ctx->xch;
+
+    xc_set_progress_prefix(xch, "Checkpointed save");
+    rc = suspend_and_send_dirty(ctx);
+    xc_set_progress_prefix(xch, NULL);
+
+    return rc;
 }
 
 /*
@@ -998,11 +1322,50 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
             goto err;
         }
 
+        /* End-of-checkpoint records are handled differently in the case of
+         * postcopy migration, so we need to alert the destination before
+         * sending them. */
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            rc = write_postcopy_begin_record(ctx);
+            if ( rc )
+                goto err;
+        }
+
         rc = ctx->save.ops.end_of_checkpoint(ctx);
         if ( rc )
             goto err;
 
-        if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            xc_report_progress_single(xch, "Beginning postcopy transition");
+
+            rc = send_postcopy_pfns(ctx);
+            if ( rc )
+                goto err;
+
+            rc = write_postcopy_transition_record(ctx);
+            if ( rc )
+                goto err;
+
+            /* Yield control to libxl to finish the transition.  Note that this
+             * callback returns _non-zero_ upon success. */
+            rc = ctx->save.callbacks->postcopy_transition(
+                ctx->save.callbacks->data);
+            if ( !rc )
+            {
+                rc = -1;
+                goto err;
+            }
+
+            /* When libxl is done, we can begin the postcopy loop. */
+            rc = postcopy_domain_memory(ctx);
+            if ( rc )
+                goto err;
+        }
+        else if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
         {
             /*
              * We have now completed the initial live portion of the checkpoint
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index ea4b780..13df25b 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -92,6 +92,9 @@ static int write_hvm_params(struct xc_sr_context *ctx)
     unsigned int i;
     int rc;
 
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
     for ( i = 0; i < ARRAY_SIZE(params); i++ )
     {
         uint32_t index = params[i];
@@ -106,6 +109,16 @@ static int write_hvm_params(struct xc_sr_context *ctx)
 
         if ( value != 0 )
         {
+            if ( ctx->save.live &&
+                 ctx->save.policy_decision == XGS_POLICY_POSTCOPY &&
+                 ( index == HVM_PARAM_CONSOLE_PFN ||
+                   index == HVM_PARAM_STORE_PFN ||
+                   index == HVM_PARAM_IOREQ_PFN ||
+                   index == HVM_PARAM_BUFIOREQ_PFN ||
+                   index == HVM_PARAM_PAGING_RING_PFN ) &&
+                 test_and_clear_bit(value, dirty_bitmap) )
+                --ctx->save.nr_final_dirty_pages;
+
             entries[hdr.count].index = index;
             entries[hdr.count].value = value;
             hdr.count++;
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
index 40debf6..9f5b223 100644
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -24,7 +24,21 @@
 ** We process save/restore/migrate in batches of pages; the below
 ** determines how many pages we (at maximum) deal with in each batch.
 */
-#define MAX_PRECOPY_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
+#define MAX_PRECOPY_BATCH_SIZE ((size_t)1024U)   /* up to 1024 pages (4MB) */
+
+/*
+** We process the migration postcopy transition in batches of pfns to ensure
+** that we stay within the record size bound.  Because these records contain
+** only pfns (and _not_ their contents), we can accomodate many more of them
+** in a batch.
+*/
+#define MAX_PFN_BATCH_SIZE ((4U << 20) / sizeof(uint64_t)) /* up to 512k pfns */
+
+/*
+** The postcopy background copy uses a smaller batch size to ensure it can
+** quickly respond to remote faults.
+*/
+#define MAX_POSTCOPY_BATCH_SIZE ((size_t)64U)
 
 /* When pinning page tables at the end of restore, we also use batching. */
 #define MAX_PIN_BATCH  1024
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 10d5012..4ef9ca5 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -349,6 +349,12 @@ static int libxl__save_live_migration_simple_precopy_policy(
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
+static void libxl__save_live_migration_postcopy_transition_callback(void *user)
+{
+    /* XXX we're not yet ready to deal with this */
+    assert(0);
+}
+
 /*----- main code for saving, in order of execution -----*/
 
 void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
@@ -419,8 +425,11 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
             dss->xcflags |= XCFLAGS_CHECKPOINT_COMPRESS;
     }
 
-    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
+    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE) {
         callbacks->suspend = libxl__domain_suspend_callback;
+        callbacks->postcopy_transition =
+            libxl__save_live_migration_postcopy_transition_callback;
+    }
 
     callbacks->precopy_policy = libxl__save_live_migration_simple_precopy_policy;
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 50c97b4..5647b97 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -33,7 +33,8 @@ our @msgs = (
                                               'xen_pfn_t', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
-    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ],
+    [ 11, 'scxA',   "postcopy_transition", [] ]
 );
 
 #----------------------------------------
@@ -225,6 +226,7 @@ foreach my $sr (qw(save restore)) {
 
     f_decl("${setcallbacks}_${sr}", 'helper', 'void',
            "(struct ${sr}_callbacks *cbs, unsigned cbflags)");
+    f_more("${setcallbacks}_${sr}", "    memset(cbs, 0, sizeof(*cbs));\n");
 
     f_more("${receiveds}_${sr}",
            <<END_ALWAYS.($debug ? <<END_DEBUG : '').<<END_ALWAYS);
@@ -335,7 +337,7 @@ END_ALWAYS
         my $c_v = "(1u<<$msgnum)";
         my $c_cb = "cbs->$name";
         $f_more_sr->("    if ($c_cb) cbflags |= $c_v;\n", $enumcallbacks);
-        $f_more_sr->("    $c_cb = (cbflags & $c_v) ? ${encode}_${name} : 0;\n",
+        $f_more_sr->("    if (cbflags & $c_v) $c_cb = ${encode}_${name};\n",
                      $setcallbacks);
     }
     $f_more_sr->("        return 1;\n    }\n\n");
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 15/20] libxc/migration: implement the receiver side of postcopy live migration
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (13 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 14/20] libxc/migration: implement the sender side of postcopy live migration Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 16/20] libxl/libxl_stream_write.c: track callback chains with an explicit phase Joshua Otto
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

Add the receive-side logic for a new 'postcopy' phase in the live
migration algorithm.

To support this migration phase:
- Augment the main restore record-processing logic to recognize and
  handle the postcopy-initiation records.
- Add the core logic for the phase, postcopy_restore(), which marks as
  paged-out all pfns reported by the sender as outstanding at the
  beginning of the phase, and subsequently serves as a pager for this
  subset of memory by forwarding paging requests to the migration sender
  and filling the outstanding domain memory as it is received.

The new restore callbacks required for this migration phase are stubbed
in libxl for now, to be replaced in a subsequent patch that adds libxl
support for this migration phase.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h      |  63 ++-
 tools/libxc/xc_sr_common.h          |  82 +++-
 tools/libxc/xc_sr_restore.c         | 890 +++++++++++++++++++++++++++++++++++-
 tools/libxc/xc_sr_restore_x86_hvm.c |  38 +-
 tools/libxl/libxl_create.c          |  15 +
 tools/libxl/libxl_save_msgs_gen.pl  |   2 +-
 6 files changed, 1049 insertions(+), 41 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 16441c9..684afc8 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -146,35 +146,50 @@ struct restore_callbacks {
      */
     int (*suspend)(void* data);
 
-    /* Called after the secondary vm is ready to resume.
-     * Callback function resumes the guest & the device model,
-     * returns to xc_domain_restore.
-     */
-    int (*aftercopy)(void* data);
+    union {
+        struct {
+            /* Called upon receipt of the POSTCOPY_TRANSITION record in the
+             * stream to yield control of the stream to the higher layer so that
+             * the remaining data needed to resume the domain in the postcopy
+             * phase can be obtained.  Returns as soon as the higher layer is
+             * finished with the stream.
+             *
+             * Returns 1 on success, 0 on failure. */
+            int (*postcopy_transition)(void *data);
+        };
+
+        struct {
+            /* Called after the secondary vm is ready to resume.
+             * Callback function resumes the guest & the device model,
+             * returns to xc_domain_restore.
+             */
+            int (*aftercopy)(void* data);
 
-    /* A checkpoint record has been found in the stream.
-     * returns: */
+            /* A checkpoint record has been found in the stream.
+             * returns: */
 #define XGR_CHECKPOINT_ERROR    0 /* Terminate processing */
 #define XGR_CHECKPOINT_SUCCESS  1 /* Continue reading more data from the stream */
 #define XGR_CHECKPOINT_FAILOVER 2 /* Failover and resume VM */
-    int (*checkpoint)(void* data);
-
-    /*
-     * Called after the checkpoint callback.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint
-     */
-    int (*wait_checkpoint)(void* data);
+            int (*checkpoint)(void* data);
 
-    /*
-     * callback to send store gfn and console gfn to xl
-     * if we want to resume vm before xc_domain_save()
-     * exits.
-     */
-    void (*restore_results)(xen_pfn_t store_gfn, xen_pfn_t console_gfn,
-                            void *data);
+            /*
+             * Called after the checkpoint callback.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint
+             */
+            int (*wait_checkpoint)(void* data);
+
+            /*
+             * callback to send store gfn and console gfn to xl
+             * if we want to resume vm before xc_domain_save()
+             * exits.
+             */
+            void (*restore_results)(xen_pfn_t store_gfn, xen_pfn_t console_gfn,
+                                    void *data);
+        };
+    };
 
     /* to be provided as the last argument to each callback function */
     void* data;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 0043791..cdb933c 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -3,6 +3,10 @@
 
 #include <stdbool.h>
 
+#include <xenevtchn.h>
+
+#include <xen/vm_event.h>
+
 #include "xg_private.h"
 #include "xg_save_restore.h"
 #include "xc_dom.h"
@@ -232,6 +236,82 @@ struct xc_sr_context
             uint32_t guest_type;
             uint32_t guest_page_size;
 
+            /* Is this a postcopy live migration? */
+            bool postcopy;
+
+            struct xc_sr_restore_paging
+            {
+                xenevtchn_handle *xce_handle;
+                int port;
+                vm_event_back_ring_t back_ring;
+                uint32_t evtchn_port;
+                void *ring_page;
+                void *buffer;
+
+                struct xc_sr_pending_postcopy_request
+                {
+                    xen_pfn_t pfn; /* == INVALID_PFN when not in use */
+
+                    /* As from vm_event_request_t */
+                    uint32_t flags;
+                    uint32_t vcpu_id;
+                } *pending_requests;
+
+                /* The total count of outstanding and requested pfns.  The
+                 * postcopy phase is complete when this reaches 0. */
+                unsigned nr_pending_pfns;
+
+                /* Prior to the receipt of the first POSTCOPY_PFNS record, all
+                 * pfns are 'invalid', meaning that we don't (yet) believe that
+                 * they need to be migrated as part of the postcopy phase.
+                 *
+                 * Pfns received in POSTCOPY_PFNS records become 'outstanding',
+                 * meaning that they must be migrated but haven't yet been
+                 * requested, received or dropped.
+                 *
+                 * A pfn transitions from outstanding to requested when we
+                 * receive a request for it on the paging ring and request it
+                 * from the sender, before having received it.  There is at
+                 * least one valid entry in pending_requests for each requested
+                 * pfn.
+                 *
+                 * A pfn transitions from either outstanding or requested to
+                 * ready when its contents are received.  Responses to all
+                 * previous pager requests for this pfn are pushed at this time,
+                 * and subsequent pager requests for this pfn can be responded
+                 * to immediately.
+                 *
+                 * A pfn transitions from outstanding to dropped if we're
+                 * notified on the ring of the drop.  We track this explicitly
+                 * so that we don't panic upon subsequently receiving the
+                 * contents of this page from the sender.
+                 *
+                 * In summary, the per-pfn postcopy state machine is:
+                 *
+                 * invalid -> outstanding -> requested -> ready
+                 *                |                        ^
+                 *                +------------------------+
+                 *                |
+                 *                +------ -> dropped
+                 *
+                 * The state of each pfn is tracked using these four bitmaps. */
+                unsigned long *outstanding_pfns;
+                unsigned long *requested_pfns;
+                unsigned long *ready_pfns;
+                unsigned long *dropped_pfns;
+
+                /* Used to accumulate batches of pfns for which we must forward
+                 * paging requests to the sender. */
+                uint64_t *request_batch;
+
+                /* For teardown. */
+                bool evtchn_bound, evtchn_opened, paging_enabled, buffer_locked;
+
+                /* So we can sanity-check the sequence of postcopy records in
+                 * the stream. */
+                bool ready;
+            } paging;
+
             /* Plain VM, or checkpoints over time. */
             int checkpointed;
 
@@ -255,7 +335,7 @@ struct xc_sr_context
              * INPUT:  evtchn & domid
              * OUTPUT: gfn
              */
-            xen_pfn_t    xenstore_gfn,    console_gfn;
+            xen_pfn_t    xenstore_gfn,    console_gfn,    paging_ring_gfn;
             unsigned int xenstore_evtchn, console_evtchn;
             domid_t      xenstore_domid,  console_domid;
 
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 4e3c472..38c218f 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -1,6 +1,7 @@
 #include <arpa/inet.h>
 
 #include <assert.h>
+#include <poll.h>
 
 #include "xc_sr_common.h"
 
@@ -78,6 +79,30 @@ static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
     return test_bit(pfn, ctx->restore.populated_pfns);
 }
 
+static int pfn_bitmap_realloc(struct xc_sr_context *ctx, unsigned long **bitmap,
+                              size_t old_sz, size_t new_sz)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long *p;
+
+    assert(bitmap);
+    if ( *bitmap )
+    {
+        p = realloc(*bitmap, new_sz);
+        if ( !p )
+        {
+            ERROR("Failed to realloc restore bitmap");
+            errno = ENOMEM;
+            return -1;
+        }
+
+        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+        *bitmap = p;
+    }
+
+    return 0;
+}
+
 /*
  * Set a pfn as populated, expanding the tracking structures if needed. To
  * avoid realloc()ing too excessively, the size increased to the nearest power
@@ -85,13 +110,21 @@ static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
  */
 static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
-    xc_interface *xch = ctx->xch;
+    int rc = 0;
 
     if ( pfn > ctx->restore.max_populated_pfn )
     {
         xen_pfn_t new_max;
         size_t old_sz, new_sz;
-        unsigned long *p;
+        unsigned i;
+        unsigned long **bitmaps[] =
+        {
+            &ctx->restore.populated_pfns,
+            &ctx->restore.paging.outstanding_pfns,
+            &ctx->restore.paging.requested_pfns,
+            &ctx->restore.paging.ready_pfns,
+            &ctx->restore.paging.dropped_pfns
+        };
 
         /* Round up to the nearest power of two larger than pfn, less 1. */
         new_max = pfn;
@@ -106,17 +139,13 @@ static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
 
         old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
         new_sz = bitmap_size(new_max + 1);
-        p = realloc(ctx->restore.populated_pfns, new_sz);
-        if ( !p )
-        {
-            ERROR("Failed to realloc populated bitmap");
-            errno = ENOMEM;
-            return -1;
-        }
 
-        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+        for ( i = 0; i < ARRAY_SIZE(bitmaps) && !rc; ++i )
+            rc = pfn_bitmap_realloc(ctx, bitmaps[i], old_sz, new_sz);
+
+        if ( rc )
+            return rc;
 
-        ctx->restore.populated_pfns    = p;
         ctx->restore.max_populated_pfn = new_max;
     }
 
@@ -484,6 +513,811 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
 }
 
 /*
+ * To prepare for entry to the postcopy phase of live migration:
+ * - enable paging on the domain, and set up the paging ring and event channel
+ * - allocate a locked and aligned paging buffer
+ * - allocate the postcopy page bookkeeping structures
+ */
+static int postcopy_paging_setup(struct xc_sr_context *ctx)
+{
+    int rc;
+    unsigned i;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    xc_interface *xch = ctx->xch;
+
+    /* Sanity-check the migration stream. */
+    if ( !ctx->restore.postcopy )
+    {
+        ERROR("Received POSTCOPY_PFNS_BEGIN before POSTCOPY_BEGIN");
+        return -1;
+    }
+
+    paging->ring_page = xc_vm_event_enable(xch, ctx->domid,
+                                           HVM_PARAM_PAGING_RING_PFN,
+                                           &paging->evtchn_port);
+    if ( !paging->ring_page )
+    {
+        PERROR("Failed to enable paging");
+        return -1;
+    }
+    paging->paging_enabled = true;
+
+    paging->xce_handle = xenevtchn_open(NULL, 0);
+    if (!paging->xce_handle )
+    {
+        ERROR("Failed to open paging evtchn");
+        return -1;
+    }
+    paging->evtchn_opened = true;
+
+    rc = xenevtchn_bind_interdomain(paging->xce_handle, ctx->domid,
+                                    paging->evtchn_port);
+    if ( rc < 0 )
+    {
+        ERROR("Failed to bind paging evtchn");
+        return rc;
+    }
+    paging->evtchn_bound = true;
+    paging->port = rc;
+
+    SHARED_RING_INIT((vm_event_sring_t *)paging->ring_page);
+    BACK_RING_INIT(&paging->back_ring, (vm_event_sring_t *)paging->ring_page,
+                   PAGE_SIZE);
+
+    errno = posix_memalign(&paging->buffer, PAGE_SIZE, PAGE_SIZE);
+    if ( errno != 0 )
+    {
+        PERROR("Failed to allocate paging buffer");
+        return -1;
+    }
+
+    rc = mlock(paging->buffer, PAGE_SIZE);
+    if ( rc < 0 )
+    {
+        PERROR("Failed to lock paging buffer");
+        return rc;
+    }
+    paging->buffer_locked = true;
+
+    paging->outstanding_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+    paging->requested_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+    paging->ready_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+    paging->dropped_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+
+    paging->pending_requests = malloc(RING_SIZE(&paging->back_ring) *
+                                      sizeof(*paging->pending_requests));
+    paging->request_batch = malloc(RING_SIZE(&paging->back_ring) *
+                                   sizeof(*paging->request_batch));
+    if ( !paging->outstanding_pfns ||
+         !paging->requested_pfns ||
+         !paging->ready_pfns ||
+         !paging->dropped_pfns ||
+         !paging->pending_requests ||
+         !paging->request_batch )
+    {
+        PERROR("Failed to allocate pfn state tracking buffers");
+        return -1;
+    }
+
+    /* All slots are initially empty. */
+    for ( i = 0; i < RING_SIZE(&paging->back_ring); ++i )
+        paging->pending_requests[i].pfn = INVALID_PFN;
+
+    paging->ready = true;
+
+    return 0;
+}
+
+static void postcopy_paging_cleanup(struct xc_sr_context *ctx)
+{
+    int rc;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    xc_interface *xch = ctx->xch;
+
+    if ( paging->ring_page )
+        munmap(paging->ring_page, PAGE_SIZE);
+
+    if ( paging->paging_enabled )
+    {
+        rc = xc_vm_event_control(xch, ctx->domid, XEN_VM_EVENT_DISABLE,
+                                 XEN_DOMCTL_VM_EVENT_OP_PAGING, NULL);
+        if ( rc != 0 )
+            ERROR("Failed to disable paging");
+    }
+
+    if ( paging->evtchn_bound )
+    {
+        rc = xenevtchn_unbind(paging->xce_handle, paging->port);
+        if ( rc != 0 )
+            ERROR("Failed to unbind event port");
+    }
+
+    if ( paging->evtchn_opened )
+    {
+        rc = xenevtchn_close(paging->xce_handle);
+        if ( rc != 0 )
+            ERROR("Failed to close event channel");
+    }
+
+    if ( paging->buffer )
+    {
+        if ( paging->buffer_locked )
+            munlock(paging->buffer, PAGE_SIZE);
+
+        free(paging->buffer);
+    }
+
+    free(paging->outstanding_pfns);
+    free(paging->requested_pfns);
+    free(paging->ready_pfns);
+    free(paging->dropped_pfns);
+    free(paging->pending_requests);
+    free(paging->request_batch);
+}
+
+/* Helpers to query and transition the state of postcopy pfns. */
+#define CHECK_STATE_BITMAP_FN(state)                                      \
+    static inline bool postcopy_pfn_ ## state (struct xc_sr_context *ctx, \
+                                               xen_pfn_t pfn)             \
+    {                                                                     \
+        assert(pfn <= ctx->restore.max_populated_pfn);                    \
+        return test_bit(pfn, ctx->restore.paging. state ## _pfns);        \
+    }
+
+CHECK_STATE_BITMAP_FN(outstanding);
+CHECK_STATE_BITMAP_FN(requested);
+CHECK_STATE_BITMAP_FN(ready);
+CHECK_STATE_BITMAP_FN(dropped);
+
+static inline bool postcopy_pfn_invalid(struct xc_sr_context *ctx,
+                                        xen_pfn_t pfn)
+{
+    return !postcopy_pfn_outstanding(ctx, pfn) &&
+           !postcopy_pfn_requested(ctx, pfn) &&
+           !postcopy_pfn_ready(ctx, pfn) &&
+           !postcopy_pfn_dropped(ctx, pfn);
+}
+
+static inline void mark_postcopy_pfn_outstanding(struct xc_sr_context *ctx,
+                                                 xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_invalid(ctx, pfn));
+
+    set_bit(pfn, ctx->restore.paging.outstanding_pfns);
+}
+
+static inline void mark_postcopy_pfn_requested(struct xc_sr_context *ctx,
+                                               xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_outstanding(ctx, pfn));
+
+    clear_bit(pfn, ctx->restore.paging.outstanding_pfns);
+    set_bit(pfn, ctx->restore.paging.requested_pfns);
+}
+
+static inline void mark_postcopy_pfn_ready(struct xc_sr_context *ctx,
+                                           xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_outstanding(ctx, pfn) ||
+           postcopy_pfn_requested(ctx, pfn));
+
+    clear_bit(pfn, ctx->restore.paging.outstanding_pfns);
+    clear_bit(pfn, ctx->restore.paging.requested_pfns);
+    set_bit(pfn, ctx->restore.paging.ready_pfns);
+}
+
+static inline void mark_postcopy_pfn_dropped(struct xc_sr_context *ctx,
+                                             xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_outstanding(ctx, pfn));
+
+    clear_bit(pfn, ctx->restore.paging.outstanding_pfns);
+    set_bit(pfn, ctx->restore.paging.dropped_pfns);
+}
+
+static int process_postcopy_pfns(struct xc_sr_context *ctx, unsigned count,
+                                 xen_pfn_t *pfns, uint32_t *types)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    xen_pfn_t *bpfns = NULL, bpfn;
+    int rc;
+    unsigned i, nr_pages;
+
+    rc = populate_pfns(ctx, count, pfns, types);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", count);
+        goto err;
+    }
+
+    set_page_types(ctx, count, pfns, types);
+
+    rc = filter_pages(ctx, count, pfns, types, &nr_pages, &bpfns);
+    if ( rc )
+    {
+        ERROR("Failed to filter mfns for batch of %u pages", count);
+        goto err;
+    }
+
+    /* Nothing to do? */
+    if ( nr_pages == 0 )
+        goto done;
+
+    /* Fully evict all backed pages in the batch. */
+    for ( i = 0; i < nr_pages; ++i )
+    {
+        bpfn = bpfns[i];
+        rc = -1;
+
+        /* We should never see the same pfn twice at this stage.  */
+        if ( !postcopy_pfn_invalid(ctx, bpfn) )
+        {
+            ERROR("Duplicate postcopy pfn %"PRI_xen_pfn, bpfn);
+            goto err;
+        }
+
+        /* We now consider this pfn 'outstanding' - pending, and not yet
+         * requested. */
+        mark_postcopy_pfn_outstanding(ctx, bpfn);
+        ++paging->nr_pending_pfns;
+
+        /* Neither nomination nor eviction can be permitted to fail - the guest
+         * isn't yet running, so a failure would imply a foreign or hypervisor
+         * mapping on the page, and that would be bogus because the migration
+         * isn't yet complete. */
+        rc = xc_mem_paging_nominate(xch, ctx->domid, bpfn);
+        if ( rc < 0 )
+        {
+            PERROR("Error nominating postcopy pfn %"PRI_xen_pfn, bpfn);
+            goto err;
+        }
+
+        rc = xc_mem_paging_evict(xch, ctx->domid, bpfn);
+        if ( rc < 0 )
+        {
+            PERROR("Error evicting postcopy pfn %"PRI_xen_pfn, bpfn);
+            goto err;
+        }
+    }
+
+ done:
+    rc = 0;
+
+ err:
+    free(bpfns);
+
+    return rc;
+}
+
+static int handle_postcopy_pfns(struct xc_sr_context *ctx,
+                                struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+    unsigned pages_of_data;
+    int rc;
+    xen_pfn_t *pfns = NULL;
+    uint32_t *types = NULL;
+
+    /* Sanity-check the migration stream. */
+    if ( !ctx->restore.paging.ready )
+    {
+        ERROR("Received POSTCOPY_PFNS record before POSTCOPY_PFNS_BEGIN");
+        rc = -1;
+        goto err;
+    }
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_PFNS);
+    if ( rc )
+        goto err;
+
+    rc = decode_pages_record(ctx, pages, &pfns, &types, &pages_of_data);
+    if ( rc )
+        goto err;
+
+    if ( rec->length != (sizeof(*pages) + (sizeof(uint64_t) * pages->count)) )
+    {
+        ERROR("POSTCOPY_PFNS record wrong size: length %u, expected "
+              "%zu + %zu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count));
+        goto err;
+    }
+
+    rc = process_postcopy_pfns(ctx, pages->count, pfns, types);
+
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+static int handle_postcopy_transition(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    void *data = ctx->restore.callbacks->data;
+
+    /* Sanity-check the migration stream. */
+    if ( !ctx->restore.paging.ready )
+    {
+        ERROR("Received POSTCOPY_TRANSITION record before POSTCOPY_PFNS_BEGIN");
+        return -1;
+    }
+
+    rc = ctx->restore.ops.stream_complete(ctx);
+    if ( rc )
+        return rc;
+
+    ctx->restore.callbacks->restore_results(ctx->restore.xenstore_gfn,
+                                            ctx->restore.console_gfn,
+                                            data);
+
+    /* Asynchronously resume the guest.  We'll return when we've been handed
+     * back control of the stream, so that we can begin filling in the
+     * outstanding postcopy page data and forwarding guest requests for specific
+     * pages. */
+    IPRINTF("Postcopy transition: resuming guest");
+    return ctx->restore.callbacks->postcopy_transition(data) ? 0 : -1;
+}
+
+static int postcopy_load_page(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                              void *page_data)
+{
+    int rc;
+    unsigned i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    struct xc_sr_pending_postcopy_request *preq;
+    vm_event_response_t rsp;
+    vm_event_back_ring_t *back_ring = &paging->back_ring;
+
+    assert(postcopy_pfn_outstanding(ctx, pfn) ||
+           postcopy_pfn_requested(ctx, pfn));
+
+    memcpy(paging->buffer, page_data, PAGE_SIZE);
+    rc = xc_mem_paging_load(ctx->xch, ctx->domid, pfn, paging->buffer);
+    if ( rc < 0 )
+    {
+        PERROR("Failed to paging load pfn %"PRI_xen_pfn, pfn);
+        return rc;
+    }
+
+    if ( postcopy_pfn_requested(ctx, pfn) )
+    {
+        for ( i = 0; i < RING_SIZE(back_ring); ++i )
+        {
+            preq = &paging->pending_requests[i];
+            if ( preq->pfn != pfn )
+                continue;
+
+            /* Put the response on the ring. */
+            rsp = (vm_event_response_t)
+            {
+                .version = VM_EVENT_INTERFACE_VERSION,
+                .vcpu_id = preq->vcpu_id,
+                .flags   = (preq->flags & VM_EVENT_FLAG_VCPU_PAUSED),
+                .reason  = VM_EVENT_REASON_MEM_PAGING,
+                .u       = { .mem_paging = { .gfn = pfn } }
+            };
+
+            memcpy(RING_GET_RESPONSE(back_ring, back_ring->rsp_prod_pvt),
+                   &rsp, sizeof(rsp));
+		    ++back_ring->rsp_prod_pvt;
+
+            /* And free the pending request slot. */
+            preq->pfn = INVALID_PFN;
+        }
+    }
+
+    --paging->nr_pending_pfns;
+    mark_postcopy_pfn_ready(ctx, pfn);
+    return 0;
+}
+
+static int process_postcopy_page_data(struct xc_sr_context *ctx, unsigned count,
+                                      xen_pfn_t *pfns, uint32_t *types,
+                                      void *page_data)
+{
+    int rc;
+    unsigned i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    bool push_responses = false;
+
+    for ( i = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_XTAB:
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+            ERROR("Received postcopy pfn %"PRI_xen_pfn
+                  " with invalid type %"PRIu32, pfns[i], types[i]);
+            return -1;
+        default:
+            if ( postcopy_pfn_invalid(ctx, pfns[i]) )
+            {
+                ERROR("Expected pfn %"PRI_xen_pfn" to be invalid", pfns[i]);
+                return -1;
+            }
+            else if ( postcopy_pfn_ready(ctx, pfns[i]) )
+            {
+                ERROR("pfn %"PRI_xen_pfn" already received", pfns[i]);
+                return -1;
+            }
+            else if ( postcopy_pfn_dropped(ctx, pfns[i]) )
+            {
+                /* Nothing to do - move on to the next page. */
+                page_data += PAGE_SIZE;
+            }
+            else
+            {
+                if ( postcopy_pfn_requested(ctx, pfns[i]) )
+                {
+                    DBGPRINTF("Received requested pfn %"PRI_xen_pfn, pfns[i]);
+                    push_responses = true;
+                }
+
+                rc = postcopy_load_page(ctx, pfns[i], page_data);
+                if ( rc )
+                    return rc;
+
+                page_data += PAGE_SIZE;
+            }
+            break;
+        }
+    }
+
+    if ( push_responses )
+    {
+        /* We put at least one response on the ring as a result of processing
+         * this batch of pages, so we need to push them and kick the ring event
+         * channel. */
+        RING_PUSH_RESPONSES(&paging->back_ring);
+
+        rc = xenevtchn_notify(paging->xce_handle, paging->port);
+        if ( rc )
+        {
+            ERROR("Failed to notify paging event channel");
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+static int handle_postcopy_page_data(struct xc_sr_context *ctx,
+                                     struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+    unsigned pages_of_data;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL;
+    uint32_t *types = NULL;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_PAGE_DATA);
+    if ( rc )
+        goto err;
+
+    rc = decode_pages_record(ctx, pages, &pfns, &types, &pages_of_data);
+    if ( rc )
+        goto err;
+
+    if ( rec->length != (sizeof(*pages) +
+                         (sizeof(uint64_t) * pages->count) +
+                         (PAGE_SIZE * pages_of_data)) )
+    {
+        ERROR("POSTCOPY_PAGE_DATA record wrong size: length %u, expected "
+              "%zu + %zu + %lu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
+        goto err;
+    }
+
+    rc = process_postcopy_page_data(ctx, pages->count, pfns, types,
+                                    &pages->pfn[pages->count]);
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+static int forward_postcopy_paging_requests(struct xc_sr_context *ctx,
+                                            unsigned nr_batch_requests)
+{
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    size_t batchsz = nr_batch_requests * sizeof(*paging->request_batch);
+    struct xc_sr_rec_pages_header phdr =
+    {
+        .count = nr_batch_requests
+    };
+    struct xc_sr_record rec =
+    {
+        .type   = REC_TYPE_POSTCOPY_FAULT,
+        .length = sizeof(phdr),
+        .data   = &phdr
+    };
+
+    return write_split_record(ctx, ctx->restore.send_back_fd, &rec,
+                              paging->request_batch, batchsz);
+}
+
+static int handle_postcopy_paging_requests(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    struct xc_sr_pending_postcopy_request *preq;
+    vm_event_back_ring_t *back_ring = &paging->back_ring;
+    vm_event_request_t req;
+    vm_event_response_t rsp;
+    xen_pfn_t pfn;
+    bool put_responses = false, drop_requested;
+    unsigned i, nr_batch_requests = 0;
+
+    while ( RING_HAS_UNCONSUMED_REQUESTS(back_ring) )
+    {
+        RING_COPY_REQUEST(back_ring, back_ring->req_cons, &req);
+        ++back_ring->req_cons;
+
+        drop_requested = !!(req.u.mem_paging.flags & MEM_PAGING_DROP_PAGE);
+        pfn = req.u.mem_paging.gfn;
+
+        DBGPRINTF("Postcopy page fault! %"PRI_xen_pfn, pfn);
+
+        if ( postcopy_pfn_invalid(ctx, pfn) )
+        {
+            ERROR("pfn %"PRI_xen_pfn" does not need to be migrated", pfn);
+            rc = -1;
+            goto err;
+        }
+        else if ( postcopy_pfn_ready(ctx, pfn) || drop_requested )
+        {
+            if ( drop_requested )
+            {
+                if ( postcopy_pfn_outstanding(ctx, pfn) )
+                {
+                    mark_postcopy_pfn_dropped(ctx, pfn);
+                    --paging->nr_pending_pfns;
+                }
+                else
+                {
+                    ERROR("Pager requesting we drop non-paged "
+                          "(or previously-requested) pfn %"PRI_xen_pfn, pfn);
+                    rc = -1;
+                    goto err;
+                }
+            }
+
+            /* This page has already been loaded (or has been dropped), so we can
+             * respond immediately. */
+            rsp = (vm_event_response_t)
+            {
+                .version = VM_EVENT_INTERFACE_VERSION,
+                .vcpu_id = req.vcpu_id,
+                .flags   = (req.flags & VM_EVENT_FLAG_VCPU_PAUSED),
+                .reason  = VM_EVENT_REASON_MEM_PAGING,
+                .u       = { .mem_paging = { .gfn = pfn } }
+            };
+
+            memcpy(RING_GET_RESPONSE(back_ring, back_ring->rsp_prod_pvt),
+                   &rsp, sizeof(rsp));
+		    ++back_ring->rsp_prod_pvt;
+
+			put_responses = true;
+        }
+        else /* implies not dropped AND either outstanding or requested */
+        {
+            if ( postcopy_pfn_outstanding(ctx, pfn) )
+            {
+                /* This is the first time this pfn has been requested. */
+                mark_postcopy_pfn_requested(ctx, pfn);
+
+                paging->request_batch[nr_batch_requests] = pfn;
+                ++nr_batch_requests;
+            }
+
+            /* Find a free pending_requests slot. */
+            for ( i = 0; i < RING_SIZE(back_ring); ++i )
+            {
+                preq = &paging->pending_requests[i];
+                if ( preq->pfn == INVALID_PFN )
+                {
+                    /* Claim this slot. */
+                    preq->pfn = pfn;
+
+                    preq->flags = req.flags;
+                    preq->vcpu_id = req.vcpu_id;
+                    break;
+                }
+            }
+
+            /* We _must_ find a free slot - there cannot be more outstanding
+             * requests than there are slots in the ring. */
+            assert(i < RING_SIZE(back_ring));
+        }
+    }
+
+    if ( put_responses )
+    {
+        RING_PUSH_RESPONSES(back_ring);
+
+        rc = xenevtchn_notify(paging->xce_handle, paging->port);
+        if ( rc )
+        {
+            ERROR("Failed to notify paging event channel");
+            goto err;
+        }
+    }
+
+    if ( nr_batch_requests )
+    {
+        rc = forward_postcopy_paging_requests(ctx, nr_batch_requests);
+        if ( rc )
+        {
+            ERROR("Failed to forward postcopy paging requests");
+            goto err;
+        }
+    }
+
+    rc = 0;
+
+ err:
+    return rc;
+}
+
+static int write_postcopy_complete_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record rec = { REC_TYPE_POSTCOPY_COMPLETE };
+
+    return write_record(ctx, ctx->restore.send_back_fd, &rec);
+}
+
+static int postcopy_restore(struct xc_sr_context *ctx)
+{
+    int rc;
+    int recv_fd = ctx->fd;
+    int old_flags;
+    int port;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    struct xc_sr_read_record_context rrctx;
+    struct xc_sr_record rec = { 0, 0, NULL };
+    struct pollfd pfds[] =
+    {
+        { .fd = xenevtchn_fd(paging->xce_handle), .events = POLLIN },
+        { .fd = recv_fd,                          .events = POLLIN }
+    };
+
+    assert(ctx->restore.postcopy);
+    assert(paging->xce_handle);
+
+    read_record_init(&rrctx, ctx);
+
+    /* For the duration of the postcopy loop, configuring the receive stream as
+     * non-blocking. */
+    old_flags = fcntl(recv_fd, F_GETFL);
+    if ( old_flags == -1 )
+    {
+        rc = old_flags;
+        goto err;
+    }
+
+    assert(!(old_flags & O_NONBLOCK));
+
+    rc = fcntl(recv_fd, F_SETFL, old_flags | O_NONBLOCK);
+    if ( rc == -1 )
+    {
+        goto err;
+    }
+
+    while ( paging->nr_pending_pfns )
+    {
+        rc = poll(pfds, ARRAY_SIZE(pfds), -1);
+        if ( rc < 0 )
+        {
+            if ( errno == EINTR )
+                continue;
+
+            PERROR("Failed to poll the pager event channel/restore stream");
+            goto err;
+        }
+
+        /* Fill in any newly received page data first, on the off chance that
+         * new pager requests are for that data. */
+        if ( rc && pfds[1].revents & POLLIN )
+        {
+            rc = try_read_record(&rrctx, recv_fd, &rec);
+            if ( rc && (errno != EAGAIN) && (errno != EWOULDBLOCK) )
+            {
+                goto err;
+            }
+            else if ( !rc )
+            {
+                read_record_destroy(&rrctx);
+                read_record_init(&rrctx, ctx);
+
+                rc = handle_postcopy_page_data(ctx, &rec);
+                if ( rc )
+                    goto err;
+
+                free(rec.data);
+                rec.data = NULL;
+            }
+        }
+
+        if ( rc && pfds[0].revents & POLLIN )
+        {
+            port = xenevtchn_pending(paging->xce_handle);
+            if ( port == -1 )
+            {
+                ERROR("Failed to read port from pager event channel");
+                rc = -1;
+                goto err;
+            }
+
+            rc = xenevtchn_unmask(paging->xce_handle, port);
+            if ( rc != 0 )
+            {
+                ERROR("Failed to unmask pager event channel port");
+                goto err;
+            }
+
+            rc = handle_postcopy_paging_requests(ctx);
+            if ( rc )
+                goto err;
+        }
+    }
+
+    /* At this point, all oustanding postcopy pages have been loaded.  We now
+     * need only flush any outstanding requests that may have accumulated in the
+     * ring while we were processing the final POSTCOPY_PAGE_DATA records. */
+    rc = handle_postcopy_paging_requests(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_postcopy_complete_record(ctx);
+    if ( rc )
+        goto err;
+
+    /* End-of-stream synchronization: make the receive stream blocking again,
+     * and wait to receive what must be the END record. */
+    rc = fcntl(recv_fd, F_SETFL, old_flags);
+    if ( rc == -1 )
+        goto err;
+
+    rc = read_record(ctx, recv_fd, &rec);
+    if ( rc )
+    {
+        goto err;
+    }
+    else if ( rec.type != REC_TYPE_END )
+    {
+        ERROR("Expected end of stream, received %s", rec_type_to_str(rec.type));
+        rc = -1;
+        goto err;
+    }
+
+ err:
+    /* If _we_ fail here, we can't safely synchronize with the completion of
+     * domain resumption because it might be waiting for us (to fulfill a pager
+     * request).  Since we therefore can't know whether or not the domain was
+     * unpaused, just abruptly bail and let the sender assume the worst. */
+    free(rec.data);
+    read_record_destroy(&rrctx);
+
+    return rc;
+}
+
+/*
  * Send checkpoint dirty pfn list to primary.
  */
 static int send_checkpoint_dirty_pfn_list(struct xc_sr_context *ctx)
@@ -702,6 +1536,25 @@ static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
         rc = handle_checkpoint(ctx);
         break;
 
+    case REC_TYPE_POSTCOPY_BEGIN:
+        if ( ctx->restore.postcopy )
+            rc = -1;
+        else
+            ctx->restore.postcopy = true;
+        break;
+
+    case REC_TYPE_POSTCOPY_PFNS_BEGIN:
+        rc = postcopy_paging_setup(ctx);
+        break;
+
+    case REC_TYPE_POSTCOPY_PFNS:
+        rc = handle_postcopy_pfns(ctx, rec);
+        break;
+
+    case REC_TYPE_POSTCOPY_TRANSITION:
+        rc = handle_postcopy_transition(ctx);
+        break;
+
     default:
         rc = ctx->restore.ops.process_record(ctx, rec);
         break;
@@ -774,6 +1627,10 @@ static void cleanup(struct xc_sr_context *ctx)
     if ( ctx->restore.checkpointed == XC_MIG_STREAM_COLO )
         xc_hypercall_buffer_free_pages(xch, dirty_bitmap,
                                    NRPAGES(bitmap_size(ctx->restore.p2m_size)));
+
+    if ( ctx->restore.postcopy )
+        postcopy_paging_cleanup(ctx);
+
     free(ctx->restore.buffered_records);
     free(ctx->restore.populated_pfns);
     if ( ctx->restore.ops.cleanup(ctx) )
@@ -836,7 +1693,8 @@ static int restore(struct xc_sr_context *ctx)
                 goto err;
         }
 
-    } while ( rec.type != REC_TYPE_END );
+    } while ( rec.type != REC_TYPE_END &&
+              rec.type != REC_TYPE_POSTCOPY_TRANSITION );
 
  remus_failover:
 
@@ -847,6 +1705,14 @@ static int restore(struct xc_sr_context *ctx)
         IPRINTF("COLO Failover");
         goto done;
     }
+    else if ( ctx->restore.postcopy )
+    {
+        rc = postcopy_restore(ctx);
+        if ( rc )
+            goto err;
+
+        goto done;
+    }
 
     /*
      * With Remus, if we reach here, there must be some error on primary,
diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
index 49d22c7..7be3218 100644
--- a/tools/libxc/xc_sr_restore_x86_hvm.c
+++ b/tools/libxc/xc_sr_restore_x86_hvm.c
@@ -27,6 +27,27 @@ static int handle_hvm_context(struct xc_sr_context *ctx,
     return 0;
 }
 
+static int handle_hvm_magic_page(struct xc_sr_context *ctx,
+                                 struct xc_sr_rec_hvm_params_entry *entry)
+{
+    int rc;
+    xen_pfn_t pfn = entry->value;
+
+    if ( ctx->restore.postcopy )
+    {
+        rc = populate_pfns(ctx, 1, &pfn, NULL);
+        if ( rc )
+            return rc;
+    }
+
+    if ( entry->index != HVM_PARAM_PAGING_RING_PFN )
+    {
+        xc_clear_domain_page(ctx->xch, ctx->domid, pfn);
+    }
+
+    return 0;
+}
+
 /*
  * Process an HVM_PARAMS record from the stream.
  */
@@ -52,18 +73,29 @@ static int handle_hvm_params(struct xc_sr_context *ctx,
         {
         case HVM_PARAM_CONSOLE_PFN:
             ctx->restore.console_gfn = entry->value;
-            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            rc = handle_hvm_magic_page(ctx, entry);
             break;
         case HVM_PARAM_STORE_PFN:
             ctx->restore.xenstore_gfn = entry->value;
-            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            rc = handle_hvm_magic_page(ctx, entry);
+            break;
+        case HVM_PARAM_PAGING_RING_PFN:
+            ctx->restore.paging_ring_gfn = entry->value;
+            rc = handle_hvm_magic_page(ctx, entry);
             break;
         case HVM_PARAM_IOREQ_PFN:
         case HVM_PARAM_BUFIOREQ_PFN:
-            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            rc = handle_hvm_magic_page(ctx, entry);
             break;
         }
 
+        if ( rc )
+        {
+            PERROR("populate/clear magic HVM page %"PRId64" = 0x%016"PRIx64,
+                   entry->index, entry->value);
+            return rc;
+        }
+
         rc = xc_hvm_param_set(xch, ctx->domid, entry->index, entry->value);
         if ( rc < 0 )
         {
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index b65c971..8f4af0a 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -745,6 +745,8 @@ static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc);
 
+static void domcreate_postcopy_transition_callback(void *user);
+
 static void domcreate_launch_dm(libxl__egc *egc, libxl__multidev *aodevs,
                                 int ret);
 
@@ -1097,6 +1099,11 @@ static void domcreate_bootloader_done(libxl__egc *egc,
             libxl__remus_restore_setup(egc, dcs);
             /* fall through */
         case LIBXL_CHECKPOINTED_STREAM_NONE:
+            /* When the restore helper initiates the postcopy transition, pick
+             * up in domcreate_postcopy_transition_callback() */
+            callbacks->postcopy_transition =
+                domcreate_postcopy_transition_callback;
+
             libxl__stream_read_start(egc, &dcs->srs);
         }
         return;
@@ -1106,6 +1113,14 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     domcreate_stream_done(egc, &dcs->srs, rc);
 }
 
+/* ----- postcopy live migration ----- */
+
+static void domcreate_postcopy_transition_callback(void *user)
+{
+    /* XXX we're not ready to deal with this yet */
+    assert(0);
+}
+
 void libxl__srm_callout_callback_restore_results(xen_pfn_t store_mfn,
           xen_pfn_t console_mfn, void *user)
 {
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 5647b97..7f59e03 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -34,7 +34,7 @@ our @msgs = (
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
     [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ],
-    [ 11, 'scxA',   "postcopy_transition", [] ]
+    [ 11, 'srcxA',  "postcopy_transition", [] ]
 );
 
 #----------------------------------------
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 16/20] libxl/libxl_stream_write.c: track callback chains with an explicit phase
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (14 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 15/20] libxc/migration: implement the receiver " Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 17/20] libxl/libxl_stream_read.c: " Joshua Otto
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

There are three callback chains through libxl_stream_write: the 'normal'
straight-through save path initiated by libxl__stream_write_start(), the
iterated checkpoint path initiated each time by
libxl__stream_write_start_checkpoint(), and the (short) back-channel
checkpoint path initiated by libxl__stream_write_checkpoint_state().
These paths share significant common code but handle failure and
completion slightly differently, so it is necessary to keep track of
the callback chain currently in progress and act accordingly at various
points.

Until now, a collection of booleans in the stream write state has been
used to indicate the current callback chain.  However, the set of
callback chains is really better described by an enum, since only one
callback chain can actually be active at one time.  In anticipation of
the addition of a new chain for postcopy live migration, refactor the
existing logic to use an enum rather than booleans for callback chain
tracking.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl_internal.h     |  7 ++-
 tools/libxl/libxl_stream_write.c | 96 ++++++++++++++++++----------------------
 2 files changed, 48 insertions(+), 55 deletions(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 45d607a..e99d2ef 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3201,9 +3201,12 @@ struct libxl__stream_write_state {
     /* Private */
     int rc;
     bool running;
-    bool in_checkpoint;
+    enum {
+        SWS_PHASE_NORMAL,
+        SWS_PHASE_CHECKPOINT,
+        SWS_PHASE_CHECKPOINT_STATE
+    } phase;
     bool sync_teardown;  /* Only used to coordinate shutdown on error path. */
-    bool in_checkpoint_state;
     libxl__save_helper_state shs;
 
     /* Main stream-writing data. */
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index c96a6a2..8f2a1c9 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -89,12 +89,9 @@ static void emulator_context_read_done(libxl__egc *egc,
                                        int rc, int onwrite, int errnoval);
 static void emulator_context_record_done(libxl__egc *egc,
                                          libxl__stream_write_state *stream);
-static void write_end_record(libxl__egc *egc,
-                             libxl__stream_write_state *stream);
+static void write_phase_end_record(libxl__egc *egc,
+                                   libxl__stream_write_state *stream);
 
-/* Event chain unique to checkpointed streams. */
-static void write_checkpoint_end_record(libxl__egc *egc,
-                                        libxl__stream_write_state *stream);
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 
@@ -213,7 +210,7 @@ void libxl__stream_write_init(libxl__stream_write_state *stream)
 
     stream->rc = 0;
     stream->running = false;
-    stream->in_checkpoint = false;
+    stream->phase = SWS_PHASE_NORMAL;
     stream->sync_teardown = false;
     FILLZERO(stream->dc);
     stream->record_done_callback = NULL;
@@ -294,9 +291,9 @@ void libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                           libxl__stream_write_state *stream)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint);
+    assert(stream->phase == SWS_PHASE_NORMAL);
     assert(!stream->back_channel);
-    stream->in_checkpoint = true;
+    stream->phase = SWS_PHASE_CHECKPOINT;
 
     write_emulator_xenstore_record(egc, stream);
 }
@@ -431,12 +428,8 @@ static void emulator_xenstore_record_done(libxl__egc *egc,
 
     if (dss->type == LIBXL_DOMAIN_TYPE_HVM)
         write_emulator_context_record(egc, stream);
-    else {
-        if (stream->in_checkpoint)
-            write_checkpoint_end_record(egc, stream);
-        else
-            write_end_record(egc, stream);
-    }
+    else
+        write_phase_end_record(egc, stream);
 }
 
 static void write_emulator_context_record(libxl__egc *egc,
@@ -534,34 +527,35 @@ static void emulator_context_record_done(libxl__egc *egc,
     free(stream->emu_body);
     stream->emu_body = NULL;
 
-    if (stream->in_checkpoint)
-        write_checkpoint_end_record(egc, stream);
-    else
-        write_end_record(egc, stream);
+    write_phase_end_record(egc, stream);
 }
 
-static void write_end_record(libxl__egc *egc,
-                             libxl__stream_write_state *stream)
+static void write_phase_end_record(libxl__egc *egc,
+                                   libxl__stream_write_state *stream)
 {
     struct libxl__sr_rec_hdr rec;
+    sws_record_done_cb cb;
+    const char *what;
 
     FILLZERO(rec);
-    rec.type = REC_TYPE_END;
-
-    setup_write(egc, stream, "end record",
-                &rec, NULL, stream_success);
-}
-
-static void write_checkpoint_end_record(libxl__egc *egc,
-                                        libxl__stream_write_state *stream)
-{
-    struct libxl__sr_rec_hdr rec;
 
-    FILLZERO(rec);
-    rec.type = REC_TYPE_CHECKPOINT_END;
+    switch (stream->phase) {
+    case SWS_PHASE_NORMAL:
+        rec.type = REC_TYPE_END;
+        what     = "end record";
+        cb       = stream_success;
+        break;
+    case SWS_PHASE_CHECKPOINT:
+        rec.type = REC_TYPE_CHECKPOINT_END;
+        what     = "checkpoint end record";
+        cb       = checkpoint_end_record_done;
+        break;
+    default:
+        /* SWS_PHASE_CHECKPOINT_STATE has no end record */
+        assert(false);
+    }
 
-    setup_write(egc, stream, "checkpoint end record",
-                &rec, NULL, checkpoint_end_record_done);
+    setup_write(egc, stream, what, &rec, NULL, cb);
 }
 
 static void checkpoint_end_record_done(libxl__egc *egc,
@@ -582,21 +576,20 @@ static void stream_complete(libxl__egc *egc,
 {
     assert(stream->running);
 
-    if (stream->in_checkpoint) {
+    switch (stream->phase) {
+    case SWS_PHASE_NORMAL:
+        stream_done(egc, stream, rc);
+        break;
+    case SWS_PHASE_CHECKPOINT:
         assert(rc);
-
         /*
          * If an error is encountered while in a checkpoint, pass it
          * back to libxc.  The failure will come back around to us via
          * libxl__xc_domain_save_done()
          */
         checkpoint_done(egc, stream, rc);
-        return;
-    }
-
-    if (stream->in_checkpoint_state) {
-        assert(rc);
-
+        break;
+    case SWS_PHASE_CHECKPOINT_STATE:
         /*
          * If an error is encountered while in a checkpoint, pass it
          * back to libxc.  The failure will come back around to us via
@@ -606,17 +599,15 @@ static void stream_complete(libxl__egc *egc,
          *    libxl__stream_write_abort()
          */
         checkpoint_state_done(egc, stream, rc);
-        return;
+        break;
     }
-
-    stream_done(egc, stream, rc);
 }
 
 static void stream_done(libxl__egc *egc,
                         libxl__stream_write_state *stream, int rc)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint_state);
+    assert(stream->phase != SWS_PHASE_CHECKPOINT_STATE);
     stream->running = false;
 
     if (stream->emu_carefd)
@@ -640,9 +631,9 @@ static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_write_state *stream,
                             int rc)
 {
-    assert(stream->in_checkpoint);
+    assert(stream->phase == SWS_PHASE_CHECKPOINT);
 
-    stream->in_checkpoint = false;
+    stream->phase = SWS_PHASE_NORMAL;
     stream->checkpoint_callback(egc, stream, rc);
 }
 
@@ -699,9 +690,8 @@ void libxl__stream_write_checkpoint_state(libxl__egc *egc,
     struct libxl__sr_rec_hdr rec;
 
     assert(stream->running);
-    assert(!stream->in_checkpoint);
-    assert(!stream->in_checkpoint_state);
-    stream->in_checkpoint_state = true;
+    assert(stream->phase == SWS_PHASE_NORMAL);
+    stream->phase = SWS_PHASE_CHECKPOINT_STATE;
 
     FILLZERO(rec);
     rec.type = REC_TYPE_CHECKPOINT_STATE;
@@ -720,8 +710,8 @@ static void write_checkpoint_state_done(libxl__egc *egc,
 static void checkpoint_state_done(libxl__egc *egc,
                                   libxl__stream_write_state *stream, int rc)
 {
-    assert(stream->in_checkpoint_state);
-    stream->in_checkpoint_state = false;
+    assert(stream->phase == SWS_PHASE_CHECKPOINT_STATE);
+    stream->phase = SWS_PHASE_NORMAL;
     stream->checkpoint_callback(egc, stream, rc);
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 17/20] libxl/libxl_stream_read.c: track callback chains with an explicit phase
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (15 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 16/20] libxl/libxl_stream_write.c: track callback chains with an explicit phase Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 18/20] libxl/migration: implement the sender side of postcopy live migration Joshua Otto
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

As the previous patch did for libxl_stream_write, do for
libxl_stream_read.  libxl_stream_read already has a notion of phase for
its record-buffering behaviour - this is combined with the callback
chain phase.  Again, this is done to support the addition of a new
callback chain for postcopy live migration.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl_internal.h    |  7 ++--
 tools/libxl/libxl_stream_read.c | 83 +++++++++++++++++++++--------------------
 2 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index e99d2ef..c754706 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3123,9 +3123,7 @@ struct libxl__stream_read_state {
     /* Private */
     int rc;
     bool running;
-    bool in_checkpoint;
     bool sync_teardown; /* Only used to coordinate shutdown on error path. */
-    bool in_checkpoint_state;
     libxl__save_helper_state shs;
     libxl__conversion_helper_state chs;
 
@@ -3135,8 +3133,9 @@ struct libxl__stream_read_state {
     LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; /* NOGC */
     enum {
         SRS_PHASE_NORMAL,
-        SRS_PHASE_BUFFERING,
-        SRS_PHASE_UNBUFFERING,
+        SRS_PHASE_CHECKPOINT_BUFFERING,
+        SRS_PHASE_CHECKPOINT_UNBUFFERING,
+        SRS_PHASE_CHECKPOINT_STATE
     } phase;
     bool recursion_guard;
 
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 89c2f21..4cb553e 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -29,14 +29,15 @@
  * processed, and all records will be processed in queue order.
  *
  * Internal states:
- *           running  phase       in_         record   incoming
- *                                checkpoint  _queue   _record
+ *           running  phase                   record   incoming
+ *                                            _queue   _record
  *
- * Undefined    undef  undef        undef       undef    undef
- * Idle         false  undef        false       0        0
- * Active       true   NORMAL       false       0/1      0/partial
- * Active       true   BUFFERING    true        any      0/partial
- * Active       true   UNBUFFERING  true        any      0
+ * Undefined    undef  undef                    undef    undef
+ * Idle         false  undef                    0        0
+ * Active       true   NORMAL                   0/1      0/partial
+ * Active       true   CHECKPOINT_BUFFERING     any      0/partial
+ * Active       true   CHECKPOINT_UNBUFFERING   any      0
+ * Active       true   CHECKPOINT_STATE         0/1      0/partial
  *
  * While reading data from the stream, 'dc' is active and a callback
  * is expected.  Most actions in process_record() start a callback of
@@ -48,12 +49,12 @@
  *   Records are read one at time and immediately processed.  (The
  *   record queue will not contain more than a single record.)
  *
- * PHASE_BUFFERING:
+ * PHASE_CHECKPOINT_BUFFERING:
  *   This phase is used in checkpointed streams, when libxc signals
  *   the presence of a checkpoint in the stream.  Records are read and
  *   buffered until a CHECKPOINT_END record has been read.
  *
- * PHASE_UNBUFFERING:
+ * PHASE_CHECKPOINT_UNBUFFERING:
  *   Once a CHECKPOINT_END record has been read, all buffered records
  *   are processed.
  *
@@ -172,6 +173,12 @@ static void checkpoint_state_done(libxl__egc *egc,
 
 /*----- Helpers -----*/
 
+static inline bool stream_in_checkpoint(libxl__stream_read_state *stream)
+{
+    return stream->phase == SRS_PHASE_CHECKPOINT_BUFFERING ||
+           stream->phase == SRS_PHASE_CHECKPOINT_UNBUFFERING;
+}
+
 /* Helper to set up reading some data from the stream. */
 static int setup_read(libxl__stream_read_state *stream,
                       const char *what, void *ptr, size_t nr_bytes,
@@ -210,7 +217,6 @@ void libxl__stream_read_init(libxl__stream_read_state *stream)
 
     stream->rc = 0;
     stream->running = false;
-    stream->in_checkpoint = false;
     stream->sync_teardown = false;
     FILLZERO(stream->dc);
     FILLZERO(stream->hdr);
@@ -297,10 +303,9 @@ void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                          libxl__stream_read_state *stream)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint);
+    assert(stream->phase == SRS_PHASE_NORMAL);
 
-    stream->in_checkpoint = true;
-    stream->phase = SRS_PHASE_BUFFERING;
+    stream->phase = SRS_PHASE_CHECKPOINT_BUFFERING;
 
     /*
      * Libxc has handed control of the fd to us.  Start reading some
@@ -392,6 +397,7 @@ static void stream_continue(libxl__egc *egc,
 
     switch (stream->phase) {
     case SRS_PHASE_NORMAL:
+    case SRS_PHASE_CHECKPOINT_STATE:
         /*
          * Normal phase (regular migration or restore from file):
          *
@@ -416,9 +422,9 @@ static void stream_continue(libxl__egc *egc,
         }
         break;
 
-    case SRS_PHASE_BUFFERING: {
+    case SRS_PHASE_CHECKPOINT_BUFFERING: {
         /*
-         * Buffering phase (checkpointed streams only):
+         * Buffering phase:
          *
          * logically:
          *   do { read_record(); } while ( not CHECKPOINT_END );
@@ -431,8 +437,6 @@ static void stream_continue(libxl__egc *egc,
         libxl__sr_record_buf *rec = LIBXL_STAILQ_LAST(
             &stream->record_queue, libxl__sr_record_buf, entry);
 
-        assert(stream->in_checkpoint);
-
         if (!rec || (rec->hdr.type != REC_TYPE_CHECKPOINT_END)) {
             setup_read_record(egc, stream);
             break;
@@ -442,19 +446,18 @@ static void stream_continue(libxl__egc *egc,
          * There are now some number of buffered records, with a
          * CHECKPOINT_END at the end. Start processing them all.
          */
-        stream->phase = SRS_PHASE_UNBUFFERING;
+        stream->phase = SRS_PHASE_CHECKPOINT_UNBUFFERING;
     }
         /* FALLTHROUGH */
-    case SRS_PHASE_UNBUFFERING:
+    case SRS_PHASE_CHECKPOINT_UNBUFFERING:
         /*
-         * Unbuffering phase (checkpointed streams only):
+         * Unbuffering phase:
          *
          * logically:
          *   do { process_record(); } while ( not CHECKPOINT_END );
          *
          * Process all records collected during the buffering phase.
          */
-        assert(stream->in_checkpoint);
 
         while (process_record(egc, stream))
             ; /*
@@ -625,7 +628,7 @@ static bool process_record(libxl__egc *egc,
         break;
 
     case REC_TYPE_CHECKPOINT_END:
-        if (!stream->in_checkpoint) {
+        if (!stream_in_checkpoint(stream)) {
             LOG(ERROR, "Unexpected CHECKPOINT_END record in stream");
             rc = ERROR_FAIL;
             goto err;
@@ -634,7 +637,7 @@ static bool process_record(libxl__egc *egc,
         break;
 
     case REC_TYPE_CHECKPOINT_STATE:
-        if (!stream->in_checkpoint_state) {
+        if (stream->phase != SRS_PHASE_CHECKPOINT_STATE) {
             LOG(ERROR, "Unexpected CHECKPOINT_STATE record in stream");
             rc = ERROR_FAIL;
             goto err;
@@ -743,7 +746,12 @@ static void stream_complete(libxl__egc *egc,
 {
     assert(stream->running);
 
-    if (stream->in_checkpoint) {
+    switch (stream->phase) {
+    case SRS_PHASE_NORMAL:
+        stream_done(egc, stream, rc);
+        break;
+    case SRS_PHASE_CHECKPOINT_BUFFERING:
+    case SRS_PHASE_CHECKPOINT_UNBUFFERING:
         assert(rc);
 
         /*
@@ -752,10 +760,8 @@ static void stream_complete(libxl__egc *egc,
          * libxl__xc_domain_restore_done()
          */
         checkpoint_done(egc, stream, rc);
-        return;
-    }
-
-    if (stream->in_checkpoint_state) {
+        break;
+    case SRS_PHASE_CHECKPOINT_STATE:
         assert(rc);
 
         /*
@@ -767,10 +773,8 @@ static void stream_complete(libxl__egc *egc,
          *    libxl__stream_read_abort()
          */
         checkpoint_state_done(egc, stream, rc);
-        return;
+        break;
     }
-
-    stream_done(egc, stream, rc);
 }
 
 static void checkpoint_done(libxl__egc *egc,
@@ -778,18 +782,17 @@ static void checkpoint_done(libxl__egc *egc,
 {
     int ret;
 
-    assert(stream->in_checkpoint);
+    assert(stream_in_checkpoint(stream));
 
     if (rc == 0)
         ret = XGR_CHECKPOINT_SUCCESS;
-    else if (stream->phase == SRS_PHASE_BUFFERING)
+    else if (stream->phase == SRS_PHASE_CHECKPOINT_BUFFERING)
         ret = XGR_CHECKPOINT_FAILOVER;
     else
         ret = XGR_CHECKPOINT_ERROR;
 
     stream->checkpoint_callback(egc, stream, ret);
 
-    stream->in_checkpoint = false;
     stream->phase = SRS_PHASE_NORMAL;
 }
 
@@ -799,8 +802,7 @@ static void stream_done(libxl__egc *egc,
     libxl__sr_record_buf *rec, *trec;
 
     assert(stream->running);
-    assert(!stream->in_checkpoint);
-    assert(!stream->in_checkpoint_state);
+    assert(stream->phase == SRS_PHASE_NORMAL);
     stream->running = false;
 
     if (stream->incoming_record)
@@ -955,9 +957,8 @@ void libxl__stream_read_checkpoint_state(libxl__egc *egc,
                                          libxl__stream_read_state *stream)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint);
-    assert(!stream->in_checkpoint_state);
-    stream->in_checkpoint_state = true;
+    assert(stream->phase == SRS_PHASE_NORMAL);
+    stream->phase = SRS_PHASE_CHECKPOINT_STATE;
 
     setup_read_record(egc, stream);
 }
@@ -965,8 +966,8 @@ void libxl__stream_read_checkpoint_state(libxl__egc *egc,
 static void checkpoint_state_done(libxl__egc *egc,
                                   libxl__stream_read_state *stream, int rc)
 {
-    assert(stream->in_checkpoint_state);
-    stream->in_checkpoint_state = false;
+    assert(stream->phase == SRS_PHASE_CHECKPOINT_STATE);
+    stream->phase = SRS_PHASE_NORMAL;
     stream->checkpoint_callback(egc, stream, rc);
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 18/20] libxl/migration: implement the sender side of postcopy live migration
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (16 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 17/20] libxl/libxl_stream_read.c: " Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 19/20] libxl/migration: implement the receiver " Joshua Otto
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

To make the libxl sender capable of supporting postcopy live migration:
- Add a postcopy transition callback chain through the stream writer (this
  callback chain is nearly identical to the checkpoint callback chain, and
  differs meaningfully only in its failure/completion behaviour)
- Wire this callback chain up to the xc postcopy callback entries in the domain
  save logic.
- Add parameters to libxl_domain_live_migrate() to permit bidirectional
  communication between the sender and receiver and enable the caller to reason
  about the safety of recovery from a postcopy failure.

No mechanism is introduced yet to enable library clients to induce a postcopy
live migration - this will follow after the libxl postcopy receiver logic.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 docs/specs/libxl-migration-stream.pandoc | 19 ++++++++-
 tools/libxl/libxl.h                      |  4 +-
 tools/libxl/libxl_dom_save.c             | 25 +++++++++++-
 tools/libxl/libxl_domain.c               | 25 ++++++++----
 tools/libxl/libxl_internal.h             | 21 ++++++++--
 tools/libxl/libxl_sr_stream_format.h     | 13 +++---
 tools/libxl/libxl_stream_write.c         | 69 ++++++++++++++++++++++++++++++--
 tools/xl/xl_migrate.c                    |  5 ++-
 8 files changed, 155 insertions(+), 26 deletions(-)

diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
index a1ba1ac..8d00cd7 100644
--- a/docs/specs/libxl-migration-stream.pandoc
+++ b/docs/specs/libxl-migration-stream.pandoc
@@ -2,7 +2,8 @@
 % Andrew Cooper <<andrew.cooper3@citrix.com>>
   Wen Congyang <<wency@cn.fujitsu.com>>
   Yang Hongyang <<hongyang.yang@easystack.cn>>
-% Revision 2
+  Joshua Otto <<jtotto@uwaterloo.ca>>
+% Revision 3
 
 Introduction
 ============
@@ -123,7 +124,9 @@ type         0x00000000: END
 
              0x00000005: CHECKPOINT_STATE
 
-             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
+             0x00000006: POSTCOPY_TRANSITION_END
+
+             0x00000007 - 0x7FFFFFFF: Reserved for future _mandatory_
              records.
 
              0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
@@ -304,6 +307,18 @@ While Secondary is running in below loop:
     b. Send _CHECKPOINT\_SVM\_SUSPENDED_ to primary
 4. Checkpoint
 
+POSTCOPY\_TRANSITION\_END
+-------------------------
+
+A postcopy transition end record marks the end of a postcopy transition in a
+libxl live migration stream.  It indicates that control of the stream should be
+returned to libxc for the postcopy memory migration phase.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The postcopy transition end record contains no fields; its body_length is 0.
+
 Future Extensions
 =================
 
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 84ac96a..99d187b 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1375,10 +1375,12 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd,
 #define LIBXL_SUSPEND_DEBUG 1
 #define LIBXL_SUSPEND_LIVE 2
 
-int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int fd,
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
                               int flags, /* LIBXL_SUSPEND_* */
                               unsigned int precopy_iterations,
                               unsigned int precopy_dirty_threshold,
+                              int recv_fd,
+                              bool *postcopy_transitioned, /* OUT */
                               const libxl_asyncop_how *ao_how)
                               LIBXL_EXTERNAL_CALLERS_ONLY;
 
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 4ef9ca5..9e565ae 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -349,10 +349,31 @@ static int libxl__save_live_migration_simple_precopy_policy(
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *sws, int rc);
+
 static void libxl__save_live_migration_postcopy_transition_callback(void *user)
 {
-    /* XXX we're not yet ready to deal with this */
-    assert(0);
+    libxl__save_helper_state *shs = user;
+    libxl__stream_write_state *sws = CONTAINER_OF(shs, *sws, shs);
+    sws->postcopy_transition_callback = postcopy_transition_done;
+    libxl__stream_write_start_postcopy_transition(shs->egc, sws);
+}
+
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *sws,
+                                     int rc)
+{
+    libxl__domain_save_state *dss = sws->dss;
+
+    /* Past here, it's _possible_ that the domain may execute at the
+     * destination, so - unless we're given positive confirmation by the
+     * destination that it failed to resume there - we must assume it has. */
+    assert(dss->postcopy_transitioned);
+    *dss->postcopy_transitioned = !rc;
+
+    /* Return control to libxc. */
+    libxl__xc_domain_saverestore_async_callback_done(egc, &sws->shs, !rc);
 }
 
 /*----- main code for saving, in order of execution -----*/
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index b1cf643..ea778a6 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -488,7 +488,8 @@ static void domain_suspend_cb(libxl__egc *egc,
 
 static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
                              unsigned int precopy_iterations,
-                             unsigned int precopy_dirty_threshold,
+                             unsigned int precopy_dirty_threshold, int recv_fd,
+                             bool *postcopy_transitioned,
                              const libxl_asyncop_how *ao_how)
 {
     AO_CREATE(ctx, domid, ao_how);
@@ -508,6 +509,8 @@ static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
 
     dss->domid = domid;
     dss->fd = fd;
+    dss->recv_fd = recv_fd;
+    dss->postcopy_transitioned = postcopy_resumed_remotely;
     dss->type = type;
     dss->live = flags & LIBXL_SUSPEND_LIVE;
     dss->debug = flags & LIBXL_SUSPEND_DEBUG;
@@ -532,18 +535,26 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
 {
     return do_domain_suspend(ctx, domid, fd, flags,
                              LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
-                             LIBXL_LM_DIRTY_THRESHOLD_DEFAULT, ao_how);
+                             LIBXL_LM_DIRTY_THRESHOLD_DEFAULT, -1,
+                             NULL, ao_how);
 }
 
-int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
-                              unsigned int precopy_iterations,
-                              unsigned int precopy_dirty_threshold,
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
+                              int flags, unsigned int precopy_iterations,
+                              unsigned int precopy_dirty_threshold, int recv_fd,
+                              bool *postcopy_transitioned,
                               const libxl_asyncop_how *ao_how)
 {
+    if (!postcopy_transitioned) {
+        errno = EINVAL;
+        return -1;
+    }
+
     flags |= LIBXL_SUSPEND_LIVE;
 
-    return do_domain_suspend(ctx, domid, fd, flags, precopy_iterations,
-                             precopy_dirty_threshold, ao_how);
+    return do_domain_suspend(ctx, domid, send_fd, flags, precopy_iterations,
+                             precopy_dirty_threshold, recv_fd,
+                             postcopy_transitioned, ao_how);
 }
 
 int libxl_domain_pause(libxl_ctx *ctx, uint32_t domid)
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index c754706..ae272d7 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3194,17 +3194,25 @@ struct libxl__stream_write_state {
     void (*completion_callback)(libxl__egc *egc,
                                 libxl__stream_write_state *sws,
                                 int rc);
-    void (*checkpoint_callback)(libxl__egc *egc,
-                                libxl__stream_write_state *sws,
-                                int rc);
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        void (*checkpoint_callback)(libxl__egc *egc,
+                                    libxl__stream_write_state *sws,
+                                    int rc);
+        void (*postcopy_transition_callback)(libxl__egc *egc,
+                                             libxl__stream_write_state *sws,
+                                             int rc);
+    };
     /* Private */
     int rc;
     bool running;
     enum {
         SWS_PHASE_NORMAL,
         SWS_PHASE_CHECKPOINT,
-        SWS_PHASE_CHECKPOINT_STATE
+        SWS_PHASE_CHECKPOINT_STATE,
+        SWS_PHASE_POSTCOPY_TRANSITION
     } phase;
+    bool postcopy_transitioned;
     bool sync_teardown;  /* Only used to coordinate shutdown on error path. */
     libxl__save_helper_state shs;
 
@@ -3227,6 +3235,10 @@ _hidden void libxl__stream_write_init(libxl__stream_write_state *stream);
 _hidden void libxl__stream_write_start(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 _hidden void
+libxl__stream_write_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream);
+_hidden void
 libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                      libxl__stream_write_state *stream);
 _hidden void
@@ -3290,6 +3302,7 @@ struct libxl__domain_save_state {
     int fd;
     int fdfl; /* original flags on fd */
     int recv_fd;
+    bool *postcopy_transitioned;
     libxl_domain_type type;
     int live;
     int debug;
diff --git a/tools/libxl/libxl_sr_stream_format.h b/tools/libxl/libxl_sr_stream_format.h
index 75f5190..a789126 100644
--- a/tools/libxl/libxl_sr_stream_format.h
+++ b/tools/libxl/libxl_sr_stream_format.h
@@ -31,12 +31,13 @@ typedef struct libxl__sr_rec_hdr
 /* All records must be aligned up to an 8 octet boundary */
 #define REC_ALIGN_ORDER              3U
 
-#define REC_TYPE_END                    0x00000000U
-#define REC_TYPE_LIBXC_CONTEXT          0x00000001U
-#define REC_TYPE_EMULATOR_XENSTORE_DATA 0x00000002U
-#define REC_TYPE_EMULATOR_CONTEXT       0x00000003U
-#define REC_TYPE_CHECKPOINT_END         0x00000004U
-#define REC_TYPE_CHECKPOINT_STATE       0x00000005U
+#define REC_TYPE_END                     0x00000000U
+#define REC_TYPE_LIBXC_CONTEXT           0x00000001U
+#define REC_TYPE_EMULATOR_XENSTORE_DATA  0x00000002U
+#define REC_TYPE_EMULATOR_CONTEXT        0x00000003U
+#define REC_TYPE_CHECKPOINT_END          0x00000004U
+#define REC_TYPE_CHECKPOINT_STATE        0x00000005U
+#define REC_TYPE_POSTCOPY_TRANSITION_END 0x00000006U
 
 typedef struct libxl__sr_emulator_hdr
 {
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index 8f2a1c9..1c4b1f1 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -22,6 +22,9 @@
  * Entry points from outside:
  *  - libxl__stream_write_start()
  *     - Start writing a stream from the start.
+ *  - libxl__stream_write_postcopy_transition()
+ *     - Write the records required to permit postcopy resumption at the
+ *       migration target.
  *  - libxl__stream_write_start_checkpoint()
  *     - Write the records which form a checkpoint into a stream.
  *
@@ -65,6 +68,9 @@ static void stream_complete(libxl__egc *egc,
                             libxl__stream_write_state *stream, int rc);
 static void stream_done(libxl__egc *egc,
                         libxl__stream_write_state *stream, int rc);
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *stream,
+                                     int rc);
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_write_state *stream,
                             int rc);
@@ -91,7 +97,9 @@ static void emulator_context_record_done(libxl__egc *egc,
                                          libxl__stream_write_state *stream);
 static void write_phase_end_record(libxl__egc *egc,
                                    libxl__stream_write_state *stream);
-
+static void postcopy_transition_end_record_done(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream);
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 
@@ -211,6 +219,7 @@ void libxl__stream_write_init(libxl__stream_write_state *stream)
     stream->rc = 0;
     stream->running = false;
     stream->phase = SWS_PHASE_NORMAL;
+    stream->postcopy_transitioned = false;
     stream->sync_teardown = false;
     FILLZERO(stream->dc);
     stream->record_done_callback = NULL;
@@ -287,6 +296,22 @@ void libxl__stream_write_start(libxl__egc *egc,
     stream_complete(egc, stream, rc);
 }
 
+void libxl__stream_write_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream)
+{
+    libxl__domain_save_state *dss = stream->dss;
+
+    assert(stream->running);
+    assert(dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE);
+    assert(stream->phase == SWS_PHASE_NORMAL);
+    assert(!stream->postcopy_transitioned);
+
+    stream->phase = SWS_PHASE_POSTCOPY_TRANSITION;
+
+    write_emulator_xenstore_record(egc, stream);
+}
+
 void libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                           libxl__stream_write_state *stream)
 {
@@ -369,7 +394,7 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
      * If the stream is not still alive, we must not continue any work.
      */
     if (libxl__stream_write_inuse(stream)) {
-        if (dss->checkpointed_stream != LIBXL_CHECKPOINTED_STREAM_NONE)
+        if (dss->checkpointed_stream != LIBXL_CHECKPOINTED_STREAM_NONE) {
             /*
              * For remus, if libxl__xc_domain_save_done() completes,
              * there was an error sending data to the secondary.
@@ -377,8 +402,17 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
              * return value (Please refer to libxl__remus_teardown())
              */
             stream_complete(egc, stream, 0);
-        else
+        } else if (stream->postcopy_transitioned) {
+            /*
+             * If, on the other hand, this is a normal migration that had a
+             * postcopy migration stage, we're completely done at this point and
+             * want to report any error received here to our caller.
+             */
+            assert(stream->phase == SWS_PHASE_NORMAL);
+            write_phase_end_record(egc, stream);
+        } else {
             write_emulator_xenstore_record(egc, stream);
+        }
     }
 }
 
@@ -550,6 +584,11 @@ static void write_phase_end_record(libxl__egc *egc,
         what     = "checkpoint end record";
         cb       = checkpoint_end_record_done;
         break;
+    case SWS_PHASE_POSTCOPY_TRANSITION:
+        rec.type = REC_TYPE_POSTCOPY_TRANSITION_END;
+        what     = "postcopy transition end record";
+        cb       = postcopy_transition_end_record_done;
+        break;
     default:
         /* SWS_PHASE_CHECKPOINT_STATE has no end record */
         assert(false);
@@ -558,6 +597,13 @@ static void write_phase_end_record(libxl__egc *egc,
     setup_write(egc, stream, what, &rec, NULL, cb);
 }
 
+static void postcopy_transition_end_record_done(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream)
+{
+    postcopy_transition_done(egc, stream, 0);
+}
+
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream)
 {
@@ -600,6 +646,13 @@ static void stream_complete(libxl__egc *egc,
          */
         checkpoint_state_done(egc, stream, rc);
         break;
+    case SWS_PHASE_POSTCOPY_TRANSITION:
+        /*
+         * To deal with errors during the postcopy transition, we use the same
+         * strategy as during checkpoints.
+         */
+        postcopy_transition_done(egc, stream, rc);
+        break;
     }
 }
 
@@ -627,6 +680,16 @@ static void stream_done(libxl__egc *egc,
     }
 }
 
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *stream,
+                                     int rc)
+{
+    assert(stream->phase == SWS_PHASE_POSTCOPY_TRANSITION);
+    stream->postcopy_transitioned = true;
+    stream->phase = SWS_PHASE_NORMAL;
+    stream->postcopy_transition_callback(egc, stream, rc);
+}
+
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_write_state *stream,
                             int rc)
diff --git a/tools/xl/xl_migrate.c b/tools/xl/xl_migrate.c
index 1bb3fb4..1ffc32b 100644
--- a/tools/xl/xl_migrate.c
+++ b/tools/xl/xl_migrate.c
@@ -188,6 +188,7 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
     char rc_buf;
     uint8_t *config_data;
     int config_len, flags = LIBXL_SUSPEND_LIVE;
+    bool postcopy_transitioned;
 
     save_domain_core_begin(domid, override_config_file,
                            &config_data, &config_len);
@@ -209,7 +210,9 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
         flags |= LIBXL_SUSPEND_DEBUG;
     rc = libxl_domain_live_migrate(ctx, domid, send_fd, flags,
                                    precopy_iterations, precopy_dirty_threshold,
-                                   NULL);
+                                   recv_fd, &postcopy_transitioned, NULL);
+    assert(!postcopy_transitioned);
+
     if (rc) {
         fprintf(stderr, "migration sender: libxl_domain_suspend failed"
                 " (rc=%d)\n", rc);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 19/20] libxl/migration: implement the receiver side of postcopy live migration
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (17 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 18/20] libxl/migration: implement the sender side of postcopy live migration Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-27  9:06 ` [PATCH RFC 20/20] tools: expose postcopy live migration support in libxl and xl Joshua Otto
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

To make the libxl receiver capable of supporting postcopy live
migration:
- As was done for the libxl stream writer, add a symmetric callback
  chain through the stream reader that reads the sequence of xl records
  necessary to resume the guest and enter the postcopy phase.  This
  chain is very similar to the checkpoint chain.
- Add a new postcopy path through the domain creation sequence that
  permits the xc memory postcopy phase to proceed in parallel to the
  libxl domain creation and resumption sequence.
- Add a out-parameter to libxl_domain_create_restore(),
  'postcopy_resumed', that callers can test to determine whether or not
  further action is required on their-part post-migration to get the
  guest running.

A subsequent patch will introduce a mechanism by which library clients
can _induce_ a postcopy live migration.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl.h                  |  28 ++++++-
 tools/libxl/libxl_create.c           | 156 +++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_internal.h         |  43 +++++++++-
 tools/libxl/libxl_stream_read.c      |  57 +++++++++++++
 tools/ocaml/libs/xl/xenlight_stubs.c |   2 +-
 tools/xl/xl_vmcontrol.c              |   2 +-
 6 files changed, 273 insertions(+), 15 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 99d187b..51e8760 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1296,6 +1296,7 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
 int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 uint32_t *domid, int restore_fd,
                                 int send_back_fd,
+                                bool *postcopy_resumed, /* OUT */
                                 const libxl_domain_restore_params *params,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
@@ -1315,8 +1316,9 @@ static inline int libxl_domain_create_restore_0x040200(
 
     libxl_domain_restore_params_init(&params);
 
-    ret = libxl_domain_create_restore(
-        ctx, d_config, domid, restore_fd, -1, &params, ao_how, aop_console_how);
+    ret = libxl_domain_create_restore(ctx, d_config, domid, restore_fd,
+                                      -1, NULL, &params, ao_how,
+                                      aop_console_how);
 
     libxl_domain_restore_params_dispose(&params);
     return ret;
@@ -1336,11 +1338,31 @@ static inline int libxl_domain_create_restore_0x040400(
     LIBXL_EXTERNAL_CALLERS_ONLY
 {
     return libxl_domain_create_restore(ctx, d_config, domid, restore_fd,
-                                       -1, params, ao_how, aop_console_how);
+                                       -1, NULL, params, ao_how,
+                                       aop_console_how);
 }
 
 #define libxl_domain_create_restore libxl_domain_create_restore_0x040400
 
+#elif defined(LIBXL_API_VERSION) && LIBXL_API_VERSION >= 0x040700 \
+                                 && LIBXL_API_VERSION < 0x040900
+
+static inline int libxl_domain_create_restore_0x040700(
+    libxl_ctx *ctx, libxl_domain_config *d_config,
+    uint32_t *domid, int restore_fd,
+    int send_back_fd,
+    const libxl_domain_restore_params *params,
+    const libxl_asyncop_how *ao_how,
+    const libxl_asyncprogress_how *aop_console_how)
+    LIBXL_EXTERNAL_CALLERS_ONLY
+{
+    return libxl_domain_create_restore(ctx, d_config, domid, restore_fd,
+                                       -1, NULL, params, ao_how,
+                                       aop_console_how);
+}
+
+#define libxl_domain_create_restore libxl_domain_create_restore_0x040700
+
 #endif
 
 int libxl_domain_soft_reset(libxl_ctx *ctx,
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 8f4af0a..184b278 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -745,8 +745,20 @@ static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc);
 
+/* If a postcopy migration is initiated by the sending side during a live
+ * migration, this function returns control of the stream to the stream reader
+ * so it can finish the libxl stream. */
 static void domcreate_postcopy_transition_callback(void *user);
 
+/* When the stream reader postcopy transition completes, this callback is
+ * invoked.  It transfers control of the restore stream back to the helper. */
+void domcreate_postcopy_transition_complete_callback(
+    libxl__egc *egc, libxl__stream_read_state *srs, int rc);
+
+static void domcreate_postcopy_stream_done(libxl__egc *egc,
+                                           libxl__stream_read_state *srs,
+                                           int ret);
+
 static void domcreate_launch_dm(libxl__egc *egc, libxl__multidev *aodevs,
                                 int ret);
 
@@ -773,6 +785,10 @@ static void domcreate_destruction_cb(libxl__egc *egc,
                                      libxl__domain_destroy_state *dds,
                                      int rc);
 
+static void domcreate_report_result(libxl__egc *egc,
+                                    libxl__domain_create_state *dcs,
+                                    int rc);
+
 static void initiate_domain_create(libxl__egc *egc,
                                    libxl__domain_create_state *dcs)
 {
@@ -1104,6 +1120,13 @@ static void domcreate_bootloader_done(libxl__egc *egc,
             callbacks->postcopy_transition =
                 domcreate_postcopy_transition_callback;
 
+            /* When the stream reader is finished reading the postcopy
+             * transition, we'll find out in the
+             * domcreate_postcopy_transition_complete_callback(), where we'll
+             * hand control of the stream back to the libxc helper. */
+            dcs->srs.postcopy_transition_callback =
+                domcreate_postcopy_transition_complete_callback;
+
             libxl__stream_read_start(egc, &dcs->srs);
         }
         return;
@@ -1117,8 +1140,73 @@ static void domcreate_bootloader_done(libxl__egc *egc,
 
 static void domcreate_postcopy_transition_callback(void *user)
 {
-    /* XXX we're not ready to deal with this yet */
-    assert(0);
+    libxl__save_helper_state *shs = user;
+    libxl__domain_create_state *dcs = shs->caller_state;
+    libxl__stream_read_state *srs = &dcs->srs;
+
+    libxl__stream_read_start_postcopy_transition(shs->egc, srs);
+}
+
+void domcreate_postcopy_transition_complete_callback(
+    libxl__egc *egc, libxl__stream_read_state *srs, int rc)
+{
+    libxl__domain_create_state *dcs = srs->dcs;
+
+    if (!rc)
+        srs->completion_callback = domcreate_postcopy_stream_done;
+
+     /* If all is well (for now) we'll find out about the eventual termination
+      * of the restore helper/stream through domcreate_postcopy_stream_done().
+      * Otherwise, we'll find out sooner through domcreate_stream_done(). */
+    libxl__xc_domain_saverestore_async_callback_done(egc, &srs->shs, !rc);
+
+    if (!rc) {
+        /* In parallel, resume the guest. */
+        dcs->postcopy.active = true;
+        dcs->postcopy.resume.state = DCS_POSTCOPY_RESUME_INPROGRESS;
+        dcs->postcopy.stream.state = DCS_POSTCOPY_STREAM_INPROGRESS;
+        domcreate_stream_done(egc, srs, 0);
+    }
+}
+
+static void domcreate_postcopy_stream_done(libxl__egc *egc,
+                                           libxl__stream_read_state *srs,
+                                           int ret)
+{
+    libxl__domain_create_state *dcs = srs->dcs;
+
+    EGC_GC;
+
+    assert(dcs->postcopy.stream.state == DCS_POSTCOPY_STREAM_INPROGRESS);
+
+    switch (dcs->postcopy.resume.state) {
+    case DCS_POSTCOPY_RESUME_INPROGRESS:
+        if (ret) {
+            /* The stream failed, and the resumption is still in progress.
+             * Stash our return code for resumption to find later. */
+            dcs->postcopy.stream.state = DCS_POSTCOPY_STREAM_FAILED;
+            dcs->postcopy.stream.rc = ret;
+        } else {
+            /* We've successfully completed, but the resumption is still humming
+             * away. */
+			dcs->postcopy.stream.state = DCS_POSTCOPY_STREAM_SUCCESS;
+
+			/* Just let it finish.  Nothing to do for now. */
+			LOG(INFO, "Postcopy stream completed _before_ domain unpaused");
+        }
+        break;
+    case DCS_POSTCOPY_RESUME_FAILED:
+        /* The resumption failed first, so report its result. */
+        dcs->callback(egc, dcs, dcs->postcopy.resume.rc, dcs->guest_domid);
+        break;
+    case DCS_POSTCOPY_RESUME_SUCCESS:
+        /* This is the expected case - resumption completed, and some time later
+         * the final postcopy pages were migrated and the stream wrapped up.
+         * We're now totally done! */
+        LOG(INFO, "Postcopy stream completed after domain unpaused");
+        dcs->callback(egc, dcs, ret, dcs->guest_domid);
+        break;
+    }
 }
 
 void libxl__srm_callout_callback_restore_results(xen_pfn_t store_mfn,
@@ -1572,7 +1660,8 @@ static void domcreate_complete(libxl__egc *egc,
         }
         dcs->guest_domid = -1;
     }
-    dcs->callback(egc, dcs, rc, dcs->guest_domid);
+
+    domcreate_report_result(egc, dcs, rc);
 }
 
 static void domcreate_destruction_cb(libxl__egc *egc,
@@ -1585,7 +1674,55 @@ static void domcreate_destruction_cb(libxl__egc *egc,
     if (rc)
         LOGD(ERROR, dds->domid, "unable to destroy domain following failed creation");
 
-    dcs->callback(egc, dcs, ERROR_FAIL, dcs->guest_domid);
+    domcreate_report_result(egc, dcs, ERROR_FAIL);
+}
+
+static void domcreate_report_result(libxl__egc *egc,
+                                    libxl__domain_create_state *dcs,
+                                    int rc)
+{
+    EGC_GC;
+
+    if (!dcs->postcopy.active) {
+        /* If we aren't presently in the process of completing a postcopy
+         * resumption (the norm), everything is all cleaned up and we can report
+         * our result directly. */
+        LOG(INFO, "No postcopy at all");
+        dcs->callback(egc, dcs, rc, dcs->guest_domid);
+    } else {
+        switch (dcs->postcopy.stream.state) {
+        case DCS_POSTCOPY_STREAM_INPROGRESS:
+        case DCS_POSTCOPY_STREAM_SUCCESS:
+            /* If we haven't yet failed, try to unpause the guest. */
+            rc = rc ?: libxl_domain_unpause(CTX, dcs->guest_domid);
+            if (dcs->postcopy_resumed)
+                *dcs->postcopy_resumed = !rc;
+
+            if (dcs->postcopy.stream.state == DCS_POSTCOPY_STREAM_SUCCESS) {
+                /* The stream finished successfully, so we can report our local
+                 * result as the overall result. */
+                dcs->callback(egc, dcs, rc, dcs->guest_domid);
+                LOG(INFO, "Postcopy domain unpaused after stream completed");
+            } else if (rc) {
+                /* The stream isn't done yet, but we failed.  Tell it to bail,
+                 * and stash our return code for the postcopy stream completion
+                 * callback to find. */
+                dcs->postcopy.resume.state = DCS_POSTCOPY_RESUME_FAILED;
+                dcs->postcopy.resume.rc = rc;
+
+                libxl__stream_read_abort(egc, &dcs->srs, -1);
+            } else {
+                dcs->postcopy.resume.state = DCS_POSTCOPY_RESUME_SUCCESS;
+                LOG(INFO, "Postcopy domain unpaused before stream completed");
+            }
+            break;
+        case DCS_POSTCOPY_STREAM_FAILED:
+            /* The stream failed.  Now that we're done, tie things up by
+             * reporting the stream's result. */
+            dcs->callback(egc, dcs, dcs->postcopy.stream.rc, dcs->guest_domid);
+            break;
+        }
+    }
 }
 
 /*----- application-facing domain creation interface -----*/
@@ -1609,6 +1746,7 @@ static void domain_create_cb(libxl__egc *egc,
 
 static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid, int restore_fd, int send_back_fd,
+                            bool *postcopy_resumed, /* OUT */
                             const libxl_domain_restore_params *params,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
@@ -1617,6 +1755,9 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
     libxl__app_domain_create_state *cdcs;
     int rc;
 
+    if (postcopy_resumed)
+        *postcopy_resumed = false;
+
     GCNEW(cdcs);
     cdcs->dcs.ao = ao;
     cdcs->dcs.guest_config = d_config;
@@ -1631,6 +1772,7 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                                          &cdcs->dcs.restore_fdfl);
         if (rc < 0) goto out_err;
     }
+    cdcs->dcs.postcopy_resumed = postcopy_resumed;
     cdcs->dcs.callback = domain_create_cb;
     cdcs->dcs.domid_soft_reset = INVALID_DOMID;
 
@@ -1852,13 +1994,13 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             const libxl_asyncprogress_how *aop_console_how)
 {
     unset_disk_colo_restore(d_config);
-    return do_domain_create(ctx, d_config, domid, -1, -1, NULL,
+    return do_domain_create(ctx, d_config, domid, -1, -1, NULL, NULL,
                             ao_how, aop_console_how);
 }
 
 int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 uint32_t *domid, int restore_fd,
-                                int send_back_fd,
+                                int send_back_fd, bool *postcopy_resumed,
                                 const libxl_domain_restore_params *params,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
@@ -1870,7 +2012,7 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
     }
 
     return do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
-                            params, ao_how, aop_console_how);
+                            postcopy_resumed, params, ao_how, aop_console_how);
 }
 
 int libxl_domain_soft_reset(libxl_ctx *ctx,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index ae272d7..0a7c0d1 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3117,9 +3117,15 @@ struct libxl__stream_read_state {
     void (*completion_callback)(libxl__egc *egc,
                                 libxl__stream_read_state *srs,
                                 int rc);
-    void (*checkpoint_callback)(libxl__egc *egc,
-                                libxl__stream_read_state *srs,
-                                int rc);
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        void (*checkpoint_callback)(libxl__egc *egc,
+                                    libxl__stream_read_state *srs,
+                                    int rc);
+        void (*postcopy_transition_callback)(libxl__egc *egc,
+                                             libxl__stream_read_state *srs,
+                                             int rc);
+    };
     /* Private */
     int rc;
     bool running;
@@ -3133,10 +3139,12 @@ struct libxl__stream_read_state {
     LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; /* NOGC */
     enum {
         SRS_PHASE_NORMAL,
+        SRS_PHASE_POSTCOPY_TRANSITION,
         SRS_PHASE_CHECKPOINT_BUFFERING,
         SRS_PHASE_CHECKPOINT_UNBUFFERING,
         SRS_PHASE_CHECKPOINT_STATE
     } phase;
+    bool postcopy_transitioned;
     bool recursion_guard;
 
     /* Only used while actively reading a record from the stream. */
@@ -3150,6 +3158,9 @@ struct libxl__stream_read_state {
 _hidden void libxl__stream_read_init(libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_start(libxl__egc *egc,
                                       libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                                  libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_checkpoint_state(libxl__egc *egc,
@@ -3702,8 +3713,34 @@ struct libxl__domain_create_state {
     int restore_fd, libxc_fd;
     int restore_fdfl; /* original flags of restore_fd */
     int send_back_fd;
+    bool *postcopy_resumed;
     libxl_domain_restore_params restore_params;
     uint32_t domid_soft_reset;
+    struct {
+        /* Is a postcopy resumption in progress? (i.e. does the rest of this
+         * state have any meaning?) */
+        bool active;
+
+        struct {
+            enum {
+                DCS_POSTCOPY_RESUME_INPROGRESS,
+                DCS_POSTCOPY_RESUME_FAILED,
+                DCS_POSTCOPY_RESUME_SUCCESS
+            } state;
+
+            int rc;
+        } resume;
+
+        struct {
+            enum {
+                DCS_POSTCOPY_STREAM_INPROGRESS,
+                DCS_POSTCOPY_STREAM_FAILED,
+                DCS_POSTCOPY_STREAM_SUCCESS
+            } state;
+
+            int rc;
+        } stream;
+    } postcopy;
     libxl__domain_create_cb *callback;
     libxl_asyncprogress_how aop_console_how;
     /* private to domain_create */
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 4cb553e..8e9b720 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -35,6 +35,7 @@
  * Undefined    undef  undef                    undef    undef
  * Idle         false  undef                    0        0
  * Active       true   NORMAL                   0/1      0/partial
+ * Active       true   POSTCOPY_TRANSITION      0/1      0/partial
  * Active       true   CHECKPOINT_BUFFERING     any      0/partial
  * Active       true   CHECKPOINT_UNBUFFERING   any      0
  * Active       true   CHECKPOINT_STATE         0/1      0/partial
@@ -133,6 +134,8 @@
 /* Success/error/cleanup handling. */
 static void stream_complete(libxl__egc *egc,
                             libxl__stream_read_state *stream, int rc);
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_read_state *stream, int rc);
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_read_state *stream, int rc);
 static void stream_done(libxl__egc *egc,
@@ -222,6 +225,7 @@ void libxl__stream_read_init(libxl__stream_read_state *stream)
     FILLZERO(stream->hdr);
     LIBXL_STAILQ_INIT(&stream->record_queue);
     stream->phase = SRS_PHASE_NORMAL;
+    stream->postcopy_transitioned = false;
     stream->recursion_guard = false;
     stream->incoming_record = NULL;
     FILLZERO(stream->emu_dc);
@@ -299,6 +303,26 @@ void libxl__stream_read_start(libxl__egc *egc,
     stream_complete(egc, stream, rc);
 }
 
+void libxl__stream_read_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_read_state *stream)
+{
+    int checkpointed_stream = stream->dcs->restore_params.checkpointed_stream;
+
+    assert(stream->running);
+    assert(checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE);
+    assert(stream->phase == SRS_PHASE_NORMAL);
+    assert(!stream->postcopy_transitioned);
+
+    stream->phase = SRS_PHASE_POSTCOPY_TRANSITION;
+
+    /*
+     * Libxc has handed control of the fd to us.  Start reading some
+     * libxl records out of it.
+     */
+    stream_continue(egc, stream);
+}
+
 void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                          libxl__stream_read_state *stream)
 {
@@ -397,6 +421,7 @@ static void stream_continue(libxl__egc *egc,
 
     switch (stream->phase) {
     case SRS_PHASE_NORMAL:
+    case SRS_PHASE_POSTCOPY_TRANSITION:
     case SRS_PHASE_CHECKPOINT_STATE:
         /*
          * Normal phase (regular migration or restore from file):
@@ -576,6 +601,13 @@ static bool process_record(libxl__egc *egc,
 
     LOG(DEBUG, "Record: %u, length %u", rec->hdr.type, rec->hdr.length);
 
+    if (stream->postcopy_transitioned &&
+        rec->hdr.type != REC_TYPE_END) {
+        rc = ERROR_FAIL;
+        LOG(ERROR, "Received non-end record after postcopy transition");
+        goto err;
+    }
+
     switch (rec->hdr.type) {
 
     case REC_TYPE_END:
@@ -627,6 +659,15 @@ static bool process_record(libxl__egc *egc,
         write_emulator_blob(egc, stream, rec);
         break;
 
+    case REC_TYPE_POSTCOPY_TRANSITION_END:
+        if (stream->phase != SRS_PHASE_POSTCOPY_TRANSITION) {
+            LOG(ERROR, "Unexpected POSTCOPY_TRANSITION_END record in stream");
+            rc = ERROR_FAIL;
+            goto err;
+        }
+        postcopy_transition_done(egc, stream, 0);
+        break;
+
     case REC_TYPE_CHECKPOINT_END:
         if (!stream_in_checkpoint(stream)) {
             LOG(ERROR, "Unexpected CHECKPOINT_END record in stream");
@@ -761,6 +802,13 @@ static void stream_complete(libxl__egc *egc,
          */
         checkpoint_done(egc, stream, rc);
         break;
+    case SRS_PHASE_POSTCOPY_TRANSITION:
+        assert(rc);
+
+        /*
+         * To deal with errors during the postcopy transition, we use the same
+         * strategy as during checkpoints.
+         */
     case SRS_PHASE_CHECKPOINT_STATE:
         assert(rc);
 
@@ -777,6 +825,15 @@ static void stream_complete(libxl__egc *egc,
     }
 }
 
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_read_state *stream, int rc)
+{
+    assert(stream->phase == SRS_PHASE_POSTCOPY_TRANSITION);
+    stream->postcopy_transitioned = true;
+    stream->phase = SRS_PHASE_NORMAL;
+    stream->postcopy_transition_callback(egc, stream, rc);
+}
+
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_read_state *stream, int rc)
 {
diff --git a/tools/ocaml/libs/xl/xenlight_stubs.c b/tools/ocaml/libs/xl/xenlight_stubs.c
index 98b52b9..3ef5a1e 100644
--- a/tools/ocaml/libs/xl/xenlight_stubs.c
+++ b/tools/ocaml/libs/xl/xenlight_stubs.c
@@ -538,7 +538,7 @@ value stub_libxl_domain_create_restore(value ctx, value domain_config, value par
 
 	caml_enter_blocking_section();
 	ret = libxl_domain_create_restore(CTX, &c_dconfig, &c_domid, restore_fd,
-		-1, &c_params, ao_how, NULL);
+		-1, NULL, &c_params, ao_how, NULL);
 	caml_leave_blocking_section();
 
 	free(ao_how);
diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 89c2b25..47ba9f3 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -882,7 +882,7 @@ start:
 
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
-                                          send_back_fd, &params,
+                                          send_back_fd, NULL, &params,
                                           0, autoconnect_console_how);
 
         libxl_domain_restore_params_dispose(&params);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 20/20] tools: expose postcopy live migration support in libxl and xl
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (18 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 19/20] libxl/migration: implement the receiver " Joshua Otto
@ 2017-03-27  9:06 ` Joshua Otto
  2017-03-28 14:41 ` [PATCH RFC 00/20] Add postcopy live migration support Wei Liu
  2017-03-29 22:50 ` Andrew Cooper
  21 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-27  9:06 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, Joshua Otto,
	imhy.yang, hjarmstr

- Add a 'memory_strategy' parameter to libxl_domain_live_migrate(),
  which specifies how the remainder of the memory migration should be
  approached after the iterative precopy phase is completed.
- Plug this parameter into the libxl migration precopy policy
  implementation.
- Add --postcopy to xl migrate, and skip the xl-level handshaking at
  both sides when postcopy migration occurs.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl.h          |  6 ++++-
 tools/libxl/libxl_dom_save.c | 19 ++++++-------
 tools/libxl/libxl_domain.c   |  9 ++++---
 tools/libxl/libxl_internal.h |  1 +
 tools/xl/xl.h                |  7 ++++-
 tools/xl/xl_cmdtable.c       |  5 +++-
 tools/xl/xl_migrate.c        | 63 +++++++++++++++++++++++++++++++++++++++-----
 tools/xl/xl_vmcontrol.c      |  8 ++++--
 8 files changed, 94 insertions(+), 24 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 51e8760..3a2f7ea 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1401,7 +1401,7 @@ int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
                               int flags, /* LIBXL_SUSPEND_* */
                               unsigned int precopy_iterations,
                               unsigned int precopy_dirty_threshold,
-                              int recv_fd,
+                              int recv_fd, int memory_strategy,
                               bool *postcopy_transitioned, /* OUT */
                               const libxl_asyncop_how *ao_how)
                               LIBXL_EXTERNAL_CALLERS_ONLY;
@@ -1409,6 +1409,10 @@ int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
 #define LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT 5
 #define LIBXL_LM_DIRTY_THRESHOLD_DEFAULT 50
 
+#define LIBXL_LM_MEMORY_STOP_AND_COPY 0
+#define LIBXL_LM_MEMORY_POSTCOPY 1
+#define LIBXL_LM_MEMORY_DEFAULT LIBXL_LM_MEMORY_STOP_AND_COPY
+
 /* @param suspend_cancel [from xenctrl.h:xc_domain_resume( @param fast )]
  *   If this parameter is true, use co-operative resume. The guest
  *   must support this.
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 9e565ae..9d5d435 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -333,18 +333,19 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
  * the precopy phase of live migrations, and is responsible for deciding when
  * the precopy phase should terminate and what should be done next.
  */
-static int libxl__save_live_migration_simple_precopy_policy(
-    struct precopy_stats stats, void *user)
+static int libxl__save_live_migration_precopy_policy(struct precopy_stats stats,
+                                                     void *user)
 {
     libxl__save_helper_state *shs = user;
     libxl__domain_save_state *dss = shs->caller_state;
 
-    if (stats.dirty_count >= 0 &&
-        stats.dirty_count <= dss->precopy_dirty_threshold)
-        return XGS_POLICY_STOP_AND_COPY;
-
-    if (stats.iteration >= dss->precopy_iterations)
-        return XGS_POLICY_STOP_AND_COPY;
+    if ((stats.dirty_count >= 0 &&
+         stats.dirty_count <= dss->precopy_dirty_threshold) ||
+        (stats.iteration >= dss->precopy_iterations)) {
+        return (dss->memory_strategy == LIBXL_LM_MEMORY_POSTCOPY)
+            ? XGS_POLICY_POSTCOPY
+            : XGS_POLICY_STOP_AND_COPY;
+    }
 
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
@@ -452,7 +453,7 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
             libxl__save_live_migration_postcopy_transition_callback;
     }
 
-    callbacks->precopy_policy = libxl__save_live_migration_simple_precopy_policy;
+    callbacks->precopy_policy = libxl__save_live_migration_precopy_policy;
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
 
     dss->sws.ao  = dss->ao;
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index ea778a6..feec293 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -489,6 +489,7 @@ static void domain_suspend_cb(libxl__egc *egc,
 static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
                              unsigned int precopy_iterations,
                              unsigned int precopy_dirty_threshold, int recv_fd,
+                             int memory_strategy,
                              bool *postcopy_transitioned,
                              const libxl_asyncop_how *ao_how)
 {
@@ -510,7 +511,8 @@ static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     dss->domid = domid;
     dss->fd = fd;
     dss->recv_fd = recv_fd;
-    dss->postcopy_transitioned = postcopy_resumed_remotely;
+    dss->memory_strategy = memory_strategy;
+    dss->postcopy_transitioned = postcopy_transitioned;
     dss->type = type;
     dss->live = flags & LIBXL_SUSPEND_LIVE;
     dss->debug = flags & LIBXL_SUSPEND_DEBUG;
@@ -536,12 +538,13 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     return do_domain_suspend(ctx, domid, fd, flags,
                              LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
                              LIBXL_LM_DIRTY_THRESHOLD_DEFAULT, -1,
-                             NULL, ao_how);
+                             LIBXL_LM_MEMORY_DEFAULT, NULL, ao_how);
 }
 
 int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
                               int flags, unsigned int precopy_iterations,
                               unsigned int precopy_dirty_threshold, int recv_fd,
+                              int memory_strategy,
                               bool *postcopy_transitioned,
                               const libxl_asyncop_how *ao_how)
 {
@@ -553,7 +556,7 @@ int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
     flags |= LIBXL_SUSPEND_LIVE;
 
     return do_domain_suspend(ctx, domid, send_fd, flags, precopy_iterations,
-                             precopy_dirty_threshold, recv_fd,
+                             precopy_dirty_threshold, recv_fd, memory_strategy,
                              postcopy_transitioned, ao_how);
 }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 0a7c0d1..209cee5 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3313,6 +3313,7 @@ struct libxl__domain_save_state {
     int fd;
     int fdfl; /* original flags on fd */
     int recv_fd;
+    int memory_strategy;
     bool *postcopy_transitioned;
     libxl_domain_type type;
     int live;
diff --git a/tools/xl/xl.h b/tools/xl/xl.h
index aa95b77..279c716 100644
--- a/tools/xl/xl.h
+++ b/tools/xl/xl.h
@@ -48,6 +48,7 @@ struct domain_create {
     bool userspace_colo_proxy;
     int migrate_fd; /* -1 means none */
     int send_back_fd; /* -1 means none */
+    bool *postcopy_resumed;
     char **migration_domname_r; /* from malloc */
 };
 
@@ -66,7 +67,6 @@ static const char migrate_permission_to_go[]=
     "domain is yours, you are cleared to unpause";
 static const char migrate_report[]=
     "my copy unpause results are as follows";
-#endif
 
   /* followed by one byte:
    *     0: everything went well, domain is running
@@ -76,6 +76,11 @@ static const char migrate_report[]=
    *            from target to source
    */
 
+static const char migrate_postcopy_sync[]=
+    "postcopy migration completed successfully";
+
+#endif
+
 #define XL_MANDATORY_FLAG_JSON (1U << 0) /* config data is in JSON format */
 #define XL_MANDATORY_FLAG_STREAMv2 (1U << 1) /* stream is v2 */
 #define XL_MANDATORY_FLAG_ALL  (XL_MANDATORY_FLAG_JSON |        \
diff --git a/tools/xl/xl_cmdtable.c b/tools/xl/xl_cmdtable.c
index 6df66fb..7bd2d1b 100644
--- a/tools/xl/xl_cmdtable.c
+++ b/tools/xl/xl_cmdtable.c
@@ -169,7 +169,10 @@ struct cmd_spec cmd_table[] = {
       "--precopy-iterations Perform at most this many iterations of the precopy\n"
       "                     memory migration loop before suspending the domain.\n"
       "--precopy-threshold  If fewer than this many pages are dirty at the end of a\n"
-      "                     copy round, exit the precopy loop and suspend the domain."
+      "                     copy round, exit the precopy loop and suspend the domain.\n"
+      "--postcopy           At the end of the iterative precopy phase, transition to a\n"
+      "                     postcopy memory migration rather than performing a stop-and-copy\n"
+      "                     migration of the outstanding dirty pages.\n"
     },
     { "restore",
       &main_restore, 0, 1,
diff --git a/tools/xl/xl_migrate.c b/tools/xl/xl_migrate.c
index 1ffc32b..43c7d8e 100644
--- a/tools/xl/xl_migrate.c
+++ b/tools/xl/xl_migrate.c
@@ -179,7 +179,8 @@ static void migrate_do_preamble(int send_fd, int recv_fd, pid_t child,
 static void migrate_domain(uint32_t domid, const char *rune, int debug,
                            const char *override_config_file,
                            unsigned int precopy_iterations,
-                           unsigned int precopy_dirty_threshold)
+                           unsigned int precopy_dirty_threshold,
+                           int memory_strategy)
 {
     pid_t child = -1;
     int rc;
@@ -210,18 +211,32 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
         flags |= LIBXL_SUSPEND_DEBUG;
     rc = libxl_domain_live_migrate(ctx, domid, send_fd, flags,
                                    precopy_iterations, precopy_dirty_threshold,
-                                   recv_fd, &postcopy_transitioned, NULL);
-    assert(!postcopy_transitioned);
-
+                                   recv_fd, memory_strategy,
+                                   &postcopy_transitioned, NULL);
     if (rc) {
         fprintf(stderr, "migration sender: libxl_domain_suspend failed"
                 " (rc=%d)\n", rc);
-        if (rc == ERROR_GUEST_TIMEDOUT)
+        if (postcopy_transitioned)
+            goto failed_postcopy;
+        else if (rc == ERROR_GUEST_TIMEDOUT)
             goto failed_suspend;
         else
             goto failed_resume;
     }
 
+    /* No need for additional ceremony if we already resumed the guest as part
+     * of a postcopy live migration. */
+    if (postcopy_transitioned) {
+        /* It doesn't matter if something happens to the pipe after we get to
+         * this point - we only bother to synchronize here for tidiness. */
+        migrate_read_fixedmessage(recv_fd, migrate_postcopy_sync,
+                                  sizeof(migrate_postcopy_sync),
+                                  "postcopy sync", rune);
+        libxl_domain_destroy(ctx, domid, 0);
+        fprintf(stderr, "Migration successful.\n");
+        exit(EXIT_SUCCESS);
+    }
+
     //fprintf(stderr, "migration sender: Transfer complete.\n");
     // Should only be printed when debugging as it's a bit messy with
     // progress indication.
@@ -320,6 +335,21 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
     close(send_fd);
     migration_child_report(recv_fd);
     exit(EXIT_FAILURE);
+
+ failed_postcopy:
+    if (common_domname) {
+        xasprintf(&away_domname, "%s--postcopy-inconsistent", common_domname);
+        libxl_domain_rename(ctx, domid, common_domname, away_domname);
+    }
+
+    fprintf(stderr,
+ "** Migration failed during memory postcopy **\n"
+ "It's possible that the guest has executed/is executing at the destination,\n"
+ " so resuming it here now may be unsafe.\n");
+
+    close(send_fd);
+    migration_child_report(recv_fd);
+    exit(EXIT_FAILURE);
 }
 
 static void migrate_receive(int debug, int daemonize, int monitor,
@@ -333,6 +363,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     int rc, rc2;
     char rc_buf;
     char *migration_domname;
+    bool postcopy_resumed;
     struct domain_create dom_info;
 
     signal(SIGPIPE, SIG_IGN);
@@ -352,6 +383,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.paused = 1;
     dom_info.migrate_fd = recv_fd;
     dom_info.send_back_fd = send_fd;
+    dom_info.postcopy_resumed = &postcopy_resumed;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = checkpointed;
     dom_info.colo_proxy_script = colo_proxy_script;
@@ -414,6 +446,18 @@ static void migrate_receive(int debug, int daemonize, int monitor,
         break;
     }
 
+    /* No need for additional ceremony if we already resumed the guest as part
+     * of a postcopy live migration. */
+    if (postcopy_resumed) {
+        libxl_write_exactly(ctx, send_fd, migrate_postcopy_sync,
+                            sizeof(migrate_postcopy_sync),
+                            "migration ack stream", "postcopy sync");
+        fprintf(stderr, "migration target: Domain started successsfully.\n");
+        libxl_domain_rename(ctx, domid, migration_domname, common_domname);
+        exit(EXIT_SUCCESS);
+    }
+
+
     fprintf(stderr, "migration target: Transfer complete,"
             " requesting permission to start domain.\n");
 
@@ -545,12 +589,14 @@ int main_migrate(int argc, char **argv)
     char *host;
     int opt, daemonize = 1, monitor = 1, debug = 0, pause_after_migration = 0;
     int precopy_iterations = LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
-        precopy_dirty_threshold = LIBXL_LM_DIRTY_THRESHOLD_DEFAULT;
+        precopy_dirty_threshold = LIBXL_LM_DIRTY_THRESHOLD_DEFAULT,
+        memory_strategy = LIBXL_LM_MEMORY_DEFAULT;
     static struct option opts[] = {
         {"debug", 0, 0, 0x100},
         {"live", 0, 0, 0x200},
         {"precopy-iterations", 1, 0, 'i'},
         {"precopy-threshold", 1, 0, 'd'},
+        {"postcopy", 0, 0, 0x400},
         COMMON_LONG_OPTS
     };
 
@@ -591,6 +637,9 @@ int main_migrate(int argc, char **argv)
     case 0x200: /* --live */
         /* ignored for compatibility with xm */
         break;
+    case 0x400: /* --postcopy */
+        memory_strategy = LIBXL_LM_MEMORY_POSTCOPY;
+        break;
     }
 
     domid = find_domain(argv[optind]);
@@ -622,7 +671,7 @@ int main_migrate(int argc, char **argv)
     }
 
     migrate_domain(domid, rune, debug, config_filename, precopy_iterations,
-                   precopy_dirty_threshold);
+                   precopy_dirty_threshold, memory_strategy);
     return EXIT_SUCCESS;
 }
 
diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 47ba9f3..62e09c1 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -655,6 +655,7 @@ int create_domain(struct domain_create *dom_info)
     const char *config_source = NULL;
     const char *restore_source = NULL;
     int migrate_fd = dom_info->migrate_fd;
+    bool *postcopy_resumed = dom_info->postcopy_resumed;
     bool config_in_json;
 
     int i;
@@ -675,6 +676,9 @@ int create_domain(struct domain_create *dom_info)
 
     int restoring = (restore_file || (migrate_fd >= 0));
 
+    if (postcopy_resumed)
+        *postcopy_resumed = false;
+
     libxl_domain_config_init(&d_config);
 
     if (restoring) {
@@ -882,8 +886,8 @@ start:
 
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
-                                          send_back_fd, NULL, &params,
-                                          0, autoconnect_console_how);
+                                          send_back_fd, postcopy_resumed,
+                                          &params, 0, autoconnect_console_how);
 
         libxl_domain_restore_params_dispose(&params);
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/20] Add postcopy live migration support
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (19 preceding siblings ...)
  2017-03-27  9:06 ` [PATCH RFC 20/20] tools: expose postcopy live migration support in libxl and xl Joshua Otto
@ 2017-03-28 14:41 ` Wei Liu
  2017-03-30  4:13   ` Joshua Otto
  2017-03-29 22:50 ` Andrew Cooper
  21 siblings, 1 reply; 53+ messages in thread
From: Wei Liu @ 2017-03-28 14:41 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

Hi Harley, Chester and Joshua

This is really nice work. I took a brief look at all the patches, they
look really high quality.

We're currently approaching freeze for a Xen release. We've got a lot on
our plate. I think maintainers will get to this series at some point.

From the look of things some patches can go in because they're general
useful.

On Mon, Mar 27, 2017 at 05:06:12AM -0400, Joshua Otto wrote:
> Hi,
> 
> We're a team of three fourth-year undergraduate software engineering students at
> the University of Waterloo in Canada.  In late 2015 we posted on the list [1] to
> ask for a project to undertake for our program's capstone design project, and
> Andrew Cooper pointed us in the direction of the live migration implementation
> as an area that could use some attention.  We were particularly interested in
> post-copy live migration (as evaluated by [2] and discussed on the list at [3]),
> and have been working on an implementation of this on-and-off since then.
> 
> We now have a working implementation of this scheme, and are submitting it for
> comment.  The changes are also available as the 'postcopy' branch of the GitHub
> repository at [4]
> 
> As a brief overview of our approach:
> - We introduce a mechanism by which libxl can indicate to the libxc stream
>   helper process that the iterative migration precopy loop should be terminated
>   and postcopy should begin.
> - At this point, we suspend the domain, collect the final set of dirty pfns and
>   write these pfns (and _not_ their contents) into the stream.
> - At the destination, the xc restore logic registers itself as a pager for the
>   migrating domain, 'evicts' all of the pfns indicated by the sender as
>   outstanding, and then resumes the domain at the destination.
> - As the domain executes, the migration sender continues to push the remaining
>   oustanding pages to the receiver in the background.  The receiver
>   monitors both the stream for incoming page data and the paging ring event
>   channel for page faults triggered by the guest.  Page faults are forwarded on
>   the back-channel migration stream to the migration sender, which prioritizes
>   these pages for transmission.
> 
> By leveraging the existing paging API, we are able to implement the postcopy
> scheme without any hypervisor modifications - all of our changes are confined to
> the userspace toolstack.  However, we inherit from the paging API the
> requirement that the domains be HVM and that the host have HAP/EPT support.
> 

Please consider writing a design document for this feature and stick it
at the beginning of your series in the future. You can find examples
under docs/designs.

The restriction is a bit unfortunate, but we shouldn't block useful work
because it's incomplete. We just need to make sure should someone decide
to implement similar functionality for PV guest, they should be able to
do so.

You might want to check if shadow paging can be used with paging API,
such that you can widen the requirement to HVM guest support.

> We haven't yet had the opportunity to perform a quantitative evaluation of the
> performance trade-offs between the traditional pre-copy and our post-copy
> strategies, but intend to.  Informally, we've been testing our implementation by
> migrating a domain running the x86 memtest program (which is obviously a
> tremendously write-heavy workload), and have observed a substantial reduction in
> total time required for migration completion (at the expense of a visually
> obvious 'slowdown' in the execution of the program).  We've also noticed that,
> when performing a postcopy without any leading precopy iterations, the time
> required at the destination to 'evict' all of the outstanding pages is
> substantial - possibly because there is no batching mechanism by which pages can
> be evicted - so this area in particular might require further attention.
> 

Please do post numbers when you have them. For now, please be patient
and wait for people to comment.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy'
  2017-03-27  9:06 ` [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
@ 2017-03-28 16:34   ` Wei Liu
  2017-04-11  6:19     ` Zhang Chen
  0 siblings, 1 reply; 53+ messages in thread
From: Wei Liu @ 2017-03-28 16:34 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, zhangchen.fnst, andrew.cooper3, ian.jackson, czylin,
	imhy.yang, xen-devel, hjarmstr

Cc Chen

On Mon, Mar 27, 2017 at 05:06:13AM -0400, Joshua Otto wrote:
> The COLO xc domain save and restore procedures both make use of a 'postcopy'
> callback to defer part of each checkpoint operation to xl.  In this context, the
> name 'postcopy' is meant as "the callback invoked immediately after this
> checkpoint's memory callback."  This is an unfortunate name collision with the
> other common use of 'postcopy' in the context of live migration, where it is
> used to mean "a memory migration that permits the guest to execute at the
> destination before all of its memory is migrated by servicing accesses to
> unmigrated memory via a network page-fault."
> 
> Mechanically rename 'postcopy' -> 'aftercopy' to free up the postcopy namespace
> while preserving the original intent of the name in the COLO context.
> 
> No functional change.
> 
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> ---
>  tools/libxc/include/xenguest.h     | 4 ++--
>  tools/libxc/xc_sr_restore.c        | 4 ++--
>  tools/libxc/xc_sr_save.c           | 4 ++--
>  tools/libxl/libxl_colo_restore.c   | 2 +-
>  tools/libxl/libxl_colo_save.c      | 2 +-
>  tools/libxl/libxl_remus.c          | 2 +-
>  tools/libxl/libxl_save_msgs_gen.pl | 2 +-
>  7 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index 40902ee..aa8cc8b 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -53,7 +53,7 @@ struct save_callbacks {
>       * xc_domain_save then flushes the output buffer, while the
>       *  guest continues to run.
>       */
> -    int (*postcopy)(void* data);
> +    int (*aftercopy)(void* data);
>  
>      /* Called after the memory checkpoint has been flushed
>       * out into the network. Typical actions performed in this
> @@ -115,7 +115,7 @@ struct restore_callbacks {
>       * Callback function resumes the guest & the device model,
>       * returns to xc_domain_restore.
>       */
> -    int (*postcopy)(void* data);
> +    int (*aftercopy)(void* data);
>  
>      /* A checkpoint record has been found in the stream.
>       * returns: */
> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> index 3549f0a..ee06b3d 100644
> --- a/tools/libxc/xc_sr_restore.c
> +++ b/tools/libxc/xc_sr_restore.c
> @@ -576,7 +576,7 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
>                                                  ctx->restore.callbacks->data);
>  
>          /* Resume secondary vm */
> -        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
> +        ret = ctx->restore.callbacks->aftercopy(ctx->restore.callbacks->data);
>          HANDLE_CALLBACK_RETURN_VALUE(ret);
>  
>          /* Wait for a new checkpoint */
> @@ -855,7 +855,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>      {
>          /* this is COLO restore */
>          assert(callbacks->suspend &&
> -               callbacks->postcopy &&
> +               callbacks->aftercopy &&
>                 callbacks->wait_checkpoint &&
>                 callbacks->restore_results);
>      }
> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> index f98c827..fc63a55 100644
> --- a/tools/libxc/xc_sr_save.c
> +++ b/tools/libxc/xc_sr_save.c
> @@ -863,7 +863,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
>                  }
>              }
>  
> -            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
> +            rc = ctx->save.callbacks->aftercopy(ctx->save.callbacks->data);
>              if ( rc <= 0 )
>                  goto err;
>  
> @@ -951,7 +951,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
>      if ( hvm )
>          assert(callbacks->switch_qemu_logdirty);
>      if ( ctx.save.checkpointed )
> -        assert(callbacks->checkpoint && callbacks->postcopy);
> +        assert(callbacks->checkpoint && callbacks->aftercopy);
>      if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
>          assert(callbacks->wait_checkpoint);
>  
> diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
> index 0c535bd..7d8f9ff 100644
> --- a/tools/libxl/libxl_colo_restore.c
> +++ b/tools/libxl/libxl_colo_restore.c
> @@ -246,7 +246,7 @@ void libxl__colo_restore_setup(libxl__egc *egc,
>      if (init_dsps(&crcs->dsps))
>          goto out;
>  
> -    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
> +    callbacks->aftercopy = libxl__colo_restore_domain_resume_callback;
>      callbacks->wait_checkpoint = libxl__colo_restore_domain_wait_checkpoint_callback;
>      callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
>      callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
> diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
> index f687d5a..5921196 100644
> --- a/tools/libxl/libxl_colo_save.c
> +++ b/tools/libxl/libxl_colo_save.c
> @@ -145,7 +145,7 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
>  
>      callbacks->suspend = libxl__colo_save_domain_suspend_callback;
>      callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
> -    callbacks->postcopy = libxl__colo_save_domain_resume_callback;
> +    callbacks->aftercopy = libxl__colo_save_domain_resume_callback;
>      callbacks->wait_checkpoint = libxl__colo_save_domain_wait_checkpoint_callback;
>  
>      libxl__checkpoint_devices_setup(egc, &dss->cds);
> diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
> index 29a4783..1453365 100644
> --- a/tools/libxl/libxl_remus.c
> +++ b/tools/libxl/libxl_remus.c
> @@ -110,7 +110,7 @@ void libxl__remus_setup(libxl__egc *egc, libxl__remus_state *rs)
>      dss->sws.checkpoint_callback = remus_checkpoint_stream_written;
>  
>      callbacks->suspend = libxl__remus_domain_suspend_callback;
> -    callbacks->postcopy = libxl__remus_domain_resume_callback;
> +    callbacks->aftercopy = libxl__remus_domain_resume_callback;
>      callbacks->checkpoint = libxl__remus_domain_save_checkpoint_callback;
>  
>      libxl__checkpoint_devices_setup(egc, cds);
> diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
> index 3ae7373..27845bb 100755
> --- a/tools/libxl/libxl_save_msgs_gen.pl
> +++ b/tools/libxl/libxl_save_msgs_gen.pl
> @@ -24,7 +24,7 @@ our @msgs = (
>                                                  'unsigned long', 'done',
>                                                  'unsigned long', 'total'] ],
>      [  3, 'srcxA',  "suspend", [] ],
> -    [  4, 'srcxA',  "postcopy", [] ],
> +    [  4, 'srcxA',  "aftercopy", [] ],
>      [  5, 'srcxA',  "checkpoint", [] ],
>      [  6, 'srcxA',  "wait_checkpoint", [] ],
>      [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
> -- 
> 2.7.4
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd
  2017-03-27  9:06 ` [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
@ 2017-03-28 18:53   ` Andrew Cooper
  2017-03-31 14:19   ` Wei Liu
  1 sibling, 0 replies; 53+ messages in thread
From: Andrew Cooper @ 2017-03-28 18:53 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> Right now, write_split_record() - which is delegated to by
> write_record() - implicitly writes to ctx->fd.  This means it can't be
> used with the restore context's send_back_fd, which is unhandy.

Unhelpful?

>
> Add an 'fd' parameter to both write_record() and write_split_record(),
> and mechanically update all existing callsites to pass ctx->fd for it.
>
> No functional change.
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list()
  2017-03-27  9:06 ` [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
@ 2017-03-28 18:56   ` Andrew Cooper
  2017-03-31 14:19   ` Wei Liu
  1 sibling, 0 replies; 53+ messages in thread
From: Andrew Cooper @ 2017-03-28 18:56 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> Teach send_checkpoint_dirty_pfn_list() to use write_record()'s new fd
> parameter, avoiding the need for a manual writev().
>
> No functional change.
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

Hmm - I could have sworn I objected to the patch which added this code
in the first place, for its opencoded use of writev().

Oh well, thanks for fixing it.

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN()
  2017-03-27  9:06 ` [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN() Joshua Otto
@ 2017-03-28 19:03   ` Andrew Cooper
  2017-03-30  4:28     ` Joshua Otto
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-28 19:03 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> Writing the libxc save stream requires writing a few 'trivial' records,
> consisting only of a header with a particular type.  As a readability
> aid, it's nice to have obviously-named functions that write these sorts
> of records into the stream - for example, the first such function was
> write_end_record(), which reads much more pleasantly at its call-site
> than write_generic_record(REC_TYPE_END) would.  However, it's tedious
> and error-prone to copy-paste the generic body of such a function for
> each new trivial record type.
>
> Add a helper macro that takes a name base and a record type and declares
> the corresponding trivial record write function.  Use this to re-define
> the two existing trivial record functions, write_end_record() and
> write_checkpoint_record().
>
> No functional change.
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

-1.

This hides the functions from tools like cscope, and makes the code
harder to read.  I also don't really buy the error prone argument.

If you do want to avoid opencoding different functions, how about

static int write_zerolength_record(uint32_t record_type)

and updating the existing callsites to be

write_zerolength_record(REC_TYPE_END); etc.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages()
  2017-03-27  9:06 ` [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages() Joshua Otto
@ 2017-03-28 19:27   ` Andrew Cooper
  2017-03-30  4:42     ` Joshua Otto
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-28 19:27 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> index 481a904..8574ee8 100644
> --- a/tools/libxc/xc_sr_restore.c
> +++ b/tools/libxc/xc_sr_restore.c
> @@ -194,6 +194,68 @@ int populate_pfns(struct xc_sr_context *ctx, unsigned count,
>      return rc;
>  }
>  
> +static void set_page_types(struct xc_sr_context *ctx, unsigned count,
> +                           xen_pfn_t *pfns, uint32_t *types)
> +{
> +    unsigned i;

Please use unsigned int rather than just "unsigned" throughout.

> +
> +    for ( i = 0; i < count; ++i )
> +        ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
> +}
> +
> +/*
> + * Given count pfns and their types, allocate and fill in buffer bpfns with only
> + * those pfns that are 'backed' by real page data that needs to be migrated.
> + * The caller must later free() *bpfns.
> + *
> + * Returns 0 on success and non-0 on failure.  *bpfns can be free()ed even after
> + * failure.
> + */
> +static int filter_pages(struct xc_sr_context *ctx,
> +                        unsigned count,
> +                        xen_pfn_t *pfns,
> +                        uint32_t *types,
> +                        /* OUT */ unsigned *nr_pages,
> +                        /* OUT */ xen_pfn_t **bpfns)
> +{
> +    xc_interface *xch = ctx->xch;

Pointers to arrays are very easy to get wrong in C.  This code will be
less error if you use

xen_pfn_t *_pfns;  (variable name subject to improvement)

> +    unsigned i;
> +
> +    *nr_pages = 0;
> +    *bpfns = malloc(count * sizeof(*bpfns));

_pfns = *bfns = malloc(...).

Then use _pfns in place of (*bpfns) everywhere else.

However,  your sizeof has the wrong indirection.  It works on x86
because xen_pfn_t is the same size as a pointer, but it will blow up on
32bit ARM, where a pointer is 4 bytes but xen_pfn_t is 8 bytes.

> +    if ( !(*bpfns) )
> +    {
> +        ERROR("Failed to allocate %zu bytes to process page data",
> +              count * (sizeof(*bpfns)));
> +        return -1;
> +    }
> +
> +    for ( i = 0; i < count; ++i )
> +    {
> +        switch ( types[i] )
> +        {
> +        case XEN_DOMCTL_PFINFO_NOTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L1TAB:
> +        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L2TAB:
> +        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L3TAB:
> +        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L4TAB:
> +        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +            (*bpfns)[(*nr_pages)++] = pfns[i];
> +            break;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  /*
>   * Given a list of pfns, their types, and a block of page data from the
>   * stream, populate and record their types, map the relevant subset and copy
> @@ -203,7 +265,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
>                               xen_pfn_t *pfns, uint32_t *types, void *page_data)
>  {
>      xc_interface *xch = ctx->xch;
> -    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
> +    xen_pfn_t *mfns = NULL;

This shows a naming bug, which is my fault.  This should be named gfns,
not mfns.  (It inherits its name from the legacy migration code, but
that was also wrong.)

Please correct it, either in this patch or another; the memory
management terms are hard enough, even when all the code is correct.

~Andrew

>      int *map_errs = malloc(count * sizeof(*map_errs));
>      int rc;
>      void *mapping = NULL, *guest_page = NULL;
> @@ -211,11 +273,11 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
>          j,         /* j indexes the subset of pfns we decide to map. */
>          nr_pages = 0;
>  
> -    if ( !mfns || !map_errs )
> +    if ( !map_errs )
>      {
>          rc = -1;
>          ERROR("Failed to allocate %zu bytes to process page data",
> -              count * (sizeof(*mfns) + sizeof(*map_errs)));
> +              count * sizeof(*map_errs));
>          goto err;
>      }
>  
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data()
  2017-03-27  9:06 ` [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data() Joshua Otto
@ 2017-03-28 19:52   ` Andrew Cooper
  2017-03-30  4:49     ` Joshua Otto
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-28 19:52 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
> index 3291b25..32400b2 100644
> --- a/tools/libxc/xc_sr_stream_format.h
> +++ b/tools/libxc/xc_sr_stream_format.h
> @@ -80,15 +80,15 @@ struct xc_sr_rhdr
>  #define REC_TYPE_OPTIONAL             0x80000000U
>  
>  /* PAGE_DATA */
> -struct xc_sr_rec_page_data_header
> +struct xc_sr_rec_pages_header
>  {
>      uint32_t count;
>      uint32_t _res1;
>      uint64_t pfn[0];
>  };
>  
> -#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
> -#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
> +#define REC_PFINFO_PFN_MASK  0x000fffffffffffffULL
> +#define REC_PFINFO_TYPE_MASK 0xf000000000000000ULL
>  
>  /* X86_PV_INFO */
>  struct xc_sr_rec_x86_pv_info

What are the purposes of these name changes?

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free()
  2017-03-27  9:06 ` [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free() Joshua Otto
@ 2017-03-28 19:59   ` Andrew Cooper
  2017-03-29 17:47     ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-28 19:59 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> colo_merge_secondary_dirty_bitmap() unconditionally free()s the .data
> member of its local xc_sr_record structure rec on its exit path.
> However, if the initial call to read_record() fails then this member is
> uninitialised.  Initialise it.
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

This bugfix should be taken ASAP, and needs backporting to Xen 4.7 and 4.8

> ---
>  tools/libxc/xc_sr_save.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> index ac97d93..6acc8d3 100644
> --- a/tools/libxc/xc_sr_save.c
> +++ b/tools/libxc/xc_sr_save.c
> @@ -681,7 +681,7 @@ static int send_memory_live(struct xc_sr_context *ctx)
>  static int colo_merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
>  {
>      xc_interface *xch = ctx->xch;
> -    struct xc_sr_record rec;
> +    struct xc_sr_record rec = { 0, 0, NULL };
>      uint64_t *pfns = NULL;
>      uint64_t pfn;
>      unsigned count, i;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free()
  2017-03-28 19:59   ` Andrew Cooper
@ 2017-03-29 17:47     ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-03-29 17:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: wei.liu2, ian.jackson, czylin, Joshua Otto, imhy.yang, xen-devel,
	hjarmstr

On Tue, Mar 28, 2017 at 08:59:09PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > colo_merge_secondary_dirty_bitmap() unconditionally free()s the .data
> > member of its local xc_sr_record structure rec on its exit path.
> > However, if the initial call to read_record() fails then this member is
> > uninitialised.  Initialise it.
> >
> > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> 
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> This bugfix should be taken ASAP, and needs backporting to Xen 4.7 and 4.8

Acked + applied.

> 
> > ---
> >  tools/libxc/xc_sr_save.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> > index ac97d93..6acc8d3 100644
> > --- a/tools/libxc/xc_sr_save.c
> > +++ b/tools/libxc/xc_sr_save.c
> > @@ -681,7 +681,7 @@ static int send_memory_live(struct xc_sr_context *ctx)
> >  static int colo_merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
> >  {
> >      xc_interface *xch = ctx->xch;
> > -    struct xc_sr_record rec;
> > +    struct xc_sr_record rec = { 0, 0, NULL };
> >      uint64_t *pfns = NULL;
> >      uint64_t pfn;
> >      unsigned count, i;
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-03-27  9:06 ` [PATCH RFC 07/20] migration: defer precopy policy to libxl Joshua Otto
@ 2017-03-29 18:54   ` Jennifer Herbert
  2017-03-30  5:28     ` Joshua Otto
  2017-03-29 20:18   ` Andrew Cooper
  1 sibling, 1 reply; 53+ messages in thread
From: Jennifer Herbert @ 2017-03-29 18:54 UTC (permalink / raw)
  To: Joshua Otto, xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, imhy.yang, hjarmstr

Hi,

I would like to encourage this patch - as I have use for it outside
of your postcopy work.

Some things people will comment on:
You've used 'unsigned' without the int keyword, which people don't like.
Also on line 324, your missing space between 'if (' and 
'ctx->save.policy_decision'.

Also, I'm not a fan of your CONSULT_POLICY macro, which you've defined at
a odd point in your function, and I think could be done more elegantly.
Worst of all ... its a macro - which I think should generally be avoided 
unless
there is little choice.   I'm sure you could write a helper function to 
replace this.

Cheers,

-jenny

On 27/03/17 10:06, Joshua Otto wrote:
> The precopy phase of the xc_domain_save() live migration algorithm has
> historically been implemented to run until either a) (almost) no pages
> are dirty or b) some fixed, hard-coded maximum number of precopy
> iterations has been exceeded.  This policy and its implementation are
> less than ideal for a few reasons:
> - the logic of the policy is intertwined with the control flow of the
>    mechanism of the precopy stage
> - it can't take into account facts external to the immediate
>    migration context, such as interactive user input or the passage of
>    wall-clock time
> - it does not permit the user to change their mind, over time, about
>    what to do at the end of the precopy (they get an unconditional
>    transition into the stop-and-copy phase of the migration)
>
> To permit users to implement arbitrary higher-level policies governing
> when the live migration precopy phase should end, and what should be
> done next:
> - add a precopy_policy() callback to the xc_domain_save() user-supplied
>    callbacks
> - during the precopy phase of live migrations, consult this policy after
>    each batch of pages transmitted and take the dictated action, which
>    may be to a) abort the migration entirely, b) continue with the
>    precopy, or c) proceed to the stop-and-copy phase.
> - provide an implementation of the old policy as such a callback in
>    libxl and plumb it through the IPC machinery to libxc, effectively
>    maintaing the old policy for now
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> ---
>   tools/libxc/include/xenguest.h     |  23 ++++-
>   tools/libxc/xc_nomigrate.c         |   3 +-
>   tools/libxc/xc_sr_common.h         |   7 +-
>   tools/libxc/xc_sr_save.c           | 194 ++++++++++++++++++++++++++-----------
>   tools/libxl/libxl_dom_save.c       |  20 ++++
>   tools/libxl/libxl_save_callout.c   |   3 +-
>   tools/libxl/libxl_save_helper.c    |   7 +-
>   tools/libxl/libxl_save_msgs_gen.pl |   4 +-
>   8 files changed, 189 insertions(+), 72 deletions(-)
>
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index aa8cc8b..30ffb6f 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -39,6 +39,14 @@
>    */
>   struct xenevtchn_handle;
>   
> +/* For save's precopy_policy(). */
> +struct precopy_stats
> +{
> +    unsigned iteration;
> +    unsigned total_written;
> +    long dirty_count; /* -1 if unknown */
> +};
> +
>   /* callbacks provided by xc_domain_save */
>   struct save_callbacks {
>       /* Called after expiration of checkpoint interval,
> @@ -46,6 +54,17 @@ struct save_callbacks {
>        */
>       int (*suspend)(void* data);
>   
> +    /* Called after every batch of page data sent during the precopy phase of a
> +     * live migration to ask the caller what to do next based on the current
> +     * state of the precopy migration.
> +     */
> +#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
> +                                        * tidy up. */
> +#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
> +#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
> +                                        * remaining dirty pages. */
> +    int (*precopy_policy)(struct precopy_stats stats, void *data);
> +
>       /* Called after the guest's dirty pages have been
>        *  copied into an output buffer.
>        * Callback function resumes the guest & the device model,
> @@ -100,8 +119,8 @@ typedef enum {
>    *        doesn't use checkpointing
>    * @return 0 on success, -1 on failure
>    */
> -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> -                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
> +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> +                   uint32_t flags /* XCFLAGS_xxx */,
>                      struct save_callbacks* callbacks, int hvm,
>                      xc_migration_stream_t stream_type, int recv_fd);
>   
> diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
> index 15c838f..2af64e4 100644
> --- a/tools/libxc/xc_nomigrate.c
> +++ b/tools/libxc/xc_nomigrate.c
> @@ -20,8 +20,7 @@
>   #include <xenctrl.h>
>   #include <xenguest.h>
>   
> -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> -                   uint32_t max_factor, uint32_t flags,
> +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags,
>                      struct save_callbacks* callbacks, int hvm,
>                      xc_migration_stream_t stream_type, int recv_fd)
>   {
> diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> index b1aa88e..a9160bd 100644
> --- a/tools/libxc/xc_sr_common.h
> +++ b/tools/libxc/xc_sr_common.h
> @@ -198,12 +198,11 @@ struct xc_sr_context
>               /* Further debugging information in the stream. */
>               bool debug;
>   
> -            /* Parameters for tweaking live migration. */
> -            unsigned max_iterations;
> -            unsigned dirty_threshold;
> -
>               unsigned long p2m_size;
>   
> +            struct precopy_stats stats;
> +            int policy_decision;
> +
>               xen_pfn_t *batch_pfns;
>               unsigned nr_batch_pfns;
>               unsigned long *deferred_pages;
> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> index 797aec5..eb95334 100644
> --- a/tools/libxc/xc_sr_save.c
> +++ b/tools/libxc/xc_sr_save.c
> @@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx)
>   }
>   
>   /*
> + * Test if the batch is full.
> + */
> +static bool batch_full(struct xc_sr_context *ctx)
> +{
> +    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
> +}
> +
> +/*
> + * Test if the batch is empty.
> + */
> +static bool batch_empty(struct xc_sr_context *ctx)
> +{
> +    return ctx->save.nr_batch_pfns == 0;
> +}
> +
> +/*
>    * Flush a batch of pfns into the stream.
>    */
>   static int flush_batch(struct xc_sr_context *ctx)
>   {
>       int rc = 0;
>   
> -    if ( ctx->save.nr_batch_pfns == 0 )
> +    if ( batch_empty(ctx) )
>           return rc;
>   
>       rc = write_batch(ctx);
> @@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx)
>   }
>   
>   /*
> - * Add a single pfn to the batch, flushing the batch if full.
> + * Add a single pfn to the batch.
>    */
> -static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> +static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
>   {
> -    int rc = 0;
> -
> -    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
> -        rc = flush_batch(ctx);
> -
> -    if ( rc == 0 )
> -        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> -
> -    return rc;
> +    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
> +    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
>   }
>   
>   /*
> @@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx)
>    * Send a subset of pages in the guests p2m, according to the dirty bitmap.
>    * Used for each subsequent iteration of the live migration loop.
>    *
> + * During the precopy stage of a live migration, test the user-supplied
> + * policy function after each batch of pages and cut off the operation
> + * early if indicated.  Unless aborting, the dirty pages remaining in this round
> + * are transferred into the deferred_pages bitmap.
> + *
>    * Bitmap is bounded by p2m_size.
>    */
>   static int send_dirty_pages(struct xc_sr_context *ctx,
> -                            unsigned long entries)
> +                            unsigned long entries, bool precopy)
>   {
>       xc_interface *xch = ctx->xch;
>       xen_pfn_t p;
> @@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
>       DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
>                                       &ctx->save.dirty_bitmap_hbuf);
>   
> -    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
> +    int (*precopy_policy)(struct precopy_stats, void *) =
> +        ctx->save.callbacks->precopy_policy;
> +    void *data = ctx->save.callbacks->data;
> +
> +    assert(batch_empty(ctx));
> +    for ( p = 0, written = 0; p < ctx->save.p2m_size; )
>       {
> -        if ( !test_bit(p, dirty_bitmap) )
> -            continue;
> +        if ( ctx->save.live && precopy )
> +        {
> +            ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
> +            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> +            {
> +                return -1;
> +            }
> +            else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
> +            {
> +                /* Any outstanding dirty pages are now deferred until the next
> +                 * phase of the migration. */
> +                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> +                          ctx->save.p2m_size);
> +                if ( entries > written )
> +                    ctx->save.nr_deferred_pages += entries - written;
> +
> +                goto done;
> +            }
> +        }
>   
> -        rc = add_to_batch(ctx, p);
> +        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
> +        {
> +            if ( test_and_clear_bit(p, dirty_bitmap) )
> +            {
> +                add_to_batch(ctx, p);
> +                ++written;
> +                ++ctx->save.stats.total_written;
> +            }
> +        }
> +
> +        rc = flush_batch(ctx);
>           if ( rc )
>               return rc;
>   
> -        /* Update progress every 4MB worth of memory sent. */
> -        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
> -            xc_report_progress_step(xch, written, entries);
> -
> -        ++written;
> +        /* Update progress after every batch (4MB) worth of memory sent. */
> +        xc_report_progress_step(xch, written, entries);
>       }
>   
> -    rc = flush_batch(ctx);
> -    if ( rc )
> -        return rc;
> -
>       if ( written > entries )
>           DPRINTF("Bitmap contained more entries than expected...");
>   
>       xc_report_progress_step(xch, entries, entries);
>   
> + done:
>       return ctx->save.ops.check_vm_state(ctx);
>   }
>   
> @@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
>    * Send all pages in the guests p2m.  Used as the first iteration of the live
>    * migration loop, and for a non-live save.
>    */
> -static int send_all_pages(struct xc_sr_context *ctx)
> +static int send_all_pages(struct xc_sr_context *ctx, bool precopy)
>   {
>       DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
>                                       &ctx->save.dirty_bitmap_hbuf);
>   
>       bitmap_set(dirty_bitmap, ctx->save.p2m_size);
>   
> -    return send_dirty_pages(ctx, ctx->save.p2m_size);
> +    return send_dirty_pages(ctx, ctx->save.p2m_size, precopy);
>   }
>   
>   static int enable_logdirty(struct xc_sr_context *ctx)
> @@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context *ctx,
>       xc_interface *xch = ctx->xch;
>       char *new_str = NULL;
>   
> -    if ( asprintf(&new_str, "Frames iteration %u of %u",
> -                  iter, ctx->save.max_iterations) == -1 )
> +    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
>       {
>           PERROR("Unable to allocate new progress string");
>           return -1;
> @@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx)
>       xc_interface *xch = ctx->xch;
>       xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
>       char *progress_str = NULL;
> -    unsigned x;
>       int rc;
>   
> +    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> +                                    &ctx->save.dirty_bitmap_hbuf);
> +
> +    int (*precopy_policy)(struct precopy_stats, void *) =
> +        ctx->save.callbacks->precopy_policy;
> +    void *data = ctx->save.callbacks->data;
> +
>       rc = update_progress_string(ctx, &progress_str, 0);
>       if ( rc )
>           goto out;
>   
> -    rc = send_all_pages(ctx);
> +#define CONSULT_POLICY                                                        \
> +    do {                                                                      \
> +        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )                  \
> +        {                                                                     \
> +            rc = -1;                                                          \
> +            goto out;                                                         \
> +        }                                                                     \
> +        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )  \
> +        {                                                                     \
> +            rc = 0;                                                           \
> +            goto out;                                                         \
> +        }                                                                     \
> +    } while (0)
> +
> +    ctx->save.stats = (struct precopy_stats)
> +        {
> +            .iteration     = 0,
> +            .total_written = 0,
> +            .dirty_count   = -1
> +        };
> +    rc = send_all_pages(ctx, /* precopy */ true);
>       if ( rc )
>           goto out;
>   
> -    for ( x = 1;
> -          ((x < ctx->save.max_iterations) &&
> -           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
> +    /* send_all_pages() has updated the stats */
> +    CONSULT_POLICY;
> +
> +    for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration )
>       {
>           if ( xc_shadow_control(
>                    xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
> @@ -493,18 +559,42 @@ static int send_memory_live(struct xc_sr_context *ctx)
>               goto out;
>           }
>   
> -        if ( stats.dirty_count == 0 )
> -            break;
> +        /* Check the new dirty_count against the policy. */
> +        ctx->save.stats.dirty_count = stats.dirty_count;
> +        ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
> +        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> +        {
> +            rc = -1;
> +            goto out;
> +        }
> +        else if (ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
> +        {
> +            bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> +                      ctx->save.p2m_size);
> +            ctx->save.nr_deferred_pages += stats.dirty_count;
> +            rc = 0;
> +            goto out;
> +        }
> +
> +        /* After this point we won't know how many pages are really dirty until
> +         * the next iteration. */
> +        ctx->save.stats.dirty_count = -1;
>   
> -        rc = update_progress_string(ctx, &progress_str, x);
> +        rc = update_progress_string(ctx, &progress_str,
> +                                    ctx->save.stats.iteration);
>           if ( rc )
>               goto out;
>   
> -        rc = send_dirty_pages(ctx, stats.dirty_count);
> +        rc = send_dirty_pages(ctx, stats.dirty_count, /* precopy */ true);
>           if ( rc )
>               goto out;
> +
> +        /* send_dirty_pages() has updated the stats */
> +        CONSULT_POLICY;
>       }
>   
> +#undef CONSULT_POLICY
> +
>    out:
>       xc_set_progress_prefix(xch, NULL);
>       free(progress_str);
> @@ -595,7 +685,7 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
>       if ( ctx->save.live )
>       {
>           rc = update_progress_string(ctx, &progress_str,
> -                                    ctx->save.max_iterations);
> +                                    ctx->save.stats.iteration);
>           if ( rc )
>               goto out;
>       }
> @@ -614,7 +704,8 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
>           }
>       }
>   
> -    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
> +    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages,
> +                          /* precopy */ false);
>       if ( rc )
>           goto out;
>   
> @@ -645,7 +736,7 @@ static int verify_frames(struct xc_sr_context *ctx)
>           goto out;
>   
>       xc_set_progress_prefix(xch, "Frames verify");
> -    rc = send_all_pages(ctx);
> +    rc = send_all_pages(ctx, /* precopy */ false);
>       if ( rc )
>           goto out;
>   
> @@ -719,7 +810,7 @@ static int send_domain_memory_nonlive(struct xc_sr_context *ctx)
>   
>       xc_set_progress_prefix(xch, "Frames");
>   
> -    rc = send_all_pages(ctx);
> +    rc = send_all_pages(ctx, /* precopy */ false);
>       if ( rc )
>           goto err;
>   
> @@ -910,8 +1001,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
>   };
>   
>   int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> -                   uint32_t max_iters, uint32_t max_factor, uint32_t flags,
> -                   struct save_callbacks* callbacks, int hvm,
> +                   uint32_t flags, struct save_callbacks* callbacks, int hvm,
>                      xc_migration_stream_t stream_type, int recv_fd)
>   {
>       struct xc_sr_context ctx =
> @@ -932,25 +1022,17 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
>              stream_type == XC_MIG_STREAM_REMUS ||
>              stream_type == XC_MIG_STREAM_COLO);
>   
> -    /*
> -     * TODO: Find some time to better tweak the live migration algorithm.
> -     *
> -     * These parameters are better than the legacy algorithm especially for
> -     * busy guests.
> -     */
> -    ctx.save.max_iterations = 5;
> -    ctx.save.dirty_threshold = 50;
> -
>       /* Sanity checks for callbacks. */
>       if ( hvm )
>           assert(callbacks->switch_qemu_logdirty);
> +    if ( ctx.save.live )
> +        assert(callbacks->precopy_policy);
>       if ( ctx.save.checkpointed )
>           assert(callbacks->checkpoint && callbacks->aftercopy);
>       if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
>           assert(callbacks->wait_checkpoint);
>   
> -    DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
> -            io_fd, dom, max_iters, max_factor, flags, hvm);
> +    DPRINTF("fd %d, dom %u, flags %u, hvm %d", io_fd, dom, flags, hvm);
>   
>       if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
>       {
> diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
> index 77fe30e..6d28cce 100644
> --- a/tools/libxl/libxl_dom_save.c
> +++ b/tools/libxl/libxl_dom_save.c
> @@ -328,6 +328,25 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
>       return rc;
>   }
>   
> +/*
> + * This is the live migration precopy policy - it's called periodically during
> + * the precopy phase of live migrations, and is responsible for deciding when
> + * the precopy phase should terminate and what should be done next.
> + *
> + * The policy implemented here behaves identically to the policy previously
> + * hard-coded into xc_domain_save() - it proceeds to the stop-and-copy phase of
> + * the live migration when there are either fewer than 50 dirty pages, or more
> + * than 5 precopy rounds have completed.
> + */
> +static int libxl__save_live_migration_simple_precopy_policy(
> +    struct precopy_stats stats, void *user)
> +{
> +    return ((stats.dirty_count >= 0 && stats.dirty_count < 50) ||
> +            stats.iteration >= 5)
> +        ? XGS_POLICY_STOP_AND_COPY
> +        : XGS_POLICY_CONTINUE_PRECOPY;
> +}
> +
>   /*----- main code for saving, in order of execution -----*/
>   
>   void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
> @@ -401,6 +420,7 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
>       if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
>           callbacks->suspend = libxl__domain_suspend_callback;
>   
> +    callbacks->precopy_policy = libxl__save_live_migration_simple_precopy_policy;
>       callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
>   
>       dss->sws.ao  = dss->ao;
> diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
> index 46b892c..026b572 100644
> --- a/tools/libxl/libxl_save_callout.c
> +++ b/tools/libxl/libxl_save_callout.c
> @@ -89,8 +89,7 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_save_state *dss,
>           libxl__srm_callout_enumcallbacks_save(&shs->callbacks.save.a);
>   
>       const unsigned long argnums[] = {
> -        dss->domid, 0, 0, dss->xcflags, dss->hvm,
> -        cbflags, dss->checkpointed_stream,
> +        dss->domid, dss->xcflags, dss->hvm, cbflags, dss->checkpointed_stream,
>       };
>   
>       shs->ao = ao;
> diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c
> index d3def6b..0241a6b 100644
> --- a/tools/libxl/libxl_save_helper.c
> +++ b/tools/libxl/libxl_save_helper.c
> @@ -251,8 +251,6 @@ int main(int argc, char **argv)
>           io_fd =                             atoi(NEXTARG);
>           recv_fd =                           atoi(NEXTARG);
>           uint32_t dom =                      strtoul(NEXTARG,0,10);
> -        uint32_t max_iters =                strtoul(NEXTARG,0,10);
> -        uint32_t max_factor =               strtoul(NEXTARG,0,10);
>           uint32_t flags =                    strtoul(NEXTARG,0,10);
>           int hvm =                           atoi(NEXTARG);
>           unsigned cbflags =                  strtoul(NEXTARG,0,10);
> @@ -264,9 +262,8 @@ int main(int argc, char **argv)
>           startup("save");
>           setup_signals(save_signal_handler);
>   
> -        r = xc_domain_save(xch, io_fd, dom, max_iters, max_factor, flags,
> -                           &helper_save_callbacks, hvm, stream_type,
> -                           recv_fd);
> +        r = xc_domain_save(xch, io_fd, dom, flags, &helper_save_callbacks, hvm,
> +                           stream_type, recv_fd);
>           complete(r);
>   
>       } else if (!strcmp(mode,"--restore-domain")) {
> diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
> index 27845bb..50c97b4 100755
> --- a/tools/libxl/libxl_save_msgs_gen.pl
> +++ b/tools/libxl/libxl_save_msgs_gen.pl
> @@ -33,6 +33,7 @@ our @msgs = (
>                                                 'xen_pfn_t', 'console_gfn'] ],
>       [  9, 'srW',    "complete",              [qw(int retval
>                                                    int errnoval)] ],
> +    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
>   );
>   
>   #----------------------------------------
> @@ -141,7 +142,8 @@ static void bytes_put(unsigned char *const buf, int *len,
>   
>   END
>   
> -foreach my $simpletype (qw(int uint16_t uint32_t unsigned), 'unsigned long', 'xen_pfn_t') {
> +foreach my $simpletype (qw(int uint16_t uint32_t unsigned),
> +                        'unsigned long', 'xen_pfn_t', 'struct precopy_stats') {
>       my $typeid = typeid($simpletype);
>       $out_body{'callout'} .= <<END;
>   static int ${typeid}_get(const unsigned char **msg,


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-03-27  9:06 ` [PATCH RFC 07/20] migration: defer precopy policy to libxl Joshua Otto
  2017-03-29 18:54   ` Jennifer Herbert
@ 2017-03-29 20:18   ` Andrew Cooper
  2017-03-30  5:19     ` Joshua Otto
  1 sibling, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-29 20:18 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> The precopy phase of the xc_domain_save() live migration algorithm has
> historically been implemented to run until either a) (almost) no pages
> are dirty or b) some fixed, hard-coded maximum number of precopy
> iterations has been exceeded.  This policy and its implementation are
> less than ideal for a few reasons:
> - the logic of the policy is intertwined with the control flow of the
>   mechanism of the precopy stage
> - it can't take into account facts external to the immediate
>   migration context, such as interactive user input or the passage of
>   wall-clock time
> - it does not permit the user to change their mind, over time, about
>   what to do at the end of the precopy (they get an unconditional
>   transition into the stop-and-copy phase of the migration)
>
> To permit users to implement arbitrary higher-level policies governing
> when the live migration precopy phase should end, and what should be
> done next:
> - add a precopy_policy() callback to the xc_domain_save() user-supplied
>   callbacks
> - during the precopy phase of live migrations, consult this policy after
>   each batch of pages transmitted and take the dictated action, which
>   may be to a) abort the migration entirely, b) continue with the
>   precopy, or c) proceed to the stop-and-copy phase.
> - provide an implementation of the old policy as such a callback in
>   libxl and plumb it through the IPC machinery to libxc, effectively
>   maintaing the old policy for now
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

This patch should be split into two.  One modifying libxc to use struct
precopy_stats, and a second to wire up the RPC call.

> ---
>  tools/libxc/include/xenguest.h     |  23 ++++-
>  tools/libxc/xc_nomigrate.c         |   3 +-
>  tools/libxc/xc_sr_common.h         |   7 +-
>  tools/libxc/xc_sr_save.c           | 194 ++++++++++++++++++++++++++-----------
>  tools/libxl/libxl_dom_save.c       |  20 ++++
>  tools/libxl/libxl_save_callout.c   |   3 +-
>  tools/libxl/libxl_save_helper.c    |   7 +-
>  tools/libxl/libxl_save_msgs_gen.pl |   4 +-
>  8 files changed, 189 insertions(+), 72 deletions(-)
>
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index aa8cc8b..30ffb6f 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -39,6 +39,14 @@
>   */
>  struct xenevtchn_handle;
>  
> +/* For save's precopy_policy(). */
> +struct precopy_stats
> +{
> +    unsigned iteration;
> +    unsigned total_written;
> +    long dirty_count; /* -1 if unknown */

total_written and dirty_count are liable to be equal, so having them as
different widths of integer clearly can't be correct.

> +};
> +
>  /* callbacks provided by xc_domain_save */
>  struct save_callbacks {
>      /* Called after expiration of checkpoint interval,
> @@ -46,6 +54,17 @@ struct save_callbacks {
>       */
>      int (*suspend)(void* data);
>  
> +    /* Called after every batch of page data sent during the precopy phase of a
> +     * live migration to ask the caller what to do next based on the current
> +     * state of the precopy migration.
> +     */
> +#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
> +                                        * tidy up. */
> +#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
> +#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
> +                                        * remaining dirty pages. */
> +    int (*precopy_policy)(struct precopy_stats stats, void *data);

Structures shouldn't be passed by value like this, as the compiler has
to do a lot of memcpy() work to make it happen.  You should pass by
const pointer, as (as far as I can tell), they are strictly read-only to
the implementation of this hook?

> +
>      /* Called after the guest's dirty pages have been
>       *  copied into an output buffer.
>       * Callback function resumes the guest & the device model,
> @@ -100,8 +119,8 @@ typedef enum {
>   *        doesn't use checkpointing
>   * @return 0 on success, -1 on failure
>   */
> -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> -                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
> +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> +                   uint32_t flags /* XCFLAGS_xxx */,
>                     struct save_callbacks* callbacks, int hvm,
>                     xc_migration_stream_t stream_type, int recv_fd);

It would be cleaner for existing callers, and to extend in the future,
to encapsulate all of these parameters in a struct domain_save_params
and pass it by pointer to here.

That way, we'd avoid the situation we currently have where some
information is passed in bitfields in a single parameter, whereas other
booleans are passed as integers.

The hvm parameter specifically is useless, and can be removed by
rearranging the sanity checks until after the xc_domain_getinfo() call.

>  
> diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
> index 15c838f..2af64e4 100644
> --- a/tools/libxc/xc_nomigrate.c
> +++ b/tools/libxc/xc_nomigrate.c
> @@ -20,8 +20,7 @@
>  #include <xenctrl.h>
>  #include <xenguest.h>
>  
> -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> -                   uint32_t max_factor, uint32_t flags,
> +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags,
>                     struct save_callbacks* callbacks, int hvm,
>                     xc_migration_stream_t stream_type, int recv_fd)
>  {
> diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> index b1aa88e..a9160bd 100644
> --- a/tools/libxc/xc_sr_common.h
> +++ b/tools/libxc/xc_sr_common.h
> @@ -198,12 +198,11 @@ struct xc_sr_context
>              /* Further debugging information in the stream. */
>              bool debug;
>  
> -            /* Parameters for tweaking live migration. */
> -            unsigned max_iterations;
> -            unsigned dirty_threshold;
> -
>              unsigned long p2m_size;
>  
> +            struct precopy_stats stats;
> +            int policy_decision;
> +
>              xen_pfn_t *batch_pfns;
>              unsigned nr_batch_pfns;
>              unsigned long *deferred_pages;
> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> index 797aec5..eb95334 100644
> --- a/tools/libxc/xc_sr_save.c
> +++ b/tools/libxc/xc_sr_save.c
> @@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx)
>  }
>  
>  /*
> + * Test if the batch is full.
> + */
> +static bool batch_full(struct xc_sr_context *ctx)

const struct xc_sr_context *ctx

This is a predicate, after all.

> +{
> +    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
> +}
> +
> +/*
> + * Test if the batch is empty.
> + */
> +static bool batch_empty(struct xc_sr_context *ctx)
> +{
> +    return ctx->save.nr_batch_pfns == 0;
> +}
> +
> +/*
>   * Flush a batch of pfns into the stream.
>   */
>  static int flush_batch(struct xc_sr_context *ctx)
>  {
>      int rc = 0;
>  
> -    if ( ctx->save.nr_batch_pfns == 0 )
> +    if ( batch_empty(ctx) )
>          return rc;
>  
>      rc = write_batch(ctx);
> @@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx)
>  }
>  
>  /*
> - * Add a single pfn to the batch, flushing the batch if full.
> + * Add a single pfn to the batch.
>   */
> -static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> +static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
>  {
> -    int rc = 0;
> -
> -    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
> -        rc = flush_batch(ctx);
> -
> -    if ( rc == 0 )
> -        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> -
> -    return rc;
> +    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
> +    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
>  }
>  
>  /*
> @@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx)
>   * Send a subset of pages in the guests p2m, according to the dirty bitmap.
>   * Used for each subsequent iteration of the live migration loop.
>   *
> + * During the precopy stage of a live migration, test the user-supplied
> + * policy function after each batch of pages and cut off the operation
> + * early if indicated.  Unless aborting, the dirty pages remaining in this round
> + * are transferred into the deferred_pages bitmap.

Is this actually a sensible thing to do?  On iteration 0, this is going
to be a phenomenal number of RPC calls, which are all going to make the
same decision.

> + *
>   * Bitmap is bounded by p2m_size.
>   */
>  static int send_dirty_pages(struct xc_sr_context *ctx,
> -                            unsigned long entries)
> +                            unsigned long entries, bool precopy)

Shouldn't this precopy boolean be some kind of state variable in ctx ?

>  {
>      xc_interface *xch = ctx->xch;
>      xen_pfn_t p;
> @@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
>      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
>                                      &ctx->save.dirty_bitmap_hbuf);
>  
> -    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
> +    int (*precopy_policy)(struct precopy_stats, void *) =
> +        ctx->save.callbacks->precopy_policy;
> +    void *data = ctx->save.callbacks->data;
> +
> +    assert(batch_empty(ctx));
> +    for ( p = 0, written = 0; p < ctx->save.p2m_size; )

This looks suspicious without an increment.  Conceptually, it might be
better as a do {} while ( decision == XGS_POLICY_CONTINUE_PRECOPY ); loop?

>      {
> -        if ( !test_bit(p, dirty_bitmap) )
> -            continue;
> +        if ( ctx->save.live && precopy )
> +        {
> +            ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);

Newline here please.

> +            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> +            {

Please but a log message here indicating that abort has been requested. 
Otherwise, the migration will give up with a failure and no obvious
indication why.

> +                return -1;
> +            }
> +            else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
> +            {
> +                /* Any outstanding dirty pages are now deferred until the next
> +                 * phase of the migration. */

/*
 * The comment style for multiline comments
 * is like this.
 */

> +                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> +                          ctx->save.p2m_size);
> +                if ( entries > written )
> +                    ctx->save.nr_deferred_pages += entries - written;
> +
> +                goto done;
> +            }
> +        }
>  
> -        rc = add_to_batch(ctx, p);
> +        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
> +        {
> +            if ( test_and_clear_bit(p, dirty_bitmap) )
> +            {
> +                add_to_batch(ctx, p);
> +                ++written;
> +                ++ctx->save.stats.total_written;
> +            }
> +        }
> +
> +        rc = flush_batch(ctx);
>          if ( rc )
>              return rc;
>  
> -        /* Update progress every 4MB worth of memory sent. */
> -        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
> -            xc_report_progress_step(xch, written, entries);
> -
> -        ++written;
> +        /* Update progress after every batch (4MB) worth of memory sent. */
> +        xc_report_progress_step(xch, written, entries);
>      }
>  
> -    rc = flush_batch(ctx);
> -    if ( rc )
> -        return rc;
> -
>      if ( written > entries )
>          DPRINTF("Bitmap contained more entries than expected...");
>  
>      xc_report_progress_step(xch, entries, entries);
>  
> + done:
>      return ctx->save.ops.check_vm_state(ctx);
>  }
>  
> @@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
>   * Send all pages in the guests p2m.  Used as the first iteration of the live
>   * migration loop, and for a non-live save.
>   */
> -static int send_all_pages(struct xc_sr_context *ctx)
> +static int send_all_pages(struct xc_sr_context *ctx, bool precopy)
>  {
>      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
>                                      &ctx->save.dirty_bitmap_hbuf);
>  
>      bitmap_set(dirty_bitmap, ctx->save.p2m_size);
>  
> -    return send_dirty_pages(ctx, ctx->save.p2m_size);
> +    return send_dirty_pages(ctx, ctx->save.p2m_size, precopy);
>  }
>  
>  static int enable_logdirty(struct xc_sr_context *ctx)
> @@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context *ctx,
>      xc_interface *xch = ctx->xch;
>      char *new_str = NULL;
>  
> -    if ( asprintf(&new_str, "Frames iteration %u of %u",
> -                  iter, ctx->save.max_iterations) == -1 )
> +    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
>      {
>          PERROR("Unable to allocate new progress string");
>          return -1;
> @@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx)
>      xc_interface *xch = ctx->xch;
>      xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
>      char *progress_str = NULL;
> -    unsigned x;
>      int rc;
>  
> +    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> +                                    &ctx->save.dirty_bitmap_hbuf);
> +
> +    int (*precopy_policy)(struct precopy_stats, void *) =
> +        ctx->save.callbacks->precopy_policy;
> +    void *data = ctx->save.callbacks->data;
> +
>      rc = update_progress_string(ctx, &progress_str, 0);
>      if ( rc )
>          goto out;
>  
> -    rc = send_all_pages(ctx);
> +#define CONSULT_POLICY                                                        \

:(

The reason this code is readable and (hopefully) easy to follow, is due
in large part to a lack of macros like this trying to hide what is
actually going on.

> +    do {                                                                      \
> +        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )                  \
> +        {                                                                     \
> +            rc = -1;                                                          \
> +            goto out;                                                         \
> +        }                                                                     \
> +        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )  \
> +        {                                                                     \
> +            rc = 0;                                                           \
> +            goto out;                                                         \
> +        }                                                                     \
> +    } while (0)
> +
> +    ctx->save.stats = (struct precopy_stats)
> +        {
> +            .iteration     = 0,
> +            .total_written = 0,
> +            .dirty_count   = -1
> +        };
> +    rc = send_all_pages(ctx, /* precopy */ true);
>      if ( rc )
>          goto out;
>  
> -    for ( x = 1;
> -          ((x < ctx->save.max_iterations) &&
> -           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
> +    /* send_all_pages() has updated the stats */
> +    CONSULT_POLICY;
> +
> +    for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration )

Again, without an exit condition, this looks suspicious.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters
  2017-03-27  9:06 ` [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters Joshua Otto
@ 2017-03-29 21:08   ` Andrew Cooper
  2017-03-30  6:03     ` Joshua Otto
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-29 21:08 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/17 10:06, Joshua Otto wrote:
> In the context of the live migration algorithm, the precopy iteration
> count refers to the number of page-copying iterations performed prior to
> the suspension of the guest and transmission of the final set of dirty
> pages.  Similarly, the precopy dirty threshold refers to the dirty page
> count below which we judge it more profitable to proceed to
> stop-and-copy rather than continue with the precopy.  These would be
> helpful tuning parameters to work with when migrating particularly busy
> guests, as they enable an administrator to reap the available benefits
> of the precopy algorithm (the transmission of guest pages _not_ in the
> writable working set can be completed without guest downtime) while
> reducing the total amount of time required for the migration (as
> iterations of the precopy loop that will certainly be redundant can be
> skipped in favour of an earlier suspension).
>
> To expose these tuning parameters to users:
> - introduce a new libxl API function, libxl_domain_live_migrate(),
>   taking the same parameters as libxl_domain_suspend() _and_
>   precopy_iterations and precopy_dirty_threshold parameters, and
>   consider these parameters in the precopy policy
>
>   (though a pair of new parameters on their own might not warrant an
>   entirely new API function, it is added in anticipation of a number of
>   additional migration-only parameters that would be cumbersome on the
>   whole to tack on to the existing suspend API)
>
> - switch xl migrate to the new libxl_domain_live_migrate() and add new
>   --postcopy-iterations and --postcopy-threshold parameters to pass
>   through
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

This will have to defer to the tools maintainers, but I purposefully
didn't expose these knobs to users when rewriting live migration,
because they cannot be meaningfully chosen by anyone outside of a
testing scenario.  (That is not to say they aren't useful for testing
purposes, but I didn't upstream my version of this patch.)

I spent quite a while wondering how best to expose these tunables in a
way that end users could sensibly use them, and the best I came up with
was this:

First, run the guest under logdirty for a period of time to establish
the working set, and how steady it is.  From this, you have a baseline
for the target threshold, and a plausible way of estimating the
downtime.  (Better yet, as XenCenter, XenServers windows GUI, has proved
time and time again, users love graphs!  Even if they don't necessarily
understand them.)

From this baseline, the conditions you need to care about are the rate
of convergence.  On a steady VM, you should converge asymptotically to
the measured threshold, although on 5 or fewer iterations, the
asymptotic properties don't appear cleanly.  (Of course, the larger the
VM, the more iterations, and the more likely to spot this.)

Users will either care about the migration completing successfully, or
avoiding interrupting the workload.  The majority case would be both,
but every user will have one of these two options which is more
important than the other.  As a result, there need to be some options to
cover "if $X happens, do I continue or abort".

The case where the VM becomes more busy is harder however.  For the
users which care about not interrupting the workload, there will be a
point above which they'd prefer to abort the migration rather than
continue it.  For the users which want the migration to complete, they'd
prefer to pause the VM and take a downtime hit, rather than aborting.

Therefore, you really need two thresholds; the one above which you
always abort, the one where you would normally choose to pause.  The
decision as to what to do depends on where you are between these
thresholds when the dirty state converges.  (Of course, if the VM
suddenly becomes more idle, it is sensible to continue beyond the lower
threshold, as it will reduce the downtime.)  The absolute number of
iterations on the other hand doesn't actually matter from a users point
of view, so isn't a useful control to have.

Another thing to be careful with is the measure of convergence with
respect to guest busyness, and other factors influencing the absolute
iteration time, such as congestion of the network between the two
hosts.  I haven't yet come up with a sensible way of reconciling this
with the above, in a way which can be expressed as a useful set of controls.


The plan, following migration v2, was always to come back to this and
see about doing something better than the current hard coded parameters,
but I am still working on fixing migration in other areas (not having
VMs crash when moving, because they observe important differences in the
hardware).

How does your postcopy proposal influence/change the above logic?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/20] Add postcopy live migration support
  2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
                   ` (20 preceding siblings ...)
  2017-03-28 14:41 ` [PATCH RFC 00/20] Add postcopy live migration support Wei Liu
@ 2017-03-29 22:50 ` Andrew Cooper
  2017-03-31  4:51   ` Joshua Otto
  21 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2017-03-29 22:50 UTC (permalink / raw)
  To: Joshua Otto, xen-devel; +Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On 27/03/2017 10:06, Joshua Otto wrote:
> Hi,
>
> We're a team of three fourth-year undergraduate software engineering students at
> the University of Waterloo in Canada.  In late 2015 we posted on the list [1] to
> ask for a project to undertake for our program's capstone design project, and
> Andrew Cooper pointed us in the direction of the live migration implementation
> as an area that could use some attention.  We were particularly interested in
> post-copy live migration (as evaluated by [2] and discussed on the list at [3]),
> and have been working on an implementation of this on-and-off since then.
>
> We now have a working implementation of this scheme, and are submitting it for
> comment.  The changes are also available as the 'postcopy' branch of the GitHub
> repository at [4]
>
> As a brief overview of our approach:
> - We introduce a mechanism by which libxl can indicate to the libxc stream
>   helper process that the iterative migration precopy loop should be terminated
>   and postcopy should begin.
> - At this point, we suspend the domain, collect the final set of dirty pfns and
>   write these pfns (and _not_ their contents) into the stream.
> - At the destination, the xc restore logic registers itself as a pager for the
>   migrating domain, 'evicts' all of the pfns indicated by the sender as
>   outstanding, and then resumes the domain at the destination.
> - As the domain executes, the migration sender continues to push the remaining
>   oustanding pages to the receiver in the background.  The receiver
>   monitors both the stream for incoming page data and the paging ring event
>   channel for page faults triggered by the guest.  Page faults are forwarded on
>   the back-channel migration stream to the migration sender, which prioritizes
>   these pages for transmission.
>
> By leveraging the existing paging API, we are able to implement the postcopy
> scheme without any hypervisor modifications - all of our changes are confined to
> the userspace toolstack.  However, we inherit from the paging API the
> requirement that the domains be HVM and that the host have HAP/EPT support.

Wow.  Considering that the paging API has had no in-tree consumers (and
its out-of-tree consumer folded), I am astounded that it hasn't bitrotten.

>
> We haven't yet had the opportunity to perform a quantitative evaluation of the
> performance trade-offs between the traditional pre-copy and our post-copy
> strategies, but intend to.  Informally, we've been testing our implementation by
> migrating a domain running the x86 memtest program (which is obviously a
> tremendously write-heavy workload), and have observed a substantial reduction in
> total time required for migration completion (at the expense of a visually
> obvious 'slowdown' in the execution of the program).

Do you have any numbers, even for this informal testing?

>   We've also noticed that,
> when performing a postcopy without any leading precopy iterations, the time
> required at the destination to 'evict' all of the outstanding pages is
> substantial - possibly because there is no batching mechanism by which pages can
> be evicted - so this area in particular might require further attention.
>
> We're really interested in any feedback you might have!

Do you have a design document for this?  The spec modifications and code
comments are great, but there is no substitute (as far as understanding
goes) for a description in terms of the algorithm and design choices.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/20] Add postcopy live migration support
  2017-03-28 14:41 ` [PATCH RFC 00/20] Add postcopy live migration support Wei Liu
@ 2017-03-30  4:13   ` Joshua Otto
  2017-03-31 14:19     ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  4:13 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: andrew.cooper3, hjarmstr, ian.jackson, czylin, imhy.yang

On Tue, Mar 28, 2017 at 03:41:02PM +0100, Wei Liu wrote:
> Hi Harley, Chester and Joshua
> 
> This is really nice work. I took a brief look at all the patches, they
> look really high quality.

Thank you!

> 
> We're currently approaching freeze for a Xen release. We've got a lot on
> our plate. I think maintainers will get to this series at some point.

Understood.  We're currently approaching our final exams so that's probably for
the best :)

> 
> From the look of things some patches can go in because they're general
> useful.
> 
> On Mon, Mar 27, 2017 at 05:06:12AM -0400, Joshua Otto wrote:
> > Hi,
> > 
> > We're a team of three fourth-year undergraduate software engineering students at
> > the University of Waterloo in Canada.  In late 2015 we posted on the list [1] to
> > ask for a project to undertake for our program's capstone design project, and
> > Andrew Cooper pointed us in the direction of the live migration implementation
> > as an area that could use some attention.  We were particularly interested in
> > post-copy live migration (as evaluated by [2] and discussed on the list at [3]),
> > and have been working on an implementation of this on-and-off since then.
> > 
> > We now have a working implementation of this scheme, and are submitting it for
> > comment.  The changes are also available as the 'postcopy' branch of the GitHub
> > repository at [4]
> > 
> > As a brief overview of our approach:
> > - We introduce a mechanism by which libxl can indicate to the libxc stream
> >   helper process that the iterative migration precopy loop should be terminated
> >   and postcopy should begin.
> > - At this point, we suspend the domain, collect the final set of dirty pfns and
> >   write these pfns (and _not_ their contents) into the stream.
> > - At the destination, the xc restore logic registers itself as a pager for the
> >   migrating domain, 'evicts' all of the pfns indicated by the sender as
> >   outstanding, and then resumes the domain at the destination.
> > - As the domain executes, the migration sender continues to push the remaining
> >   oustanding pages to the receiver in the background.  The receiver
> >   monitors both the stream for incoming page data and the paging ring event
> >   channel for page faults triggered by the guest.  Page faults are forwarded on
> >   the back-channel migration stream to the migration sender, which prioritizes
> >   these pages for transmission.
> > 
> > By leveraging the existing paging API, we are able to implement the postcopy
> > scheme without any hypervisor modifications - all of our changes are confined to
> > the userspace toolstack.  However, we inherit from the paging API the
> > requirement that the domains be HVM and that the host have HAP/EPT support.
> > 
> 
> Please consider writing a design document for this feature and stick it
> at the beginning of your series in the future. You can find examples
> under docs/designs.

Absolutely, I'll submit one with v2.

> 
> The restriction is a bit unfortunate, but we shouldn't block useful work
> because it's incomplete. We just need to make sure should someone decide
> to implement similar functionality for PV guest, they should be able to
> do so.
> 
> You might want to check if shadow paging can be used with paging API,
> such that you can widen the requirement to HVM guest support.
> 
> > We haven't yet had the opportunity to perform a quantitative evaluation of the
> > performance trade-offs between the traditional pre-copy and our post-copy
> > strategies, but intend to.  Informally, we've been testing our implementation by
> > migrating a domain running the x86 memtest program (which is obviously a
> > tremendously write-heavy workload), and have observed a substantial reduction in
> > total time required for migration completion (at the expense of a visually
> > obvious 'slowdown' in the execution of the program).  We've also noticed that,
> > when performing a postcopy without any leading precopy iterations, the time
> > required at the destination to 'evict' all of the outstanding pages is
> > substantial - possibly because there is no batching mechanism by which pages can
> > be evicted - so this area in particular might require further attention.
> > 
> 
> Please do post numbers when you have them. For now, please be patient
> and wait for people to comment.

Will do.  As a general question for those following the thread, are there any
application workloads/benchmarks that people would find particularly
interesting?

The experiment that we've planned but haven't had the time to follow through
fully is to mount a ramdisk inside the guest and use Axboe's fio to test all of
the entries in the (read/write mix) x (working set size) x (access pattern)
matrix.

Thank you again for your feedback!

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN()
  2017-03-28 19:03   ` Andrew Cooper
@ 2017-03-30  4:28     ` Joshua Otto
  0 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  4:28 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On Tue, Mar 28, 2017 at 08:03:26PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > Writing the libxc save stream requires writing a few 'trivial' records,
> > consisting only of a header with a particular type.  As a readability
> > aid, it's nice to have obviously-named functions that write these sorts
> > of records into the stream - for example, the first such function was
> > write_end_record(), which reads much more pleasantly at its call-site
> > than write_generic_record(REC_TYPE_END) would.  However, it's tedious
> > and error-prone to copy-paste the generic body of such a function for
> > each new trivial record type.
> >
> > Add a helper macro that takes a name base and a record type and declares
> > the corresponding trivial record write function.  Use this to re-define
> > the two existing trivial record functions, write_end_record() and
> > write_checkpoint_record().
> >
> > No functional change.
> >
> > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> 
> -1.
> 
> This hides the functions from tools like cscope, and makes the code
> harder to read.  I also don't really buy the error prone argument.

Okay, fair enough.

> 
> If you do want to avoid opencoding different functions, how about
> 
> static int write_zerolength_record(uint32_t record_type)
> 
> and updating the existing callsites to be
> 
> write_zerolength_record(REC_TYPE_END); etc.

I really do prefer write_end_record() to write_some_record(REC_TYPE_END),
visually.  I'll fix up the later patches to add the corresponding functions
without the macro.

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages()
  2017-03-28 19:27   ` Andrew Cooper
@ 2017-03-30  4:42     ` Joshua Otto
  0 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  4:42 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On Tue, Mar 28, 2017 at 08:27:48PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> > index 481a904..8574ee8 100644
> > --- a/tools/libxc/xc_sr_restore.c
> > +++ b/tools/libxc/xc_sr_restore.c
> > @@ -194,6 +194,68 @@ int populate_pfns(struct xc_sr_context *ctx, unsigned count,
> >      return rc;
> >  }
> >  
> > +static void set_page_types(struct xc_sr_context *ctx, unsigned count,
> > +                           xen_pfn_t *pfns, uint32_t *types)
> > +{
> > +    unsigned i;
> 
> Please use unsigned int rather than just "unsigned" throughout.

Okay.  (For what it's worth, I chose plain "unsigned" here for consistency with
the rest of xc_sr_save/xc_sr_restore)

> > +
> > +    for ( i = 0; i < count; ++i )
> > +        ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
> > +}
> > +
> > +/*
> > + * Given count pfns and their types, allocate and fill in buffer bpfns with only
> > + * those pfns that are 'backed' by real page data that needs to be migrated.
> > + * The caller must later free() *bpfns.
> > + *
> > + * Returns 0 on success and non-0 on failure.  *bpfns can be free()ed even after
> > + * failure.
> > + */
> > +static int filter_pages(struct xc_sr_context *ctx,
> > +                        unsigned count,
> > +                        xen_pfn_t *pfns,
> > +                        uint32_t *types,
> > +                        /* OUT */ unsigned *nr_pages,
> > +                        /* OUT */ xen_pfn_t **bpfns)
> > +{
> > +    xc_interface *xch = ctx->xch;
> 
> Pointers to arrays are very easy to get wrong in C.  This code will be
> less error if you use
> 
> xen_pfn_t *_pfns;  (variable name subject to improvement)
> 
> > +    unsigned i;
> > +
> > +    *nr_pages = 0;
> > +    *bpfns = malloc(count * sizeof(*bpfns));
> 
> _pfns = *bfns = malloc(...).
> 
> Then use _pfns in place of (*bpfns) everywhere else.
> 
> However,  your sizeof has the wrong indirection.  It works on x86
> because xen_pfn_t is the same size as a pointer, but it will blow up on
> 32bit ARM, where a pointer is 4 bytes but xen_pfn_t is 8 bytes.

Agh!  Oh dear.

> > +    if ( !(*bpfns) )
> > +    {
> > +        ERROR("Failed to allocate %zu bytes to process page data",
> > +              count * (sizeof(*bpfns)));
> > +        return -1;
> > +    }
> > +
> > +    for ( i = 0; i < count; ++i )
> > +    {
> > +        switch ( types[i] )
> > +        {
> > +        case XEN_DOMCTL_PFINFO_NOTAB:
> > +
> > +        case XEN_DOMCTL_PFINFO_L1TAB:
> > +        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> > +
> > +        case XEN_DOMCTL_PFINFO_L2TAB:
> > +        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> > +
> > +        case XEN_DOMCTL_PFINFO_L3TAB:
> > +        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> > +
> > +        case XEN_DOMCTL_PFINFO_L4TAB:
> > +        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> > +
> > +            (*bpfns)[(*nr_pages)++] = pfns[i];
> > +            break;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  /*
> >   * Given a list of pfns, their types, and a block of page data from the
> >   * stream, populate and record their types, map the relevant subset and copy
> > @@ -203,7 +265,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
> >                               xen_pfn_t *pfns, uint32_t *types, void *page_data)
> >  {
> >      xc_interface *xch = ctx->xch;
> > -    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
> > +    xen_pfn_t *mfns = NULL;
> 
> This shows a naming bug, which is my fault.  This should be named gfns,
> not mfns.  (It inherits its name from the legacy migration code, but
> that was also wrong.)
> 
> Please correct it, either in this patch or another; the memory
> management terms are hard enough, even when all the code is correct.

Ahhhhhhh - I actually found this desperately confusing when trying to grok the
code originally.  Thanks for clearing that up!

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data()
  2017-03-28 19:52   ` Andrew Cooper
@ 2017-03-30  4:49     ` Joshua Otto
  2017-04-12 15:16       ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  4:49 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On Tue, Mar 28, 2017 at 08:52:26PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
> > index 3291b25..32400b2 100644
> > --- a/tools/libxc/xc_sr_stream_format.h
> > +++ b/tools/libxc/xc_sr_stream_format.h
> > @@ -80,15 +80,15 @@ struct xc_sr_rhdr
> >  #define REC_TYPE_OPTIONAL             0x80000000U
> >  
> >  /* PAGE_DATA */
> > -struct xc_sr_rec_page_data_header
> > +struct xc_sr_rec_pages_header
> >  {
> >      uint32_t count;
> >      uint32_t _res1;
> >      uint64_t pfn[0];
> >  };
> >  
> > -#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
> > -#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
> > +#define REC_PFINFO_PFN_MASK  0x000fffffffffffffULL
> > +#define REC_PFINFO_TYPE_MASK 0xf000000000000000ULL
> >  
> >  /* X86_PV_INFO */
> >  struct xc_sr_rec_x86_pv_info
> 
> What are the purposes of these name changes?

I should definitely have explained this more explicitly, sorry about that.  I
use the same exact structure (a count followed by a list of encoded pfns+types)
for three additional record types (POSTCOPY_PFNS, POSTCOPY_PAGE_DATA, and
POSTCOPY_FAULT) later in the series when postcopy is introduced.  To enable the
generation and validation logic to be shared between all of the code that
processes this sort of record, I renamed the structure and its associated masks
to be more generic.

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-03-29 20:18   ` Andrew Cooper
@ 2017-03-30  5:19     ` Joshua Otto
  2017-04-12 15:16       ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  5:19 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On Wed, Mar 29, 2017 at 09:18:10PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > The precopy phase of the xc_domain_save() live migration algorithm has
> > historically been implemented to run until either a) (almost) no pages
> > are dirty or b) some fixed, hard-coded maximum number of precopy
> > iterations has been exceeded.  This policy and its implementation are
> > less than ideal for a few reasons:
> > - the logic of the policy is intertwined with the control flow of the
> >   mechanism of the precopy stage
> > - it can't take into account facts external to the immediate
> >   migration context, such as interactive user input or the passage of
> >   wall-clock time
> > - it does not permit the user to change their mind, over time, about
> >   what to do at the end of the precopy (they get an unconditional
> >   transition into the stop-and-copy phase of the migration)
> >
> > To permit users to implement arbitrary higher-level policies governing
> > when the live migration precopy phase should end, and what should be
> > done next:
> > - add a precopy_policy() callback to the xc_domain_save() user-supplied
> >   callbacks
> > - during the precopy phase of live migrations, consult this policy after
> >   each batch of pages transmitted and take the dictated action, which
> >   may be to a) abort the migration entirely, b) continue with the
> >   precopy, or c) proceed to the stop-and-copy phase.
> > - provide an implementation of the old policy as such a callback in
> >   libxl and plumb it through the IPC machinery to libxc, effectively
> >   maintaing the old policy for now
> >
> > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> 
> This patch should be split into two.  One modifying libxc to use struct
> precopy_stats, and a second to wire up the RPC call.

Will do.

> > ---
> >  tools/libxc/include/xenguest.h     |  23 ++++-
> >  tools/libxc/xc_nomigrate.c         |   3 +-
> >  tools/libxc/xc_sr_common.h         |   7 +-
> >  tools/libxc/xc_sr_save.c           | 194 ++++++++++++++++++++++++++-----------
> >  tools/libxl/libxl_dom_save.c       |  20 ++++
> >  tools/libxl/libxl_save_callout.c   |   3 +-
> >  tools/libxl/libxl_save_helper.c    |   7 +-
> >  tools/libxl/libxl_save_msgs_gen.pl |   4 +-
> >  8 files changed, 189 insertions(+), 72 deletions(-)
> >
> > diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> > index aa8cc8b..30ffb6f 100644
> > --- a/tools/libxc/include/xenguest.h
> > +++ b/tools/libxc/include/xenguest.h
> > @@ -39,6 +39,14 @@
> >   */
> >  struct xenevtchn_handle;
> >  
> > +/* For save's precopy_policy(). */
> > +struct precopy_stats
> > +{
> > +    unsigned iteration;
> > +    unsigned total_written;
> > +    long dirty_count; /* -1 if unknown */
> 
> total_written and dirty_count are liable to be equal, so having them as
> different widths of integer clearly can't be correct.

Hmmm, I could have sworn that I chose the width to match the type of dirty_count
in the shadow op stats, but I've checked again and it's uint32_t there so I'm
not sure what I was thinking.

> 
> > +};
> > +
> >  /* callbacks provided by xc_domain_save */
> >  struct save_callbacks {
> >      /* Called after expiration of checkpoint interval,
> > @@ -46,6 +54,17 @@ struct save_callbacks {
> >       */
> >      int (*suspend)(void* data);
> >  
> > +    /* Called after every batch of page data sent during the precopy phase of a
> > +     * live migration to ask the caller what to do next based on the current
> > +     * state of the precopy migration.
> > +     */
> > +#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
> > +                                        * tidy up. */
> > +#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
> > +#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
> > +                                        * remaining dirty pages. */
> > +    int (*precopy_policy)(struct precopy_stats stats, void *data);
> 
> Structures shouldn't be passed by value like this, as the compiler has
> to do a lot of memcpy() work to make it happen.  You should pass by
> const pointer, as (as far as I can tell), they are strictly read-only to
> the implementation of this hook?

I chose to pass by value to make the IPC plumbing easier -
libxl_save_msgs_gen.pl doesn't know what to do about pointers, and (not being
the strongest Perl programmer...) I didn't want to volunteer to be the one to
teach it.

Is the memcpy() really significant here?  If this were a tight loop, sure, but
every invocation of the policy callback implies both a 4MB network transfer
_and_ a synchronous RPC.

> > +
> >      /* Called after the guest's dirty pages have been
> >       *  copied into an output buffer.
> >       * Callback function resumes the guest & the device model,
> > @@ -100,8 +119,8 @@ typedef enum {
> >   *        doesn't use checkpointing
> >   * @return 0 on success, -1 on failure
> >   */
> > -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> > -                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
> > +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> > +                   uint32_t flags /* XCFLAGS_xxx */,
> >                     struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd);
> 
> It would be cleaner for existing callers, and to extend in the future,
> to encapsulate all of these parameters in a struct domain_save_params
> and pass it by pointer to here.
> 
> That way, we'd avoid the situation we currently have where some
> information is passed in bitfields in a single parameter, whereas other
> booleans are passed as integers.
> 
> The hvm parameter specifically is useless, and can be removed by
> rearranging the sanity checks until after the xc_domain_getinfo() call.
> 
> >  
> > diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
> > index 15c838f..2af64e4 100644
> > --- a/tools/libxc/xc_nomigrate.c
> > +++ b/tools/libxc/xc_nomigrate.c
> > @@ -20,8 +20,7 @@
> >  #include <xenctrl.h>
> >  #include <xenguest.h>
> >  
> > -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> > -                   uint32_t max_factor, uint32_t flags,
> > +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags,
> >                     struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd)
> >  {
> > diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> > index b1aa88e..a9160bd 100644
> > --- a/tools/libxc/xc_sr_common.h
> > +++ b/tools/libxc/xc_sr_common.h
> > @@ -198,12 +198,11 @@ struct xc_sr_context
> >              /* Further debugging information in the stream. */
> >              bool debug;
> >  
> > -            /* Parameters for tweaking live migration. */
> > -            unsigned max_iterations;
> > -            unsigned dirty_threshold;
> > -
> >              unsigned long p2m_size;
> >  
> > +            struct precopy_stats stats;
> > +            int policy_decision;
> > +
> >              xen_pfn_t *batch_pfns;
> >              unsigned nr_batch_pfns;
> >              unsigned long *deferred_pages;
> > diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> > index 797aec5..eb95334 100644
> > --- a/tools/libxc/xc_sr_save.c
> > +++ b/tools/libxc/xc_sr_save.c
> > @@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx)
> >  }
> >  
> >  /*
> > + * Test if the batch is full.
> > + */
> > +static bool batch_full(struct xc_sr_context *ctx)
> 
> const struct xc_sr_context *ctx
> 
> This is a predicate, after all.
> 
> > +{
> > +    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
> > +}
> > +
> > +/*
> > + * Test if the batch is empty.
> > + */
> > +static bool batch_empty(struct xc_sr_context *ctx)
> > +{
> > +    return ctx->save.nr_batch_pfns == 0;
> > +}
> > +
> > +/*
> >   * Flush a batch of pfns into the stream.
> >   */
> >  static int flush_batch(struct xc_sr_context *ctx)
> >  {
> >      int rc = 0;
> >  
> > -    if ( ctx->save.nr_batch_pfns == 0 )
> > +    if ( batch_empty(ctx) )
> >          return rc;
> >  
> >      rc = write_batch(ctx);
> > @@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx)
> >  }
> >  
> >  /*
> > - * Add a single pfn to the batch, flushing the batch if full.
> > + * Add a single pfn to the batch.
> >   */
> > -static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> > +static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> >  {
> > -    int rc = 0;
> > -
> > -    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
> > -        rc = flush_batch(ctx);
> > -
> > -    if ( rc == 0 )
> > -        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> > -
> > -    return rc;
> > +    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
> > +    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> >  }
> >  
> >  /*
> > @@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx)
> >   * Send a subset of pages in the guests p2m, according to the dirty bitmap.
> >   * Used for each subsequent iteration of the live migration loop.
> >   *
> > + * During the precopy stage of a live migration, test the user-supplied
> > + * policy function after each batch of pages and cut off the operation
> > + * early if indicated.  Unless aborting, the dirty pages remaining in this round
> > + * are transferred into the deferred_pages bitmap.
> 
> Is this actually a sensible thing to do?  On iteration 0, this is going
> to be a phenomenal number of RPC calls, which are all going to make the
> same decision.

With the existing policy?  No.  However, the grand idea is to permit other
policies where this does make sense.  As an example, I think it would be really
useful for users to be able to specify a timeout, in seconds, for the precopy
phase, after which the migration advances to its next phase (I'll elaborate more
on this in the other discussion thread).

It's true that this means a lot of RPC.  The hope is that the cost of each RPC
should be negligible in comparison to a 4MB synchronous network copy.

> 
> > + *
> >   * Bitmap is bounded by p2m_size.
> >   */
> >  static int send_dirty_pages(struct xc_sr_context *ctx,
> > -                            unsigned long entries)
> > +                            unsigned long entries, bool precopy)
> 
> Shouldn't this precopy boolean be some kind of state variable in ctx ?

I suppose it could be.  I was a bit worried that there would be objections to
piling too many additional variables into the context, because each is
essentially an implicit extra parameter to every function here.

> 
> >  {
> >      xc_interface *xch = ctx->xch;
> >      xen_pfn_t p;
> > @@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
> >      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >                                      &ctx->save.dirty_bitmap_hbuf);
> >  
> > -    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
> > +    int (*precopy_policy)(struct precopy_stats, void *) =
> > +        ctx->save.callbacks->precopy_policy;
> > +    void *data = ctx->save.callbacks->data;
> > +
> > +    assert(batch_empty(ctx));
> > +    for ( p = 0, written = 0; p < ctx->save.p2m_size; )
> 
> This looks suspicious without an increment.  Conceptually, it might be
> better as a do {} while ( decision == XGS_POLICY_CONTINUE_PRECOPY ); loop?

Sure, I think that would read just fine too.
> 
> >      {
> > -        if ( !test_bit(p, dirty_bitmap) )
> > -            continue;
> > +        if ( ctx->save.live && precopy )
> > +        {
> > +            ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
> 
> Newline here please.
> 
> > +            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> > +            {
> 
> Please but a log message here indicating that abort has been requested. 
> Otherwise, the migration will give up with a failure and no obvious
> indication why.
> 
> > +                return -1;
> > +            }
> > +            else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
> > +            {
> > +                /* Any outstanding dirty pages are now deferred until the next
> > +                 * phase of the migration. */
> 
> /*
>  * The comment style for multiline comments
>  * is like this.
>  */
> 
> > +                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> > +                          ctx->save.p2m_size);
> > +                if ( entries > written )
> > +                    ctx->save.nr_deferred_pages += entries - written;
> > +
> > +                goto done;
> > +            }
> > +        }
> >  
> > -        rc = add_to_batch(ctx, p);
> > +        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
> > +        {
> > +            if ( test_and_clear_bit(p, dirty_bitmap) )
> > +            {
> > +                add_to_batch(ctx, p);
> > +                ++written;
> > +                ++ctx->save.stats.total_written;
> > +            }
> > +        }
> > +
> > +        rc = flush_batch(ctx);
> >          if ( rc )
> >              return rc;
> >  
> > -        /* Update progress every 4MB worth of memory sent. */
> > -        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
> > -            xc_report_progress_step(xch, written, entries);
> > -
> > -        ++written;
> > +        /* Update progress after every batch (4MB) worth of memory sent. */
> > +        xc_report_progress_step(xch, written, entries);
> >      }
> >  
> > -    rc = flush_batch(ctx);
> > -    if ( rc )
> > -        return rc;
> > -
> >      if ( written > entries )
> >          DPRINTF("Bitmap contained more entries than expected...");
> >  
> >      xc_report_progress_step(xch, entries, entries);
> >  
> > + done:
> >      return ctx->save.ops.check_vm_state(ctx);
> >  }
> >  
> > @@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
> >   * Send all pages in the guests p2m.  Used as the first iteration of the live
> >   * migration loop, and for a non-live save.
> >   */
> > -static int send_all_pages(struct xc_sr_context *ctx)
> > +static int send_all_pages(struct xc_sr_context *ctx, bool precopy)
> >  {
> >      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >                                      &ctx->save.dirty_bitmap_hbuf);
> >  
> >      bitmap_set(dirty_bitmap, ctx->save.p2m_size);
> >  
> > -    return send_dirty_pages(ctx, ctx->save.p2m_size);
> > +    return send_dirty_pages(ctx, ctx->save.p2m_size, precopy);
> >  }
> >  
> >  static int enable_logdirty(struct xc_sr_context *ctx)
> > @@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context *ctx,
> >      xc_interface *xch = ctx->xch;
> >      char *new_str = NULL;
> >  
> > -    if ( asprintf(&new_str, "Frames iteration %u of %u",
> > -                  iter, ctx->save.max_iterations) == -1 )
> > +    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
> >      {
> >          PERROR("Unable to allocate new progress string");
> >          return -1;
> > @@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx)
> >      xc_interface *xch = ctx->xch;
> >      xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
> >      char *progress_str = NULL;
> > -    unsigned x;
> >      int rc;
> >  
> > +    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> > +                                    &ctx->save.dirty_bitmap_hbuf);
> > +
> > +    int (*precopy_policy)(struct precopy_stats, void *) =
> > +        ctx->save.callbacks->precopy_policy;
> > +    void *data = ctx->save.callbacks->data;
> > +
> >      rc = update_progress_string(ctx, &progress_str, 0);
> >      if ( rc )
> >          goto out;
> >  
> > -    rc = send_all_pages(ctx);
> > +#define CONSULT_POLICY                                                        \
> 
> :(
> 
> The reason this code is readable and (hopefully) easy to follow, is due
> in large part to a lack of macros like this trying to hide what is
> actually going on.

Okay, I'll inline it.  I guess I thought I might get away with it because it's
never more than a screen buffer away from its callsites.

> 
> > +    do {                                                                      \
> > +        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )                  \
> > +        {                                                                     \
> > +            rc = -1;                                                          \
> > +            goto out;                                                         \
> > +        }                                                                     \
> > +        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )  \
> > +        {                                                                     \
> > +            rc = 0;                                                           \
> > +            goto out;                                                         \
> > +        }                                                                     \
> > +    } while (0)
> > +
> > +    ctx->save.stats = (struct precopy_stats)
> > +        {
> > +            .iteration     = 0,
> > +            .total_written = 0,
> > +            .dirty_count   = -1
> > +        };
> > +    rc = send_all_pages(ctx, /* precopy */ true);
> >      if ( rc )
> >          goto out;
> >  
> > -    for ( x = 1;
> > -          ((x < ctx->save.max_iterations) &&
> > -           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
> > +    /* send_all_pages() has updated the stats */
> > +    CONSULT_POLICY;
> > +
> > +    for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration )
> 
> Again, without an exit condition, this looks suspicious.

Sure, I'll turn this one into a do {} while() too.

Thank you for the review!

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-03-29 18:54   ` Jennifer Herbert
@ 2017-03-30  5:28     ` Joshua Otto
  0 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  5:28 UTC (permalink / raw)
  To: Jennifer Herbert, xen-devel
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, imhy.yang, hjarmstr

On Wed, Mar 29, 2017 at 07:54:15PM +0100, Jennifer Herbert wrote:
> I would like to encourage this patch - as I have use for it outside
> of your postcopy work.

Glad to hear that!

> Some things people will comment on:
> You've used 'unsigned' without the int keyword, which people don't like.
> Also on line 324, your missing space between 'if (' and
> 'ctx->save.policy_decision'.

Ack.  All of the existing code in xc_sr_save/xc_sr_restore uses plain "unsigned"
so I tried to be consistent.

> 
> Also, I'm not a fan of your CONSULT_POLICY macro, which you've defined at
> a odd point in your function, and I think could be done more elegantly.
> Worst of all ... its a macro - which I think should generally be avoided
> unless
> there is little choice.   I'm sure you could write a helper function to
> replace this.

Yes, you're right, will fix.

Thank you for the review!

Josh

> 
> Cheers,
> 
> -jenny
> 
> On 27/03/17 10:06, Joshua Otto wrote:
> >The precopy phase of the xc_domain_save() live migration algorithm has
> >historically been implemented to run until either a) (almost) no pages
> >are dirty or b) some fixed, hard-coded maximum number of precopy
> >iterations has been exceeded.  This policy and its implementation are
> >less than ideal for a few reasons:
> >- the logic of the policy is intertwined with the control flow of the
> >   mechanism of the precopy stage
> >- it can't take into account facts external to the immediate
> >   migration context, such as interactive user input or the passage of
> >   wall-clock time
> >- it does not permit the user to change their mind, over time, about
> >   what to do at the end of the precopy (they get an unconditional
> >   transition into the stop-and-copy phase of the migration)
> >
> >To permit users to implement arbitrary higher-level policies governing
> >when the live migration precopy phase should end, and what should be
> >done next:
> >- add a precopy_policy() callback to the xc_domain_save() user-supplied
> >   callbacks
> >- during the precopy phase of live migrations, consult this policy after
> >   each batch of pages transmitted and take the dictated action, which
> >   may be to a) abort the migration entirely, b) continue with the
> >   precopy, or c) proceed to the stop-and-copy phase.
> >- provide an implementation of the old policy as such a callback in
> >   libxl and plumb it through the IPC machinery to libxc, effectively
> >   maintaing the old policy for now
> >
> >Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> >---
> >  tools/libxc/include/xenguest.h     |  23 ++++-
> >  tools/libxc/xc_nomigrate.c         |   3 +-
> >  tools/libxc/xc_sr_common.h         |   7 +-
> >  tools/libxc/xc_sr_save.c           | 194 ++++++++++++++++++++++++++-----------
> >  tools/libxl/libxl_dom_save.c       |  20 ++++
> >  tools/libxl/libxl_save_callout.c   |   3 +-
> >  tools/libxl/libxl_save_helper.c    |   7 +-
> >  tools/libxl/libxl_save_msgs_gen.pl |   4 +-
> >  8 files changed, 189 insertions(+), 72 deletions(-)
> >
> >diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> >index aa8cc8b..30ffb6f 100644
> >--- a/tools/libxc/include/xenguest.h
> >+++ b/tools/libxc/include/xenguest.h
> >@@ -39,6 +39,14 @@
> >   */
> >  struct xenevtchn_handle;
> >+/* For save's precopy_policy(). */
> >+struct precopy_stats
> >+{
> >+    unsigned iteration;
> >+    unsigned total_written;
> >+    long dirty_count; /* -1 if unknown */
> >+};
> >+
> >  /* callbacks provided by xc_domain_save */
> >  struct save_callbacks {
> >      /* Called after expiration of checkpoint interval,
> >@@ -46,6 +54,17 @@ struct save_callbacks {
> >       */
> >      int (*suspend)(void* data);
> >+    /* Called after every batch of page data sent during the precopy phase of a
> >+     * live migration to ask the caller what to do next based on the current
> >+     * state of the precopy migration.
> >+     */
> >+#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
> >+                                        * tidy up. */
> >+#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
> >+#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
> >+                                        * remaining dirty pages. */
> >+    int (*precopy_policy)(struct precopy_stats stats, void *data);
> >+
> >      /* Called after the guest's dirty pages have been
> >       *  copied into an output buffer.
> >       * Callback function resumes the guest & the device model,
> >@@ -100,8 +119,8 @@ typedef enum {
> >   *        doesn't use checkpointing
> >   * @return 0 on success, -1 on failure
> >   */
> >-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> >-                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
> >+int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> >+                   uint32_t flags /* XCFLAGS_xxx */,
> >                     struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd);
> >diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
> >index 15c838f..2af64e4 100644
> >--- a/tools/libxc/xc_nomigrate.c
> >+++ b/tools/libxc/xc_nomigrate.c
> >@@ -20,8 +20,7 @@
> >  #include <xenctrl.h>
> >  #include <xenguest.h>
> >-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> >-                   uint32_t max_factor, uint32_t flags,
> >+int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags,
> >                     struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd)
> >  {
> >diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> >index b1aa88e..a9160bd 100644
> >--- a/tools/libxc/xc_sr_common.h
> >+++ b/tools/libxc/xc_sr_common.h
> >@@ -198,12 +198,11 @@ struct xc_sr_context
> >              /* Further debugging information in the stream. */
> >              bool debug;
> >-            /* Parameters for tweaking live migration. */
> >-            unsigned max_iterations;
> >-            unsigned dirty_threshold;
> >-
> >              unsigned long p2m_size;
> >+            struct precopy_stats stats;
> >+            int policy_decision;
> >+
> >              xen_pfn_t *batch_pfns;
> >              unsigned nr_batch_pfns;
> >              unsigned long *deferred_pages;
> >diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> >index 797aec5..eb95334 100644
> >--- a/tools/libxc/xc_sr_save.c
> >+++ b/tools/libxc/xc_sr_save.c
> >@@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx)
> >  }
> >  /*
> >+ * Test if the batch is full.
> >+ */
> >+static bool batch_full(struct xc_sr_context *ctx)
> >+{
> >+    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
> >+}
> >+
> >+/*
> >+ * Test if the batch is empty.
> >+ */
> >+static bool batch_empty(struct xc_sr_context *ctx)
> >+{
> >+    return ctx->save.nr_batch_pfns == 0;
> >+}
> >+
> >+/*
> >   * Flush a batch of pfns into the stream.
> >   */
> >  static int flush_batch(struct xc_sr_context *ctx)
> >  {
> >      int rc = 0;
> >-    if ( ctx->save.nr_batch_pfns == 0 )
> >+    if ( batch_empty(ctx) )
> >          return rc;
> >      rc = write_batch(ctx);
> >@@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx)
> >  }
> >  /*
> >- * Add a single pfn to the batch, flushing the batch if full.
> >+ * Add a single pfn to the batch.
> >   */
> >-static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> >+static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> >  {
> >-    int rc = 0;
> >-
> >-    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
> >-        rc = flush_batch(ctx);
> >-
> >-    if ( rc == 0 )
> >-        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> >-
> >-    return rc;
> >+    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
> >+    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> >  }
> >  /*
> >@@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx)
> >   * Send a subset of pages in the guests p2m, according to the dirty bitmap.
> >   * Used for each subsequent iteration of the live migration loop.
> >   *
> >+ * During the precopy stage of a live migration, test the user-supplied
> >+ * policy function after each batch of pages and cut off the operation
> >+ * early if indicated.  Unless aborting, the dirty pages remaining in this round
> >+ * are transferred into the deferred_pages bitmap.
> >+ *
> >   * Bitmap is bounded by p2m_size.
> >   */
> >  static int send_dirty_pages(struct xc_sr_context *ctx,
> >-                            unsigned long entries)
> >+                            unsigned long entries, bool precopy)
> >  {
> >      xc_interface *xch = ctx->xch;
> >      xen_pfn_t p;
> >@@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
> >      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >                                      &ctx->save.dirty_bitmap_hbuf);
> >-    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
> >+    int (*precopy_policy)(struct precopy_stats, void *) =
> >+        ctx->save.callbacks->precopy_policy;
> >+    void *data = ctx->save.callbacks->data;
> >+
> >+    assert(batch_empty(ctx));
> >+    for ( p = 0, written = 0; p < ctx->save.p2m_size; )
> >      {
> >-        if ( !test_bit(p, dirty_bitmap) )
> >-            continue;
> >+        if ( ctx->save.live && precopy )
> >+        {
> >+            ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
> >+            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> >+            {
> >+                return -1;
> >+            }
> >+            else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
> >+            {
> >+                /* Any outstanding dirty pages are now deferred until the next
> >+                 * phase of the migration. */
> >+                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> >+                          ctx->save.p2m_size);
> >+                if ( entries > written )
> >+                    ctx->save.nr_deferred_pages += entries - written;
> >+
> >+                goto done;
> >+            }
> >+        }
> >-        rc = add_to_batch(ctx, p);
> >+        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
> >+        {
> >+            if ( test_and_clear_bit(p, dirty_bitmap) )
> >+            {
> >+                add_to_batch(ctx, p);
> >+                ++written;
> >+                ++ctx->save.stats.total_written;
> >+            }
> >+        }
> >+
> >+        rc = flush_batch(ctx);
> >          if ( rc )
> >              return rc;
> >-        /* Update progress every 4MB worth of memory sent. */
> >-        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
> >-            xc_report_progress_step(xch, written, entries);
> >-
> >-        ++written;
> >+        /* Update progress after every batch (4MB) worth of memory sent. */
> >+        xc_report_progress_step(xch, written, entries);
> >      }
> >-    rc = flush_batch(ctx);
> >-    if ( rc )
> >-        return rc;
> >-
> >      if ( written > entries )
> >          DPRINTF("Bitmap contained more entries than expected...");
> >      xc_report_progress_step(xch, entries, entries);
> >+ done:
> >      return ctx->save.ops.check_vm_state(ctx);
> >  }
> >@@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
> >   * Send all pages in the guests p2m.  Used as the first iteration of the live
> >   * migration loop, and for a non-live save.
> >   */
> >-static int send_all_pages(struct xc_sr_context *ctx)
> >+static int send_all_pages(struct xc_sr_context *ctx, bool precopy)
> >  {
> >      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >                                      &ctx->save.dirty_bitmap_hbuf);
> >      bitmap_set(dirty_bitmap, ctx->save.p2m_size);
> >-    return send_dirty_pages(ctx, ctx->save.p2m_size);
> >+    return send_dirty_pages(ctx, ctx->save.p2m_size, precopy);
> >  }
> >  static int enable_logdirty(struct xc_sr_context *ctx)
> >@@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context *ctx,
> >      xc_interface *xch = ctx->xch;
> >      char *new_str = NULL;
> >-    if ( asprintf(&new_str, "Frames iteration %u of %u",
> >-                  iter, ctx->save.max_iterations) == -1 )
> >+    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
> >      {
> >          PERROR("Unable to allocate new progress string");
> >          return -1;
> >@@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx)
> >      xc_interface *xch = ctx->xch;
> >      xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
> >      char *progress_str = NULL;
> >-    unsigned x;
> >      int rc;
> >+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >+                                    &ctx->save.dirty_bitmap_hbuf);
> >+
> >+    int (*precopy_policy)(struct precopy_stats, void *) =
> >+        ctx->save.callbacks->precopy_policy;
> >+    void *data = ctx->save.callbacks->data;
> >+
> >      rc = update_progress_string(ctx, &progress_str, 0);
> >      if ( rc )
> >          goto out;
> >-    rc = send_all_pages(ctx);
> >+#define CONSULT_POLICY                                                        \
> >+    do {                                                                      \
> >+        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )                  \
> >+        {                                                                     \
> >+            rc = -1;                                                          \
> >+            goto out;                                                         \
> >+        }                                                                     \
> >+        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )  \
> >+        {                                                                     \
> >+            rc = 0;                                                           \
> >+            goto out;                                                         \
> >+        }                                                                     \
> >+    } while (0)
> >+
> >+    ctx->save.stats = (struct precopy_stats)
> >+        {
> >+            .iteration     = 0,
> >+            .total_written = 0,
> >+            .dirty_count   = -1
> >+        };
> >+    rc = send_all_pages(ctx, /* precopy */ true);
> >      if ( rc )
> >          goto out;
> >-    for ( x = 1;
> >-          ((x < ctx->save.max_iterations) &&
> >-           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
> >+    /* send_all_pages() has updated the stats */
> >+    CONSULT_POLICY;
> >+
> >+    for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration )
> >      {
> >          if ( xc_shadow_control(
> >                   xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
> >@@ -493,18 +559,42 @@ static int send_memory_live(struct xc_sr_context *ctx)
> >              goto out;
> >          }
> >-        if ( stats.dirty_count == 0 )
> >-            break;
> >+        /* Check the new dirty_count against the policy. */
> >+        ctx->save.stats.dirty_count = stats.dirty_count;
> >+        ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
> >+        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> >+        {
> >+            rc = -1;
> >+            goto out;
> >+        }
> >+        else if (ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
> >+        {
> >+            bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> >+                      ctx->save.p2m_size);
> >+            ctx->save.nr_deferred_pages += stats.dirty_count;
> >+            rc = 0;
> >+            goto out;
> >+        }
> >+
> >+        /* After this point we won't know how many pages are really dirty until
> >+         * the next iteration. */
> >+        ctx->save.stats.dirty_count = -1;
> >-        rc = update_progress_string(ctx, &progress_str, x);
> >+        rc = update_progress_string(ctx, &progress_str,
> >+                                    ctx->save.stats.iteration);
> >          if ( rc )
> >              goto out;
> >-        rc = send_dirty_pages(ctx, stats.dirty_count);
> >+        rc = send_dirty_pages(ctx, stats.dirty_count, /* precopy */ true);
> >          if ( rc )
> >              goto out;
> >+
> >+        /* send_dirty_pages() has updated the stats */
> >+        CONSULT_POLICY;
> >      }
> >+#undef CONSULT_POLICY
> >+
> >   out:
> >      xc_set_progress_prefix(xch, NULL);
> >      free(progress_str);
> >@@ -595,7 +685,7 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
> >      if ( ctx->save.live )
> >      {
> >          rc = update_progress_string(ctx, &progress_str,
> >-                                    ctx->save.max_iterations);
> >+                                    ctx->save.stats.iteration);
> >          if ( rc )
> >              goto out;
> >      }
> >@@ -614,7 +704,8 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
> >          }
> >      }
> >-    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
> >+    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages,
> >+                          /* precopy */ false);
> >      if ( rc )
> >          goto out;
> >@@ -645,7 +736,7 @@ static int verify_frames(struct xc_sr_context *ctx)
> >          goto out;
> >      xc_set_progress_prefix(xch, "Frames verify");
> >-    rc = send_all_pages(ctx);
> >+    rc = send_all_pages(ctx, /* precopy */ false);
> >      if ( rc )
> >          goto out;
> >@@ -719,7 +810,7 @@ static int send_domain_memory_nonlive(struct xc_sr_context *ctx)
> >      xc_set_progress_prefix(xch, "Frames");
> >-    rc = send_all_pages(ctx);
> >+    rc = send_all_pages(ctx, /* precopy */ false);
> >      if ( rc )
> >          goto err;
> >@@ -910,8 +1001,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
> >  };
> >  int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> >-                   uint32_t max_iters, uint32_t max_factor, uint32_t flags,
> >-                   struct save_callbacks* callbacks, int hvm,
> >+                   uint32_t flags, struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd)
> >  {
> >      struct xc_sr_context ctx =
> >@@ -932,25 +1022,17 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> >             stream_type == XC_MIG_STREAM_REMUS ||
> >             stream_type == XC_MIG_STREAM_COLO);
> >-    /*
> >-     * TODO: Find some time to better tweak the live migration algorithm.
> >-     *
> >-     * These parameters are better than the legacy algorithm especially for
> >-     * busy guests.
> >-     */
> >-    ctx.save.max_iterations = 5;
> >-    ctx.save.dirty_threshold = 50;
> >-
> >      /* Sanity checks for callbacks. */
> >      if ( hvm )
> >          assert(callbacks->switch_qemu_logdirty);
> >+    if ( ctx.save.live )
> >+        assert(callbacks->precopy_policy);
> >      if ( ctx.save.checkpointed )
> >          assert(callbacks->checkpoint && callbacks->aftercopy);
> >      if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
> >          assert(callbacks->wait_checkpoint);
> >-    DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
> >-            io_fd, dom, max_iters, max_factor, flags, hvm);
> >+    DPRINTF("fd %d, dom %u, flags %u, hvm %d", io_fd, dom, flags, hvm);
> >      if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
> >      {
> >diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
> >index 77fe30e..6d28cce 100644
> >--- a/tools/libxl/libxl_dom_save.c
> >+++ b/tools/libxl/libxl_dom_save.c
> >@@ -328,6 +328,25 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
> >      return rc;
> >  }
> >+/*
> >+ * This is the live migration precopy policy - it's called periodically during
> >+ * the precopy phase of live migrations, and is responsible for deciding when
> >+ * the precopy phase should terminate and what should be done next.
> >+ *
> >+ * The policy implemented here behaves identically to the policy previously
> >+ * hard-coded into xc_domain_save() - it proceeds to the stop-and-copy phase of
> >+ * the live migration when there are either fewer than 50 dirty pages, or more
> >+ * than 5 precopy rounds have completed.
> >+ */
> >+static int libxl__save_live_migration_simple_precopy_policy(
> >+    struct precopy_stats stats, void *user)
> >+{
> >+    return ((stats.dirty_count >= 0 && stats.dirty_count < 50) ||
> >+            stats.iteration >= 5)
> >+        ? XGS_POLICY_STOP_AND_COPY
> >+        : XGS_POLICY_CONTINUE_PRECOPY;
> >+}
> >+
> >  /*----- main code for saving, in order of execution -----*/
> >  void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
> >@@ -401,6 +420,7 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
> >      if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
> >          callbacks->suspend = libxl__domain_suspend_callback;
> >+    callbacks->precopy_policy = libxl__save_live_migration_simple_precopy_policy;
> >      callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
> >      dss->sws.ao  = dss->ao;
> >diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
> >index 46b892c..026b572 100644
> >--- a/tools/libxl/libxl_save_callout.c
> >+++ b/tools/libxl/libxl_save_callout.c
> >@@ -89,8 +89,7 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_save_state *dss,
> >          libxl__srm_callout_enumcallbacks_save(&shs->callbacks.save.a);
> >      const unsigned long argnums[] = {
> >-        dss->domid, 0, 0, dss->xcflags, dss->hvm,
> >-        cbflags, dss->checkpointed_stream,
> >+        dss->domid, dss->xcflags, dss->hvm, cbflags, dss->checkpointed_stream,
> >      };
> >      shs->ao = ao;
> >diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c
> >index d3def6b..0241a6b 100644
> >--- a/tools/libxl/libxl_save_helper.c
> >+++ b/tools/libxl/libxl_save_helper.c
> >@@ -251,8 +251,6 @@ int main(int argc, char **argv)
> >          io_fd =                             atoi(NEXTARG);
> >          recv_fd =                           atoi(NEXTARG);
> >          uint32_t dom =                      strtoul(NEXTARG,0,10);
> >-        uint32_t max_iters =                strtoul(NEXTARG,0,10);
> >-        uint32_t max_factor =               strtoul(NEXTARG,0,10);
> >          uint32_t flags =                    strtoul(NEXTARG,0,10);
> >          int hvm =                           atoi(NEXTARG);
> >          unsigned cbflags =                  strtoul(NEXTARG,0,10);
> >@@ -264,9 +262,8 @@ int main(int argc, char **argv)
> >          startup("save");
> >          setup_signals(save_signal_handler);
> >-        r = xc_domain_save(xch, io_fd, dom, max_iters, max_factor, flags,
> >-                           &helper_save_callbacks, hvm, stream_type,
> >-                           recv_fd);
> >+        r = xc_domain_save(xch, io_fd, dom, flags, &helper_save_callbacks, hvm,
> >+                           stream_type, recv_fd);
> >          complete(r);
> >      } else if (!strcmp(mode,"--restore-domain")) {
> >diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
> >index 27845bb..50c97b4 100755
> >--- a/tools/libxl/libxl_save_msgs_gen.pl
> >+++ b/tools/libxl/libxl_save_msgs_gen.pl
> >@@ -33,6 +33,7 @@ our @msgs = (
> >                                                'xen_pfn_t', 'console_gfn'] ],
> >      [  9, 'srW',    "complete",              [qw(int retval
> >                                                   int errnoval)] ],
> >+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
> >  );
> >  #----------------------------------------
> >@@ -141,7 +142,8 @@ static void bytes_put(unsigned char *const buf, int *len,
> >  END
> >-foreach my $simpletype (qw(int uint16_t uint32_t unsigned), 'unsigned long', 'xen_pfn_t') {
> >+foreach my $simpletype (qw(int uint16_t uint32_t unsigned),
> >+                        'unsigned long', 'xen_pfn_t', 'struct precopy_stats') {
> >      my $typeid = typeid($simpletype);
> >      $out_body{'callout'} .= <<END;
> >  static int ${typeid}_get(const unsigned char **msg,
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters
  2017-03-29 21:08   ` Andrew Cooper
@ 2017-03-30  6:03     ` Joshua Otto
  2017-04-12 15:37       ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-30  6:03 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On Wed, Mar 29, 2017 at 10:08:02PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > In the context of the live migration algorithm, the precopy iteration
> > count refers to the number of page-copying iterations performed prior to
> > the suspension of the guest and transmission of the final set of dirty
> > pages.  Similarly, the precopy dirty threshold refers to the dirty page
> > count below which we judge it more profitable to proceed to
> > stop-and-copy rather than continue with the precopy.  These would be
> > helpful tuning parameters to work with when migrating particularly busy
> > guests, as they enable an administrator to reap the available benefits
> > of the precopy algorithm (the transmission of guest pages _not_ in the
> > writable working set can be completed without guest downtime) while
> > reducing the total amount of time required for the migration (as
> > iterations of the precopy loop that will certainly be redundant can be
> > skipped in favour of an earlier suspension).
> >
> > To expose these tuning parameters to users:
> > - introduce a new libxl API function, libxl_domain_live_migrate(),
> >   taking the same parameters as libxl_domain_suspend() _and_
> >   precopy_iterations and precopy_dirty_threshold parameters, and
> >   consider these parameters in the precopy policy
> >
> >   (though a pair of new parameters on their own might not warrant an
> >   entirely new API function, it is added in anticipation of a number of
> >   additional migration-only parameters that would be cumbersome on the
> >   whole to tack on to the existing suspend API)
> >
> > - switch xl migrate to the new libxl_domain_live_migrate() and add new
> >   --postcopy-iterations and --postcopy-threshold parameters to pass
> >   through
> >
> > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> 
> This will have to defer to the tools maintainers, but I purposefully
> didn't expose these knobs to users when rewriting live migration,
> because they cannot be meaningfully chosen by anyone outside of a
> testing scenario.  (That is not to say they aren't useful for testing
> purposes, but I didn't upstream my version of this patch.)

Ahhh, I wondered why those parameters to xc_domain_save() were present
but ignored.  That's reasonable.

I guess the way I had imagined an administrator using them would be in a
non-production/test environment - if they could run workloads
representative of their production application in this environment, they
could experiment with different --precopy-iterations and
--precopy-threshold values (having just a high-level understanding of
what they control) and choose the ones that result in the best outcome
for later use in production.

> I spent quite a while wondering how best to expose these tunables in a
> way that end users could sensibly use them, and the best I came up with
> was this:
> 
> First, run the guest under logdirty for a period of time to establish
> the working set, and how steady it is.  From this, you have a baseline
> for the target threshold, and a plausible way of estimating the
> downtime.  (Better yet, as XenCenter, XenServers windows GUI, has proved
> time and time again, users love graphs!  Even if they don't necessarily
> understand them.)
> 
> From this baseline, the conditions you need to care about are the rate
> of convergence.  On a steady VM, you should converge asymptotically to
> the measured threshold, although on 5 or fewer iterations, the
> asymptotic properties don't appear cleanly.  (Of course, the larger the
> VM, the more iterations, and the more likely to spot this.)
> 
> Users will either care about the migration completing successfully, or
> avoiding interrupting the workload.  The majority case would be both,
> but every user will have one of these two options which is more
> important than the other.  As a result, there need to be some options to
> cover "if $X happens, do I continue or abort".
> 
> The case where the VM becomes more busy is harder however.  For the
> users which care about not interrupting the workload, there will be a
> point above which they'd prefer to abort the migration rather than
> continue it.  For the users which want the migration to complete, they'd
> prefer to pause the VM and take a downtime hit, rather than aborting.
> 
> Therefore, you really need two thresholds; the one above which you
> always abort, the one where you would normally choose to pause.  The
> decision as to what to do depends on where you are between these
> thresholds when the dirty state converges.  (Of course, if the VM
> suddenly becomes more idle, it is sensible to continue beyond the lower
> threshold, as it will reduce the downtime.)  The absolute number of
> iterations on the other hand doesn't actually matter from a users point
> of view, so isn't a useful control to have.
> 
> Another thing to be careful with is the measure of convergence with
> respect to guest busyness, and other factors influencing the absolute
> iteration time, such as congestion of the network between the two
> hosts.  I haven't yet come up with a sensible way of reconciling this
> with the above, in a way which can be expressed as a useful set of controls.
> 
> 
> The plan, following migration v2, was always to come back to this and
> see about doing something better than the current hard coded parameters,
> but I am still working on fixing migration in other areas (not having
> VMs crash when moving, because they observe important differences in the
> hardware).

I think a good strategy would be to solicit three parameters from the
user:
- the precopy duration they're willing to tolerate
- the downtime duration they're willing to tolerate
- the bandwidth of the link between the hosts (we could try and estimate
  it for them but I'd rather just make them run iperf)

Then, after applying this patch, alter the policy so that precopy simply
runs for the duration that the user is willing to wait.  After that,
using the bandwidth estimate, compute the approximate downtime required
to transfer the final set of dirty-pages.  If this is less than what the
user indicated is acceptable, proceed with the stop-and-copy - otherwise
abort.

This still requires the user to figure out for themselves how long their
workload can really wait, but hopefully they already had some idea
before deciding to attempt live migration in the first place.

> How does your postcopy proposal influence/change the above logic?

Well, the 'downtime' phase of the migration becomes a very short, fixed
interval, regardless of guest busyness, so you can't ask the user 'how
much downtime can you tolerate?'  Instead, the question becomes the
murkier 'how much memory performance degradation can your guest
tolerate?'  I.e. is the postcopy migration going to essentially be
downtime, or can useful work get done between faults? (for example,
guests that are I/O bound would do much better with postcopy than they
would with a long stop-and-copy)

To answer that question, they're back to the approach I outlined at the
beginning - they'd have to experiment in a test environment and observe
their workload's response to the alternatives to make an informed
choice.

Cheers,

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/20] Add postcopy live migration support
  2017-03-29 22:50 ` Andrew Cooper
@ 2017-03-31  4:51   ` Joshua Otto
  2017-04-12 15:38     ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Joshua Otto @ 2017-03-31  4:51 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: ian.jackson, hjarmstr, wei.liu2, czylin, imhy.yang

On Wed, Mar 29, 2017 at 11:50:52PM +0100, Andrew Cooper wrote:
> On 27/03/2017 10:06, Joshua Otto wrote:
> > Hi,
> >
> > We're a team of three fourth-year undergraduate software engineering students at
> > the University of Waterloo in Canada.  In late 2015 we posted on the list [1] to
> > ask for a project to undertake for our program's capstone design project, and
> > Andrew Cooper pointed us in the direction of the live migration implementation
> > as an area that could use some attention.  We were particularly interested in
> > post-copy live migration (as evaluated by [2] and discussed on the list at [3]),
> > and have been working on an implementation of this on-and-off since then.
> >
> > We now have a working implementation of this scheme, and are submitting it for
> > comment.  The changes are also available as the 'postcopy' branch of the GitHub
> > repository at [4]
> >
> > As a brief overview of our approach:
> > - We introduce a mechanism by which libxl can indicate to the libxc stream
> >   helper process that the iterative migration precopy loop should be terminated
> >   and postcopy should begin.
> > - At this point, we suspend the domain, collect the final set of dirty pfns and
> >   write these pfns (and _not_ their contents) into the stream.
> > - At the destination, the xc restore logic registers itself as a pager for the
> >   migrating domain, 'evicts' all of the pfns indicated by the sender as
> >   outstanding, and then resumes the domain at the destination.
> > - As the domain executes, the migration sender continues to push the remaining
> >   oustanding pages to the receiver in the background.  The receiver
> >   monitors both the stream for incoming page data and the paging ring event
> >   channel for page faults triggered by the guest.  Page faults are forwarded on
> >   the back-channel migration stream to the migration sender, which prioritizes
> >   these pages for transmission.
> >
> > By leveraging the existing paging API, we are able to implement the postcopy
> > scheme without any hypervisor modifications - all of our changes are confined to
> > the userspace toolstack.  However, we inherit from the paging API the
> > requirement that the domains be HVM and that the host have HAP/EPT support.
> 
> Wow.  Considering that the paging API has had no in-tree consumers (and
> its out-of-tree consumer folded), I am astounded that it hasn't bitrotten.

Well, there's tools/xenpaging, which was a helpful reference when
putting this together.  The user-space pager actually has rotted a bit
(I'm fairly certain the VM event ring protocol has changed subtly under
its feet), so I also needed to consult tools/xen-access to get things
right.

> 
> >
> > We haven't yet had the opportunity to perform a quantitative evaluation of the
> > performance trade-offs between the traditional pre-copy and our post-copy
> > strategies, but intend to.  Informally, we've been testing our implementation by
> > migrating a domain running the x86 memtest program (which is obviously a
> > tremendously write-heavy workload), and have observed a substantial reduction in
> > total time required for migration completion (at the expense of a visually
> > obvious 'slowdown' in the execution of the program).
> 
> Do you have any numbers, even for this informal testing?

We have a much more ambitious test matrix planned, but sure, here's an
early encouraging set of measurements - for a domain with 2GB of memory
and a 256MB writable working set (the application driving the writes
being fio submitting writes against a ramdisk), we measured these times:

                    Pre-copy + Stop-and-copy |  1 precopy iteration +
                             (s)             |       postcopy (s)
                   --------------------------+-------------------------
 Precopy Duration:           66.97           |         44.44
 Suspend Duration:            6.807          |          3.23
Postcopy Duration:            N/A            |          4.83

However...

That 3.23s suspend for the hybrid migration seems too high, doesn't it?

There's currently a serious performance bug that we're still trying to
work out in the case of pure-postcopy migrations, with no leading
precopy.  Attempting a pure postcopy migration when running the
experiment above yields:

                     Pure postcopy (s)
                   ----------------------
 Precopy Duration:           0
 Suspend Duration:          21.93
Postcopy Duration:          44.22

Although the postcopy scheme clearly works, it takes 21.93s (!) to
unpause the guest at the destination.  The eviction of the unmigrated
pages completes in a second or two because of the lack of batching
support (still bad, but not this bad) - the holdup is somewhere on the
domain creation sequence between domcreate_stream_done() and
domcreate_complete().

I suspect that this is the result of a bad interaction between QEMU's
startup sequence (its foreign memory mapping behaviour in particular)
and the postcopy paging.  Specifically: the paging ring has room only
for 8 requests at a time.  When QEMU attempts to map a large range, the
range gets postcopy-faulted over synchronously in batches of 8 pages at
a time, and each such batch implies a synchronous copy of its pages
over the network (and the 100us xenforeignmemory_map() retry timer)
before the next batch can begin.

If I am able to confirm that this is the case, a sensible solution would
seem to be supporting paging range-population requests (i.e. a new
paging ring request type for a _range_ of gfns).  In the mean time, you
should expect to observe this effect as well in experiments.  It appears
to be largely (but not completely) mitigated by performing a single
pre-copy iteration first.

> 
> >   We've also noticed that,
> > when performing a postcopy without any leading precopy iterations, the time
> > required at the destination to 'evict' all of the outstanding pages is
> > substantial - possibly because there is no batching mechanism by which pages can
> > be evicted - so this area in particular might require further attention.
> >
> > We're really interested in any feedback you might have!
> 
> Do you have a design document for this?  The spec modifications and code
> comments are great, but there is no substitute (as far as understanding
> goes) for a description in terms of the algorithm and design choices.

As I replied to Wei, not yet, but we'd happily prepare one for v2.

Thanks!

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/20] Add postcopy live migration support
  2017-03-30  4:13   ` Joshua Otto
@ 2017-03-31 14:19     ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-03-31 14:19 UTC (permalink / raw)
  To: Joshua Otto
  Cc: Wei Liu, andrew.cooper3, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Thu, Mar 30, 2017 at 12:13:51AM -0400, Joshua Otto wrote:
[...]
> Will do.  As a general question for those following the thread, are there any
> application workloads/benchmarks that people would find particularly
> interesting?
> 

I think any memory intense workload will do. Others might have their
preferences.

> The experiment that we've planned but haven't had the time to follow through
> fully is to mount a ramdisk inside the guest and use Axboe's fio to test all of
> the entries in the (read/write mix) x (working set size) x (access pattern)
> matrix.

This sounds reasonable.

> 
> Thank you again for your feedback!
> 
> Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd
  2017-03-27  9:06 ` [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
  2017-03-28 18:53   ` Andrew Cooper
@ 2017-03-31 14:19   ` Wei Liu
  1 sibling, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-03-31 14:19 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Mon, Mar 27, 2017 at 05:06:14AM -0400, Joshua Otto wrote:
> Right now, write_split_record() - which is delegated to by
> write_record() - implicitly writes to ctx->fd.  This means it can't be
> used with the restore context's send_back_fd, which is unhandy.
> 
> Add an 'fd' parameter to both write_record() and write_split_record(),
> and mechanically update all existing callsites to pass ctx->fd for it.
> 
> No functional change.
> 
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

Acked-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list()
  2017-03-27  9:06 ` [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
  2017-03-28 18:56   ` Andrew Cooper
@ 2017-03-31 14:19   ` Wei Liu
  1 sibling, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-03-31 14:19 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Mon, Mar 27, 2017 at 05:06:15AM -0400, Joshua Otto wrote:
> Teach send_checkpoint_dirty_pfn_list() to use write_record()'s new fd
> parameter, avoiding the need for a manual writev().
> 
> No functional change.
> 
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

Acked-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy'
  2017-03-28 16:34   ` Wei Liu
@ 2017-04-11  6:19     ` Zhang Chen
  0 siblings, 0 replies; 53+ messages in thread
From: Zhang Chen @ 2017-04-11  6:19 UTC (permalink / raw)
  To: Wei Liu, Joshua Otto
  Cc: zhangchen.fnst, andrew.cooper3, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr



On 03/29/2017 12:34 AM, Wei Liu wrote:
> Cc Chen
>
> On Mon, Mar 27, 2017 at 05:06:13AM -0400, Joshua Otto wrote:
>> The COLO xc domain save and restore procedures both make use of a 'postcopy'
>> callback to defer part of each checkpoint operation to xl.  In this context, the
>> name 'postcopy' is meant as "the callback invoked immediately after this
>> checkpoint's memory callback."  This is an unfortunate name collision with the
>> other common use of 'postcopy' in the context of live migration, where it is
>> used to mean "a memory migration that permits the guest to execute at the
>> destination before all of its memory is migrated by servicing accesses to
>> unmigrated memory via a network page-fault."
>>
>> Mechanically rename 'postcopy' -> 'aftercopy' to free up the postcopy namespace
>> while preserving the original intent of the name in the COLO context.
>>
>> No functional change.
>>
>> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

Acked-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>

>> ---
>>   tools/libxc/include/xenguest.h     | 4 ++--
>>   tools/libxc/xc_sr_restore.c        | 4 ++--
>>   tools/libxc/xc_sr_save.c           | 4 ++--
>>   tools/libxl/libxl_colo_restore.c   | 2 +-
>>   tools/libxl/libxl_colo_save.c      | 2 +-
>>   tools/libxl/libxl_remus.c          | 2 +-
>>   tools/libxl/libxl_save_msgs_gen.pl | 2 +-
>>   7 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
>> index 40902ee..aa8cc8b 100644
>> --- a/tools/libxc/include/xenguest.h
>> +++ b/tools/libxc/include/xenguest.h
>> @@ -53,7 +53,7 @@ struct save_callbacks {
>>        * xc_domain_save then flushes the output buffer, while the
>>        *  guest continues to run.
>>        */
>> -    int (*postcopy)(void* data);
>> +    int (*aftercopy)(void* data);
>>   
>>       /* Called after the memory checkpoint has been flushed
>>        * out into the network. Typical actions performed in this
>> @@ -115,7 +115,7 @@ struct restore_callbacks {
>>        * Callback function resumes the guest & the device model,
>>        * returns to xc_domain_restore.
>>        */
>> -    int (*postcopy)(void* data);
>> +    int (*aftercopy)(void* data);
>>   
>>       /* A checkpoint record has been found in the stream.
>>        * returns: */
>> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
>> index 3549f0a..ee06b3d 100644
>> --- a/tools/libxc/xc_sr_restore.c
>> +++ b/tools/libxc/xc_sr_restore.c
>> @@ -576,7 +576,7 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
>>                                                   ctx->restore.callbacks->data);
>>   
>>           /* Resume secondary vm */
>> -        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
>> +        ret = ctx->restore.callbacks->aftercopy(ctx->restore.callbacks->data);
>>           HANDLE_CALLBACK_RETURN_VALUE(ret);
>>   
>>           /* Wait for a new checkpoint */
>> @@ -855,7 +855,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>>       {
>>           /* this is COLO restore */
>>           assert(callbacks->suspend &&
>> -               callbacks->postcopy &&
>> +               callbacks->aftercopy &&
>>                  callbacks->wait_checkpoint &&
>>                  callbacks->restore_results);
>>       }
>> diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
>> index f98c827..fc63a55 100644
>> --- a/tools/libxc/xc_sr_save.c
>> +++ b/tools/libxc/xc_sr_save.c
>> @@ -863,7 +863,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
>>                   }
>>               }
>>   
>> -            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
>> +            rc = ctx->save.callbacks->aftercopy(ctx->save.callbacks->data);
>>               if ( rc <= 0 )
>>                   goto err;
>>   
>> @@ -951,7 +951,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
>>       if ( hvm )
>>           assert(callbacks->switch_qemu_logdirty);
>>       if ( ctx.save.checkpointed )
>> -        assert(callbacks->checkpoint && callbacks->postcopy);
>> +        assert(callbacks->checkpoint && callbacks->aftercopy);
>>       if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
>>           assert(callbacks->wait_checkpoint);
>>   
>> diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
>> index 0c535bd..7d8f9ff 100644
>> --- a/tools/libxl/libxl_colo_restore.c
>> +++ b/tools/libxl/libxl_colo_restore.c
>> @@ -246,7 +246,7 @@ void libxl__colo_restore_setup(libxl__egc *egc,
>>       if (init_dsps(&crcs->dsps))
>>           goto out;
>>   
>> -    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
>> +    callbacks->aftercopy = libxl__colo_restore_domain_resume_callback;
>>       callbacks->wait_checkpoint = libxl__colo_restore_domain_wait_checkpoint_callback;
>>       callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
>>       callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
>> diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
>> index f687d5a..5921196 100644
>> --- a/tools/libxl/libxl_colo_save.c
>> +++ b/tools/libxl/libxl_colo_save.c
>> @@ -145,7 +145,7 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
>>   
>>       callbacks->suspend = libxl__colo_save_domain_suspend_callback;
>>       callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
>> -    callbacks->postcopy = libxl__colo_save_domain_resume_callback;
>> +    callbacks->aftercopy = libxl__colo_save_domain_resume_callback;
>>       callbacks->wait_checkpoint = libxl__colo_save_domain_wait_checkpoint_callback;
>>   
>>       libxl__checkpoint_devices_setup(egc, &dss->cds);
>> diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
>> index 29a4783..1453365 100644
>> --- a/tools/libxl/libxl_remus.c
>> +++ b/tools/libxl/libxl_remus.c
>> @@ -110,7 +110,7 @@ void libxl__remus_setup(libxl__egc *egc, libxl__remus_state *rs)
>>       dss->sws.checkpoint_callback = remus_checkpoint_stream_written;
>>   
>>       callbacks->suspend = libxl__remus_domain_suspend_callback;
>> -    callbacks->postcopy = libxl__remus_domain_resume_callback;
>> +    callbacks->aftercopy = libxl__remus_domain_resume_callback;
>>       callbacks->checkpoint = libxl__remus_domain_save_checkpoint_callback;
>>   
>>       libxl__checkpoint_devices_setup(egc, cds);
>> diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
>> index 3ae7373..27845bb 100755
>> --- a/tools/libxl/libxl_save_msgs_gen.pl
>> +++ b/tools/libxl/libxl_save_msgs_gen.pl
>> @@ -24,7 +24,7 @@ our @msgs = (
>>                                                   'unsigned long', 'done',
>>                                                   'unsigned long', 'total'] ],
>>       [  3, 'srcxA',  "suspend", [] ],
>> -    [  4, 'srcxA',  "postcopy", [] ],
>> +    [  4, 'srcxA',  "aftercopy", [] ],
>>       [  5, 'srcxA',  "checkpoint", [] ],
>>       [  6, 'srcxA',  "wait_checkpoint", [] ],
>>       [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
>> -- 
>> 2.7.4
>>
>
> .
>

-- 
Thanks
Zhang Chen




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data()
  2017-03-30  4:49     ` Joshua Otto
@ 2017-04-12 15:16       ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-04-12 15:16 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, Andrew Cooper, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Thu, Mar 30, 2017 at 12:49:07AM -0400, Joshua Otto wrote:
> On Tue, Mar 28, 2017 at 08:52:26PM +0100, Andrew Cooper wrote:
> > On 27/03/17 10:06, Joshua Otto wrote:
> > > diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
> > > index 3291b25..32400b2 100644
> > > --- a/tools/libxc/xc_sr_stream_format.h
> > > +++ b/tools/libxc/xc_sr_stream_format.h
> > > @@ -80,15 +80,15 @@ struct xc_sr_rhdr
> > >  #define REC_TYPE_OPTIONAL             0x80000000U
> > >  
> > >  /* PAGE_DATA */
> > > -struct xc_sr_rec_page_data_header
> > > +struct xc_sr_rec_pages_header
> > >  {
> > >      uint32_t count;
> > >      uint32_t _res1;
> > >      uint64_t pfn[0];
> > >  };
> > >  
> > > -#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
> > > -#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
> > > +#define REC_PFINFO_PFN_MASK  0x000fffffffffffffULL
> > > +#define REC_PFINFO_TYPE_MASK 0xf000000000000000ULL
> > >  
> > >  /* X86_PV_INFO */
> > >  struct xc_sr_rec_x86_pv_info
> > 
> > What are the purposes of these name changes?
> 
> I should definitely have explained this more explicitly, sorry about that.  I
> use the same exact structure (a count followed by a list of encoded pfns+types)
> for three additional record types (POSTCOPY_PFNS, POSTCOPY_PAGE_DATA, and
> POSTCOPY_FAULT) later in the series when postcopy is introduced.  To enable the
> generation and validation logic to be shared between all of the code that
> processes this sort of record, I renamed the structure and its associated masks
> to be more generic.

This should be part of the commit message.

Wei.

> 
> Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-03-30  5:19     ` Joshua Otto
@ 2017-04-12 15:16       ` Wei Liu
  2017-04-18 17:56         ` Ian Jackson
  0 siblings, 1 reply; 53+ messages in thread
From: Wei Liu @ 2017-04-12 15:16 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, Andrew Cooper, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Thu, Mar 30, 2017 at 01:19:41AM -0400, Joshua Otto wrote:
> > 
> > > +};
> > > +
> > >  /* callbacks provided by xc_domain_save */
> > >  struct save_callbacks {
> > >      /* Called after expiration of checkpoint interval,
> > > @@ -46,6 +54,17 @@ struct save_callbacks {
> > >       */
> > >      int (*suspend)(void* data);
> > >  
> > > +    /* Called after every batch of page data sent during the precopy phase of a
> > > +     * live migration to ask the caller what to do next based on the current
> > > +     * state of the precopy migration.
> > > +     */
> > > +#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
> > > +                                        * tidy up. */
> > > +#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
> > > +#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
> > > +                                        * remaining dirty pages. */
> > > +    int (*precopy_policy)(struct precopy_stats stats, void *data);
> > 
> > Structures shouldn't be passed by value like this, as the compiler has
> > to do a lot of memcpy() work to make it happen.  You should pass by
> > const pointer, as (as far as I can tell), they are strictly read-only to
> > the implementation of this hook?
> 
> I chose to pass by value to make the IPC plumbing easier -
> libxl_save_msgs_gen.pl doesn't know what to do about pointers, and (not being
> the strongest Perl programmer...) I didn't want to volunteer to be the one to
> teach it.
> 
> Is the memcpy() really significant here?  If this were a tight loop, sure, but
> every invocation of the policy callback implies both a 4MB network transfer
> _and_ a synchronous RPC.

Ian, How can Joshua pass a pointer across RPC boundary to avoid excessive
copying?

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 13/20] libxc/migration: add try_read_record()
  2017-03-27  9:06 ` [PATCH RFC 13/20] libxc/migration: add try_read_record() Joshua Otto
@ 2017-04-12 15:16   ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-04-12 15:16 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, andrew.cooper3, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Mon, Mar 27, 2017 at 05:06:25AM -0400, Joshua Otto wrote:
[...]
>  
> +int try_read_record(struct xc_sr_read_record_context *rrctx, int fd,
> +                    struct xc_sr_record *rec)
> +{
> +    int rc;
> +    xc_interface *xch = rrctx->ctx->xch;
> +    size_t offset_out, dataoff, datasz;
> +
> +    /* If the header isn't yet complete, attempt to finish it first. */
> +    if ( rrctx->offset < sizeof(rrctx->rhdr) )
> +    {
> +        rc = try_read_exact(fd, (char *)&rrctx->rhdr + rrctx->offset,
> +                            sizeof(rrctx->rhdr) - rrctx->offset, &offset_out);
> +        rrctx->offset += offset_out;
> +
> +        if ( rc )
> +            return rc;
> +        else
> +            assert(rrctx->offset == sizeof(rrctx->rhdr));

No need to have the "else" branch.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters
  2017-03-30  6:03     ` Joshua Otto
@ 2017-04-12 15:37       ` Wei Liu
  2017-04-27 22:51         ` Joshua Otto
  0 siblings, 1 reply; 53+ messages in thread
From: Wei Liu @ 2017-04-12 15:37 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, Andrew Cooper, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Thu, Mar 30, 2017 at 02:03:29AM -0400, Joshua Otto wrote:
> On Wed, Mar 29, 2017 at 10:08:02PM +0100, Andrew Cooper wrote:
> > On 27/03/17 10:06, Joshua Otto wrote:
> > > In the context of the live migration algorithm, the precopy iteration
> > > count refers to the number of page-copying iterations performed prior to
> > > the suspension of the guest and transmission of the final set of dirty
> > > pages.  Similarly, the precopy dirty threshold refers to the dirty page
> > > count below which we judge it more profitable to proceed to
> > > stop-and-copy rather than continue with the precopy.  These would be
> > > helpful tuning parameters to work with when migrating particularly busy
> > > guests, as they enable an administrator to reap the available benefits
> > > of the precopy algorithm (the transmission of guest pages _not_ in the
> > > writable working set can be completed without guest downtime) while
> > > reducing the total amount of time required for the migration (as
> > > iterations of the precopy loop that will certainly be redundant can be
> > > skipped in favour of an earlier suspension).
> > >
> > > To expose these tuning parameters to users:
> > > - introduce a new libxl API function, libxl_domain_live_migrate(),
> > >   taking the same parameters as libxl_domain_suspend() _and_
> > >   precopy_iterations and precopy_dirty_threshold parameters, and
> > >   consider these parameters in the precopy policy
> > >
> > >   (though a pair of new parameters on their own might not warrant an
> > >   entirely new API function, it is added in anticipation of a number of
> > >   additional migration-only parameters that would be cumbersome on the
> > >   whole to tack on to the existing suspend API)
> > >
> > > - switch xl migrate to the new libxl_domain_live_migrate() and add new
> > >   --postcopy-iterations and --postcopy-threshold parameters to pass
> > >   through
> > >
> > > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> > 
> > This will have to defer to the tools maintainers, but I purposefully
> > didn't expose these knobs to users when rewriting live migration,
> > because they cannot be meaningfully chosen by anyone outside of a
> > testing scenario.  (That is not to say they aren't useful for testing
> > purposes, but I didn't upstream my version of this patch.)
> 
> Ahhh, I wondered why those parameters to xc_domain_save() were present
> but ignored.  That's reasonable.
> 
> I guess the way I had imagined an administrator using them would be in a
> non-production/test environment - if they could run workloads
> representative of their production application in this environment, they
> could experiment with different --precopy-iterations and
> --precopy-threshold values (having just a high-level understanding of
> what they control) and choose the ones that result in the best outcome
> for later use in production.
> 

Running in a test environment isn't always an option -- think about
public cloud providers who don't have control over the VMs or the
workload.

> > I spent quite a while wondering how best to expose these tunables in a
> > way that end users could sensibly use them, and the best I came up with
> > was this:
> > 
> > First, run the guest under logdirty for a period of time to establish
> > the working set, and how steady it is.  From this, you have a baseline
> > for the target threshold, and a plausible way of estimating the
> > downtime.  (Better yet, as XenCenter, XenServers windows GUI, has proved
> > time and time again, users love graphs!  Even if they don't necessarily
> > understand them.)
> > 
> > From this baseline, the conditions you need to care about are the rate
> > of convergence.  On a steady VM, you should converge asymptotically to
> > the measured threshold, although on 5 or fewer iterations, the
> > asymptotic properties don't appear cleanly.  (Of course, the larger the
> > VM, the more iterations, and the more likely to spot this.)
> > 
> > Users will either care about the migration completing successfully, or
> > avoiding interrupting the workload.  The majority case would be both,
> > but every user will have one of these two options which is more
> > important than the other.  As a result, there need to be some options to
> > cover "if $X happens, do I continue or abort".
> > 
> > The case where the VM becomes more busy is harder however.  For the
> > users which care about not interrupting the workload, there will be a
> > point above which they'd prefer to abort the migration rather than
> > continue it.  For the users which want the migration to complete, they'd
> > prefer to pause the VM and take a downtime hit, rather than aborting.
> > 
> > Therefore, you really need two thresholds; the one above which you
> > always abort, the one where you would normally choose to pause.  The
> > decision as to what to do depends on where you are between these
> > thresholds when the dirty state converges.  (Of course, if the VM
> > suddenly becomes more idle, it is sensible to continue beyond the lower
> > threshold, as it will reduce the downtime.)  The absolute number of
> > iterations on the other hand doesn't actually matter from a users point
> > of view, so isn't a useful control to have.
> > 
> > Another thing to be careful with is the measure of convergence with
> > respect to guest busyness, and other factors influencing the absolute
> > iteration time, such as congestion of the network between the two
> > hosts.  I haven't yet come up with a sensible way of reconciling this
> > with the above, in a way which can be expressed as a useful set of controls.
> > 

My thought as well.

> > 
> > The plan, following migration v2, was always to come back to this and
> > see about doing something better than the current hard coded parameters,
> > but I am still working on fixing migration in other areas (not having
> > VMs crash when moving, because they observe important differences in the
> > hardware).
> 
> I think a good strategy would be to solicit three parameters from the
> user:
> - the precopy duration they're willing to tolerate
> - the downtime duration they're willing to tolerate
> - the bandwidth of the link between the hosts (we could try and estimate
>   it for them but I'd rather just make them run iperf)
> 
> Then, after applying this patch, alter the policy so that precopy simply
> runs for the duration that the user is willing to wait.  After that,
> using the bandwidth estimate, compute the approximate downtime required
> to transfer the final set of dirty-pages.  If this is less than what the
> user indicated is acceptable, proceed with the stop-and-copy - otherwise
> abort.
> 
> This still requires the user to figure out for themselves how long their
> workload can really wait, but hopefully they already had some idea
> before deciding to attempt live migration in the first place.
> 

I am not entirely sure what to make of this. I'm not convinced using
durations would cover all cases, but I can't come up with a counter
example that doesn't sound contrived.

Given this series is already complex enough, I think we should set this
aside for another day.

How hard would it be to _not_ include all the knobs in this series?

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 00/20] Add postcopy live migration support
  2017-03-31  4:51   ` Joshua Otto
@ 2017-04-12 15:38     ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2017-04-12 15:38 UTC (permalink / raw)
  To: Joshua Otto
  Cc: wei.liu2, Andrew Cooper, ian.jackson, czylin, imhy.yang,
	xen-devel, hjarmstr

On Fri, Mar 31, 2017 at 12:51:46AM -0400, Joshua Otto wrote:
> > 
> > >   We've also noticed that,
> > > when performing a postcopy without any leading precopy iterations, the time
> > > required at the destination to 'evict' all of the outstanding pages is
> > > substantial - possibly because there is no batching mechanism by which pages can
> > > be evicted - so this area in particular might require further attention.
> > >
> > > We're really interested in any feedback you might have!
> > 
> > Do you have a design document for this?  The spec modifications and code
> > comments are great, but there is no substitute (as far as understanding
> > goes) for a description in terms of the algorithm and design choices.
> 
> As I replied to Wei, not yet, but we'd happily prepare one for v2.
> 

I've gone through most refactoring patches. They look fine to me. I
stopped before the actual implementations because I would like to read
your design doc first.

Wei.

> Thanks!
> 
> Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl
  2017-04-12 15:16       ` Wei Liu
@ 2017-04-18 17:56         ` Ian Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Ian Jackson @ 2017-04-18 17:56 UTC (permalink / raw)
  To: Wei Liu
  Cc: Andrew Cooper, czylin, Joshua Otto, imhy.yang, xen-devel, hjarmstr

Wei Liu writes ("Re: [PATCH RFC 07/20] migration: defer precopy policy to libxl"):
> On Thu, Mar 30, 2017 at 01:19:41AM -0400, Joshua Otto wrote:
> > Is the memcpy() really significant here?  If this were a tight
> > loop, sure, but every invocation of the policy callback implies
> > both a 4MB network transfer _and_ a synchronous RPC.
> 
> Ian, How can Joshua pass a pointer across RPC boundary to avoid excessive
> copying?

You can't pass a pointer across the IPC boundary.  The two bits of
code run in different processes, with different address spaces.

Also, this precopy stats struct is tiny: two unsigned and a long.
Josuha is entirely right to ask whether the overhead is significant.
I think it isn't.

If the performance of the proposed arrangements is inadequate, the
whole design needs reconsideration - the synchronous callback is more
of a concer, as Joshua suggests.  But I assume it's not, or Joshua
would have noticed !

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters
  2017-04-12 15:37       ` Wei Liu
@ 2017-04-27 22:51         ` Joshua Otto
  0 siblings, 0 replies; 53+ messages in thread
From: Joshua Otto @ 2017-04-27 22:51 UTC (permalink / raw)
  To: Wei Liu
  Cc: Andrew Cooper, ian.jackson, czylin, imhy.yang, xen-devel, hjarmstr

On Wed, Apr 12, 2017 at 04:37:16PM +0100, Wei Liu wrote:
> On Thu, Mar 30, 2017 at 02:03:29AM -0400, Joshua Otto wrote:
> > I guess the way I had imagined an administrator using them would be in a
> > non-production/test environment - if they could run workloads
> > representative of their production application in this environment, they
> > could experiment with different --precopy-iterations and
> > --precopy-threshold values (having just a high-level understanding of
> > what they control) and choose the ones that result in the best outcome
> > for later use in production.
> > 
> 
> Running in a test environment isn't always an option -- think about
> public cloud providers who don't have control over the VMs or the
> workload.

Sure, it definitely won't always be an option, but sometimes it might.
The question is whether or not the benefit in the cases where it can be
used justifies the added complexity to the interface.  I think so, but
that's just my intuition.

> > > 
> > > The plan, following migration v2, was always to come back to this and
> > > see about doing something better than the current hard coded parameters,
> > > but I am still working on fixing migration in other areas (not having
> > > VMs crash when moving, because they observe important differences in the
> > > hardware).
> > 
> > I think a good strategy would be to solicit three parameters from the
> > user:
> > - the precopy duration they're willing to tolerate
> > - the downtime duration they're willing to tolerate
> > - the bandwidth of the link between the hosts (we could try and estimate
> >   it for them but I'd rather just make them run iperf)
> > 
> > Then, after applying this patch, alter the policy so that precopy simply
> > runs for the duration that the user is willing to wait.  After that,
> > using the bandwidth estimate, compute the approximate downtime required
> > to transfer the final set of dirty-pages.  If this is less than what the
> > user indicated is acceptable, proceed with the stop-and-copy - otherwise
> > abort.
> > 
> > This still requires the user to figure out for themselves how long their
> > workload can really wait, but hopefully they already had some idea
> > before deciding to attempt live migration in the first place.
> > 
> 
> I am not entirely sure what to make of this. I'm not convinced using
> durations would cover all cases, but I can't come up with a counter
> example that doesn't sound contrived.
> 
> Given this series is already complex enough, I think we should set this
> aside for another day.
> 
> How hard would it be to _not_ include all the knobs in this series?

Fair enough.  It wouldn't be much trouble, so I'll drop it for now.

As a general comment on the patch series for anyone following: I've just
finished with the last of my academic commitments and now have time to
pick this back up.  I'll follow up in the next few weeks with the
suggested revisions, the design document and the quantitative
performance evaluation.

Thanks!

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-04-27 22:51 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-27  9:06 [PATCH RFC 00/20] Add postcopy live migration support Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 01/20] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
2017-03-28 16:34   ` Wei Liu
2017-04-11  6:19     ` Zhang Chen
2017-03-27  9:06 ` [PATCH RFC 02/20] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
2017-03-28 18:53   ` Andrew Cooper
2017-03-31 14:19   ` Wei Liu
2017-03-27  9:06 ` [PATCH RFC 03/20] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
2017-03-28 18:56   ` Andrew Cooper
2017-03-31 14:19   ` Wei Liu
2017-03-27  9:06 ` [PATCH RFC 04/20] libxc/xc_sr_save.c: add WRITE_TRIVIAL_RECORD_FN() Joshua Otto
2017-03-28 19:03   ` Andrew Cooper
2017-03-30  4:28     ` Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 05/20] libxc/xc_sr: factor out filter_pages() Joshua Otto
2017-03-28 19:27   ` Andrew Cooper
2017-03-30  4:42     ` Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 06/20] libxc/xc_sr: factor helpers out of handle_page_data() Joshua Otto
2017-03-28 19:52   ` Andrew Cooper
2017-03-30  4:49     ` Joshua Otto
2017-04-12 15:16       ` Wei Liu
2017-03-27  9:06 ` [PATCH RFC 07/20] migration: defer precopy policy to libxl Joshua Otto
2017-03-29 18:54   ` Jennifer Herbert
2017-03-30  5:28     ` Joshua Otto
2017-03-29 20:18   ` Andrew Cooper
2017-03-30  5:19     ` Joshua Otto
2017-04-12 15:16       ` Wei Liu
2017-04-18 17:56         ` Ian Jackson
2017-03-27  9:06 ` [PATCH RFC 08/20] libxl/migration: add precopy tuning parameters Joshua Otto
2017-03-29 21:08   ` Andrew Cooper
2017-03-30  6:03     ` Joshua Otto
2017-04-12 15:37       ` Wei Liu
2017-04-27 22:51         ` Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 09/20] libxc/xc_sr_save: introduce save batch types Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 10/20] libxc/xc_sr_save.c: initialise rec.data before free() Joshua Otto
2017-03-28 19:59   ` Andrew Cooper
2017-03-29 17:47     ` Wei Liu
2017-03-27  9:06 ` [PATCH RFC 11/20] libxc/migration: correct hvm record ordering specification Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 12/20] libxc/migration: specify postcopy live migration Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 13/20] libxc/migration: add try_read_record() Joshua Otto
2017-04-12 15:16   ` Wei Liu
2017-03-27  9:06 ` [PATCH RFC 14/20] libxc/migration: implement the sender side of postcopy live migration Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 15/20] libxc/migration: implement the receiver " Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 16/20] libxl/libxl_stream_write.c: track callback chains with an explicit phase Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 17/20] libxl/libxl_stream_read.c: " Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 18/20] libxl/migration: implement the sender side of postcopy live migration Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 19/20] libxl/migration: implement the receiver " Joshua Otto
2017-03-27  9:06 ` [PATCH RFC 20/20] tools: expose postcopy live migration support in libxl and xl Joshua Otto
2017-03-28 14:41 ` [PATCH RFC 00/20] Add postcopy live migration support Wei Liu
2017-03-30  4:13   ` Joshua Otto
2017-03-31 14:19     ` Wei Liu
2017-03-29 22:50 ` Andrew Cooper
2017-03-31  4:51   ` Joshua Otto
2017-04-12 15:38     ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.