All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/15] Migration v2 (libxc)
@ 2015-04-23 11:48 Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 01/15] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
                   ` (15 more replies)
  0 siblings, 16 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel
  Cc: Wei Liu, Ian Campbell, Wen Congyang, Andrew Cooper, Ian Jackson,
	Tim Deegan, Ross Lagerwall, David Vrabel, Shriram Rajagopalan,
	Hongyang Yang, Vijay Kilari

Presented here is v10 of the Migration v2 series (libxc subset), which is able
to function when transparently inserted under an unmodified xl/libxl.

There are no major changes from v9.  A docs subseries has been completed in
parallel and removed from this series, and liberal quantities of spell
checking has occured.  A summary of changes: (M = modified, A = acked)

 A tools/libxc: Implement writev_exact() in the same style as write_exact()
M  libxc/progress: Extend the progress interface
 A tools/libxc: Migration v2 framework
 A tools/libxc: C implementation of stream format
 A tools/libxc: generic common code
 A tools/libxc: x86 common code
 A tools/libxc: x86 PV common code
MA tools/libxc: x86 PV save code
M  tools/libxc: x86 PV restore code
 A tools/libxc: x86 HVM save code
MA tools/libxc: x86 HVM restore code
 A tools/libxc: common save code
 A tools/libxc: common restore code
 A docs: libxc migration stream specification
 A tools/libxc: Migration v2 compatibility for unmodified libxl


This series can be found on a 'saverestore2-v10' at
  http://xenbits.xen.org/git-http/people/andrewcoop/xen.git

To experiment, simply set XG_MIGRATION_V2 in xl's environment.  For migration,
the easiest way is to tweak libxl-save-helper to be a shell script

  root@vitruvias:/home# cat /usr/lib/xen/bin/libxl-save-helper
  #!/bin/bash
  export XG_MIGRATION_V2=x
  exec /usr/lib/xen/bin/libxl-save-helper.bin "$@"

which will ensure that XG_MIGRATION_V2 gets set in the environment for both
the source and destination of migration.

Please experiment!

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v10 01/15] tools/libxc: Implement writev_exact() in the same style as write_exact()
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 02/15] libxc/progress: Extend the progress interface Andrew Cooper
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

This implementation of writev_exact() will cope with an iovcnt greater than
IOV_MAX because glibc will actually let this work anyway, and it is very
useful not to have to work about this in the caller of writev_exact().  The
caller is still required to ensure that the sum of iov_len's doesn't overflow
a ssize_t.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
v8: Extra comments indicating why we loop when we do
---
 tools/libxc/xc_private.c |   85 ++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_private.h |   14 ++++++++
 2 files changed, 99 insertions(+)

diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index 2eb44b6..83ead5e 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -867,6 +867,91 @@ int write_exact(int fd, const void *data, size_t size)
     return 0;
 }
 
+#if defined(__MINIOS__)
+/*
+ * MiniOS's libc doesn't know about writev(). Implement it as multiple write()s.
+ */
+int writev_exact(int fd, const struct iovec *iov, int iovcnt)
+{
+    int rc, i;
+
+    for ( i = 0; i < iovcnt; ++i )
+    {
+        rc = write_exact(fd, iov[i].iov_base, iov[i].iov_len);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+#else
+int writev_exact(int fd, const struct iovec *iov, int iovcnt)
+{
+    struct iovec *local_iov = NULL;
+    int rc = 0, iov_idx = 0, saved_errno = 0;
+    ssize_t len;
+
+    while ( iov_idx < iovcnt )
+    {
+        /*
+         * Skip over iov[] entries with 0 length.
+         *
+         * This is needed to cover the case where we took a partial write and
+         * all remaining vectors are of 0 length.  In such a case, the results
+         * from writev() are indistinguishable from EOF.
+         */
+        while ( iov[iov_idx].iov_len == 0 )
+            if ( ++iov_idx == iovcnt )
+                goto out;
+
+        len = writev(fd, &iov[iov_idx], min(iovcnt - iov_idx, IOV_MAX));
+        saved_errno = errno;
+
+        if ( (len == -1) && (errno == EINTR) )
+            continue;
+        if ( len <= 0 )
+        {
+            rc = -1;
+            goto out;
+        }
+
+        /* Check iov[] to see whether we had a partial or complete write. */
+        while ( (len > 0) && (iov_idx < iovcnt) )
+        {
+            if ( len >= iov[iov_idx].iov_len )
+                len -= iov[iov_idx++].iov_len;
+            else
+            {
+                /* Partial write of iov[iov_idx]. Copy iov so we can adjust
+                 * element iov_idx and resubmit the rest. */
+                if ( !local_iov )
+                {
+                    local_iov = malloc(iovcnt * sizeof(*iov));
+                    if ( !local_iov )
+                    {
+                        saved_errno = ENOMEM;
+                        goto out;
+                    }
+
+                    iov = memcpy(local_iov, iov, iovcnt * sizeof(*iov));
+                }
+
+                local_iov[iov_idx].iov_base += len;
+                local_iov[iov_idx].iov_len  -= len;
+                break;
+            }
+        }
+    }
+
+    saved_errno = 0;
+
+ out:
+    free(local_iov);
+    errno = saved_errno;
+    return rc;
+}
+#endif
+
 int xc_ffs8(uint8_t x)
 {
     int i;
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 9f55309..b45b079 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -42,6 +42,19 @@
 #define VALGRIND_MAKE_MEM_UNDEFINED(addr, len) /* addr, len */
 #endif
 
+#if defined(__MINIOS__)
+/*
+ * MiniOS's libc doesn't know about sys/uio.h or writev().
+ * Declare enough of sys/uio.h to compile.
+ */
+struct iovec {
+    void *iov_base;
+    size_t iov_len;
+};
+#else
+#include <sys/uio.h>
+#endif
+
 #define DECLARE_HYPERCALL privcmd_hypercall_t hypercall
 #define DECLARE_DOMCTL struct xen_domctl domctl
 #define DECLARE_SYSCTL struct xen_sysctl sysctl
@@ -395,6 +408,7 @@ int xc_add_mmu_update(xc_interface *xch, struct xc_mmu *mmu,
 /* Return 0 on success; -1 on error setting errno. */
 int read_exact(int fd, void *data, size_t size); /* EOF => -1, errno=0 */
 int write_exact(int fd, const void *data, size_t size);
+int writev_exact(int fd, const struct iovec *iov, int iovcnt);
 
 int xc_ffs8(uint8_t x);
 int xc_ffs16(uint16_t x);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 02/15] libxc/progress: Extend the progress interface
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 01/15] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-05-05 12:56   ` Ian Campbell
  2015-04-23 11:48 ` [PATCH v10 03/15] tools/libxc: Migration v2 framework Andrew Cooper
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Wei Liu

Progress information is logged via a different logger to regular libxc log
messages, and currently can only express a range.  However, not everything
which needs reporting as progress comes with a range.  Extend the interface to
allow reporting of a single statement.

The programming interface now looks like:
  xc_set_progress_prefix()
    set the prefix string to be used
  xc_report_progress_single()
    report a single action
  xc_report_progress_step()
    report $X of $Y

The new programming interface is implemented in a compatible way with the
existing caller interface (by reporting a single action as "0 of 0").

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
v10: Expand commit message more, and add some asserts()
v8:  Expand commit message
---
 tools/libxc/xc_domain_restore.c |    3 ++-
 tools/libxc/xc_domain_save.c    |    3 ++-
 tools/libxc/xc_private.c        |   22 +++++++++++++++-------
 tools/libxc/xc_private.h        |    4 ++--
 tools/libxc/xtl_core.c          |    9 +++++----
 5 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index 2ab9f46..e9c036f 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1622,7 +1622,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
         goto out;
     }
 
-    xc_report_progress_start(xch, "Reloading memory pages", dinfo->p2m_size);
+    xc_set_progress_prefix(xch, "Reloading memory pages");
+    xc_report_progress_step(xch, 0, dinfo->p2m_size);
 
     /*
      * Now simply read each saved frame into its new machine frame.
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 59323b8..a71442e 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1131,7 +1131,8 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                  "Saving memory: iter %d (last sent %u skipped %u)",
                  iter, sent_this_iter, skip_this_iter);
 
-        xc_report_progress_start(xch, reportbuf, dinfo->p2m_size);
+        xc_set_progress_prefix(xch, reportbuf);
+        xc_report_progress_step(xch, 0, dinfo->p2m_size);
 
         iter++;
         sent_this_iter = 0;
diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index 83ead5e..2ffebd9 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -388,18 +388,26 @@ void xc_osdep_log(xc_interface *xch, xentoollog_level level, int code, const cha
     va_end(args);
 }
 
-void xc_report_progress_start(xc_interface *xch, const char *doing,
-                              unsigned long total) {
+const char *xc_set_progress_prefix(xc_interface *xch, const char *doing)
+{
+    const char *old = xch->currently_progress_reporting;
+
     xch->currently_progress_reporting = doing;
-    xtl_progress(xch->error_handler, "xc", xch->currently_progress_reporting,
-                 0, total);
+    return old;
+}
+
+void xc_report_progress_single(xc_interface *xch, const char *doing)
+{
+    assert(doing);
+    xtl_progress(xch->error_handler, "xc", doing, 0, 0);
 }
 
 void xc_report_progress_step(xc_interface *xch,
-                             unsigned long done, unsigned long total) {
+                             unsigned long done, unsigned long total)
+{
     assert(xch->currently_progress_reporting);
-    xtl_progress(xch->error_handler, "xc", xch->currently_progress_reporting,
-                 done, total);
+    xtl_progress(xch->error_handler, "xc",
+                 xch->currently_progress_reporting, done, total);
 }
 
 int xc_get_pfn_type_batch(xc_interface *xch, uint32_t dom,
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index b45b079..247a408 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -133,8 +133,8 @@ void xc_report(xc_interface *xch, xentoollog_logger *lg, xentoollog_level,
                int code, const char *fmt, ...)
      __attribute__((format(printf,5,6)));
 
-void xc_report_progress_start(xc_interface *xch, const char *doing,
-                              unsigned long total);
+const char *xc_set_progress_prefix(xc_interface *xch, const char *doing);
+void xc_report_progress_single(xc_interface *xch, const char *doing);
 void xc_report_progress_step(xc_interface *xch,
                              unsigned long done, unsigned long total);
 
diff --git a/tools/libxc/xtl_core.c b/tools/libxc/xtl_core.c
index 326b97e..73add92 100644
--- a/tools/libxc/xtl_core.c
+++ b/tools/libxc/xtl_core.c
@@ -66,13 +66,14 @@ void xtl_log(struct xentoollog_logger *logger,
 void xtl_progress(struct xentoollog_logger *logger,
                   const char *context, const char *doing_what,
                   unsigned long done, unsigned long total) {
-    int percent;
+    int percent = 0;
 
     if (!logger->progress) return;
 
-    percent = (total < LONG_MAX/100)
-        ? (done * 100) / total
-        : done / ((total + 99) / 100);
+    if ( total )
+        percent = (total < LONG_MAX/100)
+            ? (done * 100) / total
+            : done / ((total + 99) / 100);
 
     logger->progress(logger, context, doing_what, percent, done, total);
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 03/15] tools/libxc: Migration v2 framework
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 01/15] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 02/15] libxc/progress: Extend the progress interface Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 04/15] tools/libxc: C implementation of stream format Andrew Cooper
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
two save/restore codepaths to coexist, and have a runtime switch.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
v8: Adjust layout to compile with MiniOS
---
 tools/libxc/Makefile            |    2 ++
 tools/libxc/include/xenguest.h  |   13 +++++++++++++
 tools/libxc/xc_domain_restore.c |    8 ++++++++
 tools/libxc/xc_domain_save.c    |    6 ++++++
 tools/libxc/xc_sr_common.h      |   15 +++++++++++++++
 tools/libxc/xc_sr_restore.c     |   23 +++++++++++++++++++++++
 tools/libxc/xc_sr_save.c        |   19 +++++++++++++++++++
 7 files changed, 86 insertions(+)
 create mode 100644 tools/libxc/xc_sr_common.h
 create mode 100644 tools/libxc/xc_sr_restore.c
 create mode 100644 tools/libxc/xc_sr_save.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 8b609cf..dfd3445 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -54,6 +54,8 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c xc_suspend.c
 ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += xc_sr_restore.c
+GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
 else
 GUEST_SRCS-y += xc_nomigrate.c
diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 601b108..8e39075 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -90,6 +90,10 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                    uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
                    struct save_callbacks* callbacks, int hvm);
 
+/* Domain Save v2 */
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm);
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
@@ -126,6 +130,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       unsigned int hvm, unsigned int pae, int superpages,
                       int checkpointed_stream,
                       struct restore_callbacks *callbacks);
+
+/* Domain Restore v2 */
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks);
 /**
  * xc_domain_restore writes a file to disk that contains the device
  * model saved state.
diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index e9c036f..b8a8ef4 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1502,6 +1502,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     struct restore_ctx *ctx = &_ctx;
     struct domain_info_context *dinfo = &ctx->dinfo;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_restore2(
+            xch, io_fd, dom, store_evtchn, store_mfn,
+            store_domid, console_evtchn, console_mfn, console_domid,
+            hvm,  pae,  superpages, checkpointed_stream, callbacks);
+    }
+
     DPRINTF("%s: starting restore of new domid %u", __func__, dom);
 
     pagebuf_init(&pagebuf);
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index a71442e..301e770 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -894,6 +894,12 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
     int completed = 0;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_save2(xch, io_fd, dom, max_iters,
+                               max_factor, flags, callbacks, hvm);
+    }
+
     DPRINTF("%s: starting save of domid %u", __func__, dom);
 
     if ( hvm && !callbacks->switch_qemu_logdirty )
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
new file mode 100644
index 0000000..ab7cb46
--- /dev/null
+++ b/tools/libxc/xc_sr_common.h
@@ -0,0 +1,15 @@
+#ifndef __COMMON__H
+#define __COMMON__H
+
+#include "xg_private.h"
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
new file mode 100644
index 0000000..cdebc69
--- /dev/null
+++ b/tools/libxc/xc_sr_restore.c
@@ -0,0 +1,23 @@
+#include "xc_sr_common.h"
+
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
new file mode 100644
index 0000000..3c34a6b
--- /dev/null
+++ b/tools/libxc/xc_sr_save.c
@@ -0,0 +1,19 @@
+#include "xc_sr_common.h"
+
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom,
+                    uint32_t max_iters, uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 04/15] tools/libxc: C implementation of stream format
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (2 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 03/15] tools/libxc: Migration v2 framework Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 05/15] tools/libxc: generic common code Andrew Cooper
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Provide the C structures matching the binary (wire) format of the new
stream format.  All header/record fields are naturally aligned and
explicit padding fields are used to ensure the correct layout (i.e.,
there is no need for any non-standard structure packing pragma or
attribute).

Provide some helper functions for converting types to string for
diagnostic purposes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/Makefile              |    1 +
 tools/libxc/xc_sr_common.c        |   72 ++++++++++++++++++
 tools/libxc/xc_sr_common.h        |    8 ++
 tools/libxc/xc_sr_stream_format.h |  148 +++++++++++++++++++++++++++++++++++++
 4 files changed, 229 insertions(+)
 create mode 100644 tools/libxc/xc_sr_common.c
 create mode 100644 tools/libxc/xc_sr_stream_format.h

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index dfd3445..803db59 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -54,6 +54,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c xc_suspend.c
 ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += xc_sr_common.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
new file mode 100644
index 0000000..294a626
--- /dev/null
+++ b/tools/libxc/xc_sr_common.c
@@ -0,0 +1,72 @@
+#include "xc_sr_common.h"
+
+static const char *dhdr_types[] =
+{
+    [DHDR_TYPE_X86_PV]  = "x86 PV",
+    [DHDR_TYPE_X86_HVM] = "x86 HVM",
+    [DHDR_TYPE_X86_PVH] = "x86 PVH",
+    [DHDR_TYPE_ARM]     = "ARM",
+};
+
+const char *dhdr_type_to_str(uint32_t type)
+{
+    if ( type < ARRAY_SIZE(dhdr_types) && dhdr_types[type] )
+        return dhdr_types[type];
+
+    return "Reserved";
+}
+
+static const char *mandatory_rec_types[] =
+{
+    [REC_TYPE_END]                  = "End",
+    [REC_TYPE_PAGE_DATA]            = "Page data",
+    [REC_TYPE_X86_PV_INFO]          = "x86 PV info",
+    [REC_TYPE_X86_PV_P2M_FRAMES]    = "x86 PV P2M frames",
+    [REC_TYPE_X86_PV_VCPU_BASIC]    = "x86 PV vcpu basic",
+    [REC_TYPE_X86_PV_VCPU_EXTENDED] = "x86 PV vcpu extended",
+    [REC_TYPE_X86_PV_VCPU_XSAVE]    = "x86 PV vcpu xsave",
+    [REC_TYPE_SHARED_INFO]          = "Shared info",
+    [REC_TYPE_TSC_INFO]             = "TSC info",
+    [REC_TYPE_HVM_CONTEXT]          = "HVM context",
+    [REC_TYPE_HVM_PARAMS]           = "HVM params",
+    [REC_TYPE_TOOLSTACK]            = "Toolstack",
+    [REC_TYPE_X86_PV_VCPU_MSRS]     = "x86 PV vcpu msrs",
+    [REC_TYPE_VERIFY]               = "Verify",
+};
+
+const char *rec_type_to_str(uint32_t type)
+{
+    if ( !(type & REC_TYPE_OPTIONAL) )
+    {
+        if ( (type < ARRAY_SIZE(mandatory_rec_types)) &&
+             (mandatory_rec_types[type]) )
+            return mandatory_rec_types[type];
+    }
+
+    return "Reserved";
+}
+
+static void __attribute__((unused)) build_assertions(void)
+{
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_dhdr) != 16);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rhdr) != 8);
+
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_page_data_header)  != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_info)       != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_p2m_frames) != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_vcpu_hdr)   != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_tsc_info)          != 24);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_hvm_params_entry)  != 16);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_hvm_params)        != 8);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index ab7cb46..b65e52b 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -3,6 +3,14 @@
 
 #include "xg_private.h"
 
+#include "xc_sr_stream_format.h"
+
+/* String representation of Domain Header types. */
+const char *dhdr_type_to_str(uint32_t type);
+
+/* String representation of Record types. */
+const char *rec_type_to_str(uint32_t type);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
new file mode 100644
index 0000000..d116ca6
--- /dev/null
+++ b/tools/libxc/xc_sr_stream_format.h
@@ -0,0 +1,148 @@
+#ifndef __STREAM_FORMAT__H
+#define __STREAM_FORMAT__H
+
+/*
+ * C structures for the Migration v2 stream format.
+ * See docs/specs/libxc-migration-stream.pandoc
+ */
+
+#include <inttypes.h>
+
+/*
+ * Image Header
+ */
+struct xc_sr_ihdr
+{
+    uint64_t marker;
+    uint32_t id;
+    uint32_t version;
+    uint16_t options;
+    uint16_t _res1;
+    uint32_t _res2;
+};
+
+#define IHDR_MARKER  0xffffffffffffffffULL
+#define IHDR_ID      0x58454E46U
+#define IHDR_VERSION 2
+
+#define _IHDR_OPT_ENDIAN 0
+#define IHDR_OPT_LITTLE_ENDIAN (0 << _IHDR_OPT_ENDIAN)
+#define IHDR_OPT_BIG_ENDIAN    (1 << _IHDR_OPT_ENDIAN)
+
+/*
+ * Domain Header
+ */
+struct xc_sr_dhdr
+{
+    uint32_t type;
+    uint16_t page_shift;
+    uint16_t _res1;
+    uint32_t xen_major;
+    uint32_t xen_minor;
+};
+
+#define DHDR_TYPE_X86_PV  0x00000001U
+#define DHDR_TYPE_X86_HVM 0x00000002U
+#define DHDR_TYPE_X86_PVH 0x00000003U
+#define DHDR_TYPE_ARM     0x00000004U
+
+/*
+ * Record Header
+ */
+struct xc_sr_rhdr
+{
+    uint32_t type;
+    uint32_t length;
+};
+
+/* All records must be aligned up to an 8 octet boundary */
+#define REC_ALIGN_ORDER               (3U)
+/* Somewhat arbitrary - 8MB */
+#define REC_LENGTH_MAX                (8U << 20)
+
+#define REC_TYPE_END                  0x00000000U
+#define REC_TYPE_PAGE_DATA            0x00000001U
+#define REC_TYPE_X86_PV_INFO          0x00000002U
+#define REC_TYPE_X86_PV_P2M_FRAMES    0x00000003U
+#define REC_TYPE_X86_PV_VCPU_BASIC    0x00000004U
+#define REC_TYPE_X86_PV_VCPU_EXTENDED 0x00000005U
+#define REC_TYPE_X86_PV_VCPU_XSAVE    0x00000006U
+#define REC_TYPE_SHARED_INFO          0x00000007U
+#define REC_TYPE_TSC_INFO             0x00000008U
+#define REC_TYPE_HVM_CONTEXT          0x00000009U
+#define REC_TYPE_HVM_PARAMS           0x0000000aU
+#define REC_TYPE_TOOLSTACK            0x0000000bU
+#define REC_TYPE_X86_PV_VCPU_MSRS     0x0000000cU
+#define REC_TYPE_VERIFY               0x0000000dU
+
+#define REC_TYPE_OPTIONAL             0x80000000U
+
+/* PAGE_DATA */
+struct xc_sr_rec_page_data_header
+{
+    uint32_t count;
+    uint32_t _res1;
+    uint64_t pfn[0];
+};
+
+#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
+#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
+
+/* X86_PV_INFO */
+struct xc_sr_rec_x86_pv_info
+{
+    uint8_t guest_width;
+    uint8_t pt_levels;
+    uint8_t _res[6];
+};
+
+/* X86_PV_P2M_FRAMES */
+struct xc_sr_rec_x86_pv_p2m_frames
+{
+    uint32_t start_pfn;
+    uint32_t end_pfn;
+    uint64_t p2m_pfns[0];
+};
+
+/* X86_PV_VCPU_{BASIC,EXTENDED,XSAVE,MSRS} */
+struct xc_sr_rec_x86_pv_vcpu_hdr
+{
+    uint32_t vcpu_id;
+    uint32_t _res1;
+    uint8_t context[0];
+};
+
+/* TSC_INFO */
+struct xc_sr_rec_tsc_info
+{
+    uint32_t mode;
+    uint32_t khz;
+    uint64_t nsec;
+    uint32_t incarnation;
+    uint32_t _res1;
+};
+
+/* HVM_PARAMS */
+struct xc_sr_rec_hvm_params_entry
+{
+    uint64_t index;
+    uint64_t value;
+};
+
+struct xc_sr_rec_hvm_params
+{
+    uint32_t count;
+    uint32_t _res1;
+    struct xc_sr_rec_hvm_params_entry param[0];
+};
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 05/15] tools/libxc: generic common code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (3 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 04/15] tools/libxc: C implementation of stream format Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 06/15] tools/libxc: x86 " Andrew Cooper
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Add the context structure used to keep state during the save/restore
process.

Define the set of architecture or domain type specific operations with a
set of callbacks (save_ops, and restore_ops).

Add common functions for writing records.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/xc_sr_common.c |   41 ++++++
 tools/libxc/xc_sr_common.h |  309 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 350 insertions(+)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 294a626..59e0c5d 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -1,3 +1,5 @@
+#include <assert.h>
+
 #include "xc_sr_common.h"
 
 static const char *dhdr_types[] =
@@ -46,6 +48,45 @@ const char *rec_type_to_str(uint32_t type)
     return "Reserved";
 }
 
+int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                       void *buf, size_t sz)
+{
+    static const char zeroes[(1u << REC_ALIGN_ORDER) - 1] = { 0 };
+
+    xc_interface *xch = ctx->xch;
+    typeof(rec->length) combined_length = rec->length + sz;
+    size_t record_length = ROUNDUP(combined_length, REC_ALIGN_ORDER);
+    struct iovec parts[] =
+    {
+        { &rec->type,       sizeof(rec->type) },
+        { &combined_length, sizeof(combined_length) },
+        { rec->data,        rec->length },
+        { buf,              sz },
+        { (void*)zeroes,    record_length - combined_length },
+    };
+
+    if ( record_length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rec->type,
+              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    if ( rec->length )
+        assert(rec->data);
+    if ( sz )
+        assert(buf);
+
+    if ( writev_exact(ctx->fd, parts, ARRAY_SIZE(parts)) )
+        goto err;
+
+    return 0;
+
+ err:
+    PERROR("Unable to write record to stream");
+    return -1;
+}
+
 static void __attribute__((unused)) build_assertions(void)
 {
     XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index b65e52b..b71b532 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -1,7 +1,12 @@
 #ifndef __COMMON__H
 #define __COMMON__H
 
+#include <stdbool.h>
+
 #include "xg_private.h"
+#include "xg_save_restore.h"
+#include "xc_dom.h"
+#include "xc_bitops.h"
 
 #include "xc_sr_stream_format.h"
 
@@ -11,6 +16,310 @@
 /* String representation of Record types. */
 const char *rec_type_to_str(uint32_t type);
 
+struct xc_sr_context;
+struct xc_sr_record;
+
+/**
+ * Save operations.  To be implemented for each type of guest, for use by the
+ * common save algorithm.
+ *
+ * Every function must be implemented, even if only with a no-op stub.
+ */
+struct xc_sr_save_ops
+{
+    /* Convert a PFN to GFN.  May return ~0UL for an invalid mapping. */
+    xen_pfn_t (*pfn_to_gfn)(const struct xc_sr_context *ctx, xen_pfn_t pfn);
+
+    /**
+     * Optionally transform the contents of a page from being specific to the
+     * sending environment, to being generic for the stream.
+     *
+     * The page of data at the end of 'page' may be a read-only mapping of a
+     * running guest; it must not be modified.  If no transformation is
+     * required, the callee should leave '*pages' untouched.
+     *
+     * If a transformation is required, the callee should allocate themselves
+     * a local page using malloc() and return it via '*page'.
+     *
+     * The caller shall free() '*page' in all cases.  In the case that the
+     * callee encounters an error, it should *NOT* free() the memory it
+     * allocated for '*page'.
+     *
+     * It is valid to fail with EAGAIN if the transformation is not able to be
+     * completed at this point.  The page shall be retried later.
+     *
+     * @returns 0 for success, -1 for failure, with errno appropriately set.
+     */
+    int (*normalise_page)(struct xc_sr_context *ctx, xen_pfn_t type,
+                          void **page);
+
+    /**
+     * Set up local environment to restore a domain.  This is called before
+     * any records are written to the stream.  (Typically querying running
+     * domain state, setting up mappings etc.)
+     */
+    int (*setup)(struct xc_sr_context *ctx);
+
+    /**
+     * Write records which need to be at the start of the stream.  This is
+     * called after the Image and Domain headers are written.  (Any records
+     * which need to be ahead of the memory.)
+     */
+    int (*start_of_stream)(struct xc_sr_context *ctx);
+
+    /**
+     * Write records which need to be at the end of the stream, following the
+     * complete memory contents.  The caller shall handle writing the END
+     * record into the stream.  (Any records which need to be after the memory
+     * is complete.)
+     */
+    int (*end_of_stream)(struct xc_sr_context *ctx);
+
+    /**
+     * Clean up the local environment.  Will be called exactly once, either
+     * after a successful save, or upon encountering an error.
+     */
+    int (*cleanup)(struct xc_sr_context *ctx);
+};
+
+
+/**
+ * Restore operations.  To be implemented for each type of guest, for use by
+ * the common restore algorithm.
+ *
+ * Every function must be implemented, even if only with a no-op stub.
+ */
+struct xc_sr_restore_ops
+{
+    /* Convert a PFN to GFN.  May return ~0UL for an invalid mapping. */
+    xen_pfn_t (*pfn_to_gfn)(const struct xc_sr_context *ctx, xen_pfn_t pfn);
+
+    /* Check to see whether a PFN is valid. */
+    bool (*pfn_is_valid)(const struct xc_sr_context *ctx, xen_pfn_t pfn);
+
+    /* Set the GFN of a PFN. */
+    void (*set_gfn)(struct xc_sr_context *ctx, xen_pfn_t pfn, xen_pfn_t gfn);
+
+    /* Set the type of a PFN. */
+    void (*set_page_type)(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                          xen_pfn_t type);
+
+    /**
+     * Optionally transform the contents of a page from being generic in the
+     * stream, to being specific to the restoring environment.
+     *
+     * 'page' is expected to be modified in-place if a transformation is
+     * required.
+     *
+     * @returns 0 for success, -1 for failure, with errno appropriately set.
+     */
+    int (*localise_page)(struct xc_sr_context *ctx, uint32_t type, void *page);
+
+    /**
+     * Set up local environment to restore a domain.  This is called before
+     * any records are read from the stream.
+     */
+    int (*setup)(struct xc_sr_context *ctx);
+
+    /**
+     * Process an individual record from the stream.  The caller shall take
+     * care of processing common records (e.g. END, PAGE_DATA).
+     *
+     * @return 0 for success, -1 for failure, or the sentinel value
+     * RECORD_NOT_PROCESSED.
+     */
+#define RECORD_NOT_PROCESSED 1
+    int (*process_record)(struct xc_sr_context *ctx, struct xc_sr_record *rec);
+
+    /**
+     * Perform any actions required after the stream has been finished. Called
+     * after the END record has been received.
+     */
+    int (*stream_complete)(struct xc_sr_context *ctx);
+
+    /**
+     * Clean up the local environment.  Will be called exactly once, either
+     * after a successful restore, or upon encountering an error.
+     */
+    int (*cleanup)(struct xc_sr_context *ctx);
+};
+
+/* x86 PV per-vcpu storage structure for blobs heading Xen-wards. */
+struct xc_sr_x86_pv_restore_vcpu
+{
+    void *basic, *extd, *xsave, *msr;
+    size_t basicsz, extdsz, xsavesz, msrsz;
+};
+
+struct xc_sr_context
+{
+    xc_interface *xch;
+    uint32_t domid;
+    int fd;
+
+    xc_dominfo_t dominfo;
+
+    union /* Common save or restore data. */
+    {
+        struct /* Save data. */
+        {
+            struct xc_sr_save_ops ops;
+            struct save_callbacks *callbacks;
+
+            /* Live migrate vs non live suspend. */
+            bool live;
+
+            /* Further debugging information in the stream. */
+            bool debug;
+
+            /* Parameters for tweaking live migration. */
+            unsigned max_iterations;
+            unsigned dirty_threshold;
+
+            unsigned long p2m_size;
+
+            xen_pfn_t *batch_pfns;
+            unsigned nr_batch_pfns;
+            unsigned long *deferred_pages;
+            unsigned long nr_deferred_pages;
+        } save;
+
+        struct /* Restore data. */
+        {
+            struct xc_sr_restore_ops ops;
+            struct restore_callbacks *callbacks;
+
+            /* From Image Header. */
+            uint32_t format_version;
+
+            /* From Domain Header. */
+            uint32_t guest_type;
+            uint32_t guest_page_size;
+
+            /*
+             * Xenstore and Console parameters.
+             * INPUT:  evtchn & domid
+             * OUTPUT: gfn
+             */
+            xen_pfn_t    xenstore_gfn,    console_gfn;
+            unsigned int xenstore_evtchn, console_evtchn;
+            domid_t      xenstore_domid,  console_domid;
+
+            /* Bitmap of currently populated PFNs during restore. */
+            unsigned long *populated_pfns;
+            xen_pfn_t max_populated_pfn;
+
+            /* Sender has invoked verify mode on the stream. */
+            bool verify;
+        } restore;
+    };
+
+    union /* Guest-arch specific data. */
+    {
+        struct /* x86 PV guest. */
+        {
+            /* 4 or 8; 32 or 64 bit domain */
+            unsigned int width;
+            /* 3 or 4 pagetable levels */
+            unsigned int levels;
+
+            /* Maximum Xen frame */
+            xen_pfn_t max_mfn;
+            /* Read-only machine to phys map */
+            xen_pfn_t *m2p;
+            /* first mfn of the compat m2p (Only needed for 32bit PV guests) */
+            xen_pfn_t compat_m2p_mfn0;
+            /* Number of m2p frames mapped */
+            unsigned long nr_m2p_frames;
+
+            /* Maximum guest frame */
+            xen_pfn_t max_pfn;
+
+            /* Number of frames making up the p2m */
+            unsigned int p2m_frames;
+            /* Guest's phys to machine map.  Mapped read-only (save) or
+             * allocated locally (restore).  Uses guest unsigned longs. */
+            void *p2m;
+            /* The guest pfns containing the p2m leaves */
+            xen_pfn_t *p2m_pfns;
+
+            /* Read-only mapping of guests shared info page */
+            shared_info_any_t *shinfo;
+
+            union
+            {
+                struct
+                {
+                    /* State machine for the order of received records. */
+                    bool seen_pv_info;
+
+                    /* Types for each page (bounded by max_pfn). */
+                    uint32_t *pfn_types;
+
+                    /* Vcpu context blobs. */
+                    struct xc_sr_x86_pv_restore_vcpu *vcpus;
+                    unsigned nr_vcpus;
+                } restore;
+            };
+        } x86_pv;
+
+        struct /* x86 HVM guest. */
+        {
+            union
+            {
+                struct
+                {
+                    /* Whether qemu enabled logdirty mode, and we should
+                     * disable on cleanup. */
+                    bool qemu_enabled_logdirty;
+                } save;
+
+                struct
+                {
+                    /* HVM context blob. */
+                    void *context;
+                    size_t contextsz;
+                } restore;
+            };
+        } x86_hvm;
+    };
+};
+
+
+struct xc_sr_record
+{
+    uint32_t type;
+    uint32_t length;
+    void *data;
+};
+
+/*
+ * Writes a split record to the stream, applying correct padding where
+ * appropriate.  It is common when sending records containing blobs from Xen
+ * that the header and blob data are separate.  This function accepts a second
+ * buffer and length, and will merge it with the main record when sending.
+ *
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                       void *buf, size_t sz);
+
+/*
+ * Writes a record to the stream, applying correct padding where appropriate.
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+static inline int write_record(struct xc_sr_context *ctx,
+                               struct xc_sr_record *rec)
+{
+    return write_split_record(ctx, rec, NULL, 0);
+}
+
 #endif
 /*
  * Local variables:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 06/15] tools/libxc: x86 common code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (4 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 05/15] tools/libxc: generic common code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 07/15] tools/libxc: x86 PV " Andrew Cooper
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Save/restore records common to all x86 domain types (HVM, PV).

This is only the TSC_INFO record.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/Makefile           |    1 +
 tools/libxc/xc_sr_common_x86.c |   54 ++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_common_x86.h |   26 +++++++++++++++++++
 3 files changed, 81 insertions(+)
 create mode 100644 tools/libxc/xc_sr_common_x86.c
 create mode 100644 tools/libxc/xc_sr_common_x86.h

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 803db59..5e12e9a 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -55,6 +55,7 @@ GUEST_SRCS-y += xg_private.c xc_suspend.c
 ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
 GUEST_SRCS-y += xc_sr_common.c
+GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
diff --git a/tools/libxc/xc_sr_common_x86.c b/tools/libxc/xc_sr_common_x86.c
new file mode 100644
index 0000000..98f1cef
--- /dev/null
+++ b/tools/libxc/xc_sr_common_x86.c
@@ -0,0 +1,54 @@
+#include "xc_sr_common_x86.h"
+
+int write_tsc_info(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_tsc_info tsc = { 0 };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_TSC_INFO,
+        .length = sizeof(tsc),
+        .data = &tsc
+    };
+
+    if ( xc_domain_get_tsc_info(xch, ctx->domid, &tsc.mode,
+                                &tsc.nsec, &tsc.khz, &tsc.incarnation) < 0 )
+    {
+        PERROR("Unable to obtain TSC information");
+        return -1;
+    }
+
+    return write_record(ctx, &rec);
+}
+
+int handle_tsc_info(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_tsc_info *tsc = rec->data;
+
+    if ( rec->length != sizeof(*tsc) )
+    {
+        ERROR("TSC_INFO record wrong size: length %u, expected %zu",
+              rec->length, sizeof(*tsc));
+        return -1;
+    }
+
+    if ( xc_domain_set_tsc_info(xch, ctx->domid, tsc->mode,
+                                tsc->nsec, tsc->khz, tsc->incarnation) )
+    {
+        PERROR("Unable to set TSC information");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_sr_common_x86.h b/tools/libxc/xc_sr_common_x86.h
new file mode 100644
index 0000000..1d42da9
--- /dev/null
+++ b/tools/libxc/xc_sr_common_x86.h
@@ -0,0 +1,26 @@
+#ifndef __COMMON_X86__H
+#define __COMMON_X86__H
+
+#include "xc_sr_common.h"
+
+/*
+ * Obtains a domains TSC information from Xen and writes a TSC_INFO record
+ * into the stream.
+ */
+int write_tsc_info(struct xc_sr_context *ctx);
+
+/*
+ * Parses a TSC_INFO record and applies the result to the domain.
+ */
+int handle_tsc_info(struct xc_sr_context *ctx, struct xc_sr_record *rec);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 07/15] tools/libxc: x86 PV common code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (5 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 06/15] tools/libxc: x86 " Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 08/15] tools/libxc: x86 PV save code Andrew Cooper
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Add functions common to save and restore of x86 PV guests.  This includes
functions for dealing with the P2M and M2P and the VCPU context.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/Makefile              |    1 +
 tools/libxc/xc_sr_common_x86_pv.c |  210 +++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_common_x86_pv.h |  102 ++++++++++++++++++
 3 files changed, 313 insertions(+)
 create mode 100644 tools/libxc/xc_sr_common_x86_pv.c
 create mode 100644 tools/libxc/xc_sr_common_x86_pv.h

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 5e12e9a..c2e62a8 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -56,6 +56,7 @@ ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
 GUEST_SRCS-y += xc_sr_common.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86.c
+GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86_pv.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
diff --git a/tools/libxc/xc_sr_common_x86_pv.c b/tools/libxc/xc_sr_common_x86_pv.c
new file mode 100644
index 0000000..eb68c07
--- /dev/null
+++ b/tools/libxc/xc_sr_common_x86_pv.c
@@ -0,0 +1,210 @@
+#include <assert.h>
+
+#include "xc_sr_common_x86_pv.h"
+
+xen_pfn_t mfn_to_pfn(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    assert(mfn <= ctx->x86_pv.max_mfn);
+    return ctx->x86_pv.m2p[mfn];
+}
+
+bool mfn_in_pseudophysmap(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    return ( (mfn <= ctx->x86_pv.max_mfn) &&
+             (mfn_to_pfn(ctx, mfn) <= ctx->x86_pv.max_pfn) &&
+             (xc_pfn_to_mfn(mfn_to_pfn(ctx, mfn), ctx->x86_pv.p2m,
+                            ctx->x86_pv.width) == mfn) );
+}
+
+void dump_bad_pseudophysmap_entry(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn = ~0UL;
+
+    ERROR("mfn %#lx, max %#lx", mfn, ctx->x86_pv.max_mfn);
+
+    if ( (mfn != ~0UL) && (mfn <= ctx->x86_pv.max_mfn) )
+    {
+        pfn = ctx->x86_pv.m2p[mfn];
+        ERROR("  m2p[%#lx] = %#lx, max_pfn %#lx",
+              mfn, pfn, ctx->x86_pv.max_pfn);
+    }
+
+    if ( (pfn != ~0UL) && (pfn <= ctx->x86_pv.max_pfn) )
+        ERROR("  p2m[%#lx] = %#lx",
+              pfn, xc_pfn_to_mfn(pfn, ctx->x86_pv.p2m, ctx->x86_pv.width));
+}
+
+xen_pfn_t cr3_to_mfn(struct xc_sr_context *ctx, uint64_t cr3)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return cr3 >> 12;
+    else
+    {
+        /* 32bit guests can't represent mfns wider than 32 bits */
+        if ( cr3 & 0xffffffff00000000UL )
+            return ~0UL;
+        else
+            return (uint32_t)((cr3 >> 12) | (cr3 << 20));
+    }
+}
+
+uint64_t mfn_to_cr3(struct xc_sr_context *ctx, xen_pfn_t _mfn)
+{
+    uint64_t mfn = _mfn;
+
+    if ( ctx->x86_pv.width == 8 )
+        return mfn << 12;
+    else
+    {
+        /* 32bit guests can't represent mfns wider than 32 bits */
+        if ( mfn & 0xffffffff00000000UL )
+            return ~0UL;
+        else
+            return (uint32_t)((mfn << 12) | (mfn >> 20));
+    }
+}
+
+int x86_pv_domain_info(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned int guest_width, guest_levels, fpp;
+    xen_pfn_t max_pfn;
+
+    /* Get the domain width */
+    if ( xc_domain_get_guest_width(xch, ctx->domid, &guest_width) )
+    {
+        PERROR("Unable to determine dom%d's width", ctx->domid);
+        return -1;
+    }
+
+    if ( guest_width == 4 )
+        guest_levels = 3;
+    else if ( guest_width == 8 )
+        guest_levels = 4;
+    else
+    {
+        ERROR("Invalid guest width %d.  Expected 32 or 64", guest_width * 8);
+        return -1;
+    }
+    ctx->x86_pv.width = guest_width;
+    ctx->x86_pv.levels = guest_levels;
+    fpp = PAGE_SIZE / ctx->x86_pv.width;
+
+    DPRINTF("%d bits, %d levels", guest_width * 8, guest_levels);
+
+    /* Get the domain's size */
+    if ( xc_domain_maximum_gpfn(xch, ctx->domid, &max_pfn) < 0 )
+    {
+        PERROR("Unable to obtain guests max pfn");
+        return -1;
+    }
+
+    if ( max_pfn > 0 )
+    {
+        ctx->x86_pv.max_pfn = max_pfn;
+        ctx->x86_pv.p2m_frames = (ctx->x86_pv.max_pfn + fpp) / fpp;
+
+        DPRINTF("max_pfn %#lx, p2m_frames %d", max_pfn, ctx->x86_pv.p2m_frames);
+    }
+
+    return 0;
+}
+
+int x86_pv_map_m2p(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t m2p_chunks, m2p_size, max_page;
+    privcmd_mmap_entry_t *entries = NULL;
+    xen_pfn_t *extents_start = NULL;
+    int rc = -1, i;
+
+    if ( xc_maximum_ram_page(xch, &max_page) < 0 )
+    {
+        PERROR("Failed to get maximum ram page");
+        goto err;
+    }
+
+    ctx->x86_pv.max_mfn = max_page;
+    m2p_size   = M2P_SIZE(ctx->x86_pv.max_mfn);
+    m2p_chunks = M2P_CHUNKS(ctx->x86_pv.max_mfn);
+
+    extents_start = malloc(m2p_chunks * sizeof(xen_pfn_t));
+    if ( !extents_start )
+    {
+        ERROR("Unable to allocate %lu bytes for m2p mfns",
+              m2p_chunks * sizeof(xen_pfn_t));
+        goto err;
+    }
+
+    if ( xc_machphys_mfn_list(xch, m2p_chunks, extents_start) )
+    {
+        PERROR("Failed to get m2p mfn list");
+        goto err;
+    }
+
+    entries = malloc(m2p_chunks * sizeof(privcmd_mmap_entry_t));
+    if ( !entries )
+    {
+        ERROR("Unable to allocate %lu bytes for m2p mapping mfns",
+              m2p_chunks * sizeof(privcmd_mmap_entry_t));
+        goto err;
+    }
+
+    for ( i = 0; i < m2p_chunks; ++i )
+        entries[i].mfn = extents_start[i];
+
+    ctx->x86_pv.m2p = xc_map_foreign_ranges(
+        xch, DOMID_XEN, m2p_size, PROT_READ,
+        M2P_CHUNK_SIZE, entries, m2p_chunks);
+
+    if ( !ctx->x86_pv.m2p )
+    {
+        PERROR("Failed to mmap() m2p ranges");
+        goto err;
+    }
+
+    ctx->x86_pv.nr_m2p_frames = (M2P_CHUNK_SIZE >> PAGE_SHIFT) * m2p_chunks;
+
+#ifdef __i386__
+    /* 32 bit toolstacks automatically get the compat m2p */
+    ctx->x86_pv.compat_m2p_mfn0 = entries[0].mfn;
+#else
+    /* 64 bit toolstacks need to ask Xen specially for it */
+    {
+        struct xen_machphys_mfn_list xmml = {
+            .max_extents = 1,
+            .extent_start = { &ctx->x86_pv.compat_m2p_mfn0 }
+        };
+
+        rc = do_memory_op(xch, XENMEM_machphys_compat_mfn_list,
+                          &xmml, sizeof(xmml));
+        if ( rc || xmml.nr_extents != 1 )
+        {
+            PERROR("Failed to get compat mfn list from Xen");
+            rc = -1;
+            goto err;
+        }
+    }
+#endif
+
+    /* All Done */
+    rc = 0;
+    DPRINTF("max_mfn %#lx", ctx->x86_pv.max_mfn);
+
+err:
+    free(entries);
+    free(extents_start);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_sr_common_x86_pv.h b/tools/libxc/xc_sr_common_x86_pv.h
new file mode 100644
index 0000000..3234944
--- /dev/null
+++ b/tools/libxc/xc_sr_common_x86_pv.h
@@ -0,0 +1,102 @@
+#ifndef __COMMON_X86_PV_H
+#define __COMMON_X86_PV_H
+
+#include "xc_sr_common_x86.h"
+
+/*
+ * Convert an mfn to a pfn, given Xen's m2p table.
+ *
+ * Caller must ensure that the requested mfn is in range.
+ */
+xen_pfn_t mfn_to_pfn(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/*
+ * Query whether a particular mfn is valid in the physmap of a guest.
+ */
+bool mfn_in_pseudophysmap(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/*
+ * Debug a particular mfn by walking the p2m and m2p.
+ */
+void dump_bad_pseudophysmap_entry(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/*
+ * Convert a PV cr3 field to an mfn.
+ *
+ * Adjusts for Xen's extended-cr3 format to pack a 44bit physical address into
+ * a 32bit architectural cr3.
+ */
+xen_pfn_t cr3_to_mfn(struct xc_sr_context *ctx, uint64_t cr3);
+
+/*
+ * Convert an mfn to a PV cr3 field.
+ *
+ * Adjusts for Xen's extended-cr3 format to pack a 44bit physical address into
+ * a 32bit architectural cr3.
+ */
+uint64_t mfn_to_cr3(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/* Bits 12 through 51 of a PTE point at the frame */
+#define PTE_FRAME_MASK 0x000ffffffffff000ULL
+
+/*
+ * Extract an mfn from a Pagetable Entry.  May return INVALID_MFN if the pte
+ * would overflow a 32bit xen_pfn_t.
+ */
+static inline xen_pfn_t pte_to_frame(uint64_t pte)
+{
+    uint64_t frame = (pte & PTE_FRAME_MASK) >> PAGE_SHIFT;
+
+#ifdef __i386__
+    if ( frame >= INVALID_MFN )
+        return INVALID_MFN;
+#endif
+
+    return frame;
+}
+
+/*
+ * Change the frame in a Pagetable Entry while leaving the flags alone.
+ */
+static inline uint64_t merge_pte(uint64_t pte, xen_pfn_t mfn)
+{
+    return (pte & ~PTE_FRAME_MASK) | ((uint64_t)mfn << PAGE_SHIFT);
+}
+
+/*
+ * Get current domain information.
+ *
+ * Fills ctx->x86_pv
+ * - .width
+ * - .levels
+ * - .fpp
+ * - .p2m_frames
+ *
+ * Used by the save side to create the X86_PV_INFO record, and by the restore
+ * side to verify the incoming stream.
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_domain_info(struct xc_sr_context *ctx);
+
+/*
+ * Maps the Xen M2P.
+ *
+ * Fills ctx->x86_pv.
+ * - .max_mfn
+ * - .m2p
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_map_m2p(struct xc_sr_context *ctx);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 08/15] tools/libxc: x86 PV save code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (6 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 07/15] tools/libxc: x86 PV " Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 09/15] tools/libxc: x86 PV restore code Andrew Cooper
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Save the x86 PV specific parts of a domain.  This is the X86_PV_INFO record,
the P2M_FRAMES, the X86_PV_SHARED_INFO, the three different VCPU context
records, and the MSR records.

The normalise_page callback used by the common code when writing the PAGE_DATA
records, converts MFNs in page tables to PFNs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
v10: Bail with a hard error if pfn truncation would occur
---
 tools/libxc/Makefile            |    1 +
 tools/libxc/xc_sr_common.h      |    1 +
 tools/libxc/xc_sr_save_x86_pv.c |  885 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 887 insertions(+)
 create mode 100644 tools/libxc/xc_sr_save_x86_pv.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index c2e62a8..40ba91c 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -57,6 +57,7 @@ GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
 GUEST_SRCS-y += xc_sr_common.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86_pv.c
+GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_pv.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index b71b532..2f0f00b 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -285,6 +285,7 @@ struct xc_sr_context
     };
 };
 
+extern struct xc_sr_save_ops save_ops_x86_pv;
 
 struct xc_sr_record
 {
diff --git a/tools/libxc/xc_sr_save_x86_pv.c b/tools/libxc/xc_sr_save_x86_pv.c
new file mode 100644
index 0000000..a668221
--- /dev/null
+++ b/tools/libxc/xc_sr_save_x86_pv.c
@@ -0,0 +1,885 @@
+#include <assert.h>
+#include <limits.h>
+
+#include "xc_sr_common_x86_pv.h"
+
+/*
+ * Maps the guests shared info page.
+ */
+static int map_shinfo(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    ctx->x86_pv.shinfo = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ, ctx->dominfo.shared_info_frame);
+    if ( !ctx->x86_pv.shinfo )
+    {
+        PERROR("Failed to map shared info frame at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Copy a list of mfns from a guest, accounting for differences between guest
+ * and toolstack width.  Can fail if truncation would occur.
+ */
+static int copy_mfns_from_guest(const struct xc_sr_context *ctx,
+                                xen_pfn_t *dst, const void *src, size_t count)
+{
+    size_t x;
+
+    if ( ctx->x86_pv.width == sizeof(unsigned long) )
+        memcpy(dst, src, count * sizeof(*dst));
+    else
+    {
+        for ( x = 0; x < count; ++x )
+        {
+#ifdef __x86_64__
+            /* 64bit toolstack, 32bit guest.  Expand any INVALID_MFN. */
+            uint32_t s = ((uint32_t *)src)[x];
+
+            dst[x] = s == ~0U ? INVALID_MFN : s;
+#else
+            /*
+             * 32bit toolstack, 64bit guest.  Truncate INVALID_MFN, but bail
+             * if any other truncation would occur.
+             *
+             * This will only occur on hosts where a PV guest has ram above
+             * the 16TB boundary.  A 32bit dom0 is unlikely to have
+             * successfully booted on a system this large.
+             */
+            uint64_t s = ((uint64_t *)src)[x];
+
+            if ( (s != ~0ULL) && ((s >> 32) != 0) )
+            {
+                errno = E2BIG;
+                return -1;
+            }
+
+            dst[x] = s;
+#endif
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Walk the guests frame list list and frame list to identify and map the
+ * frames making up the guests p2m table.  Construct a list of pfns making up
+ * the table.
+ */
+static int map_p2m(struct xc_sr_context *ctx)
+{
+    /* Terminology:
+     *
+     * fll   - frame list list, top level p2m, list of fl mfns
+     * fl    - frame list, mid level p2m, list of leaf mfns
+     * local - own allocated buffers, adjusted for bitness
+     * guest - mappings into the domain
+     */
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    unsigned x, fpp, fll_entries, fl_entries;
+    xen_pfn_t fll_mfn;
+
+    xen_pfn_t *local_fll = NULL;
+    void *guest_fll = NULL;
+    size_t local_fll_size;
+
+    xen_pfn_t *local_fl = NULL;
+    void *guest_fl = NULL;
+    size_t local_fl_size;
+
+    fpp = PAGE_SIZE / ctx->x86_pv.width;
+    fll_entries = (ctx->x86_pv.max_pfn / (fpp * fpp)) + 1;
+    fl_entries  = (ctx->x86_pv.max_pfn / fpp) + 1;
+
+    fll_mfn = GET_FIELD(ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list,
+                        ctx->x86_pv.width);
+    if ( fll_mfn == 0 || fll_mfn > ctx->x86_pv.max_mfn )
+    {
+        ERROR("Bad mfn %#lx for p2m frame list list", fll_mfn);
+        goto err;
+    }
+
+    /* Map the guest top p2m. */
+    guest_fll = xc_map_foreign_range(xch, ctx->domid, PAGE_SIZE,
+                                     PROT_READ, fll_mfn);
+    if ( !guest_fll )
+    {
+        PERROR("Failed to map p2m frame list list at %#lx", fll_mfn);
+        goto err;
+    }
+
+    local_fll_size = fll_entries * sizeof(*local_fll);
+    local_fll = malloc(local_fll_size);
+    if ( !local_fll )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list list",
+              local_fll_size);
+        goto err;
+    }
+
+    if ( copy_mfns_from_guest(ctx, local_fll, guest_fll, fll_entries) )
+    {
+        ERROR("Truncation detected copying p2m frame list list");
+        goto err;
+    }
+
+    /* Check for bad mfns in frame list list. */
+    for ( x = 0; x < fll_entries; ++x )
+    {
+        if ( local_fll[x] == 0 || local_fll[x] > ctx->x86_pv.max_mfn )
+        {
+            ERROR("Bad mfn %#lx at index %u (of %u) in p2m frame list list",
+                  local_fll[x], x, fll_entries);
+            goto err;
+        }
+    }
+
+    /* Map the guest mid p2m frames. */
+    guest_fl = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                    local_fll, fll_entries);
+    if ( !guest_fl )
+    {
+        PERROR("Failed to map p2m frame list");
+        goto err;
+    }
+
+    local_fl_size = fl_entries * sizeof(*local_fl);
+    local_fl = malloc(local_fl_size);
+    if ( !local_fl )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list",
+              local_fl_size);
+        goto err;
+    }
+
+    if ( copy_mfns_from_guest(ctx, local_fl, guest_fl, fl_entries) )
+    {
+        ERROR("Truncation detected copying p2m frame list");
+        goto err;
+    }
+
+    for ( x = 0; x < fl_entries; ++x )
+    {
+        if ( local_fl[x] == 0 || local_fl[x] > ctx->x86_pv.max_mfn )
+        {
+            ERROR("Bad mfn %#lx at index %u (of %u) in p2m frame list",
+                  local_fl[x], x, fl_entries);
+            goto err;
+        }
+    }
+
+    /* Map the p2m leaves themselves. */
+    ctx->x86_pv.p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                           local_fl, fl_entries);
+    if ( !ctx->x86_pv.p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    ctx->x86_pv.p2m_frames = fl_entries;
+    ctx->x86_pv.p2m_pfns = malloc(local_fl_size);
+    if ( !ctx->x86_pv.p2m_pfns )
+    {
+        ERROR("Cannot allocate %zu bytes for p2m pfns list",
+              local_fl_size);
+        goto err;
+    }
+
+    /* Convert leaf frames from mfns to pfns. */
+    for ( x = 0; x < fl_entries; ++x )
+    {
+        if ( !mfn_in_pseudophysmap(ctx, local_fl[x]) )
+        {
+            ERROR("Bad mfn in p2m_frame_list[%u]", x);
+            dump_bad_pseudophysmap_entry(ctx, local_fl[x]);
+            errno = ERANGE;
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[x] = mfn_to_pfn(ctx, local_fl[x]);
+    }
+
+    rc = 0;
+err:
+
+    free(local_fl);
+    if ( guest_fl )
+        munmap(guest_fl, fll_entries * PAGE_SIZE);
+
+    free(local_fll);
+    if ( guest_fll )
+        munmap(guest_fll, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Obtain a specific vcpus basic state and write an X86_PV_VCPU_BASIC record
+ * into the stream.  Performs mfn->pfn conversion on architectural state.
+ */
+static int write_one_vcpu_basic(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn;
+    unsigned i, gdt_count;
+    int rc = -1;
+    vcpu_guest_context_any_t vcpu;
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_BASIC,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+
+    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
+    {
+        PERROR("Failed to get vcpu%u context", id);
+        goto err;
+    }
+
+    /* Vcpu0 is special: Convert the suspend record to a pfn. */
+    if ( id == 0 )
+    {
+        mfn = GET_FIELD(&vcpu, user_regs.edx, ctx->x86_pv.width);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for suspend record");
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(&vcpu, user_regs.edx, mfn_to_pfn(ctx, mfn),
+                  ctx->x86_pv.width);
+    }
+
+    gdt_count = GET_FIELD(&vcpu, gdt_ents, ctx->x86_pv.width);
+    if ( gdt_count > FIRST_RESERVED_GDT_ENTRY )
+    {
+        ERROR("GDT entry count (%u) out of range (max %u)",
+              gdt_count, FIRST_RESERVED_GDT_ENTRY);
+        errno = ERANGE;
+        goto err;
+    }
+    gdt_count = (gdt_count + 511) / 512; /* gdt_count now in units of frames. */
+
+    /* Convert GDT frames to pfns. */
+    for ( i = 0; i < gdt_count; ++i )
+    {
+        mfn = GET_FIELD(&vcpu, gdt_frames[i], ctx->x86_pv.width);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for frame %u of vcpu%u's GDT", i, id);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(&vcpu, gdt_frames[i], mfn_to_pfn(ctx, mfn),
+                  ctx->x86_pv.width);
+    }
+
+    /* Convert CR3 to a pfn. */
+    mfn = cr3_to_mfn(ctx, GET_FIELD(&vcpu, ctrlreg[3], ctx->x86_pv.width));
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Bad mfn for vcpu%u's cr3", id);
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        errno = ERANGE;
+        goto err;
+    }
+    pfn = mfn_to_pfn(ctx, mfn);
+    SET_FIELD(&vcpu, ctrlreg[3], mfn_to_cr3(ctx, pfn), ctx->x86_pv.width);
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to pfn. */
+    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
+    {
+        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for vcpu%u's cr1", id);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        pfn = mfn_to_pfn(ctx, mfn);
+        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
+    }
+
+    if ( ctx->x86_pv.width == 8 )
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x64));
+    else
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x32));
+
+ err:
+    return rc;
+}
+
+/*
+ * Obtain a specific vcpus extended state and write an X86_PV_VCPU_EXTENDED
+ * record into the stream.
+ */
+static int write_one_vcpu_extended(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_EXTENDED,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
+        .domain = ctx->domid,
+        .u.ext_vcpucontext.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u extended context", id);
+        return -1;
+    }
+
+    return write_split_record(ctx, &rec, &domctl.u.ext_vcpucontext,
+                              domctl.u.ext_vcpucontext.size);
+}
+
+/*
+ * Query to see whether a specific vcpu has xsave state and if so, write an
+ * X86_PV_VCPU_XSAVE record into the stream.
+ */
+static int write_one_vcpu_xsave(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_XSAVE,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_getvcpuextstate,
+        .domain = ctx->domid,
+        .u.vcpuextstate.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's xsave context", id);
+        goto err;
+    }
+
+    /* No xsave state? skip this record. */
+    if ( !domctl.u.vcpuextstate.xfeature_mask )
+        goto out;
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, domctl.u.vcpuextstate.size);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %"PRIx64" bytes for vcpu%u's xsave context",
+              domctl.u.vcpuextstate.size, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's xsave context", id);
+        goto err;
+    }
+
+    rc = write_split_record(ctx, &rec, buffer, domctl.u.vcpuextstate.size);
+    if ( rc )
+        goto err;
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Query to see whether a specific vcpu has msr state and if so, write an
+ * X86_PV_VCPU_MSRS record into the stream.
+ */
+static int write_one_vcpu_msrs(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    size_t buffersz;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_MSRS,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_vcpu_msrs,
+        .domain = ctx->domid,
+        .u.vcpu_msrs.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's msrs", id);
+        goto err;
+    }
+
+    /* No MSRs? skip this record. */
+    if ( !domctl.u.vcpu_msrs.msr_count )
+        goto out;
+
+    buffersz = domctl.u.vcpu_msrs.msr_count * sizeof(xen_domctl_vcpu_msr_t);
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, buffersz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for vcpu%u's msrs",
+              buffersz, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpu_msrs.msrs, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's msrs", id);
+        goto err;
+    }
+
+    rc = write_split_record(ctx, &rec, buffer,
+                            domctl.u.vcpu_msrs.msr_count *
+                            sizeof(xen_domctl_vcpu_msr_t));
+    if ( rc )
+        goto err;
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * For each vcpu, if it is online, write its state into the stream.
+ */
+static int write_all_vcpu_information(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xc_vcpuinfo_t vinfo;
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
+    {
+        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
+        if ( rc )
+        {
+            PERROR("Failed to get vcpu%u information", i);
+            return rc;
+        }
+
+        /* Vcpu offline? skip all these records. */
+        if ( !vinfo.online )
+            continue;
+
+        rc = write_one_vcpu_basic(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_extended(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_xsave(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_msrs(ctx, i);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
+/*
+ * Writes an X86_PV_INFO record into the stream.
+ */
+static int write_x86_pv_info(struct xc_sr_context *ctx)
+{
+    struct xc_sr_rec_x86_pv_info info =
+        {
+            .guest_width = ctx->x86_pv.width,
+            .pt_levels = ctx->x86_pv.levels,
+        };
+    struct xc_sr_record rec =
+        {
+            .type = REC_TYPE_X86_PV_INFO,
+            .length = sizeof(info),
+            .data = &info
+        };
+
+    return write_record(ctx, &rec);
+}
+
+/*
+ * Writes an X86_PV_P2M_FRAMES record into the stream.  This contains the list
+ * of pfns making up the p2m table.
+ */
+static int write_x86_pv_p2m_frames(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc; unsigned i;
+    size_t datasz = ctx->x86_pv.p2m_frames * sizeof(uint64_t);
+    uint64_t *data = NULL;
+    struct xc_sr_rec_x86_pv_p2m_frames hdr =
+        {
+            .start_pfn = 0,
+            .end_pfn = ctx->x86_pv.max_pfn,
+        };
+    struct xc_sr_record rec =
+        {
+            .type = REC_TYPE_X86_PV_P2M_FRAMES,
+            .length = sizeof(hdr),
+            .data = &hdr,
+        };
+
+    /* No need to translate if sizeof(uint64_t) == sizeof(xen_pfn_t). */
+    if ( sizeof(uint64_t) != sizeof(*ctx->x86_pv.p2m_pfns) )
+    {
+        if ( !(data = malloc(datasz)) )
+        {
+            ERROR("Cannot allocate %zu bytes for X86_PV_P2M_FRAMES data",
+                  datasz);
+            return -1;
+        }
+
+        for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+            data[i] = ctx->x86_pv.p2m_pfns[i];
+    }
+    else
+        data = (uint64_t *)ctx->x86_pv.p2m_pfns;
+
+    rc = write_split_record(ctx, &rec, data, datasz);
+
+    if ( data != (uint64_t *)ctx->x86_pv.p2m_pfns )
+        free(data);
+
+    return rc;
+}
+
+/*
+ * Writes an SHARED_INFO record into the stream.
+ */
+static int write_shared_info(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_SHARED_INFO,
+        .length = PAGE_SIZE,
+        .data = ctx->x86_pv.shinfo,
+    };
+
+    return write_record(ctx, &rec);
+}
+
+/*
+ * Normalise a pagetable for the migration stream.  Performs pfn->mfn
+ * conversions on the ptes.
+ */
+static int normalise_pagetable(struct xc_sr_context *ctx, const uint64_t *src,
+                               uint64_t *dst, unsigned long type)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t pte;
+    unsigned i, xen_first = -1, xen_last = -1; /* Indices of Xen mappings. */
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( ctx->x86_pv.levels == 4 )
+    {
+        /* 64bit guests only have Xen mappings in their L4 tables. */
+        if ( type == XEN_DOMCTL_PFINFO_L4TAB )
+        {
+            xen_first = 256;
+            xen_last = 271;
+        }
+    }
+    else
+    {
+        switch ( type )
+        {
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            ERROR("??? Found L4 table for 32bit guest");
+            errno = EINVAL;
+            return -1;
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            /* 32bit guests can only use the first 4 entries of their L3 tables.
+             * All other are potentially used by Xen. */
+            xen_first = 4;
+            xen_last = 512;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
+             * are normal but only a few will have Xen mappings.
+             *
+             * 428 = (HYPERVISOR_VIRT_START_PAE >> L2_PAGETABLE_SHIFT_PAE)&0x1ff
+             *
+             * ...which is conveniently unavailable to us in a 64bit build.
+             */
+            if ( pte_to_frame(src[428]) == ctx->x86_pv.compat_m2p_mfn0 )
+            {
+                xen_first = 428;
+                xen_last = 512;
+            }
+            break;
+        }
+    }
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        xen_pfn_t mfn;
+
+        pte = src[i];
+
+        /* Remove Xen mappings: Xen will reconstruct on the other side. */
+        if ( i >= xen_first && i <= xen_last )
+            pte = 0;
+
+        /*
+         * Errors during the live part of migration are expected as a result
+         * of split pagetable updates, page type changes, active grant
+         * mappings etc.  The pagetable will need to be resent after pausing.
+         * In such cases we fail with EAGAIN.
+         *
+         * For domains which are already paused, errors are fatal.
+         */
+        if ( pte & _PAGE_PRESENT )
+        {
+            mfn = pte_to_frame(pte);
+
+#ifdef __i386__
+            if ( mfn == INVALID_MFN )
+            {
+                ERROR("PTE truncation detected.  L%lu[%u] = %016"PRIx64,
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                errno = E2BIG;
+                return -1;
+            }
+#endif
+
+            if ( (type > XEN_DOMCTL_PFINFO_L1TAB) && (pte & _PAGE_PSE) )
+            {
+                if ( !ctx->dominfo.paused )
+                    errno = EAGAIN;
+                else
+                {
+                    ERROR("Cannot migrate superpage (L%lu[%u]: 0x%016"PRIx64")",
+                          type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                    errno = E2BIG;
+                }
+                return -1;
+            }
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                if ( !ctx->dominfo.paused )
+                    errno = EAGAIN;
+                else
+                {
+                    ERROR("Bad mfn for L%lu[%u]",
+                          type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i);
+                    dump_bad_pseudophysmap_entry(ctx, mfn);
+                    errno = ERANGE;
+                }
+                return -1;
+            }
+
+            pte = merge_pte(pte, mfn_to_pfn(ctx, mfn));
+        }
+
+        dst[i] = pte;
+    }
+
+    return 0;
+}
+
+/* save_ops function. */
+static xen_pfn_t x86_pv_pfn_to_gfn(const struct xc_sr_context *ctx,
+                                   xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    return xc_pfn_to_mfn(pfn, ctx->x86_pv.p2m, ctx->x86_pv.width);
+}
+
+
+/*
+ * save_ops function.  Performs pagetable normalisation on appropriate pages.
+ */
+static int x86_pv_normalise_page(struct xc_sr_context *ctx, xen_pfn_t type,
+                                 void **page)
+{
+    xc_interface *xch = ctx->xch;
+    void *local_page;
+    int rc;
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    local_page = malloc(PAGE_SIZE);
+    if ( !local_page )
+    {
+        ERROR("Unable to allocate scratch page");
+        rc = -1;
+        goto out;
+    }
+
+    rc = normalise_pagetable(ctx, *page, local_page, type);
+    *page = local_page;
+
+  out:
+    return rc;
+}
+
+/*
+ * save_ops function.  Queries domain information and maps the Xen m2p and the
+ * guests shinfo and p2m table.
+ */
+static int x86_pv_setup(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        return rc;
+
+    rc = map_shinfo(ctx);
+    if ( rc )
+        return rc;
+
+    rc = map_p2m(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Writes PV header records into the stream.
+ */
+static int x86_pv_start_of_stream(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = write_x86_pv_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_x86_pv_p2m_frames(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Writes tail records information into the stream.
+ */
+static int x86_pv_end_of_stream(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_shared_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_all_vcpu_information(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Cleanup.
+ */
+static int x86_pv_cleanup(struct xc_sr_context *ctx)
+{
+    free(ctx->x86_pv.p2m_pfns);
+
+    if ( ctx->x86_pv.p2m )
+        munmap(ctx->x86_pv.p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    if ( ctx->x86_pv.shinfo )
+        munmap(ctx->x86_pv.shinfo, PAGE_SIZE);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return 0;
+}
+
+struct xc_sr_save_ops save_ops_x86_pv =
+{
+    .pfn_to_gfn      = x86_pv_pfn_to_gfn,
+    .normalise_page  = x86_pv_normalise_page,
+    .setup           = x86_pv_setup,
+    .start_of_stream = x86_pv_start_of_stream,
+    .end_of_stream   = x86_pv_end_of_stream,
+    .cleanup         = x86_pv_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 09/15] tools/libxc: x86 PV restore code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (7 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 08/15] tools/libxc: x86 PV save code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-05-05 13:00   ` Ian Campbell
  2015-04-23 11:48 ` [PATCH v10 10/15] tools/libxc: x86 HVM save code Andrew Cooper
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Wei Liu

Restore the x86 PV specific parts.  The X86_PV_INFO, the P2M_FRAMES,
SHARED_INFO, and VCPU context records.

The localise_page callback is called from the common PAGE_DATA code to convert
PFNs in page tables to MFNs.

Page tables are pinned and the guest's P2M is updated when the stream is
complete.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/Makefile               |    1 +
 tools/libxc/xc_sr_common.h         |   10 +
 tools/libxc/xc_sr_restore.c        |  125 ++++
 tools/libxc/xc_sr_restore_x86_pv.c | 1165 ++++++++++++++++++++++++++++++++++++
 4 files changed, 1301 insertions(+)
 create mode 100644 tools/libxc/xc_sr_restore_x86_pv.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 40ba91c..b301432 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -57,6 +57,7 @@ GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
 GUEST_SRCS-y += xc_sr_common.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86_pv.c
+GUEST_SRCS-$(CONFIG_X86) += xc_sr_restore_x86_pv.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_pv.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 2f0f00b..7d3d54f 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -287,6 +287,8 @@ struct xc_sr_context
 
 extern struct xc_sr_save_ops save_ops_x86_pv;
 
+extern struct xc_sr_restore_ops restore_ops_x86_pv;
+
 struct xc_sr_record
 {
     uint32_t type;
@@ -321,6 +323,14 @@ static inline int write_record(struct xc_sr_context *ctx,
     return write_split_record(ctx, rec, NULL, 0);
 }
 
+/*
+ * This would ideally be private in restore.c, but is needed by
+ * x86_pv_localise_page() if we receive pagetables frames ahead of the
+ * contents of the frames they point at.
+ */
+int populate_pfns(struct xc_sr_context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index cdebc69..045f486 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -1,5 +1,130 @@
 #include "xc_sr_common.h"
 
+/*
+ * Is a pfn populated?
+ */
+static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    if ( pfn > ctx->restore.max_populated_pfn )
+        return false;
+    return test_bit(pfn, ctx->restore.populated_pfns);
+}
+
+/*
+ * Set a pfn as populated, expanding the tracking structures if needed. To
+ * avoid realloc()ing too excessivly, the size increased to the nearest power
+ * of two large enough to contain the required pfn.
+ */
+static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( pfn > ctx->restore.max_populated_pfn )
+    {
+        xen_pfn_t new_max;
+        size_t old_sz, new_sz;
+        unsigned long *p;
+
+        /* Round up to the nearest power of two larger than pfn, less 1. */
+        new_max = pfn;
+        new_max |= new_max >> 1;
+        new_max |= new_max >> 2;
+        new_max |= new_max >> 4;
+        new_max |= new_max >> 8;
+        new_max |= new_max >> 16;
+#ifdef __x86_64__
+        new_max |= new_max >> 32;
+#endif
+
+        old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
+        new_sz = bitmap_size(new_max + 1);
+        p = realloc(ctx->restore.populated_pfns, new_sz);
+        if ( !p )
+        {
+            ERROR("Failed to realloc populated bitmap");
+            errno = ENOMEM;
+            return -1;
+        }
+
+        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+
+        ctx->restore.populated_pfns    = p;
+        ctx->restore.max_populated_pfn = new_max;
+    }
+
+    set_bit(pfn, ctx->restore.populated_pfns);
+
+    return 0;
+}
+
+/*
+ * Given a set of pfns, obtain memory from Xen to fill the physmap for the
+ * unpopulated subset.  If types is NULL, no page typechecking is performed
+ * and all unpopulated pfns are populated.
+ */
+int populate_pfns(struct xc_sr_context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof(*mfns)),
+        *pfns = malloc(count * sizeof(*pfns));
+    unsigned i, nr_pfns = 0;
+    int rc = -1;
+
+    if ( !mfns || !pfns )
+    {
+        ERROR("Failed to allocate %zu bytes for populating the physmap",
+              2 * count * sizeof(*mfns));
+        goto err;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        if ( (!types || (types &&
+                         (types[i] != XEN_DOMCTL_PFINFO_XTAB &&
+                          types[i] != XEN_DOMCTL_PFINFO_BROKEN))) &&
+             !pfn_is_populated(ctx, original_pfns[i]) )
+        {
+            pfns[nr_pfns] = mfns[nr_pfns] = original_pfns[i];
+            ++nr_pfns;
+        }
+    }
+
+    if ( nr_pfns )
+    {
+        rc = xc_domain_populate_physmap_exact(
+            xch, ctx->domid, nr_pfns, 0, 0, mfns);
+        if ( rc )
+        {
+            PERROR("Failed to populate physmap");
+            goto err;
+        }
+
+        for ( i = 0; i < nr_pfns; ++i )
+        {
+            if ( mfns[i] == INVALID_MFN )
+            {
+                ERROR("Populate physmap failed for pfn %u", i);
+                rc = -1;
+                goto err;
+            }
+
+            rc = pfn_set_populated(ctx, pfns[i]);
+            if ( rc )
+                goto err;
+            ctx->restore.ops.set_gfn(ctx, pfns[i], mfns[i]);
+        }
+    }
+
+    rc = 0;
+
+ err:
+    free(pfns);
+    free(mfns);
+
+    return rc;
+}
+
 int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        unsigned int store_evtchn, unsigned long *store_mfn,
                        domid_t store_domid, unsigned int console_evtchn,
diff --git a/tools/libxc/xc_sr_restore_x86_pv.c b/tools/libxc/xc_sr_restore_x86_pv.c
new file mode 100644
index 0000000..ef26c64
--- /dev/null
+++ b/tools/libxc/xc_sr_restore_x86_pv.c
@@ -0,0 +1,1165 @@
+#include <assert.h>
+
+#include "xc_sr_common_x86_pv.h"
+
+static xen_pfn_t pfn_to_mfn(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    return xc_pfn_to_mfn(pfn, ctx->x86_pv.p2m, ctx->x86_pv.width);
+}
+
+/*
+ * Expand our local tracking information for the p2m table and domains maximum
+ * size.  Normally this will be called once to expand from 0 to max_pfn, but
+ * is liable to expand multiple times if the domain grows on the sending side
+ * after migration has started.
+ */
+static int expand_p2m(struct xc_sr_context *ctx, unsigned long max_pfn)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long old_max = ctx->x86_pv.max_pfn, i;
+    unsigned int fpp = PAGE_SIZE / ctx->x86_pv.width;
+    unsigned long end_frame = (max_pfn / fpp) + 1;
+    unsigned long old_end_frame = (old_max / fpp) + 1;
+    xen_pfn_t *p2m = NULL, *p2m_pfns = NULL;
+    uint32_t *pfn_types = NULL;
+    size_t p2msz, p2m_pfnsz, pfn_typesz;
+
+    assert(max_pfn > old_max);
+
+    p2msz = (max_pfn + 1) * ctx->x86_pv.width;
+    p2m = realloc(ctx->x86_pv.p2m, p2msz);
+    if ( !p2m )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m", p2msz);
+        return -1;
+    }
+    ctx->x86_pv.p2m = p2m;
+
+    pfn_typesz = (max_pfn + 1) * sizeof(*pfn_types);
+    pfn_types = realloc(ctx->x86_pv.restore.pfn_types, pfn_typesz);
+    if ( !pfn_types )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for pfn_types", pfn_typesz);
+        return -1;
+    }
+    ctx->x86_pv.restore.pfn_types = pfn_types;
+
+    p2m_pfnsz = (end_frame + 1) * sizeof(*p2m_pfns);
+    p2m_pfns = realloc(ctx->x86_pv.p2m_pfns, p2m_pfnsz);
+    if ( !p2m_pfns )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m frame list", p2m_pfnsz);
+        return -1;
+    }
+    ctx->x86_pv.p2m_frames = end_frame;
+    ctx->x86_pv.p2m_pfns = p2m_pfns;
+
+    ctx->x86_pv.max_pfn = max_pfn;
+    for ( i = (old_max ? old_max + 1 : 0); i <= max_pfn; ++i )
+    {
+        ctx->restore.ops.set_gfn(ctx, i, INVALID_MFN);
+        ctx->restore.ops.set_page_type(ctx, i, 0);
+    }
+
+    for ( i = (old_end_frame ? old_end_frame + 1 : 0); i <= end_frame; ++i )
+        ctx->x86_pv.p2m_pfns[i] = INVALID_MFN;
+
+    DPRINTF("Expanded p2m from %#lx to %#lx", old_max, max_pfn);
+    return 0;
+}
+
+/*
+ * Pin all of the pagetables.
+ */
+static int pin_pagetables(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long i, nr_pins;
+    struct mmuext_op pin[MAX_PIN_BATCH];
+
+    for ( i = nr_pins = 0; i <= ctx->x86_pv.max_pfn; ++i )
+    {
+        if ( (ctx->x86_pv.restore.pfn_types[i] &
+              XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( (ctx->x86_pv.restore.pfn_types[i] &
+                  XEN_DOMCTL_PFINFO_LTABTYPE_MASK) )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
+            break;
+        default:
+            continue;
+        }
+
+        pin[nr_pins].arg1.mfn = pfn_to_mfn(ctx, i);
+        nr_pins++;
+
+        if ( nr_pins == MAX_PIN_BATCH )
+        {
+            if ( xc_mmuext_op(xch, pin, nr_pins, ctx->domid) != 0 )
+            {
+                PERROR("Failed to pin batch of pagetables");
+                return -1;
+            }
+            nr_pins = 0;
+        }
+    }
+
+    if ( (nr_pins > 0) && (xc_mmuext_op(xch, pin, nr_pins, ctx->domid) < 0) )
+    {
+        PERROR("Failed to pin batch of pagetables");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Update details in a guests start_info structure.
+ */
+static int process_start_info(struct xc_sr_context *ctx,
+                              vcpu_guest_context_any_t *vcpu)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn, mfn;
+    start_info_any_t *guest_start_info = NULL;
+    int rc = -1;
+
+    pfn = GET_FIELD(vcpu, user_regs.edx, ctx->x86_pv.width);
+
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Start Info pfn %#lx out of range", pfn);
+        goto err;
+    }
+    else if ( ctx->x86_pv.restore.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+    {
+        ERROR("Start Info pfn %#lx has bad type %u", pfn,
+              (ctx->x86_pv.restore.pfn_types[pfn] >>
+               XEN_DOMCTL_PFINFO_LTAB_SHIFT));
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Start Info has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    SET_FIELD(vcpu, user_regs.edx, mfn, ctx->x86_pv.width);
+    guest_start_info = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE, mfn);
+    if ( !guest_start_info )
+    {
+        PERROR("Failed to map Start Info at mfn %#lx", mfn);
+        goto err;
+    }
+
+    /* Deal with xenstore stuff */
+    pfn = GET_FIELD(guest_start_info, store_mfn, ctx->x86_pv.width);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("XenStore pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("XenStore pfn has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.xenstore_gfn = mfn;
+    SET_FIELD(guest_start_info, store_mfn, mfn, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, store_evtchn,
+              ctx->restore.xenstore_evtchn, ctx->x86_pv.width);
+
+    /* Deal with console stuff */
+    pfn = GET_FIELD(guest_start_info, console.domU.mfn, ctx->x86_pv.width);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Console pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Console pfn has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.console_gfn = mfn;
+    SET_FIELD(guest_start_info, console.domU.mfn, mfn, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, console.domU.evtchn,
+              ctx->restore.console_evtchn, ctx->x86_pv.width);
+
+    /* Set other information */
+    SET_FIELD(guest_start_info, nr_pages,
+              ctx->x86_pv.max_pfn + 1, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, shared_info,
+              ctx->dominfo.shared_info_frame << PAGE_SHIFT, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, flags, 0, ctx->x86_pv.width);
+
+    rc = 0;
+
+err:
+    if ( guest_start_info )
+        munmap(guest_start_info, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Process one stashed vcpu worth of basic state and send to Xen.
+ */
+static int process_vcpu_basic(struct xc_sr_context *ctx,
+                              unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    vcpu_guest_context_any_t vcpu;
+    xen_pfn_t pfn, mfn;
+    unsigned i, gdt_count;
+    int rc = -1;
+
+    memcpy(&vcpu, ctx->x86_pv.restore.vcpus[vcpuid].basic,
+           ctx->x86_pv.restore.vcpus[vcpuid].basicsz);
+
+    /* Vcpu 0 is special: Convert the suspend record to an mfn. */
+    if ( vcpuid == 0 )
+    {
+        rc = process_start_info(ctx, &vcpu);
+        if ( rc )
+            return rc;
+        rc = -1;
+    }
+
+    SET_FIELD(&vcpu, flags,
+              GET_FIELD(&vcpu, flags, ctx->x86_pv.width) | VGCF_online,
+              ctx->x86_pv.width);
+
+    gdt_count = GET_FIELD(&vcpu, gdt_ents, ctx->x86_pv.width);
+    if ( gdt_count > FIRST_RESERVED_GDT_ENTRY )
+    {
+        ERROR("GDT entry count (%u) out of range (max %u)",
+              gdt_count, FIRST_RESERVED_GDT_ENTRY);
+        errno = ERANGE;
+        goto err;
+    }
+    gdt_count = (gdt_count + 511) / 512; /* gdt_count now in units of frames. */
+
+    /* Convert GDT frames to mfns. */
+    for ( i = 0; i < gdt_count; ++i )
+    {
+        pfn = GET_FIELD(&vcpu, gdt_frames[i], ctx->x86_pv.width);
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("GDT frame %u (pfn %#lx) out of range", i, pfn);
+            goto err;
+        }
+        else if ( (ctx->x86_pv.restore.pfn_types[pfn] !=
+                   XEN_DOMCTL_PFINFO_NOTAB) )
+        {
+            ERROR("GDT frame %u (pfn %#lx) has bad type %u", i, pfn,
+                  (ctx->x86_pv.restore.pfn_types[pfn] >>
+                   XEN_DOMCTL_PFINFO_LTAB_SHIFT));
+            goto err;
+        }
+
+        mfn = pfn_to_mfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("GDT frame %u has bad mfn", i);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        SET_FIELD(&vcpu, gdt_frames[i], mfn, ctx->x86_pv.width);
+    }
+
+    /* Convert CR3 to an mfn. */
+    pfn = cr3_to_mfn(ctx, GET_FIELD(&vcpu, ctrlreg[3], ctx->x86_pv.width));
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("cr3 (pfn %#lx) out of range", pfn);
+        goto err;
+    }
+    else if ( (ctx->x86_pv.restore.pfn_types[pfn] &
+                XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
+              (((xen_pfn_t)ctx->x86_pv.levels) <<
+               XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+    {
+        ERROR("cr3 (pfn %#lx) has bad type %u, expected %u", pfn,
+              (ctx->x86_pv.restore.pfn_types[pfn] >>
+               XEN_DOMCTL_PFINFO_LTAB_SHIFT),
+              ctx->x86_pv.levels);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("cr3 has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    SET_FIELD(&vcpu, ctrlreg[3], mfn_to_cr3(ctx, mfn), ctx->x86_pv.width);
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to mfn. */
+    if ( ctx->x86_pv.levels == 4 && (vcpu.x64.ctrlreg[1] & 1) )
+    {
+        pfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("cr1 (pfn %#lx) out of range", pfn);
+            goto err;
+        }
+        else if ( (ctx->x86_pv.restore.pfn_types[pfn] &
+                   XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
+                  (((xen_pfn_t)ctx->x86_pv.levels) <<
+                   XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+        {
+            ERROR("cr1 (pfn %#lx) has bad type %u, expected %u", pfn,
+                  (ctx->x86_pv.restore.pfn_types[pfn] >>
+                   XEN_DOMCTL_PFINFO_LTAB_SHIFT),
+                  ctx->x86_pv.levels);
+            goto err;
+        }
+
+        mfn = pfn_to_mfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("cr1 has bad mfn");
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        vcpu.x64.ctrlreg[1] = (uint64_t)mfn << PAGE_SHIFT;
+    }
+
+    if ( xc_vcpu_setcontext(xch, ctx->domid, vcpuid, &vcpu) )
+    {
+        PERROR("Failed to set vcpu%u's basic info", vcpuid);
+        goto err;
+    }
+
+    rc = 0;
+
+ err:
+    return rc;
+}
+
+/*
+ * Process one stashed vcpu worth of extended state and send to Xen.
+ */
+static int process_vcpu_extended(struct xc_sr_context *ctx,
+                                 unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu =
+        &ctx->x86_pv.restore.vcpus[vcpuid];
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_set_ext_vcpucontext;
+    domctl.domain = ctx->domid;
+    memcpy(&domctl.u.ext_vcpucontext, vcpu->extd, vcpu->extdsz);
+
+    if ( xc_domctl(xch, &domctl) != 0 )
+    {
+        PERROR("Failed to set vcpu%u's extended info", vcpuid);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Process one stashed vcpu worth of xsave state and send to Xen.
+ */
+static int process_vcpu_xsave(struct xc_sr_context *ctx,
+                              unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu =
+        &ctx->x86_pv.restore.vcpus[vcpuid];
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, vcpu->xsavesz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for xsave hypercall buffer",
+              vcpu->xsavesz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setvcpuextstate;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpuextstate.vcpu = vcpuid;
+    domctl.u.vcpuextstate.size = vcpu->xsavesz;
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    memcpy(buffer, vcpu->xsave, vcpu->xsavesz);
+
+    rc = xc_domctl(xch, &domctl);
+    if ( rc )
+        PERROR("Failed to set vcpu%u's xsave info", vcpuid);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Process one stashed vcpu worth of msr state and send to Xen.
+ */
+static int process_vcpu_msrs(struct xc_sr_context *ctx,
+                             unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu =
+        &ctx->x86_pv.restore.vcpus[vcpuid];
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, vcpu->msrsz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for msr hypercall buffer",
+              vcpu->msrsz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_set_vcpu_msrs;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpu_msrs.vcpu = vcpuid;
+    domctl.u.vcpu_msrs.msr_count = vcpu->msrsz % sizeof(xen_domctl_vcpu_msr_t);
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    memcpy(buffer, vcpu->msr, vcpu->msrsz);
+
+    rc = xc_domctl(xch, &domctl);
+    if ( rc )
+        PERROR("Failed to set vcpu%u's msrs", vcpuid);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Process all stashed vcpu context and send to Xen.
+ */
+static int update_vcpu_context(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu;
+    unsigned i;
+    int rc = 0;
+
+    for ( i = 0; i < ctx->x86_pv.restore.nr_vcpus; ++i )
+    {
+        vcpu = &ctx->x86_pv.restore.vcpus[i];
+
+        if ( vcpu->basic )
+        {
+            rc = process_vcpu_basic(ctx, i);
+            if ( rc )
+                return rc;
+        }
+        else if ( i == 0 )
+        {
+            ERROR("Sender didn't send vcpu0's basic state");
+            return -1;
+        }
+
+        if ( vcpu->extd )
+        {
+            rc = process_vcpu_extended(ctx, i);
+            if ( rc )
+                return rc;
+        }
+
+        if ( vcpu->xsave )
+        {
+            rc = process_vcpu_xsave(ctx, i);
+            if ( rc )
+                return rc;
+        }
+
+        if ( vcpu->msr )
+        {
+            rc = process_vcpu_msrs(ctx, i);
+            if ( rc )
+                return rc;
+        }
+    }
+
+    return rc;
+}
+
+/*
+ * Copy the p2m which has been constructed locally as memory has been
+ * allocated, over the p2m in guest, so the guest can find its memory again on
+ * resume.
+ */
+static int update_guest_p2m(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn, *guest_p2m = NULL;
+    unsigned i;
+    int rc = -1;
+
+    for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+    {
+        pfn = ctx->x86_pv.p2m_pfns[i];
+
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] out of range",
+                  pfn, i);
+            goto err;
+        }
+        else if ( (ctx->x86_pv.restore.pfn_types[pfn] !=
+                   XEN_DOMCTL_PFINFO_NOTAB) )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] has bad type %u", pfn, i,
+                  (ctx->x86_pv.restore.pfn_types[pfn] >>
+                   XEN_DOMCTL_PFINFO_LTAB_SHIFT));
+            goto err;
+        }
+
+        mfn = pfn_to_mfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("p2m_frame_list[%u] has bad mfn", i);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[i] = mfn;
+    }
+
+    guest_p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_WRITE,
+                                     ctx->x86_pv.p2m_pfns,
+                                     ctx->x86_pv.p2m_frames );
+    if ( !guest_p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    memcpy(guest_p2m, ctx->x86_pv.p2m,
+           (ctx->x86_pv.max_pfn + 1) * ctx->x86_pv.width);
+    rc = 0;
+ err:
+    if ( guest_p2m )
+        munmap(guest_p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Process an X86_PV_INFO record.
+ */
+static int handle_x86_pv_info(struct xc_sr_context *ctx,
+                              struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_info *info = rec->data;
+
+    if ( ctx->x86_pv.restore.seen_pv_info )
+    {
+        ERROR("Already received X86_PV_INFO record");
+        return -1;
+    }
+
+    if ( rec->length < sizeof(*info) )
+    {
+        ERROR("X86_PV_INFO record truncated: length %u, expected %zu",
+              rec->length, sizeof(*info));
+        return -1;
+    }
+    else if ( info->guest_width != 4 &&
+              info->guest_width != 8 )
+    {
+        ERROR("Unexpected guest width %u, Expected 4 or 8",
+              info->guest_width);
+        return -1;
+    }
+    else if ( info->guest_width != ctx->x86_pv.width )
+    {
+        int rc;
+        struct xen_domctl domctl;
+
+        /* Try to set address size, domain is always created 64 bit. */
+        memset(&domctl, 0, sizeof(domctl));
+        domctl.domain = ctx->domid;
+        domctl.cmd    = XEN_DOMCTL_set_address_size;
+        domctl.u.address_size.size = info->guest_width * 8;
+        rc = do_domctl(xch, &domctl);
+        if ( rc != 0 )
+        {
+            ERROR("Width of guest in stream (%u"
+                  " bits) differs with existing domain (%u bits)",
+                  info->guest_width * 8, ctx->x86_pv.width * 8);
+            return -1;
+        }
+
+        /* Domain's information changed, better to refresh. */
+        rc = x86_pv_domain_info(ctx);
+        if ( rc != 0 )
+        {
+            ERROR("Unable to refresh guest information");
+            return -1;
+        }
+    }
+    else if ( info->pt_levels != 3 &&
+              info->pt_levels != 4 )
+    {
+        ERROR("Unexpected guest levels %u, Expected 3 or 4",
+              info->pt_levels);
+        return -1;
+    }
+    else if ( info->pt_levels != ctx->x86_pv.levels )
+    {
+        ERROR("Levels of guest in stream (%u"
+              ") differs with existing domain (%u)",
+              info->pt_levels, ctx->x86_pv.levels);
+        return -1;
+    }
+
+    ctx->x86_pv.restore.seen_pv_info = true;
+    return 0;
+}
+
+/*
+ * Process an X86_PV_P2M_FRAMES record.  Takes care of expanding the local p2m
+ * state if needed.
+ */
+static int handle_x86_pv_p2m_frames(struct xc_sr_context *ctx,
+                                    struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_p2m_frames *data = rec->data;
+    unsigned start, end, x, fpp = PAGE_SIZE / ctx->x86_pv.width;
+    int rc;
+
+    if ( !ctx->x86_pv.restore.seen_pv_info )
+    {
+        ERROR("Not yet received X86_PV_INFO record");
+        return -1;
+    }
+
+    if ( rec->length < sizeof(*data) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record truncated: length %u, min %zu",
+              rec->length, sizeof(*data) + sizeof(uint64_t));
+        return -1;
+    }
+    else if ( data->start_pfn > data->end_pfn )
+    {
+        ERROR("End pfn in stream (%#x) exceeds Start (%#x)",
+              data->end_pfn, data->start_pfn);
+        return -1;
+    }
+
+    start =  data->start_pfn / fpp;
+    end = data->end_pfn / fpp + 1;
+
+    if ( rec->length != sizeof(*data) + ((end - start) * sizeof(uint64_t)) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record wrong size: start_pfn %#x"
+              ", end_pfn %#x, length %u, expected %zu + (%u - %u) * %zu",
+              data->start_pfn, data->end_pfn, rec->length,
+              sizeof(*data), end, start, sizeof(uint64_t));
+        return -1;
+    }
+
+    if ( data->end_pfn > ctx->x86_pv.max_pfn )
+    {
+        rc = expand_p2m(ctx, data->end_pfn);
+        if ( rc )
+            return rc;
+    }
+
+    for ( x = 0; x < (end - start); ++x )
+        ctx->x86_pv.p2m_pfns[start + x] = data->p2m_pfns[x];
+
+    return 0;
+}
+
+/*
+ * Processes X86_PV_VCPU_{BASIC,EXTENDED,XSAVE,MSRS} records from the stream.
+ * The blobs are all stashed to one side as they need to be deferred until the
+ * very end of the stream, rather than being send to Xen at the point they
+ * arrive in the stream.  It performs all pre-hypercall size validation.
+ */
+static int handle_x86_pv_vcpu_blob(struct xc_sr_context *ctx,
+                                   struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_vcpu_hdr *vhdr = rec->data;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu;
+    const char *rec_name;
+    size_t blobsz;
+    void *blob;
+    int rc = -1;
+
+    switch ( rec->type )
+    {
+    case REC_TYPE_X86_PV_VCPU_BASIC:
+        rec_name = "X86_PV_VCPU_BASIC";
+        break;
+
+    case REC_TYPE_X86_PV_VCPU_EXTENDED:
+        rec_name = "X86_PV_VCPU_EXTENDED";
+        break;
+
+    case REC_TYPE_X86_PV_VCPU_XSAVE:
+        rec_name = "X86_PV_VCPU_XSAVE";
+        break;
+
+    case REC_TYPE_X86_PV_VCPU_MSRS:
+        rec_name = "X86_PV_VCPU_MSRS";
+        break;
+
+    default:
+        ERROR("Unrecognised vcpu blob record %s (%u)",
+              rec_type_to_str(rec->type), rec->type);
+        goto out;
+    }
+
+    /* Confirm that there is a complete header. */
+    if ( rec->length <= sizeof(*vhdr) )
+    {
+        ERROR("%s record truncated: length %u, min %zu",
+              rec_name, rec->length, sizeof(*vhdr) + 1);
+        goto out;
+    }
+
+    blobsz = rec->length - sizeof(*vhdr);
+
+    /* Check that the vcpu id is within range. */
+    if ( vhdr->vcpu_id >= ctx->x86_pv.restore.nr_vcpus )
+    {
+        ERROR("%s record vcpu_id (%u) exceeds domain max (%u)",
+              rec_name, vhdr->vcpu_id, ctx->x86_pv.restore.nr_vcpus - 1);
+        goto out;
+    }
+
+    vcpu = &ctx->x86_pv.restore.vcpus[vhdr->vcpu_id];
+
+    /* Further per-record checks, where possible. */
+    switch ( rec->type )
+    {
+    case REC_TYPE_X86_PV_VCPU_BASIC:
+    {
+        size_t vcpusz = ctx->x86_pv.width == 8 ?
+            sizeof(vcpu_guest_context_x86_64_t) :
+            sizeof(vcpu_guest_context_x86_32_t);
+
+        if ( blobsz != vcpusz )
+        {
+            ERROR("%s record wrong size: expected %zu, got %u",
+                  rec_name, sizeof(*vhdr) + vcpusz, rec->length);
+            goto out;
+        }
+        break;
+    }
+
+    case REC_TYPE_X86_PV_VCPU_EXTENDED:
+        if ( blobsz > 128 )
+        {
+            ERROR("%s record too long: max %zu, got %u",
+                  rec_name, sizeof(*vhdr) + 128, rec->length);
+            goto out;
+        }
+        break;
+
+    case REC_TYPE_X86_PV_VCPU_XSAVE:
+        if ( blobsz % sizeof(xen_domctl_vcpu_msr_t) != 0 )
+        {
+            ERROR("%s record payload size %zu expected to be a multiple of %zu",
+                  rec_name, blobsz, sizeof(xen_domctl_vcpu_msr_t));
+            goto out;
+        }
+        break;
+    }
+
+    /* Allocate memory. */
+    blob = malloc(blobsz);
+    if ( !blob )
+    {
+        ERROR("Unable to allocate %zu bytes for vcpu%u %s blob",
+              blobsz, vhdr->vcpu_id, rec_name);
+        goto out;
+    }
+
+    memcpy(blob, &vhdr->context, blobsz);
+
+    /* Stash sideways for later. */
+    switch ( rec->type )
+    {
+#define RECSTORE(x, y) case REC_TYPE_X86_PV_ ## x: \
+        free(y); (y) = blob; (y ## sz) = blobsz; break
+
+        RECSTORE(VCPU_BASIC,    vcpu->basic);
+        RECSTORE(VCPU_EXTENDED, vcpu->extd);
+        RECSTORE(VCPU_XSAVE,    vcpu->xsave);
+        RECSTORE(VCPU_MSRS,     vcpu->msr);
+#undef RECSTORE
+    }
+
+    rc = 0;
+
+ out:
+    return rc;
+}
+
+/*
+ * Process a SHARED_INFO record from the stream.
+ */
+static int handle_shared_info(struct xc_sr_context *ctx,
+                              struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned i;
+    int rc = -1;
+    shared_info_any_t *guest_shinfo = NULL;
+    const shared_info_any_t *old_shinfo = rec->data;
+
+    if ( !ctx->x86_pv.restore.seen_pv_info )
+    {
+        ERROR("Not yet received X86_PV_INFO record");
+        return -1;
+    }
+
+    if ( rec->length != PAGE_SIZE )
+    {
+        ERROR("X86_PV_SHARED_INFO record wrong size: length %u"
+              ", expected 4096", rec->length);
+        goto err;
+    }
+
+    guest_shinfo = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE,
+        ctx->dominfo.shared_info_frame);
+    if ( !guest_shinfo )
+    {
+        PERROR("Failed to map Shared Info at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        goto err;
+    }
+
+    MEMCPY_FIELD(guest_shinfo, old_shinfo, vcpu_info, ctx->x86_pv.width);
+    MEMCPY_FIELD(guest_shinfo, old_shinfo, arch, ctx->x86_pv.width);
+
+    SET_FIELD(guest_shinfo, arch.pfn_to_mfn_frame_list_list,
+              0, ctx->x86_pv.width);
+
+    MEMSET_ARRAY_FIELD(guest_shinfo, evtchn_pending, 0, ctx->x86_pv.width);
+    for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
+        SET_FIELD(guest_shinfo, vcpu_info[i].evtchn_pending_sel,
+                  0, ctx->x86_pv.width);
+
+    MEMSET_ARRAY_FIELD(guest_shinfo, evtchn_mask, 0xff, ctx->x86_pv.width);
+
+    rc = 0;
+ err:
+
+    if ( guest_shinfo )
+        munmap(guest_shinfo, PAGE_SIZE);
+
+    return rc;
+}
+
+/* restore_ops function. */
+static bool x86_pv_pfn_is_valid(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    return pfn <= ctx->x86_pv.max_pfn;
+}
+
+/* restore_ops function. */
+static void x86_pv_set_page_type(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                                 unsigned long type)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    ctx->x86_pv.restore.pfn_types[pfn] = type;
+}
+
+/* restore_ops function. */
+static void x86_pv_set_gfn(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                           xen_pfn_t mfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    if ( ctx->x86_pv.width == sizeof(uint64_t) )
+        /* 64 bit guest.  Need to expand INVALID_MFN for 32 bit toolstacks. */
+        ((uint64_t *)ctx->x86_pv.p2m)[pfn] = mfn == INVALID_MFN ? ~0ULL : mfn;
+    else
+        /* 32 bit guest.  Can truncate INVALID_MFN for 64 bit toolstacks. */
+        ((uint32_t *)ctx->x86_pv.p2m)[pfn] = mfn;
+}
+
+/*
+ * restore_ops function.  Convert pfns back to mfns in pagetables.  Possibly
+ * needs to populate new frames if a PTE is found referring to a frame which
+ * hasn't yet been seen from PAGE_DATA records.
+ */
+static int x86_pv_localise_page(struct xc_sr_context *ctx,
+                                uint32_t type, void *page)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t *table = page;
+    uint64_t pte;
+    unsigned i, to_populate;
+    xen_pfn_t pfns[(PAGE_SIZE / sizeof(uint64_t))];
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    /* Only page tables need localisation. */
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    /* Check to see whether we need to populate any new frames. */
+    for ( i = 0, to_populate = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        pte = table[i];
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            xen_pfn_t pfn = pte_to_frame(pte);
+
+#ifdef __i386__
+            if ( pfn == INVALID_MFN )
+            {
+                ERROR("PTE truncation detected.  L%u[%u] = %016"PRIx64,
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                errno = E2BIG;
+                return -1;
+            }
+#endif
+
+            if ( pfn_to_mfn(ctx, pfn) == INVALID_MFN )
+                pfns[to_populate++] = pfn;
+        }
+    }
+
+    if ( to_populate && populate_pfns(ctx, to_populate, pfns, NULL) )
+            return -1;
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        pte = table[i];
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            xen_pfn_t mfn, pfn;
+
+            pfn = pte_to_frame(pte);
+            mfn = pfn_to_mfn(ctx, pfn);
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                ERROR("Bad mfn for L%u[%u] - pte %"PRIx64,
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                dump_bad_pseudophysmap_entry(ctx, mfn);
+                errno = ERANGE;
+                return -1;
+            }
+
+            table[i] = merge_pte(pte, mfn);
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * restore_ops function.  Confirm that the incoming stream matches the type of
+ * domain we are attempting to restore into.
+ */
+static int x86_pv_setup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_X86_PV )
+    {
+        ERROR("Unable to restore %s domain into an x86_pv domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != PAGE_SIZE )
+    {
+        ERROR("Invalid page size %d for x86_pv domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        return rc;
+
+    ctx->x86_pv.restore.nr_vcpus = ctx->dominfo.max_vcpu_id + 1;
+    ctx->x86_pv.restore.vcpus = calloc(sizeof(struct xc_sr_x86_pv_restore_vcpu),
+                                       ctx->x86_pv.restore.nr_vcpus);
+    if ( !ctx->x86_pv.restore.vcpus )
+    {
+        errno = ENOMEM;
+        return -1;
+    }
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        return rc;
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_pv_process_record(struct xc_sr_context *ctx,
+                                 struct xc_sr_record *rec)
+{
+    switch ( rec->type )
+    {
+    case REC_TYPE_X86_PV_INFO:
+        return handle_x86_pv_info(ctx, rec);
+
+    case REC_TYPE_X86_PV_P2M_FRAMES:
+        return handle_x86_pv_p2m_frames(ctx, rec);
+
+    case REC_TYPE_X86_PV_VCPU_BASIC:
+    case REC_TYPE_X86_PV_VCPU_EXTENDED:
+    case REC_TYPE_X86_PV_VCPU_XSAVE:
+    case REC_TYPE_X86_PV_VCPU_MSRS:
+        return handle_x86_pv_vcpu_blob(ctx, rec);
+
+    case REC_TYPE_SHARED_INFO:
+        return handle_shared_info(ctx, rec);
+
+    case REC_TYPE_TSC_INFO:
+        return handle_tsc_info(ctx, rec);
+
+    default:
+        return RECORD_NOT_PROCESSED;
+    }
+}
+
+/*
+ * restore_ops function.  Update the vcpu context in Xen, pin the pagetables,
+ * rewrite the p2m and seed the grant table.
+ */
+static int x86_pv_stream_complete(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = update_vcpu_context(ctx);
+    if ( rc )
+        return rc;
+
+    rc = pin_pagetables(ctx);
+    if ( rc )
+        return rc;
+
+    rc = update_guest_p2m(ctx);
+    if ( rc )
+        return rc;
+
+    rc = xc_dom_gnttab_seed(xch, ctx->domid,
+                            ctx->restore.console_gfn,
+                            ctx->restore.xenstore_gfn,
+                            ctx->restore.console_domid,
+                            ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        return rc;
+    }
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_pv_cleanup(struct xc_sr_context *ctx)
+{
+    free(ctx->x86_pv.p2m);
+    free(ctx->x86_pv.p2m_pfns);
+
+    if ( ctx->x86_pv.restore.vcpus )
+    {
+        unsigned i;
+
+        for ( i = 0; i < ctx->x86_pv.restore.nr_vcpus; ++i )
+        {
+            struct xc_sr_x86_pv_restore_vcpu *vcpu =
+                &ctx->x86_pv.restore.vcpus[i];
+
+            free(vcpu->basic);
+            free(vcpu->extd);
+            free(vcpu->xsave);
+            free(vcpu->msr);
+        }
+
+        free(ctx->x86_pv.restore.vcpus);
+    }
+
+    free(ctx->x86_pv.restore.pfn_types);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return 0;
+}
+
+struct xc_sr_restore_ops restore_ops_x86_pv =
+{
+    .pfn_is_valid    = x86_pv_pfn_is_valid,
+    .pfn_to_gfn      = pfn_to_mfn,
+    .set_page_type   = x86_pv_set_page_type,
+    .set_gfn         = x86_pv_set_gfn,
+    .localise_page   = x86_pv_localise_page,
+    .setup           = x86_pv_setup,
+    .process_record  = x86_pv_process_record,
+    .stream_complete = x86_pv_stream_complete,
+    .cleanup         = x86_pv_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 10/15] tools/libxc: x86 HVM save code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (8 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 09/15] tools/libxc: x86 PV restore code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 11/15] tools/libxc: x86 HVM restore code Andrew Cooper
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Save the x86 HVM specific parts of the domain.  This is considerably simpler
than an x86 PV domain.  Only the HVM_CONTEXT and HVM_PARAMS records are
needed.

There is no need for any page normalisation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/Makefile             |    1 +
 tools/libxc/xc_sr_common.h       |    1 +
 tools/libxc/xc_sr_save_x86_hvm.c |  220 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 222 insertions(+)
 create mode 100644 tools/libxc/xc_sr_save_x86_hvm.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index b301432..936512d 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -59,6 +59,7 @@ GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86_pv.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_restore_x86_pv.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_pv.c
+GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_hvm.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 7d3d54f..a8b5568 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -286,6 +286,7 @@ struct xc_sr_context
 };
 
 extern struct xc_sr_save_ops save_ops_x86_pv;
+extern struct xc_sr_save_ops save_ops_x86_hvm;
 
 extern struct xc_sr_restore_ops restore_ops_x86_pv;
 
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
new file mode 100644
index 0000000..0928f19
--- /dev/null
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -0,0 +1,220 @@
+#include <assert.h>
+
+#include "xc_sr_common_x86.h"
+
+#include <xen/hvm/params.h>
+
+/*
+ * Query for the HVM context and write an HVM_CONTEXT record into the stream.
+ */
+static int write_hvm_context(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc, hvm_buf_size;
+    struct xc_sr_record hvm_rec =
+    {
+        .type = REC_TYPE_HVM_CONTEXT,
+    };
+
+    hvm_buf_size = xc_domain_hvm_getcontext(xch, ctx->domid, 0, 0);
+    if ( hvm_buf_size < 0 )
+    {
+        PERROR("Couldn't get HVM context size from Xen");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.data = malloc(hvm_buf_size);
+    if ( !hvm_rec.data )
+    {
+        PERROR("Couldn't allocate memory");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_buf_size = xc_domain_hvm_getcontext(xch, ctx->domid,
+                                            hvm_rec.data, hvm_buf_size);
+    if ( hvm_buf_size < 0 )
+    {
+        PERROR("Couldn't get HVM context from Xen");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.length = hvm_buf_size;
+    rc = write_record(ctx, &hvm_rec);
+    if ( rc < 0 )
+    {
+        PERROR("error write HVM_CONTEXT record");
+        goto out;
+    }
+
+ out:
+    free(hvm_rec.data);
+    return rc;
+}
+
+/*
+ * Query for a range of HVM parameters and write an HVM_PARAMS record into the
+ * stream.
+ */
+static int write_hvm_params(struct xc_sr_context *ctx)
+{
+    static const unsigned int params[] = {
+        HVM_PARAM_STORE_PFN,
+        HVM_PARAM_IOREQ_PFN,
+        HVM_PARAM_BUFIOREQ_PFN,
+        HVM_PARAM_PAGING_RING_PFN,
+        HVM_PARAM_MONITOR_RING_PFN,
+        HVM_PARAM_SHARING_RING_PFN,
+        HVM_PARAM_VM86_TSS,
+        HVM_PARAM_CONSOLE_PFN,
+        HVM_PARAM_ACPI_IOPORTS_LOCATION,
+        HVM_PARAM_VIRIDIAN,
+        HVM_PARAM_IDENT_PT,
+        HVM_PARAM_PAE_ENABLED,
+        HVM_PARAM_VM_GENERATION_ID_ADDR,
+        HVM_PARAM_IOREQ_SERVER_PFN,
+        HVM_PARAM_NR_IOREQ_SERVER_PAGES,
+    };
+
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_hvm_params_entry entries[ARRAY_SIZE(params)];
+    struct xc_sr_rec_hvm_params hdr = {
+        .count = 0,
+    };
+    struct xc_sr_record rec = {
+        .type   = REC_TYPE_HVM_PARAMS,
+        .length = sizeof(hdr),
+        .data   = &hdr,
+    };
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < ARRAY_SIZE(params); i++ )
+    {
+        uint32_t index = params[i];
+        uint64_t value;
+
+        rc = xc_hvm_param_get(xch, ctx->domid, index, &value);
+        if ( rc )
+        {
+            PERROR("Failed to get HVMPARAM at index %u", index);
+            return rc;
+        }
+
+        if ( value != 0 )
+        {
+            entries[hdr.count].index = index;
+            entries[hdr.count].value = value;
+            hdr.count++;
+        }
+    }
+
+    rc = write_split_record(ctx, &rec, entries, hdr.count * sizeof(*entries));
+    if ( rc )
+        PERROR("Failed to write HVM_PARAMS record");
+
+    return rc;
+}
+
+static xen_pfn_t x86_hvm_pfn_to_gfn(const struct xc_sr_context *ctx,
+                                    xen_pfn_t pfn)
+{
+    /* identity map */
+    return pfn;
+}
+
+static int x86_hvm_normalise_page(struct xc_sr_context *ctx,
+                                  xen_pfn_t type, void **page)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_setup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( !ctx->save.callbacks->switch_qemu_logdirty )
+    {
+        ERROR("No switch_qemu_logdirty callback provided");
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( ctx->save.callbacks->switch_qemu_logdirty(
+             ctx->domid, 1, ctx->save.callbacks->data) )
+    {
+        PERROR("Couldn't enable qemu log-dirty mode");
+        return -1;
+    }
+
+    ctx->x86_hvm.save.qemu_enabled_logdirty = true;
+
+    return 0;
+}
+
+static int x86_hvm_start_of_stream(struct xc_sr_context *ctx)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_end_of_stream(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    /* Write the TSC record. */
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        return rc;
+
+    /* Write the HVM_CONTEXT record. */
+    rc = write_hvm_context(ctx);
+    if ( rc )
+        return rc;
+
+    /* Write HVM_PARAMS record contains applicable HVM params. */
+    rc = write_hvm_params(ctx);
+    if ( rc )
+        return rc;
+
+    return rc;
+}
+
+static int x86_hvm_cleanup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    /* If qemu successfully enabled logdirty mode, attempt to disable. */
+    if ( ctx->x86_hvm.save.qemu_enabled_logdirty &&
+         ctx->save.callbacks->switch_qemu_logdirty(
+             ctx->domid, 0, ctx->save.callbacks->data) )
+    {
+        PERROR("Couldn't disable qemu log-dirty mode");
+        return -1;
+    }
+
+    return 0;
+}
+
+struct xc_sr_save_ops save_ops_x86_hvm =
+{
+    .pfn_to_gfn      = x86_hvm_pfn_to_gfn,
+    .normalise_page  = x86_hvm_normalise_page,
+    .setup           = x86_hvm_setup,
+    .start_of_stream = x86_hvm_start_of_stream,
+    .end_of_stream   = x86_hvm_end_of_stream,
+    .cleanup         = x86_hvm_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 11/15] tools/libxc: x86 HVM restore code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (9 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 10/15] tools/libxc: x86 HVM save code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 12/15] tools/libxc: common save code Andrew Cooper
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Restore the x86 HVM specific parts of a domain.  This is the HVM_CONTEXT and
HVM_PARAMS records.

There is no need for any page localisation.

This also includes writing the trailing qemu save record to a file because
this is what libxc currently does.  This is intended to be moved into libxl
proper in the future.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
v10: Set CONSOLE_EVTCHN parameter
---
 tools/libxc/Makefile                |    1 +
 tools/libxc/xc_sr_common.h          |    1 +
 tools/libxc/xc_sr_restore_x86_hvm.c |  233 +++++++++++++++++++++++++++++++++++
 3 files changed, 235 insertions(+)
 create mode 100644 tools/libxc/xc_sr_restore_x86_hvm.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 936512d..76f63a1 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -58,6 +58,7 @@ GUEST_SRCS-y += xc_sr_common.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_common_x86_pv.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_restore_x86_pv.c
+GUEST_SRCS-$(CONFIG_X86) += xc_sr_restore_x86_hvm.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_pv.c
 GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_hvm.c
 GUEST_SRCS-y += xc_sr_restore.c
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index a8b5568..ef42412 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -289,6 +289,7 @@ struct xc_sr_context
 extern struct xc_sr_save_ops save_ops_x86_hvm;
 
 extern struct xc_sr_restore_ops restore_ops_x86_pv;
+extern struct xc_sr_restore_ops restore_ops_x86_hvm;
 
 struct xc_sr_record
 {
diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
new file mode 100644
index 0000000..49d22c7
--- /dev/null
+++ b/tools/libxc/xc_sr_restore_x86_hvm.c
@@ -0,0 +1,233 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
+#include "xc_sr_common_x86.h"
+
+/*
+ * Process an HVM_CONTEXT record from the stream.
+ */
+static int handle_hvm_context(struct xc_sr_context *ctx,
+                              struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    void *p;
+
+    p = malloc(rec->length);
+    if ( !p )
+    {
+        ERROR("Unable to allocate %u bytes for hvm context", rec->length);
+        return -1;
+    }
+
+    free(ctx->x86_hvm.restore.context);
+
+    ctx->x86_hvm.restore.context = memcpy(p, rec->data, rec->length);
+    ctx->x86_hvm.restore.contextsz = rec->length;
+
+    return 0;
+}
+
+/*
+ * Process an HVM_PARAMS record from the stream.
+ */
+static int handle_hvm_params(struct xc_sr_context *ctx,
+                             struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_hvm_params *hdr = rec->data;
+    struct xc_sr_rec_hvm_params_entry *entry = hdr->param;
+    unsigned int i;
+    int rc;
+
+    if ( rec->length < sizeof(*hdr)
+         || rec->length < sizeof(*hdr) + hdr->count * sizeof(*entry) )
+    {
+        ERROR("hvm_params record is too short");
+        return -1;
+    }
+
+    for ( i = 0; i < hdr->count; i++, entry++ )
+    {
+        switch ( entry->index )
+        {
+        case HVM_PARAM_CONSOLE_PFN:
+            ctx->restore.console_gfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_STORE_PFN:
+            ctx->restore.xenstore_gfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_IOREQ_PFN:
+        case HVM_PARAM_BUFIOREQ_PFN:
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        }
+
+        rc = xc_hvm_param_set(xch, ctx->domid, entry->index, entry->value);
+        if ( rc < 0 )
+        {
+            PERROR("set HVM param %"PRId64" = 0x%016"PRIx64,
+                   entry->index, entry->value);
+            return rc;
+        }
+    }
+    return 0;
+}
+
+/* restore_ops function. */
+static bool x86_hvm_pfn_is_valid(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    return true;
+}
+
+/* restore_ops function. */
+static xen_pfn_t x86_hvm_pfn_to_gfn(const struct xc_sr_context *ctx,
+                                    xen_pfn_t pfn)
+{
+    return pfn;
+}
+
+/* restore_ops function. */
+static void x86_hvm_set_gfn(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                            xen_pfn_t gfn)
+{
+    /* no op */
+}
+
+/* restore_ops function. */
+static void x86_hvm_set_page_type(struct xc_sr_context *ctx,
+                                  xen_pfn_t pfn, xen_pfn_t type)
+{
+    /* no-op */
+}
+
+/* restore_ops function. */
+static int x86_hvm_localise_page(struct xc_sr_context *ctx,
+                                 uint32_t type, void *page)
+{
+    /* no-op */
+    return 0;
+}
+
+/*
+ * restore_ops function. Confirms the stream matches the domain.
+ */
+static int x86_hvm_setup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_X86_HVM )
+    {
+        ERROR("Unable to restore %s domain into an x86_hvm domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != PAGE_SIZE )
+    {
+        ERROR("Invalid page size %u for x86_hvm domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_hvm_process_record(struct xc_sr_context *ctx,
+                                  struct xc_sr_record *rec)
+{
+    switch ( rec->type )
+    {
+    case REC_TYPE_TSC_INFO:
+        return handle_tsc_info(ctx, rec);
+
+    case REC_TYPE_HVM_CONTEXT:
+        return handle_hvm_context(ctx, rec);
+
+    case REC_TYPE_HVM_PARAMS:
+        return handle_hvm_params(ctx, rec);
+
+    default:
+        return RECORD_NOT_PROCESSED;
+    }
+}
+
+/*
+ * restore_ops function.  Sets extra hvm parameters and seeds the grant table.
+ */
+static int x86_hvm_stream_complete(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = xc_hvm_param_set(xch, ctx->domid, HVM_PARAM_STORE_EVTCHN,
+                          ctx->restore.xenstore_evtchn);
+    if ( rc )
+    {
+        PERROR("Failed to set HVM_PARAM_STORE_EVTCHN");
+        return rc;
+    }
+
+    rc = xc_hvm_param_set(xch, ctx->domid, HVM_PARAM_CONSOLE_EVTCHN,
+                          ctx->restore.console_evtchn);
+    if ( rc )
+    {
+        PERROR("Failed to set HVM_PARAM_CONSOLE_EVTCHN");
+        return rc;
+    }
+
+    rc = xc_domain_hvm_setcontext(xch, ctx->domid,
+                                  ctx->x86_hvm.restore.context,
+                                  ctx->x86_hvm.restore.contextsz);
+    if ( rc < 0 )
+    {
+        PERROR("Unable to restore HVM context");
+        return rc;
+    }
+
+    rc = xc_dom_gnttab_hvm_seed(xch, ctx->domid,
+                                ctx->restore.console_gfn,
+                                ctx->restore.xenstore_gfn,
+                                ctx->restore.console_domid,
+                                ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        return rc;
+    }
+
+    return rc;
+}
+
+static int x86_hvm_cleanup(struct xc_sr_context *ctx)
+{
+    free(ctx->x86_hvm.restore.context);
+
+    return 0;
+}
+
+struct xc_sr_restore_ops restore_ops_x86_hvm =
+{
+    .pfn_is_valid    = x86_hvm_pfn_is_valid,
+    .pfn_to_gfn      = x86_hvm_pfn_to_gfn,
+    .set_gfn         = x86_hvm_set_gfn,
+    .set_page_type   = x86_hvm_set_page_type,
+    .localise_page   = x86_hvm_localise_page,
+    .setup           = x86_hvm_setup,
+    .process_record  = x86_hvm_process_record,
+    .stream_complete = x86_hvm_stream_complete,
+    .cleanup         = x86_hvm_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 12/15] tools/libxc: common save code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (10 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 11/15] tools/libxc: x86 HVM restore code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 13/15] tools/libxc: common restore code Andrew Cooper
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Save a domain, calling domain type specific function at the appropriate
points.  This implements the xc_domain_save2() API function which is
equivalent to the existing xc_domain_save().

This writes the image and domain headers, and writes all the PAGE_DATA records
using a "live" process.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/xc_sr_save.c |  777 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 776 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 3c34a6b..5d9c267 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -1,11 +1,786 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
 #include "xc_sr_common.h"
 
+/*
+ * Writes an Image header and Domain header into the stream.
+ */
+static int write_headers(struct xc_sr_context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
+    struct xc_sr_ihdr ihdr =
+        {
+            .marker  = IHDR_MARKER,
+            .id      = htonl(IHDR_ID),
+            .version = htonl(IHDR_VERSION),
+            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
+        };
+    struct xc_sr_dhdr dhdr =
+        {
+            .type       = guest_type,
+            .page_shift = XC_PAGE_SHIFT,
+            .xen_major  = (xen_version >> 16) & 0xffff,
+            .xen_minor  = (xen_version)       & 0xffff,
+        };
+
+    if ( xen_version < 0 )
+    {
+        PERROR("Unable to obtain Xen Version");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
+    {
+        PERROR("Unable to write Image Header to stream");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
+    {
+        PERROR("Unable to write Domain Header to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Writes an END record into the stream.
+ */
+static int write_end_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record end = { REC_TYPE_END, 0, NULL };
+
+    return write_record(ctx, &end);
+}
+
+/*
+ * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
+ * is constructed in ctx->save.batch_pfns.
+ *
+ * This function:
+ * - gets the types for each pfn in the batch.
+ * - for each pfn with real data:
+ *   - maps and attempts to localise the pages.
+ * - construct and writes a PAGE_DATA record into the stream.
+ */
+static int write_batch(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = NULL, *types = NULL;
+    void *guest_mapping = NULL;
+    void **guest_data = NULL;
+    void **local_pages = NULL;
+    int *errors = NULL, rc = -1;
+    unsigned i, p, nr_pages = 0;
+    unsigned nr_pfns = ctx->save.nr_batch_pfns;
+    void *page, *orig_page;
+    uint64_t *rec_pfns = NULL;
+    struct iovec *iov = NULL; int iovcnt = 0;
+    struct xc_sr_rec_page_data_header hdr = { 0 };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_PAGE_DATA,
+    };
+
+    assert(nr_pfns != 0);
+
+    /* Mfns of the batch pfns. */
+    mfns = malloc(nr_pfns * sizeof(*mfns));
+    /* Types of the batch pfns. */
+    types = malloc(nr_pfns * sizeof(*types));
+    /* Errors from attempting to map the gfns. */
+    errors = malloc(nr_pfns * sizeof(*errors));
+    /* Pointers to page data to send.  Mapped gfns or local allocations. */
+    guest_data = calloc(nr_pfns, sizeof(*guest_data));
+    /* Pointers to locally allocated pages.  Need freeing. */
+    local_pages = calloc(nr_pfns, sizeof(*local_pages));
+    /* iovec[] for writev(). */
+    iov = malloc((nr_pfns + 4) * sizeof(*iov));
+
+    if ( !mfns || !types || !errors || !guest_data || !local_pages || !iov )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
+                                                      ctx->save.batch_pfns[i]);
+
+        /* Likely a ballooned page. */
+        if ( mfns[i] == INVALID_MFN )
+        {
+            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+            ++ctx->save.nr_deferred_pages;
+        }
+    }
+
+    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
+    if ( rc )
+    {
+        PERROR("Failed to get types for pfn batch");
+        goto err;
+    }
+    rc = -1;
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        mfns[nr_pages++] = mfns[i];
+    }
+
+    if ( nr_pages > 0 )
+    {
+        guest_mapping = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
+        if ( !guest_mapping )
+        {
+            PERROR("Failed to map guest pages");
+            goto err;
+        }
+
+        for ( i = 0, p = 0; i < nr_pfns; ++i )
+        {
+            switch ( types[i] )
+            {
+            case XEN_DOMCTL_PFINFO_BROKEN:
+            case XEN_DOMCTL_PFINFO_XALLOC:
+            case XEN_DOMCTL_PFINFO_XTAB:
+                continue;
+            }
+
+            if ( errors[p] )
+            {
+                ERROR("Mapping of pfn %#lx (mfn %#lx) failed %d",
+                      ctx->save.batch_pfns[i], mfns[p], errors[p]);
+                goto err;
+            }
+
+            orig_page = page = guest_mapping + (p * PAGE_SIZE);
+            rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
+
+            if ( orig_page != page )
+                local_pages[i] = page;
+
+            if ( rc )
+            {
+                if ( rc == -1 && errno == EAGAIN )
+                {
+                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+                    ++ctx->save.nr_deferred_pages;
+                    types[i] = XEN_DOMCTL_PFINFO_XTAB;
+                    --nr_pages;
+                }
+                else
+                    goto err;
+            }
+            else
+                guest_data[i] = page;
+
+            rc = -1;
+            ++p;
+        }
+    }
+
+    rec_pfns = malloc(nr_pfns * sizeof(*rec_pfns));
+    if ( !rec_pfns )
+    {
+        ERROR("Unable to allocate %zu bytes of memory for page data pfn list",
+              nr_pfns * sizeof(*rec_pfns));
+        goto err;
+    }
+
+    hdr.count = nr_pfns;
+
+    rec.length = sizeof(hdr);
+    rec.length += nr_pfns * sizeof(*rec_pfns);
+    rec.length += nr_pages * PAGE_SIZE;
+
+    for ( i = 0; i < nr_pfns; ++i )
+        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.batch_pfns[i];
+
+    iov[0].iov_base = &rec.type;
+    iov[0].iov_len = sizeof(rec.type);
+
+    iov[1].iov_base = &rec.length;
+    iov[1].iov_len = sizeof(rec.length);
+
+    iov[2].iov_base = &hdr;
+    iov[2].iov_len = sizeof(hdr);
+
+    iov[3].iov_base = rec_pfns;
+    iov[3].iov_len = nr_pfns * sizeof(*rec_pfns);
+
+    iovcnt = 4;
+
+    if ( nr_pages )
+    {
+        for ( i = 0; i < nr_pfns; ++i )
+        {
+            if ( guest_data[i] )
+            {
+                iov[iovcnt].iov_base = guest_data[i];
+                iov[iovcnt].iov_len = PAGE_SIZE;
+                iovcnt++;
+                --nr_pages;
+            }
+        }
+    }
+
+    if ( writev_exact(ctx->fd, iov, iovcnt) )
+    {
+        PERROR("Failed to write page data to stream");
+        goto err;
+    }
+
+    /* Sanity check we have sent all the pages we expected to. */
+    assert(nr_pages == 0);
+    rc = ctx->save.nr_batch_pfns = 0;
+
+ err:
+    free(rec_pfns);
+    if ( guest_mapping )
+        munmap(guest_mapping, nr_pages * PAGE_SIZE);
+    for ( i = 0; local_pages && i < nr_pfns; ++i )
+        free(local_pages[i]);
+    free(iov);
+    free(local_pages);
+    free(guest_data);
+    free(errors);
+    free(types);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Flush a batch of pfns into the stream.
+ */
+static int flush_batch(struct xc_sr_context *ctx)
+{
+    int rc = 0;
+
+    if ( ctx->save.nr_batch_pfns == 0 )
+        return rc;
+
+    rc = write_batch(ctx);
+
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE *
+                                    sizeof(*ctx->save.batch_pfns));
+    }
+
+    return rc;
+}
+
+/*
+ * Add a single pfn to the batch, flushing the batch if full.
+ */
+static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    int rc = 0;
+
+    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
+        rc = flush_batch(ctx);
+
+    if ( rc == 0 )
+        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
+
+    return rc;
+}
+
+/*
+ * Pause/suspend the domain, and refresh ctx->dominfo if required.
+ */
+static int suspend_domain(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    /* TODO: Properly specify the return value from this callback.  All
+     * implementations currently appear to return 1 for success, whereas
+     * the legacy code checks for != 0. */
+    int cb_rc = ctx->save.callbacks->suspend(ctx->save.callbacks->data);
+
+    if ( cb_rc == 0 )
+    {
+        ERROR("save callback suspend() failed: %d", cb_rc);
+        return -1;
+    }
+
+    /* Refresh domain information. */
+    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
+         (ctx->dominfo.domid != ctx->domid) )
+    {
+        PERROR("Unable to refresh domain information");
+        return -1;
+    }
+
+    /* Confirm the domain has actually been paused. */
+    if ( !ctx->dominfo.shutdown ||
+         (ctx->dominfo.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain has not been suspended: shutdown %d, reason %d",
+              ctx->dominfo.shutdown, ctx->dominfo.shutdown_reason);
+        return -1;
+    }
+
+    xc_report_progress_single(xch, "Domain now suspended");
+
+    return 0;
+}
+
+/*
+ * Send all pages in the guests p2m.  Used as the first iteration of the live
+ * migration loop, and for a non-live save.
+ */
+static int send_all_pages(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t p;
+    int rc;
+
+    for ( p = 0; p < ctx->save.p2m_size; ++p )
+    {
+        rc = add_to_batch(ctx, p);
+        if ( rc )
+            return rc;
+
+        /* Update progress every 4MB worth of memory sent. */
+        if ( (p & ((1U << (22 - 12)) - 1)) == 0 )
+            xc_report_progress_step(xch, p, ctx->save.p2m_size);
+    }
+
+    rc = flush_batch(ctx);
+    if ( rc )
+        return rc;
+
+    xc_report_progress_step(xch, ctx->save.p2m_size,
+                            ctx->save.p2m_size);
+    return 0;
+}
+
+/*
+ * Send a subset of pages in the guests p2m, according to the provided bitmap.
+ * Used for each subsequent iteration of the live migration loop.
+ *
+ * Bitmap is bounded by p2m_size.
+ */
+static int send_some_pages(struct xc_sr_context *ctx,
+                           unsigned long *bitmap,
+                           unsigned long entries)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t p;
+    unsigned long written;
+    int rc;
+
+    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
+    {
+        if ( !test_bit(p, bitmap) )
+            continue;
+
+        rc = add_to_batch(ctx, p);
+        if ( rc )
+            return rc;
+
+        /* Update progress every 4MB worth of memory sent. */
+        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
+            xc_report_progress_step(xch, written, entries);
+
+        ++written;
+    }
+
+    rc = flush_batch(ctx);
+    if ( rc )
+        return rc;
+
+    if ( written > entries )
+        DPRINTF("Bitmap contained more entries than expected...");
+
+    xc_report_progress_step(xch, entries, entries);
+    return 0;
+}
+
+static int enable_logdirty(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int on1 = 0, off = 0, on2 = 0;
+    int rc;
+
+    /* This juggling is required if logdirty is enabled for VRAM tracking. */
+    rc = xc_shadow_control(xch, ctx->domid,
+                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                           NULL, 0, NULL, 0, NULL);
+    if ( rc < 0 )
+    {
+        on1 = errno;
+        rc = xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                               NULL, 0, NULL, 0, NULL);
+        if ( rc < 0 )
+            off = errno;
+        else {
+            rc = xc_shadow_control(xch, ctx->domid,
+                                   XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                                   NULL, 0, NULL, 0, NULL);
+            if ( rc < 0 )
+                on2 = errno;
+        }
+        if ( rc < 0 )
+        {
+            PERROR("Failed to enable logdirty: %d,%d,%d", on1, off, on2);
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+static int update_progress_string(struct xc_sr_context *ctx,
+                                  char **str, unsigned iter)
+{
+    xc_interface *xch = ctx->xch;
+    char *new_str = NULL;
+
+    if ( asprintf(&new_str, "Memory iteration %u of %u",
+                  iter, ctx->save.max_iterations) == -1 )
+    {
+        PERROR("Unable to allocate new progress string");
+        return -1;
+    }
+
+    free(*str);
+    *str = new_str;
+
+    xc_set_progress_prefix(xch, *str);
+    return 0;
+}
+
+/*
+ * Send all domain memory.  This is the heart of the live migration loop.
+ */
+static int send_domain_memory_live(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
+    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
+    char *progress_str = NULL;
+    unsigned x;
+    int rc = -1;
+
+    to_send = xc_hypercall_buffer_alloc_pages(
+        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
+
+    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
+                                  sizeof(*ctx->save.batch_pfns));
+    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
+
+    if ( !ctx->save.batch_pfns || !to_send || !ctx->save.deferred_pages )
+    {
+        ERROR("Unable to allocate memory for to_{send,fix}/batch bitmaps");
+        goto out;
+    }
+
+    rc = enable_logdirty(ctx);
+    if ( rc )
+        goto out;
+
+    rc = update_progress_string(ctx, &progress_str, 0);
+    if ( rc )
+        goto out;
+
+    rc = send_all_pages(ctx);
+    if ( rc )
+        goto out;
+
+    for ( x = 1;
+          ((x < ctx->save.max_iterations) &&
+           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
+    {
+        if ( xc_shadow_control(
+                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                 HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+                 NULL, 0, &stats) != ctx->save.p2m_size )
+        {
+            PERROR("Failed to retrieve logdirty bitmap");
+            rc = -1;
+            goto out;
+        }
+
+        if ( stats.dirty_count == 0 )
+            break;
+
+        rc = update_progress_string(ctx, &progress_str, x);
+        if ( rc )
+            goto out;
+
+        rc = send_some_pages(ctx, to_send, stats.dirty_count);
+        if ( rc )
+            goto out;
+    }
+
+    rc = suspend_domain(ctx);
+    if ( rc )
+        goto out;
+
+    if ( xc_shadow_control(
+             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+             NULL, 0, &stats) != ctx->save.p2m_size )
+    {
+        PERROR("Failed to retrieve logdirty bitmap");
+        rc = -1;
+        goto out;
+    }
+
+    rc = update_progress_string(ctx, &progress_str, ctx->save.max_iterations);
+    if ( rc )
+        goto out;
+
+    bitmap_or(to_send, ctx->save.deferred_pages, ctx->save.p2m_size);
+
+    rc = send_some_pages(ctx, to_send,
+                         stats.dirty_count + ctx->save.nr_deferred_pages);
+    if ( rc )
+        goto out;
+
+    if ( ctx->save.debug )
+    {
+        struct xc_sr_record rec =
+        {
+            .type = REC_TYPE_VERIFY,
+            .length = 0,
+        };
+
+        DPRINTF("Enabling verify mode");
+
+        rc = write_record(ctx, &rec);
+        if ( rc )
+            goto out;
+
+        xc_set_progress_prefix(xch, "Memory verify");
+        rc = send_all_pages(ctx);
+        if ( rc )
+            goto out;
+
+        if ( xc_shadow_control(
+                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_PEEK,
+                 HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+                 NULL, 0, &stats) != ctx->save.p2m_size )
+        {
+            PERROR("Failed to retrieve logdirty bitmap");
+            rc = -1;
+            goto out;
+        }
+
+        DPRINTF("  Further stats: faults %u, dirty %u",
+                stats.fault_count, stats.dirty_count);
+    }
+
+  out:
+    xc_set_progress_prefix(xch, NULL);
+    free(progress_str);
+    xc_hypercall_buffer_free_pages(xch, to_send,
+                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
+    free(ctx->save.deferred_pages);
+    free(ctx->save.batch_pfns);
+    return rc;
+}
+
+/*
+ * Send all domain memory, pausing the domain first.  Generally used for
+ * suspend-to-file.
+ */
+static int send_domain_memory_nonlive(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+
+    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
+                                  sizeof(*ctx->save.batch_pfns));
+    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
+
+    if ( !ctx->save.batch_pfns || !ctx->save.deferred_pages )
+    {
+        PERROR("Failed to allocate memory for nonlive tracking structures");
+        errno = ENOMEM;
+        goto err;
+    }
+
+    rc = suspend_domain(ctx);
+    if ( rc )
+        goto err;
+
+    xc_set_progress_prefix(xch, "Memory");
+
+    rc = send_all_pages(ctx);
+    if ( rc )
+        goto err;
+
+ err:
+    free(ctx->save.deferred_pages);
+    free(ctx->save.batch_pfns);
+
+    return rc;
+}
+
+/*
+ * Save a domain.
+ */
+static int save(struct xc_sr_context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int rc, saved_rc = 0, saved_errno = 0;
+
+    IPRINTF("Saving domain %d, type %s",
+            ctx->domid, dhdr_type_to_str(guest_type));
+
+    rc = ctx->save.ops.setup(ctx);
+    if ( rc )
+        goto err;
+
+    xc_report_progress_single(xch, "Start of stream");
+
+    rc = write_headers(ctx, guest_type);
+    if ( rc )
+        goto err;
+
+    rc = ctx->save.ops.start_of_stream(ctx);
+    if ( rc )
+        goto err;
+
+    if ( ctx->save.live )
+        rc = send_domain_memory_live(ctx);
+    else
+        rc = send_domain_memory_nonlive(ctx);
+
+    if ( rc )
+        goto err;
+
+    if ( !ctx->dominfo.shutdown ||
+         (ctx->dominfo.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain has not been suspended");
+        rc = -1;
+        goto err;
+    }
+
+    xc_report_progress_single(xch, "End of stream");
+
+    rc = ctx->save.ops.end_of_stream(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_end_record(ctx);
+    if ( rc )
+        goto err;
+
+    xc_report_progress_single(xch, "Complete");
+    goto done;
+
+ err:
+    saved_errno = errno;
+    saved_rc = rc;
+    PERROR("Save failed");
+
+ done:
+    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                      NULL, 0, NULL, 0, NULL);
+
+    rc = ctx->save.ops.cleanup(ctx);
+    if ( rc )
+        PERROR("Failed to clean up");
+
+    if ( saved_rc )
+    {
+        rc = saved_rc;
+        errno = saved_errno;
+    }
+
+    return rc;
+};
+
 int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom,
                     uint32_t max_iters, uint32_t max_factor, uint32_t flags,
                     struct save_callbacks* callbacks, int hvm)
 {
+    xen_pfn_t nr_pfns;
+    struct xc_sr_context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions. */
+    ctx.save.callbacks = callbacks;
+    ctx.save.live  = !!(flags & XCFLAGS_LIVE);
+    ctx.save.debug = !!(flags & XCFLAGS_DEBUG);
+
+    /*
+     * TODO: Find some time to better tweak the live migration algorithm.
+     *
+     * These parameters are better than the legacy algorithm especially for
+     * busy guests.
+     */
+    ctx.save.max_iterations = 5;
+    ctx.save.dirty_threshold = 50;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+    DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
+            io_fd, dom, max_iters, max_factor, flags, hvm);
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %u does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+
+    if ( xc_domain_nr_gpfns(xch, dom, &nr_pfns) < 0 )
+    {
+        PERROR("Unable to obtain the guest p2m size");
+        return -1;
+    }
+
+    ctx.save.p2m_size = nr_pfns;
+
+    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
+    {
+        errno = E2BIG;
+        ERROR("Cannot save this big a guest");
+        return -1;
+    }
+
+    if ( ctx.dominfo.hvm )
+    {
+        ctx.save.ops = save_ops_x86_hvm;
+        return save(&ctx, DHDR_TYPE_X86_HVM);
+    }
+    else
+    {
+        ctx.save.ops = save_ops_x86_pv;
+        return save(&ctx, DHDR_TYPE_X86_PV);
+    }
 }
 
 /*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 13/15] tools/libxc: common restore code
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (11 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 12/15] tools/libxc: common save code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-04-23 11:48 ` [PATCH v10 14/15] docs: libxc migration stream specification Andrew Cooper
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

Restore a domain from the new format.  This reads and validates the domain and
image header and loads the guest memory from the PAGE_DATA records, populating
the p2m as it does so.

This provides the xc_domain_restore2() function as an alternative to the
existing xc_domain_restore().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/xc_sr_restore.c |  507 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 503 insertions(+), 4 deletions(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 045f486..0bf4bae 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -1,6 +1,133 @@
+#include <arpa/inet.h>
+
 #include "xc_sr_common.h"
 
 /*
+ * Read and validate the Image and Domain headers.
+ */
+static int read_headers(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_ihdr ihdr;
+    struct xc_sr_dhdr dhdr;
+
+    if ( read_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
+    {
+        PERROR("Failed to read Image Header from stream");
+        return -1;
+    }
+
+    ihdr.id      = ntohl(ihdr.id);
+    ihdr.version = ntohl(ihdr.version);
+    ihdr.options = ntohs(ihdr.options);
+
+    if ( ihdr.marker != IHDR_MARKER )
+    {
+        ERROR("Invalid marker: Got 0x%016"PRIx64, ihdr.marker);
+        return -1;
+    }
+    else if ( ihdr.id != IHDR_ID )
+    {
+        ERROR("Invalid ID: Expected 0x%08x, Got 0x%08x", IHDR_ID, ihdr.id);
+        return -1;
+    }
+    else if ( ihdr.version != IHDR_VERSION )
+    {
+        ERROR("Invalid Version: Expected %d, Got %d",
+              ihdr.version, IHDR_VERSION);
+        return -1;
+    }
+    else if ( ihdr.options & IHDR_OPT_BIG_ENDIAN )
+    {
+        ERROR("Unable to handle big endian streams");
+        return -1;
+    }
+
+    ctx->restore.format_version = ihdr.version;
+
+    if ( read_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
+    {
+        PERROR("Failed to read Domain Header from stream");
+        return -1;
+    }
+
+    ctx->restore.guest_type = dhdr.type;
+    ctx->restore.guest_page_size = (1U << dhdr.page_shift);
+
+    if ( dhdr.xen_major == 0 )
+    {
+        IPRINTF("Found %s domain, converted from legacy stream format",
+                dhdr_type_to_str(dhdr.type));
+        DPRINTF("  Legacy conversion script version %u", dhdr.xen_minor);
+    }
+    else
+        IPRINTF("Found %s domain from Xen %u.%u",
+                dhdr_type_to_str(dhdr.type), dhdr.xen_major, dhdr.xen_minor);
+    return 0;
+}
+
+/*
+ * Reads a record from the stream, and fills in the record structure.
+ *
+ * Returns 0 on success and non-0 on failure.
+ *
+ * On success, the records type and size shall be valid.
+ * - If size is 0, data shall be NULL.
+ * - If size is non-0, data shall be a buffer allocated by malloc() which must
+ *   be passed to free() by the caller.
+ *
+ * On failure, the contents of the record structure are undefined.
+ */
+static int read_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rhdr rhdr;
+    size_t datasz;
+
+    if ( read_exact(ctx->fd, &rhdr, sizeof(rhdr)) )
+    {
+        PERROR("Failed to read Record Header from stream");
+        return -1;
+    }
+    else if ( rhdr.length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
+              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
+
+    if ( datasz )
+    {
+        rec->data = malloc(datasz);
+
+        if ( !rec->data )
+        {
+            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
+                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+
+        if ( read_exact(ctx->fd, rec->data, datasz) )
+        {
+            free(rec->data);
+            rec->data = NULL;
+            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
+                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+    }
+    else
+        rec->data = NULL;
+
+    rec->type   = rhdr.type;
+    rec->length = rhdr.length;
+
+    return 0;
+};
+
+/*
  * Is a pfn populated?
  */
 static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
@@ -12,7 +139,7 @@ static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
 
 /*
  * Set a pfn as populated, expanding the tracking structures if needed. To
- * avoid realloc()ing too excessivly, the size increased to the nearest power
+ * avoid realloc()ing too excessively, the size increased to the nearest power
  * of two large enough to contain the required pfn.
  */
 static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
@@ -59,7 +186,7 @@ static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
 
 /*
  * Given a set of pfns, obtain memory from Xen to fill the physmap for the
- * unpopulated subset.  If types is NULL, no page typechecking is performed
+ * unpopulated subset.  If types is NULL, no page type checking is performed
  * and all unpopulated pfns are populated.
  */
 int populate_pfns(struct xc_sr_context *ctx, unsigned count,
@@ -125,16 +252,388 @@ int populate_pfns(struct xc_sr_context *ctx, unsigned count,
     return rc;
 }
 
+/*
+ * Given a list of pfns, their types, and a block of page data from the
+ * stream, populate and record their types, map the relevant subset and copy
+ * the data into the guest.
+ */
+static int process_page_data(struct xc_sr_context *ctx, unsigned count,
+                             xen_pfn_t *pfns, uint32_t *types, void *page_data)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
+    int *map_errs = malloc(count * sizeof(*map_errs));
+    int rc;
+    void *mapping = NULL, *guest_page = NULL;
+    unsigned i,    /* i indexes the pfns from the record. */
+        j,         /* j indexes the subset of pfns we decide to map. */
+        nr_pages = 0;
+
+    if ( !mfns || !map_errs )
+    {
+        rc = -1;
+        ERROR("Failed to allocate %zu bytes to process page data",
+              count * (sizeof(*mfns) + sizeof(*map_errs)));
+        goto err;
+    }
+
+    rc = populate_pfns(ctx, count, pfns, types);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", count);
+        goto err;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
+
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_NOTAB:
+
+        case XEN_DOMCTL_PFINFO_L1TAB:
+        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+            mfns[nr_pages++] = ctx->restore.ops.pfn_to_gfn(ctx, pfns[i]);
+            break;
+        }
+    }
+
+    /* Nothing to do? */
+    if ( nr_pages == 0 )
+        goto done;
+
+    mapping = guest_page = xc_map_foreign_bulk(
+        xch, ctx->domid, PROT_READ | PROT_WRITE,
+        mfns, map_errs, nr_pages);
+    if ( !mapping )
+    {
+        rc = -1;
+        PERROR("Unable to map %u mfns for %u pages of data",
+               nr_pages, count);
+        goto err;
+    }
+
+    for ( i = 0, j = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_XTAB:
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+            /* No page data to deal with. */
+            continue;
+        }
+
+        if ( map_errs[j] )
+        {
+            rc = -1;
+            ERROR("Mapping pfn %lx (mfn %lx, type %#x)failed with %d",
+                  pfns[i], mfns[j], types[i], map_errs[j]);
+            goto err;
+        }
+
+        /* Undo page normalisation done by the saver. */
+        rc = ctx->restore.ops.localise_page(ctx, types[i], page_data);
+        if ( rc )
+        {
+            ERROR("Failed to localise pfn %lx (type %#x)",
+                  pfns[i], types[i] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        if ( ctx->restore.verify )
+        {
+            /* Verify mode - compare incoming data to what we already have. */
+            if ( memcmp(guest_page, page_data, PAGE_SIZE) )
+                ERROR("verify pfn %lx failed (type %#x)",
+                      pfns[i], types[i] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+        }
+        else
+        {
+            /* Regular mode - copy incoming data into place. */
+            memcpy(guest_page, page_data, PAGE_SIZE);
+        }
+
+        ++j;
+        guest_page += PAGE_SIZE;
+        page_data += PAGE_SIZE;
+    }
+
+ done:
+    rc = 0;
+
+ err:
+    if ( mapping )
+        munmap(mapping, nr_pages * PAGE_SIZE);
+
+    free(map_errs);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Validate a PAGE_DATA record from the stream, and pass the results to
+ * process_page_data() to actually perform the legwork.
+ */
+static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_page_data_header *pages = rec->data;
+    unsigned i, pages_of_data = 0;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL, pfn;
+    uint32_t *types = NULL, type;
+
+    if ( rec->length < sizeof(*pages) )
+    {
+        ERROR("PAGE_DATA record truncated: length %u, min %zu",
+              rec->length, sizeof(*pages));
+        goto err;
+    }
+    else if ( pages->count < 1 )
+    {
+        ERROR("Expected at least 1 pfn in PAGE_DATA record");
+        goto err;
+    }
+    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
+    {
+        ERROR("PAGE_DATA record (length %u) too short to contain %u"
+              " pfns worth of information", rec->length, pages->count);
+        goto err;
+    }
+
+    pfns = malloc(pages->count * sizeof(*pfns));
+    types = malloc(pages->count * sizeof(*types));
+    if ( !pfns || !types )
+    {
+        ERROR("Unable to allocate enough memory for %u pfns",
+              pages->count);
+        goto err;
+    }
+
+    for ( i = 0; i < pages->count; ++i )
+    {
+        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
+        if ( !ctx->restore.ops.pfn_is_valid(ctx, pfn) )
+        {
+            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
+            goto err;
+        }
+
+        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
+        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
+             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
+        {
+            ERROR("Invalid type %#x for pfn %#lx (index %u)", type, pfn, i);
+            goto err;
+        }
+        else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
+            /* NOTAB and all L1 through L4 tables (including pinned) should
+             * have a page worth of data in the record. */
+            pages_of_data++;
+
+        pfns[i] = pfn;
+        types[i] = type;
+    }
+
+    if ( rec->length != (sizeof(*pages) +
+                         (sizeof(uint64_t) * pages->count) +
+                         (PAGE_SIZE * pages_of_data)) )
+    {
+        ERROR("PAGE_DATA record wrong size: length %u, expected "
+              "%zu + %zu + %lu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
+        goto err;
+    }
+
+    rc = process_page_data(ctx, pages->count, pfns, types,
+                           &pages->pfn[pages->count]);
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+/*
+ * Restore a domain.
+ */
+static int restore(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_record rec;
+    int rc, saved_rc = 0, saved_errno = 0;
+
+    IPRINTF("Restoring domain");
+
+    rc = ctx->restore.ops.setup(ctx);
+    if ( rc )
+        goto err;
+
+    ctx->restore.max_populated_pfn = (32 * 1024 / 4) - 1;
+    ctx->restore.populated_pfns = bitmap_alloc(
+        ctx->restore.max_populated_pfn + 1);
+    if ( !ctx->restore.populated_pfns )
+    {
+        ERROR("Unable to allocate memory for populated_pfns bitmap");
+        goto err;
+    }
+
+    do
+    {
+        rc = read_record(ctx, &rec);
+        if ( rc )
+            goto err;
+
+        switch ( rec.type )
+        {
+        case REC_TYPE_END:
+            break;
+
+        case REC_TYPE_PAGE_DATA:
+            rc = handle_page_data(ctx, &rec);
+            break;
+
+        case REC_TYPE_VERIFY:
+            DPRINTF("Verify mode enabled");
+            ctx->restore.verify = true;
+            break;
+
+        default:
+            rc = ctx->restore.ops.process_record(ctx, &rec);
+            break;
+        }
+
+        free(rec.data);
+
+        if ( rc == RECORD_NOT_PROCESSED )
+        {
+            if ( rec.type & REC_TYPE_OPTIONAL )
+                DPRINTF("Ignoring optional record %#x (%s)",
+                        rec.type, rec_type_to_str(rec.type));
+            else
+            {
+                ERROR("Mandatory record %#x (%s) not handled",
+                      rec.type, rec_type_to_str(rec.type));
+                rc = -1;
+            }
+        }
+
+        if ( rc )
+            goto err;
+
+    } while ( rec.type != REC_TYPE_END );
+
+    rc = ctx->restore.ops.stream_complete(ctx);
+    if ( rc )
+        goto err;
+
+    IPRINTF("Restore successful");
+    goto done;
+
+ err:
+    saved_errno = errno;
+    saved_rc = rc;
+    PERROR("Restore failed");
+
+ done:
+    free(ctx->restore.populated_pfns);
+    rc = ctx->restore.ops.cleanup(ctx);
+    if ( rc )
+        PERROR("Failed to clean up");
+
+    if ( saved_rc )
+    {
+        rc = saved_rc;
+        errno = saved_errno;
+    }
+
+    return rc;
+}
+
 int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        unsigned int store_evtchn, unsigned long *store_mfn,
                        domid_t store_domid, unsigned int console_evtchn,
-                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned long *console_gfn, domid_t console_domid,
                        unsigned int hvm, unsigned int pae, int superpages,
                        int checkpointed_stream,
                        struct restore_callbacks *callbacks)
 {
+    struct xc_sr_context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions. */
+    ctx.restore.console_evtchn = console_evtchn;
+    ctx.restore.console_domid = console_domid;
+    ctx.restore.xenstore_evtchn = store_evtchn;
+    ctx.restore.xenstore_domid = store_domid;
+    ctx.restore.callbacks = callbacks;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+    DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
+            ", checkpointed_stream %d", io_fd, dom, hvm, pae,
+            superpages, checkpointed_stream);
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %u does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+
+    if ( read_headers(&ctx) )
+        return -1;
+
+    if ( ctx.dominfo.hvm )
+    {
+        ctx.restore.ops = restore_ops_x86_hvm;
+        if ( restore(&ctx) )
+            return -1;
+    }
+    else
+    {
+        ctx.restore.ops = restore_ops_x86_pv;
+        if ( restore(&ctx) )
+            return -1;
+    }
+
+    IPRINTF("XenStore: mfn %#lx, dom %d, evt %u",
+            ctx.restore.xenstore_gfn,
+            ctx.restore.xenstore_domid,
+            ctx.restore.xenstore_evtchn);
+
+    IPRINTF("Console: mfn %#lx, dom %d, evt %u",
+            ctx.restore.console_gfn,
+            ctx.restore.console_domid,
+            ctx.restore.console_evtchn);
+
+    *console_gfn = ctx.restore.console_gfn;
+    *store_mfn = ctx.restore.xenstore_gfn;
+
+    return 0;
 }
 
 /*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 14/15] docs: libxc migration stream specification
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (12 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 13/15] tools/libxc: common restore code Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-05-05 13:03   ` Ian Campbell
  2015-04-23 11:48 ` [PATCH v10 15/15] tools/libxc: Migration v2 compatibility for unmodified libxl Andrew Cooper
  2015-05-05 14:02 ` [PATCH v10 00/15] Migration v2 (libxc) Ian Campbell
  15 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Ian Jackson, David Vrabel, Ian Campbell

From: David Vrabel <david.vrabel@citrix.com>

Add the specification for a new migration stream format.  The document
includes all the details but to summarize:

The existing (legacy) format is dependant on the word size of the
toolstack.  This prevents domains from migrating from hosts running
32-bit toolstacks to hosts running 64-bit toolstacks (and vice-versa).

The legacy format lacks any version information making it difficult to
extend in compatible way.

The new format has a header (the image header) with version information,
a domain header with basic information of the domain and a stream of
records for the image data.

The format will be used for future domain types (such as on ARM).

The specification is pandoc format (an extended markdown format) and the
documentation build system is extended to support pandoc format documents.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
---
 docs/specs/libxc-migration-stream.pandoc |  672 ++++++++++++++++++++++++++++++
 1 file changed, 672 insertions(+)
 create mode 100644 docs/specs/libxc-migration-stream.pandoc

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
new file mode 100644
index 0000000..455d1ce
--- /dev/null
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -0,0 +1,672 @@
+% LibXenCtrl Domain Image Format
+% David Vrabel <<david.vrabel@citrix.com>>
+  Andrew Cooper <<andrew.cooper3@citrix.com>>
+% Draft G
+
+Introduction
+============
+
+Purpose
+-------
+
+The _domain save image_ is the context of a running domain used for
+snapshots of a domain or for transferring domains between hosts during
+migration.
+
+There are a number of problems with the format of the domain save
+image used in Xen 4.4 and earlier (the _legacy format_).
+
+* Dependant on toolstack word size.  A number of fields within the
+  image are native types such as `unsigned long` which have different
+  sizes between 32-bit and 64-bit toolstacks.  This prevents domains
+  from being migrated between hosts running 32-bit and 64-bit
+  toolstacks.
+
+* There is no header identifying the image.
+
+* The image has no version information.
+
+A new format that addresses the above is required.
+
+ARM does not yet have have a domain save image format specified and
+the format described in this specification should be suitable.
+
+Not Yet Included
+----------------
+
+The following features are not yet fully specified and will be
+included in a future draft.
+
+* Remus
+
+* Page data compression.
+
+* ARM
+
+
+Overview
+========
+
+The image format consists of two main sections:
+
+* _Headers_
+* _Records_
+
+Headers
+-------
+
+There are two headers: the _image header_, and the _domain header_.
+The image header describes the format of the image (version etc.).
+The _domain header_ contains general information about the domain
+(architecture, type etc.).
+
+Records
+-------
+
+The main part of the format is a sequence of different _records_.
+Each record type contains information about the domain context.  At a
+minimum there is a END record marking the end of the records section.
+
+
+Fields
+------
+
+All the fields within the headers and records have a fixed width.
+
+Fields are always aligned to their size.
+
+Padding and reserved fields are set to zero on save and must be
+ignored during restore.
+
+Integer (numeric) fields in the image header are always in big-endian
+byte order.
+
+Integer fields in the domain header and in the records are in the
+endianess described in the image header (which will typically be the
+native ordering).
+
+\clearpage
+
+Headers
+=======
+
+Image Header
+------------
+
+The image header identifies an image as a Xen domain save image.  It
+includes the version of this specification that the image complies
+with.
+
+Tools supporting version _V_ of the specification shall always save
+images using version _V_.  Tools shall support restoring from version
+_V_.  If the previous Xen release produced version _V_ - 1 images,
+tools shall supported restoring from these.  Tools may additionally
+support restoring from earlier versions.
+
+The marker field can be used to distinguish between legacy images and
+those corresponding to this specification.  Legacy images will have at
+one or more zero bits within the first 8 octets of the image.
+
+Fields within the image header are always in _big-endian_ byte order,
+regardless of the setting of the endianness bit.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | marker                                          |
+    +-----------------------+-------------------------+
+    | id                    | version                 |
+    +-----------+-----------+-------------------------+
+    | options   | (reserved)                          |
+    +-----------+-------------------------------------+
+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+marker      0xFFFFFFFFFFFFFFFF.
+
+id          0x58454E46 ("XENF" in ASCII).
+
+version     0x00000002.  The version of this specification.
+
+options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
+
+            bit 1-15: Reserved.
+--------------------------------------------------------------------
+
+The endianness shall be 0 (little-endian) for images generated on an
+i386, x86_64, or arm host.
+
+\clearpage
+
+Domain Header
+-------------
+
+The domain header includes general properties of the domain.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | type                  | page_shift| (reserved)  |
+    +-----------------------+-----------+-------------+
+    | xen_major             | xen_minor               |
+    +-----------------------+-------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+type        0x0000: Reserved.
+
+            0x0001: x86 PV.
+
+            0x0002: x86 HVM.
+
+            0x0003: x86 PVH.
+
+            0x0004: ARM.
+
+            0x0005 - 0xFFFFFFFF: Reserved.
+
+page_shift  Size of a guest page as a power of two.
+
+            i.e., page size = 2 ^page_shift^.
+
+xen_major   The Xen major version when this image was saved.
+
+xen_minor   The Xen minor version when this image was saved.
+--------------------------------------------------------------------
+
+The legacy stream conversion tool writes a `xen_major` version of 0, and sets
+`xen_minor` to the version of itself.
+
+\clearpage
+
+Records
+=======
+
+A record has a record header, type specific data and a trailing
+footer.  If `body_length` is not a multiple of 8, the body is padded
+with zeroes to align the end of the record on an 8 octet boundary.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | type                  | body_length             |
+    +-----------+-----------+-------------------------+
+    | body...                                         |
+    ...
+    |           | padding (0 to 7 octets)             |
+    +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field        Description
+-----------  -------------------------------------------------------
+type         0x00000000: END
+
+             0x00000001: PAGE_DATA
+
+             0x00000002: X86_PV_INFO
+
+             0x00000003: X86_PV_P2M_FRAMES
+
+             0x00000004: X86_PV_VCPU_BASIC
+
+             0x00000005: X86_PV_VCPU_EXTENDED
+
+             0x00000006: X86_PV_VCPU_XSAVE
+
+             0x00000007: SHARED_INFO
+
+             0x00000008: TSC_INFO
+
+             0x00000009: HVM_CONTEXT
+
+             0x0000000A: HVM_PARAMS
+
+             0x0000000B: TOOLSTACK
+
+             0x0000000C: X86_PV_VCPU_MSRS
+
+             0x0000000D: VERIFY
+
+             0x0000000E - 0x7FFFFFFF: Reserved for future _mandatory_
+             records.
+
+             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
+             records.
+
+body_length  Length in octets of the record body.
+
+body         Content of the record.
+
+padding      0 to 7 octets of zeros to pad the whole record to a multiple
+             of 8 octets.
+--------------------------------------------------------------------
+
+Records may be _mandatory_ or _optional_.  Optional records have bit
+31 set in their type.  Restoring an image that has unrecognised or
+unsupported mandatory record must fail.  The contents of optional
+records may be ignored during a restore.
+
+The following sub-sections specify the record body format for each of
+the record types.
+
+\clearpage
+
+END
+----
+
+An end record marks the end of the image, and shall be the final record
+in the stream.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The end record contains no fields; its body_length is 0.
+
+\clearpage
+
+PAGE_DATA
+---------
+
+The bulk of an image consists of many PAGE_DATA records containing the
+memory contents.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+    | page_data[0]...                                 |
+    ...
+    +-------------------------------------------------+
+    | page_data[N-1]...                               |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+count       Number of pages described in this record.
+
+pfn         An array of count PFNs and their types.
+
+            Bit 63-60: XEN\_DOMCTL\_PFINFO\_* type (from
+            `public/domctl.h` but shifted by 32 bits)
+
+            Bit 59-52: Reserved.
+
+            Bit 51-0: PFN.
+
+page\_data  page\_size octets of uncompressed page contents for each
+            page set as present in the pfn array.
+--------------------------------------------------------------------
+
+Note: Count is strictly > 0.  N is strictly <= C and it is possible for there
+to be no page_data in the record if all pfns are of invalid types.
+
+--------------------------------------------------------------------
+PFINFO type    Value      Description
+-------------  ---------  ------------------------------------------
+NOTAB          0x0        Normal page.
+
+L1TAB          0x1        L1 page table page.
+
+L2TAB          0x2        L2 page table page.
+
+L3TAB          0x3        L3 page table page.
+
+L4TAB          0x4        L4 page table page.
+
+               0x5-0x8    Reserved.
+
+L1TAB_PIN      0x9        L1 page table page (pinned).
+
+L2TAB_PIN      0xA        L2 page table page (pinned).
+
+L3TAB_PIN      0xB        L3 page table page (pinned).
+
+L4TAB_PIN      0xC        L4 page table page (pinned).
+
+BROKEN         0xD        Broken page.
+
+XALLOC         0xE        Allocate only.
+
+XTAB           0xF        Invalid page.
+--------------------------------------------------------------------
+
+Table: XEN\_DOMCTL\_PFINFO\_* Page Types.
+
+PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any
+corresponding `page_data`.
+
+The saver uses the `XTAB` type for PFNs that become invalid in the
+guest's P2M table during a live migration[^2].
+
+Restoring an image with unrecognised page types shall fail.
+
+[^2]: In the legacy format, this is the list of unmapped PFNs in the
+tail.
+
+\clearpage
+
+X86_PV_INFO
+-----------
+
+     0     1     2     3     4     5     6     7 octet
+    +-----+-----+-----------+-------------------------+
+    | w   | ptl | (reserved)                          |
+    +-----+-----+-----------+-------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+guest_width (w)  Guest width in octets (either 4 or 8).
+
+pt_levels (ptl)  Number of page table levels (either 3 or 4).
+--------------------------------------------------------------------
+
+\clearpage
+
+X86_PV_P2M_FRAMES
+-----------------
+
+     0     1     2     3     4     5     6     7 octet
+    +-----+-----+-----+-----+-------------------------+
+    | p2m_start_pfn (S)     | p2m_end_pfn (E)         |
+    +-----+-----+-----+-----+-------------------------+
+    | p2m_pfn[p2m frame containing pfn S]             |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | p2m_pfn[p2m frame containing pfn E]             |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-------------    ---------------------------------------------------
+p2m_start_pfn    First pfn index in the p2m_pfn array.
+
+p2m_end_pfn      Last pfn index in the p2m_pfn array.
+
+p2m_pfn          Array of PFNs containing the guest's P2M table, for
+                 the PFN frames containing the PFN range S to E
+                 (inclusive).
+
+--------------------------------------------------------------------
+
+\clearpage
+
+X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS
+----------------------------------------
+
+The format of these records are identical.  They are all binary blobs
+of data which are accessed using specific pairs of domctl hypercalls.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | vcpu_id               | (reserved)              |
+    +-----------------------+-------------------------+
+    | context...                                      |
+    ...
+    +-------------------------------------------------+
+
+---------------------------------------------------------------------
+Field            Description
+-----------      ----------------------------------------------------
+vcpu_id          The VCPU ID.
+
+context          Binary data for this VCPU.
+---------------------------------------------------------------------
+
+---------------------------------------------------------------------
+Record type                  Accessor hypercalls
+-----------------------      ----------------------------------------
+X86\_PV\_VCPU\_BASIC         XEN\_DOMCTL\_{get,set}vcpucontext
+
+X86\_PV\_VCPU\_EXTENDED      XEN\_DOMCTL\_{get,set}\_ext\_vcpucontext
+
+X86\_PV\_VCPU\_XSAVE         XEN\_DOMCTL\_{get,set}vcpuextstate
+
+X86\_PV\_VCPU\_MSRS          XEN\_DOMCTL\_{get,set}\_vcpu\_msrs
+---------------------------------------------------------------------
+
+\clearpage
+
+SHARED_INFO
+-----------
+
+The content of the Shared Info page.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | shared_info                                     |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+shared_info      Contents of the shared info page.  This record
+                 should be exactly 1 page long.
+--------------------------------------------------------------------
+
+\clearpage
+
+TSC_INFO
+--------
+
+Domain TSC information, as accessed by the
+XEN\_DOMCTL\_{get,set}tscinfo hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | mode                   | khz                    |
+    +------------------------+------------------------+
+    | nsec                                            |
+    +------------------------+------------------------+
+    | incarnation            | (reserved)             |
+    +------------------------+------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+mode             TSC mode, TSC\_MODE\_* constant.
+
+khz              TSC frequency, in kHz.
+
+nsec             Elapsed time, in nanoseconds.
+
+incarnation      Incarnation.
+--------------------------------------------------------------------
+
+\clearpage
+
+HVM_CONTEXT
+-----------
+
+HVM Domain context, as accessed by the
+XEN\_DOMCTL\_{get,set}hvmcontext hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | hvm_ctx                                         |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+hvm_ctx          The HVM Context blob from Xen.
+--------------------------------------------------------------------
+
+\clearpage
+
+HVM_PARAMS
+----------
+
+HVM Domain parameters, as accessed by the
+HVMOP\_{get,set}\_param hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | count (C)              | (reserved)             |
+    +------------------------+------------------------+
+    | param[0].index                                  |
+    +-------------------------------------------------+
+    | param[0].value                                  |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | param[C-1].index                                |
+    +-------------------------------------------------+
+    | param[C-1].value                                |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+count            The number of parameters contained in this record.
+                 Each parameter in the record contains an index and
+                 value.
+
+param index      Parameter index.
+
+param value      Parameter value.
+--------------------------------------------------------------------
+
+\clearpage
+
+TOOLSTACK
+---------
+
+An opaque blob provided by and supplied to the higher layers of the
+toolstack (e.g., libxl) during save and restore.
+
+> This is only temporary -- the intention is the toolstack takes care
+> of this itself.  This record is only present for early development
+> purposes and will be removed before submissions, along with changes
+> to libxl which cause libxl to handle this data itself.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | data                                            |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+data             Blob of toolstack-specific data.
+--------------------------------------------------------------------
+
+\clearpage
+
+VERIFY
+------
+
+A verify record indicates that, while all memory has now been sent, the sender
+shall send further memory records for debugging purposes.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The verify record contains no fields; its body_length is 0.
+
+\clearpage
+
+Layout
+======
+
+The set of valid records depends on the guest architecture and type.  No
+assumptions should be made about the ordering or interleaving of
+independent records.  Record dependencies are noted below.
+
+x86 PV Guest
+------------
+
+A typical save record for an x86 PV guest image would look like:
+
+1. Image header
+2. Domain header
+3. X86\_PV\_INFO record
+4. X86\_PV\_P2M\_FRAMES record
+5. Many PAGE\_DATA records
+6. TSC\_INFO
+7. SHARED\_INFO record
+8. VCPU context records for each online VCPU
+    a. X86\_PV\_VCPU\_BASIC record
+    b. X86\_PV\_VCPU\_EXTENDED record
+    c. X86\_PV\_VCPU\_XSAVE record
+    d. X86\_PV\_VCPU\_MSRS record
+9. END record
+
+There are some strict ordering requirements.  The following records must
+be present in the following order as each of them depends on information
+present in the preceding ones.
+
+1. X86\_PV\_INFO record
+2. X86\_PV\_P2M\_FRAMES record
+3. PAGE\_DATA records
+4. VCPU records
+
+x86 HVM Guest
+-------------
+
+A typical save record for an x86 HVM guest image would look like:
+
+1. Image header
+2. Domain header
+3. Many PAGE\_DATA records
+4. TSC\_INFO
+5. HVM\_PARAMS
+6. HVM\_CONTEXT
+
+HVM\_PARAMS must precede HVM\_CONTEXT, as certain parameters can affect
+the validity of architectural state in the context.
+
+
+Legacy Images (x86 only)
+========================
+
+Restoring legacy images from older tools shall be handled by
+translating the legacy format image into this new format.
+
+It shall not be possible to save in the legacy format.
+
+There are two different legacy images depending on whether they were
+generated by a 32-bit or a 64-bit toolstack. These shall be
+distinguished by inspecting octets 4-7 in the image.  If these are
+zero then it is a 64-bit image.
+
+Toolstack  Field                            Value
+---------  -----                            -----
+64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
+32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
+32-bit     Chunk type (HVM)                 < 0
+32-bit     Page count (HVM)                 > 0
+
+Table: Possible values for octet 4-7 in legacy images
+
+This assumes the presence of the extended-info chunk which was
+introduced in Xen 3.0.
+
+
+Future Extensions
+=================
+
+All changes to the image or domain headers require the image version
+to be increased.
+
+The format may be extended by adding additional record types.
+
+Extending an existing record type must be done by adding a new record
+type.  This allows old images with the old record to still be
+restored.
+
+The image header may only be extended by _appending_ additional
+fields.  In particular, the `marker`, `id` and `version` fields must
+never change size or location.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v10 15/15] tools/libxc: Migration v2 compatibility for unmodified libxl
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (13 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 14/15] docs: libxc migration stream specification Andrew Cooper
@ 2015-04-23 11:48 ` Andrew Cooper
  2015-05-05 14:02 ` [PATCH v10 00/15] Migration v2 (libxc) Ian Campbell
  15 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-04-23 11:48 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Wei Liu

These changes cause migration v2 to behave similarly enough to legacy
migration to function for HVM guests under an unmodified xl/libxl.

The migration v2 work for libxl will fix the layering issues with the
toolstack and qemu records, at which point this patch will be unneeded.

It is however included here for people wishing to experiment with migration v2
ahead of the libxl work.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>

---
v10: s/dump_qemu/handle_qemu/ for consistency
---
 tools/libxc/Makefile                |    2 +
 tools/libxc/xc_sr_restore_x86_hvm.c |  102 +++++++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_save_x86_hvm.c    |   36 +++++++++++++
 3 files changed, 140 insertions(+)

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 76f63a1..63878ec 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -64,6 +64,8 @@ GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_hvm.c
 GUEST_SRCS-y += xc_sr_restore.c
 GUEST_SRCS-y += xc_sr_save.c
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
+$(patsubst %.c,%.o,$(GUEST_SRCS-y)): CFLAGS += -DXG_LIBXL_HVM_COMPAT
+$(patsubst %.c,%.opic,$(GUEST_SRCS-y)): CFLAGS += -DXG_LIBXL_HVM_COMPAT
 else
 GUEST_SRCS-y += xc_nomigrate.c
 endif
diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
index 49d22c7..6e9b318 100644
--- a/tools/libxc/xc_sr_restore_x86_hvm.c
+++ b/tools/libxc/xc_sr_restore_x86_hvm.c
@@ -3,6 +3,24 @@
 
 #include "xc_sr_common_x86.h"
 
+#ifdef XG_LIBXL_HVM_COMPAT
+static int handle_toolstack(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore )
+        return 0;
+
+    rc = ctx->restore.callbacks->toolstack_restore(
+        ctx->domid, rec->data, rec->length, ctx->restore.callbacks->data);
+
+    if ( rc < 0 )
+        PERROR("restoring toolstack");
+    return rc;
+}
+#endif
+
 /*
  * Process an HVM_CONTEXT record from the stream.
  */
@@ -75,6 +93,76 @@ static int handle_hvm_params(struct xc_sr_context *ctx,
     return 0;
 }
 
+#ifdef XG_LIBXL_HVM_COMPAT
+static int handle_qemu(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    char qemusig[21], path[256];
+    uint32_t qlen;
+    void *qbuf = NULL;
+    int rc = -1;
+    FILE *fp = NULL;
+
+    if ( read_exact(ctx->fd, qemusig, sizeof(qemusig)) )
+    {
+        PERROR("Error reading QEMU signature");
+        goto out;
+    }
+
+    if ( !memcmp(qemusig, "DeviceModelRecord0002", sizeof(qemusig)) )
+    {
+        if ( read_exact(ctx->fd, &qlen, sizeof(qlen)) )
+        {
+            PERROR("Error reading QEMU record length");
+            goto out;
+        }
+
+        qbuf = malloc(qlen);
+        if ( !qbuf )
+        {
+            PERROR("no memory for device model state");
+            goto out;
+        }
+
+        if ( read_exact(ctx->fd, qbuf, qlen) )
+        {
+            PERROR("Error reading device model state");
+            goto out;
+        }
+    }
+    else
+    {
+        ERROR("Invalid device model state signature '%*.*s'",
+              (int)sizeof(qemusig), (int)sizeof(qemusig), qemusig);
+        goto out;
+    }
+
+    sprintf(path, XC_DEVICE_MODEL_RESTORE_FILE".%u", ctx->domid);
+    fp = fopen(path, "wb");
+    if ( !fp )
+    {
+        PERROR("Failed to open '%s' for writing", path);
+        goto out;
+    }
+
+    DPRINTF("Writing %u bytes of QEMU data", qlen);
+    if ( fwrite(qbuf, 1, qlen, fp) != qlen )
+    {
+        PERROR("Failed to write %u bytes of QEMU data", qlen);
+        goto out;
+    }
+
+    rc = 0;
+
+ out:
+    if ( fp )
+        fclose(fp);
+    free(qbuf);
+
+    return rc;
+}
+#endif
+
 /* restore_ops function. */
 static bool x86_hvm_pfn_is_valid(const struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
@@ -150,6 +238,11 @@ static int x86_hvm_process_record(struct xc_sr_context *ctx,
     case REC_TYPE_HVM_PARAMS:
         return handle_hvm_params(ctx, rec);
 
+#ifdef XG_LIBXL_HVM_COMPAT
+    case REC_TYPE_TOOLSTACK:
+        return handle_toolstack(ctx, rec);
+#endif
+
     default:
         return RECORD_NOT_PROCESSED;
     }
@@ -199,6 +292,15 @@ static int x86_hvm_stream_complete(struct xc_sr_context *ctx)
         return rc;
     }
 
+#ifdef XG_LIBXL_HVM_COMPAT
+    rc = handle_qemu(ctx);
+    if ( rc )
+    {
+        ERROR("Failed to dump qemu");
+        return rc;
+    }
+#endif
+
     return rc;
 }
 
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index 0928f19..8baa104 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -118,6 +118,36 @@ static int write_hvm_params(struct xc_sr_context *ctx)
     return rc;
 }
 
+#ifdef XG_LIBXL_HVM_COMPAT
+static int write_toolstack(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_record rec = {
+        .type = REC_TYPE_TOOLSTACK,
+        .length = 0,
+    };
+    uint8_t *buf;
+    uint32_t len;
+    int rc;
+
+    if ( !ctx->save.callbacks || !ctx->save.callbacks->toolstack_save )
+        return 0;
+
+    if ( ctx->save.callbacks->toolstack_save(
+             ctx->domid, &buf, &len, ctx->save.callbacks->data) < 0 )
+    {
+        PERROR("Error calling toolstack_save");
+        return -1;
+    }
+
+    rc = write_split_record(ctx, &rec, buf, len);
+    if ( rc < 0 )
+        PERROR("Error writing TOOLSTACK record");
+    free(buf);
+    return rc;
+}
+#endif
+
 static xen_pfn_t x86_hvm_pfn_to_gfn(const struct xc_sr_context *ctx,
                                     xen_pfn_t pfn)
 {
@@ -170,6 +200,12 @@ static int x86_hvm_end_of_stream(struct xc_sr_context *ctx)
     if ( rc )
         return rc;
 
+#ifdef XG_LIBXL_HVM_COMPAT
+    rc = write_toolstack(ctx);
+    if ( rc )
+        return rc;
+#endif
+
     /* Write the HVM_CONTEXT record. */
     rc = write_hvm_context(ctx);
     if ( rc )
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 02/15] libxc/progress: Extend the progress interface
  2015-04-23 11:48 ` [PATCH v10 02/15] libxc/progress: Extend the progress interface Andrew Cooper
@ 2015-05-05 12:56   ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-05-05 12:56 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Wei Liu, Ian Jackson, Xen-devel

On Thu, 2015-04-23 at 12:48 +0100, Andrew Cooper wrote:
> Progress information is logged via a different logger to regular libxc log
> messages, and currently can only express a range.  However, not everything
> which needs reporting as progress comes with a range.  Extend the interface to
> allow reporting of a single statement.
> 
> The programming interface now looks like:
>   xc_set_progress_prefix()
>     set the prefix string to be used
>   xc_report_progress_single()
>     report a single action
>   xc_report_progress_step()
>     report $X of $Y
> 
> The new programming interface is implemented in a compatible way with the
> existing caller interface (by reporting a single action as "0 of 0").
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Ian Campbell <Ian.Campbell@citrix.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 09/15] tools/libxc: x86 PV restore code
  2015-04-23 11:48 ` [PATCH v10 09/15] tools/libxc: x86 PV restore code Andrew Cooper
@ 2015-05-05 13:00   ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-05-05 13:00 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Wei Liu, Ian Jackson, Xen-devel

On Thu, 2015-04-23 at 12:48 +0100, Andrew Cooper wrote:

> + * Set a pfn as populated, expanding the tracking structures if needed. To
> + * avoid realloc()ing too excessivly, the size increased to the nearest power

"excessively".

Otherwise: Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 14/15] docs: libxc migration stream specification
  2015-04-23 11:48 ` [PATCH v10 14/15] docs: libxc migration stream specification Andrew Cooper
@ 2015-05-05 13:03   ` Ian Campbell
  2015-05-05 13:04     ` Andrew Cooper
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-05-05 13:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Wei Liu, Ian Jackson, David Vrabel, Xen-devel

On Thu, 2015-04-23 at 12:48 +0100, Andrew Cooper wrote:
> From: David Vrabel <david.vrabel@citrix.com>
> 
> Add the specification for a new migration stream format.  The document
> includes all the details but to summarize:
> 
> The existing (legacy) format is dependant on the word size of the
> toolstack.  This prevents domains from migrating from hosts running
> 32-bit toolstacks to hosts running 64-bit toolstacks (and vice-versa).
> 
> The legacy format lacks any version information making it difficult to
> extend in compatible way.
> 
> The new format has a header (the image header) with version information,
> a domain header with basic information of the domain and a stream of
> records for the image data.
> 
> The format will be used for future domain types (such as on ARM).
> 
> The specification is pandoc format (an extended markdown format) and the
> documentation build system is extended to support pandoc format documents.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Without reading:

Acked-by: Ian Campbell <Ian.Campbell@citrix.com>

With two minor comments:

> diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
> new file mode 100644
> index 0000000..455d1ce
> --- /dev/null
> +++ b/docs/specs/libxc-migration-stream.pandoc
> @@ -0,0 +1,672 @@
> +% LibXenCtrl Domain Image Format
> +% David Vrabel <<david.vrabel@citrix.com>>
> +  Andrew Cooper <<andrew.cooper3@citrix.com>>
> +% Draft G

No longer a draft? Perhaps s/Draft/Version/?

> +Integer fields in the domain header and in the records are in the
> +endianess described in the image header (which will typically be the

endianness.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 14/15] docs: libxc migration stream specification
  2015-05-05 13:03   ` Ian Campbell
@ 2015-05-05 13:04     ` Andrew Cooper
  2015-05-05 13:25       ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2015-05-05 13:04 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, Ian Jackson, David Vrabel, Xen-devel

On 05/05/15 14:03, Ian Campbell wrote:
> On Thu, 2015-04-23 at 12:48 +0100, Andrew Cooper wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> Add the specification for a new migration stream format.  The document
>> includes all the details but to summarize:
>>
>> The existing (legacy) format is dependant on the word size of the
>> toolstack.  This prevents domains from migrating from hosts running
>> 32-bit toolstacks to hosts running 64-bit toolstacks (and vice-versa).
>>
>> The legacy format lacks any version information making it difficult to
>> extend in compatible way.
>>
>> The new format has a header (the image header) with version information,
>> a domain header with basic information of the domain and a stream of
>> records for the image data.
>>
>> The format will be used for future domain types (such as on ARM).
>>
>> The specification is pandoc format (an extended markdown format) and the
>> documentation build system is extended to support pandoc format documents.
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Without reading:
>
> Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
>
> With two minor comments:
>
>> diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
>> new file mode 100644
>> index 0000000..455d1ce
>> --- /dev/null
>> +++ b/docs/specs/libxc-migration-stream.pandoc
>> @@ -0,0 +1,672 @@
>> +% LibXenCtrl Domain Image Format
>> +% David Vrabel <<david.vrabel@citrix.com>>
>> +  Andrew Cooper <<andrew.cooper3@citrix.com>>
>> +% Draft G
> No longer a draft? Perhaps s/Draft/Version/?

Strictly speaking it is still a draft until migration v2 becomes enabled
by default.  This is one of the areas I was going to cover with my
post-libxl cleanup patches.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 14/15] docs: libxc migration stream specification
  2015-05-05 13:04     ` Andrew Cooper
@ 2015-05-05 13:25       ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-05-05 13:25 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Wei Liu, Ian Jackson, David Vrabel, Xen-devel

On Tue, 2015-05-05 at 14:04 +0100, Andrew Cooper wrote:
> On 05/05/15 14:03, Ian Campbell wrote:
> > On Thu, 2015-04-23 at 12:48 +0100, Andrew Cooper wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> Add the specification for a new migration stream format.  The document
> >> includes all the details but to summarize:
> >>
> >> The existing (legacy) format is dependant on the word size of the
> >> toolstack.  This prevents domains from migrating from hosts running
> >> 32-bit toolstacks to hosts running 64-bit toolstacks (and vice-versa).
> >>
> >> The legacy format lacks any version information making it difficult to
> >> extend in compatible way.
> >>
> >> The new format has a header (the image header) with version information,
> >> a domain header with basic information of the domain and a stream of
> >> records for the image data.
> >>
> >> The format will be used for future domain types (such as on ARM).
> >>
> >> The specification is pandoc format (an extended markdown format) and the
> >> documentation build system is extended to support pandoc format documents.
> >>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > Without reading:
> >
> > Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
> >
> > With two minor comments:
> >
> >> diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
> >> new file mode 100644
> >> index 0000000..455d1ce
> >> --- /dev/null
> >> +++ b/docs/specs/libxc-migration-stream.pandoc
> >> @@ -0,0 +1,672 @@
> >> +% LibXenCtrl Domain Image Format
> >> +% David Vrabel <<david.vrabel@citrix.com>>
> >> +  Andrew Cooper <<andrew.cooper3@citrix.com>>
> >> +% Draft G
> > No longer a draft? Perhaps s/Draft/Version/?
> 
> Strictly speaking it is still a draft until migration v2 becomes enabled
> by default.  This is one of the areas I was going to cover with my
> post-libxl cleanup patches.

Good point & sounds good, thanks.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v10 00/15] Migration v2 (libxc)
  2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
                   ` (14 preceding siblings ...)
  2015-04-23 11:48 ` [PATCH v10 15/15] tools/libxc: Migration v2 compatibility for unmodified libxl Andrew Cooper
@ 2015-05-05 14:02 ` Ian Campbell
  15 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-05-05 14:02 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Vijay Kilari, Wen Congyang, Ian Jackson, Tim Deegan,
	Xen-devel, Ross Lagerwall, David Vrabel, Shriram Rajagopalan,
	Hongyang Yang

On Thu, 2015-04-23 at 12:48 +0100, Andrew Cooper wrote:
> Presented here is v10 of the Migration v2 series (libxc subset), which is able
> to function when transparently inserted under an unmodified xl/libxl.

I've applied the v11 branch with my minor comments addressed which you
provided privately.

> To experiment, simply set XG_MIGRATION_V2 in xl's environment.  For migration,
> the easiest way is to tweak libxl-save-helper to be a shell script
> 
>   root@vitruvias:/home# cat /usr/lib/xen/bin/libxl-save-helper
>   #!/bin/bash
>   export XG_MIGRATION_V2=x
>   exec /usr/lib/xen/bin/libxl-save-helper.bin "$@"
> 
> which will ensure that XG_MIGRATION_V2 gets set in the environment for both
> the source and destination of migration.
> 
> Please experiment!

Seconded!

Cheers,
Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-05-05 14:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-23 11:48 [PATCH v10 00/15] Migration v2 (libxc) Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 01/15] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 02/15] libxc/progress: Extend the progress interface Andrew Cooper
2015-05-05 12:56   ` Ian Campbell
2015-04-23 11:48 ` [PATCH v10 03/15] tools/libxc: Migration v2 framework Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 04/15] tools/libxc: C implementation of stream format Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 05/15] tools/libxc: generic common code Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 06/15] tools/libxc: x86 " Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 07/15] tools/libxc: x86 PV " Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 08/15] tools/libxc: x86 PV save code Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 09/15] tools/libxc: x86 PV restore code Andrew Cooper
2015-05-05 13:00   ` Ian Campbell
2015-04-23 11:48 ` [PATCH v10 10/15] tools/libxc: x86 HVM save code Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 11/15] tools/libxc: x86 HVM restore code Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 12/15] tools/libxc: common save code Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 13/15] tools/libxc: common restore code Andrew Cooper
2015-04-23 11:48 ` [PATCH v10 14/15] docs: libxc migration stream specification Andrew Cooper
2015-05-05 13:03   ` Ian Campbell
2015-05-05 13:04     ` Andrew Cooper
2015-05-05 13:25       ` Ian Campbell
2015-04-23 11:48 ` [PATCH v10 15/15] tools/libxc: Migration v2 compatibility for unmodified libxl Andrew Cooper
2015-05-05 14:02 ` [PATCH v10 00/15] Migration v2 (libxc) Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.