All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/29] Migration Stream v2
@ 2014-09-10 17:10 Andrew Cooper
  2014-09-10 17:10 ` [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging Andrew Cooper
                   ` (29 more replies)
  0 siblings, 30 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Ian Jackson,
	Tim Deegan, Ross Lagerwall, David Vrabel, Jan Beulich

Hello,

Presented here for review is v7 of the Migration Stream v2 work, which
contains a v1 for xl/libxl support, including on-the-fly conversion of legacy
streams.

There are some TODOs and questions in the final section of the series, but the
earlier sections are thoroughly tested and considered complete, having
undergone extensive testing as part of XenServer.

A git version is available from the saverestore2-v7 branch of:
  git://xenbits.xen.org/people/andrewcoop/xen.git
  http://xenbits.xen.org/git-http/people/andrewcoop/xen.git

In summary,

Patches 1 to 5 are misc bugfixes/improvements
Patches 6 and 7 are specification documents for the new stream
Patch 8 contains python infrastructure for conversion and verification
Patches 8 to 19 form the libxc changes
Patches 20 to 29 form the libxl and xl changes

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:18   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 02/29] tools/[lib]xl: Correct use of init/dispose for libxl_domain_restore_params Andrew Cooper
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

LOG() automatically adds a newline.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
 tools/libxl/libxl_create.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index b36c719..a5e185e 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -883,7 +883,7 @@ static void initiate_domain_create(libxl__egc *egc,
     }
 
     if (restore_fd >= 0) {
-        LOG(DEBUG, "restoring, not running bootloader\n");
+        LOG(DEBUG, "restoring, not running bootloader");
         domcreate_bootloader_done(egc, &dcs->bl, 0);
     } else  {
         LOG(DEBUG, "running bootloader");
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 02/29] tools/[lib]xl: Correct use of init/dispose for libxl_domain_restore_params
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
  2014-09-10 17:10 ` [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:19   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
 tools/libxl/libxl.h      |    9 +++++++--
 tools/libxl/xl_cmdimpl.c |    6 ++++++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index dab3a67..5136d02 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -847,10 +847,15 @@ int static inline libxl_domain_create_restore_0x040200(
     LIBXL_EXTERNAL_CALLERS_ONLY
 {
     libxl_domain_restore_params params;
-    params.checkpointed_stream = 0;
+    int ret;
 
-    return libxl_domain_create_restore(
+    libxl_domain_restore_params_init(&params);
+
+    ret = libxl_domain_create_restore(
         ctx, d_config, domid, restore_fd, &params, ao_how, aop_console_how);
+
+    libxl_domain_restore_params_dispose(&params);
+    return ret;
 }
 
 #define libxl_domain_create_restore libxl_domain_create_restore_0x040200
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 8a38077..26492fc 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -2246,11 +2246,17 @@ start:
 
     if ( restoring ) {
         libxl_domain_restore_params params;
+
+        libxl_domain_restore_params_init(&params);
+
         params.checkpointed_stream = dom_info->checkpointed_stream;
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
                                           &params,
                                           0, autoconnect_console_how);
+
+        libxl_domain_restore_params_dispose(&params);
+
         /*
          * On subsequent reboot etc we should create the domain, not
          * restore/migrate-receive it again.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact()
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
  2014-09-10 17:10 ` [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging Andrew Cooper
  2014-09-10 17:10 ` [PATCH 02/29] tools/[lib]xl: Correct use of init/dispose for libxl_domain_restore_params Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:19   ` Ian Campbell
  2014-09-11 10:57   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 04/29] libxc/bitops: Add or() to the available bitmap operations Andrew Cooper
                   ` (26 subsequent siblings)
  29 siblings, 2 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

This implementation of writev_exact() will cope with an iovcnt greater than
IOV_MAX because glibc will actually let this work anyway, and it is very
useful not to have to work about this in the caller of writev_exact().  The
caller is still required to ensure that the sum of iov_len's doesn't overflow
a ssize_t.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>

---
v3:
 * Re-add adjustment for partial writes.
 * Split min/max adjustment into separate patch.

v2:
 * Remove adjustment for partial writes of a specific iov[] entry.
---
 tools/libxc/xc_private.c |   60 ++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_private.h |    2 ++
 2 files changed, 62 insertions(+)

diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index 1c214dd..0941b06 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -858,6 +858,66 @@ int write_exact(int fd, const void *data, size_t size)
     return 0;
 }
 
+int writev_exact(int fd, const struct iovec *iov, int iovcnt)
+{
+    struct iovec *local_iov = NULL;
+    int rc = 0, iov_idx = 0, saved_errno = 0;
+    ssize_t len;
+
+    while ( iov_idx < iovcnt )
+    {
+        /* Skip over iov[] entries with 0 length. */
+        while ( iov[iov_idx].iov_len == 0 )
+            if ( ++iov_idx == iovcnt )
+                goto out;
+
+        len = writev(fd, &iov[iov_idx], min(iovcnt - iov_idx, IOV_MAX));
+        saved_errno = errno;
+
+        if ( (len == -1) && (errno == EINTR) )
+            continue;
+        if ( len <= 0 )
+        {
+            rc = -1;
+            goto out;
+        }
+
+        /* Check iov[] to see whether we had a partial or complete write. */
+        while ( len > 0 && (iov_idx < iovcnt) )
+        {
+            if ( len >= iov[iov_idx].iov_len )
+                len -= iov[iov_idx++].iov_len;
+            else
+            {
+                /* Partial write of iov[iov_idx]. Copy iov so we can adjust
+                 * element iov_idx and resubmit the rest. */
+                if ( !local_iov )
+                {
+                    local_iov = malloc(iovcnt * sizeof(*iov));
+                    if ( !local_iov )
+                    {
+                        saved_errno = ENOMEM;
+                        goto out;
+                    }
+
+                    iov = memcpy(local_iov, iov, iovcnt * sizeof(*iov));
+                }
+
+                local_iov[iov_idx].iov_base += len;
+                local_iov[iov_idx].iov_len  -= len;
+                break;
+            }
+        }
+    }
+
+    saved_errno = 0;
+
+ out:
+    free(local_iov);
+    errno = saved_errno;
+    return rc;
+}
+
 int xc_ffs8(uint8_t x)
 {
     int i;
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index c50a7c9..97e4a56 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -28,6 +28,7 @@
 #include <sys/stat.h>
 #include <stdlib.h>
 #include <sys/ioctl.h>
+#include <sys/uio.h>
 
 #include "xenctrl.h"
 #include "xenctrlosdep.h"
@@ -343,6 +344,7 @@ int xc_flush_mmu_updates(xc_interface *xch, struct xc_mmu *mmu);
 /* Return 0 on success; -1 on error setting errno. */
 int read_exact(int fd, void *data, size_t size); /* EOF => -1, errno=0 */
 int write_exact(int fd, const void *data, size_t size);
+int writev_exact(int fd, const struct iovec *iov, int iovcnt);
 
 int xc_ffs8(uint8_t x);
 int xc_ffs16(uint16_t x);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 04/29] libxc/bitops: Add or() to the available bitmap operations
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (2 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:21   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure Andrew Cooper
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
 tools/libxc/xc_bitops.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/libxc/xc_bitops.h b/tools/libxc/xc_bitops.h
index d8e0c16..dfce3b8 100644
--- a/tools/libxc/xc_bitops.h
+++ b/tools/libxc/xc_bitops.h
@@ -60,4 +60,12 @@ static inline int test_and_set_bit(int nr, unsigned long *addr)
     return oldbit;
 }
 
+static inline void bitmap_or(unsigned long *dst, const unsigned long *other,
+                             int nr_bits)
+{
+    int i, nr_longs = (bitmap_size(nr_bits) / sizeof(unsigned long));
+    for ( i = 0; i < nr_longs; ++i )
+        dst[i] |= other[i];
+}
+
 #endif  /* XC_BITOPS_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (3 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 04/29] libxc/bitops: Add or() to the available bitmap operations Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:32   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 06/29] docs: libxc migration stream specification Andrew Cooper
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Not everything which needs reporting as progress comes with a range.  Allow
reporting "0 of 0" for a single progress statement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
 tools/libxc/xc_domain_restore.c |    2 +-
 tools/libxc/xc_domain_save.c    |    2 +-
 tools/libxc/xc_private.c        |   22 ++++++++++++++--------
 tools/libxc/xc_private.h        |    4 ++--
 tools/libxc/xtl_core.c          |    9 +++++----
 5 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index b9a56d5..b411126 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1610,7 +1610,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
         goto out;
     }
 
-    xc_report_progress_start(xch, "Reloading memory pages", dinfo->p2m_size);
+    xc_report_progress_set(xch, "Reloading memory pages");
 
     /*
      * Now simply read each saved frame into its new machine frame.
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 254fdb3..02544f8 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1127,7 +1127,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                  "Saving memory: iter %d (last sent %u skipped %u)",
                  iter, sent_this_iter, skip_this_iter);
 
-        xc_report_progress_start(xch, reportbuf, dinfo->p2m_size);
+        xc_report_progress_set(xch, reportbuf);
 
         iter++;
         sent_this_iter = 0;
diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index 0941b06..45537af 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -388,18 +388,24 @@ void xc_osdep_log(xc_interface *xch, xentoollog_level level, int code, const cha
     va_end(args);
 }
 
-void xc_report_progress_start(xc_interface *xch, const char *doing,
-                              unsigned long total) {
+const char *xc_report_progress_set(xc_interface *xch, const char *doing)
+{
+    const char *old = xch->currently_progress_reporting;
+
     xch->currently_progress_reporting = doing;
-    xtl_progress(xch->error_handler, "xc", xch->currently_progress_reporting,
-                 0, total);
+    return old;
+}
+
+void xc_report_progress_single(xc_interface *xch, const char *doing)
+{
+    xtl_progress(xch->error_handler, "xc", doing, 0, 0);
 }
 
 void xc_report_progress_step(xc_interface *xch,
-                             unsigned long done, unsigned long total) {
-    assert(xch->currently_progress_reporting);
-    xtl_progress(xch->error_handler, "xc", xch->currently_progress_reporting,
-                 done, total);
+                             unsigned long done, unsigned long total)
+{
+    xtl_progress(xch->error_handler, "xc",
+                 xch->currently_progress_reporting ?: "???", done, total);
 }
 
 int xc_get_pfn_type_batch(xc_interface *xch, uint32_t dom,
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 97e4a56..22021e9 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -119,8 +119,8 @@ void xc_report(xc_interface *xch, xentoollog_logger *lg, xentoollog_level,
                int code, const char *fmt, ...)
      __attribute__((format(printf,5,6)));
 
-void xc_report_progress_start(xc_interface *xch, const char *doing,
-                              unsigned long total);
+const char *xc_report_progress_set(xc_interface *xch, const char *doing);
+void xc_report_progress_single(xc_interface *xch, const char *doing);
 void xc_report_progress_step(xc_interface *xch,
                              unsigned long done, unsigned long total);
 
diff --git a/tools/libxc/xtl_core.c b/tools/libxc/xtl_core.c
index 326b97e..73add92 100644
--- a/tools/libxc/xtl_core.c
+++ b/tools/libxc/xtl_core.c
@@ -66,13 +66,14 @@ void xtl_log(struct xentoollog_logger *logger,
 void xtl_progress(struct xentoollog_logger *logger,
                   const char *context, const char *doing_what,
                   unsigned long done, unsigned long total) {
-    int percent;
+    int percent = 0;
 
     if (!logger->progress) return;
 
-    percent = (total < LONG_MAX/100)
-        ? (done * 100) / total
-        : done / ((total + 99) / 100);
+    if ( total )
+        percent = (total < LONG_MAX/100)
+            ? (done * 100) / total
+            : done / ((total + 99) / 100);
 
     logger->progress(logger, context, doing_what, percent, done, total);
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 06/29] docs: libxc migration stream specification
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (4 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 07/29] docs: libxl " Andrew Cooper
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Add the specification for a new migration stream format.  The document
includes all the details but to summarize:

The existing (legacy) format is dependant on the word size of the
toolstack.  This prevents domains from migrating from hosts running
32-bit toolstacks to hosts running 64-bit toolstacks (and vice-versa).

The legacy format lacks any version information making it difficult to
extend in compatible way.

The new format has a header (the image header) with version information,
a domain header with basic information of the domain and a stream of
records for the image data.

The format will be used for future domain types (such as on ARM).

The specification is pandoc format (an extended markdown format) and the
documentation build system is extended to support pandoc format documents.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>

---
v7:
 * Remove SAVING_CPU.
 * More clear statements regarding record order and interleaving.
v6:
 * Add pandoc -> txt/html conversion.
 * Spelling fixes.
 * Introduce SAVING_CPU record.
---
 docs/Makefile                            |   11 +-
 docs/specs/libxc-migration-stream.pandoc |  672 ++++++++++++++++++++++++++++++
 2 files changed, 682 insertions(+), 1 deletion(-)
 create mode 100644 docs/specs/libxc-migration-stream.pandoc

diff --git a/docs/Makefile b/docs/Makefile
index 2c0903b..777998d 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -14,21 +14,26 @@ MAN5SRC-y := $(wildcard man/xl*.pod.5)
 
 MARKDOWNSRC-y := $(wildcard misc/*.markdown)
 
+PANDOCSRC-y += $(wildcard specs/*.pandoc)
+
 TXTSRC-y := $(wildcard misc/*.txt)
 
 
 DOC_MAN1 := $(patsubst man/%.pod.1,man1/%.1,$(MAN1SRC-y))
 DOC_MAN5 := $(patsubst man/%.pod.5,man5/%.5,$(MAN5SRC-y))
 DOC_HTML := $(patsubst %.markdown,html/%.html,$(MARKDOWNSRC-y)) \
+            $(patsubst %.pandoc,html/%.html,$(PANDOCSRC-y)) \
             $(patsubst man/%.pod.1,html/man/%.1.html,$(MAN1SRC-y)) \
             $(patsubst man/%.pod.5,html/man/%.5.html,$(MAN5SRC-y)) \
             $(patsubst %.txt,html/%.txt,$(TXTSRC-y)) \
             $(patsubst %,html/hypercall/%/index.html,$(DOC_ARCHES))
 DOC_TXT  := $(patsubst %.txt,txt/%.txt,$(TXTSRC-y)) \
             $(patsubst %.markdown,txt/%.txt,$(MARKDOWNSRC-y)) \
+            $(patsubst %.pandoc,txt/%.txt,$(PANDOCSRC-y)) \
             $(patsubst man/%.pod.1,txt/man/%.1.txt,$(MAN1SRC-y)) \
             $(patsubst man/%.pod.5,txt/man/%.5.txt,$(MAN5SRC-y))
-DOC_PDF  := $(patsubst %.markdown,pdf/%.pdf,$(MARKDOWNSRC-y))
+DOC_PDF  := $(patsubst %.markdown,pdf/%.pdf,$(MARKDOWNSRC-y)) \
+            $(patsubst %.pandoc,pdf/%.pdf,$(PANDOCSRC-y))
 
 .PHONY: all
 all: build
@@ -191,6 +196,10 @@ pdf/%.pdf: %.markdown
 	$(INSTALL_DIR) $(@D)
 	pandoc -N --toc --standalone $< --output $@
 
+pdf/%.pdf txt/%.txt html/%.html: %.pandoc
+	$(INSTALL_DIR) $(@D)
+	pandoc --number-sections --toc --standalone $< --output $@
+
 ifeq (,$(findstring clean,$(MAKECMDGOALS)))
 $(XEN_ROOT)/config/Docs.mk:
 	$(error You have to run ./configure before building docs)
diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
new file mode 100644
index 0000000..c061186
--- /dev/null
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -0,0 +1,672 @@
+% LibXenCtrl Domain Image Format
+% David Vrabel <<david.vrabel@citrix.com>>
+  Andrew Cooper <<andrew.cooper3@citrix.com>>
+% Draft G
+
+Introduction
+============
+
+Purpose
+-------
+
+The _domain save image_ is the context of a running domain used for
+snapshots of a domain or for transferring domains between hosts during
+migration.
+
+There are a number of problems with the format of the domain save
+image used in Xen 4.4 and earlier (the _legacy format_).
+
+* Dependant on toolstack word size.  A number of fields within the
+  image are native types such as `unsigned long` which have different
+  sizes between 32-bit and 64-bit toolstacks.  This prevents domains
+  from being migrated between hosts running 32-bit and 64-bit
+  toolstacks.
+
+* There is no header identifying the image.
+
+* The image has no version information.
+
+A new format that addresses the above is required.
+
+ARM does not yet have have a domain save image format specified and
+the format described in this specification should be suitable.
+
+Not Yet Included
+----------------
+
+The following features are not yet fully specified and will be
+included in a future draft.
+
+* Remus
+
+* Page data compression.
+
+* ARM
+
+
+Overview
+========
+
+The image format consists of two main sections:
+
+* _Headers_
+* _Records_
+
+Headers
+-------
+
+There are two headers: the _image header_, and the _domain header_.
+The image header describes the format of the image (version etc.).
+The _domain header_ contains general information about the domain
+(architecture, type etc.).
+
+Records
+-------
+
+The main part of the format is a sequence of different _records_.
+Each record type contains information about the domain context.  At a
+minimum there is a END record marking the end of the records section.
+
+
+Fields
+------
+
+All the fields within the headers and records have a fixed width.
+
+Fields are always aligned to their size.
+
+Padding and reserved fields are set to zero on save and must be
+ignored during restore.
+
+Integer (numeric) fields in the image header are always in big-endian
+byte order.
+
+Integer fields in the domain header and in the records are in the
+endianess described in the image header (which will typically be the
+native ordering).
+
+\clearpage
+
+Headers
+=======
+
+Image Header
+------------
+
+The image header identifies an image as a Xen domain save image.  It
+includes the version of this specification that the image complies
+with.
+
+Tools supporting version _V_ of the specification shall always save
+images using version _V_.  Tools shall support restoring from version
+_V_.  If the previous Xen release produced version _V_ - 1 images,
+tools shall supported restoring from these.  Tools may additionally
+support restoring from earlier versions.
+
+The marker field can be used to distinguish between legacy images and
+those corresponding to this specification.  Legacy images will have at
+one or more zero bits within the first 8 octets of the image.
+
+Fields within the image header are always in _big-endian_ byte order,
+regardless of the setting of the endianness bit.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | marker                                          |
+    +-----------------------+-------------------------+
+    | id                    | version                 |
+    +-----------+-----------+-------------------------+
+    | options   | (reserved)                          |
+    +-----------+-------------------------------------+
+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+marker      0xFFFFFFFFFFFFFFFF.
+
+id          0x58454E46 ("XENF" in ASCII).
+
+version     0x00000002.  The version of this specification.
+
+options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
+
+            bit 1-15: Reserved.
+--------------------------------------------------------------------
+
+The endianness shall be 0 (little-endian) for images generated on an
+i386, x86_64, or arm host.
+
+\clearpage
+
+Domain Header
+-------------
+
+The domain header includes general properties of the domain.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | type                  | page_shift| (reserved)  |
+    +-----------------------+-----------+-------------+
+    | xen_major             | xen_minor               |
+    +-----------------------+-------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+type        0x0000: Reserved.
+
+            0x0001: x86 PV.
+
+            0x0002: x86 HVM.
+
+            0x0003: x86 PVH.
+
+            0x0004: ARM.
+
+            0x0005 - 0xFFFFFFFF: Reserved.
+
+page_shift  Size of a guest page as a power of two.
+
+            i.e., page size = 2 ^page_shift^.
+
+xen_major   The Xen major version when this image was saved.
+
+xen_minor   The Xen minor version when this image was saved.
+--------------------------------------------------------------------
+
+The legacy stream conversion tool writes a `xen_major` version of 0, and sets
+`xen_minor` to the version of itself.
+
+\clearpage
+
+Records
+=======
+
+A record has a record header, type specific data and a trailing
+footer.  If `body_length` is not a multiple of 8, the body is padded
+with zeroes to align the end of the record on an 8 octet boundary.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | type                  | body_length             |
+    +-----------+-----------+-------------------------+
+    | body...                                         |
+    ...
+    |           | padding (0 to 7 octets)             |
+    +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field        Description
+-----------  -------------------------------------------------------
+type         0x00000000: END
+
+             0x00000001: PAGE_DATA
+
+             0x00000002: X86_PV_INFO
+
+             0x00000003: X86_PV_P2M_FRAMES
+
+             0x00000004: X86_PV_VCPU_BASIC
+
+             0x00000005: X86_PV_VCPU_EXTENDED
+
+             0x00000006: X86_PV_VCPU_XSAVE
+
+             0x00000007: SHARED_INFO
+
+             0x00000008: TSC_INFO
+
+             0x00000009: HVM_CONTEXT
+
+             0x0000000A: HVM_PARAMS
+
+             0x0000000B: TOOLSTACK
+
+             0x0000000C: X86_PV_VCPU_MSRS
+
+             0x0000000D: VERIFY
+
+             0x0000000E - 0x7FFFFFFF: Reserved for future _mandatory_
+             records.
+
+             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
+             records.
+
+body_length  Length in octets of the record body.
+
+body         Content of the record.
+
+padding      0 to 7 octets of zeros to pad the whole record to a multiple
+             of 8 octets.
+--------------------------------------------------------------------
+
+Records may be _mandatory_ or _optional_.  Optional records have bit
+31 set in their type.  Restoring an image that has unrecognised or
+unsupported mandatory record must fail.  The contents of optional
+records may be ignored during a restore.
+
+The following sub-sections specify the record body format for each of
+the record types.
+
+\clearpage
+
+END
+----
+
+An end record marks the end of the image, and shall be the final record
+in the stream.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The end record contains no fields; its body_length is 0.
+
+\clearpage
+
+PAGE_DATA
+---------
+
+The bulk of an image consists of many PAGE_DATA records containing the
+memory contents.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+    | page_data[0]...                                 |
+    ...
+    +-------------------------------------------------+
+    | page_data[N-1]...                               |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+count       Number of pages described in this record.
+
+pfn         An array of count PFNs and their types.
+
+            Bit 63-60: XEN\_DOMCTL\_PFINFO\_* type (from
+            `public/domctl.h` but shifted by 32 bits)
+
+            Bit 59-52: Reserved.
+
+            Bit 51-0: PFN.
+
+page\_data  page\_size octets of uncompressed page contents for each
+            page set as present in the pfn array.
+--------------------------------------------------------------------
+
+Note: Count is strictly > 0.  N is strictly <= C and it is possible for there
+to be no page_data in the record if all pfns are of invalid types.
+
+--------------------------------------------------------------------
+PFINFO type    Value      Description
+-------------  ---------  ------------------------------------------
+NOTAB          0x0        Normal page.
+
+L1TAB          0x1        L1 page table page.
+
+L2TAB          0x2        L2 page table page.
+
+L3TAB          0x3        L3 page table page.
+
+L4TAB          0x4        L4 page table page.
+
+               0x5-0x8    Reserved.
+
+L1TAB_PIN      0x9        L1 page table page (pinned).
+
+L2TAB_PIN      0xA        L2 page table page (pinned).
+
+L3TAB_PIN      0xB        L3 page table page (pinned).
+
+L4TAB_PIN      0xC        L4 page table page (pinned).
+
+BROKEN         0xD        Broken page.
+
+XALLOC         0xE        Allocate only.
+
+XTAB           0xF        Invalid page.
+--------------------------------------------------------------------
+
+Table: XEN\_DOMCTL\_PFINFO\_* Page Types.
+
+PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any
+corresponding `page_data`.
+
+The saver uses the `XTAB` type for PFNs that become invalid in the
+guest's P2M table during a live migration[^2].
+
+Restoring an image with unrecognised page types shall fail.
+
+[^2]: In the legacy format, this is the list of unmapped PFNs in the
+tail.
+
+\clearpage
+
+X86_PV_INFO
+-----------
+
+     0     1     2     3     4     5     6     7 octet
+    +-----+-----+-----------+-------------------------+
+    | w   | ptl | (reserved)                          |
+    +-----+-----+-----------+-------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+guest_width (w)  Guest width in octets (either 4 or 8).
+
+pt_levels (ptl)  Number of page table levels (either 3 or 4).
+--------------------------------------------------------------------
+
+\clearpage
+
+X86_PV_P2M_FRAMES
+-----------------
+
+     0     1     2     3     4     5     6     7 octet
+    +-----+-----+-----+-----+-------------------------+
+    | p2m_start_pfn (S)     | p2m_end_pfn (E)         |
+    +-----+-----+-----+-----+-------------------------+
+    | p2m_pfn[p2m frame containing pfn S]             |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | p2m_pfn[p2m frame containing pfn E]             |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-------------    ---------------------------------------------------
+p2m_start_pfn    First pfn index in the p2m_pfn array.
+
+p2m_end_pfn      Last pfn index in the p2m_pfn array.
+
+p2m_pfn          Array of PFNs containing the guest's P2M table, for
+                 the PFN frames containing the PFN range S to E
+                 (inclusive).
+
+--------------------------------------------------------------------
+
+\clearpage
+
+X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS
+----------------------------------------
+
+The format of these records are identical.  They are all binary blobs
+of data which are accessed using specific pairs of domctl hypercalls.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | vcpu_id               | (reserved)              |
+    +-----------------------+-------------------------+
+    | context...                                      |
+    ...
+    +-------------------------------------------------+
+
+---------------------------------------------------------------------
+Field            Description
+-----------      ----------------------------------------------------
+vcpu_id          The VCPU ID.
+
+context          Binary data for this VCPU.
+---------------------------------------------------------------------
+
+---------------------------------------------------------------------
+Record type                  Accessor hypercalls
+-----------------------      ----------------------------------------
+X86\_PV\_VCPU\_BASIC         XEN\_DOMCTL\_{get,set}vcpucontext
+
+X86\_PV\_VCPU\_EXTENDED      XEN\_DOMCTL\_{get,set}\_ext\_vcpucontext
+
+X86\_PV\_VCPU\_XSAVE         XEN\_DOMCTL\_{get,set}vcpuextstate
+
+X86\_PV\_VCPU\_MSRS          XEN\_DOMCTL\_{get,set}\_vcpu\_msrs
+---------------------------------------------------------------------
+
+\clearpage
+
+SHARED_INFO
+-----------
+
+The content of the Shared Info page.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | shared_info                                     |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+shared_info      Contents of the shared info page.  This record
+                 should be exactly 1 page long.
+--------------------------------------------------------------------
+
+\clearpage
+
+TSC_INFO
+--------
+
+Domain TSC information, as accessed by the
+XEN\_DOMCTL\_{get,set}tscinfo hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | mode                   | khz                    |
+    +------------------------+------------------------+
+    | nsec                                            |
+    +------------------------+------------------------+
+    | incarnation            | (reserved)             |
+    +------------------------+------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+mode             TSC mode, TSC\_MODE\_* constant.
+
+khz              TSC frequency, in kHz.
+
+nsec             Elapsed time, in nanoseconds.
+
+incarnation      Incarnation.
+--------------------------------------------------------------------
+
+\clearpage
+
+HVM_CONTEXT
+-----------
+
+HVM Domain context, as accessed by the
+XEN\_DOMCTL\_{get,set}hvmcontext hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | hvm_ctx                                         |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+hvm_ctx          The HVM Context blob from Xen.
+--------------------------------------------------------------------
+
+\clearpage
+
+HVM_PARAMS
+----------
+
+HVM Domain parameters, as accessed by the
+HVMOP\_{get,set}\_param hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | count (C)              | (reserved)             |
+    +------------------------+------------------------+
+    | param[0].index                                  |
+    +-------------------------------------------------+
+    | param[0].value                                  |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | param[C-1].index                                |
+    +-------------------------------------------------+
+    | param[C-1].value                                |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+count            The number of parameters contained in this record.
+                 Each parameter in the record contains an index and
+                 value.
+
+param index      Parameter index.
+
+param value      Parameter value.
+--------------------------------------------------------------------
+
+\clearpage
+
+TOOLSTACK
+---------
+
+An opaque blob provided by and supplied to the higher layers of the
+toolstack (e.g., libxl) during save and restore.
+
+> This is only temporary -- the intention is the toolstack takes care
+> of this itself.  This record is only present for early development
+> purposes and will be removed before submissions, along with changes
+> to libxl which cause libxl to handle this data itself.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | data                                            |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+data             Blob of toolstack-specific data.
+--------------------------------------------------------------------
+
+\clearpage
+
+VERIFY
+------
+
+A verify record indicates that, while all memory has now been sent, the sender
+shall send further memory records for debugging purposes.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The verify record contains no fields; its body_length is 0.
+
+\clearpage
+
+Layout
+======
+
+The set of valid records depends on the guest architecture and type.  No
+assumptions should be made about the ordering or interleaving of
+independent records.  Record dependencies are noted below.
+
+x86 PV Guest
+------------
+
+A typical save record for an x86 PV guest image would look like:
+
+1. Image header
+2. Domain header
+3. X86\_PV\_INFO record
+4. X86\_PV\_P2M\_FRAMES record
+5. Many PAGE\_DATA records
+6. TSC\_INFO
+7. SHARED\_INFO record
+8. VCPU context records for each online VCPU
+    a. X86\_PV\_VCPU\_BASIC record
+    b. X86\_PV\_VCPU\_EXTENDED record
+    c. X86\_PV\_VCPU\_XSAVE record
+    d. X86\_PV\_VCPU\_MSRS record
+9. END record
+
+There are some strict ordering requirements.  The following records must
+be present in the following order as each of them depends on information
+present in the preceding ones.
+
+1. X86\_PV\_INFO record
+2. X86\_PV\_P2M\_FRAMES record
+3. PAGE\_DATA records
+4. VCPU records
+
+x86 HVM Guest
+-------------
+
+A typical save record for an x86 HVM guest image would look like:
+
+1. Image header
+2. Domain header
+3. Many PAGE\_DATA records
+4. TSC\_INFO
+5. HVM\_PARAMS
+6. HVM\_CONTEXT
+
+HVM\_PARAMS must preceed HVM\_CONTEXT, as certain parameters can affect
+the validity of architectural state in the context.
+
+
+Legacy Images (x86 only)
+========================
+
+Restoring legacy images from older tools shall be handled by
+translating the legacy format image into this new format.
+
+It shall not be possible to save in the legacy format.
+
+There are two different legacy images depending on whether they were
+generated by a 32-bit or a 64-bit toolstack. These shall be
+distinguished by inspecting octets 4-7 in the image.  If these are
+zero then it is a 64-bit image.
+
+Toolstack  Field                            Value
+---------  -----                            -----
+64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
+32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
+32-bit     Chunk type (HVM)                 < 0
+32-bit     Page count (HVM)                 > 0
+
+Table: Possible values for octet 4-7 in legacy images
+
+This assumes the presence of the extended-info chunk which was
+introduced in Xen 3.0.
+
+
+Future Extensions
+=================
+
+All changes to the image or domain headers require the image version
+to be increased.
+
+The format may be extended by adding additional record types.
+
+Extending an existing record type must be done by adding a new record
+type.  This allows old images with the old record to still be
+restored.
+
+The image header may only be extended by _appending_ additional
+fields.  In particular, the `marker`, `id` and `version` fields must
+never change size or location.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/29] docs: libxl migration stream specification
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (5 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 06/29] docs: libxc migration stream specification Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:45   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 08/29] tools/python: Infrastructure relating to migration v2 streams Andrew Cooper
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 docs/specs/libxl-migration-stream.pandoc |  223 ++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)
 create mode 100644 docs/specs/libxl-migration-stream.pandoc

diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
new file mode 100644
index 0000000..88b9fca
--- /dev/null
+++ b/docs/specs/libxl-migration-stream.pandoc
@@ -0,0 +1,223 @@
+% LibXenLight Domain Image Format
+% Andrew Cooper <<andrew.cooper3@citrix.com>>
+% Draft A
+
+Introduction
+============
+
+For the purposes of this document, `xl` is used as a representation of any
+implementer of the `libxl` API.  `xl` should be considered completely
+interchangeable with alternates, such as `libvirt` or `xenopsd-xl`.
+
+Purpose
+-------
+
+The _domain image format_ is the context of a running domain used for
+snapshots of a domain or for transferring domains between hosts during
+migration.
+
+There are a number of problems with the domain image format used in Xen 4.4
+and earlier (the _legacy format_)
+
+* There is no `libxl` context information.  `xl` is required to send certain
+  pieces of `libxl` context itself.
+
+* The contents of the stream is passed directly through `libxl` to `libxc`.
+  The legacy `libxc` format contained some information which belonged at the
+  `libxl` level, resulting in awkward layer violation to return the
+  information back to `libxl`.
+
+* The legacy `libxc` format was inextensible, causing inextensibility in the
+  legacy `libxl` handling.
+
+This design addresses the above points, allowing for a completely
+self-contained, extensible stream with each layer responsibile for its own
+appropriate information.
+
+
+Not Yet Included
+----------------
+
+The following features are not yet fully specified and will be
+included in a future draft.
+
+* Remus
+
+* ARM
+
+
+Overview
+========
+
+The image format consists of a _Header_, followed by 1 or more _Records_.
+Each record consists of a type and length field, followed by any type-specific
+data.
+
+\clearpage
+
+Header
+======
+
+The header identifies the stream as a `libxl` stream, including the version of
+this specification that it complies with.
+
+All fields in this header shall be in _big-endian_ byte order, regardless of
+the setting of the endianness bit.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | ident                                           |
+    +-----------------------+-------------------------+
+    | version               | options                 |
+    +-----------------------+-------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+ident       0x4c6962786c466d74 ("LibxlFmt" in ASCII).
+
+version     0x00000002.  The version of this specification.
+
+options     bit 0: Endianness.    0 = little-endian, 1 = big-endian.
+
+            bit 1: Legacy Format. If set, this stream was created by
+                                  the legacy conversion tool.
+
+            bits 2-31: Reserved.
+--------------------------------------------------------------------
+
+The endianness shall be 0 (little-endian) for images generated on an
+i386, x86_64, or arm host.
+
+\clearpage
+
+
+Records
+=======
+
+A record has a record header, type specific data and a trailing footer.  If
+`length` is not a multiple of 8, the body is padded with zeroes to align the
+end of the record on an 8 octet boundary.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | type                  | body_length             |
+    +-----------+-----------+-------------------------+
+    | body...                                         |
+    ...
+    |           | padding (0 to 7 octets)             |
+    +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field        Description
+-----------  -------------------------------------------------------
+type         0x00000000: END
+
+             0x00000001: DOMAIN_JSON
+
+             0x00000002: LIBXC_CONTEXT
+
+             0x00000003: XENSTORE_DATA
+
+             0x00000004: EMULATOR_CONTEXT
+
+             0x00000005 - 0x7FFFFFFF: Reserved for future _mandatory_
+             records.
+
+             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
+             records.
+
+body_length  Length in octets of the record body.
+
+body         Content of the record.
+
+padding      0 to 7 octets of zeros to pad the whole record to a multiple
+             of 8 octets.
+--------------------------------------------------------------------
+
+\clearpage
+
+END
+----
+
+A end record marks the end of the image, and shall be the final record
+in the stream.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The end record contains no fields; its body_length is 0.
+
+DOMAIN\_JSON
+-----------
+
+The JSON string containing a description of the entire domain.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | domain json                                     |
+    ...
+    +-------------------------------------------------+
+
+This record is expected to be present for non-legacy streams.  Legacy streams
+did not contain any such information, and the conversion script will not be
+able to make it appear.  `xl` shall be responsible for appropriately handling
+the out-of-band context for legacy streams.
+
+LIBXC\_CONTEXT
+--------------
+
+A libxc context record is a marker, indicating that the stream should be
+handed to `xc_domain_restore()`.  `libxc` shall be resonsible for reading its
+own image format from the stream.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The libxc context record contains no fields; its body_length is 0[^1].
+
+
+[^1]: The sending side cannot calculate ahead of time how much data `libxc`
+might write into the stream, especially for live migration where the quantity
+of data is partially proportional to the elapsed time.
+
+XENSTORE\_DATA
+-------------
+
+A record containing xenstore key/value pairs of data.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | xenstore key/value pairs                        |
+    ...
+    +-------------------------------------------------+
+
+EMULATOR\_CONTEXT
+----------------
+
+A context blob for a specific emulator associated with the domain.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | emulator_id            | index                  |
+    +------------------------+------------------------+
+    | emulator_ctx                                    |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+------------     ---------------------------------------------------
+emulator_id      0x00000000: Unknown (In the case of a legacy stream)
+
+                 0x00000001: Qemu Traditional
+
+                 0x00000002: Qemu Upstream
+
+                 0x00000003 - 0xFFFFFFFF: Reserved for future emulators.
+
+index            Index of this emulator for the domain, if multiple
+                 emulators are in use.
+
+emulator_ctx     Emulator context blob.
+--------------------------------------------------------------------
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 08/29] tools/python: Infrastructure relating to migration v2 streams
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (6 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 07/29] docs: libxl " Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

* libxc.py and libxl.py contain stream structure and constants, as well as
  verification functions as per the specifications.

* tests.py contains some basic unit tests, usable by the existing unittest
  infrastructure.

* convert-legacy-stream.py will take a legacy migration stream as an input,
  and produce a v2 stream as an output, including reframing a legacy stream
  for libxl to consume.

* verify-stream-v2.py will verify a stream according to the v2
  specification(s), including a libxc stream contained inside a libxl stream.

The files from xen/migration lives as part of the regular xen library, while
convert-legacy-stream and verify-stream-v2 are installed as standalone scripts
into PRIVATE_BINDIR.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>

---
v7:
 * Change to xen/migration namespace
 * Libxl stream support
v6:
 * Move to be part of tools/python and installed in proper locations
---
 tools/python/Makefile                         |    5 +
 tools/python/scripts/convert-legacy-stream.py |  671 +++++++++++++++++++++++++
 tools/python/scripts/verify-stream-v2.py      |  173 +++++++
 tools/python/setup.py                         |    1 +
 tools/python/xen/migration/libxc.py           |  434 ++++++++++++++++
 tools/python/xen/migration/libxl.py           |  190 +++++++
 tools/python/xen/migration/tests.py           |   55 ++
 tools/python/xen/migration/verify.py          |   37 ++
 8 files changed, 1566 insertions(+)
 create mode 100755 tools/python/scripts/convert-legacy-stream.py
 create mode 100755 tools/python/scripts/verify-stream-v2.py
 create mode 100644 tools/python/xen/migration/__init__.py
 create mode 100644 tools/python/xen/migration/libxc.py
 create mode 100644 tools/python/xen/migration/libxl.py
 create mode 100644 tools/python/xen/migration/tests.py
 create mode 100644 tools/python/xen/migration/verify.py

diff --git a/tools/python/Makefile b/tools/python/Makefile
index c914332..f1038a1 100644
--- a/tools/python/Makefile
+++ b/tools/python/Makefile
@@ -20,9 +20,14 @@ build: genpath genwrap.py $(XEN_ROOT)/tools/libxl/libxl_types.idl \
 
 .PHONY: install
 install:
+	$(INSTALL_DIR) $(DESTDIR)$(PRIVATE_BINDIR)
+
 	CC="$(CC)" CFLAGS="$(CFLAGS) $(APPEND_LDFLAGS)" $(PYTHON) setup.py install \
 		$(PYTHON_PREFIX_ARG) --root="$(DESTDIR)" --force
 
+	$(INSTALL_PROG) scripts/convert-legacy-stream.py $(DESTDIR)$(PRIVATE_BINDIR)/convert-legacy-stream
+	$(INSTALL_PROG) scripts/verify-stream-v2.py $(DESTDIR)$(PRIVATE_BINDIR)/verify-stream-v2
+
 .PHONY: test
 test:
 	export LD_LIBRARY_PATH=$$(readlink -f ../libxc):$$(readlink -f ../xenstore); $(PYTHON) test.py -b -u
diff --git a/tools/python/scripts/convert-legacy-stream.py b/tools/python/scripts/convert-legacy-stream.py
new file mode 100755
index 0000000..bb54bd5
--- /dev/null
+++ b/tools/python/scripts/convert-legacy-stream.py
@@ -0,0 +1,671 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import sys
+import os, os.path
+import syslog
+import traceback
+
+from struct import calcsize, unpack, pack
+
+from xen.migration import libxc, libxl
+
+__version__ = 1
+
+fin = None             # Input file/fd
+fout = None            # Output file/fd
+twidth = 0             # Legacy toolstack bitness (32 or 64)
+pv = None              # Boolean (pv or hvm)
+qemu = True            # Boolean - process qemu record?
+log_to_syslog = False  # Boolean - Log to syslog instead of stdout/err?
+verbose = False        # Boolean - Summarise stream contents
+
+def stream_read(_ = None):
+    return fin.read(_)
+
+def stream_write(_):
+    return fout.write(_)
+
+def info(msg):
+    """Info message, routed to appropriate destination"""
+    if verbose:
+        if log_to_syslog:
+            for line in msg.split("\n"):
+                syslog.syslog(syslog.LOG_INFO, line)
+        else:
+            print msg
+
+def err(msg):
+    """Error message, routed to appropriate destination"""
+    if log_to_syslog:
+        for line in msg.split("\n"):
+            syslog.syslog(syslog.LOG_ERR, line)
+    print >> sys.stderr, msg
+
+class StreamError(StandardError):
+    pass
+
+class VM(object):
+
+    def __init__(self, fmt):
+        # Common
+        self.p2m_size = 0
+
+        # PV
+        self.max_vcpu_id = 0
+        self.online_vcpu_map = []
+        self.width = 0
+        self.levels = 0
+        self.basic_len = 0
+        self.extd = False
+        self.xsave_len = 0
+
+        # libxl
+        self.libxl = fmt == "libxl"
+        self.xenstore = [] # Deferred "toolstack" records
+
+def write_libxc_ihdr():
+    stream_write(pack(libxc.IHDR_FORMAT,
+                      libxc.IHDR_MARKER,  # Marker
+                      libxc.IHDR_IDENT,   # Ident
+                      libxc.IHDR_VERSION, # Version
+                      libxc.IHDR_OPT_LE,  # Options
+                      0, 0))              # Reserved
+
+def write_libxc_dhdr():
+    if pv:
+        dtype = libxc.DHDR_TYPE_x86_pv
+    else:
+        dtype = libxc.DHDR_TYPE_x86_hvm
+
+    stream_write(pack(libxc.DHDR_FORMAT,
+                      dtype,        # Type
+                      12,           # Page size
+                      0,            # Reserved
+                      0,            # Xen major (converted)
+                      __version__)) # Xen minor (converted)
+
+def write_libxl_hdr():
+    stream_write(pack(libxl.HDR_FORMAT,
+                      libxl.HDR_IDENT,     # Ident
+                      libxl.HDR_VERSION,   # Version 2
+                      libxl.HDR_OPT_LE |   # Options
+                      libxl.HDR_OPT_LEGACY # Little Endian and Legacy
+                      ))
+
+def write_record(rt, *argl):
+    alldata = ''.join(argl)
+    length = len(alldata)
+
+    record = pack(libxc.RH_FORMAT, rt, length) + alldata
+    plen = (8 - (length & 7)) & 7
+    record += '\x00' * plen
+
+    stream_write(record)
+
+def write_libxc_pv_info(vm):
+    write_record(libxc.REC_TYPE_x86_pv_info,
+                 pack(libxc.X86_PV_INFO_FORMAT,
+                      vm.width, vm.levels, 0, 0))
+
+def write_libxc_pv_p2m_frames(vm, pfns):
+    write_record(libxc.REC_TYPE_x86_pv_p2m_frames,
+                 pack(libxc.X86_PV_P2M_FRAMES_FORMAT,
+                      0, vm.p2m_size - 1),
+                 pack("Q" * len(pfns), *pfns))
+
+def write_libxc_pv_vcpu_basic(vcpu_id, data):
+    write_record(libxc.REC_TYPE_x86_pv_vcpu_basic,
+                 pack(libxc.X86_PV_VCPU_HDR_FORMAT, vcpu_id, 0), data)
+
+def write_libxc_pv_vcpu_extd(vcpu_id, data):
+    write_record(libxc.REC_TYPE_x86_pv_vcpu_extended,
+                 pack(libxc.X86_PV_VCPU_HDR_FORMAT, vcpu_id, 0), data)
+
+def write_libxc_pv_vcpu_xsave(vcpu_id, data):
+    write_record(libxc.REC_TYPE_x86_pv_vcpu_xsave,
+                 pack(libxc.X86_PV_VCPU_HDR_FORMAT, vcpu_id, 0), data)
+
+def write_page_data(pfns, pages):
+    if fout is None: # Save copying 1M buffers around for no reason
+        return
+
+    new_pfns = [(((x & 0xf0000000) << 32) | (x & 0x0fffffff)) for x in pfns]
+
+    # Optimise the needless buffer copying in write_record()
+    stream_write(pack(libxc.RH_FORMAT,
+                             libxc.REC_TYPE_page_data,
+                             8 + (len(new_pfns) * 8) + len(pages)))
+    stream_write(pack(libxc.PAGE_DATA_FORMAT, len(new_pfns), 0))
+    stream_write(pack("Q" * len(new_pfns), *new_pfns))
+    stream_write(pages)
+
+def write_libxc_tsc_info(mode, khz, nsec, incarn):
+    write_record(libxc.REC_TYPE_tsc_info,
+                 pack(libxc.TSC_INFO_FORMAT,
+                      mode, khz, nsec, incarn, 0))
+
+def write_libxc_hvm_params(params):
+    if pv:
+        raise StreamError("HVM-only param in PV stream")
+    elif len(params) % 2:
+        raise RuntimeError("Expected even length list of hvm parameters")
+
+    write_record(libxc.REC_TYPE_hvm_params,
+                 pack(libxc.HVM_PARAMS_FORMAT, len(params) / 2, 0),
+                 pack("Q" * len(params), *params))
+
+def write_libxl_end():
+    write_record(libxl.REC_TYPE_end, "")
+
+def write_libxl_libxc_context():
+    write_record(libxl.REC_TYPE_libxc_context, "")
+
+def write_libxl_xenstore_data(data):
+    write_record(libxl.REC_TYPE_xenstore_data, data)
+
+def write_libxl_emulator_context(blob):
+    write_record(libxl.REC_TYPE_emulator_context,
+                 pack(libxl.EMULATOR_CONTEXT_FORMAT,
+                      libxl.EMULATOR_ID_unknown, 0) + blob)
+
+def rdexact(nr_bytes):
+    """Read exactly nr_bytes from fin"""
+    _ = stream_read(nr_bytes)
+    if len(_) != nr_bytes:
+        raise IOError("Stream truncated")
+    return _
+
+def unpack_exact(fmt):
+    """Unpack a format from fin"""
+    sz = calcsize(fmt)
+    return unpack(fmt, rdexact(sz))
+
+def unpack_ulongs(nr_ulongs):
+    if twidth == 32:
+        return unpack_exact("I" * nr_ulongs)
+    else:
+        return unpack_exact("Q" * nr_ulongs)
+
+def read_pv_extended_info(vm):
+
+    marker, = unpack_ulongs(1)
+
+    if twidth == 32:
+        expected = 0xffffffff
+    else:
+        expected = 0xffffffffffffffff
+
+    if marker != expected:
+        raise StreamError("Unexpected extended info marker 0x%x" % (marker, ))
+
+    total_length, = unpack_exact("I")
+    so_far = 0
+
+    info("Extended Info: length 0x%x" % (total_length, ))
+
+    while so_far < total_length:
+
+        blkid, datasz = unpack_exact("=4sI")
+        so_far += 8
+
+        info("  Record type: %s, size 0x%x" % (blkid, datasz))
+
+        data = rdexact(datasz)
+        so_far += datasz
+
+        # Eww, but this is how it is done :(
+        if blkid == "vcpu":
+
+            vm.basic_len = datasz
+
+            if datasz == 0x1430:
+                vm.width = 8
+                vm.levels = 4
+                info("    64bit domain, 4 levels")
+            elif datasz == 0xaf0:
+                vm.width = 4
+                vm.levels = 3
+                info("    32bit domain, 3 levels")
+            else:
+                raise StreamError("Unable to determine guest width/level")
+
+            write_libxc_pv_info(vm)
+
+        elif blkid == "extv":
+            vm.extd = True
+
+        elif blkid == "xcnt":
+            vm.xsave_len, = unpack("I", data[:4])
+            info("xcnt sz 0x%x" % (vm.xsave_len, ))
+
+        else:
+            raise StreamError("Unrecognised extended block")
+
+
+    if so_far != total_length:
+        raise StreamError("Overshot Extended Info size by %d bytes"
+                          % (so_far - total_length,))
+
+def read_pv_p2m_frames(vm):
+    fpp = 4096 / vm.width
+    p2m_frame_len = (vm.p2m_size - 1) / fpp + 1
+
+    info("P2M frames: fpp %d, p2m_frame_len %d" % (fpp, p2m_frame_len))
+    write_libxc_pv_p2m_frames(vm, unpack_ulongs(p2m_frame_len))
+
+def read_pv_tail(vm):
+
+    nr_unmapped_pfns, = unpack_exact("I")
+
+    if nr_unmapped_pfns != 0:
+        # "Unmapped" pfns are bogus
+        _ = unpack_ulongs(nr_unmapped_pfns)
+        info("discarding %d bogus 'unmapped pfns'" % (nr_unmapped_pfns, ))
+        #raise StreamError("Found bogus 'unmapped pfns'")
+
+    for vcpu_id in vm.online_vcpu_map:
+
+        basic = rdexact(vm.basic_len)
+        info("Got VCPU basic (size 0x%x)" % (vm.basic_len, ))
+        write_libxc_pv_vcpu_basic(vcpu_id, basic)
+
+        if vm.extd:
+            extd = rdexact(128)
+            info("Got VCPU extd (size 0x%x)" % (128, ))
+            write_libxc_pv_vcpu_extd(vcpu_id, extd)
+
+        if vm.xsave_len:
+            mask, size = unpack_exact("QQ")
+            assert vm.xsave_len - 16 == size
+
+            xsave = rdexact(size)
+            info("Got VCPU xsave (mask 0x%x, size 0x%x)" % (mask, size))
+            write_libxc_pv_vcpu_xsave(vcpu_id, xsave)
+
+    shinfo = rdexact(4096)
+    info("Got shinfo")
+
+    write_record(libxc.REC_TYPE_shared_info, shinfo)
+    write_record(libxc.REC_TYPE_end, "")
+
+
+def read_chunks(vm):
+
+    hvm_params = []
+
+    while True:
+
+        marker, = unpack_exact("=i")
+        if marker <= 0:
+            info("Chunk: type 0x%x" % (marker, ))
+
+        if marker == 0:
+            info("  End")
+
+            if hvm_params:
+                write_libxc_hvm_params(hvm_params)
+
+            return
+
+        elif marker > 0:
+
+            if marker > 1024:
+                raise StreamError("Page batch (%d) exceeded MAX_BATCH"
+                                  % (marker, ))
+            pfns = unpack_ulongs(marker)
+
+            # xc_domain_save() leaves many XEN_DOMCTL_PFINFO_XTAB records for
+            # sequences of pfns it cant map.  Drop these.
+            pfns = [ x for x in pfns if x != 0xf0000000 ]
+
+            if len(set(pfns)) != len(pfns):
+                raise StreamError("Duplicate pfns in batch")
+
+                # print "0x[",
+                # for pfn in pfns:
+                #     print "%x" % (pfn, ),
+                # print "]"
+
+            nr_pages = len([x for x in pfns if (x & 0xf0000000) < 0xd0000000])
+
+            #print "  Page Batch, %d PFNs, %d pages" % (marker, nr_pages)
+            pages = rdexact(nr_pages * 4096)
+
+            write_page_data(pfns, pages)
+
+        elif marker == -1: # XC_SAVE_ID_ENABLE_VERIFY_MODE
+            # Verify mode... Seemingly nothing to do...
+            pass
+
+        elif marker == -2: # XC_SAVE_ID_VCPU_INFO
+            max_id, = unpack_exact("i")
+
+            if max_id > 4095:
+                raise StreamError("Vcpu max_id out of range: %d > 4095"
+                                  % (max_id, ) )
+
+            vm.max_vcpu_id = max_id
+            bitmap = unpack_exact("Q" * ((max_id/64) + 1))
+
+            for idx, word in enumerate(bitmap):
+                bit_idx = 0
+
+                while word > 0:
+                    if word & 1:
+                        vm.online_vcpu_map.append((idx * 64) + bit_idx)
+
+                    bit_idx += 1
+                    word >>= 1
+
+            info("  Vcpu info: max_id %d, online map %s"
+                 % (vm.max_vcpu_id, vm.online_vcpu_map))
+
+        elif marker == -3: # XC_SAVE_ID_HVM_IDENT_PT
+            _, ident_pt = unpack_exact("=IQ")
+            info("  EPT Identity Pagetable 0x%x" % (ident_pt, ))
+            hvm_params.extend([12, # HVM_PARAM_IDENT_PT
+                               ident_pt])
+
+        elif marker == -4: # XC_SAVE_ID_HVM_VM86_TSS
+            _, vm86_tss = unpack_exact("=IQ")
+            info("  VM86 TSS: 0x%x" % (vm86_tss, ))
+            hvm_params.extend([15, # HVM_PARAM_VM86_TSS
+                               vm86_tss])
+
+        elif marker == -5: # XC_SAVE_ID_TMEM
+            raise RuntimeError("todo")
+
+        elif marker == -6: # XC_SAVE_ID_TMEM_EXTRA
+            raise RuntimeError("todo")
+
+        elif marker == -7: # XC_SAVE_ID_TSC_INFO
+            mode, nsec, khz, incarn = unpack_exact("=IQII")
+            info("  TSC_INFO: mode %s, %d ns, %d khz, %d incarn"
+                 % (mode, nsec, khz, incarn))
+            write_libxc_tsc_info(mode, khz, nsec, incarn)
+
+        elif marker == -8: # XC_SAVE_ID_HVM_CONSOLE_PFN
+            _, console_pfn = unpack_exact("=IQ")
+            info("  Console pfn 0x%x" % (console_pfn, ))
+            hvm_params.extend([17, # HVM_PARAM_CONSOLE_PFN
+                               console_pfn])
+
+        elif marker == -9: # XC_SAVE_ID_LAST_CHECKPOINT
+            info("  Last Checkpoint")
+            # Nothing to do
+
+        elif marker == -10: # XC_SAVE_ID_HVM_ACPI_IOPORTS_LOCATION
+            _, loc = unpack_exact("=IQ")
+            info("  ACPI ioport location 0x%x" % (loc, ))
+            hvm_params.extend([19, # HVM_PARAM_ACPI_IOPORTS_LOCATION
+                               loc])
+
+        elif marker == -11: # XC_SAVE_ID_HVM_VIRIDIAN
+            _, loc = unpack_exact("=IQ")
+            info("  Viridian location 0x%x" % (loc, ))
+            hvm_params.extend([9, # HVM_PARAM_VIRIDIAN
+                               loc])
+
+        elif marker == -12: # XC_SAVE_ID_COMPRESSED_DATA
+            sz, = unpack_exact("I")
+            data = rdexact(sz)
+            info("  Compressed Data: sz 0x%x" % (sz, ))
+            raise RuntimeError("todo")
+
+        elif marker == -13: # XC_SAVE_ID_ENABLE_COMPRESSION
+            raise RuntimeError("todo")
+
+        elif marker == -14: # XC_SAVE_ID_HVM_GENERATION_ID_ADDR
+            _, genid_loc = unpack_exact("=IQ")
+            info("  Generation ID Address 0x%x" % (genid_loc, ))
+            hvm_params.extend([32, # HVM_PARAM_VM_GENERATION_ID_ADDR
+                               genid_loc])
+
+        elif marker == -15: # XC_SAVE_ID_HVM_PAGING_RING_PFN
+            _, paging_ring_pfn = unpack_exact("=IQ")
+            info("  Paging ring pfn 0x%x" % (paging_ring_pfn, ))
+            hvm_params.extend([27, # HVM_PARAM_PAGING_RING_PFN
+                               paging_ring_pfn])
+
+        elif marker == -16: # XC_SAVE_ID_HVM_ACCESS_RING_PFN
+            _, access_ring_pfn = unpack_exact("=IQ")
+            info("  Access ring pfn 0x%x" % (access_ring_pfn, ))
+            hvm_params.extend([28, # HVM_PARAM_ACCESS_RING_PFN
+                               access_ring_pfn])
+
+        elif marker == -17: # XC_SAVE_ID_HVM_SHARING_RING_PFN
+            _, sharing_ring_pfn = unpack_exact("=IQ")
+            info("  Sharing ring pfn 0x%x" % (sharing_ring_pfn, ))
+            hvm_params.extend([29, # HVM_PARAM_SHARING_RING_PFN
+                               sharing_ring_pfn])
+
+        elif marker == -18:
+            sz, = unpack_exact("I")
+
+            if sz:
+                data = rdexact(sz)
+                info("  Toolstack Data: sz 0x%x" % (sz, ))
+
+                if vm.libxl:
+                    vm.xenstore.append(data)
+                else:
+                    info("    Discarding")
+
+        else:
+            raise StreamError("Unrecognised chunk")
+
+def read_hvm_tail(vm):
+
+    io, bufio, store = unpack_exact("QQQ")
+    info("Magic pfns: 0x%x 0x%x 0x%x" % (io, bufio, store))
+    write_libxc_hvm_params([5, io,     # HVM_PARAM_IOREQ_PFN
+                            6, bufio,  # HVM_PARAM_BUFIOREQ_PFN
+                            1, store]) # HVM_PARAM_STORE_PFN
+
+    blobsz, = unpack_exact("I")
+    info("Got HVM Context (0x%x bytes)" % (blobsz, ))
+    blob = rdexact(blobsz)
+
+    write_record(libxc.REC_TYPE_hvm_context, blob)
+    write_record(libxc.REC_TYPE_end, "")
+
+
+
+def read_qemu(vm):
+
+    rawsig = rdexact(21)
+    sig, = unpack("21s", rawsig)
+    info("Qemu signature: %s" % (sig, ))
+
+    if sig == "DeviceModelRecord0002":
+        rawsz = rdexact(4)
+        sz, = unpack("I", rawsz)
+        qdata = rdexact(sz)
+
+        if vm.libxl:
+            write_libxl_emulator_context(qdata)
+        else:
+            stream_write(rawsig)
+            stream_write(rawsz)
+            stream_write(qdata)
+
+    else:
+        raise RuntimeError("Unrecognised Qemu sig '%s'" % (sig, ))
+
+
+def skip_xl_header(fmt):
+    """Skip over an xl header in the stream"""
+
+    hdr = rdexact(32)
+    if hdr != "Xen saved domain, xl format\n \0 \r":
+        raise StreamError("No xl header")
+
+    end, mflags, oflags, optlen = unpack_exact("=IIII")
+
+    if fmt == "libxl":
+        mflags |= 1
+
+    opts = pack("=IIII", end, mflags, oflags, optlen)
+
+    optdata = rdexact(optlen)
+
+    info("Processed xl header")
+
+    stream_write(hdr)
+    stream_write(opts)
+    stream_write(optdata)
+
+def read_legacy_stream(vm):
+
+    try:
+        vm.p2m_size, = unpack_ulongs(1)
+        info("P2M Size: 0x%x" % (vm.p2m_size,))
+
+        if vm.libxl:
+            write_libxl_hdr()
+            write_libxl_libxc_context()
+
+        write_libxc_ihdr()
+        write_libxc_dhdr()
+
+        if pv:
+            read_pv_extended_info(vm)
+            read_pv_p2m_frames(vm)
+
+        read_chunks(vm)
+
+        if pv:
+            read_pv_tail(vm)
+        else:
+            read_hvm_tail(vm)
+
+        if vm.libxl:
+            for x in vm.xenstore:
+                write_libxl_xenstore_data(x)
+
+        if not pv and (vm.libxl or qemu):
+            read_qemu(vm)
+
+        if vm.libxl:
+            write_libxl_end()
+
+    except (IOError, StreamError, ):
+        err("Stream Error:")
+        err(traceback.format_exc())
+        return 1
+
+    except RuntimeError:
+        err("Script Error:")
+        err(traceback.format_exc())
+        err("Please fix me")
+        return 2
+    return 0
+
+def open_file_or_fd(val, mode):
+    """
+    If 'val' looks like a decimal integer, open it as an fd.  If not, try to
+    open it as a regular file.
+    """
+
+    fd = -1
+    try:
+        # Does it look like an integer?
+        try:
+            fd = int(val, 10)
+        except ValueError:
+            pass
+
+        # Try to open it...
+        if fd != -1:
+            return os.fdopen(fd, mode, 0)
+        else:
+            return open(val, mode, 0)
+
+    except StandardError, e:
+        if fd != -1:
+            err("Unable to open fd %d: %s" % (fd, e))
+        else:
+            err("Unable to open file '%s': %s" % (val, e))
+
+    raise SystemExit(1)
+
+
+def main(argv):
+    from optparse import OptionParser
+    global fin, fout, twidth, pv, qemu, verbose
+
+    # Change stdout to be line-buffered.
+    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 1)
+
+    parser = OptionParser(version = __version__,
+                          usage = ("%prog [options] -i INPUT -o OUTPUT"
+                                   " -w WIDTH -g GUEST"),
+                          description =
+                          "Convert a legacy stream to a v2 stream")
+
+    # Required options
+    parser.add_option("-i", "--in", dest = "fin", metavar = "<FD or FILE>",
+                      help = "Legacy input to convert")
+    parser.add_option("-o", "--out", dest = "fout", metavar = "<FD or FILE>",
+                      help = "v2 format output")
+    parser.add_option("-w", "--width", dest = "twidth",
+                      metavar = "<32/64>", choices = ["32", "64"],
+                      help = "Legacy toolstack bitness")
+    parser.add_option("-g", "--guest-type", dest = "gtype",
+                      metavar = "<pv/hvm>", choices = ["pv", "hvm"],
+                      help = "Type of guest in stream")
+
+    # Optional options
+    parser.add_option("-f", "--format", dest = "format",
+                      metavar = "<libxc|libxl>", default = "libxc",
+                      choices = ["libxc", "libxl"],
+                      help = "Desired format of the outgoing stream (defaults to libxc)")
+    parser.add_option("-v", "--verbose", action = "store_true", default = False,
+                      help = "Summarise stream contents")
+    parser.add_option("-x", "--xl", action = "store_true", default = False,
+                      help = ("Is an `xl` header present in the stream?"
+                              " (default no)"))
+    parser.add_option("--skip-qemu", action = "store_true", default = False,
+                      help = ("Skip processing of the qemu tail?"
+                              " (default no)"))
+    parser.add_option("--syslog", action = "store_true", default = False,
+                      help = "Log to syslog instead of stdout")
+
+    opts, _ = parser.parse_args()
+
+    if (opts.fin is None or opts.fout is None or
+        opts.twidth is None or opts.gtype is None):
+
+        parser.print_help(sys.stderr)
+        raise SystemExit(1)
+
+    if opts.syslog:
+        global log_to_syslog
+
+        syslog.openlog("convert-legacy-stream", syslog.LOG_PID)
+        log_to_syslog = True
+
+    fin     = open_file_or_fd(opts.fin,  "rb")
+    fout    = open_file_or_fd(opts.fout, "wb")
+    twidth  = int(opts.twidth)
+    pv      = opts.gtype == "pv"
+    verbose = opts.verbose
+    if opts.skip_qemu:
+        qemu = False
+
+    if opts.xl:
+        skip_xl_header(opts.format)
+
+    rc = read_legacy_stream(VM(opts.format))
+    fout.close()
+
+    return rc
+
+if __name__ == "__main__":
+    try:
+        sys.exit(main(sys.argv))
+    except SystemExit, e:
+        sys.exit(e.code)
+    except KeyboardInterrupt:
+        sys.exit(1)
diff --git a/tools/python/scripts/verify-stream-v2.py b/tools/python/scripts/verify-stream-v2.py
new file mode 100755
index 0000000..203883e
--- /dev/null
+++ b/tools/python/scripts/verify-stream-v2.py
@@ -0,0 +1,173 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+""" Verify a v2 format migration stream """
+
+import sys
+import struct
+import os, os.path
+import syslog
+import traceback
+
+from xen.migration.verify import StreamError, RecordError
+from xen.migration.libxc import VerifyLibxc
+from xen.migration.libxl import VerifyLibxl
+
+fin = None             # Input file/fd
+log_to_syslog = False  # Boolean - Log to syslog instead of stdout/err?
+verbose = False        # Boolean - Summarise stream contents
+quiet = False          # Boolean - Suppress error printing
+
+def info(msg):
+    """Info message, routed to appropriate destination"""
+    if not quiet and verbose:
+        if log_to_syslog:
+            for line in msg.split("\n"):
+                syslog.syslog(syslog.LOG_INFO, line)
+        else:
+            print msg
+
+def err(msg):
+    """Error message, routed to appropriate destination"""
+    if not quiet:
+        if log_to_syslog:
+            for line in msg.split("\n"):
+                syslog.syslog(syslog.LOG_ERR, line)
+        print >> sys.stderr, msg
+
+def stream_read(_ = None):
+    """Read from input"""
+    return fin.read(_)
+
+def rdexact(nr_bytes):
+    """Read exactly nr_bytes from fin"""
+    _ = stream_read(nr_bytes)
+    if len(_) != nr_bytes:
+        raise IOError("Stream truncated")
+    return _
+
+def unpack_exact(fmt):
+    """Unpack a format from fin"""
+    sz = struct.calcsize(fmt)
+    return struct.unpack(fmt, rdexact(sz))
+
+
+def skip_xl_header():
+    """Skip over an xl header in the stream"""
+
+    hdr = rdexact(32)
+    if hdr != "Xen saved domain, xl format\n \0 \r":
+        raise StreamError("No xl header")
+
+    _, mflags, _, optlen = unpack_exact("=IIII")
+    _ = rdexact(optlen)
+
+    info("Processed xl header")
+
+    if mflags & 1: # Bottom bit in mandatory flags indicates a libxl v2 stream
+        return "libxl"
+    else:
+        return "libxc"
+
+def read_stream(fmt):
+    """ Read an entire stream """
+
+    try:
+        if fmt == "xl":
+            fmt = skip_xl_header()
+
+        if fmt == "libxc":
+            VerifyLibxc(info, stream_read).verify()
+        else:
+            VerifyLibxl(info, stream_read).verify()
+
+    except (IOError, StreamError, RecordError):
+        err("Stream Error:")
+        err(traceback.format_exc())
+        return 1
+
+    except StandardError:
+        err("Script Error:")
+        err(traceback.format_exc())
+        err("Please fix me")
+        return 2
+
+    return 0
+
+def open_file_or_fd(val, mode, buffering):
+    """
+    If 'val' looks like a decimal integer, open it as an fd.  If not, try to
+    open it as a regular file.
+    """
+
+    fd = -1
+    try:
+        # Does it look like an integer?
+        try:
+            fd = int(val, 10)
+        except ValueError:
+            pass
+
+        # Try to open it...
+        if fd != -1:
+            return os.fdopen(fd, mode, buffering)
+        else:
+            return open(val, mode, buffering)
+
+    except StandardError, e:
+        if fd != -1:
+            err("Unable to open fd %d: %s: %s" %
+                (fd, e.__class__.__name__, e))
+        else:
+            err("Unable to open file '%s': %s: %s" %
+                (val, __class__.__name__, e))
+
+    raise SystemExit(2)
+
+def main(argv):
+    from optparse import OptionParser
+    global fin, quiet, verbose
+
+    # Change stdout to be line-buffered.
+    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 1)
+
+    parser = OptionParser(usage = "%prog [options]",
+                          description =
+                          "Verify a stream according to the v2 spec")
+
+    # Optional options
+    parser.add_option("-i", "--in", dest = "fin", metavar = "<FD or FILE>",
+                      default = "0",
+                      help = "Stream to verify (defaults to stdin)")
+    parser.add_option("-v", "--verbose", action = "store_true", default = False,
+                      help = "Summarise stream contents")
+    parser.add_option("-q", "--quiet", action = "store_true", default = False,
+                      help = "Suppress all logging/errors")
+    parser.add_option("-f", "--format", dest = "format",
+                      metavar = "<libxc|libxl|xl>", default = "libxc",
+                      choices = ["libxc", "libxl", "xl"],
+                      help = "Format of the incoming stream (defaults to libxc)")
+    parser.add_option("--syslog", action = "store_true", default = False,
+                      help = "Log to syslog instead of stdout")
+
+    opts, _ = parser.parse_args()
+
+    if opts.syslog:
+        global log_to_syslog
+
+        syslog.openlog("verify-stream-v2", syslog.LOG_PID)
+        log_to_syslog = True
+
+    verbose = opts.verbose
+    quiet = opts.quiet
+    fin = open_file_or_fd(opts.fin, "rb", 0)
+
+    return read_stream(opts.format)
+
+if __name__ == "__main__":
+    try:
+        sys.exit(main(sys.argv))
+    except SystemExit, e:
+        sys.exit(e.code)
+    except KeyboardInterrupt:
+        sys.exit(2)
diff --git a/tools/python/setup.py b/tools/python/setup.py
index 17ebb4a..74822f4 100644
--- a/tools/python/setup.py
+++ b/tools/python/setup.py
@@ -43,6 +43,7 @@ setup(name            = 'xen',
       version         = '3.0',
       description     = 'Xen',
       packages        = ['xen',
+                         'xen.migration',
                          'xen.lowlevel',
                         ],
       ext_package = "xen.lowlevel",
diff --git a/tools/python/xen/migration/__init__.py b/tools/python/xen/migration/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tools/python/xen/migration/libxc.py b/tools/python/xen/migration/libxc.py
new file mode 100644
index 0000000..ea74882
--- /dev/null
+++ b/tools/python/xen/migration/libxc.py
@@ -0,0 +1,434 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+"""
+Libxc Migration v2 streams
+
+Record structures as per docs/specs/libxc-migration-stream.pandoc, and
+verification routines.
+"""
+
+import sys
+
+from struct import calcsize, unpack
+
+from xen.migration.verify import StreamError, RecordError, VerifyBase
+
+# Image Header
+IHDR_FORMAT = "!QIIHHI"
+
+IHDR_MARKER  = 0xffffffffffffffff
+IHDR_IDENT   = 0x58454E46 # "XENF" in ASCII
+IHDR_VERSION = 2
+
+IHDR_OPT_BIT_ENDIAN = 0
+IHDR_OPT_LE = (0 << IHDR_OPT_BIT_ENDIAN)
+IHDR_OPT_BE = (1 << IHDR_OPT_BIT_ENDIAN)
+
+IHDR_OPT_RESZ_MASK = 0xfffe
+
+# Domain Header
+DHDR_FORMAT = "IHHII"
+
+DHDR_TYPE_x86_pv  = 0x00000001
+DHDR_TYPE_x86_hvm = 0x00000002
+DHDR_TYPE_x86_pvh = 0x00000003
+DHDR_TYPE_arm     = 0x00000004
+
+dhdr_type_to_str = {
+    DHDR_TYPE_x86_pv  : "x86 PV",
+    DHDR_TYPE_x86_hvm : "x86 HVM",
+    DHDR_TYPE_x86_pvh : "x86 PVH",
+    DHDR_TYPE_arm     : "ARM",
+}
+
+# Records
+RH_FORMAT = "II"
+
+REC_TYPE_end                  = 0x00000000
+REC_TYPE_page_data            = 0x00000001
+REC_TYPE_x86_pv_info          = 0x00000002
+REC_TYPE_x86_pv_p2m_frames    = 0x00000003
+REC_TYPE_x86_pv_vcpu_basic    = 0x00000004
+REC_TYPE_x86_pv_vcpu_extended = 0x00000005
+REC_TYPE_x86_pv_vcpu_xsave    = 0x00000006
+REC_TYPE_shared_info          = 0x00000007
+REC_TYPE_tsc_info             = 0x00000008
+REC_TYPE_hvm_context          = 0x00000009
+REC_TYPE_hvm_params           = 0x0000000a
+REC_TYPE_toolstack            = 0x0000000b
+REC_TYPE_x86_pv_vcpu_msrs     = 0x0000000c
+REC_TYPE_verify               = 0x0000000d
+
+rec_type_to_str = {
+    REC_TYPE_end                  : "End",
+    REC_TYPE_page_data            : "Page data",
+    REC_TYPE_x86_pv_info          : "x86 PV info",
+    REC_TYPE_x86_pv_p2m_frames    : "x86 PV P2M frames",
+    REC_TYPE_x86_pv_vcpu_basic    : "x86 PV vcpu basic",
+    REC_TYPE_x86_pv_vcpu_extended : "x86 PV vcpu extended",
+    REC_TYPE_x86_pv_vcpu_xsave    : "x86 PV vcpu xsave",
+    REC_TYPE_shared_info          : "Shared info",
+    REC_TYPE_tsc_info             : "TSC info",
+    REC_TYPE_hvm_context          : "HVM context",
+    REC_TYPE_hvm_params           : "HVM params",
+    REC_TYPE_toolstack            : "Toolstack",
+    REC_TYPE_x86_pv_vcpu_msrs     : "x86 PV vcpu msrs",
+    REC_TYPE_verify               : "Verify",
+}
+
+# page_data
+PAGE_DATA_FORMAT             = "II"
+PAGE_DATA_PFN_MASK           = (1L << 52) - 1
+PAGE_DATA_PFN_RESZ_MASK      = ((1L << 60) - 1) & ~((1L << 52) - 1)
+
+# flags from xen/public/domctl.h: XEN_DOMCTL_PFINFO_* shifted by 32 bits
+PAGE_DATA_TYPE_SHIFT         = 60
+PAGE_DATA_TYPE_LTABTYPE_MASK = (0x7L << PAGE_DATA_TYPE_SHIFT)
+PAGE_DATA_TYPE_LTAB_MASK     = (0xfL << PAGE_DATA_TYPE_SHIFT)
+PAGE_DATA_TYPE_LPINTAB       = (0x8L << PAGE_DATA_TYPE_SHIFT) # Pinned pagetable
+
+PAGE_DATA_TYPE_NOTAB         = (0x0L << PAGE_DATA_TYPE_SHIFT) # Regular page
+PAGE_DATA_TYPE_L1TAB         = (0x1L << PAGE_DATA_TYPE_SHIFT) # L1 pagetable
+PAGE_DATA_TYPE_L2TAB         = (0x2L << PAGE_DATA_TYPE_SHIFT) # L2 pagetable
+PAGE_DATA_TYPE_L3TAB         = (0x3L << PAGE_DATA_TYPE_SHIFT) # L3 pagetable
+PAGE_DATA_TYPE_L4TAB         = (0x4L << PAGE_DATA_TYPE_SHIFT) # L4 pagetable
+PAGE_DATA_TYPE_BROKEN        = (0xdL << PAGE_DATA_TYPE_SHIFT) # Broken
+PAGE_DATA_TYPE_XALLOC        = (0xeL << PAGE_DATA_TYPE_SHIFT) # Allocate-only
+PAGE_DATA_TYPE_XTAB          = (0xfL << PAGE_DATA_TYPE_SHIFT) # Invalid
+
+# x86_pv_info
+X86_PV_INFO_FORMAT        = "BBHI"
+
+X86_PV_P2M_FRAMES_FORMAT  = "II"
+
+# x86_pv_vcpu_{basic,extended,xsave,msrs}
+X86_PV_VCPU_HDR_FORMAT    = "II"
+
+# tsc_info
+TSC_INFO_FORMAT           = "IIQII"
+
+# hvm_params
+HVM_PARAMS_ENTRY_FORMAT   = "QQ"
+HVM_PARAMS_FORMAT         = "II"
+
+#
+# libxl format
+#
+
+LIBXL_QEMU_SIGNATURE = "DeviceModelRecord0002"
+LIBXL_QEMU_RECORD_HDR = "=%dsI" % (len(LIBXL_QEMU_SIGNATURE), )
+
+class VerifyLibxc(VerifyBase):
+    """ Verify a Libxc v2 stream """
+
+    def __init__(self, info, read):
+        VerifyBase.__init__(self, info, read)
+
+        self.squashed_pagedata_records = 0
+
+
+    def verify(self):
+
+        self.verify_ihdr()
+        self.verify_dhdr()
+
+        while self.verify_record() != REC_TYPE_end:
+            pass
+
+
+    def verify_ihdr(self):
+        """ Verify an Image Header """
+        marker, ident, version, options, res1, res2 = \
+            self.unpack_exact(IHDR_FORMAT)
+
+        if marker != IHDR_MARKER:
+            raise StreamError("Bad image marker: Expected 0x%x, got 0x%x"
+                              % (IHDR_MARKER, marker))
+
+        if ident != IHDR_IDENT:
+            raise StreamError("Bad image id: Expected 0x%x, got 0x%x"
+                              % (IHDR_IDENT, ident))
+
+        if version != IHDR_VERSION:
+            raise StreamError("Unknown image version: Expected %d, got %d"
+                              % (IHDR_VERSION, version))
+
+        if options & IHDR_OPT_RESZ_MASK:
+            raise StreamError("Reserved bits set in image options field: 0x%x"
+                              % (options & IHDR_OPT_RESZ_MASK))
+
+        if res1 != 0 or res2 != 0:
+            raise StreamError("Reserved bits set in image header: 0x%04x:0x%08x"
+                              % (res1, res2))
+
+        if ( (sys.byteorder == "little") and
+             ((options & IHDR_OPT_BIT_ENDIAN) != IHDR_OPT_LE) ):
+            raise StreamError("Stream is not native endianess - unable to validate")
+
+        endian = ["little", "big"][options & IHDR_OPT_LE]
+        self.info("Libxc Image Header: %s endian" % (endian, ))
+
+
+    def verify_dhdr(self):
+        """ Verify a domain header """
+
+        gtype, page_shift, res1, major, minor = \
+            self.unpack_exact(DHDR_FORMAT)
+
+        if gtype not in dhdr_type_to_str:
+            raise StreamError("Unrecognised domain type 0x%x" % (gtype, ))
+
+        if res1 != 0:
+            raise StreamError("Reserved bits set in domain header 0x%04x"
+                              % (res1, ))
+
+        if page_shift != 12:
+            raise StreamError("Page shift expected to be 12.  Got %d"
+                              % (page_shift, ))
+
+        if major == 0:
+            self.info("Domain Header: legacy converted %s"
+                      % (dhdr_type_to_str[gtype], ))
+        else:
+            self.info("Domain Header: %s from Xen %d.%d"
+                      % (dhdr_type_to_str[gtype], major, minor))
+
+
+    def verify_record(self):
+        """ Verify an individual record """
+
+        rtype, length = self.unpack_exact(RH_FORMAT)
+
+        if rtype not in rec_type_to_str:
+            raise StreamError("Unrecognised record type 0x%x" % (rtype, ))
+
+        contentsz = (length + 7) & ~7
+        content = self.rdexact(contentsz)
+
+        if rtype != REC_TYPE_page_data:
+
+            if self.squashed_pagedata_records > 0:
+                self.info("Squashed %d Page Data records together"
+                          % (self.squashed_pagedata_records, ))
+                self.squashed_pagedata_records = 0
+
+            self.info("Libxc Record: %s, length %d"
+                      % (rec_type_to_str[rtype], length))
+
+        else:
+            self.squashed_pagedata_records += 1
+
+        padding = content[length:]
+        if padding != "\x00" * len(padding):
+            raise StreamError("Padding containing non0 bytes found")
+
+        if rtype not in record_verifiers:
+            raise RuntimeError("No verification function for libxc record '%s'"
+                               % rec_type_to_str[rtype])
+        else:
+            record_verifiers[rtype](self, content[:length])
+
+        return rtype
+
+
+    def verify_record_end(self, content):
+        """ End record """
+
+        if len(content) != 0:
+            raise RecordError("End record with non-zero length")
+
+
+    def verify_record_page_data(self, content):
+        """ Page Data record """
+        minsz = calcsize(PAGE_DATA_FORMAT)
+
+        if len(content) <= minsz:
+            raise RecordError("PAGE_DATA record must be at least %d bytes long"
+                              % (minsz, ))
+
+        count, res1 = unpack(PAGE_DATA_FORMAT, content[:minsz])
+
+        if res1 != 0:
+            raise StreamError("Reserved bits set in PAGE_DATA record 0x%04x"
+                              % (res1, ))
+
+        pfnsz = count * 8
+        if (len(content) - minsz) < pfnsz:
+            raise RecordError("PAGE_DATA record must contain a pfn record for "
+                              "each count")
+
+        pfns = list(unpack("=%dQ" % (count,), content[minsz:minsz + pfnsz]))
+
+        nr_pages = 0
+        for idx, pfn in enumerate(pfns):
+
+            if pfn & PAGE_DATA_PFN_RESZ_MASK:
+                raise RecordError("Reserved bits set in pfn[%d]: 0x%016x",
+                                  idx, pfn & PAGE_DATA_PFN_RESZ_MASK)
+
+            if pfn >> PAGE_DATA_TYPE_SHIFT in (5, 6, 7, 8):
+                raise RecordError("Invalid type value in pfn[%d]: 0x%016x",
+                                  idx, pfn & PAGE_DATA_TYPE_LTAB_MASK)
+
+            # We expect page data for each normal page or pagetable
+            if PAGE_DATA_TYPE_NOTAB <= (pfn & PAGE_DATA_TYPE_LTABTYPE_MASK) \
+                    <= PAGE_DATA_TYPE_L4TAB:
+                nr_pages += 1
+
+        pagesz = nr_pages * 4096
+        if len(content) != minsz + pfnsz + pagesz:
+            raise RecordError("Expected %u + %u + %u, got %u"
+                              % (minsz, pfnsz, pagesz, len(content)))
+
+
+    def verify_record_x86_pv_info(self, content):
+        """ x86 PV Info record """
+
+        expectedsz = calcsize(X86_PV_INFO_FORMAT)
+        if len(content) != expectedsz:
+            raise RecordError("x86_pv_info: expected length of %d, got %d"
+                              % (expectedsz, len(content)))
+
+        width, levels, res1, res2 = unpack(X86_PV_INFO_FORMAT, content)
+
+        if width not in (4, 8):
+            raise RecordError("Expected width of 4 or 8, got %d" % (width, ))
+
+        if levels not in (3, 4):
+            raise RecordError("Expected levels of 3 or 4, got %d" % (levels, ))
+
+        if res1 != 0 or res2 != 0:
+            raise StreamError("Reserved bits set in X86_PV_INFO: 0x%04x 0x%08x"
+                              % (res1, res2))
+
+        bitness = {4:32, 8:64}[width]
+        self.info("  %sbit guest, %d levels of pagetables" % (bitness, levels))
+
+
+    def verify_record_x86_pv_p2m_frames(self, content):
+        """ x86 PV p2m frames record """
+
+        if len(content) % 8 != 0:
+            raise RecordError("Length expected to be a multiple of 8, not %d"
+                              % (len(content), ))
+
+        start, end = unpack("=II", content[:8])
+        self.info("  Start pfn 0x%x, End 0x%x" % (start, end))
+
+
+    def verify_record_x86_pv_vcpu_generic(self, content, name):
+        """ Generic for all REC_TYPE_x86_pv_vcpu_{basic,extended,xsave,msrs} """
+        minsz = calcsize(X86_PV_VCPU_HDR_FORMAT)
+
+        if len(content) <= minsz:
+            raise RecordError("X86_PV_VCPU_%s record length must be at least %d"
+                              " bytes long" % (name, minsz))
+
+        vcpuid, res1 = unpack(X86_PV_VCPU_HDR_FORMAT, content[:minsz])
+
+        if res1 != 0:
+            raise StreamError("Reserved bits set in x86_pv_vcpu_%s record 0x%04x"
+                              % (name, res1))
+
+        self.info("  vcpu%d %s context, %d bytes"
+                  % (vcpuid, name, len(content) - minsz))
+
+
+    def verify_record_shared_info(self, content):
+        """ shared info record """
+
+        if len(content) != 4096:
+            raise RecordError("Length expected to be 4906 bytes, not %d"
+                              % (len(content), ))
+
+
+    def verify_record_tsc_info(self, content):
+        """ tsc info record """
+
+        sz = calcsize(TSC_INFO_FORMAT)
+
+        if len(content) != sz:
+            raise RecordError("Length should be %u bytes" % (sz, ))
+
+        mode, khz, nsec, incarn, res1 = unpack(TSC_INFO_FORMAT, content)
+
+        if res1 != 0:
+            raise StreamError("Reserved bits set in TSC_INFO: 0x%08x"
+                              % (res1, ))
+
+        self.info("  Mode %u, %u kHz, %u ns, incarnation %d"
+                  % (mode, khz, nsec, incarn))
+
+
+    def verify_record_hvm_context(self, content):
+        """ hvm context record """
+
+        if len(content) == 0:
+            raise RecordError("Zero length HVM context")
+
+
+    def verify_record_hvm_params(self, content):
+        """ hvm params record """
+
+        sz = calcsize(HVM_PARAMS_FORMAT)
+
+        if len(content) < sz:
+            raise RecordError("Length should be at least %u bytes" % (sz, ))
+
+        count, rsvd = unpack(HVM_PARAMS_FORMAT, content[:sz])
+
+        if rsvd != 0:
+            raise RecordError("Reserved field not zero (0x%04x)" % (rsvd, ))
+
+        sz += count * calcsize(HVM_PARAMS_ENTRY_FORMAT)
+
+        if len(content) != sz:
+            raise RecordError("Length should be %u bytes" % (sz, ))
+
+
+    def verify_record_toolstack(self, _):
+        """ toolstack record """
+        self.info("  TODO: remove")
+
+    def verify_record_verify(self, content):
+        """ verify record """
+
+        if len(content) != 0:
+            raise RecordError("Verify record with non-zero length")
+
+
+record_verifiers = {
+    REC_TYPE_end:
+        VerifyLibxc.verify_record_end,
+    REC_TYPE_page_data:
+        VerifyLibxc.verify_record_page_data,
+
+    REC_TYPE_x86_pv_info:
+        VerifyLibxc.verify_record_x86_pv_info,
+    REC_TYPE_x86_pv_p2m_frames:
+        VerifyLibxc.verify_record_x86_pv_p2m_frames,
+
+    REC_TYPE_x86_pv_vcpu_basic:
+        lambda s, x: VerifyLibxc.verify_record_x86_pv_vcpu_generic(s, x, "basic"),
+    REC_TYPE_x86_pv_vcpu_extended:
+        lambda s, x: VerifyLibxc.verify_record_x86_pv_vcpu_generic(s, x, "extended"),
+    REC_TYPE_x86_pv_vcpu_xsave:
+        lambda s, x: VerifyLibxc.verify_record_x86_pv_vcpu_generic(s, x, "xsave"),
+    REC_TYPE_x86_pv_vcpu_msrs:
+        lambda s, x: VerifyLibxc.verify_record_x86_pv_vcpu_generic(s, x, "msrs"),
+
+    REC_TYPE_shared_info:
+        VerifyLibxc.verify_record_shared_info,
+    REC_TYPE_tsc_info:
+        VerifyLibxc.verify_record_tsc_info,
+
+    REC_TYPE_hvm_context:
+        VerifyLibxc.verify_record_hvm_context,
+    REC_TYPE_hvm_params:
+        VerifyLibxc.verify_record_hvm_params,
+    REC_TYPE_toolstack:
+        VerifyLibxc.verify_record_toolstack,
+    REC_TYPE_verify:
+        VerifyLibxc.verify_record_verify,
+    }
diff --git a/tools/python/xen/migration/libxl.py b/tools/python/xen/migration/libxl.py
new file mode 100644
index 0000000..b2d6dcb
--- /dev/null
+++ b/tools/python/xen/migration/libxl.py
@@ -0,0 +1,190 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+"""
+Libxl Migration v2 streams
+
+Record structures as per docs/specs/libxl-migration-stream.pandoc, and
+verification routines.
+"""
+
+import sys
+
+from struct import calcsize, unpack
+from xen.migration.verify import StreamError, RecordError, VerifyBase
+from xen.migration.libxc import VerifyLibxc
+
+# Header
+HDR_FORMAT = "!QII"
+
+HDR_IDENT = 0x4c6962786c466d74 # "LibxlFmt" in ASCII
+HDR_VERSION = 2
+
+HDR_OPT_BIT_ENDIAN = 0
+HDR_OPT_BIT_LEGACY = 1
+
+HDR_OPT_LE     = (0 << HDR_OPT_BIT_ENDIAN)
+HDR_OPT_BE     = (1 << HDR_OPT_BIT_ENDIAN)
+HDR_OPT_LEGACY = (1 << HDR_OPT_BIT_LEGACY)
+
+HDR_OPT_RESZ_MASK = 0xfffc
+
+# Records
+RH_FORMAT = "II"
+
+REC_TYPE_end              = 0x00000000
+REC_TYPE_domain_json      = 0x00000001
+REC_TYPE_libxc_context    = 0x00000002
+REC_TYPE_xenstore_data    = 0x00000003
+REC_TYPE_emulator_context = 0x00000004
+
+rec_type_to_str = {
+    REC_TYPE_end              : "End",
+    REC_TYPE_domain_json      : "Domain JSON",
+    REC_TYPE_libxc_context    : "Libxc context",
+    REC_TYPE_xenstore_data    : "Xenstore data",
+    REC_TYPE_emulator_context : "Emulator context",
+}
+
+# emulator_context
+EMULATOR_CONTEXT_FORMAT = "II"
+
+EMULATOR_ID_unknown       = 0x00000000
+EMULATOR_ID_qemu_trad     = 0x00000001
+EMULATOR_ID_qemu_upstream = 0x00000002
+
+emulator_id_to_str = {
+    EMULATOR_ID_unknown       : "Unknown",
+    EMULATOR_ID_qemu_trad     : "Qemu Traditional",
+    EMULATOR_ID_qemu_upstream : "Qemu Upstream",
+}
+
+
+class VerifyLibxl(VerifyBase):
+    """ Verify a Libxl v2 stream """
+
+    def __init__(self, info, read):
+        VerifyBase.__init__(self, info, read)
+
+
+    def verify(self):
+
+        self.verify_hdr()
+
+        while self.verify_record() != REC_TYPE_end:
+            pass
+
+
+    def verify_hdr(self):
+        """ Verify a Header """
+        ident, version, options = self.unpack_exact(HDR_FORMAT)
+
+        if ident != HDR_IDENT:
+            raise StreamError("Bad image id: Expected 0x%x, got 0x%x"
+                              % (HDR_IDENT, ident))
+
+        if version != HDR_VERSION:
+            raise StreamError("Unknown image version: Expected %d, got %d"
+                              % (HDR_VERSION, version))
+
+        if options & HDR_OPT_RESZ_MASK:
+            raise StreamError("Reserved bits set in image options field: 0x%x"
+                              % (options & HDR_OPT_RESZ_MASK))
+
+        if ( (sys.byteorder == "little") and
+             ((options & HDR_OPT_BIT_ENDIAN) != HDR_OPT_LE) ):
+            raise StreamError("Stream is not native endianess - unable to validate")
+
+        endian = ["little", "big"][options & HDR_OPT_LE]
+
+        if options & HDR_OPT_LEGACY:
+            self.info("Libxl Header: %s endian, legacy converted" % (endian, ))
+        else:
+            self.info("Libxl Header: %s endian" % (endian, ))
+
+
+    def verify_record(self):
+        """ Verify an individual record """
+        rtype, length = self.unpack_exact(RH_FORMAT)
+
+        if rtype not in rec_type_to_str:
+            raise StreamError("Unrecognised record type %x" % (rtype, ))
+
+        self.info("Libxl Record: %s, length %d"
+                  % (rec_type_to_str[rtype], length))
+
+        contentsz = (length + 7) & ~7
+        content = self.rdexact(contentsz)
+
+        padding = content[length:]
+        if padding != "\x00" * len(padding):
+            raise StreamError("Padding containing non0 bytes found")
+
+        if rtype not in record_verifiers:
+            raise RuntimeError("No verification function for libxl record '%s'"
+                               % rec_type_to_str[rtype])
+        else:
+            record_verifiers[rtype](self, content[:length])
+
+        return rtype
+
+
+    def verify_record_end(self, content):
+        """ End record """
+
+        if len(content) != 0:
+            raise RecordError("End record with non-zero length")
+
+
+    def verify_record_domain_json(self, content):
+        """ Domain JSON record """
+
+        if len(content) == 0:
+            raise RecordError("Domain JSON record with zero length")
+
+
+    def verify_record_libxc_context(self, content):
+        """ Libxc context record """
+
+        if len(content) != 0:
+            raise RecordError("Libxc context record with non-zero length")
+
+        # Verify the libxc stream, as we can't seek forwards through it
+        VerifyLibxc(self.info, self.read).verify()
+
+
+    def verify_record_xenstore_data(self, content):
+        """ Xenstore Data record """
+
+        if len(content) == 0:
+            raise RecordError("Xenstore data record with zero length")
+
+
+    def verify_record_emulator_context(self, content):
+        """ Emulator Context record """
+        minsz = calcsize(EMULATOR_CONTEXT_FORMAT)
+
+        if len(content) < minsz:
+            raise RecordError("Length must be at least %d bytes, got %d"
+                              % (minsz, len(content)))
+
+        emu_id, emu_idx = unpack(EMULATOR_CONTEXT_FORMAT, content[:minsz])
+
+        if emu_id not in emulator_id_to_str:
+            raise RecordError("Unrecognised emulator id 0x%x" % (emu_id, ))
+
+        self.info("  Index %d, type %s" % (emu_idx, emulator_id_to_str[emu_id]))
+
+
+record_verifiers = {
+    REC_TYPE_end:
+        VerifyLibxl.verify_record_end,
+    REC_TYPE_domain_json:
+        VerifyLibxl.verify_record_domain_json,
+    REC_TYPE_libxc_context:
+        VerifyLibxl.verify_record_libxc_context,
+    REC_TYPE_xenstore_data:
+        VerifyLibxl.verify_record_xenstore_data,
+    REC_TYPE_emulator_context:
+        VerifyLibxl.verify_record_emulator_context,
+}
diff --git a/tools/python/xen/migration/tests.py b/tools/python/xen/migration/tests.py
new file mode 100644
index 0000000..ab2924f
--- /dev/null
+++ b/tools/python/xen/migration/tests.py
@@ -0,0 +1,55 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+"""
+Unit tests for migration v2 streams
+"""
+
+import unittest
+
+from struct import calcsize
+
+from xen.migration import libxc, libxl
+
+class TestLibxc(unittest.TestCase):
+
+    def test_format_sizes(self):
+
+        for fmt, sz in ( (libxc.IHDR_FORMAT, 24),
+                         (libxc.DHDR_FORMAT, 16),
+                         (libxc.RH_FORMAT, 8),
+
+                         (libxc.PAGE_DATA_FORMAT, 8),
+                         (libxc.X86_PV_INFO_FORMAT, 8),
+                         (libxc.X86_PV_P2M_FRAMES_FORMAT, 8),
+                         (libxc.X86_PV_VCPU_HDR_FORMAT, 8),
+                         (libxc.TSC_INFO_FORMAT, 24),
+                         (libxc.HVM_PARAMS_ENTRY_FORMAT, 16),
+                         (libxc.HVM_PARAMS_FORMAT, 8),
+                         ):
+            self.assertEqual(calcsize(fmt), sz)
+
+
+class TestLibxl(unittest.TestCase):
+
+    def test_format_sizes(self):
+
+        for fmt, sz in ( (libxl.HDR_FORMAT, 16),
+                         (libxl.RH_FORMAT, 8),
+
+                         (libxl.EMULATOR_CONTEXT_FORMAT, 8),
+                         ):
+            self.assertEqual(calcsize(fmt), sz)
+
+
+
+def test_suite():
+    suite = unittest.TestSuite()
+
+    suite.addTest(unittest.makeSuite(TestLibxc))
+    suite.addTest(unittest.makeSuite(TestLibxl))
+
+    return suite
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tools/python/xen/migration/verify.py b/tools/python/xen/migration/verify.py
new file mode 100644
index 0000000..7a42dbf
--- /dev/null
+++ b/tools/python/xen/migration/verify.py
@@ -0,0 +1,37 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+"""
+Common verification infrastructure for v2 streams
+"""
+
+from struct import calcsize, unpack
+
+class StreamError(StandardError):
+    """Error with the stream"""
+    pass
+
+class RecordError(StandardError):
+    """Error with a record in the stream"""
+    pass
+
+
+class VerifyBase(object):
+
+    def __init__(self, info, read):
+
+        self.info = info
+        self.read = read
+
+    def rdexact(self, nr_bytes):
+        """Read exactly nr_bytes from the stream"""
+        _ = self.read(nr_bytes)
+        if len(_) != nr_bytes:
+            raise IOError("Stream truncated")
+        return _
+
+    def unpack_exact(self, fmt):
+        """Unpack a struct format string from the stream"""
+        sz = calcsize(fmt)
+        return unpack(fmt, self.rdexact(sz))
+
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (7 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 08/29] tools/python: Infrastructure relating to migration v2 streams Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:34   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 10/29] tools/libxc: C implementation of stream format Andrew Cooper
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
two save/restore codepaths to coexist, and have a runtime switch.

It is indended that once this series is less RFC, the v2 framework will
completely replace v1.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/Makefile              |    1 +
 tools/libxc/saverestore/common.h  |   15 +++++++++++++++
 tools/libxc/saverestore/restore.c |   23 +++++++++++++++++++++++
 tools/libxc/saverestore/save.c    |   19 +++++++++++++++++++
 tools/libxc/xc_domain_restore.c   |    8 ++++++++
 tools/libxc/xc_domain_save.c      |    6 ++++++
 tools/libxc/xenguest.h            |   13 +++++++++++++
 7 files changed, 85 insertions(+)
 create mode 100644 tools/libxc/saverestore/common.h
 create mode 100644 tools/libxc/saverestore/restore.c
 create mode 100644 tools/libxc/saverestore/save.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 3b04027..564e805 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -45,6 +45,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c xc_suspend.c
 ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += $(wildcard saverestore/*.c)
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
 else
 GUEST_SRCS-y += xc_nomigrate.c
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
new file mode 100644
index 0000000..f1aff44
--- /dev/null
+++ b/tools/libxc/saverestore/common.h
@@ -0,0 +1,15 @@
+#ifndef __COMMON__H
+#define __COMMON__H
+
+#include "../xg_private.h"
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
new file mode 100644
index 0000000..6624baa
--- /dev/null
+++ b/tools/libxc/saverestore/restore.c
@@ -0,0 +1,23 @@
+#include "common.h"
+
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
new file mode 100644
index 0000000..f6ad734
--- /dev/null
+++ b/tools/libxc/saverestore/save.c
@@ -0,0 +1,19 @@
+#include "common.h"
+
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index b411126..cec2dfc 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1490,6 +1490,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     struct restore_ctx *ctx = &_ctx;
     struct domain_info_context *dinfo = &ctx->dinfo;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_restore2(
+            xch, io_fd, dom, store_evtchn, store_mfn,
+            store_domid, console_evtchn, console_mfn, console_domid,
+            hvm,  pae,  superpages, checkpointed_stream, callbacks);
+    }
+
     DPRINTF("%s: starting restore of new domid %u", __func__, dom);
 
     pagebuf_init(&pagebuf);
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 02544f8..a23ed68 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -894,6 +894,12 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
     int completed = 0;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_save2(xch, io_fd, dom, max_iters,
+                               max_factor, flags, callbacks, hvm);
+    }
+
     DPRINTF("%s: starting save of domid %u", __func__, dom);
 
     if ( hvm && !callbacks->switch_qemu_logdirty )
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 40bbac8..55755cf 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -88,6 +88,10 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                    uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
                    struct save_callbacks* callbacks, int hvm);
 
+/* Domain Save v2 */
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm);
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
@@ -124,6 +128,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       unsigned int hvm, unsigned int pae, int superpages,
                       int checkpointed_stream,
                       struct restore_callbacks *callbacks);
+
+/* Domain Restore v2 */
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks);
 /**
  * xc_domain_restore writes a file to disk that contains the device
  * model saved state.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 10/29] tools/libxc: C implementation of stream format
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (8 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:48   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 11/29] tools/libxc: noarch common code Andrew Cooper
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Provide the C structures matching the binary (wire) format of the new
stream format.  All header/record fields are naturally aligned and
explicit padding fields are used to ensure the correct layout (i.e.,
there is no need for any non-standard structure packing pragma or
attribute).

Provide some helper functions for converting types to string for
diagnostic purposes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/saverestore/common.c        |   72 +++++++++++++++
 tools/libxc/saverestore/common.h        |    8 ++
 tools/libxc/saverestore/stream_format.h |  148 +++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+)
 create mode 100644 tools/libxc/saverestore/common.c
 create mode 100644 tools/libxc/saverestore/stream_format.h

diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
new file mode 100644
index 0000000..d9e47ef
--- /dev/null
+++ b/tools/libxc/saverestore/common.c
@@ -0,0 +1,72 @@
+#include "common.h"
+
+static const char *dhdr_types[] =
+{
+    [DHDR_TYPE_X86_PV]  = "x86 PV",
+    [DHDR_TYPE_X86_HVM] = "x86 HVM",
+    [DHDR_TYPE_X86_PVH] = "x86 PVH",
+    [DHDR_TYPE_ARM]     = "ARM",
+};
+
+const char *dhdr_type_to_str(uint32_t type)
+{
+    if ( type < ARRAY_SIZE(dhdr_types) && dhdr_types[type] )
+        return dhdr_types[type];
+
+    return "Reserved";
+}
+
+static const char *mandatory_rec_types[] =
+{
+    [REC_TYPE_END]                  = "End",
+    [REC_TYPE_PAGE_DATA]            = "Page data",
+    [REC_TYPE_X86_PV_INFO]          = "x86 PV info",
+    [REC_TYPE_X86_PV_P2M_FRAMES]    = "x86 PV P2M frames",
+    [REC_TYPE_X86_PV_VCPU_BASIC]    = "x86 PV vcpu basic",
+    [REC_TYPE_X86_PV_VCPU_EXTENDED] = "x86 PV vcpu extended",
+    [REC_TYPE_X86_PV_VCPU_XSAVE]    = "x86 PV vcpu xsave",
+    [REC_TYPE_SHARED_INFO]          = "Shared info",
+    [REC_TYPE_TSC_INFO]             = "TSC info",
+    [REC_TYPE_HVM_CONTEXT]          = "HVM context",
+    [REC_TYPE_HVM_PARAMS]           = "HVM params",
+    [REC_TYPE_TOOLSTACK]            = "Toolstack",
+    [REC_TYPE_X86_PV_VCPU_MSRS]     = "x86 PV vcpu msrs",
+    [REC_TYPE_VERIFY]               = "Verify",
+};
+
+const char *rec_type_to_str(uint32_t type)
+{
+    if ( !(type & REC_TYPE_OPTIONAL) )
+    {
+        if ( (type < ARRAY_SIZE(mandatory_rec_types)) &&
+             (mandatory_rec_types[type]) )
+            return mandatory_rec_types[type];
+    }
+
+    return "Reserved";
+}
+
+static void __attribute__((unused)) build_assertions(void)
+{
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_dhdr) != 16);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rhdr) != 8);
+
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_page_data_header)  != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_info)       != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_p2m_frames) != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_vcpu_hdr)   != 8);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_tsc_info)          != 24);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_hvm_params_entry)  != 16);
+    XC_BUILD_BUG_ON(sizeof(struct xc_sr_rec_hvm_params)        != 8);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index f1aff44..cbecf0a 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -3,6 +3,14 @@
 
 #include "../xg_private.h"
 
+#include "stream_format.h"
+
+/* String representation of Domain Header types. */
+const char *dhdr_type_to_str(uint32_t type);
+
+/* String representation of Record types. */
+const char *rec_type_to_str(uint32_t type);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
new file mode 100644
index 0000000..d116ca6
--- /dev/null
+++ b/tools/libxc/saverestore/stream_format.h
@@ -0,0 +1,148 @@
+#ifndef __STREAM_FORMAT__H
+#define __STREAM_FORMAT__H
+
+/*
+ * C structures for the Migration v2 stream format.
+ * See docs/specs/libxc-migration-stream.pandoc
+ */
+
+#include <inttypes.h>
+
+/*
+ * Image Header
+ */
+struct xc_sr_ihdr
+{
+    uint64_t marker;
+    uint32_t id;
+    uint32_t version;
+    uint16_t options;
+    uint16_t _res1;
+    uint32_t _res2;
+};
+
+#define IHDR_MARKER  0xffffffffffffffffULL
+#define IHDR_ID      0x58454E46U
+#define IHDR_VERSION 2
+
+#define _IHDR_OPT_ENDIAN 0
+#define IHDR_OPT_LITTLE_ENDIAN (0 << _IHDR_OPT_ENDIAN)
+#define IHDR_OPT_BIG_ENDIAN    (1 << _IHDR_OPT_ENDIAN)
+
+/*
+ * Domain Header
+ */
+struct xc_sr_dhdr
+{
+    uint32_t type;
+    uint16_t page_shift;
+    uint16_t _res1;
+    uint32_t xen_major;
+    uint32_t xen_minor;
+};
+
+#define DHDR_TYPE_X86_PV  0x00000001U
+#define DHDR_TYPE_X86_HVM 0x00000002U
+#define DHDR_TYPE_X86_PVH 0x00000003U
+#define DHDR_TYPE_ARM     0x00000004U
+
+/*
+ * Record Header
+ */
+struct xc_sr_rhdr
+{
+    uint32_t type;
+    uint32_t length;
+};
+
+/* All records must be aligned up to an 8 octet boundary */
+#define REC_ALIGN_ORDER               (3U)
+/* Somewhat arbitrary - 8MB */
+#define REC_LENGTH_MAX                (8U << 20)
+
+#define REC_TYPE_END                  0x00000000U
+#define REC_TYPE_PAGE_DATA            0x00000001U
+#define REC_TYPE_X86_PV_INFO          0x00000002U
+#define REC_TYPE_X86_PV_P2M_FRAMES    0x00000003U
+#define REC_TYPE_X86_PV_VCPU_BASIC    0x00000004U
+#define REC_TYPE_X86_PV_VCPU_EXTENDED 0x00000005U
+#define REC_TYPE_X86_PV_VCPU_XSAVE    0x00000006U
+#define REC_TYPE_SHARED_INFO          0x00000007U
+#define REC_TYPE_TSC_INFO             0x00000008U
+#define REC_TYPE_HVM_CONTEXT          0x00000009U
+#define REC_TYPE_HVM_PARAMS           0x0000000aU
+#define REC_TYPE_TOOLSTACK            0x0000000bU
+#define REC_TYPE_X86_PV_VCPU_MSRS     0x0000000cU
+#define REC_TYPE_VERIFY               0x0000000dU
+
+#define REC_TYPE_OPTIONAL             0x80000000U
+
+/* PAGE_DATA */
+struct xc_sr_rec_page_data_header
+{
+    uint32_t count;
+    uint32_t _res1;
+    uint64_t pfn[0];
+};
+
+#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
+#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
+
+/* X86_PV_INFO */
+struct xc_sr_rec_x86_pv_info
+{
+    uint8_t guest_width;
+    uint8_t pt_levels;
+    uint8_t _res[6];
+};
+
+/* X86_PV_P2M_FRAMES */
+struct xc_sr_rec_x86_pv_p2m_frames
+{
+    uint32_t start_pfn;
+    uint32_t end_pfn;
+    uint64_t p2m_pfns[0];
+};
+
+/* X86_PV_VCPU_{BASIC,EXTENDED,XSAVE,MSRS} */
+struct xc_sr_rec_x86_pv_vcpu_hdr
+{
+    uint32_t vcpu_id;
+    uint32_t _res1;
+    uint8_t context[0];
+};
+
+/* TSC_INFO */
+struct xc_sr_rec_tsc_info
+{
+    uint32_t mode;
+    uint32_t khz;
+    uint64_t nsec;
+    uint32_t incarnation;
+    uint32_t _res1;
+};
+
+/* HVM_PARAMS */
+struct xc_sr_rec_hvm_params_entry
+{
+    uint64_t index;
+    uint64_t value;
+};
+
+struct xc_sr_rec_hvm_params
+{
+    uint32_t count;
+    uint32_t _res1;
+    struct xc_sr_rec_hvm_params_entry param[0];
+};
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 11/29] tools/libxc: noarch common code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (9 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 10/29] tools/libxc: C implementation of stream format Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 10:52   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 12/29] tools/libxc: x86 " Andrew Cooper
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Add the context structure used to keep state during the save/restore
process.

Define the set of architecture or domain type specific operations with a
set of callbacks (save_ops, and restore_ops).

Add common functions for writing records.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

---
v6:
 * Use writev() instead of write()
---
 tools/libxc/saverestore/common.c |   41 +++++
 tools/libxc/saverestore/common.h |  306 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 347 insertions(+)

diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
index d9e47ef..cacca17 100644
--- a/tools/libxc/saverestore/common.c
+++ b/tools/libxc/saverestore/common.c
@@ -1,3 +1,5 @@
+#include <assert.h>
+
 #include "common.h"
 
 static const char *dhdr_types[] =
@@ -46,6 +48,45 @@ const char *rec_type_to_str(uint32_t type)
     return "Reserved";
 }
 
+int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                       void *buf, size_t sz)
+{
+    static const char zeroes[(1u << REC_ALIGN_ORDER) - 1] = { 0 };
+
+    xc_interface *xch = ctx->xch;
+    typeof(rec->length) combined_length = rec->length + sz;
+    size_t record_length = ROUNDUP(combined_length, REC_ALIGN_ORDER);
+    struct iovec parts[] =
+    {
+        { &rec->type, sizeof(rec->type) },
+        { &combined_length, sizeof(combined_length) },
+        { rec->data, rec->length },
+        { buf, sz },
+        { (void*)zeroes, record_length - combined_length },
+    };
+
+    if ( record_length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rec->type,
+              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    if ( rec->length )
+        assert(rec->data);
+    if ( sz )
+        assert(buf);
+
+    if ( writev_exact(ctx->fd, parts, ARRAY_SIZE(parts)) )
+        goto err;
+
+    return 0;
+
+ err:
+    PERROR("Unable to write record to stream");
+    return -1;
+}
+
 static void __attribute__((unused)) build_assertions(void)
 {
     XC_BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index cbecf0a..0ded85d 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -1,7 +1,12 @@
 #ifndef __COMMON__H
 #define __COMMON__H
 
+#include <stdbool.h>
+
 #include "../xg_private.h"
+#include "../xg_save_restore.h"
+#include "../xc_dom.h"
+#include "../xc_bitops.h"
 
 #include "stream_format.h"
 
@@ -11,6 +16,307 @@ const char *dhdr_type_to_str(uint32_t type);
 /* String representation of Record types. */
 const char *rec_type_to_str(uint32_t type);
 
+struct xc_sr_context;
+struct xc_sr_record;
+
+/**
+ * Save operations.  To be implemented for each type of guest, for use by the
+ * common save algorithm.
+ *
+ * Every function must be implemented, even if only with a no-op stub.
+ */
+struct xc_sr_save_ops
+{
+    /* Convert a PFN to GFN.  May return ~0UL for an invalid mapping. */
+    xen_pfn_t (*pfn_to_gfn)(const struct xc_sr_context *ctx, xen_pfn_t pfn);
+
+    /**
+     * Optionally transform the contents of a page from being specific to the
+     * sending environment, to being generic for the stream.
+     *
+     * The page of data at the end of 'page' may be a read-only mapping of a
+     * running guest; it must not be modified.  If no transformation is
+     * required, the callee should leave '*pages' unouched.
+     *
+     * If a transformation is required, the callee should allocate themselves
+     * a local page using malloc() and return it via '*page'.
+     *
+     * The caller shall free() '*page' in all cases.  In the case that the
+     * callee enounceters an error, it should *NOT* free() the memory it
+     * allocated for '*page'.
+     *
+     * It is valid to fail with EAGAIN if the transformation is not able to be
+     * completed at this point.  The page shall be retried later.
+     *
+     * @returns 0 for success, -1 for failure, with errno appropriately set.
+     */
+    int (*normalise_page)(struct xc_sr_context *ctx, xen_pfn_t type,
+                          void **page);
+
+    /**
+     * Set up local environment to restore a domain.  This is called before
+     * any records are written to the stream.  (Typically querying running
+     * domain state, setting up mappings etc.)
+     */
+    int (*setup)(struct xc_sr_context *ctx);
+
+    /**
+     * Write records which need to be at the start of the stream.  This is
+     * called after the Image and Domain headers are written.  (Any records
+     * which need to be ahead of the memory.)
+     */
+    int (*start_of_stream)(struct xc_sr_context *ctx);
+
+    /**
+     * Write records which need to be at the end of the stream, following the
+     * complete memory contents.  The caller shall handle writing the END
+     * record into the stream.  (Any records which need to be after the memory
+     * is complete.)
+     */
+    int (*end_of_stream)(struct xc_sr_context *ctx);
+
+    /**
+     * Clean up the local environment.  Will be called exactly once, either
+     * after a successful save, or upon encountering an error.
+     */
+    int (*cleanup)(struct xc_sr_context *ctx);
+};
+
+
+/**
+ * Restore operations.  To be implemented for each type of guest, for use by
+ * the common restore algorithm.
+ *
+ * Every function must be implemented, even if only with a no-op stub.
+ */
+struct xc_sr_restore_ops
+{
+    /* Convert a PFN to GFN.  May return ~0UL for an invalid mapping. */
+    xen_pfn_t (*pfn_to_gfn)(const struct xc_sr_context *ctx, xen_pfn_t pfn);
+
+    /* Check to see whether a PFN is valid. */
+    bool (*pfn_is_valid)(const struct xc_sr_context *ctx, xen_pfn_t pfn);
+
+    /* Set the GFN of a PFN. */
+    void (*set_gfn)(struct xc_sr_context *ctx, xen_pfn_t pfn, xen_pfn_t gfn);
+
+    /* Set the type of a PFN. */
+    void (*set_page_type)(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                          xen_pfn_t type);
+
+    /**
+     * Optionally transform the contents of a page from being generic in the
+     * stream, to being specific to the restoring environment.
+     *
+     * 'page' is expected to be modified in-place if a transformation is
+     * required.
+     *
+     * @returns 0 for success, -1 for failure, with errno appropriately set.
+     */
+    int (*localise_page)(struct xc_sr_context *ctx, uint32_t type, void *page);
+
+    /**
+     * Set up local environment to restore a domain.  This is called before
+     * any records are read from the stream.
+     */
+    int (*setup)(struct xc_sr_context *ctx);
+
+    /**
+     * Process an individual record from the stream.  The caller shall take
+     * care of processing common records (e.g. END, PAGE_DATA).
+     *
+     * @return 0 for success, -1 for failure, or the sentinal value
+     * RECORD_NOT_PROCESSED.
+     */
+#define RECORD_NOT_PROCESSED 1
+    int (*process_record)(struct xc_sr_context *ctx, struct xc_sr_record *rec);
+
+    /**
+     * Perform any actions required after the stream has been finished. Called
+     * after the END record has been received.
+     */
+    int (*stream_complete)(struct xc_sr_context *ctx);
+
+    /**
+     * Clean up the local environment.  Will be called exactly once, either
+     * after a successful restore, or upon encountering an error.
+     */
+    int (*cleanup)(struct xc_sr_context *ctx);
+};
+
+/* x86 PV per-vcpu storage structure for blobs heading Xen-wards. */
+struct xc_sr_x86_pv_restore_vcpu
+{
+    void *basic, *extd, *xsave, *msr;
+    size_t basicsz, extdsz, xsavesz, msrsz;
+};
+
+struct xc_sr_context
+{
+    xc_interface *xch;
+    uint32_t domid;
+    int fd;
+
+    xc_dominfo_t dominfo;
+
+    union
+    {
+        struct
+        {
+            struct xc_sr_save_ops ops;
+            struct save_callbacks *callbacks;
+
+            /* Live migrate vs nonlive suspend. */
+            bool live;
+
+            /* Further debugging information in the stream. */
+            bool debug;
+
+            /* Parameters for tweaking live migration. */
+            unsigned max_iterations;
+            unsigned dirty_threshold;
+
+            unsigned long p2m_size;
+
+            xen_pfn_t *batch_pfns;
+            unsigned nr_batch_pfns;
+            unsigned long *deferred_pages;
+            unsigned long nr_deferred_pages;
+        } save;
+
+        struct
+        {
+            struct xc_sr_restore_ops ops;
+            struct restore_callbacks *callbacks;
+
+            /* From Image Header. */
+            uint32_t format_version;
+
+            /* From Domain Header. */
+            uint32_t guest_type;
+            uint32_t guest_page_size;
+
+            /* Xenstore and Console parameters. Some as input from caller,
+             * some as input from stream, some output. */
+            unsigned long xenstore_mfn, console_mfn;
+            unsigned int xenstore_evtchn, console_evtchn;
+            domid_t xenstore_domid, console_domid;
+
+            /* Bitmap of currently populated PFNs during restore. */
+            unsigned long *populated_pfns;
+            xen_pfn_t max_populated_pfn;
+
+            /* Sender has invoked verify mode on the stream. */
+            bool verify;
+        } restore;
+    };
+
+    union
+    {
+        struct
+        {
+            /* 4 or 8; 32 or 64 bit domain */
+            unsigned int width;
+            /* 3 or 4 pagetable levels */
+            unsigned int levels;
+
+            /* Maximum Xen frame */
+            unsigned long max_mfn;
+            /* Read-only machine to phys map */
+            xen_pfn_t *m2p;
+            /* first mfn of the compat m2p (Only needed for 32bit PV guests) */
+            xen_pfn_t compat_m2p_mfn0;
+            /* Number of m2p frames mapped */
+            unsigned long nr_m2p_frames;
+
+            /* Maximum guest frame */
+            unsigned long max_pfn;
+
+            /* Number of frames making up the p2m */
+            unsigned int p2m_frames;
+            /* Guest's phys to machine map.  Mapped read-only (save) or
+             * allocated locally (restore).  Uses guest unsigned longs. */
+            void *p2m;
+            /* The guest pfns containing the p2m leaves */
+            xen_pfn_t *p2m_pfns;
+
+            /* Read-only mapping of guests shared info page */
+            shared_info_any_t *shinfo;
+
+            union
+            {
+                struct
+                {
+                    /* State machine for the order of received records. */
+                    bool seen_pv_info;
+
+                    /* Types for each page (bounded by max_pfn). */
+                    uint32_t *pfn_types;
+
+                    /* Vcpu context blobs. */
+                    struct xc_sr_x86_pv_restore_vcpu *vcpus;
+                    unsigned nr_vcpus;
+                } restore;
+            };
+        } x86_pv;
+
+        struct
+        {
+            union
+            {
+                struct
+                {
+                    /* Whether qemu enabled logdirty mode, and we should
+                     * disable on cleanup. */
+                    bool qemu_enabled_logdirty;
+                } save;
+
+                struct
+                {
+                    /* HVM context blob. */
+                    void *context;
+                    size_t contextsz;
+                } restore;
+            };
+        } x86_hvm;
+    };
+};
+
+
+struct xc_sr_record
+{
+    uint32_t type;
+    uint32_t length;
+    void *data;
+};
+
+/*
+ * Writes a split record to the stream, applying correct padding where
+ * appropriate.  It is common when sending records containing blobs from Xen
+ * that the header and blob data are separate.  This function accepts a second
+ * buffer and length, and will merge it with the main record when sending.
+ *
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                       void *buf, size_t sz);
+
+/*
+ * Writes a record to the stream, applying correct padding where appropriate.
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+static inline int write_record(struct xc_sr_context *ctx,
+                               struct xc_sr_record *rec)
+{
+    return write_split_record(ctx, rec, NULL, 0);
+}
+
 #endif
 /*
  * Local variables:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 12/29] tools/libxc: x86 common code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (10 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 11/29] tools/libxc: noarch common code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 13/29] tools/libxc: x86 PV " Andrew Cooper
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Save/restore records common to all x86 domain types (HVM, PV).

This is only the TSC_INFO record.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxc/saverestore/common_x86.c |   54 ++++++++++++++++++++++++++++++++++
 tools/libxc/saverestore/common_x86.h |   26 ++++++++++++++++
 2 files changed, 80 insertions(+)
 create mode 100644 tools/libxc/saverestore/common_x86.c
 create mode 100644 tools/libxc/saverestore/common_x86.h

diff --git a/tools/libxc/saverestore/common_x86.c b/tools/libxc/saverestore/common_x86.c
new file mode 100644
index 0000000..d76c11f
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86.c
@@ -0,0 +1,54 @@
+#include "common_x86.h"
+
+int write_tsc_info(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_tsc_info tsc = { 0 };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_TSC_INFO,
+        .length = sizeof(tsc),
+        .data = &tsc
+    };
+
+    if ( xc_domain_get_tsc_info(xch, ctx->domid, &tsc.mode,
+                                &tsc.nsec, &tsc.khz, &tsc.incarnation) < 0 )
+    {
+        PERROR("Unable to obtain TSC information");
+        return -1;
+    }
+
+    return write_record(ctx, &rec);
+}
+
+int handle_tsc_info(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_tsc_info *tsc = rec->data;
+
+    if ( rec->length != sizeof(*tsc) )
+    {
+        ERROR("TSC_INFO record wrong size: length %u, expected %zu",
+              rec->length, sizeof(*tsc));
+        return -1;
+    }
+
+    if ( xc_domain_set_tsc_info(xch, ctx->domid, tsc->mode,
+                                tsc->nsec, tsc->khz, tsc->incarnation) )
+    {
+        PERROR("Unable to set TSC information");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86.h b/tools/libxc/saverestore/common_x86.h
new file mode 100644
index 0000000..5971bc5
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86.h
@@ -0,0 +1,26 @@
+#ifndef __COMMON_X86__H
+#define __COMMON_X86__H
+
+#include "common.h"
+
+/*
+ * Obtains a domains TSC information from Xen and writes a TSC_INFO record
+ * into the stream.
+ */
+int write_tsc_info(struct xc_sr_context *ctx);
+
+/*
+ * Parses a TSC_INFO record and applies the result to the domain.
+ */
+int handle_tsc_info(struct xc_sr_context *ctx, struct xc_sr_record *rec);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 13/29] tools/libxc: x86 PV common code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (11 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 12/29] tools/libxc: x86 " Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 14/29] tools/libxc: x86 PV save code Andrew Cooper
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Add functions common to save and restore of x86 PV guests.  This includes
functions for dealing with the P2M and M2P and the VCPU context.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/saverestore/common_x86_pv.c |  198 +++++++++++++++++++++++++++++++
 tools/libxc/saverestore/common_x86_pv.h |  102 ++++++++++++++++
 2 files changed, 300 insertions(+)
 create mode 100644 tools/libxc/saverestore/common_x86_pv.c
 create mode 100644 tools/libxc/saverestore/common_x86_pv.h

diff --git a/tools/libxc/saverestore/common_x86_pv.c b/tools/libxc/saverestore/common_x86_pv.c
new file mode 100644
index 0000000..d82d7e1
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_pv.c
@@ -0,0 +1,198 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+xen_pfn_t mfn_to_pfn(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    assert(mfn <= ctx->x86_pv.max_mfn);
+    return ctx->x86_pv.m2p[mfn];
+}
+
+bool mfn_in_pseudophysmap(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    return ( (mfn <= ctx->x86_pv.max_mfn) &&
+             (mfn_to_pfn(ctx, mfn) <= ctx->x86_pv.max_pfn) &&
+             (xc_pfn_to_mfn(mfn_to_pfn(ctx, mfn), ctx->x86_pv.p2m,
+                            ctx->x86_pv.width) == mfn) );
+}
+
+void dump_bad_pseudophysmap_entry(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn = ~0UL;
+
+    ERROR("mfn %#lx, max %#lx", mfn, ctx->x86_pv.max_mfn);
+
+    if ( (mfn != ~0UL) && (mfn <= ctx->x86_pv.max_mfn) )
+    {
+        pfn = ctx->x86_pv.m2p[mfn];
+        ERROR("  m2p[%#lx] = %#lx, max_pfn %#lx",
+              mfn, pfn, ctx->x86_pv.max_pfn);
+    }
+
+    if ( (pfn != ~0UL) && (pfn <= ctx->x86_pv.max_pfn) )
+        ERROR("  p2m[%#lx] = %#lx",
+              pfn, xc_pfn_to_mfn(pfn, ctx->x86_pv.p2m, ctx->x86_pv.width));
+}
+
+xen_pfn_t cr3_to_mfn(struct xc_sr_context *ctx, uint64_t cr3)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return cr3 >> 12;
+    else
+        return (((uint32_t)cr3 >> 12) | ((uint32_t)cr3 << 20));
+}
+
+uint64_t mfn_to_cr3(struct xc_sr_context *ctx, xen_pfn_t mfn)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return ((uint64_t)mfn) << 12;
+    else
+        return (((uint32_t)mfn << 12) | ((uint32_t)mfn >> 20));
+}
+
+int x86_pv_domain_info(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned int guest_width, guest_levels, fpp;
+    int max_pfn;
+
+    /* Get the domain width */
+    if ( xc_domain_get_guest_width(xch, ctx->domid, &guest_width) )
+    {
+        PERROR("Unable to determine dom%d's width", ctx->domid);
+        return -1;
+    }
+
+    if ( guest_width == 4 )
+        guest_levels = 3;
+    else if ( guest_width == 8 )
+        guest_levels = 4;
+    else
+    {
+        ERROR("Invalid guest width %d.  Expected 32 or 64", guest_width * 8);
+        return -1;
+    }
+    ctx->x86_pv.width = guest_width;
+    ctx->x86_pv.levels = guest_levels;
+    fpp = PAGE_SIZE / ctx->x86_pv.width;
+
+    DPRINTF("%d bits, %d levels", guest_width * 8, guest_levels);
+
+    /* Get the domain's size */
+    max_pfn = xc_domain_maximum_gpfn(xch, ctx->domid);
+    if ( max_pfn < 0 )
+    {
+        PERROR("Unable to obtain guests max pfn");
+        return -1;
+    }
+
+    if ( max_pfn > 0 )
+    {
+        ctx->x86_pv.max_pfn = max_pfn;
+        ctx->x86_pv.p2m_frames = (ctx->x86_pv.max_pfn + fpp) / fpp;
+
+        DPRINTF("max_pfn %#x, p2m_frames %d", max_pfn, ctx->x86_pv.p2m_frames);
+    }
+
+    return 0;
+}
+
+int x86_pv_map_m2p(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    long max_page = xc_maximum_ram_page(xch);
+    unsigned long m2p_chunks, m2p_size;
+    privcmd_mmap_entry_t *entries = NULL;
+    xen_pfn_t *extents_start = NULL;
+    int rc = -1, i;
+
+    if ( max_page < 0 )
+    {
+        PERROR("Failed to get maximum ram page");
+        goto err;
+    }
+
+    ctx->x86_pv.max_mfn = max_page;
+    m2p_size   = M2P_SIZE(ctx->x86_pv.max_mfn);
+    m2p_chunks = M2P_CHUNKS(ctx->x86_pv.max_mfn);
+
+    extents_start = malloc(m2p_chunks * sizeof(xen_pfn_t));
+    if ( !extents_start )
+    {
+        ERROR("Unable to allocate %lu bytes for m2p mfns",
+              m2p_chunks * sizeof(xen_pfn_t));
+        goto err;
+    }
+
+    if ( xc_machphys_mfn_list(xch, m2p_chunks, extents_start) )
+    {
+        PERROR("Failed to get m2p mfn list");
+        goto err;
+    }
+
+    entries = malloc(m2p_chunks * sizeof(privcmd_mmap_entry_t));
+    if ( !entries )
+    {
+        ERROR("Unable to allocate %lu bytes for m2p mapping mfns",
+              m2p_chunks * sizeof(privcmd_mmap_entry_t));
+        goto err;
+    }
+
+    for ( i = 0; i < m2p_chunks; ++i )
+        entries[i].mfn = extents_start[i];
+
+    ctx->x86_pv.m2p = xc_map_foreign_ranges(
+        xch, DOMID_XEN, m2p_size, PROT_READ,
+        M2P_CHUNK_SIZE, entries, m2p_chunks);
+
+    if ( !ctx->x86_pv.m2p )
+    {
+        PERROR("Failed to mmap m2p ranges");
+        goto err;
+    }
+
+    ctx->x86_pv.nr_m2p_frames = (M2P_CHUNK_SIZE >> PAGE_SHIFT) * m2p_chunks;
+
+#ifdef __i386__
+    /* 32 bit toolstacks automatically get the compat m2p */
+    ctx->x86_pv.compat_m2p_mfn0 = entries[0].mfn;
+#else
+    /* 64 bit toolstacks need to ask Xen specially for it */
+    {
+        struct xen_machphys_mfn_list xmml = {
+            .max_extents = 1,
+            .extent_start = { &ctx->x86_pv.compat_m2p_mfn0 }
+        };
+
+        rc = do_memory_op(xch, XENMEM_machphys_compat_mfn_list,
+                          &xmml, sizeof(xmml));
+        if ( rc || xmml.nr_extents != 1 )
+        {
+            PERROR("Failed to get compat mfn list from Xen");
+            rc = -1;
+            goto err;
+        }
+    }
+#endif
+
+    /* All Done */
+    rc = 0;
+    DPRINTF("max_mfn %#lx", ctx->x86_pv.max_mfn);
+
+err:
+    free(entries);
+    free(extents_start);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86_pv.h b/tools/libxc/saverestore/common_x86_pv.h
new file mode 100644
index 0000000..c36927a
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_pv.h
@@ -0,0 +1,102 @@
+#ifndef __COMMON_X86_PV_H
+#define __COMMON_X86_PV_H
+
+#include "common_x86.h"
+
+/*
+ * Convert an mfn to a pfn, given Xens m2p table.
+ *
+ * Caller must ensure that the requested mfn is in range.
+ */
+xen_pfn_t mfn_to_pfn(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/*
+ * Query whether a particular mfn is valid in the physmap of a guest.
+ */
+bool mfn_in_pseudophysmap(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/*
+ * Debug a particular mfn by walking the p2m and m2p.
+ */
+void dump_bad_pseudophysmap_entry(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/*
+ * Convert a PV cr3 field to an mfn.
+ *
+ * Adjusts for Xen's extended-cr3 format to pack a 44bit physical address into
+ * a 32bit architectural cr3.
+ */
+xen_pfn_t cr3_to_mfn(struct xc_sr_context *ctx, uint64_t cr3);
+
+/*
+ * Convert an mfn to a PV cr3 field.
+ *
+ * Adjusts for Xen's extended-cr3 format to pack a 44bit physical address into
+ * a 32bit architectural cr3.
+ */
+uint64_t mfn_to_cr3(struct xc_sr_context *ctx, xen_pfn_t mfn);
+
+/* Bits 12 thru 51 of a PTE point at the frame */
+#define PTE_FRAME_MASK 0x000ffffffffff000ULL
+
+/*
+ * Extract an mfn from a Pagetable Entry.  May return INVALID_MFN if the pte
+ * would overflow a 32bit xen_pfn_t.
+ */
+static inline xen_pfn_t pte_to_frame(uint64_t pte)
+{
+    uint64_t frame = (pte & PTE_FRAME_MASK) >> PAGE_SHIFT;
+
+#ifdef __i386__
+    if ( frame >= INVALID_MFN )
+        return INVALID_MFN;
+#endif
+
+    return frame;
+}
+
+/*
+ * Change the frame in a Pagetable Entry while leaving the flags alone.
+ */
+static inline uint64_t merge_pte(uint64_t pte, xen_pfn_t mfn)
+{
+    return (pte & ~PTE_FRAME_MASK) | ((uint64_t)mfn << PAGE_SHIFT);
+}
+
+/*
+ * Get current domain information.
+ *
+ * Fills ctx->x86_pv
+ * - .width
+ * - .levels
+ * - .fpp
+ * - .p2m_frames
+ *
+ * Used by the save side to create the X86_PV_INFO record, and by the restore
+ * side to verify the incoming stream.
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_domain_info(struct xc_sr_context *ctx);
+
+/*
+ * Maps the Xen M2P.
+ *
+ * Fills ctx->x86_pv.
+ * - .max_mfn
+ * - .m2p
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_map_m2p(struct xc_sr_context *ctx);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 14/29] tools/libxc: x86 PV save code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (12 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 13/29] tools/libxc: x86 PV " Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 15/29] tools/libxc: x86 PV restore code Andrew Cooper
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Save the x86 PV specific parts of a domain.  This is the X86_PV_INFO record,
the P2M_FRAMES, the X86_PV_SHARED_INFO, the three different VCPU context
records, and the MSR records.

The normalise_page callback used by the common code when writing the PAGE_DATA
records, converts MFNs in page tables to PFNs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

---
v6:
 * L1's don't have PSE.  Bit 7 of an L1 is a PAT bit.
 * Type-change failures are not fatal if live.
---
 tools/libxc/saverestore/common.h      |    1 +
 tools/libxc/saverestore/save_x86_pv.c |  850 +++++++++++++++++++++++++++++++++
 2 files changed, 851 insertions(+)
 create mode 100644 tools/libxc/saverestore/save_x86_pv.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 0ded85d..dbb7f8d 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -282,6 +282,7 @@ struct xc_sr_context
     };
 };
 
+extern struct xc_sr_save_ops save_ops_x86_pv;
 
 struct xc_sr_record
 {
diff --git a/tools/libxc/saverestore/save_x86_pv.c b/tools/libxc/saverestore/save_x86_pv.c
new file mode 100644
index 0000000..288ec64
--- /dev/null
+++ b/tools/libxc/saverestore/save_x86_pv.c
@@ -0,0 +1,850 @@
+#include <assert.h>
+#include <limits.h>
+
+#include "common_x86_pv.h"
+
+/*
+ * Maps the guests shared info page.
+ */
+static int map_shinfo(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    ctx->x86_pv.shinfo = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ, ctx->dominfo.shared_info_frame);
+    if ( !ctx->x86_pv.shinfo )
+    {
+        PERROR("Failed to map shared info frame at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Copy a list of mfns from a guest, accounting for differences between guest
+ * and toolstack width.
+ */
+static void copy_mfns_from_guest(const struct xc_sr_context *ctx, xen_pfn_t *dst,
+                                 const void *src, size_t count)
+{
+    size_t x;
+
+    if ( ctx->x86_pv.width == sizeof(unsigned long) )
+        memcpy(dst, src, count * sizeof(*dst));
+    else
+    {
+        for ( x = 0; x < count; ++x )
+        {
+#ifdef __x86_64__
+            /* 64bit toolstack, 32bit guest.  Expand any INVALID_MFN. */
+            uint32_t s = ((uint32_t *)src)[x];
+
+            dst[x] = s == ~0U ? INVALID_MFN : s;
+#else
+            /* 32bit toolstack, 64bit guest.  Truncate their pointers */
+            dst[x] = ((uint64_t *)src)[x];
+#endif
+        }
+    }
+}
+
+/*
+ * Walk the guests frame list list and frame list to identify and map the
+ * frames making up the guests p2m table.  Construct a list of pfns making up
+ * the table.
+ */
+static int map_p2m(struct xc_sr_context *ctx)
+{
+    /* Terminology:
+     *
+     * fll   - frame list list, top level p2m, list of fl mfns
+     * fl    - frame list, mid level p2m, list of leaf mfns
+     * local - own allocated buffers, adjusted for bitness
+     * guest - mappings into the domain
+     */
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    unsigned x, fpp, fll_entries, fl_entries;
+    xen_pfn_t fll_mfn;
+
+    xen_pfn_t *local_fll = NULL;
+    void *guest_fll = NULL;
+    size_t local_fll_size;
+
+    xen_pfn_t *local_fl = NULL;
+    void *guest_fl = NULL;
+    size_t local_fl_size;
+
+    fpp = PAGE_SIZE / ctx->x86_pv.width;
+    fll_entries = (ctx->x86_pv.max_pfn / (fpp * fpp)) + 1;
+    fl_entries  = (ctx->x86_pv.max_pfn / fpp) + 1;
+
+    fll_mfn = GET_FIELD(ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list,
+                        ctx->x86_pv.width);
+    if ( fll_mfn == 0 || fll_mfn > ctx->x86_pv.max_mfn )
+    {
+        ERROR("Bad mfn %#lx for p2m frame list list", fll_mfn);
+        goto err;
+    }
+
+    /* Map the guest top p2m. */
+    guest_fll = xc_map_foreign_range(xch, ctx->domid, PAGE_SIZE,
+                                     PROT_READ, fll_mfn);
+    if ( !guest_fll )
+    {
+        PERROR("Failed to map p2m frame list list at %#lx", fll_mfn);
+        goto err;
+    }
+
+    local_fll_size = fll_entries * sizeof(*local_fll);
+    local_fll = malloc(local_fll_size);
+    if ( !local_fll )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list list",
+              local_fll_size);
+        goto err;
+    }
+
+    copy_mfns_from_guest(ctx, local_fll, guest_fll, fll_entries);
+
+    /* Check for bad mfns in frame list list. */
+    for ( x = 0; x < fll_entries; ++x )
+    {
+        if ( local_fll[x] == 0 || local_fll[x] > ctx->x86_pv.max_mfn )
+        {
+            ERROR("Bad mfn %#lx at index %u (of %u) in p2m frame list list",
+                  local_fll[x], x, fll_entries);
+            goto err;
+        }
+    }
+
+    /* Map the guest mid p2m frames. */
+    guest_fl = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                    local_fll, fll_entries);
+    if ( !guest_fl )
+    {
+        PERROR("Failed to map p2m frame list");
+        goto err;
+    }
+
+    local_fl_size = fl_entries * sizeof(*local_fl);
+    local_fl = malloc(local_fl_size);
+    if ( !local_fl )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list",
+              local_fl_size);
+        goto err;
+    }
+
+    copy_mfns_from_guest(ctx, local_fl, guest_fl, fl_entries);
+
+    for ( x = 0; x < fl_entries; ++x )
+    {
+        if ( local_fl[x] == 0 || local_fl[x] > ctx->x86_pv.max_mfn )
+        {
+            ERROR("Bad mfn %#lx at index %u (of %u) in p2m frame list",
+                  local_fl[x], x, fl_entries);
+            goto err;
+        }
+    }
+
+    /* Map the p2m leaves themselves. */
+    ctx->x86_pv.p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                           local_fl, fl_entries);
+    if ( !ctx->x86_pv.p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    ctx->x86_pv.p2m_frames = fl_entries;
+    ctx->x86_pv.p2m_pfns = malloc(local_fl_size);
+    if ( !ctx->x86_pv.p2m_pfns )
+    {
+        ERROR("Cannot allocate %zu bytes for p2m pfns list",
+              local_fl_size);
+        goto err;
+    }
+
+    /* Convert leaf frames from mfns to pfns. */
+    for ( x = 0; x < fl_entries; ++x )
+    {
+        if ( !mfn_in_pseudophysmap(ctx, local_fl[x]) )
+        {
+            ERROR("Bad mfn in p2m_frame_list[%u]", x);
+            dump_bad_pseudophysmap_entry(ctx, local_fl[x]);
+            errno = ERANGE;
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[x] = mfn_to_pfn(ctx, local_fl[x]);
+    }
+
+    rc = 0;
+err:
+
+    free(local_fl);
+    if ( guest_fl )
+        munmap(guest_fl, fll_entries * PAGE_SIZE);
+
+    free(local_fll);
+    if ( guest_fll )
+        munmap(guest_fll, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Obtain a specific vcpus basic state and write an X86_PV_VCPU_BASIC record
+ * into the stream.  Performs mfn->pfn conversion on architectural state.
+ */
+static int write_one_vcpu_basic(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn;
+    unsigned i;
+    int rc = -1;
+    vcpu_guest_context_any_t vcpu;
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_BASIC,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+
+    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
+    {
+        PERROR("Failed to get vcpu%u context", id);
+        goto err;
+    }
+
+    /* Vcpu0 is special: Convert the suspend record to a pfn. */
+    if ( id == 0 )
+    {
+        mfn = GET_FIELD(&vcpu, user_regs.edx, ctx->x86_pv.width);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for suspend record");
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(&vcpu, user_regs.edx, mfn_to_pfn(ctx, mfn),
+                  ctx->x86_pv.width);
+    }
+
+    /* Convert GDT frames to pfns. */
+    for ( i = 0; (i * 512) < GET_FIELD(&vcpu, gdt_ents, ctx->x86_pv.width);
+          ++i )
+    {
+        mfn = GET_FIELD(&vcpu, gdt_frames[i], ctx->x86_pv.width);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for frame %u of vcpu%u's GDT", i, id);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(&vcpu, gdt_frames[i], mfn_to_pfn(ctx, mfn),
+                  ctx->x86_pv.width);
+    }
+
+    /* Convert CR3 to a pfn. */
+    mfn = cr3_to_mfn(ctx, GET_FIELD(&vcpu, ctrlreg[3], ctx->x86_pv.width));
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Bad mfn for vcpu%u's cr3", id);
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        errno = ERANGE;
+        goto err;
+    }
+    pfn = mfn_to_pfn(ctx, mfn);
+    SET_FIELD(&vcpu, ctrlreg[3], mfn_to_cr3(ctx, pfn), ctx->x86_pv.width);
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to pfn. */
+    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
+    {
+        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for vcpu%u's cr1", id);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        pfn = mfn_to_pfn(ctx, mfn);
+        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
+    }
+
+    if ( ctx->x86_pv.width == 8 )
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x64));
+    else
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x32));
+
+ err:
+    return rc;
+}
+
+/*
+ * Obtain a specific vcpus extended state and write an X86_PV_VCPU_EXTENDED
+ * record into the stream.
+ */
+static int write_one_vcpu_extended(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_EXTENDED,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
+        .domain = ctx->domid,
+        .u.ext_vcpucontext.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u extended context", id);
+        return -1;
+    }
+
+    return write_split_record(ctx, &rec, &domctl.u.ext_vcpucontext,
+                              domctl.u.ext_vcpucontext.size);
+}
+
+/*
+ * Query to see whether a specific vcpu has xsave state and if so, write an
+ * X86_PV_VCPU_XSAVE record into the stream.
+ */
+static int write_one_vcpu_xsave(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_XSAVE,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_getvcpuextstate,
+        .domain = ctx->domid,
+        .u.vcpuextstate.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's xsave context", id);
+        goto err;
+    }
+
+    /* No xsave state? skip this record. */
+    if ( !domctl.u.vcpuextstate.xfeature_mask )
+        goto out;
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, domctl.u.vcpuextstate.size);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %"PRIx64" bytes for vcpu%u's xsave context",
+              domctl.u.vcpuextstate.size, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's xsave context", id);
+        goto err;
+    }
+
+    rc = write_split_record(ctx, &rec, buffer, domctl.u.vcpuextstate.size);
+    if ( rc )
+        goto err;
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Query to see whether a specific vcpu has msr state and if so, write an
+ * X86_PV_VCPU_MSRS record into the stream.
+ */
+static int write_one_vcpu_msrs(struct xc_sr_context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    size_t buffersz;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct xc_sr_rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_MSRS,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_vcpu_msrs,
+        .domain = ctx->domid,
+        .u.vcpu_msrs.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's msrs", id);
+        goto err;
+    }
+
+    /* No MSRs? skip this record. */
+    if ( !domctl.u.vcpu_msrs.msr_count )
+        goto out;
+
+    buffersz = domctl.u.vcpu_msrs.msr_count * sizeof(xen_domctl_vcpu_msr_t);
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, buffersz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for vcpu%u's msrs",
+              buffersz, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpu_msrs.msrs, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's msrs", id);
+        goto err;
+    }
+
+    rc = write_split_record(ctx, &rec, buffer,
+                            domctl.u.vcpu_msrs.msr_count *
+                            sizeof(xen_domctl_vcpu_msr_t));
+    if ( rc )
+        goto err;
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * For each vcpu, if it is online, write its state into the stream.
+ */
+static int write_all_vcpu_information(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xc_vcpuinfo_t vinfo;
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
+    {
+        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
+        if ( rc )
+        {
+            PERROR("Failed to get vcpu%u information", i);
+            return rc;
+        }
+
+        /* Vcpu offline? skip all these records. */
+        if ( !vinfo.online )
+            continue;
+
+        rc = write_one_vcpu_basic(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_extended(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_xsave(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_msrs(ctx, i);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
+/*
+ * Writes an X86_PV_INFO record into the stream.
+ */
+static int write_x86_pv_info(struct xc_sr_context *ctx)
+{
+    struct xc_sr_rec_x86_pv_info info =
+        {
+            .guest_width = ctx->x86_pv.width,
+            .pt_levels = ctx->x86_pv.levels,
+        };
+    struct xc_sr_record rec =
+        {
+            .type = REC_TYPE_X86_PV_INFO,
+            .length = sizeof(info),
+            .data = &info
+        };
+
+    return write_record(ctx, &rec);
+}
+
+/*
+ * Writes an X86_PV_P2M_FRAMES record into the stream.  This contains the list
+ * of pfns making upt the p2m table.
+ */
+static int write_x86_pv_p2m_frames(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc; unsigned i;
+    size_t datasz = ctx->x86_pv.p2m_frames * sizeof(uint64_t);
+    uint64_t *data = NULL;
+    struct xc_sr_rec_x86_pv_p2m_frames hdr =
+        {
+            .start_pfn = 0,
+            .end_pfn = ctx->x86_pv.max_pfn,
+        };
+    struct xc_sr_record rec =
+        {
+            .type = REC_TYPE_X86_PV_P2M_FRAMES,
+            .length = sizeof(hdr),
+            .data = &hdr,
+        };
+
+    /* No need to translate if sizeof(uint64_t) == sizeof(xen_pfn_t). */
+    if ( sizeof(uint64_t) != sizeof(*ctx->x86_pv.p2m_pfns) )
+    {
+        if ( !(data = malloc(datasz)) )
+        {
+            ERROR("Cannot allocate %zu bytes for X86_PV_P2M_FRAMES data",
+                  datasz);
+            return -1;
+        }
+
+        for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+            data[i] = ctx->x86_pv.p2m_pfns[i];
+    }
+    else
+        data = (uint64_t *)ctx->x86_pv.p2m_pfns;
+
+    rc = write_split_record(ctx, &rec, data, datasz);
+
+    if ( data != (uint64_t *)ctx->x86_pv.p2m_pfns )
+        free(data);
+
+    return rc;
+}
+
+/*
+ * Writes an SHARED_INFO record into the stream.
+ */
+static int write_shared_info(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_SHARED_INFO,
+        .length = PAGE_SIZE,
+        .data = ctx->x86_pv.shinfo,
+    };
+
+    return write_record(ctx, &rec);
+}
+
+/*
+ * Normalise a pagetable for the migration stream.  Performs pfn->mfn
+ * conversions on the ptes.
+ */
+static int normalise_pagetable(struct xc_sr_context *ctx, const uint64_t *src,
+                               uint64_t *dst, unsigned long type)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t pte;
+    unsigned i, xen_first = -1, xen_last = -1; /* Indicies of Xen mappings. */
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( ctx->x86_pv.levels == 4 )
+    {
+        /* 64bit guests only have Xen mappings in their L4 tables. */
+        if ( type == XEN_DOMCTL_PFINFO_L4TAB )
+        {
+            xen_first = 256;
+            xen_last = 271;
+        }
+    }
+    else
+    {
+        switch ( type )
+        {
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            ERROR("??? Found L4 table for 32bit guest");
+            errno = EINVAL;
+            return -1;
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            /* 32bit guests can only use the first 4 entries of their L3 tables.
+             * All other are potentially used by Xen. */
+            xen_first = 4;
+            xen_last = 512;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
+             * are normal but only a few will have Xen mappings.
+             *
+             * 428 = (HYPERVISOR_VIRT_START_PAE >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff
+             *
+             * ...which is conveniently unavailable to us in a 64bit build.
+             */
+            if ( pte_to_frame(src[428]) == ctx->x86_pv.compat_m2p_mfn0 )
+            {
+                xen_first = 428;
+                xen_last = 512;
+            }
+            break;
+        }
+    }
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        xen_pfn_t mfn;
+
+        pte = src[i];
+
+        /* Remove Xen mappings: Xen will reconstruct on the other side. */
+        if ( i >= xen_first && i <= xen_last )
+            pte = 0;
+
+        /*
+         * Errors during the live part of migration are expected as a result
+         * of split pagetable updates, page type changes, active grant
+         * mappings etc.  The pagetable will need to be resent after pausing.
+         * In such cases we fail with EAGAIN.
+         *
+         * For domains which are already paused, errors are fatal.
+         */
+        if ( pte & _PAGE_PRESENT )
+        {
+            mfn = pte_to_frame(pte);
+
+#ifdef __i386__
+            if ( mfn == INVALID_MFN )
+            {
+                ERROR("PTE truncation detected.  L%lu[%u] = %016"PRIx64,
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                errno = E2BIG;
+                return -1;
+            }
+#endif
+
+            if ( (type > XEN_DOMCTL_PFINFO_L1TAB) && (pte & _PAGE_PSE) )
+            {
+                if ( !ctx->dominfo.paused )
+                    errno = EAGAIN;
+                else
+                {
+                    ERROR("Cannot migrate superpage (L%lu[%u]: 0x%016"PRIx64")",
+                          type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                    errno = E2BIG;
+                }
+                return -1;
+            }
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                if ( !ctx->dominfo.paused )
+                    errno = EAGAIN;
+                else
+                {
+                    ERROR("Bad mfn for L%lu[%u]",
+                          type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i);
+                    dump_bad_pseudophysmap_entry(ctx, mfn);
+                    errno = ERANGE;
+                }
+                return -1;
+            }
+
+            pte = merge_pte(pte, mfn_to_pfn(ctx, mfn));
+        }
+
+        dst[i] = pte;
+    }
+
+    return 0;
+}
+
+/* save_ops function. */
+static xen_pfn_t x86_pv_pfn_to_gfn(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    return xc_pfn_to_mfn(pfn, ctx->x86_pv.p2m, ctx->x86_pv.width);
+}
+
+
+/*
+ * save_ops function.  Performs pagetable normalisation on appropriate pages.
+ */
+static int x86_pv_normalise_page(struct xc_sr_context *ctx, xen_pfn_t type,
+                                 void **page)
+{
+    xc_interface *xch = ctx->xch;
+    void *local_page;
+    int rc;
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    local_page = malloc(PAGE_SIZE);
+    if ( !local_page )
+    {
+        ERROR("Unable to allocate scratch page");
+        rc = -1;
+        goto out;
+    }
+
+    rc = normalise_pagetable(ctx, *page, local_page, type);
+    *page = local_page;
+
+  out:
+    return rc;
+}
+
+/*
+ * save_ops function.  Queries domain information and maps the Xen m2p and the
+ * guests shinfo and p2m table.
+ */
+static int x86_pv_setup(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        return rc;
+
+    rc = map_shinfo(ctx);
+    if ( rc )
+        return rc;
+
+    rc = map_p2m(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Writes PV header records into the stream.
+ */
+static int x86_pv_start_of_stream(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = write_x86_pv_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_x86_pv_p2m_frames(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Writes tail records information into the stream.
+ */
+static int x86_pv_end_of_stream(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_shared_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_all_vcpu_information(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Cleanup.
+ */
+static int x86_pv_cleanup(struct xc_sr_context *ctx)
+{
+    free(ctx->x86_pv.p2m_pfns);
+
+    if ( ctx->x86_pv.p2m )
+        munmap(ctx->x86_pv.p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    if ( ctx->x86_pv.shinfo )
+        munmap(ctx->x86_pv.shinfo, PAGE_SIZE);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return 0;
+}
+
+struct xc_sr_save_ops save_ops_x86_pv =
+{
+    .pfn_to_gfn      = x86_pv_pfn_to_gfn,
+    .normalise_page  = x86_pv_normalise_page,
+    .setup           = x86_pv_setup,
+    .start_of_stream = x86_pv_start_of_stream,
+    .end_of_stream   = x86_pv_end_of_stream,
+    .cleanup         = x86_pv_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 15/29] tools/libxc: x86 PV restore code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (13 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 14/29] tools/libxc: x86 PV save code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 16/29] tools/libxc: x86 HVM save code Andrew Cooper
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Restore the x86 PV specific parts.  The X86_PV_INFO, the P2M_FRAMES,
SHARED_INFO, and VCPU context records.

The localise_page callback is called from the common PAGE_DATA code to convert
PFNs in page tables to MFNs.

Page tables are pinned and the guest's P2M is updated when the stream is
complete.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

---
v7:
 * Fix possible integer overflow in expand_p2m()
 * Batch population of frames from normalisation.  Reduces the number of
   populate physmap hypercalls by several orders of magnitude.
v6:
 * Batch pagetable pinning.
 * Fix fencepost error validating basic vcpu state.
---
 tools/libxc/saverestore/common.h         |    2 +
 tools/libxc/saverestore/restore_x86_pv.c | 1150 ++++++++++++++++++++++++++++++
 2 files changed, 1152 insertions(+)
 create mode 100644 tools/libxc/saverestore/restore_x86_pv.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index dbb7f8d..b72711c 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -284,6 +284,8 @@ struct xc_sr_context
 
 extern struct xc_sr_save_ops save_ops_x86_pv;
 
+extern struct xc_sr_restore_ops restore_ops_x86_pv;
+
 struct xc_sr_record
 {
     uint32_t type;
diff --git a/tools/libxc/saverestore/restore_x86_pv.c b/tools/libxc/saverestore/restore_x86_pv.c
new file mode 100644
index 0000000..85216e5
--- /dev/null
+++ b/tools/libxc/saverestore/restore_x86_pv.c
@@ -0,0 +1,1150 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+static xen_pfn_t pfn_to_mfn(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    return xc_pfn_to_mfn(pfn, ctx->x86_pv.p2m, ctx->x86_pv.width);
+}
+
+/*
+ * Expand our local tracking information for the p2m table and domains maximum
+ * size.  Normally this will be called once to expand from 0 to max_pfn, but
+ * is liable to expand multiple times if the domain grows on the sending side
+ * after migration has started.
+ */
+static int expand_p2m(struct xc_sr_context *ctx, unsigned long max_pfn)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long old_max = ctx->x86_pv.max_pfn, i;
+    unsigned int fpp = PAGE_SIZE / ctx->x86_pv.width;
+    unsigned long end_frame = (max_pfn / fpp) + 1;
+    unsigned long old_end_frame = (old_max / fpp) + 1;
+    xen_pfn_t *p2m = NULL, *p2m_pfns = NULL;
+    uint32_t *pfn_types = NULL;
+    size_t p2msz, p2m_pfnsz, pfn_typesz;
+
+    assert(max_pfn > old_max);
+
+    p2msz = (max_pfn + 1) * ctx->x86_pv.width;
+    p2m = realloc(ctx->x86_pv.p2m, p2msz);
+    if ( !p2m )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m", p2msz);
+        return -1;
+    }
+    ctx->x86_pv.p2m = p2m;
+
+    pfn_typesz = (max_pfn + 1) * sizeof(*pfn_types);
+    pfn_types = realloc(ctx->x86_pv.restore.pfn_types, pfn_typesz);
+    if ( !pfn_types )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for pfn_types", pfn_typesz);
+        return -1;
+    }
+    ctx->x86_pv.restore.pfn_types = pfn_types;
+
+    p2m_pfnsz = (end_frame + 1) * sizeof(*p2m_pfns);
+    p2m_pfns = realloc(ctx->x86_pv.p2m_pfns, p2m_pfnsz);
+    if ( !p2m_pfns )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m frame list", p2m_pfnsz);
+        return -1;
+    }
+    ctx->x86_pv.p2m_frames = end_frame;
+    ctx->x86_pv.p2m_pfns = p2m_pfns;
+
+    ctx->x86_pv.max_pfn = max_pfn;
+    for ( i = (old_max ? old_max + 1 : 0); i <= max_pfn; ++i )
+    {
+        ctx->restore.ops.set_gfn(ctx, i, INVALID_MFN);
+        ctx->restore.ops.set_page_type(ctx, i, 0);
+    }
+
+    for ( i = (old_end_frame ? old_end_frame + 1 : 0); i <= end_frame; ++i )
+        ctx->x86_pv.p2m_pfns[i] = INVALID_MFN;
+
+    DPRINTF("Expanded p2m from %#lx to %#lx", old_max, max_pfn);
+    return 0;
+}
+
+/*
+ * Pin all of the pagetables.
+ */
+static int pin_pagetables(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long i, nr_pins;
+    struct mmuext_op pin[MAX_PIN_BATCH];
+
+    for ( i = nr_pins = 0; i <= ctx->x86_pv.max_pfn; ++i )
+    {
+        if ( (ctx->x86_pv.restore.pfn_types[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( ctx->x86_pv.restore.pfn_types[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
+            break;
+        default:
+            continue;
+        }
+
+        pin[nr_pins].arg1.mfn = pfn_to_mfn(ctx, i);
+        nr_pins++;
+
+        if ( nr_pins == MAX_PIN_BATCH )
+        {
+            if ( xc_mmuext_op(xch, pin, nr_pins, ctx->domid) != 0 )
+            {
+                PERROR("Failed to pin batch of pagetables");
+                return -1;
+            }
+            nr_pins = 0;
+        }
+    }
+
+    if ( (nr_pins > 0) && (xc_mmuext_op(xch, pin, nr_pins, ctx->domid) < 0) )
+    {
+        PERROR("Failed to pin batch of pagetables");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Update details in a guests start_info strucutre.
+ */
+static int process_start_info(struct xc_sr_context *ctx, vcpu_guest_context_any_t *vcpu)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn, mfn;
+    start_info_any_t *guest_start_info = NULL;
+    int rc = -1;
+
+    pfn = GET_FIELD(vcpu, user_regs.edx, ctx->x86_pv.width);
+
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Start Info pfn %#lx out of range", pfn);
+        goto err;
+    }
+    else if ( ctx->x86_pv.restore.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+    {
+        ERROR("Start Info pfn %#lx has bad type %u", pfn,
+              ctx->x86_pv.restore.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Start Info has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    SET_FIELD(vcpu, user_regs.edx, mfn, ctx->x86_pv.width);
+    guest_start_info = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE, mfn);
+    if ( !guest_start_info )
+    {
+        PERROR("Failed to map Start Info at mfn %#lx", mfn);
+        goto err;
+    }
+
+    /* Deal with xenstore stuff */
+    pfn = GET_FIELD(guest_start_info, store_mfn, ctx->x86_pv.width);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("XenStore pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("XenStore pfn has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.xenstore_mfn = mfn;
+    SET_FIELD(guest_start_info, store_mfn, mfn, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, store_evtchn, ctx->restore.xenstore_evtchn, ctx->x86_pv.width);
+
+    /* Deal with console stuff */
+    pfn = GET_FIELD(guest_start_info, console.domU.mfn, ctx->x86_pv.width);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Console pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Console pfn has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.console_mfn = mfn;
+    SET_FIELD(guest_start_info, console.domU.mfn, mfn, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, console.domU.evtchn, ctx->restore.console_evtchn, ctx->x86_pv.width);
+
+    /* Set other information */
+    SET_FIELD(guest_start_info, nr_pages, ctx->x86_pv.max_pfn + 1, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, shared_info,
+              ctx->dominfo.shared_info_frame << PAGE_SHIFT, ctx->x86_pv.width);
+    SET_FIELD(guest_start_info, flags, 0, ctx->x86_pv.width);
+
+    rc = 0;
+
+err:
+    if ( guest_start_info )
+        munmap(guest_start_info, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Process one stashed vcpu worth of basic state and send to Xen.
+ */
+static int process_vcpu_basic(struct xc_sr_context *ctx,
+                              unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    vcpu_guest_context_any_t vcpu;
+    xen_pfn_t pfn, mfn;
+    unsigned long tmp;
+    unsigned i;
+    int rc = -1;
+
+    memcpy(&vcpu, ctx->x86_pv.restore.vcpus[vcpuid].basic,
+           ctx->x86_pv.restore.vcpus[vcpuid].basicsz);
+
+    /* Vcpu 0 is special: Convert the suspend record to an mfn. */
+    if ( vcpuid == 0 )
+    {
+        rc = process_start_info(ctx, &vcpu);
+        if ( rc )
+            return rc;
+        rc = -1;
+    }
+
+    SET_FIELD(&vcpu, flags,
+              GET_FIELD(&vcpu, flags, ctx->x86_pv.width) | VGCF_online,
+              ctx->x86_pv.width);
+
+    tmp = GET_FIELD(&vcpu, gdt_ents, ctx->x86_pv.width);
+    if ( tmp > 8192 )
+    {
+        ERROR("GDT entry count (%lu) out of range", tmp);
+        errno = ERANGE;
+        goto err;
+    }
+
+    /* Convert GDT frames to mfns. */
+    for ( i = 0; (i * 512) < tmp; ++i )
+    {
+        pfn = GET_FIELD(&vcpu, gdt_frames[i], ctx->x86_pv.width);
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("GDT frame %u (pfn %#lx) out of range", i, pfn);
+            goto err;
+        }
+        else if ( ctx->x86_pv.restore.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+        {
+            ERROR("GDT frame %u (pfn %#lx) has bad type %u", i, pfn,
+                  ctx->x86_pv.restore.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        mfn = pfn_to_mfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("GDT frame %u has bad mfn", i);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        SET_FIELD(&vcpu, gdt_frames[i], mfn, ctx->x86_pv.width);
+    }
+
+    /* Convert CR3 to an mfn. */
+    pfn = cr3_to_mfn(ctx, GET_FIELD(&vcpu, ctrlreg[3], ctx->x86_pv.width));
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("cr3 (pfn %#lx) out of range", pfn);
+        goto err;
+    }
+    else if ( (ctx->x86_pv.restore.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) !=
+              (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+    {
+        ERROR("cr3 (pfn %#lx) has bad type %u, expected %u", pfn,
+              ctx->x86_pv.restore.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT,
+              ctx->x86_pv.levels);
+        goto err;
+    }
+
+    mfn = pfn_to_mfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("cr3 has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    SET_FIELD(&vcpu, ctrlreg[3], mfn_to_cr3(ctx, mfn), ctx->x86_pv.width);
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to mfn. */
+    if ( ctx->x86_pv.levels == 4 && (vcpu.x64.ctrlreg[1] & 1) )
+    {
+        pfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("cr1 (pfn %#lx) out of range", pfn);
+            goto err;
+        }
+        else if ( (ctx->x86_pv.restore.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
+                  (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+        {
+            ERROR("cr1 (pfn %#lx) has bad type %u, expected %u", pfn,
+                  ctx->x86_pv.restore.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT,
+                  ctx->x86_pv.levels);
+            goto err;
+        }
+
+        mfn = pfn_to_mfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("cr1 has bad mfn");
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        vcpu.x64.ctrlreg[1] = (uint64_t)mfn << PAGE_SHIFT;
+    }
+
+    if ( xc_vcpu_setcontext(xch, ctx->domid, vcpuid, &vcpu) )
+    {
+        PERROR("Failed to set vcpu%u's basic info", vcpuid);
+        goto err;
+    }
+
+    rc = 0;
+
+ err:
+    return rc;
+}
+
+/*
+ * Process one stashed vcpu worth of extended state and send to Xen.
+ */
+static int process_vcpu_extended(struct xc_sr_context *ctx,
+                                 unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu =
+        &ctx->x86_pv.restore.vcpus[vcpuid];
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_set_ext_vcpucontext;
+    domctl.domain = ctx->domid;
+    memcpy(&domctl.u.ext_vcpucontext, vcpu->extd, vcpu->extdsz);
+
+    if ( xc_domctl(xch, &domctl) != 0 )
+    {
+        PERROR("Failed to set vcpu%u's extended info", vcpuid);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Process one stashed vcpu worth of xsave state and send to Xen.
+ */
+static int process_vcpu_xsave(struct xc_sr_context *ctx,
+                              unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu =
+        &ctx->x86_pv.restore.vcpus[vcpuid];
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, vcpu->xsavesz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for xsave hypercall buffer",
+              vcpu->xsavesz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setvcpuextstate;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpuextstate.vcpu = vcpuid;
+    domctl.u.vcpuextstate.size = vcpu->xsavesz;
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    memcpy(buffer, vcpu->xsave, vcpu->xsavesz);
+
+    rc = xc_domctl(xch, &domctl);
+    if ( rc )
+        PERROR("Failed to set vcpu%u's xsave info", vcpuid);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Process one stashed vcpu worth of msr state and send to Xen.
+ */
+static int process_vcpu_msrs(struct xc_sr_context *ctx,
+                             unsigned int vcpuid)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu =
+        &ctx->x86_pv.restore.vcpus[vcpuid];
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, vcpu->msrsz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for msr hypercall buffer",
+              vcpu->msrsz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_set_vcpu_msrs;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpu_msrs.vcpu = vcpuid;
+    domctl.u.vcpu_msrs.msr_count = vcpu->msrsz % sizeof(xen_domctl_vcpu_msr_t);
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    memcpy(buffer, vcpu->msr, vcpu->msrsz);
+
+    rc = xc_domctl(xch, &domctl);
+    if ( rc )
+        PERROR("Failed to set vcpu%u's msrs", vcpuid);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Process all stashed vcpu context and send to Xen.
+ */
+static int update_vcpu_context(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu;
+    unsigned i;
+    int rc = 0;
+
+    for ( i = 0; i < ctx->x86_pv.restore.nr_vcpus; ++i )
+    {
+        vcpu = &ctx->x86_pv.restore.vcpus[i];
+
+        if ( vcpu->basic )
+        {
+            rc = process_vcpu_basic(ctx, i);
+            if ( rc )
+                return rc;
+        }
+        else if ( i == 0 )
+        {
+            ERROR("Sender didn't send vcpu0's basic state");
+            return -1;
+        }
+
+        if ( vcpu->extd )
+        {
+            rc = process_vcpu_extended(ctx, i);
+            if ( rc )
+                return rc;
+        }
+
+        if ( vcpu->xsave )
+        {
+            rc = process_vcpu_xsave(ctx, i);
+            if ( rc )
+                return rc;
+        }
+
+        if ( vcpu->msr )
+        {
+            rc = process_vcpu_msrs(ctx, i);
+            if ( rc )
+                return rc;
+        }
+    }
+
+    return rc;
+}
+
+/*
+ * Copy the p2m which has been constructed locally as memory has been
+ * allocated, over the p2m in guest, so the guest can find its memory again on
+ * resume.
+ */
+static int update_guest_p2m(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn, *guest_p2m = NULL;
+    unsigned i;
+    int rc = -1;
+
+    for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+    {
+        pfn = ctx->x86_pv.p2m_pfns[i];
+
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] out of range",
+                  pfn, i);
+            goto err;
+        }
+        else if ( ctx->x86_pv.restore.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] has bad type %u", pfn, i,
+                  ctx->x86_pv.restore.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        mfn = pfn_to_mfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("p2m_frame_list[%u] has bad mfn", i);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[i] = mfn;
+    }
+
+    guest_p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_WRITE,
+                                     ctx->x86_pv.p2m_pfns,
+                                     ctx->x86_pv.p2m_frames );
+    if ( !guest_p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    memcpy(guest_p2m, ctx->x86_pv.p2m,
+           (ctx->x86_pv.max_pfn + 1) * ctx->x86_pv.width);
+    rc = 0;
+ err:
+    if ( guest_p2m )
+        munmap(guest_p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Process a toolstack record.  TODO - remove from spec and code once libxl
+ * framing is sorted.
+ */
+static int handle_toolstack(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore )
+        return 0;
+
+    rc = ctx->restore.callbacks->toolstack_restore(ctx->domid, rec->data, rec->length,
+                                                   ctx->restore.callbacks->data);
+    if ( rc < 0 )
+        PERROR("restoring toolstack");
+    return rc;
+}
+
+/*
+ * Process an X86_PV_INFO record.
+ */
+static int handle_x86_pv_info(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_info *info = rec->data;
+
+    if ( ctx->x86_pv.restore.seen_pv_info )
+    {
+        ERROR("Already received X86_PV_INFO record");
+        return -1;
+    }
+
+    if ( rec->length < sizeof(*info) )
+    {
+        ERROR("X86_PV_INFO record truncated: length %u, expected %zu",
+              rec->length, sizeof(*info));
+        return -1;
+    }
+    else if ( info->guest_width != 4 &&
+              info->guest_width != 8 )
+    {
+        ERROR("Unexpected guest width %u, Expected 4 or 8",
+              info->guest_width);
+        return -1;
+    }
+    else if ( info->guest_width != ctx->x86_pv.width )
+    {
+        int rc;
+        struct xen_domctl domctl;
+
+        /* Try to set address size, domain is always created 64 bit. */
+        memset(&domctl, 0, sizeof(domctl));
+        domctl.domain = ctx->domid;
+        domctl.cmd    = XEN_DOMCTL_set_address_size;
+        domctl.u.address_size.size = info->guest_width * 8;
+        rc = do_domctl(xch, &domctl);
+        if ( rc != 0 )
+        {
+            ERROR("Width of guest in stream (%u"
+                  " bits) differs with existing domain (%u bits)",
+                  info->guest_width * 8, ctx->x86_pv.width * 8);
+            return -1;
+        }
+
+        /* Domain informations changed, better to refresh. */
+        rc = x86_pv_domain_info(ctx);
+        if ( rc != 0 )
+        {
+            ERROR("Unable to refresh guest informations");
+            return -1;
+        }
+    }
+    else if ( info->pt_levels != 3 &&
+              info->pt_levels != 4 )
+    {
+        ERROR("Unexpected guest levels %u, Expected 3 or 4",
+              info->pt_levels);
+        return -1;
+    }
+    else if ( info->pt_levels != ctx->x86_pv.levels )
+    {
+        ERROR("Levels of guest in stream (%u"
+              ") differs with existing domain (%u)",
+              info->pt_levels, ctx->x86_pv.levels);
+        return -1;
+    }
+
+    ctx->x86_pv.restore.seen_pv_info = true;
+    return 0;
+}
+
+/*
+ * Process an X86_PV_P2M_FRAMES record.  Takes care of expanding the local p2m
+ * state if needed.
+ */
+static int handle_x86_pv_p2m_frames(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_p2m_frames *data = rec->data;
+    unsigned start, end, x, fpp = PAGE_SIZE / ctx->x86_pv.width;
+    int rc;
+
+    if ( !ctx->x86_pv.restore.seen_pv_info )
+    {
+        ERROR("Not yet received X86_PV_INFO record");
+        return -1;
+    }
+
+    if ( rec->length < sizeof(*data) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record truncated: length %u, min %zu",
+              rec->length, sizeof(*data) + sizeof(uint64_t));
+        return -1;
+    }
+    else if ( data->start_pfn > data->end_pfn )
+    {
+        ERROR("End pfn in stream (%#x) exceeds Start (%#x)",
+              data->end_pfn, data->start_pfn);
+        return -1;
+    }
+
+    start =  data->start_pfn / fpp;
+    end = data->end_pfn / fpp + 1;
+
+    if ( rec->length != sizeof(*data) + ((end - start) * sizeof(uint64_t)) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record wrong size: start_pfn %#x"
+              ", end_pfn %#x, length %u, expected %zu + (%u - %u) * %zu",
+              data->start_pfn, data->end_pfn, rec->length,
+              sizeof(*data), end, start, sizeof(uint64_t));
+        return -1;
+    }
+
+    if ( data->end_pfn > ctx->x86_pv.max_pfn )
+    {
+        rc = expand_p2m(ctx, data->end_pfn);
+        if ( rc )
+            return rc;
+    }
+
+    for ( x = 0; x < (end - start); ++x )
+        ctx->x86_pv.p2m_pfns[start + x] = data->p2m_pfns[x];
+
+    return 0;
+}
+
+/*
+ * Processes X86_PV_VCPU_{BASIC,EXTENDED,XSAVE,MSRS} records from the stream.
+ * The blobs are all stashed to one side as they need to be defered until the
+ * very end of the stream, rather than being send to Xen at the point they
+ * arive in the stream.  It performs all pre-hypercall size validation.
+ */
+static int handle_x86_pv_vcpu_blob(struct xc_sr_context *ctx,
+                                   struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_x86_pv_vcpu_hdr *vhdr = rec->data;
+    struct xc_sr_x86_pv_restore_vcpu *vcpu;
+    const char *rec_name;
+    size_t blobsz;
+    void *blob;
+    int rc = -1;
+
+    switch ( rec->type )
+    {
+#define RECNAME(x, y) case (x): rec_name = (y) ; break
+        RECNAME(REC_TYPE_X86_PV_VCPU_BASIC,    "X86_PV_VCPU_BASIC");
+        RECNAME(REC_TYPE_X86_PV_VCPU_EXTENDED, "X86_PV_VCPU_EXTENDED");
+        RECNAME(REC_TYPE_X86_PV_VCPU_XSAVE,    "X86_PV_VCPU_XSAVE");
+        RECNAME(REC_TYPE_X86_PV_VCPU_MSRS,     "X86_PV_VCPU_MSRS");
+#undef RECNAME
+    default:
+        ERROR("Unrecogised vcpu blob record %s (%u)",
+              rec_type_to_str(rec->type), rec->type);
+        goto out;
+    }
+
+    /* Confirm that there is a complete header. */
+    if ( rec->length <= sizeof(*vhdr) )
+    {
+        ERROR("%s record truncated: length %u, min %zu",
+              rec_name, rec->length, sizeof(*vhdr) + 1);
+        goto out;
+    }
+
+    blobsz = rec->length - sizeof(*vhdr);
+
+    /* Check that the vcpu id is within range. */
+    if ( vhdr->vcpu_id >= ctx->x86_pv.restore.nr_vcpus )
+    {
+        ERROR("%s record vcpu_id (%u) exceeds domain max (%u)",
+              rec_name, vhdr->vcpu_id, ctx->x86_pv.restore.nr_vcpus - 1);
+        goto out;
+    }
+
+    vcpu = &ctx->x86_pv.restore.vcpus[vhdr->vcpu_id];
+
+    /* Further per-record checks, where possible. */
+    switch ( rec->type )
+    {
+    case REC_TYPE_X86_PV_VCPU_BASIC:
+    {
+        size_t vcpusz = ctx->x86_pv.width == 8 ?
+            sizeof(vcpu_guest_context_x86_64_t) :
+            sizeof(vcpu_guest_context_x86_32_t);
+
+        if ( blobsz != vcpusz )
+        {
+            ERROR("%s record wrong size: expected %zu, got %u",
+                  rec_name, sizeof(*vhdr) + vcpusz, rec->length);
+            goto out;
+        }
+        break;
+    }
+
+    case REC_TYPE_X86_PV_VCPU_EXTENDED:
+        if ( blobsz > 128 )
+        {
+            ERROR("%s record too long: max %zu, got %u",
+                  rec_name, sizeof(*vhdr) + 128, rec->length);
+            goto out;
+        }
+        break;
+
+    case REC_TYPE_X86_PV_VCPU_XSAVE:
+        if ( blobsz % sizeof(xen_domctl_vcpu_msr_t) != 0 )
+        {
+            ERROR("%s record payload size %zu expected to be a multiple of %zu",
+                  rec_name, blobsz, sizeof(xen_domctl_vcpu_msr_t));
+            goto out;
+        }
+        break;
+    }
+
+    /* Allocate memory. */
+    blob = malloc(blobsz);
+    if ( !blob )
+    {
+        ERROR("Unable to allocate %zu bytes for vcpu%u %s blob",
+              blobsz, vhdr->vcpu_id, rec_name);
+        goto out;
+    }
+
+    memcpy(blob, &vhdr->context, blobsz);
+
+    /* Stash sideways for later. */
+    switch ( rec->type )
+    {
+#define RECSTORE(x, y) case (x): free(y); (y) = blob; (y ## sz) = blobsz; break
+        RECSTORE(REC_TYPE_X86_PV_VCPU_BASIC,    vcpu->basic);
+        RECSTORE(REC_TYPE_X86_PV_VCPU_EXTENDED, vcpu->extd);
+        RECSTORE(REC_TYPE_X86_PV_VCPU_XSAVE,    vcpu->xsave);
+        RECSTORE(REC_TYPE_X86_PV_VCPU_MSRS,     vcpu->msr);
+#undef RECSTORE
+    }
+
+    rc = 0;
+
+ out:
+    return rc;
+}
+
+/*
+ * Process a SHARED_INFO record from the stream.
+ */
+static int handle_shared_info(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned i;
+    int rc = -1;
+    shared_info_any_t *guest_shinfo = NULL;
+    const shared_info_any_t *old_shinfo = rec->data;
+
+    if ( !ctx->x86_pv.restore.seen_pv_info )
+    {
+        ERROR("Not yet received X86_PV_INFO record");
+        return -1;
+    }
+
+    if ( rec->length != PAGE_SIZE )
+    {
+        ERROR("X86_PV_SHARED_INFO record wrong size: length %u"
+              ", expected 4096", rec->length);
+        goto err;
+    }
+
+    guest_shinfo = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE,
+        ctx->dominfo.shared_info_frame);
+    if ( !guest_shinfo )
+    {
+        PERROR("Failed to map Shared Info at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        goto err;
+    }
+
+    MEMCPY_FIELD(guest_shinfo, old_shinfo, vcpu_info, ctx->x86_pv.width);
+    MEMCPY_FIELD(guest_shinfo, old_shinfo, arch, ctx->x86_pv.width);
+
+    SET_FIELD(guest_shinfo, arch.pfn_to_mfn_frame_list_list, 0, ctx->x86_pv.width);
+
+    MEMSET_ARRAY_FIELD(guest_shinfo, evtchn_pending, 0, ctx->x86_pv.width);
+    for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
+        SET_FIELD(guest_shinfo, vcpu_info[i].evtchn_pending_sel, 0, ctx->x86_pv.width);
+
+    MEMSET_ARRAY_FIELD(guest_shinfo, evtchn_mask, 0xff, ctx->x86_pv.width);
+
+    rc = 0;
+ err:
+
+    if ( guest_shinfo )
+        munmap(guest_shinfo, PAGE_SIZE);
+
+    return rc;
+}
+
+/* restore_ops function. */
+static bool x86_pv_pfn_is_valid(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    return pfn <= ctx->x86_pv.max_pfn;
+}
+
+/* restore_ops function. */
+static void x86_pv_set_page_type(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                                 unsigned long type)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    ctx->x86_pv.restore.pfn_types[pfn] = type;
+}
+
+/* restore_ops function. */
+static void x86_pv_set_gfn(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                           xen_pfn_t mfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    if ( ctx->x86_pv.width == sizeof(uint64_t) )
+        /* 64 bit guest.  Need to expand INVALID_MFN for 32 bit toolstacks. */
+        ((uint64_t *)ctx->x86_pv.p2m)[pfn] = mfn == INVALID_MFN ? ~0ULL : mfn;
+    else
+        /* 32 bit guest.  Can safely truncate INVALID_MFN for 64 bit toolstacks. */
+        ((uint32_t *)ctx->x86_pv.p2m)[pfn] = mfn;
+}
+
+/*
+ * restore_ops function.  Convert pfns back to mfns in pagetables.  Possibly
+ * needs to populate new frames if a PTE is found referring to a frame which
+ * hasn't yet been seen from PAGE_DATA records.
+ */
+static int x86_pv_localise_page(struct xc_sr_context *ctx, uint32_t type, void *page)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t *table = page;
+    uint64_t pte;
+    unsigned i, to_populate;
+    xen_pfn_t pfns[(PAGE_SIZE / sizeof(uint64_t))];
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    /* Only page tables need localisation. */
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    /* Check to see whether we need to populate any new frames. */
+    for ( i = 0, to_populate = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        pte = table[i];
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            xen_pfn_t pfn = pte_to_frame(pte);
+
+#ifdef __i386__
+            if ( pfn == INVALID_MFN )
+            {
+                ERROR("PTE truncation detected.  L%u[%u] = %016"PRIx64,
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                errno = E2BIG;
+                return -1;
+            }
+#endif
+
+            if ( pfn_to_mfn(ctx, pfn) == INVALID_MFN )
+                pfns[to_populate++] = pfn;
+        }
+    }
+
+    if ( to_populate && populate_pfns(ctx, to_populate, pfns, NULL) )
+            return -1;
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        pte = table[i];
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            xen_pfn_t mfn, pfn;
+
+            pfn = pte_to_frame(pte);
+            mfn = pfn_to_mfn(ctx, pfn);
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                ERROR("Bad mfn for L%u[%u] - pte %"PRIx64,
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                dump_bad_pseudophysmap_entry(ctx, mfn);
+                errno = ERANGE;
+                return -1;
+            }
+
+            table[i] = merge_pte(pte, mfn);
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * restore_ops function.  Confirm that the incoming stream matches the type of
+ * domain we are attempting to restore into.
+ */
+static int x86_pv_setup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_X86_PV )
+    {
+        ERROR("Unable to restore %s domain into an x86_pv domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != PAGE_SIZE )
+    {
+        ERROR("Invalid page size %d for x86_pv domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        return rc;
+
+    ctx->x86_pv.restore.nr_vcpus = ctx->dominfo.max_vcpu_id + 1;
+    ctx->x86_pv.restore.vcpus = calloc(sizeof(struct xc_sr_x86_pv_restore_vcpu),
+                                       ctx->x86_pv.restore.nr_vcpus);
+    if ( !ctx->x86_pv.restore.vcpus )
+    {
+        errno = ENOMEM;
+        return -1;
+    }
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        return rc;
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_pv_process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    switch ( rec->type )
+    {
+    case REC_TYPE_X86_PV_INFO:
+        return handle_x86_pv_info(ctx, rec);
+
+    case REC_TYPE_X86_PV_P2M_FRAMES:
+        return handle_x86_pv_p2m_frames(ctx, rec);
+
+    case REC_TYPE_X86_PV_VCPU_BASIC:
+    case REC_TYPE_X86_PV_VCPU_EXTENDED:
+    case REC_TYPE_X86_PV_VCPU_XSAVE:
+    case REC_TYPE_X86_PV_VCPU_MSRS:
+        return handle_x86_pv_vcpu_blob(ctx, rec);
+
+    case REC_TYPE_SHARED_INFO:
+        return handle_shared_info(ctx, rec);
+
+    case REC_TYPE_TOOLSTACK:
+        return handle_toolstack(ctx, rec);
+
+    case REC_TYPE_TSC_INFO:
+        return handle_tsc_info(ctx, rec);
+
+    default:
+        return RECORD_NOT_PROCESSED;
+    }
+}
+
+/*
+ * restore_ops function.  Update the vcpu context in Xen, pin the pagetables,
+ * rewrite the p2m and seed the grant table.
+ */
+static int x86_pv_stream_complete(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = update_vcpu_context(ctx);
+    if ( rc )
+        return rc;
+
+    rc = pin_pagetables(ctx);
+    if ( rc )
+        return rc;
+
+    rc = update_guest_p2m(ctx);
+    if ( rc )
+        return rc;
+
+    rc = xc_dom_gnttab_seed(xch, ctx->domid,
+                            ctx->restore.console_mfn,
+                            ctx->restore.xenstore_mfn,
+                            ctx->restore.console_domid,
+                            ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        return rc;
+    }
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_pv_cleanup(struct xc_sr_context *ctx)
+{
+    free(ctx->x86_pv.p2m);
+    free(ctx->x86_pv.p2m_pfns);
+
+    if ( ctx->x86_pv.restore.vcpus )
+    {
+        unsigned i;
+
+        for ( i = 0; i < ctx->x86_pv.restore.nr_vcpus; ++i )
+        {
+            struct xc_sr_x86_pv_restore_vcpu *vcpu =
+                &ctx->x86_pv.restore.vcpus[i];
+
+            free(vcpu->basic);
+            free(vcpu->extd);
+            free(vcpu->xsave);
+            free(vcpu->msr);
+        }
+
+        free(ctx->x86_pv.restore.vcpus);
+    }
+
+    free(ctx->x86_pv.restore.pfn_types);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return 0;
+}
+
+struct xc_sr_restore_ops restore_ops_x86_pv =
+{
+    .pfn_is_valid    = x86_pv_pfn_is_valid,
+    .pfn_to_gfn      = pfn_to_mfn,
+    .set_page_type   = x86_pv_set_page_type,
+    .set_gfn         = x86_pv_set_gfn,
+    .localise_page   = x86_pv_localise_page,
+    .setup           = x86_pv_setup,
+    .process_record  = x86_pv_process_record,
+    .stream_complete = x86_pv_stream_complete,
+    .cleanup         = x86_pv_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 16/29] tools/libxc: x86 HVM save code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (14 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 15/29] tools/libxc: x86 PV restore code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 17/29] tools/libxc: x86 HVM restore code Andrew Cooper
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Save the x86 HVM specific parts of the domain.  This is considerably simpler
than an x86 PV domain.  Only the HVM_CONTEXT and HVM_PARAMS records are
needed.

There is no need for any page normalisation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/saverestore/common.h       |    1 +
 tools/libxc/saverestore/save_x86_hvm.c |  259 ++++++++++++++++++++++++++++++++
 2 files changed, 260 insertions(+)
 create mode 100644 tools/libxc/saverestore/save_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index b72711c..de7063e 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -283,6 +283,7 @@ struct xc_sr_context
 };
 
 extern struct xc_sr_save_ops save_ops_x86_pv;
+extern struct xc_sr_save_ops save_ops_x86_hvm;
 
 extern struct xc_sr_restore_ops restore_ops_x86_pv;
 
diff --git a/tools/libxc/saverestore/save_x86_hvm.c b/tools/libxc/saverestore/save_x86_hvm.c
new file mode 100644
index 0000000..11cace2
--- /dev/null
+++ b/tools/libxc/saverestore/save_x86_hvm.c
@@ -0,0 +1,259 @@
+#include <assert.h>
+
+#include "common_x86.h"
+
+#include <xen/hvm/params.h>
+
+/*
+ * Query for the HVM context and write an HVM_CONTEXT record into the stream.
+ */
+static int write_hvm_context(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long hvm_buf_size;
+    int rc;
+    struct xc_sr_record hvm_rec =
+    {
+        .type = REC_TYPE_HVM_CONTEXT,
+    };
+
+    hvm_buf_size = xc_domain_hvm_getcontext(xch, ctx->domid, 0, 0);
+    if ( hvm_buf_size == -1 )
+    {
+        PERROR("Couldn't get HVM context size from Xen");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.data = malloc(hvm_buf_size);
+    if ( !hvm_rec.data )
+    {
+        PERROR("Couldn't allocate memory");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.length = xc_domain_hvm_getcontext(xch, ctx->domid,
+                                              hvm_rec.data, hvm_buf_size);
+    if ( hvm_rec.length < 0 )
+    {
+        PERROR("HVM:Could not get hvm buffer");
+        rc = -1;
+        goto out;
+    }
+
+    rc = write_record(ctx, &hvm_rec);
+    if ( rc < 0 )
+    {
+        PERROR("error write HVM_CONTEXT record");
+        goto out;
+    }
+
+ out:
+    free(hvm_rec.data);
+    return rc;
+}
+
+/*
+ * Query for a range of HVM parameters and write an HVM_PARAMS record into the
+ * stream.
+ */
+static int write_hvm_params(struct xc_sr_context *ctx)
+{
+    static const unsigned int params[] = {
+        HVM_PARAM_STORE_PFN,
+        HVM_PARAM_IOREQ_PFN,
+        HVM_PARAM_BUFIOREQ_PFN,
+        HVM_PARAM_PAGING_RING_PFN,
+        HVM_PARAM_ACCESS_RING_PFN,
+        HVM_PARAM_SHARING_RING_PFN,
+        HVM_PARAM_VM86_TSS,
+        HVM_PARAM_CONSOLE_PFN,
+        HVM_PARAM_ACPI_IOPORTS_LOCATION,
+        HVM_PARAM_VIRIDIAN,
+        HVM_PARAM_IDENT_PT,
+        HVM_PARAM_PAE_ENABLED,
+        HVM_PARAM_VM_GENERATION_ID_ADDR,
+        HVM_PARAM_IOREQ_SERVER_PFN,
+        HVM_PARAM_NR_IOREQ_SERVER_PAGES,
+    };
+
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_hvm_params_entry entries[ARRAY_SIZE(params)];
+    struct xc_sr_rec_hvm_params hdr = {
+        .count = 0,
+    };
+    struct xc_sr_record rec = {
+        .type   = REC_TYPE_HVM_PARAMS,
+        .length = sizeof(hdr),
+        .data   = &hdr,
+    };
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < ARRAY_SIZE(params); i++ )
+    {
+        uint32_t index = params[i];
+        uint64_t value;
+
+        rc = xc_get_hvm_param(xch, ctx->domid, index, (unsigned long *)&value);
+        if ( rc )
+        {
+            /* Gross XenServer hack. Consider HVM_PARAM_CONSOLE_PFN failure
+             * nonfatal. This is related to the fact it is impossible to
+             * distinguish "no console" from a console at pfn/evtchn 0.
+             *
+             * TODO - find a compatible way to fix this.
+             */
+            if ( index == HVM_PARAM_CONSOLE_PFN )
+                continue;
+
+            PERROR("Failed to get HVMPARAM at index %u", index);
+            return rc;
+        }
+
+        if ( value != 0 )
+        {
+            entries[hdr.count].index = index;
+            entries[hdr.count].value = value;
+            hdr.count++;
+        }
+    }
+
+    rc = write_split_record(ctx, &rec, entries, hdr.count * sizeof(*entries));
+    if ( rc )
+        PERROR("Failed to write HVM_PARAMS record");
+
+    return rc;
+}
+
+/* TODO - remove. */
+static int write_toolstack(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_record rec = {
+        .type = REC_TYPE_TOOLSTACK,
+        .length = 0,
+    };
+    uint8_t *buf;
+    uint32_t len;
+    int rc;
+
+    if ( !ctx->save.callbacks || !ctx->save.callbacks->toolstack_save )
+        return 0;
+
+    if ( ctx->save.callbacks->toolstack_save(ctx->domid, &buf, &len, ctx->save.callbacks->data) < 0 )
+    {
+        PERROR("Error calling toolstack_save");
+        return -1;
+    }
+
+    rc = write_split_record(ctx, &rec, buf, len);
+    if ( rc < 0 )
+        PERROR("Error writing TOOLSTACK record");
+    free(buf);
+    return rc;
+}
+
+static xen_pfn_t x86_hvm_pfn_to_gfn(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    /* identity map */
+    return pfn;
+}
+
+static int x86_hvm_normalise_page(struct xc_sr_context *ctx, xen_pfn_t type, void **page)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_setup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( !ctx->save.callbacks->switch_qemu_logdirty )
+    {
+        ERROR("No switch_qemu_logdirty callback provided");
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( ctx->save.callbacks->switch_qemu_logdirty(
+             ctx->domid, 1, ctx->save.callbacks->data) )
+    {
+        PERROR("Couldn't enable qemu log-dirty mode");
+        return -1;
+    }
+
+    ctx->x86_hvm.save.qemu_enabled_logdirty = true;
+
+    return 0;
+}
+
+static int x86_hvm_start_of_stream(struct xc_sr_context *ctx)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_end_of_stream(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        return rc;
+
+    /* TODO - remove. */
+    rc = write_toolstack(ctx);
+    if ( rc )
+        return rc;
+
+    /* Write the HVM_CONTEXT record. */
+    rc = write_hvm_context(ctx);
+    if ( rc )
+        return rc;
+
+    /* Write HVM_PARAMS record contains applicable HVM params. */
+    rc = write_hvm_params(ctx);
+    if ( rc )
+        return rc;
+
+    return rc;
+}
+
+static int x86_hvm_cleanup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    /* If qemu successfully enabled logdirty mode, attempt to disable. */
+    if ( ctx->x86_hvm.save.qemu_enabled_logdirty &&
+         ctx->save.callbacks->switch_qemu_logdirty(
+             ctx->domid, 0, ctx->save.callbacks->data) )
+    {
+        PERROR("Couldn't disable qemu log-dirty mode");
+        return -1;
+    }
+
+    return 0;
+}
+
+struct xc_sr_save_ops save_ops_x86_hvm =
+{
+    .pfn_to_gfn      = x86_hvm_pfn_to_gfn,
+    .normalise_page  = x86_hvm_normalise_page,
+    .setup           = x86_hvm_setup,
+    .start_of_stream = x86_hvm_start_of_stream,
+    .end_of_stream   = x86_hvm_end_of_stream,
+    .cleanup         = x86_hvm_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 17/29] tools/libxc: x86 HVM restore code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (15 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 16/29] tools/libxc: x86 HVM save code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 18/29] tools/libxc: noarch save code Andrew Cooper
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Restore the x86 HVM specific parts of a domain.  This is the HVM_CONTEXT and
HVM_PARAMS records.

There is no need for any page localisation.

This also includes writing the trailing qemu save record to a file because
this is what libxc currently does.  This is intended to be moved into libxl
proper in the future.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/saverestore/common.h          |    1 +
 tools/libxc/saverestore/restore_x86_hvm.c |  355 +++++++++++++++++++++++++++++
 2 files changed, 356 insertions(+)
 create mode 100644 tools/libxc/saverestore/restore_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index de7063e..193dd0e 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -286,6 +286,7 @@ extern struct xc_sr_save_ops save_ops_x86_pv;
 extern struct xc_sr_save_ops save_ops_x86_hvm;
 
 extern struct xc_sr_restore_ops restore_ops_x86_pv;
+extern struct xc_sr_restore_ops restore_ops_x86_hvm;
 
 struct xc_sr_record
 {
diff --git a/tools/libxc/saverestore/restore_x86_hvm.c b/tools/libxc/saverestore/restore_x86_hvm.c
new file mode 100644
index 0000000..c65f9ac
--- /dev/null
+++ b/tools/libxc/saverestore/restore_x86_hvm.c
@@ -0,0 +1,355 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
+#include "common_x86.h"
+
+/* TODO: remove */
+static int handle_toolstack(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore )
+        return 0;
+
+    rc = ctx->restore.callbacks->toolstack_restore(ctx->domid, rec->data, rec->length,
+                                                   ctx->restore.callbacks->data);
+    if ( rc < 0 )
+        PERROR("restoring toolstack");
+    return rc;
+}
+
+/*
+ * Process an HVM_CONTEXT record from the stream.
+ */
+static int handle_hvm_context(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    void *p;
+
+    p = malloc(rec->length);
+    if ( !p )
+    {
+        ERROR("Unable to allocate %u bytes for hvm context", rec->length);
+        return -1;
+    }
+
+    free(ctx->x86_hvm.restore.context);
+
+    ctx->x86_hvm.restore.context = memcpy(p, rec->data, rec->length);
+    ctx->x86_hvm.restore.contextsz = rec->length;
+
+    return 0;
+}
+
+/*
+ * Process an HVM_PARAMS record from the stream.
+ */
+static int handle_hvm_params(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_hvm_params *hdr = rec->data;
+    struct xc_sr_rec_hvm_params_entry *entry = hdr->param;
+    unsigned int i;
+    int rc;
+
+    if ( rec->length < sizeof(*hdr)
+         || rec->length < sizeof(*hdr) + hdr->count * sizeof(*entry) )
+    {
+        ERROR("hvm_params record is too short");
+        return -1;
+    }
+
+    for ( i = 0; i < hdr->count; i++, entry++ )
+    {
+        switch ( entry->index )
+        {
+        case HVM_PARAM_CONSOLE_PFN:
+            ctx->restore.console_mfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_STORE_PFN:
+            ctx->restore.xenstore_mfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_IOREQ_PFN:
+        case HVM_PARAM_BUFIOREQ_PFN:
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        }
+
+        rc = xc_set_hvm_param(xch, ctx->domid, entry->index, entry->value);
+        if ( rc < 0 )
+        {
+            PERROR("set HVM param %"PRId64" = 0x%016"PRIx64,
+                   entry->index, entry->value);
+            return rc;
+        }
+    }
+    return 0;
+}
+
+/* TODO: remove */
+static __attribute__((unused)) int dump_qemu(struct xc_sr_context *ctx)
+{
+#ifdef XG_DUMP_QEMU
+    xc_interface *xch = ctx->xch;
+    char qemusig[21], path[256];
+    uint32_t qlen;
+    void *qbuf = NULL;
+    int rc = -1;
+    FILE *fp = NULL;
+
+    if ( read_exact(ctx->fd, qemusig, sizeof(qemusig)) )
+    {
+        PERROR("Error reading QEMU signature");
+        goto out;
+    }
+
+    if ( !memcmp(qemusig, "DeviceModelRecord0002", sizeof(qemusig)) )
+    {
+        if ( read_exact(ctx->fd, &qlen, sizeof(qlen)) )
+        {
+            PERROR("Error reading QEMU record length");
+            goto out;
+        }
+
+        qbuf = malloc(qlen);
+        if ( !qbuf )
+        {
+            PERROR("no memory for device model state");
+            goto out;
+        }
+
+        if ( read_exact(ctx->fd, qbuf, qlen) )
+        {
+            PERROR("Error reading device model state");
+            goto out;
+        }
+    }
+    /* XenServer hack, until Xapi compatibility code is written. */
+    else if ( !memcmp(qemusig, "QemuDeviceModelRecord", sizeof(qemusig)) )
+    {
+        char head[9];
+
+        if ( read_exact(ctx->fd, &head, sizeof(head)) )
+        {
+            PERROR("Error reading QEMU record head");
+            goto out;
+        }
+
+        if ( head[0] == '\n' && !memcmp(&head[5], "QEVM", 4) )
+            qlen = ntohl(*((uint32_t*)&head[1])) - 4;
+        else
+        {
+            ERROR("Unrecongised data following %s", qemusig);
+            goto out;
+        }
+
+        qbuf = malloc(qlen);
+        if ( !qbuf )
+        {
+            PERROR("no memory for device model state");
+            goto out;
+        }
+
+        memcpy(qbuf, "QEVM", 4);
+        if ( read_exact(ctx->fd, &((uint8_t*)qbuf)[4], qlen - 4) )
+        {
+            PERROR("Error reading device model state");
+            goto out;
+        }
+    }
+    else
+    {
+        ERROR("Invalid device model state signature '%*.*s'",
+              (int)sizeof(qemusig), (int)sizeof(qemusig), qemusig);
+        goto out;
+    }
+
+    sprintf(path, XC_DEVICE_MODEL_RESTORE_FILE".%u", ctx->domid);
+    fp = fopen(path, "wb");
+    if ( !fp )
+    {
+        PERROR("Failed to open '%s' for writing", path);
+        goto out;
+    }
+
+    DPRINTF("Writing %u bytes of QEMU data", qlen);
+    if ( fwrite(qbuf, 1, qlen, fp) != qlen )
+    {
+        PERROR("Failed to write %u bytes of QEMU data", qlen);
+        goto out;
+    }
+
+    rc = 0;
+
+ out:
+    if ( fp )
+        fclose(fp);
+    free(qbuf);
+
+    return rc;
+#else
+    return 0;
+#endif
+}
+
+/* restore_ops function. */
+static bool x86_hvm_pfn_is_valid(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    return true;
+}
+
+/* restore_ops function. */
+static xen_pfn_t x86_hvm_pfn_to_gfn(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    return pfn;
+}
+
+/* restore_ops function. */
+static void x86_hvm_set_gfn(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                            xen_pfn_t gfn)
+{
+    /* no op */
+}
+
+/* restore_ops function. */
+static void x86_hvm_set_page_type(struct xc_sr_context *ctx, xen_pfn_t pfn, xen_pfn_t type)
+{
+    /* no-op */
+}
+
+/* restore_ops function. */
+static int x86_hvm_localise_page(struct xc_sr_context *ctx, uint32_t type, void *page)
+{
+    /* no-op */
+    return 0;
+}
+
+/*
+ * restore_ops function. Confirms the stream matches the domain.
+ */
+static int x86_hvm_setup(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_X86_HVM )
+    {
+        ERROR("Unable to restore %s domain into an x86_hvm domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != PAGE_SIZE )
+    {
+        ERROR("Invalid page size %d for x86_hvm domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_hvm_process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    switch ( rec->type )
+    {
+    case REC_TYPE_TSC_INFO:
+        return handle_tsc_info(ctx, rec);
+
+    case REC_TYPE_HVM_CONTEXT:
+        return handle_hvm_context(ctx, rec);
+
+    case REC_TYPE_HVM_PARAMS:
+        return handle_hvm_params(ctx, rec);
+
+    case REC_TYPE_TOOLSTACK:
+        return handle_toolstack(ctx, rec);
+
+    default:
+        return RECORD_NOT_PROCESSED;
+    }
+}
+
+/*
+ * restore_ops function.  Sets extra hvm parameters and seeds the grant table.
+ */
+static int x86_hvm_stream_complete(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = xc_set_hvm_param(xch, ctx->domid, HVM_PARAM_STORE_EVTCHN,
+                          ctx->restore.xenstore_evtchn);
+    if ( rc )
+    {
+        PERROR("Failed to set HVM_PARAM_STORE_EVTCHN");
+        return rc;
+    }
+
+    rc = xc_domain_hvm_setcontext(xch, ctx->domid,
+                                  ctx->x86_hvm.restore.context,
+                                  ctx->x86_hvm.restore.contextsz);
+    if ( rc < 0 )
+    {
+        PERROR("Unable to restore HVM context");
+        return rc;
+    }
+
+    rc = xc_dom_gnttab_hvm_seed(xch, ctx->domid,
+                                ctx->restore.console_mfn,
+                                ctx->restore.xenstore_mfn,
+                                ctx->restore.console_domid,
+                                ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        return rc;
+    }
+
+    /*
+     * FIXME: reading the device model state from the stream should be
+     * done by libxl.
+     */
+    rc = dump_qemu(ctx);
+    if ( rc )
+    {
+        ERROR("Failed to dump qemu");
+        return rc;
+    }
+
+    return rc;
+}
+
+static int x86_hvm_cleanup(struct xc_sr_context *ctx)
+{
+    free(ctx->x86_hvm.restore.context);
+
+    return 0;
+}
+
+struct xc_sr_restore_ops restore_ops_x86_hvm =
+{
+    .pfn_is_valid    = x86_hvm_pfn_is_valid,
+    .pfn_to_gfn      = x86_hvm_pfn_to_gfn,
+    .set_gfn         = x86_hvm_set_gfn,
+    .set_page_type   = x86_hvm_set_page_type,
+    .localise_page   = x86_hvm_localise_page,
+    .setup           = x86_hvm_setup,
+    .process_record  = x86_hvm_process_record,
+    .stream_complete = x86_hvm_stream_complete,
+    .cleanup         = x86_hvm_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 18/29] tools/libxc: noarch save code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (16 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 17/29] tools/libxc: x86 HVM restore code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 19/29] tools/libxc: noarch restore code Andrew Cooper
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Save a domain, calling domain type specific function at the appropriate
points.  This implements the xc_domain_save2() API function which is
equivalent to the existing xc_domain_save().

This writes the image and domain headers, and writes all the PAGE_DATA records
using a "live" process.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

---
v6:
 * Use writev() instead of write() for PAGE_DATA.
 * Introduce separate non-live suspend which avoids playing with logdirty.
 * Fix a bug which would clobber the penultimate logdirty bitmap.
---
 tools/libxc/saverestore/save.c |  737 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 736 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index f6ad734..1517759 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -1,11 +1,746 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
 #include "common.h"
 
+/*
+ * Writes an Image header and Domain header into the stream.
+ */
+static int write_headers(struct xc_sr_context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
+    struct xc_sr_ihdr ihdr =
+        {
+            .marker  = IHDR_MARKER,
+            .id      = htonl(IHDR_ID),
+            .version = htonl(IHDR_VERSION),
+            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
+        };
+    struct xc_sr_dhdr dhdr =
+        {
+            .type       = guest_type,
+            .page_shift = XC_PAGE_SHIFT,
+            .xen_major  = (xen_version >> 16) & 0xffff,
+            .xen_minor  = (xen_version)       & 0xffff,
+        };
+
+    if ( xen_version < 0 )
+    {
+        PERROR("Unable to obtain Xen Version");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
+    {
+        PERROR("Unable to write Image Header to stream");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
+    {
+        PERROR("Unable to write Domain Header to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Writes an END record into the stream.
+ */
+static int write_end_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record end = { REC_TYPE_END, 0, NULL };
+
+    return write_record(ctx, &end);
+}
+
+/*
+ * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
+ * is constructed in ctx->save.batch_pfns.
+ *
+ * This function:
+ * - gets the types for each pfn in the batch.
+ * - for each pfn with real data:
+ *   - maps and attempts to localise the pages.
+ * - construct and writes a PAGE_DATA record into the stream.
+ */
+static int write_batch(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = NULL, *types = NULL;
+    void *guest_mapping = NULL;
+    void **guest_data = NULL;
+    void **local_pages = NULL;
+    int *errors = NULL, rc = -1;
+    unsigned i, p, nr_pages = 0;
+    unsigned nr_pfns = ctx->save.nr_batch_pfns;
+    void *page, *orig_page;
+    uint64_t *rec_pfns = NULL;
+    struct iovec *iov = NULL; int iovcnt = 0;
+    struct xc_sr_rec_page_data_header hdr = { 0 };
+    struct xc_sr_record rec =
+    {
+        .type = REC_TYPE_PAGE_DATA,
+    };
+
+    assert(nr_pfns != 0);
+
+    /* Mfns of the batch pfns. */
+    mfns = malloc(nr_pfns * sizeof(*mfns));
+    /* Types of the batch pfns. */
+    types = malloc(nr_pfns * sizeof(*types));
+    /* Errors from attempting to map the mfns. */
+    errors = malloc(nr_pfns * sizeof(*errors));
+    /* Pointers to page data to send.  Either mapped mfns or local allocations. */
+    guest_data = calloc(nr_pfns, sizeof(*guest_data));
+    /* Pointers to locally allocated pages.  Need freeing. */
+    local_pages = calloc(nr_pfns, sizeof(*local_pages));
+    /* iovec[] for writev(). */
+    iov = malloc((nr_pfns + 4) * sizeof(*iov));
+
+    if ( !mfns || !types || !errors || !guest_data || !local_pages || !iov )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx, ctx->save.batch_pfns[i]);
+
+        /* Likely a ballooned page. */
+        if ( mfns[i] == INVALID_MFN )
+        {
+            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+            ++ctx->save.nr_deferred_pages;
+        }
+    }
+
+    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
+    if ( rc )
+    {
+        PERROR("Failed to get types for pfn batch");
+        goto err;
+    }
+    rc = -1;
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        mfns[nr_pages++] = mfns[i];
+    }
+
+    if ( nr_pages > 0 )
+    {
+        guest_mapping = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
+        if ( !guest_mapping )
+        {
+            PERROR("Failed to map guest pages");
+            goto err;
+        }
+    }
+
+    for ( i = 0, p = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        if ( errors[p] )
+        {
+            ERROR("Mapping of pfn %#lx (mfn %#lx) failed %d",
+                  ctx->save.batch_pfns[i], mfns[p], errors[p]);
+            goto err;
+        }
+
+        orig_page = page = guest_mapping + (p * PAGE_SIZE);
+        rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
+        if ( rc )
+        {
+            if ( rc == -1 && errno == EAGAIN )
+            {
+                set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+                ++ctx->save.nr_deferred_pages;
+                types[i] = XEN_DOMCTL_PFINFO_XTAB;
+                --nr_pages;
+            }
+            else
+                goto err;
+        }
+        else
+            guest_data[i] = page;
+
+        if ( page != orig_page )
+            local_pages[i] = page;
+        rc = -1;
+
+        ++p;
+    }
+
+    rec_pfns = malloc(nr_pfns * sizeof(*rec_pfns));
+    if ( !rec_pfns )
+    {
+        ERROR("Unable to allocate %zu bytes of memory for page data pfn list",
+              nr_pfns * sizeof(*rec_pfns));
+        goto err;
+    }
+
+    hdr.count = nr_pfns;
+
+    rec.length = sizeof(hdr);
+    rec.length += nr_pfns * sizeof(*rec_pfns);
+    rec.length += nr_pages * PAGE_SIZE;
+
+    for ( i = 0; i < nr_pfns; ++i )
+        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.batch_pfns[i];
+
+    iov[0].iov_base = &rec.type;
+    iov[0].iov_len = sizeof(rec.type);
+
+    iov[1].iov_base = &rec.length;
+    iov[1].iov_len = sizeof(rec.length);
+
+    iov[2].iov_base = &hdr;
+    iov[2].iov_len = sizeof(hdr);
+
+    iov[3].iov_base = rec_pfns;
+    iov[3].iov_len = nr_pfns * sizeof(*rec_pfns);
+
+    for ( i = 0, iovcnt = 4; i < nr_pfns; ++i )
+    {
+        if ( guest_data[i] )
+        {
+            iov[iovcnt].iov_base = guest_data[i];
+            iov[iovcnt].iov_len = PAGE_SIZE;
+            iovcnt++;
+            --nr_pages;
+        }
+    }
+
+    if ( writev_exact(ctx->fd, iov, iovcnt) )
+    {
+        PERROR("Failed to write page data to stream");
+        goto err;
+    }
+
+    /* Sanity check we have sent all the pages we expected to. */
+    assert(nr_pages == 0);
+    rc = ctx->save.nr_batch_pfns = 0;
+
+ err:
+    free(rec_pfns);
+    if ( guest_mapping )
+        munmap(guest_mapping, nr_pages * PAGE_SIZE);
+    for ( i = 0; local_pages && i < nr_pfns; ++i )
+            free(local_pages[i]);
+    free(iov);
+    free(local_pages);
+    free(guest_data);
+    free(errors);
+    free(types);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Flush a batch of pfns into the stream.
+ */
+static int flush_batch(struct xc_sr_context *ctx)
+{
+    int rc = 0;
+
+    if ( ctx->save.nr_batch_pfns == 0 )
+        return rc;
+
+    rc = write_batch(ctx);
+
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
+    }
+
+    return rc;
+}
+
+/*
+ * Add a single pfn to the batch, flushing the batch if full.
+ */
+static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    int rc = 0;
+
+    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
+        rc = flush_batch(ctx);
+
+    if ( rc == 0 )
+        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
+
+    return rc;
+}
+
+/*
+ * Pause/suspend the domain, and refresh ctx->dominfo if required.
+ */
+static int suspend_domain(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    /* TODO: Properly specify the return value from this callback.  All
+     * implementations currently appear to return 1 for success, whereas
+     * the legacy code checks for != 0. */
+    int cb_rc = ctx->save.callbacks->suspend(ctx->save.callbacks->data);
+
+    if ( cb_rc == 0 )
+    {
+        ERROR("save callback suspend() failed: %d", cb_rc);
+        return -1;
+    }
+
+    /* Refresh domain information. */
+    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
+         (ctx->dominfo.domid != ctx->domid) )
+    {
+        PERROR("Unable to refresh domain information");
+        return -1;
+    }
+
+    /* Confirm the domain has actually been paused. */
+    if ( !ctx->dominfo.shutdown ||
+         (ctx->dominfo.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain has not been suspended: shutdown %d, reason %d",
+              ctx->dominfo.shutdown, ctx->dominfo.shutdown_reason);
+        return -1;
+    }
+
+    xc_report_progress_single(xch, "Domain now suspended");
+
+    return 0;
+}
+
+/*
+ * Send all pages in the guests p2m.  Used as the first iteration of the live
+ * migration loop, and for a non-live save.
+ */
+static int send_all_pages(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t p;
+    int rc;
+
+    for ( p = 0; p < ctx->save.p2m_size; ++p )
+    {
+        rc = add_to_batch(ctx, p);
+        if ( rc )
+            return rc;
+
+        /* Update progress every 4MB worth of memory sent. */
+        if ( (p & ((1U << (22 - 12)) - 1)) == 0 )
+            xc_report_progress_step(xch, p, ctx->save.p2m_size);
+    }
+
+    rc = flush_batch(ctx);
+    if ( rc )
+        return rc;
+
+    xc_report_progress_step(xch, ctx->save.p2m_size,
+                            ctx->save.p2m_size);
+    return 0;
+}
+
+/*
+ * Send a subset of pages in the guests p2m, according to the provided bitmap.
+ * Used for each subsequent iteration of the live migration loop.
+ *
+ * Bitmap is bounded by p2m_size.
+ */
+static int send_some_pages(struct xc_sr_context *ctx,
+                           unsigned long *bitmap,
+                           unsigned long entries)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t p;
+    unsigned long written;
+    int rc;
+
+    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
+    {
+        if ( !test_bit(p, bitmap) )
+            continue;
+
+        rc = add_to_batch(ctx, p);
+        if ( rc )
+            return rc;
+
+        /* Update progress every 4MB worth of memory sent. */
+        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
+            xc_report_progress_step(xch, written, entries);
+
+        ++written;
+    }
+
+    rc = flush_batch(ctx);
+    if ( rc )
+        return rc;
+
+    if ( written > entries )
+        DPRINTF("Bitmap contained more entries than expected...");
+
+    xc_report_progress_step(xch, entries, entries);
+    return 0;
+}
+
+static int update_progress_string(struct xc_sr_context *ctx,
+                                  char **str, unsigned iter)
+{
+    xc_interface *xch = ctx->xch;
+
+    free(*str);
+    if ( asprintf(str, "Memory iteration %u of %u",
+                  iter, ctx->save.max_iterations) == -1 )
+    {
+        ERROR("Unable to allocate progress string");
+        return -1;
+    }
+
+    xc_report_progress_set(xch, *str);
+    return 0;
+}
+
+/*
+ * Send all domain memory.  This is the heart of the live migration loop.
+ */
+static int send_domain_memory_live(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
+    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
+    char *progress_str = NULL;
+    unsigned x;
+    int rc = -1;
+
+    to_send = xc_hypercall_buffer_alloc_pages(
+        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
+
+    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
+    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
+
+    if ( !ctx->save.batch_pfns || !to_send || !ctx->save.deferred_pages )
+    {
+        ERROR("Unable to allocate memory for to_{send,fix}/batch bitmaps");
+        goto out;
+    }
+
+    /* This juggling is required if logdirty is enabled for VRAM tracking. */
+    if ( xc_shadow_control(xch, ctx->domid,
+                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                           NULL, 0, NULL, 0, NULL) < 0 )
+    {
+        int frc = xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                                    NULL, 0, NULL, 0, NULL);
+        if ( frc >= 0 )
+            frc = xc_shadow_control(xch, ctx->domid,
+                                    XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                                    NULL, 0, NULL, 0, NULL);
+        if ( frc < 0 )
+        {
+            PERROR("Failed to enable logdirty: rc %d", frc);
+            goto out;
+        }
+    }
+
+    rc = update_progress_string(ctx, &progress_str, 0);
+    if ( rc )
+        goto out;
+
+    rc = send_all_pages(ctx);
+    if ( rc )
+        goto out;
+
+    for ( x = 1;
+          ((x < ctx->save.max_iterations) &&
+           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
+    {
+        if ( xc_shadow_control(
+                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                 HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+                 NULL, 0, &stats) != ctx->save.p2m_size )
+        {
+            PERROR("Failed to retrieve logdirty bitmap");
+            rc = -1;
+            goto out;
+        }
+
+        if ( stats.dirty_count == 0 )
+            break;
+
+        rc = update_progress_string(ctx, &progress_str, x);
+        if ( rc )
+            goto out;
+
+        rc = send_some_pages(ctx, to_send, stats.dirty_count);
+        if ( rc )
+            goto out;
+    }
+
+    rc = suspend_domain(ctx);
+    if ( rc )
+        goto out;
+
+    if ( xc_shadow_control(
+             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+             NULL, 0, &stats) != ctx->save.p2m_size )
+    {
+        PERROR("Failed to retrieve logdirty bitmap");
+        rc = -1;
+        goto out;
+    }
+
+    rc = update_progress_string(ctx, &progress_str, ctx->save.max_iterations);
+    if ( rc )
+        goto out;
+
+    bitmap_or(to_send, ctx->save.deferred_pages, ctx->save.p2m_size);
+
+    rc = send_some_pages(ctx, to_send,
+                         stats.dirty_count + ctx->save.nr_deferred_pages);
+    if ( rc )
+        goto out;
+
+    if ( ctx->save.debug )
+    {
+        struct xc_sr_record rec =
+        {
+            .type = REC_TYPE_VERIFY,
+            .length = 0,
+        };
+
+        DPRINTF("Enabling verify mode");
+
+        rc = write_record(ctx, &rec);
+        if ( rc )
+            goto out;
+
+        xc_report_progress_set(xch, "Memory verify");
+        rc = send_all_pages(ctx);
+        if ( rc )
+            goto out;
+
+        if ( xc_shadow_control(
+                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_PEEK,
+                 HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+                 NULL, 0, &stats) != ctx->save.p2m_size )
+        {
+            PERROR("Failed to retrieve logdirty bitmap");
+            rc = -1;
+            goto out;
+        }
+
+        DPRINTF("  Further stats: faults %u, dirty %u",
+                stats.fault_count, stats.dirty_count);
+    }
+
+  out:
+    xc_report_progress_set(xch, NULL);
+    free(progress_str);
+    xc_hypercall_buffer_free_pages(xch, to_send,
+                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
+    free(ctx->save.deferred_pages);
+    free(ctx->save.batch_pfns);
+    return rc;
+}
+
+/*
+ * Send all domain memory, pausing the domain first.  Generally used for
+ * suspend-to-file.
+ */
+static int send_domain_memory_nonlive(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+
+    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
+                                  sizeof(*ctx->save.batch_pfns));
+    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
+
+    if ( !ctx->save.batch_pfns || !ctx->save.deferred_pages )
+    {
+        PERROR("Failed to allocate memory for nonlive tracking structures");
+        errno = ENOMEM;
+        goto err;
+    }
+
+    rc = suspend_domain(ctx);
+    if ( rc )
+        goto err;
+
+    xc_report_progress_set(xch, "Memory");
+
+    rc = send_all_pages(ctx);
+    if ( rc )
+        goto err;
+
+ err:
+    free(ctx->save.deferred_pages);
+    free(ctx->save.batch_pfns);
+
+    return rc;
+}
+
+/*
+ * Save a domain.
+ */
+static int save(struct xc_sr_context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int rc, saved_rc = 0, saved_errno = 0;
+
+    IPRINTF("Saving domain %d, type %s",
+            ctx->domid, dhdr_type_to_str(guest_type));
+
+    rc = ctx->save.ops.setup(ctx);
+    if ( rc )
+        goto err;
+
+    xc_report_progress_single(xch, "Start of stream");
+
+    rc = write_headers(ctx, guest_type);
+    if ( rc )
+        goto err;
+
+    rc = ctx->save.ops.start_of_stream(ctx);
+    if ( rc )
+        goto err;
+
+    if ( ctx->save.live )
+        rc = send_domain_memory_live(ctx);
+    else
+        rc = send_domain_memory_nonlive(ctx);
+
+    if ( rc )
+        goto err;
+
+    if ( !ctx->dominfo.shutdown ||
+         (ctx->dominfo.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain has not been suspended");
+        rc = -1;
+        goto err;
+    }
+
+    xc_report_progress_single(xch, "End of stream");
+
+    rc = ctx->save.ops.end_of_stream(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_end_record(ctx);
+    if ( rc )
+        goto err;
+
+    xc_report_progress_single(xch, "Complete");
+    goto done;
+
+ err:
+    saved_errno = errno;
+    saved_rc = rc;
+    PERROR("Save failed");
+
+ done:
+    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                      NULL, 0, NULL, 0, NULL);
+
+    rc = ctx->save.ops.cleanup(ctx);
+    if ( rc )
+        PERROR("Failed to clean up");
+
+    if ( saved_rc )
+    {
+        rc = saved_rc;
+        errno = saved_errno;
+    }
+
+    return rc;
+};
+
 int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
                     uint32_t max_factor, uint32_t flags,
                     struct save_callbacks* callbacks, int hvm)
 {
+    struct xc_sr_context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
+    ctx.save.callbacks = callbacks;
+    ctx.save.live  = !!(flags & XCFLAGS_LIVE);
+    ctx.save.debug = !!(flags & XCFLAGS_DEBUG);
+
+    /*
+     * TODO: Find some time to better tweak the live migration algorithm.
+     *
+     * These parameters are better than the legacy algorithm especially for
+     * busy guests.
+     */
+    ctx.save.max_iterations = 5;
+    ctx.save.dirty_threshold = 50;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+    DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
+            io_fd, dom, max_iters, max_factor, flags, hvm);
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %u does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+
+    ctx.save.p2m_size = xc_domain_maximum_gpfn(xch, dom) + 1;
+    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
+    {
+        errno = E2BIG;
+        ERROR("Cannot save this big a guest");
+        return -1;
+    }
+
+    if ( ctx.dominfo.hvm )
+    {
+        ctx.save.ops = save_ops_x86_hvm;
+        return save(&ctx, DHDR_TYPE_X86_HVM);
+    }
+    else
+    {
+        ctx.save.ops = save_ops_x86_pv;
+        return save(&ctx, DHDR_TYPE_X86_PV);
+    }
 }
 
 /*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 19/29] tools/libxc: noarch restore code
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (17 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 18/29] tools/libxc: noarch save code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-10 17:10 ` [PATCH 20/29] tools/libxl: Update datacopier to support sending data only Andrew Cooper
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Restore a domain from the new format.  This reads and validates the domain and
image header and loads the guest memory from the PAGE_DATA records, populating
the p2m as it does so.

This provides the xc_domain_restore2() function as an alternative to the
existing xc_domain_restore().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

---
v7:
 * Support population of pfns with no typedata.
v6:
 * Fix error path with rc = 0.
 * Fix undefined memory issue with creating the populated_pfns array.
---
 tools/libxc/saverestore/common.h  |    6 +
 tools/libxc/saverestore/restore.c |  623 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 628 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 193dd0e..35e5011 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -322,6 +322,12 @@ static inline int write_record(struct xc_sr_context *ctx,
     return write_split_record(ctx, rec, NULL, 0);
 }
 
+/* TODO - find a better way of hiding this.  It should be private to
+ * restore.c, but is needed by x86_pv_localise_page()
+ */
+int populate_pfns(struct xc_sr_context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
index 6624baa..2dea120 100644
--- a/tools/libxc/saverestore/restore.c
+++ b/tools/libxc/saverestore/restore.c
@@ -1,5 +1,566 @@
+#include <arpa/inet.h>
+
 #include "common.h"
 
+/*
+ * Read and validate the Image and Domain headers.
+ */
+static int read_headers(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_ihdr ihdr;
+    struct xc_sr_dhdr dhdr;
+
+    if ( read_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
+    {
+        PERROR("Failed to read Image Header from stream");
+        return -1;
+    }
+
+    ihdr.id      = ntohl(ihdr.id);
+    ihdr.version = ntohl(ihdr.version);
+    ihdr.options = ntohs(ihdr.options);
+
+    if ( ihdr.marker != IHDR_MARKER )
+    {
+        ERROR("Invalid marker: Got 0x%016"PRIx64, ihdr.marker);
+        return -1;
+    }
+    else if ( ihdr.id != IHDR_ID )
+    {
+        ERROR("Invalid ID: Expected 0x%08x, Got 0x%08x", IHDR_ID, ihdr.id);
+        return -1;
+    }
+    else if ( ihdr.version != IHDR_VERSION )
+    {
+        ERROR("Invalid Version: Expected %d, Got %d", ihdr.version, IHDR_VERSION);
+        return -1;
+    }
+    else if ( ihdr.options & IHDR_OPT_BIG_ENDIAN )
+    {
+        ERROR("Unable to handle big endian streams");
+        return -1;
+    }
+
+    ctx->restore.format_version = ihdr.version;
+
+    if ( read_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
+    {
+        PERROR("Failed to read Domain Header from stream");
+        return -1;
+    }
+
+    ctx->restore.guest_type = dhdr.type;
+    ctx->restore.guest_page_size = (1U << dhdr.page_shift);
+
+    if ( dhdr.xen_major == 0 )
+    {
+        IPRINTF("Found %s domain, converted from legacy stream format",
+                dhdr_type_to_str(dhdr.type));
+        DPRINTF("  Legacy conversion script version %u", dhdr.xen_minor);
+    }
+    else
+        IPRINTF("Found %s domain from Xen %u.%u",
+                dhdr_type_to_str(dhdr.type), dhdr.xen_major, dhdr.xen_minor);
+    return 0;
+}
+
+/**
+ * Reads a record from the stream, and fills in the record structure.
+ *
+ * Returns 0 on success and non-0 on failure.
+ *
+ * On success, the records type and size shall be valid.
+ * - If size is 0, data shall be NULL.
+ * - If size is non-0, data shall be a buffer allocated by malloc() which must
+ *   be passed to free() by the caller.
+ *
+ * On failure, the contents of the record structure are undefined.
+ */
+static int read_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rhdr rhdr;
+    size_t datasz;
+
+    if ( read_exact(ctx->fd, &rhdr, sizeof(rhdr)) )
+    {
+        PERROR("Failed to read Record Header from stream");
+        return -1;
+    }
+    else if ( rhdr.length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08x, %s) length %#x exceeds max (%#x)", rhdr.type,
+              rec_type_to_str(rhdr.type), rhdr.length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
+
+    if ( datasz )
+    {
+        rec->data = malloc(datasz);
+
+        if ( !rec->data )
+        {
+            ERROR("Unable to allocate %zu bytes for record data (0x%08x, %s)",
+                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+
+        if ( read_exact(ctx->fd, rec->data, datasz) )
+        {
+            free(rec->data);
+            rec->data = NULL;
+            PERROR("Failed to read %zu bytes of data for record (0x%08x, %s)",
+                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+    }
+    else
+        rec->data = NULL;
+
+    rec->type   = rhdr.type;
+    rec->length = rhdr.length;
+
+    return 0;
+};
+
+/*
+ * Is a pfn populated?
+ */
+static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    if ( pfn > ctx->restore.max_populated_pfn )
+        return false;
+    return test_bit(pfn, ctx->restore.populated_pfns);
+}
+
+/*
+ * Set a pfn as populated, expanding the tracking structures if needed. To
+ * avoid realloc()ing too excessivly, the size increased to the nearest power
+ * of two large enough to contain the required pfn.
+ */
+static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( pfn > ctx->restore.max_populated_pfn )
+    {
+        xen_pfn_t new_max;
+        size_t old_sz, new_sz;
+        unsigned long *p;
+
+        /* Round up to the nearest power of two larger than pfn, less 1. */
+        new_max = pfn;
+        new_max |= new_max >> 1;
+        new_max |= new_max >> 2;
+        new_max |= new_max >> 4;
+        new_max |= new_max >> 8;
+        new_max |= new_max >> 16;
+#ifdef __x86_64__
+        new_max |= new_max >> 32;
+#endif
+
+        old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
+        new_sz = bitmap_size(new_max + 1);
+        p = realloc(ctx->restore.populated_pfns, new_sz);
+        if ( !p )
+        {
+            ERROR("Failed to realloc populated bitmap");
+            errno = ENOMEM;
+            return -1;
+        }
+
+        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+
+        ctx->restore.populated_pfns    = p;
+        ctx->restore.max_populated_pfn = new_max;
+    }
+
+    set_bit(pfn, ctx->restore.populated_pfns);
+
+    return 0;
+}
+
+/*
+ * Given a set of pfns, obtain memory from Xen to fill the physmap for the
+ * unpopulated subset.  If types is NULL, no page typechecking is performed
+ * and all unpopulated pfns are populated.
+ */
+int populate_pfns(struct xc_sr_context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof(*mfns)),
+        *pfns = malloc(count * sizeof(*pfns));
+    unsigned i, nr_pfns = 0;
+    int rc = -1;
+
+    if ( !mfns || !pfns )
+    {
+        ERROR("Failed to allocate %zu bytes for populating the physmap",
+              2 * count * sizeof(*mfns));
+        goto err;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        if ( (!types || (types &&
+                         (types[i] != XEN_DOMCTL_PFINFO_XTAB &&
+                          types[i] != XEN_DOMCTL_PFINFO_BROKEN))) &&
+             !pfn_is_populated(ctx, original_pfns[i]) )
+        {
+            pfns[nr_pfns] = mfns[nr_pfns] = original_pfns[i];
+            ++nr_pfns;
+        }
+    }
+
+    if ( nr_pfns )
+    {
+        rc = xc_domain_populate_physmap_exact(xch, ctx->domid, nr_pfns, 0, 0, mfns);
+        if ( rc )
+        {
+            PERROR("Failed to populate physmap");
+            goto err;
+        }
+
+        for ( i = 0; i < nr_pfns; ++i )
+        {
+            if ( mfns[i] == INVALID_MFN )
+            {
+                ERROR("Populate physmap failed for pfn %u", i);
+                rc = -1;
+                goto err;
+            }
+
+            rc = pfn_set_populated(ctx, pfns[i]);
+            if ( rc )
+                goto err;
+            ctx->restore.ops.set_gfn(ctx, pfns[i], mfns[i]);
+        }
+    }
+
+    rc = 0;
+
+ err:
+    free(pfns);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Given a list of pfns, their types, and a block of page data from the
+ * stream, populate and record their types, map the relevent subset and copy
+ * the data into the guest.
+ */
+static int process_page_data(struct xc_sr_context *ctx, unsigned count,
+                             xen_pfn_t *pfns, uint32_t *types, void *page_data)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
+    int *map_errs = malloc(count * sizeof(*map_errs));
+    int rc;
+    void *mapping = NULL, *guest_page = NULL;
+    unsigned i,    /* i indexes the pfns from the record. */
+        j,         /* j indexes the subset of pfns we decide to map. */
+        nr_pages;
+
+    if ( !mfns || !map_errs )
+    {
+        rc = -1;
+        ERROR("Failed to allocate %zu bytes to process page data",
+              count * (sizeof(*mfns) + sizeof(*map_errs)));
+        goto err;
+    }
+
+    rc = populate_pfns(ctx, count, pfns, types);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", count);
+        goto err;
+    }
+
+    for ( i = 0, nr_pages = 0; i < count; ++i )
+    {
+        ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
+
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_NOTAB:
+
+        case XEN_DOMCTL_PFINFO_L1TAB:
+        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+            mfns[nr_pages++] = ctx->restore.ops.pfn_to_gfn(ctx, pfns[i]);
+            break;
+        }
+
+    }
+
+    if ( nr_pages > 0 )
+    {
+        mapping = guest_page = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ | PROT_WRITE,
+            mfns, map_errs, nr_pages);
+        if ( !mapping )
+        {
+            rc = -1;
+            PERROR("Unable to map %u mfns for %u pages of data",
+                   nr_pages, count);
+            goto err;
+        }
+    }
+
+    for ( i = 0, j = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_XTAB:
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+            /* No page data to deal with. */
+            continue;
+        }
+
+        if ( map_errs[j] )
+        {
+            rc = -1;
+            ERROR("Mapping pfn %lx (mfn %lx, type %#x)failed with %d",
+                  pfns[i], mfns[j], types[i], map_errs[j]);
+            goto err;
+        }
+
+        /* Undo page normalisation done by the saver. */
+        rc = ctx->restore.ops.localise_page(ctx, types[i], page_data);
+        if ( rc )
+        {
+            ERROR("Failed to localise pfn %lx (type %#x)",
+                  pfns[i], types[i] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        if ( ctx->restore.verify )
+        {
+            /* Verify mode - compare incoming data to what we already have. */
+            if ( memcmp(guest_page, page_data, PAGE_SIZE) )
+                ERROR("verify pfn %lx failed (type %#x)",
+                      pfns[i], types[i] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+        }
+        else
+        {
+            /* Regular mode - copy incoming data into place. */
+            memcpy(guest_page, page_data, PAGE_SIZE);
+        }
+
+        ++j;
+        guest_page += PAGE_SIZE;
+        page_data += PAGE_SIZE;
+    }
+
+    rc = 0;
+
+ err:
+    if ( mapping )
+        munmap(mapping, nr_pages * PAGE_SIZE);
+
+    free(map_errs);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Validate a PAGE_DATA record from the stream, and pass the results to
+ * process_page_data() to actually perform the legwork.
+ */
+static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_page_data_header *pages = rec->data;
+    unsigned i, pages_of_data = 0;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL, pfn;
+    uint32_t *types = NULL, type;
+
+    if ( rec->length < sizeof(*pages) )
+    {
+        ERROR("PAGE_DATA record truncated: length %u, min %zu",
+              rec->length, sizeof(*pages));
+        goto err;
+    }
+    else if ( pages->count < 1 )
+    {
+        ERROR("Expected at least 1 pfn in PAGE_DATA record");
+        goto err;
+    }
+    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
+    {
+        ERROR("PAGE_DATA record (length %u) too short to contain %u"
+              " pfns worth of information", rec->length, pages->count);
+        goto err;
+    }
+
+    pfns = malloc(pages->count * sizeof(*pfns));
+    types = malloc(pages->count * sizeof(*types));
+    if ( !pfns || !types )
+    {
+        ERROR("Unable to allocate enough memory for %u pfns",
+              pages->count);
+        goto err;
+    }
+
+    for ( i = 0; i < pages->count; ++i )
+    {
+        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
+        if ( !ctx->restore.ops.pfn_is_valid(ctx, pfn) )
+        {
+            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
+            goto err;
+        }
+
+        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
+        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
+             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
+        {
+            ERROR("Invalid type %#x for pfn %#lx (index %u)", type, pfn, i);
+            goto err;
+        }
+        else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
+            /* NOTAB and all L1 thru L4 tables (including pinned) should have
+             * a page worth of data in the record. */
+            pages_of_data++;
+
+        pfns[i] = pfn;
+        types[i] = type;
+    }
+
+    if ( rec->length != (sizeof(*pages) +
+                         (sizeof(uint64_t) * pages->count) +
+                         (PAGE_SIZE * pages_of_data)) )
+    {
+        ERROR("PAGE_DATA record wrong size: length %u, expected "
+              "%zu + %zu + %zu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
+        goto err;
+    }
+
+    rc = process_page_data(ctx, pages->count, pfns, types,
+                           &pages->pfn[pages->count]);
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+/*
+ * Restore a domain.
+ */
+static int restore(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_record rec;
+    int rc, saved_rc = 0, saved_errno = 0;
+
+    IPRINTF("Restoring domain");
+
+    rc = ctx->restore.ops.setup(ctx);
+    if ( rc )
+        goto err;
+
+    ctx->restore.max_populated_pfn = (32 * 1024 / 4) - 1;
+    ctx->restore.populated_pfns = bitmap_alloc(
+        ctx->restore.max_populated_pfn + 1);
+    if ( !ctx->restore.populated_pfns )
+    {
+        ERROR("Unable to allocate memory for populated_pfns bitmap");
+        goto err;
+    }
+
+    do
+    {
+        rc = read_record(ctx, &rec);
+        if ( rc )
+            goto err;
+
+        switch ( rec.type )
+        {
+        case REC_TYPE_END:
+            break;
+
+        case REC_TYPE_PAGE_DATA:
+            rc = handle_page_data(ctx, &rec);
+            break;
+
+        case REC_TYPE_VERIFY:
+            DPRINTF("Verify mode enabled");
+            ctx->restore.verify = true;
+            break;
+
+        default:
+            rc = ctx->restore.ops.process_record(ctx, &rec);
+            break;
+        }
+
+        free(rec.data);
+
+        if ( rc == RECORD_NOT_PROCESSED )
+        {
+            if ( rec.type & REC_TYPE_OPTIONAL )
+                DPRINTF("Ignoring optional record %#x (%s)",
+                        rec.type, rec_type_to_str(rec.type));
+            else
+            {
+                ERROR("Manditory record %#x (%s) not handled",
+                      rec.type, rec_type_to_str(rec.type));
+                rc = -1;
+            }
+        }
+
+        if ( rc )
+            goto err;
+
+    } while ( rec.type != REC_TYPE_END );
+
+    rc = ctx->restore.ops.stream_complete(ctx);
+    if ( rc )
+        goto err;
+
+    IPRINTF("Restore successful");
+    goto done;
+
+ err:
+    saved_errno = errno;
+    saved_rc = rc;
+    PERROR("Restore failed");
+
+ done:
+    free(ctx->restore.populated_pfns);
+    rc = ctx->restore.ops.cleanup(ctx);
+    if ( rc )
+        PERROR("Failed to clean up");
+
+    if ( saved_rc )
+    {
+        rc = saved_rc;
+        errno = saved_errno;
+    }
+
+    return rc;
+}
+
 int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        unsigned int store_evtchn, unsigned long *store_mfn,
                        domid_t store_domid, unsigned int console_evtchn,
@@ -8,8 +569,68 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        int checkpointed_stream,
                        struct restore_callbacks *callbacks)
 {
+    struct xc_sr_context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
+    ctx.restore.console_evtchn = console_evtchn;
+    ctx.restore.console_domid = console_domid;
+    ctx.restore.xenstore_evtchn = store_evtchn;
+    ctx.restore.xenstore_domid = store_domid;
+    ctx.restore.callbacks = callbacks;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+    DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
+            ", checkpointed_stream %d", io_fd, dom, hvm, pae,
+            superpages, checkpointed_stream);
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %u does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+
+    if ( read_headers(&ctx) )
+        return -1;
+
+    if ( ctx.dominfo.hvm )
+    {
+        ctx.restore.ops = restore_ops_x86_hvm;
+        if ( restore(&ctx) )
+            return -1;
+    }
+    else
+    {
+        ctx.restore.ops = restore_ops_x86_pv;
+        if ( restore(&ctx) )
+            return -1;
+    }
+
+    IPRINTF("XenStore: mfn %#lx, dom %d, evt %u",
+            ctx.restore.xenstore_mfn,
+            ctx.restore.xenstore_domid,
+            ctx.restore.xenstore_evtchn);
+
+    IPRINTF("Console: mfn %#lx, dom %d, evt %u",
+            ctx.restore.console_mfn,
+            ctx.restore.console_domid,
+            ctx.restore.console_evtchn);
+
+    *console_mfn = ctx.restore.console_mfn;
+    *store_mfn = ctx.restore.xenstore_mfn;
+
+    return 0;
 }
 
 /*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 20/29] tools/libxl: Update datacopier to support sending data only
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (18 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 19/29] tools/libxc: noarch restore code Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 11:56   ` Ian Campbell
  2014-09-10 17:10 ` [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier Andrew Cooper
                   ` (9 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Wen Congyang

From: Wen Congyang <wency@cn.fujitsu.com>

datacopier is to read some data and write it out. If we
have some data to send it over network, we cannot use
datacopier. Update it to support this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_aoutils.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index b10d2e1..3e0c0ae 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -309,9 +309,11 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
 
     libxl__datacopier_init(dc);
 
-    rc = libxl__ev_fd_register(gc, &dc->toread, datacopier_readable,
-                               dc->readfd, POLLIN);
-    if (rc) goto out;
+    if (dc->readfd >= 0) {
+        rc = libxl__ev_fd_register(gc, &dc->toread, datacopier_readable,
+                                   dc->readfd, POLLIN);
+        if (rc) goto out;
+    }
 
     rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
                                dc->writefd, POLLOUT);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (19 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 20/29] tools/libxl: Update datacopier to support sending data only Andrew Cooper
@ 2014-09-10 17:10 ` Andrew Cooper
  2014-09-11 12:01   ` Ian Campbell
  2014-09-10 17:11 ` [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier Andrew Cooper
                   ` (8 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:10 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Ross Lagerwall

From: Ross Lagerwall <ross.lagerwall@citrix.com>

Previously, adding more than 1000 bytes of data would cause a segfault.
Now, the maximum amount of data that can be added is limited by maxsz.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 tools/libxl/libxl_aoutils.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index 3e0c0ae..caba637 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -160,6 +160,8 @@ void libxl__datacopier_prefixdata(libxl__egc *egc, libxl__datacopier_state *dc,
 {
     EGC_GC;
     libxl__datacopier_buf *buf;
+    const uint8_t *ptr;
+
     /*
      * It is safe for this to be called immediately after _start, as
      * is documented in the public comment.  _start's caller must have
@@ -170,12 +172,14 @@ void libxl__datacopier_prefixdata(libxl__egc *egc, libxl__datacopier_state *dc,
 
     assert(len < dc->maxsz - dc->used);
 
-    buf = libxl__zalloc(NOGC, sizeof(*buf));
-    buf->used = len;
-    memcpy(buf->buf, data, len);
+    for (ptr = data; len; len -= buf->used, ptr += buf->used) {
+        buf = libxl__zalloc(NOGC, sizeof(*buf));
+        buf->used = len < sizeof(buf->buf) ? len : sizeof(buf->buf);
+        memcpy(buf->buf, ptr, buf->used);
 
-    dc->used += len;
-    LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
+        dc->used += buf->used;
+        LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
+    }
 }
 
 static int datacopier_pollhup_handled(libxl__egc *egc,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (20 preceding siblings ...)
  2014-09-10 17:10 ` [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 12:02   ` Ian Campbell
  2014-09-12  8:36   ` Wen Congyang
  2014-09-10 17:11 ` [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer Andrew Cooper
                   ` (7 subsequent siblings)
  29 siblings, 2 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Ross Lagerwall

From: Ross Lagerwall <ross.lagerwall@citrix.com>

Add a parameter, maxread, to limit the amount of data read from the
source fd of a datacopier.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 tools/libxl/libxl_aoutils.c    |    9 +++++++--
 tools/libxl/libxl_bootloader.c |    2 ++
 tools/libxl/libxl_dom.c        |    1 +
 tools/libxl/libxl_internal.h   |    1 +
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index caba637..6502325 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -145,7 +145,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
                 return;
             }
         }
-    } else if (!libxl__ev_fd_isregistered(&dc->toread)) {
+    } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
         /* we have had eof */
         datacopier_callback(egc, dc, 0, 0);
         return;
@@ -233,7 +233,8 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
         }
         int r = read(ev->fd,
                      buf->buf + buf->used,
-                     sizeof(buf->buf) - buf->used);
+                     (sizeof(buf->buf) - buf->used) < dc->maxread ?
+                         (sizeof(buf->buf) - buf->used) : dc->maxread);
         if (r < 0) {
             if (errno == EINTR) continue;
             if (errno == EWOULDBLOCK) break;
@@ -258,7 +259,11 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
         }
         buf->used += r;
         dc->used += r;
+        dc->maxread -= r;
         assert(buf->used <= sizeof(buf->buf));
+        assert(dc->maxread >= 0);
+        if (dc->maxread == 0)
+            break;
     }
     datacopier_check_state(egc, dc);
 }
diff --git a/tools/libxl/libxl_bootloader.c b/tools/libxl/libxl_bootloader.c
index 79947d4..1503101 100644
--- a/tools/libxl/libxl_bootloader.c
+++ b/tools/libxl/libxl_bootloader.c
@@ -516,6 +516,7 @@ static void bootloader_gotptys(libxl__egc *egc, libxl__openpty_state *op)
 
     bl->keystrokes.ao = ao;
     bl->keystrokes.maxsz = BOOTLOADER_BUF_OUT;
+    bl->keystrokes.maxread = INT_MAX;
     bl->keystrokes.copywhat =
         GCSPRINTF("bootloader input for domain %"PRIu32, bl->domid);
     bl->keystrokes.callback =         bootloader_keystrokes_copyfail;
@@ -527,6 +528,7 @@ static void bootloader_gotptys(libxl__egc *egc, libxl__openpty_state *op)
 
     bl->display.ao = ao;
     bl->display.maxsz = BOOTLOADER_BUF_IN;
+    bl->display.maxread = INT_MAX;
     bl->display.copywhat =
         GCSPRINTF("bootloader output for domain %"PRIu32, bl->domid);
     bl->display.callback =         bootloader_display_copyfail;
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 0dfdb08..2f74341 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -1717,6 +1717,7 @@ void libxl__domain_save_device_model(libxl__egc *egc,
     dc->readfd = -1;
     dc->writefd = fd;
     dc->maxsz = INT_MAX;
+    dc->maxread = INT_MAX;
     dc->copywhat = GCSPRINTF("qemu save file for domain %"PRIu32, dss->domid);
     dc->writewhat = "save/migration stream";
     dc->callback = save_device_model_datacopier_done;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 03e9978..d93a6ee 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2425,6 +2425,7 @@ struct libxl__datacopier_state {
     libxl__ao *ao;
     int readfd, writefd;
     ssize_t maxsz;
+    ssize_t maxread;
     const char *copywhat, *readwhat, *writewhat; /* for error msgs */
     FILE *log; /* gets a copy of everything */
     libxl__datacopier_callback *callback;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (21 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 12:03   ` Ian Campbell
  2014-09-12  8:49   ` Wen Congyang
  2014-09-10 17:11 ` [PATCH 24/29] tools/libxl: Allow suppression of POLLHUP for datacopiers Andrew Cooper
                   ` (6 subsequent siblings)
  29 siblings, 2 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Ross Lagerwall

From: Ross Lagerwall <ross.lagerwall@citrix.com>

Implement a read-only mode for libxl__datacopier.  The mode is invoked
when readbuf is set and writefd is < 0.  On success, the callback passes
the number of bytes read.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 tools/libxl/libxl_aoutils.c  |   59 +++++++++++++++++++++++++-----------------
 tools/libxl/libxl_internal.h |    4 ++-
 2 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index 6502325..9183716 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -134,7 +134,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
     STATE_AO_GC(dc->ao);
     int rc;
     
-    if (dc->used) {
+    if (dc->used && !dc->readbuf) {
         if (!libxl__ev_fd_isregistered(&dc->towrite)) {
             rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
                                        dc->writefd, POLLOUT);
@@ -147,7 +147,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
         }
     } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
         /* we have had eof */
-        datacopier_callback(egc, dc, 0, 0);
+        datacopier_callback(egc, dc, 0, dc->readbuf ? dc->used : 0);
         return;
     } else {
         /* nothing buffered, but still reading */
@@ -215,26 +215,31 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
     }
     assert(revents & POLLIN);
     for (;;) {
-        while (dc->used >= dc->maxsz) {
-            libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
-            dc->used -= rm->used;
-            assert(dc->used >= 0);
-            LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
-            free(rm);
-        }
+        libxl__datacopier_buf *buf = NULL;
+        int r;
+
+        if (dc->readbuf) {
+            r = read(ev->fd, dc->readbuf + dc->used, dc->maxread);
+        } else {
+            while (dc->used >= dc->maxsz) {
+                libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
+                dc->used -= rm->used;
+                assert(dc->used >= 0);
+                LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
+                free(rm);
+            }
 
-        libxl__datacopier_buf *buf =
-            LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
-        if (!buf || buf->used >= sizeof(buf->buf)) {
-            buf = malloc(sizeof(*buf));
-            if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
-            buf->used = 0;
-            LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
-        }
-        int r = read(ev->fd,
-                     buf->buf + buf->used,
+            buf = LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
+            if (!buf || buf->used >= sizeof(buf->buf)) {
+                buf = malloc(sizeof(*buf));
+                if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
+                buf->used = 0;
+                LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
+            }
+            r = read(ev->fd, buf->buf + buf->used,
                      (sizeof(buf->buf) - buf->used) < dc->maxread ?
                          (sizeof(buf->buf) - buf->used) : dc->maxread);
+        }
         if (r < 0) {
             if (errno == EINTR) continue;
             if (errno == EWOULDBLOCK) break;
@@ -257,10 +262,12 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
                 return;
             }
         }
-        buf->used += r;
+        if (!dc->readbuf) {
+            buf->used += r;
+            assert(buf->used <= sizeof(buf->buf));
+        }
         dc->used += r;
         dc->maxread -= r;
-        assert(buf->used <= sizeof(buf->buf));
         assert(dc->maxread >= 0);
         if (dc->maxread == 0)
             break;
@@ -324,9 +331,13 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
         if (rc) goto out;
     }
 
-    rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
-                               dc->writefd, POLLOUT);
-    if (rc) goto out;
+    if (dc->writefd >= 0) {
+        rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
+                                   dc->writefd, POLLOUT);
+        if (rc) goto out;
+    }
+
+    assert(dc->readfd >= 0 || dc->writefd >= 0);
 
     return 0;
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index d93a6ee..056843a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2404,7 +2404,8 @@ typedef struct libxl__datacopier_buf libxl__datacopier_buf;
 
 /* onwrite==1 means failure happened when writing, logged, errnoval is valid
  * onwrite==0 means failure happened when reading
- *     errnoval==0 means we got eof and all data was written
+ *     errnoval>=0 means we got eof and all data was written or number of bytes
+ *                 written when in read mode
  *     errnoval!=0 means we had a read error, logged
  * onwrite==-1 means some other internal failure, errnoval not valid, logged
  * If we get POLLHUP, we call callback_pollhup(..., onwrite, -1);
@@ -2433,6 +2434,7 @@ struct libxl__datacopier_state {
     /* remaining fields are private to datacopier */
     libxl__ev_fd toread, towrite;
     ssize_t used;
+    void *readbuf;
     LIBXL_TAILQ_HEAD(libxl__datacopier_bufs, libxl__datacopier_buf) bufs;
 };
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 24/29] tools/libxl: Allow suppression of POLLHUP for datacopiers
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (22 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 12:05   ` Ian Campbell
  2014-09-10 17:11 ` [PATCH 25/29] tools/libxl: Stream v2 format Andrew Cooper
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

If the far end of a pipe has been closed, poll() will set POLLHUP.  When
reading from a pipe, POLLIN|POLLHUP is a valid state, even when there is still
data to be read.

Currently, datacopier will bail because of POLLHUP before discovering that
there is valid data to be read.

Add an option to ignore POLLHUP for consumers who would prefer to read to EOF.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>

---

It might be easier/better to alter the POLLHUP handling, but I am struggling
to work out what effect that would have on the bootloader pty handling.
---
 tools/libxl/libxl_aoutils.c  |    2 +-
 tools/libxl/libxl_internal.h |    1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index 9183716..2b39432 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -207,7 +207,7 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
     if (datacopier_pollhup_handled(egc, dc, revents, 0))
         return;
 
-    if (revents & ~POLLIN) {
+    if (revents & ~(POLLIN | (dc->suppress_pollhup ? POLLHUP : 0))) {
         LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)"
             " on %s during copy of %s", revents, dc->readwhat, dc->copywhat);
         datacopier_callback(egc, dc, -1, 0);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 056843a..537b523 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2431,6 +2431,7 @@ struct libxl__datacopier_state {
     FILE *log; /* gets a copy of everything */
     libxl__datacopier_callback *callback;
     libxl__datacopier_callback *callback_pollhup;
+    int suppress_pollhup;
     /* remaining fields are private to datacopier */
     libxl__ev_fd toread, towrite;
     ssize_t used;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 25/29] tools/libxl: Stream v2 format
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (23 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 24/29] tools/libxl: Allow suppression of POLLHUP for datacopiers Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 12:06   ` Ian Campbell
  2014-09-10 17:11 ` [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams Andrew Cooper
                   ` (4 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Ross Lagerwall

From: Ross Lagerwall <ross.lagerwall@citrix.com>

TODO: Namespacing

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxl/libxl_saverestore.h |   47 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 tools/libxl/libxl_saverestore.h

diff --git a/tools/libxl/libxl_saverestore.h b/tools/libxl/libxl_saverestore.h
new file mode 100644
index 0000000..31a22d6
--- /dev/null
+++ b/tools/libxl/libxl_saverestore.h
@@ -0,0 +1,47 @@
+#ifndef LIBXL_SAVERESTORE_H
+#define LIBXL_SAVERESTORE_H
+
+#include <stdint.h>
+
+/* Definitions for LibXenLight Domain Image Format version 2 */
+struct restore_hdr
+{
+    uint64_t ident;
+    uint32_t version;
+    uint32_t options;
+};
+
+struct restore_rec_hdr
+{
+    uint32_t type;
+    uint32_t length;
+};
+
+struct restore_emulator_hdr
+{
+    uint32_t id;
+    uint32_t index;
+};
+
+#define REC_TYPE_END                 0x00000000U
+#define REC_TYPE_DOMAIN_JSON         0x00000001U
+#define REC_TYPE_LIBXC_CONTEXT       0x00000002U
+#define REC_TYPE_XENSTORE_DATA       0x00000003U
+#define REC_TYPE_EMULATOR_CONTEXT    0x00000004U
+
+/* All records must be aligned up to an 8 octet boundary */
+#define REC_ALIGN_ORDER              3U
+
+#define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
+
+#define EMULATOR_UNKNOWN             0x00000000U
+#define EMULATOR_QEMU_TRADITIONAL    0x00000001U
+#define EMULATOR_QEMU_UPSTREAM       0x00000002U
+
+#define RESTORE_STREAM_IDENT         0x4c6962786c466d74UL
+#define RESTORE_STREAM_VERSION       0x00000002U
+
+#define RESTORE_OPT_BIG_ENDIAN       (1 << 0)
+#define RESTORE_OPT_LEGACY           (1 << 1)
+
+#endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (24 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 25/29] tools/libxl: Stream v2 format Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 12:35   ` Ian Campbell
  2014-09-10 17:11 ` [PATCH 27/29] [VERY RFC] tools/libxl: Support restoring legacy streams Andrew Cooper
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Ross Lagerwall

TODO:
 * Integrate with the json series

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxl/libxl_create.c   |  310 ++++++++++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_dom.c      |    4 +-
 tools/libxl/libxl_internal.h |   10 +-
 3 files changed, 310 insertions(+), 14 deletions(-)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index a5e185e..9661f78 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -19,6 +19,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_saverestore.h"
 
 #include <xc_dom.h>
 #include <xenguest.h>
@@ -934,8 +935,6 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     libxl_domain_build_info *const info = &d_config->b_info;
     const int restore_fd = dcs->restore_fd;
     libxl__domain_build_state *const state = &dcs->build_state;
-    libxl__srm_restore_autogen_callbacks *const callbacks =
-        &dcs->shs.callbacks.restore.a;
 
     if (rc) {
         domcreate_rebuild_done(egc, dcs, rc);
@@ -975,7 +974,6 @@ static void domcreate_bootloader_done(libxl__egc *egc,
         hvm = 1;
         superpages = 1;
         pae = libxl_defbool_val(info->u.hvm.pae);
-        callbacks->toolstack_restore = libxl__toolstack_restore;
         break;
     case LIBXL_DOMAIN_TYPE_PV:
         hvm = 0;
@@ -1105,12 +1103,16 @@ static void domcreate_rebuild_done(libxl__egc *egc,
         goto error_out;
     }
 
-    store_libxl_entry(gc, domid, &d_config->b_info);
+    if (dcs->rebuild_callback) {
+        dcs->rebuild_callback(dcs);
+    } else {
+        store_libxl_entry(gc, domid, &d_config->b_info);
 
-    libxl__multidev_begin(ao, &dcs->multidev);
-    dcs->multidev.callback = domcreate_launch_dm;
-    libxl__add_disks(egc, ao, domid, d_config, &dcs->multidev);
-    libxl__multidev_prepared(egc, &dcs->multidev, 0);
+        libxl__multidev_begin(ao, &dcs->multidev);
+        dcs->multidev.callback = domcreate_launch_dm;
+        libxl__add_disks(egc, ao, domid, d_config, &dcs->multidev);
+        libxl__multidev_prepared(egc, &dcs->multidev, 0);
+    }
 
     return;
 
@@ -1454,6 +1456,11 @@ static void domcreate_destruction_cb(libxl__egc *egc,
 typedef struct {
     libxl__domain_create_state dcs;
     uint32_t *domid_out;
+    libxl__datacopier_state dc;
+    libxl__datacopier_state stream_dc;
+    int expected_len;
+    struct restore_hdr hdr;
+    struct restore_rec_hdr rechdr;
 } libxl__app_domain_create_state;
 
 static void domain_create_cb(libxl__egc *egc,
@@ -1517,6 +1524,293 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                             params->checkpointed_stream, ao_how, aop_console_how);
 }
 
+static void read_restore_rec_hdr(libxl__app_domain_create_state *cdcs);
+
+static void read_padding_cb(libxl__egc *egc, libxl__datacopier_state *dc,
+        int onwrite, int errnoval)
+{
+    libxl__app_domain_create_state *cdcs = CONTAINER_OF(dc, *cdcs, stream_dc);
+    STATE_AO_GC(cdcs->dcs.ao);
+    int ret = 0;
+
+    if (onwrite != 0 || errnoval != cdcs->expected_len) {
+        ret = ERROR_FAIL;
+        goto out;
+    }
+
+    read_restore_rec_hdr(cdcs);
+
+out:
+    if (ret)
+        libxl__ao_complete(egc, ao, ret);
+}
+
+static void restore_write_dm_cb(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
+{
+    libxl__app_domain_create_state *cdcs = CONTAINER_OF(dc, *cdcs, dc);
+    STATE_AO_GC(cdcs->dcs.ao);
+    int ret = 0, padding;
+    libxl__datacopier_state *stream_dc = &cdcs->stream_dc;
+    struct restore_rec_hdr *rechdr = &cdcs->rechdr;
+
+    if (onwrite || errnoval) {
+        ret = ERROR_FAIL;
+        goto out;
+    }
+
+    padding = ROUNDUP(rechdr->length - sizeof(struct restore_emulator_hdr),
+                  REC_ALIGN_ORDER) - rechdr->length - sizeof(struct restore_emulator_hdr);
+    if (padding > 0) {
+        stream_dc->readwhat = "padding";
+        stream_dc->maxread = padding;
+        cdcs->expected_len = stream_dc->maxread;
+        stream_dc->callback = read_padding_cb;
+        stream_dc->used = 0;
+        stream_dc->readbuf = libxl__malloc(gc, padding);
+        libxl__datacopier_start(stream_dc);
+    } else {
+        read_restore_rec_hdr(cdcs);
+    }
+
+out:
+    if (ret)
+        libxl__ao_complete(egc, ao, ret);
+}
+
+static void read_restore_record_complete(libxl__egc *egc,
+                                         libxl__app_domain_create_state *cdcs,
+                                         void *buf)
+{
+    STATE_AO_GC(cdcs->dcs.ao);
+    int ret = 0;
+
+    /* convenience aliases */
+    libxl_ctx *ctx = CTX;
+    libxl__domain_create_state *dcs = &cdcs->dcs;
+    struct restore_rec_hdr *rechdr = &cdcs->rechdr;
+
+    LIBXL__LOG(ctx, LIBXL__LOG_DEBUG,
+               "Record: 0x%08x, length %u", rechdr->type, rechdr->length);
+    switch (rechdr->type) {
+    case REC_TYPE_END:
+        if (rechdr->length != 0) {
+            LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
+                       "Encountered END record with non-zero length");
+            ret = ERROR_FAIL;
+        } else {
+            /* complete restore */
+            store_libxl_entry(gc, dcs->guest_domid, &dcs->guest_config->b_info);
+
+            libxl__multidev_begin(ao, &dcs->multidev);
+            dcs->multidev.callback = domcreate_launch_dm;
+            libxl__add_disks(egc, ao, dcs->guest_domid, dcs->guest_config, &dcs->multidev);
+            libxl__multidev_prepared(egc, &dcs->multidev, 0);
+        }
+        break;
+    case REC_TYPE_DOMAIN_JSON:
+        /* XXX handle domain JSON */
+        break;
+    case REC_TYPE_LIBXC_CONTEXT:
+        initiate_domain_create(egc, &cdcs->dcs);
+        break;
+    case REC_TYPE_XENSTORE_DATA:
+        ret = libxl__toolstack_restore(cdcs->dcs.guest_domid, buf,
+                                       rechdr->length, &cdcs->dcs);
+        if (!ret)
+            read_restore_rec_hdr(cdcs);
+        break;
+    case REC_TYPE_EMULATOR_CONTEXT: {
+        struct restore_emulator_hdr *emuhdr = buf;
+        char path[256];
+        libxl__datacopier_state *dc = &cdcs->dc;
+
+        if (emuhdr->id == EMULATOR_UNKNOWN) {
+            ret = ERROR_FAIL;
+            goto out;
+        }
+
+        sprintf(path, XC_DEVICE_MODEL_RESTORE_FILE".%u", cdcs->dcs.guest_domid);
+        memset(dc, 0, sizeof(*dc));
+        dc->ao = ao;
+        dc->readwhat = "save/migration stream";
+        dc->copywhat = "emulator context";
+        dc->writewhat = "qemu save file";
+        dc->readfd = cdcs->dcs.restore_fd;
+        dc->writefd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0666);
+        if (dc->writefd == -1) {
+            ret = ERROR_FAIL;
+            goto out;
+        }
+        dc->maxsz = INT_MAX;
+        dc->maxread = rechdr->length - sizeof(*emuhdr);
+        dc->callback = restore_write_dm_cb;
+
+        ret = libxl__datacopier_start(dc);
+        if (ret)
+            goto out;
+        break;
+    }
+    default:
+        ret = ERROR_FAIL;
+        break;
+    }
+
+    free(buf);
+
+out:
+    if (ret)
+        libxl__ao_complete(egc, ao, ret);
+}
+
+static void read_restore_rec_body_cb(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
+{
+    libxl__app_domain_create_state *cdcs = CONTAINER_OF(dc, *cdcs, stream_dc);
+    STATE_AO_GC(cdcs->dcs.ao);
+    int ret = 0;
+
+    if (onwrite != 0 || errnoval != cdcs->expected_len) {
+        ret = ERROR_FAIL;
+        free(dc->readbuf);
+        goto out;
+    }
+
+    read_restore_record_complete(egc, cdcs, dc->readbuf);
+
+out:
+    if (ret)
+        libxl__ao_complete(egc, ao, ret);
+}
+
+static void read_restore_rec_hdr_cb(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
+{
+    libxl__app_domain_create_state *cdcs = CONTAINER_OF(dc, *cdcs, stream_dc);
+    STATE_AO_GC(cdcs->dcs.ao);
+    int ret = 0;
+
+    /* convenience aliases */
+    struct restore_rec_hdr *rechdr = &cdcs->rechdr;
+
+    if (onwrite != 0 || errnoval != cdcs->expected_len) {
+        ret = ERROR_FAIL;
+        goto out;
+    }
+
+    if (rechdr->length > 0) {
+        dc->readwhat = "read record body";
+        if (rechdr->type == REC_TYPE_EMULATOR_CONTEXT)
+            dc->maxread = sizeof(struct restore_rec_hdr);
+        else
+            dc->maxread = ROUNDUP(rechdr->length, REC_ALIGN_ORDER);
+        cdcs->expected_len = dc->maxread;
+        dc->callback = read_restore_rec_body_cb;
+        dc->used = 0;
+        dc->readbuf = libxl__malloc(NOGC, dc->maxread);
+        libxl__datacopier_start(dc);
+    } else {
+        read_restore_record_complete(egc, cdcs, NULL);
+    }
+
+out:
+    if (ret)
+        libxl__ao_complete(egc, ao, ret);
+}
+
+static void read_restore_rec_hdr(libxl__app_domain_create_state *cdcs)
+{
+    STATE_AO_GC(cdcs->dcs.ao);
+
+    libxl__datacopier_state *dc = &cdcs->stream_dc;
+    dc->readwhat = "read record header";
+    dc->maxread = sizeof(cdcs->rechdr);
+    cdcs->expected_len = dc->maxread;
+    dc->used = 0;
+    dc->callback = read_restore_rec_hdr_cb;
+    dc->readbuf = &cdcs->rechdr;
+    dc->suppress_pollhup = 1;
+    libxl__datacopier_start(dc);
+}
+
+static void read_restore_hdr_cb(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
+{
+    libxl__app_domain_create_state *cdcs = CONTAINER_OF(dc, *cdcs, stream_dc);
+    STATE_AO_GC(cdcs->dcs.ao);
+    int ret = 0;
+
+    /* convenience aliases */
+    libxl_ctx *ctx = CTX;
+    struct restore_hdr *hdr = &cdcs->hdr;
+
+    if (onwrite != 0 || errnoval != sizeof(*hdr)) {
+        ret = ERROR_FAIL;
+        goto out;
+    }
+
+    hdr->ident = be64toh(hdr->ident);
+    hdr->version = be32toh(hdr->version);
+    hdr->options = be32toh(hdr->options);
+
+    if (hdr->ident != RESTORE_STREAM_IDENT) {
+        LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
+                   "Invalid ident: Got 0x%016"PRIx64, hdr->ident);
+        ret = ERROR_FAIL;
+        goto out;
+    }
+    if (hdr->version != RESTORE_STREAM_VERSION) {
+        LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
+                   "Invalid Version: Expected %u, Got %u",
+                   hdr->version, RESTORE_STREAM_VERSION);
+        ret = ERROR_FAIL;
+        goto out;
+    }
+    if (hdr->options & RESTORE_OPT_BIG_ENDIAN) {
+        LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
+                   "Unable to handle big endian streams");
+        ret = ERROR_FAIL;
+        goto out;
+    }
+
+    read_restore_rec_hdr(cdcs);
+
+out:
+    if (ret)
+        libxl__ao_complete(egc, ao, ret);
+}
+
+static void restore_rebuild_complete(libxl__domain_create_state *dcs)
+{
+    libxl__app_domain_create_state *cdcs = CONTAINER_OF(dcs, *cdcs, dcs);
+    STATE_AO_GC(cdcs->dcs.ao);
+
+    read_restore_rec_hdr(cdcs);
+}
+
+static void libxl__domain_restore(libxl__egc *egc,
+                                  libxl__app_domain_create_state *cdcs)
+{
+    libxl__datacopier_state *dc = &cdcs->stream_dc;
+
+    memset(dc, 0, sizeof(*dc));
+    dc->readwhat = "read header";
+    dc->copywhat = "";
+    dc->writewhat = "";
+    dc->ao = cdcs->dcs.ao;
+    dc->readfd = cdcs->dcs.restore_fd;
+    dc->maxread = sizeof(cdcs->hdr);
+    cdcs->expected_len = dc->maxread;
+    dc->maxsz = INT_MAX;
+    dc->used = 0;
+    dc->callback = read_restore_hdr_cb;
+    dc->writefd = -1;
+    dc->readbuf = &cdcs->hdr;
+
+    libxl__datacopier_start(dc);
+}
+
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 2f74341..4160695 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -780,10 +780,8 @@ static inline char *restore_helper(libxl__gc *gc, uint32_t domid,
 }
 
 int libxl__toolstack_restore(uint32_t domid, const uint8_t *buf,
-                             uint32_t size, void *user)
+                             uint32_t size, libxl__domain_create_state *dcs)
 {
-    libxl__save_helper_state *shs = user;
-    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
     STATE_AO_GC(dcs->ao);
     int i, ret;
     const uint8_t *ptr = buf;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 537b523..3964009 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -991,8 +991,11 @@ _hidden int libxl__domain_rename(libxl__gc *gc, uint32_t domid,
                                  const char *old_name, const char *new_name,
                                  xs_transaction_t trans);
 
+typedef struct libxl__domain_create_state libxl__domain_create_state;
+
 _hidden int libxl__toolstack_restore(uint32_t domid, const uint8_t *buf,
-                                     uint32_t size, void *data);
+                                     uint32_t size,
+                                     libxl__domain_create_state *dcs);
 _hidden int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid);
 
 _hidden const char *libxl__userdata_path(libxl__gc *gc, uint32_t domid,
@@ -2780,12 +2783,12 @@ _hidden int libxl__destroy_qdisk_backend(libxl__gc *gc, uint32_t domid);
 
 /*----- Domain creation -----*/
 
-typedef struct libxl__domain_create_state libxl__domain_create_state;
-
 typedef void libxl__domain_create_cb(libxl__egc *egc,
                                      libxl__domain_create_state*,
                                      int rc, uint32_t domid);
 
+typedef void libxl__domain_rebuild_cb(libxl__domain_create_state *dcs);
+
 struct libxl__domain_create_state {
     /* filled in by user */
     libxl__ao *ao;
@@ -2806,6 +2809,7 @@ struct libxl__domain_create_state {
     /* necessary if the domain creation failed and we have to destroy it */
     libxl__domain_destroy_state dds;
     libxl__multidev multidev;
+    libxl__domain_rebuild_cb *rebuild_callback;
 };
 
 /*----- Domain suspend (save) functions -----*/
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 27/29] [VERY RFC] tools/libxl: Support restoring legacy streams
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (25 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 12:36   ` Ian Campbell
  2014-09-10 17:11 ` [PATCH 28/29] tools/xl: Restore v2 streams using new interface Andrew Cooper
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

This WorksForMe in the success case, but the error handling is certainly lacking.

Specifically, the conversion scripts output fd can't be closed until the v2
read loop has exited (cleanly or otherwise), without risking a close()/open()
race silently replacing the fd behind the loops back.

However, it can't be closed when the read loop exits, as the conversion script
child might still be alive, and would prefer terminating cleaning than failing
with a bad FD.

Obviously, having one error handler block for the success/failure of the other
side is a no-go, and would still involve a preselecting which was expected to
exit first.

Does anyone have any clever ideas of how to asynchronously collect the events
"the conversion script has exited", "the save helper has exited" and "the v2
read loop has finished" given the available infrastructure, to kick of a
combined cleanup of all 3?

(I also need to fix the conversion script info/error logging, but that is a
distinctly more minor problem.)
---
 tools/libxl/Makefile                |    1 +
 tools/libxl/libxl_convert_callout.c |  127 +++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_create.c          |   43 +++++++++---
 tools/libxl/libxl_internal.h        |   16 +++++
 tools/libxl/libxl_types.idl         |    1 +
 5 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 tools/libxl/libxl_convert_callout.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 4bee4af..d7b5036 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -80,6 +80,7 @@ LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
 			libxl_json.o libxl_aoutils.o libxl_numa.o \
 			libxl_save_callout.o _libxl_save_msgs_callout.o \
+			libxl_convert_callout.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += libxl_genid.o
 LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o
diff --git a/tools/libxl/libxl_convert_callout.c b/tools/libxl/libxl_convert_callout.c
new file mode 100644
index 0000000..9d31b91
--- /dev/null
+++ b/tools/libxl/libxl_convert_callout.c
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2014      Citrix Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h"
+
+#include "libxl_internal.h"
+
+static void stream_convert_done(libxl__egc *egc, libxl__conversion_helper_state *chs)
+{
+    STATE_AO_GC(chs->ao);
+
+    if (chs->rc)
+        libxl__ao_complete(egc, ao, chs->rc);
+}
+
+static void helper_done(libxl__egc *egc, libxl__conversion_helper_state *chs)
+{
+    STATE_AO_GC(chs->ao);
+
+    if (chs->rc)
+        LOG(ERROR, "conversion script failed");
+    else
+        LOG(INFO, "conversion script succeeded");
+
+    chs->completion_callback(egc, chs);
+}
+
+static void helper_exited(libxl__egc *egc, libxl__ev_child *ch,
+                          pid_t pid, int status)
+{
+    libxl__conversion_helper_state *chs = CONTAINER_OF(ch, *chs, child);
+    STATE_AO_GC(chs->ao);
+
+    if (status) {
+        libxl_report_child_exitstatus(CTX, XTL_INFO, "conversion", pid, status);
+        chs->rc = ERROR_FAIL;
+    }
+
+    helper_done(egc, chs);
+}
+
+/*----- entrypoints -----*/
+int libxl__convert_legacy_stream(libxl__egc *egc,
+                                 libxl__domain_create_state *dcs)
+{
+    STATE_AO_GC(dcs->ao);
+    libxl__conversion_helper_state *chs = &dcs->chs;
+    int rc = 0;
+
+    chs->ao = ao;
+    chs->completion_callback = stream_convert_done;
+
+    chs->rc = 0;
+    libxl__ev_child_init(&chs->child);
+
+    int legacy_fd = dcs->restore_fd;
+    libxl__carefd *child_in = NULL;
+
+    libxl__carefd_begin();
+    int fds[2];
+    if (libxl_pipe(CTX, fds)) {
+        rc = ERROR_FAIL;
+        libxl__carefd_unlock();
+        goto err;
+
+    }
+    chs->child_out = libxl__carefd_record(CTX, fds[0]);
+    child_in       = libxl__carefd_record(CTX, fds[1]);
+    libxl__carefd_unlock();
+
+    pid_t pid = libxl__ev_child_fork(gc, &chs->child, helper_exited);
+    if (!pid) {
+        LOG(INFO, "In child!");
+
+        char * const args[] =
+        {
+            getenv("LIBXL_CONVERT_HELPER") ?: PRIVATE_BINDIR "/convert-legacy-stream",
+            "--in", GCSPRINTF("%d", legacy_fd),
+            "--out", GCSPRINTF("%d", fds[1]),
+            "--width",
+#ifdef __i386__
+            "32",
+#else
+            "64",
+#endif
+            "--guest", "pv",
+            "--format", "libxl",
+            /* "--verbose", */
+            NULL,
+        };
+
+        libxl_fd_set_cloexec(CTX, legacy_fd, 0);
+        libxl_fd_set_cloexec(CTX, libxl__carefd_fd(child_in), 0);
+
+        libxl__exec(gc,
+                    -1, -1, -1,
+                    args[0], args, NULL);
+    }
+
+    LOG(INFO, "In parent!");
+
+    libxl__carefd_close(child_in);
+
+    LOG(INFO, "Updating restore_fd %d -> %d",
+        dcs->restore_fd, libxl__carefd_fd(chs->child_out));
+    dcs->restore_fd = libxl__carefd_fd(chs->child_out);
+
+    /* TODO - how to libxl__carefd_close(chs->child_out) ? */
+
+    assert(!rc);
+    return rc;
+
+ err:
+    assert(rc);
+    return rc;
+}
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 9661f78..a1935ec 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1514,16 +1514,6 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             ao_how, aop_console_how);
 }
 
-int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
-                                uint32_t *domid, int restore_fd,
-                                const libxl_domain_restore_params *params,
-                                const libxl_asyncop_how *ao_how,
-                                const libxl_asyncprogress_how *aop_console_how)
-{
-    return do_domain_create(ctx, d_config, domid, restore_fd,
-                            params->checkpointed_stream, ao_how, aop_console_how);
-}
-
 static void read_restore_rec_hdr(libxl__app_domain_create_state *cdcs);
 
 static void read_padding_cb(libxl__egc *egc, libxl__datacopier_state *dc,
@@ -1810,6 +1800,39 @@ static void libxl__domain_restore(libxl__egc *egc,
     libxl__datacopier_start(dc);
 }
 
+int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
+                                uint32_t *domid, int restore_fd,
+                                const libxl_domain_restore_params *params,
+                                const libxl_asyncop_how *ao_how,
+                                const libxl_asyncprogress_how *aop_console_how)
+{
+    AO_CREATE(ctx, 0, ao_how);
+    libxl__app_domain_create_state *cdcs;
+    int ret = 0;
+
+    GCNEW(cdcs);
+    cdcs->dcs.ao = ao;
+    cdcs->dcs.guest_config = d_config;
+    cdcs->dcs.restore_fd = restore_fd;
+    cdcs->dcs.callback = domain_create_cb;
+    cdcs->dcs.checkpointed_stream = 0;
+    cdcs->dcs.rebuild_callback = restore_rebuild_complete;
+    libxl__ao_progress_gethow(&cdcs->dcs.aop_console_how, aop_console_how);
+    cdcs->domid_out = domid;
+
+    if (params->stream_version == 1) {
+        LOG(DEBUG, "Converting legacy stream");
+        ret = libxl__convert_legacy_stream(egc, &cdcs->dcs);
+    }
+
+    if (!ret)
+        libxl__domain_restore(egc, cdcs);
+
+    if (ret)
+        return AO_ABORT(ret);
+    else
+        return AO_INPROGRESS;
+}
 
 /*
  * Local variables:
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 3964009..c56a167 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2497,6 +2497,19 @@ typedef struct libxl__save_helper_state {
                       * marshalling and xc callback functions */
 } libxl__save_helper_state;
 
+/*----- Legacy conversion helper -----*/
+typedef struct libxl__conversion_helper_state libxl__conversion_helper_state;
+
+struct libxl__conversion_helper_state {
+    /* public */
+    libxl__ao *ao;
+    void (*completion_callback)(libxl__egc *egc,
+                                libxl__conversion_helper_state *chs);
+    /* private */
+    int rc;
+    libxl__ev_child child;
+    libxl__carefd *child_out;
+};
 
 /*----- Domain suspend (save) state structure -----*/
 
@@ -2806,6 +2819,7 @@ struct libxl__domain_create_state {
         /* If we're not doing stubdom, we use only dmss.dm,
          * for the non-stubdom device model. */
     libxl__save_helper_state shs;
+    libxl__conversion_helper_state chs;
     /* necessary if the domain creation failed and we have to destroy it */
     libxl__domain_destroy_state dds;
     libxl__multidev multidev;
@@ -3283,6 +3297,8 @@ static inline void libxl__update_config_vtpm(libxl__gc *gc,
     libxl_uuid_copy(CTX, &dst->uuid, &src->uuid);
 }
 
+int libxl__convert_legacy_stream(libxl__egc *egc,
+                                 libxl__domain_create_state *dcs);
 #endif
 
 /*
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index f1fcbc3..fdd3781 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -308,6 +308,7 @@ libxl_domain_create_info = Struct("domain_create_info",[
 
 libxl_domain_restore_params = Struct("domain_restore_params", [
     ("checkpointed_stream", integer),
+    ("stream_version", uint32, {'init_val': '1'}),
     ])
 
 libxl_domain_sched_params = Struct("domain_sched_params",[
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 28/29] tools/xl: Restore v2 streams using new interface
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (26 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 27/29] [VERY RFC] tools/libxl: Support restoring legacy streams Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-10 17:11 ` [PATCH 29/29] tools/[lib]xl: Alter libxl_domain_suspend() to write a v2 stream Andrew Cooper
  2014-09-11 11:50 ` [PATCH v7 0/29] Migration Stream v2 Ian Campbell
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxl/xl_cmdimpl.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 26492fc..d17e333 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -136,6 +136,8 @@ static const char *action_on_shutdown_names[] = {
 
 #define SAVEFILE_BYTEORDER_VALUE ((uint32_t)0x01020304UL)
 
+#define SAVEFILE_MANDATORY_STREAMV2 (1 << 0)
+
 struct domain_create {
     int debug;
     int daemonize;
@@ -2115,7 +2117,7 @@ static uint32_t create_domain(struct domain_create *dom_info)
                 restore_source, hdr.mandatory_flags, hdr.optional_flags,
                 hdr.optional_data_len);
 
-        badflags = hdr.mandatory_flags & ~( 0 /* none understood yet */ );
+        badflags = hdr.mandatory_flags & ~SAVEFILE_MANDATORY_STREAMV2;
         if (badflags) {
             fprintf(stderr, "Savefile has mandatory flag(s) 0x%"PRIx32" "
                     "which are not supported; need newer xl\n",
@@ -2250,6 +2252,9 @@ start:
         libxl_domain_restore_params_init(&params);
 
         params.checkpointed_stream = dom_info->checkpointed_stream;
+        params.stream_version =
+            (hdr.mandatory_flags & SAVEFILE_MANDATORY_STREAMV2) ? 2 : 1;
+
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
                                           &params,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 29/29] tools/[lib]xl: Alter libxl_domain_suspend() to write a v2 stream
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (27 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 28/29] tools/xl: Restore v2 streams using new interface Andrew Cooper
@ 2014-09-10 17:11 ` Andrew Cooper
  2014-09-11 11:50 ` [PATCH v7 0/29] Migration Stream v2 Ian Campbell
  29 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-10 17:11 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell, Ross Lagerwall

From: Ross Lagerwall <ross.lagerwall@citrix.com>

Note that for now, the xl header and device config blob at the beginning
of the stream is still written out since we don't have any domain JSON
yet.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxl/libxl_dom.c      |  265 ++++++++++++++++++++++++++++++++++++------
 tools/libxl/libxl_internal.h |    5 +-
 tools/libxl/xl_cmdimpl.c     |    1 +
 3 files changed, 232 insertions(+), 39 deletions(-)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 4160695..1544378 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -19,6 +19,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_saverestore.h"
 
 #include <xc_dom.h>
 #include <xen/hvm/hvm_info_table.h>
@@ -1066,7 +1067,9 @@ int libxl__domain_suspend_device_model(libxl__gc *gc,
     uint32_t const domid = dss->domid;
     const char *const filename = dss->dm_savefile;
 
-    switch (libxl__device_model_version_running(gc, domid)) {
+    dss->dm_version = libxl__device_model_version_running(gc, domid);
+
+    switch (dss->dm_version) {
     case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL: {
         LOG(DEBUG, "Saving device model state to %s", filename);
         libxl__qemu_traditional_cmd(gc, domid, "save");
@@ -1410,10 +1413,9 @@ static inline char *physmap_path(libxl__gc *gc, uint32_t domid,
             domid, phys_offset, node);
 }
 
-int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
-        uint32_t *len, void *dss_void)
+static int libxl__toolstack_save(libxl__domain_suspend_state *dss,
+        uint8_t **buf, uint32_t *len)
 {
-    libxl__domain_suspend_state *dss = dss_void;
     STATE_AO_GC(dss->ao);
     int i = 0;
     char *start_addr = NULL, *size = NULL, *phys_offset = NULL, *name = NULL;
@@ -1423,6 +1425,9 @@ int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
     char **entries = NULL;
     struct libxl__physmap_info *pi;
 
+    /* Convenience aliases */
+    const uint32_t domid = dss->domid;
+
     entries = libxl__xs_directory(gc, 0, GCSPRINTF(
                 "/local/domain/0/device-model/%d/physmap", domid), &num);
     count = num;
@@ -1572,11 +1577,130 @@ static void remus_checkpoint_dm_saved(libxl__egc *egc,
 
 /*----- main code for suspending, in order of execution -----*/
 
+void libxl__save_write_header(libxl__egc *egc,
+                              libxl__domain_suspend_state *dss);
+
 void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
 {
     STATE_AO_GC(dss->ao);
-    int port;
+
+    libxl__save_write_header(egc, dss);
+}
+
+void libxl__save_write_end(libxl__egc *egc,
+                           libxl__domain_suspend_state *dss);
+
+static void domain_save_device_model_cb(libxl__egc *egc,
+                                        libxl__domain_suspend_state *dss,
+                                        int rc)
+{
+    STATE_AO_GC(dss->ao);
+
+    if (rc)
+        domain_suspend_done(egc, dss, rc);
+    else
+        libxl__save_write_end(egc, dss);
+}
+
+static void write_toolstack_done(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dc, *dss, dc);
+    STATE_AO_GC(dss->ao);
+
+    int rc = ERROR_FAIL;
+
+    /* Convenience aliases */
+    const libxl_domain_type type = dss->type;
+
+    if (onwrite || errnoval)
+        goto out;
+
+    if (type == LIBXL_DOMAIN_TYPE_HVM) {
+        rc = libxl__domain_suspend_device_model(gc, dss);
+        if (rc) goto out;
+
+        libxl__domain_save_device_model(egc, dss, domain_save_device_model_cb);
+        return;
+    }
+
+    libxl__save_write_end(egc, dss);
+
+    return;
+
+out:
+    domain_suspend_done(egc, dss, rc);
+}
+
+void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
+                                int rc, int retval, int errnoval)
+{
+    libxl__domain_suspend_state *dss = dss_void;
+    STATE_AO_GC(dss->ao);
+    struct restore_rec_hdr rechdr;
+    uint8_t *buf;
+    uint32_t len;
+    unsigned char pad[8] = {0};
+
+    if (rc)
+        goto out;
+
+    if (retval) {
+        LOGEV(ERROR, errnoval, "saving domain: %s",
+                         dss->guest_responded ?
+                         "domain responded to suspend request" :
+                         "domain did not respond to suspend request");
+        if ( !dss->guest_responded )
+            rc = ERROR_GUEST_TIMEDOUT;
+        else
+            rc = ERROR_FAIL;
+        goto out;
+    }
+
+    libxl__datacopier_state *dc = &dss->dc;
+    memset(dc, 0, sizeof(*dc));
+    dc->readwhat = "";
+    dc->copywhat = "toolstack data";
+    dc->writewhat = "save/migration stream";
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = dss->fd;
+    dc->maxsz = INT_MAX;
+    dc->maxread = INT_MAX;
+    dc->callback = write_toolstack_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) goto out;
+
+    rc = libxl__toolstack_save(dss, &buf, &len);
+    fprintf(stderr, "toolstack_save returned %d, len = %u\n", rc, len);
+    if (rc) goto out;
+
+    rechdr.type = REC_TYPE_XENSTORE_DATA;
+    rechdr.length = len;
+    libxl__datacopier_prefixdata(egc, dc, &rechdr, sizeof(rechdr));
+    libxl__datacopier_prefixdata(egc, dc, buf, len);
+    free(buf);
+
+    len = ROUNDUP(len, REC_ALIGN_ORDER) - len;
+    assert(len >= 0 && len < 8);
+    if (len > 0)
+        libxl__datacopier_prefixdata(egc, dc, pad, len);
+
+    return;
+
+out:
+    domain_suspend_done(egc, dss, rc);
+}
+
+static void write_header_done(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dc, *dss, dc);
+    STATE_AO_GC(dss->ao);
+
     int rc = ERROR_FAIL;
+    int port;
 
     /* Convenience aliases */
     const uint32_t domid = dss->domid;
@@ -1587,6 +1711,9 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
     libxl__srm_save_autogen_callbacks *const callbacks =
         &dss->shs.callbacks.save.a;
 
+    if (onwrite || errnoval)
+        goto out;
+
     logdirty_init(&dss->logdirty);
     libxl__xswait_init(&dss->pvcontrol);
     libxl__ev_evtchn_init(&dss->guest_evtchn);
@@ -1643,50 +1770,97 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
         callbacks->suspend = libxl__domain_suspend_callback;
 
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
-    dss->shs.callbacks.save.toolstack_save = libxl__toolstack_save;
 
     libxl__xc_domain_save(egc, dss);
     return;
 
+out:
+    domain_suspend_done(egc, dss, rc);
+}
+
+void libxl__save_write_header(libxl__egc *egc,
+                              libxl__domain_suspend_state *dss)
+{
+    STATE_AO_GC(dss->ao);
+    struct restore_hdr hdr;
+    struct restore_rec_hdr rechdr;
+    int rc = ERROR_FAIL;
+
+    libxl__datacopier_state *dc = &dss->dc;
+    memset(dc, 0, sizeof(*dc));
+    dc->readwhat = "";
+    dc->copywhat = "suspend header";
+    dc->writewhat = "save/migration stream";
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = dss->fd;
+    dc->maxsz = INT_MAX;
+    dc->maxread = INT_MAX;
+    dc->callback = write_header_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) goto out;
+
+    hdr.ident = htobe64(RESTORE_STREAM_IDENT);
+    hdr.version = htobe32(RESTORE_STREAM_VERSION);
+    hdr.options = htobe32(0x0);
+    libxl__datacopier_prefixdata(egc, dc, &hdr, sizeof(hdr));
+
+    /* XXX need to write the domain config here. */
+
+    rechdr.type = REC_TYPE_LIBXC_CONTEXT;
+    rechdr.length = 0;
+    libxl__datacopier_prefixdata(egc, dc, &rechdr, sizeof(rechdr));
+
+    return;
+
  out:
     domain_suspend_done(egc, dss, rc);
 }
 
-void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
-                                int rc, int retval, int errnoval)
+static void write_end_writer_done(libxl__egc *egc,
+     libxl__datacopier_state *dc, int onwrite, int errnoval)
 {
-    libxl__domain_suspend_state *dss = dss_void;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dc, *dss, dc);
     STATE_AO_GC(dss->ao);
 
-    /* Convenience aliases */
-    const libxl_domain_type type = dss->type;
+    int rc = 0;
 
-    if (rc)
-        goto out;
+    if (onwrite || errnoval)
+        rc = ERROR_FAIL;
 
-    if (retval) {
-        LOGEV(ERROR, errnoval, "saving domain: %s",
-                         dss->guest_responded ?
-                         "domain responded to suspend request" :
-                         "domain did not respond to suspend request");
-        if ( !dss->guest_responded )
-            rc = ERROR_GUEST_TIMEDOUT;
-        else
-            rc = ERROR_FAIL;
-        goto out;
-    }
+    domain_suspend_done(egc, dss, rc);
+}
 
-    if (type == LIBXL_DOMAIN_TYPE_HVM) {
-        rc = libxl__domain_suspend_device_model(gc, dss);
-        if (rc) goto out;
+void libxl__save_write_end(libxl__egc *egc,
+                           libxl__domain_suspend_state *dss)
+{
+    STATE_AO_GC(dss->ao);
+    struct restore_rec_hdr rechdr;
+    int rc = ERROR_FAIL;
 
-        libxl__domain_save_device_model(egc, dss, domain_suspend_done);
-        return;
-    }
+    libxl__datacopier_state *dc = &dss->dc;
+    memset(dc, 0, sizeof(*dc));
+    dc->readwhat = "";
+    dc->copywhat = "suspend footer";
+    dc->writewhat = "save/migration stream";
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = dss->fd;
+    dc->maxsz = INT_MAX;
+    dc->maxread = INT_MAX;
+    dc->callback = write_end_writer_done;
 
-    rc = 0;
+    rechdr.type = REC_TYPE_END;
+    rechdr.length = 0;
 
-out:
+    rc = libxl__datacopier_start(dc);
+    if (rc) goto out;
+
+    libxl__datacopier_prefixdata(egc, dc, &rechdr, sizeof(rechdr));
+    return;
+
+ out:
     domain_suspend_done(egc, dss, rc);
 }
 
@@ -1698,6 +1872,8 @@ void libxl__domain_save_device_model(libxl__egc *egc,
                                      libxl__save_device_model_cb *callback)
 {
     STATE_AO_GC(dss->ao);
+    struct restore_rec_hdr rechdr;
+    struct restore_emulator_hdr emuhdr;
     struct stat st;
     uint32_t qemu_state_len;
     int rc;
@@ -1707,8 +1883,9 @@ void libxl__domain_save_device_model(libxl__egc *egc,
     /* Convenience aliases */
     const char *const filename = dss->dm_savefile;
     const int fd = dss->fd;
+    const int dm_version = dss->dm_version;
 
-    libxl__datacopier_state *dc = &dss->save_dm_datacopier;
+    libxl__datacopier_state *dc = &dss->dc;
     memset(dc, 0, sizeof(*dc));
     dc->readwhat = GCSPRINTF("qemu save file %s", filename);
     dc->ao = ao;
@@ -1739,15 +1916,31 @@ void libxl__domain_save_device_model(libxl__egc *egc,
 
     qemu_state_len = st.st_size;
     LOG(DEBUG, "%s is %d bytes", dc->readwhat, qemu_state_len);
+    fprintf(stderr, "device model is %u\n", qemu_state_len);
 
     rc = libxl__datacopier_start(dc);
     if (rc) goto out;
 
+    rechdr.type = REC_TYPE_EMULATOR_CONTEXT;
+    rechdr.length = sizeof(emuhdr) + qemu_state_len;
     libxl__datacopier_prefixdata(egc, dc,
-                                 QEMU_SIGNATURE, strlen(QEMU_SIGNATURE));
+                                 &rechdr, sizeof(rechdr));
 
+    switch (dm_version) {
+    case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL:
+        emuhdr.id = EMULATOR_QEMU_TRADITIONAL;
+        break;
+    case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN:
+        emuhdr.id = EMULATOR_QEMU_UPSTREAM;
+        break;
+    default:
+        emuhdr.id = EMULATOR_UNKNOWN;
+        break;
+    }
+    emuhdr.index = 0;
     libxl__datacopier_prefixdata(egc, dc,
-                                 &qemu_state_len, sizeof(qemu_state_len));
+                                 &emuhdr, sizeof(emuhdr));
+
     return;
 
  out:
@@ -1758,7 +1951,7 @@ static void save_device_model_datacopier_done(libxl__egc *egc,
      libxl__datacopier_state *dc, int onwrite, int errnoval)
 {
     libxl__domain_suspend_state *dss =
-        CONTAINER_OF(dc, *dss, save_dm_datacopier);
+        CONTAINER_OF(dc, *dss, dc);
     STATE_AO_GC(dss->ao);
 
     /* Convenience aliases */
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index c56a167..10ab664 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2556,7 +2556,8 @@ struct libxl__domain_suspend_state {
                                  struct libxl__domain_suspend_state*, int ok);
     /* private for libxl__domain_save_device_model */
     libxl__save_device_model_cb *save_dm_callback;
-    libxl__datacopier_state save_dm_datacopier;
+    libxl__datacopier_state dc;
+    int dm_version;
 };
 
 
@@ -2851,8 +2852,6 @@ void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
 
 _hidden void libxl__domain_suspend_common_switch_qemu_logdirty
                                (int domid, unsigned int enable, void *data);
-_hidden int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
-        uint32_t *len, void *data);
 
 
 /* calls libxl__xc_domain_restore_done when done */
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index d17e333..3193352 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -3414,6 +3414,7 @@ static void save_domain_core_writeconfig(int fd, const char *source,
     memset(&hdr, 0, sizeof(hdr));
     memcpy(hdr.magic, savefileheader_magic, sizeof(hdr.magic));
     hdr.byteorder = SAVEFILE_BYTEORDER_VALUE;
+    hdr.mandatory_flags = SAVEFILE_MANDATORY_STREAMV2;
 
     optdata_begin= 0;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging
  2014-09-10 17:10 ` [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging Andrew Cooper
@ 2014-09-11 10:18   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:18 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> LOG() automatically adds a newline.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Ian Campbell <Ian.Campbell@citrix.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/29] tools/[lib]xl: Correct use of init/dispose for libxl_domain_restore_params
  2014-09-10 17:10 ` [PATCH 02/29] tools/[lib]xl: Correct use of init/dispose for libxl_domain_restore_params Andrew Cooper
@ 2014-09-11 10:19   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Ian Campbell <Ian.Campbell@citrix.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact()
  2014-09-10 17:10 ` [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
@ 2014-09-11 10:19   ` Ian Campbell
  2014-09-11 10:57   ` Ian Campbell
  1 sibling, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> This implementation of writev_exact() will cope with an iovcnt greater than
> IOV_MAX because glibc will actually let this work anyway, and it is very
> useful not to have to work about this in the caller of writev_exact().  The
> caller is still required to ensure that the sum of iov_len's doesn't overflow
> a ssize_t.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Ian Campbell <Ian.Campbell@citrix.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04/29] libxc/bitops: Add or() to the available bitmap operations
  2014-09-10 17:10 ` [PATCH 04/29] libxc/bitops: Add or() to the available bitmap operations Andrew Cooper
@ 2014-09-11 10:21   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:21 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

There's a bit of a trap here for callers wrt needing to make sure they
are the same size (or at least both larger than nr_bits), but I think we
can live with that.

> CC: Ian Campbell <Ian.Campbell@citrix.com>
> CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
> ---
>  tools/libxc/xc_bitops.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/libxc/xc_bitops.h b/tools/libxc/xc_bitops.h
> index d8e0c16..dfce3b8 100644
> --- a/tools/libxc/xc_bitops.h
> +++ b/tools/libxc/xc_bitops.h
> @@ -60,4 +60,12 @@ static inline int test_and_set_bit(int nr, unsigned long *addr)
>      return oldbit;
>  }
>  
> +static inline void bitmap_or(unsigned long *dst, const unsigned long *other,
> +                             int nr_bits)
> +{
> +    int i, nr_longs = (bitmap_size(nr_bits) / sizeof(unsigned long));

Ah, this doesn't round down because bitnmap_size already rounds up.
Good.

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure
  2014-09-10 17:10 ` [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure Andrew Cooper
@ 2014-09-11 10:32   ` Ian Campbell
  2014-09-11 14:03     ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:32 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> Not everything which needs reporting as progress comes with a range.

Please can you expand on your reasoning wrt the places where you are
removing a usage of start with a size. Especially since you aren't
changing any subsequent progress_step call to a you new singleton
variant.

> Allow reporting "0 of 0" for a single progress statement.

Can we not arrange to suppress this entirely? Perhaps by turning total=0
into percent=-1 and having the lower level code omit that bit of the
message in that case? If doing that you should update xentoollog.h to
make this clear.

It appears you are also doing more than just this as well, by changing
start to set for some reason. Even if you want to change the parameters
it's not clear that the new name is in any way an improvement.

Thirdly you appear to also be arranging to make it allowable to call
progress_step without having previously called progress_start/set. Why
is this needed?

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Ian Campbell <Ian.Campbell@citrix.com>
> CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
> ---
>  tools/libxc/xc_domain_restore.c |    2 +-
>  tools/libxc/xc_domain_save.c    |    2 +-
>  tools/libxc/xc_private.c        |   22 ++++++++++++++--------
>  tools/libxc/xc_private.h        |    4 ++--
>  tools/libxc/xtl_core.c          |    9 +++++----
>  5 files changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
> index b9a56d5..b411126 100644
> --- a/tools/libxc/xc_domain_restore.c
> +++ b/tools/libxc/xc_domain_restore.c
> @@ -1610,7 +1610,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>          goto out;
>      }
>  
> -    xc_report_progress_start(xch, "Reloading memory pages", dinfo->p2m_size);
> +    xc_report_progress_set(xch, "Reloading memory pages");
>  
>      /*
>       * Now simply read each saved frame into its new machine frame.
> diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
> index 254fdb3..02544f8 100644
> --- a/tools/libxc/xc_domain_save.c
> +++ b/tools/libxc/xc_domain_save.c
> @@ -1127,7 +1127,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
>                   "Saving memory: iter %d (last sent %u skipped %u)",
>                   iter, sent_this_iter, skip_this_iter);
>  
> -        xc_report_progress_start(xch, reportbuf, dinfo->p2m_size);
> +        xc_report_progress_set(xch, reportbuf);
>  
>          iter++;
>          sent_this_iter = 0;
> diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
> index 0941b06..45537af 100644
> --- a/tools/libxc/xc_private.c
> +++ b/tools/libxc/xc_private.c
> @@ -388,18 +388,24 @@ void xc_osdep_log(xc_interface *xch, xentoollog_level level, int code, const cha
>      va_end(args);
>  }
>  
> -void xc_report_progress_start(xc_interface *xch, const char *doing,
> -                              unsigned long total) {
> +const char *xc_report_progress_set(xc_interface *xch, const char *doing)
> +{
> +    const char *old = xch->currently_progress_reporting;
> +
>      xch->currently_progress_reporting = doing;
> -    xtl_progress(xch->error_handler, "xc", xch->currently_progress_reporting,
> -                 0, total);
> +    return old;
> +}
> +
> +void xc_report_progress_single(xc_interface *xch, const char *doing)
> +{
> +    xtl_progress(xch->error_handler, "xc", doing, 0, 0);
>  }
>  
>  void xc_report_progress_step(xc_interface *xch,
> -                             unsigned long done, unsigned long total) {
> -    assert(xch->currently_progress_reporting);
> -    xtl_progress(xch->error_handler, "xc", xch->currently_progress_reporting,
> -                 done, total);
> +                             unsigned long done, unsigned long total)
> +{
> +    xtl_progress(xch->error_handler, "xc",
> +                 xch->currently_progress_reporting ?: "???", done, total);
>  }
>  
>  int xc_get_pfn_type_batch(xc_interface *xch, uint32_t dom,
> diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
> index 97e4a56..22021e9 100644
> --- a/tools/libxc/xc_private.h
> +++ b/tools/libxc/xc_private.h
> @@ -119,8 +119,8 @@ void xc_report(xc_interface *xch, xentoollog_logger *lg, xentoollog_level,
>                 int code, const char *fmt, ...)
>       __attribute__((format(printf,5,6)));
>  
> -void xc_report_progress_start(xc_interface *xch, const char *doing,
> -                              unsigned long total);
> +const char *xc_report_progress_set(xc_interface *xch, const char *doing);
> +void xc_report_progress_single(xc_interface *xch, const char *doing);
>  void xc_report_progress_step(xc_interface *xch,
>                               unsigned long done, unsigned long total);
>  
> diff --git a/tools/libxc/xtl_core.c b/tools/libxc/xtl_core.c
> index 326b97e..73add92 100644
> --- a/tools/libxc/xtl_core.c
> +++ b/tools/libxc/xtl_core.c
> @@ -66,13 +66,14 @@ void xtl_log(struct xentoollog_logger *logger,
>  void xtl_progress(struct xentoollog_logger *logger,
>                    const char *context, const char *doing_what,
>                    unsigned long done, unsigned long total) {
> -    int percent;
> +    int percent = 0;
>  
>      if (!logger->progress) return;
>  
> -    percent = (total < LONG_MAX/100)
> -        ? (done * 100) / total
> -        : done / ((total + 99) / 100);
> +    if ( total )
> +        percent = (total < LONG_MAX/100)
> +            ? (done * 100) / total
> +            : done / ((total + 99) / 100);
>  
>      logger->progress(logger, context, doing_what, percent, done, total);
>  }

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-10 17:10 ` [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
@ 2014-09-11 10:34   ` Ian Campbell
  2014-09-11 10:37     ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
> two save/restore codepaths to coexist, and have a runtime switch.
> 
> It is indended that once this series is less RFC, the v2 framework will
> completely replace v1.

I think we are now at the point where this hack needs to be dropped from
the series.

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  tools/libxc/Makefile              |    1 +
>  tools/libxc/saverestore/common.h  |   15 +++++++++++++++
>  tools/libxc/saverestore/restore.c |   23 +++++++++++++++++++++++
>  tools/libxc/saverestore/save.c    |   19 +++++++++++++++++++
>  tools/libxc/xc_domain_restore.c   |    8 ++++++++
>  tools/libxc/xc_domain_save.c      |    6 ++++++
>  tools/libxc/xenguest.h            |   13 +++++++++++++
>  7 files changed, 85 insertions(+)
>  create mode 100644 tools/libxc/saverestore/common.h
>  create mode 100644 tools/libxc/saverestore/restore.c
>  create mode 100644 tools/libxc/saverestore/save.c
> 
> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
> index 3b04027..564e805 100644
> --- a/tools/libxc/Makefile
> +++ b/tools/libxc/Makefile
> @@ -45,6 +45,7 @@ GUEST_SRCS-y :=
>  GUEST_SRCS-y += xg_private.c xc_suspend.c
>  ifeq ($(CONFIG_MIGRATE),y)
>  GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
> +GUEST_SRCS-y += $(wildcard saverestore/*.c)
>  GUEST_SRCS-y += xc_offline_page.c xc_compression.c
>  else
>  GUEST_SRCS-y += xc_nomigrate.c
> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
> new file mode 100644
> index 0000000..f1aff44
> --- /dev/null
> +++ b/tools/libxc/saverestore/common.h
> @@ -0,0 +1,15 @@
> +#ifndef __COMMON__H
> +#define __COMMON__H
> +
> +#include "../xg_private.h"
> +
> +#endif
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
> new file mode 100644
> index 0000000..6624baa
> --- /dev/null
> +++ b/tools/libxc/saverestore/restore.c
> @@ -0,0 +1,23 @@
> +#include "common.h"
> +
> +int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
> +                       unsigned int store_evtchn, unsigned long *store_mfn,
> +                       domid_t store_domid, unsigned int console_evtchn,
> +                       unsigned long *console_mfn, domid_t console_domid,
> +                       unsigned int hvm, unsigned int pae, int superpages,
> +                       int checkpointed_stream,
> +                       struct restore_callbacks *callbacks)
> +{
> +    IPRINTF("In experimental %s", __func__);
> +    return -1;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
> new file mode 100644
> index 0000000..f6ad734
> --- /dev/null
> +++ b/tools/libxc/saverestore/save.c
> @@ -0,0 +1,19 @@
> +#include "common.h"
> +
> +int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> +                    uint32_t max_factor, uint32_t flags,
> +                    struct save_callbacks* callbacks, int hvm)
> +{
> +    IPRINTF("In experimental %s", __func__);
> +    return -1;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
> index b411126..cec2dfc 100644
> --- a/tools/libxc/xc_domain_restore.c
> +++ b/tools/libxc/xc_domain_restore.c
> @@ -1490,6 +1490,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>      struct restore_ctx *ctx = &_ctx;
>      struct domain_info_context *dinfo = &ctx->dinfo;
>  
> +    if ( getenv("XG_MIGRATION_V2") )
> +    {
> +        return xc_domain_restore2(
> +            xch, io_fd, dom, store_evtchn, store_mfn,
> +            store_domid, console_evtchn, console_mfn, console_domid,
> +            hvm,  pae,  superpages, checkpointed_stream, callbacks);
> +    }
> +
>      DPRINTF("%s: starting restore of new domid %u", __func__, dom);
>  
>      pagebuf_init(&pagebuf);
> diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
> index 02544f8..a23ed68 100644
> --- a/tools/libxc/xc_domain_save.c
> +++ b/tools/libxc/xc_domain_save.c
> @@ -894,6 +894,12 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
>  
>      int completed = 0;
>  
> +    if ( getenv("XG_MIGRATION_V2") )
> +    {
> +        return xc_domain_save2(xch, io_fd, dom, max_iters,
> +                               max_factor, flags, callbacks, hvm);
> +    }
> +
>      DPRINTF("%s: starting save of domid %u", __func__, dom);
>  
>      if ( hvm && !callbacks->switch_qemu_logdirty )
> diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
> index 40bbac8..55755cf 100644
> --- a/tools/libxc/xenguest.h
> +++ b/tools/libxc/xenguest.h
> @@ -88,6 +88,10 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
>                     uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
>                     struct save_callbacks* callbacks, int hvm);
>  
> +/* Domain Save v2 */
> +int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
> +                    uint32_t max_factor, uint32_t flags,
> +                    struct save_callbacks* callbacks, int hvm);
>  
>  /* callbacks provided by xc_domain_restore */
>  struct restore_callbacks {
> @@ -124,6 +128,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>                        unsigned int hvm, unsigned int pae, int superpages,
>                        int checkpointed_stream,
>                        struct restore_callbacks *callbacks);
> +
> +/* Domain Restore v2 */
> +int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
> +                       unsigned int store_evtchn, unsigned long *store_mfn,
> +                       domid_t store_domid, unsigned int console_evtchn,
> +                       unsigned long *console_mfn, domid_t console_domid,
> +                       unsigned int hvm, unsigned int pae, int superpages,
> +                       int checkpointed_stream,
> +                       struct restore_callbacks *callbacks);
>  /**
>   * xc_domain_restore writes a file to disk that contains the device
>   * model saved state.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-11 10:34   ` Ian Campbell
@ 2014-09-11 10:37     ` Andrew Cooper
  2014-09-11 11:01       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 10:37 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 11/09/14 11:34, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
>> two save/restore codepaths to coexist, and have a runtime switch.
>>
>> It is indended that once this series is less RFC, the v2 framework will
>> completely replace v1.
> I think we are now at the point where this hack needs to be dropped from
> the series.

One problem is remus.  My plan when dropping this patch was to drop all
of xc_domain_{save/restore}.c as well, but without remus migration-v2
support available, this will break existing set-ups.

One option might be to have legacy and v2 sitting properly side-by-side
in libxc for the transition period.

~Andrew

>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>  tools/libxc/Makefile              |    1 +
>>  tools/libxc/saverestore/common.h  |   15 +++++++++++++++
>>  tools/libxc/saverestore/restore.c |   23 +++++++++++++++++++++++
>>  tools/libxc/saverestore/save.c    |   19 +++++++++++++++++++
>>  tools/libxc/xc_domain_restore.c   |    8 ++++++++
>>  tools/libxc/xc_domain_save.c      |    6 ++++++
>>  tools/libxc/xenguest.h            |   13 +++++++++++++
>>  7 files changed, 85 insertions(+)
>>  create mode 100644 tools/libxc/saverestore/common.h
>>  create mode 100644 tools/libxc/saverestore/restore.c
>>  create mode 100644 tools/libxc/saverestore/save.c
>>
>> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
>> index 3b04027..564e805 100644
>> --- a/tools/libxc/Makefile
>> +++ b/tools/libxc/Makefile
>> @@ -45,6 +45,7 @@ GUEST_SRCS-y :=
>>  GUEST_SRCS-y += xg_private.c xc_suspend.c
>>  ifeq ($(CONFIG_MIGRATE),y)
>>  GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
>> +GUEST_SRCS-y += $(wildcard saverestore/*.c)
>>  GUEST_SRCS-y += xc_offline_page.c xc_compression.c
>>  else
>>  GUEST_SRCS-y += xc_nomigrate.c
>> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
>> new file mode 100644
>> index 0000000..f1aff44
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/common.h
>> @@ -0,0 +1,15 @@
>> +#ifndef __COMMON__H
>> +#define __COMMON__H
>> +
>> +#include "../xg_private.h"
>> +
>> +#endif
>> +/*
>> + * Local variables:
>> + * mode: C
>> + * c-file-style: "BSD"
>> + * c-basic-offset: 4
>> + * tab-width: 4
>> + * indent-tabs-mode: nil
>> + * End:
>> + */
>> diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
>> new file mode 100644
>> index 0000000..6624baa
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/restore.c
>> @@ -0,0 +1,23 @@
>> +#include "common.h"
>> +
>> +int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>> +                       unsigned int store_evtchn, unsigned long *store_mfn,
>> +                       domid_t store_domid, unsigned int console_evtchn,
>> +                       unsigned long *console_mfn, domid_t console_domid,
>> +                       unsigned int hvm, unsigned int pae, int superpages,
>> +                       int checkpointed_stream,
>> +                       struct restore_callbacks *callbacks)
>> +{
>> +    IPRINTF("In experimental %s", __func__);
>> +    return -1;
>> +}
>> +
>> +/*
>> + * Local variables:
>> + * mode: C
>> + * c-file-style: "BSD"
>> + * c-basic-offset: 4
>> + * tab-width: 4
>> + * indent-tabs-mode: nil
>> + * End:
>> + */
>> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
>> new file mode 100644
>> index 0000000..f6ad734
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/save.c
>> @@ -0,0 +1,19 @@
>> +#include "common.h"
>> +
>> +int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>> +                    uint32_t max_factor, uint32_t flags,
>> +                    struct save_callbacks* callbacks, int hvm)
>> +{
>> +    IPRINTF("In experimental %s", __func__);
>> +    return -1;
>> +}
>> +
>> +/*
>> + * Local variables:
>> + * mode: C
>> + * c-file-style: "BSD"
>> + * c-basic-offset: 4
>> + * tab-width: 4
>> + * indent-tabs-mode: nil
>> + * End:
>> + */
>> diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
>> index b411126..cec2dfc 100644
>> --- a/tools/libxc/xc_domain_restore.c
>> +++ b/tools/libxc/xc_domain_restore.c
>> @@ -1490,6 +1490,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>>      struct restore_ctx *ctx = &_ctx;
>>      struct domain_info_context *dinfo = &ctx->dinfo;
>>  
>> +    if ( getenv("XG_MIGRATION_V2") )
>> +    {
>> +        return xc_domain_restore2(
>> +            xch, io_fd, dom, store_evtchn, store_mfn,
>> +            store_domid, console_evtchn, console_mfn, console_domid,
>> +            hvm,  pae,  superpages, checkpointed_stream, callbacks);
>> +    }
>> +
>>      DPRINTF("%s: starting restore of new domid %u", __func__, dom);
>>  
>>      pagebuf_init(&pagebuf);
>> diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
>> index 02544f8..a23ed68 100644
>> --- a/tools/libxc/xc_domain_save.c
>> +++ b/tools/libxc/xc_domain_save.c
>> @@ -894,6 +894,12 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
>>  
>>      int completed = 0;
>>  
>> +    if ( getenv("XG_MIGRATION_V2") )
>> +    {
>> +        return xc_domain_save2(xch, io_fd, dom, max_iters,
>> +                               max_factor, flags, callbacks, hvm);
>> +    }
>> +
>>      DPRINTF("%s: starting save of domid %u", __func__, dom);
>>  
>>      if ( hvm && !callbacks->switch_qemu_logdirty )
>> diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
>> index 40bbac8..55755cf 100644
>> --- a/tools/libxc/xenguest.h
>> +++ b/tools/libxc/xenguest.h
>> @@ -88,6 +88,10 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
>>                     uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
>>                     struct save_callbacks* callbacks, int hvm);
>>  
>> +/* Domain Save v2 */
>> +int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>> +                    uint32_t max_factor, uint32_t flags,
>> +                    struct save_callbacks* callbacks, int hvm);
>>  
>>  /* callbacks provided by xc_domain_restore */
>>  struct restore_callbacks {
>> @@ -124,6 +128,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
>>                        unsigned int hvm, unsigned int pae, int superpages,
>>                        int checkpointed_stream,
>>                        struct restore_callbacks *callbacks);
>> +
>> +/* Domain Restore v2 */
>> +int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>> +                       unsigned int store_evtchn, unsigned long *store_mfn,
>> +                       domid_t store_domid, unsigned int console_evtchn,
>> +                       unsigned long *console_mfn, domid_t console_domid,
>> +                       unsigned int hvm, unsigned int pae, int superpages,
>> +                       int checkpointed_stream,
>> +                       struct restore_callbacks *callbacks);
>>  /**
>>   * xc_domain_restore writes a file to disk that contains the device
>>   * model saved state.
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/29] docs: libxl migration stream specification
  2014-09-10 17:10 ` [PATCH 07/29] docs: libxl " Andrew Cooper
@ 2014-09-11 10:45   ` Ian Campbell
  2014-09-11 10:56     ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:

Please run a spell checker over this ("responsibile" was the first one I
spotted, "resonsible" was the second)

> +Purpose
> +-------

I think this (and the equivalent section in the libxc level doc) should
be moved to the commit log. It's useful right now as a motivator for the
change but in a years time it will just be some random fluff in a doc
that everyone has to page past to get to the interesting bits.

> +This design addresses the above points, allowing for a completely
> +self-contained, extensible stream with each layer responsibile for its own
> +appropriate information.
> +
> +
> +Not Yet Included
> +----------------
> +
> +The following features are not yet fully specified and will be
> +included in a future draft.
> +
> +* Remus
> +
> +* ARM
> +
> +
> +Overview
> +========
> +
> +The image format consists of a _Header_, followed by 1 or more _Records_.
> +Each record consists of a type and length field, followed by any type-specific
> +data.
> +
> +\clearpage
> +
> +Header
> +======
> +
> +The header identifies the stream as a `libxl` stream, including the version of
> +this specification that it complies with.
> +
> +All fields in this header shall be in _big-endian_ byte order, regardless of
> +the setting of the endianness bit.
> +
> +     0     1     2     3     4     5     6     7 octet
> +    +-------------------------------------------------+
> +    | ident                                           |
> +    +-----------------------+-------------------------+
> +    | version               | options                 |
> +    +-----------------------+-------------------------+
> +
> +--------------------------------------------------------------------
> +Field       Description
> +----------- --------------------------------------------------------
> +ident       0x4c6962786c466d74 ("LibxlFmt" in ASCII).
> +
> +version     0x00000002.  The version of this specification.
> +
> +options     bit 0: Endianness.    0 = little-endian, 1 = big-endian.
> +
> +            bit 1: Legacy Format. If set, this stream was created by
> +                                  the legacy conversion tool.

Are such streams otherwise distinguishable from a stream which was
created directly? Should anything care about this?

I fear this is going to be used to paper over shortcomings in the
conversion tool somehow, but I suppose I'll see later in the series.

> +LIBXC\_CONTEXT
> +--------------
> +
> +A libxc context record is a marker, indicating that the stream should be
> +handed to `xc_domain_restore()`.  `libxc` shall be resonsible for reading its
> +own image format from the stream.
> +
> +     0     1     2     3     4     5     6     7 octet
> +    +-------------------------------------------------+
> +
> +The libxc context record contains no fields; its body_length is 0[^1].
> +
> +
> +[^1]: The sending side cannot calculate ahead of time how much data `libxc`
> +might write into the stream, especially for live migration where the quantity
> +of data is partially proportional to the elapsed time.

I think this deserves to be in the main text and not a footnote
(assuming that's what ^1 is). I think it should probably be expanded to
explain how a toolstack can actually treat this, which I assume is to
assume that xc_domain_restore will consume exactly its own business and
then return.

Something somewhere also ought to say what libxc will have done on
error, which is presumably to have left the stream in some indeterminate
state and almost certainly not at the next libxl record boundary.

> +
> +XENSTORE\_DATA
> +-------------
> +
> +A record containing xenstore key/value pairs of data.

In what format?

> +     0     1     2     3     4     5     6     7 octet
> +    +-------------------------------------------------+
> +    | xenstore key/value pairs                        |
> +    ...
> +    +-------------------------------------------------+
> +

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/29] tools/libxc: C implementation of stream format
  2014-09-10 17:10 ` [PATCH 10/29] tools/libxc: C implementation of stream format Andrew Cooper
@ 2014-09-11 10:48   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> Provide the C structures matching the binary (wire) format of the new
> stream format.  All header/record fields are naturally aligned and
> explicit padding fields are used to ensure the correct layout (i.e.,
> there is no need for any non-standard structure packing pragma or
> attribute).
> 
> Provide some helper functions for converting types to string for
> diagnostic purposes.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

(David V also previously offered a reviewed-by, which isn't present
here)

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/29] tools/libxc: noarch common code
  2014-09-10 17:10 ` [PATCH 11/29] tools/libxc: noarch common code Andrew Cooper
@ 2014-09-11 10:52   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:52 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> +     * The page of data at the end of 'page' may be a read-only mapping of a
> +     * running guest; it must not be modified.  If no transformation is
> +     * required, the callee should leave '*pages' unouched.

untouched

> +     *
> +     * If a transformation is required, the callee should allocate themselves
> +     * a local page using malloc() and return it via '*page'.
> +     *
> +     * The caller shall free() '*page' in all cases.  In the case that the
> +     * callee enounceters an error, it should *NOT* free() the memory it

encounters


> [...]
> +    /**
> +     * Process an individual record from the stream.  The caller shall take
> +     * care of processing common records (e.g. END, PAGE_DATA).
> +     *
> +     * @return 0 for success, -1 for failure, or the sentinal value

sentinel

(might I recommend M-x ispell-region)

With the spelling fixed:
Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/29] docs: libxl migration stream specification
  2014-09-11 10:45   ` Ian Campbell
@ 2014-09-11 10:56     ` Andrew Cooper
  2014-09-11 11:03       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 10:56 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 11/09/14 11:45, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>
> Please run a spell checker over this ("responsibile" was the first one I
> spotted, "resonsible" was the second)
>
>> +Purpose
>> +-------
> I think this (and the equivalent section in the libxc level doc) should
> be moved to the commit log. It's useful right now as a motivator for the
> change but in a years time it will just be some random fluff in a doc
> that everyone has to page past to get to the interesting bits.

Ok.

>
>> +This design addresses the above points, allowing for a completely
>> +self-contained, extensible stream with each layer responsibile for its own
>> +appropriate information.
>> +
>> +
>> +Not Yet Included
>> +----------------
>> +
>> +The following features are not yet fully specified and will be
>> +included in a future draft.
>> +
>> +* Remus
>> +
>> +* ARM
>> +
>> +
>> +Overview
>> +========
>> +
>> +The image format consists of a _Header_, followed by 1 or more _Records_.
>> +Each record consists of a type and length field, followed by any type-specific
>> +data.
>> +
>> +\clearpage
>> +
>> +Header
>> +======
>> +
>> +The header identifies the stream as a `libxl` stream, including the version of
>> +this specification that it complies with.
>> +
>> +All fields in this header shall be in _big-endian_ byte order, regardless of
>> +the setting of the endianness bit.
>> +
>> +     0     1     2     3     4     5     6     7 octet
>> +    +-------------------------------------------------+
>> +    | ident                                           |
>> +    +-----------------------+-------------------------+
>> +    | version               | options                 |
>> +    +-----------------------+-------------------------+
>> +
>> +--------------------------------------------------------------------
>> +Field       Description
>> +----------- --------------------------------------------------------
>> +ident       0x4c6962786c466d74 ("LibxlFmt" in ASCII).
>> +
>> +version     0x00000002.  The version of this specification.
>> +
>> +options     bit 0: Endianness.    0 = little-endian, 1 = big-endian.
>> +
>> +            bit 1: Legacy Format. If set, this stream was created by
>> +                                  the legacy conversion tool.
> Are such streams otherwise distinguishable from a stream which was
> created directly? Should anything care about this?
>
> I fear this is going to be used to paper over shortcomings in the
> conversion tool somehow, but I suppose I'll see later in the series.

My concern here was regarding the d_config.  A legacy converted stream
cannot possibly contain a domain_json blob, so must have a d_config
passed by the caller.  Admittedly, this did pre-date realising that
libxl currently allows the caller to blindly overwrite the config
anyway, and this needs to continue for compatibility reasons.

However, knowing that a stream has been converted is a key debugging
detail, even if this flag serves no other purpose from libxl's point of
view.

>
>> +LIBXC\_CONTEXT
>> +--------------
>> +
>> +A libxc context record is a marker, indicating that the stream should be
>> +handed to `xc_domain_restore()`.  `libxc` shall be resonsible for reading its
>> +own image format from the stream.
>> +
>> +     0     1     2     3     4     5     6     7 octet
>> +    +-------------------------------------------------+
>> +
>> +The libxc context record contains no fields; its body_length is 0[^1].
>> +
>> +
>> +[^1]: The sending side cannot calculate ahead of time how much data `libxc`
>> +might write into the stream, especially for live migration where the quantity
>> +of data is partially proportional to the elapsed time.
> I think this deserves to be in the main text and not a footnote
> (assuming that's what ^1 is).

^1 is indeed a footnote.

>  I think it should probably be expanded to
> explain how a toolstack can actually treat this, which I assume is to
> assume that xc_domain_restore will consume exactly its own business and
> then return.

Ok

>
> Something somewhere also ought to say what libxc will have done on
> error, which is presumably to have left the stream in some indeterminate
> state and almost certainly not at the next libxl record boundary.

Correct - I shall discuss this.

>
>> +
>> +XENSTORE\_DATA
>> +-------------
>> +
>> +A record containing xenstore key/value pairs of data.
> In what format?

Ah, yes.  "Whatever libxl currently does", although I guess I need to
expand on that.

>
>> +     0     1     2     3     4     5     6     7 octet
>> +    +-------------------------------------------------+
>> +    | xenstore key/value pairs                        |
>> +    ...
>> +    +-------------------------------------------------+
>> +
> Ian.
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact()
  2014-09-10 17:10 ` [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
  2014-09-11 10:19   ` Ian Campbell
@ 2014-09-11 10:57   ` Ian Campbell
  2014-09-11 10:59     ` Andrew Cooper
  1 sibling, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 10:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
> index c50a7c9..97e4a56 100644
> --- a/tools/libxc/xc_private.h
> +++ b/tools/libxc/xc_private.h
> @@ -28,6 +28,7 @@
>  #include <sys/stat.h>
>  #include <stdlib.h>
>  #include <sys/ioctl.h>
> +#include <sys/uio.h>

Trying to apply this resulted in (lots of):
        compilation terminated.
        In file included from xc_flask.c:19:0:
        xc_private.h:31:21: fatal error: sys/uio.h: No such file or directory
        compilation terminated.
        
>From the stubdom build. I suppose given the environment it would be
reasonably easy to stub out a version using a loop and regular write?

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact()
  2014-09-11 10:57   ` Ian Campbell
@ 2014-09-11 10:59     ` Andrew Cooper
  0 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 10:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 11/09/14 11:57, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>> diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
>> index c50a7c9..97e4a56 100644
>> --- a/tools/libxc/xc_private.h
>> +++ b/tools/libxc/xc_private.h
>> @@ -28,6 +28,7 @@
>>  #include <sys/stat.h>
>>  #include <stdlib.h>
>>  #include <sys/ioctl.h>
>> +#include <sys/uio.h>
> Trying to apply this resulted in (lots of):
>         compilation terminated.
>         In file included from xc_flask.c:19:0:
>         xc_private.h:31:21: fatal error: sys/uio.h: No such file or directory
>         compilation terminated.
>         
> From the stubdom build. I suppose given the environment it would be
> reasonably easy to stub out a version using a loop and regular write?
>
> Ian.
>
>

Ah - I had forgotten about stubdoms.  I will fix that up.

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-11 10:37     ` Andrew Cooper
@ 2014-09-11 11:01       ` Ian Campbell
  2014-09-11 11:04         ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 11:01 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
> On 11/09/14 11:34, Ian Campbell wrote:
> > On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
> >> two save/restore codepaths to coexist, and have a runtime switch.
> >>
> >> It is indended that once this series is less RFC, the v2 framework will
> >> completely replace v1.
> > I think we are now at the point where this hack needs to be dropped from
> > the series.
> 
> One problem is remus.  My plan when dropping this patch was to drop all
> of xc_domain_{save/restore}.c as well, but without remus migration-v2
> support available, this will break existing set-ups.

Hrm, how is that going wrt 4.5 freeze?

> One option might be to have legacy and v2 sitting properly side-by-side
> in libxc for the transition period.

How long do you mean? Until 4.6?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/29] docs: libxl migration stream specification
  2014-09-11 10:56     ` Andrew Cooper
@ 2014-09-11 11:03       ` Ian Campbell
  2014-09-11 11:10         ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 11:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 11:56 +0100, Andrew Cooper wrote:
> >> +            bit 1: Legacy Format. If set, this stream was created by
> >> +                                  the legacy conversion tool.
> > Are such streams otherwise distinguishable from a stream which was
> > created directly? Should anything care about this?
> >
> > I fear this is going to be used to paper over shortcomings in the
> > conversion tool somehow, but I suppose I'll see later in the series.
> 
> My concern here was regarding the d_config.  A legacy converted stream
> cannot possibly contain a domain_json blob, so must have a d_config
> passed by the caller.  Admittedly, this did pre-date realising that
> libxl currently allows the caller to blindly overwrite the config
> anyway, and this needs to continue for compatibility reasons.
> 
> However, knowing that a stream has been converted is a key debugging
> detail, even if this flag serves no other purpose from libxl's point of
> view.

Would it be worth saying so e.g. "This flag is for debugging purposes
only, toolstacks should not modify their behaviour based on this flag"?

Or could you do away with this bit and use version==1 to indicate this
(since version==1 doesn't actually get used by a real legacy stream)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-11 11:01       ` Ian Campbell
@ 2014-09-11 11:04         ` Andrew Cooper
  2014-09-11 11:10           ` Ian Campbell
  2014-09-14 10:23           ` Shriram Rajagopalan
  0 siblings, 2 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 11:04 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 11/09/14 12:01, Ian Campbell wrote:
> On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
>> On 11/09/14 11:34, Ian Campbell wrote:
>>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>>>> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
>>>> two save/restore codepaths to coexist, and have a runtime switch.
>>>>
>>>> It is indended that once this series is less RFC, the v2 framework will
>>>> completely replace v1.
>>> I think we are now at the point where this hack needs to be dropped from
>>> the series.
>> One problem is remus.  My plan when dropping this patch was to drop all
>> of xc_domain_{save/restore}.c as well, but without remus migration-v2
>> support available, this will break existing set-ups.
> Hrm, how is that going wrt 4.5 freeze?

I haven’t heard seen anything since v5 of this series (for which I did
some quick bugfixes and released v6).

I don't know, which probably means not good.

>
>> One option might be to have legacy and v2 sitting properly side-by-side
>> in libxc for the transition period.
> How long do you mean? Until 4.6?

Ideally, very early in the 4.6 dev period.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/29] docs: libxl migration stream specification
  2014-09-11 11:03       ` Ian Campbell
@ 2014-09-11 11:10         ` Andrew Cooper
  0 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 11:10 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 11/09/14 12:03, Ian Campbell wrote:
> On Thu, 2014-09-11 at 11:56 +0100, Andrew Cooper wrote:
>>>> +            bit 1: Legacy Format. If set, this stream was created by
>>>> +                                  the legacy conversion tool.
>>> Are such streams otherwise distinguishable from a stream which was
>>> created directly? Should anything care about this?
>>>
>>> I fear this is going to be used to paper over shortcomings in the
>>> conversion tool somehow, but I suppose I'll see later in the series.
>> My concern here was regarding the d_config.  A legacy converted stream
>> cannot possibly contain a domain_json blob, so must have a d_config
>> passed by the caller.  Admittedly, this did pre-date realising that
>> libxl currently allows the caller to blindly overwrite the config
>> anyway, and this needs to continue for compatibility reasons.
>>
>> However, knowing that a stream has been converted is a key debugging
>> detail, even if this flag serves no other purpose from libxl's point of
>> view.
> Would it be worth saying so e.g. "This flag is for debugging purposes
> only, toolstacks should not modify their behaviour based on this flag"?

If it turns out that way, then absolutely.

>
> Or could you do away with this bit and use version==1 to indicate this
> (since version==1 doesn't actually get used by a real legacy stream)
>
>

I don't like this, concepually. The stream version is 2, as that states
that "this stream conforms to v2 of the spec", which describes the
framing etc.  (The problem needing to be solved is that there was no
older libxl stream where there should have been.)

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-11 11:04         ` Andrew Cooper
@ 2014-09-11 11:10           ` Ian Campbell
  2014-09-14 10:23           ` Shriram Rajagopalan
  1 sibling, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 11:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 12:04 +0100, Andrew Cooper wrote:
> On 11/09/14 12:01, Ian Campbell wrote:
> > On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
> >> On 11/09/14 11:34, Ian Campbell wrote:
> >>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >>>> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
> >>>> two save/restore codepaths to coexist, and have a runtime switch.
> >>>>
> >>>> It is indended that once this series is less RFC, the v2 framework will
> >>>> completely replace v1.
> >>> I think we are now at the point where this hack needs to be dropped from
> >>> the series.
> >> One problem is remus.  My plan when dropping this patch was to drop all
> >> of xc_domain_{save/restore}.c as well, but without remus migration-v2
> >> support available, this will break existing set-ups.
> > Hrm, how is that going wrt 4.5 freeze?
> 
> I haven’t heard seen anything since v5 of this series (for which I did
> some quick bugfixes and released v6).
> 
> I don't know, which probably means not good.

I think there have been remus positins since then, but I've been leaving
them to Ian so I don't know either.

> >
> >> One option might be to have legacy and v2 sitting properly side-by-side
> >> in libxc for the transition period.
> > How long do you mean? Until 4.6?
> 
> Ideally, very early in the 4.6 dev period.

Well, also for the lifetime of 4.5...

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 0/29] Migration Stream v2
  2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
                   ` (28 preceding siblings ...)
  2014-09-10 17:11 ` [PATCH 29/29] tools/[lib]xl: Alter libxl_domain_suspend() to write a v2 stream Andrew Cooper
@ 2014-09-11 11:50 ` Ian Campbell
  29 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 11:50 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Jackson, Tim Deegan, Xen-devel, Ross Lagerwall,
	David Vrabel, Jan Beulich

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:

> Patches 1 to 5 are misc bugfixes/improvements

I just pushed:
89f5ca3 libxc/bitops: Add or() to the available bitmap operations
220c8fe tools/[lib]xl: Correct use of init/dispose for libxl_domain_re
5f7766c tools/libxl: Fix stray blank line from debug logging

The other two had comments.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 20/29] tools/libxl: Update datacopier to support sending data only
  2014-09-10 17:10 ` [PATCH 20/29] tools/libxl: Update datacopier to support sending data only Andrew Cooper
@ 2014-09-11 11:56   ` Ian Campbell
  2014-09-11 12:00     ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 11:56 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Wen Congyang, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> datacopier is to read some data and write it out. If we
> have some data to send it over network, we cannot use
> datacopier. Update it to support this case.

I'm afraid I'm not sure what this changelog is saying. If we have data
to send over the network then why can't we use datacopier? How does
making this conditional on readfd's validity help that case?

> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  tools/libxl/libxl_aoutils.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index b10d2e1..3e0c0ae 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -309,9 +309,11 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
>  
>      libxl__datacopier_init(dc);
>  
> -    rc = libxl__ev_fd_register(gc, &dc->toread, datacopier_readable,
> -                               dc->readfd, POLLIN);
> -    if (rc) goto out;
> +    if (dc->readfd >= 0) {
> +        rc = libxl__ev_fd_register(gc, &dc->toread, datacopier_readable,
> +                                   dc->readfd, POLLIN);
> +        if (rc) goto out;
> +    }
>  
>      rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
>                                 dc->writefd, POLLOUT);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 20/29] tools/libxl: Update datacopier to support sending data only
  2014-09-11 11:56   ` Ian Campbell
@ 2014-09-11 12:00     ` Andrew Cooper
  2014-09-11 12:39       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 12:00 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Wen Congyang, Xen-devel

On 11/09/14 12:56, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> datacopier is to read some data and write it out. If we
>> have some data to send it over network, we cannot use
>> datacopier. Update it to support this case.
> I'm afraid I'm not sure what this changelog is saying. If we have data
> to send over the network then why can't we use datacopier? How does
> making this conditional on readfd's validity help that case?

datacopier was originally strictly copying data from infd to outfd,
until EOF.

These patches around here in the series extend datacopier to include
"please write this local buffer to outfd", "please read $N bytes from
infd to this local buffer", and "please copy exactly $N bytes from infd
to outfd", all as asynchronous operations.

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier
  2014-09-10 17:10 ` [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier Andrew Cooper
@ 2014-09-11 12:01   ` Ian Campbell
  2014-09-11 12:17     ` Ross Lagerwall
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:01 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ross Lagerwall, Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> Previously, adding more than 1000 bytes of data would cause a segfault.

Segfault is a symptom, not a cause, why does it happen?

> Now, the maximum amount of data that can be added is limited by maxsz.
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  tools/libxl/libxl_aoutils.c |   14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index 3e0c0ae..caba637 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -160,6 +160,8 @@ void libxl__datacopier_prefixdata(libxl__egc *egc, libxl__datacopier_state *dc,
>  {
>      EGC_GC;
>      libxl__datacopier_buf *buf;
> +    const uint8_t *ptr;
> +
>      /*
>       * It is safe for this to be called immediately after _start, as
>       * is documented in the public comment.  _start's caller must have
> @@ -170,12 +172,14 @@ void libxl__datacopier_prefixdata(libxl__egc *egc, libxl__datacopier_state *dc,
>  
>      assert(len < dc->maxsz - dc->used);
>  
> -    buf = libxl__zalloc(NOGC, sizeof(*buf));
> -    buf->used = len;
> -    memcpy(buf->buf, data, len);
> +    for (ptr = data; len; len -= buf->used, ptr += buf->used) {
> +        buf = libxl__zalloc(NOGC, sizeof(*buf));
> +        buf->used = len < sizeof(buf->buf) ? len : sizeof(buf->buf);
> +        memcpy(buf->buf, ptr, buf->used);
>  
> -    dc->used += len;
> -    LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
> +        dc->used += buf->used;
> +        LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
> +    }
>  }
>  
>  static int datacopier_pollhup_handled(libxl__egc *egc,

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier
  2014-09-10 17:11 ` [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier Andrew Cooper
@ 2014-09-11 12:02   ` Ian Campbell
  2014-09-11 12:23     ` Ross Lagerwall
  2014-09-12  8:36   ` Wen Congyang
  1 sibling, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ross Lagerwall, Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> Add a parameter, maxread, to limit the amount of data read from the
> source fd of a datacopier.

Why is this useful?

> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  tools/libxl/libxl_aoutils.c    |    9 +++++++--
>  tools/libxl/libxl_bootloader.c |    2 ++
>  tools/libxl/libxl_dom.c        |    1 +
>  tools/libxl/libxl_internal.h   |    1 +
>  4 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index caba637..6502325 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -145,7 +145,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>                  return;
>              }
>          }
> -    } else if (!libxl__ev_fd_isregistered(&dc->toread)) {
> +    } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
>          /* we have had eof */
>          datacopier_callback(egc, dc, 0, 0);
>          return;
> @@ -233,7 +233,8 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>          }
>          int r = read(ev->fd,
>                       buf->buf + buf->used,
> -                     sizeof(buf->buf) - buf->used);
> +                     (sizeof(buf->buf) - buf->used) < dc->maxread ?
> +                         (sizeof(buf->buf) - buf->used) : dc->maxread);
>          if (r < 0) {
>              if (errno == EINTR) continue;
>              if (errno == EWOULDBLOCK) break;
> @@ -258,7 +259,11 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>          }
>          buf->used += r;
>          dc->used += r;
> +        dc->maxread -= r;
>          assert(buf->used <= sizeof(buf->buf));
> +        assert(dc->maxread >= 0);
> +        if (dc->maxread == 0)
> +            break;
>      }
>      datacopier_check_state(egc, dc);
>  }
> diff --git a/tools/libxl/libxl_bootloader.c b/tools/libxl/libxl_bootloader.c
> index 79947d4..1503101 100644
> --- a/tools/libxl/libxl_bootloader.c
> +++ b/tools/libxl/libxl_bootloader.c
> @@ -516,6 +516,7 @@ static void bootloader_gotptys(libxl__egc *egc, libxl__openpty_state *op)
>  
>      bl->keystrokes.ao = ao;
>      bl->keystrokes.maxsz = BOOTLOADER_BUF_OUT;
> +    bl->keystrokes.maxread = INT_MAX;
>      bl->keystrokes.copywhat =
>          GCSPRINTF("bootloader input for domain %"PRIu32, bl->domid);
>      bl->keystrokes.callback =         bootloader_keystrokes_copyfail;
> @@ -527,6 +528,7 @@ static void bootloader_gotptys(libxl__egc *egc, libxl__openpty_state *op)
>  
>      bl->display.ao = ao;
>      bl->display.maxsz = BOOTLOADER_BUF_IN;
> +    bl->display.maxread = INT_MAX;
>      bl->display.copywhat =
>          GCSPRINTF("bootloader output for domain %"PRIu32, bl->domid);
>      bl->display.callback =         bootloader_display_copyfail;
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index 0dfdb08..2f74341 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -1717,6 +1717,7 @@ void libxl__domain_save_device_model(libxl__egc *egc,
>      dc->readfd = -1;
>      dc->writefd = fd;
>      dc->maxsz = INT_MAX;
> +    dc->maxread = INT_MAX;
>      dc->copywhat = GCSPRINTF("qemu save file for domain %"PRIu32, dss->domid);
>      dc->writewhat = "save/migration stream";
>      dc->callback = save_device_model_datacopier_done;
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 03e9978..d93a6ee 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2425,6 +2425,7 @@ struct libxl__datacopier_state {
>      libxl__ao *ao;
>      int readfd, writefd;
>      ssize_t maxsz;
> +    ssize_t maxread;
>      const char *copywhat, *readwhat, *writewhat; /* for error msgs */
>      FILE *log; /* gets a copy of everything */
>      libxl__datacopier_callback *callback;

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer
  2014-09-10 17:11 ` [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer Andrew Cooper
@ 2014-09-11 12:03   ` Ian Campbell
  2014-09-11 12:26     ` Ross Lagerwall
  2014-09-12  8:49   ` Wen Congyang
  1 sibling, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ross Lagerwall, Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> Implement a read-only mode for libxl__datacopier. 

Read-only in the sense that it only reads, or in the sense that one may
only read from it?

What is the purpose of this mode?

Where does the data go?

>  The mode is invoked
> when readbuf is set and writefd is < 0.  On success, the callback passes
> the number of bytes read.
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  tools/libxl/libxl_aoutils.c  |   59 +++++++++++++++++++++++++-----------------
>  tools/libxl/libxl_internal.h |    4 ++-
>  2 files changed, 38 insertions(+), 25 deletions(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index 6502325..9183716 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -134,7 +134,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>      STATE_AO_GC(dc->ao);
>      int rc;
>      
> -    if (dc->used) {
> +    if (dc->used && !dc->readbuf) {
>          if (!libxl__ev_fd_isregistered(&dc->towrite)) {
>              rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
>                                         dc->writefd, POLLOUT);
> @@ -147,7 +147,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>          }
>      } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
>          /* we have had eof */
> -        datacopier_callback(egc, dc, 0, 0);
> +        datacopier_callback(egc, dc, 0, dc->readbuf ? dc->used : 0);
>          return;
>      } else {
>          /* nothing buffered, but still reading */
> @@ -215,26 +215,31 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>      }
>      assert(revents & POLLIN);
>      for (;;) {
> -        while (dc->used >= dc->maxsz) {
> -            libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
> -            dc->used -= rm->used;
> -            assert(dc->used >= 0);
> -            LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
> -            free(rm);
> -        }
> +        libxl__datacopier_buf *buf = NULL;
> +        int r;
> +
> +        if (dc->readbuf) {
> +            r = read(ev->fd, dc->readbuf + dc->used, dc->maxread);
> +        } else {
> +            while (dc->used >= dc->maxsz) {
> +                libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
> +                dc->used -= rm->used;
> +                assert(dc->used >= 0);
> +                LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
> +                free(rm);
> +            }
>  
> -        libxl__datacopier_buf *buf =
> -            LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
> -        if (!buf || buf->used >= sizeof(buf->buf)) {
> -            buf = malloc(sizeof(*buf));
> -            if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
> -            buf->used = 0;
> -            LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
> -        }
> -        int r = read(ev->fd,
> -                     buf->buf + buf->used,
> +            buf = LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
> +            if (!buf || buf->used >= sizeof(buf->buf)) {
> +                buf = malloc(sizeof(*buf));
> +                if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
> +                buf->used = 0;
> +                LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
> +            }
> +            r = read(ev->fd, buf->buf + buf->used,
>                       (sizeof(buf->buf) - buf->used) < dc->maxread ?
>                           (sizeof(buf->buf) - buf->used) : dc->maxread);
> +        }
>          if (r < 0) {
>              if (errno == EINTR) continue;
>              if (errno == EWOULDBLOCK) break;
> @@ -257,10 +262,12 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>                  return;
>              }
>          }
> -        buf->used += r;
> +        if (!dc->readbuf) {
> +            buf->used += r;
> +            assert(buf->used <= sizeof(buf->buf));
> +        }
>          dc->used += r;
>          dc->maxread -= r;
> -        assert(buf->used <= sizeof(buf->buf));
>          assert(dc->maxread >= 0);
>          if (dc->maxread == 0)
>              break;
> @@ -324,9 +331,13 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
>          if (rc) goto out;
>      }
>  
> -    rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
> -                               dc->writefd, POLLOUT);
> -    if (rc) goto out;
> +    if (dc->writefd >= 0) {
> +        rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
> +                                   dc->writefd, POLLOUT);
> +        if (rc) goto out;
> +    }
> +
> +    assert(dc->readfd >= 0 || dc->writefd >= 0);
>  
>      return 0;
>  
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index d93a6ee..056843a 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2404,7 +2404,8 @@ typedef struct libxl__datacopier_buf libxl__datacopier_buf;
>  
>  /* onwrite==1 means failure happened when writing, logged, errnoval is valid
>   * onwrite==0 means failure happened when reading
> - *     errnoval==0 means we got eof and all data was written
> + *     errnoval>=0 means we got eof and all data was written or number of bytes
> + *                 written when in read mode
>   *     errnoval!=0 means we had a read error, logged
>   * onwrite==-1 means some other internal failure, errnoval not valid, logged
>   * If we get POLLHUP, we call callback_pollhup(..., onwrite, -1);
> @@ -2433,6 +2434,7 @@ struct libxl__datacopier_state {
>      /* remaining fields are private to datacopier */
>      libxl__ev_fd toread, towrite;
>      ssize_t used;
> +    void *readbuf;
>      LIBXL_TAILQ_HEAD(libxl__datacopier_bufs, libxl__datacopier_buf) bufs;
>  };
>  

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 24/29] tools/libxl: Allow suppression of POLLHUP for datacopiers
  2014-09-10 17:11 ` [PATCH 24/29] tools/libxl: Allow suppression of POLLHUP for datacopiers Andrew Cooper
@ 2014-09-11 12:05   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:05 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> If the far end of a pipe has been closed, poll() will set POLLHUP.  When
> reading from a pipe, POLLIN|POLLHUP is a valid state, even when there is still
> data to be read.
> 
> Currently, datacopier will bail because of POLLHUP before discovering that
> there is valid data to be read.
> 
> Add an option to ignore POLLHUP for consumers who would prefer to read to EOF.

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Ian Campbell <Ian.Campbell@citrix.com>
> CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
> 
> ---
> 
> It might be easier/better to alter the POLLHUP handling, but I am struggling
> to work out what effect that would have on the bootloader pty handling.

I was just about to ask why you weren't doing this. I'll leave the
answer to Ian.

> ---
>  tools/libxl/libxl_aoutils.c  |    2 +-
>  tools/libxl/libxl_internal.h |    1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index 9183716..2b39432 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -207,7 +207,7 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>      if (datacopier_pollhup_handled(egc, dc, revents, 0))
>          return;
>  
> -    if (revents & ~POLLIN) {
> +    if (revents & ~(POLLIN | (dc->suppress_pollhup ? POLLHUP : 0))) {
>          LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)"
>              " on %s during copy of %s", revents, dc->readwhat, dc->copywhat);
>          datacopier_callback(egc, dc, -1, 0);
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 056843a..537b523 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2431,6 +2431,7 @@ struct libxl__datacopier_state {
>      FILE *log; /* gets a copy of everything */
>      libxl__datacopier_callback *callback;
>      libxl__datacopier_callback *callback_pollhup;
> +    int suppress_pollhup;
>      /* remaining fields are private to datacopier */
>      libxl__ev_fd toread, towrite;
>      ssize_t used;

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 25/29] tools/libxl: Stream v2 format
  2014-09-10 17:11 ` [PATCH 25/29] tools/libxl: Stream v2 format Andrew Cooper
@ 2014-09-11 12:06   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ross Lagerwall, Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:

> +#define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))

I thought we ended up with a common version of this somewhere, but I
can't find it now.

> +
> +#define EMULATOR_UNKNOWN             0x00000000U
> +#define EMULATOR_QEMU_TRADITIONAL    0x00000001U
> +#define EMULATOR_QEMU_UPSTREAM       0x00000002U
> +
> +#define RESTORE_STREAM_IDENT         0x4c6962786c466d74UL

Does this compile without warnings for 32-bit?

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier
  2014-09-11 12:01   ` Ian Campbell
@ 2014-09-11 12:17     ` Ross Lagerwall
  2014-09-11 12:39       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Ross Lagerwall @ 2014-09-11 12:17 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On 09/11/2014 01:01 PM, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>> From: Ross Lagerwall <ross.lagerwall@citrix.com>
>>
>> Previously, adding more than 1000 bytes of data would cause a segfault.
>
> Segfault is a symptom, not a cause, why does it happen?

struct libxl__datacopier_buf contains a fixed size 1000 byte statically 
allocated buffer so adding > 1000 bytes of data would cause it to 
overrun the buffer and overwrite other memory.

Regards
-- 
Ross Lagerwall

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier
  2014-09-11 12:02   ` Ian Campbell
@ 2014-09-11 12:23     ` Ross Lagerwall
  2014-09-11 12:40       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Ross Lagerwall @ 2014-09-11 12:23 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On 09/11/2014 01:02 PM, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
>> From: Ross Lagerwall <ross.lagerwall@citrix.com>
>>
>> Add a parameter, maxread, to limit the amount of data read from the
>> source fd of a datacopier.
>
> Why is this useful?
>

It is useful for splicing _part_ of a stream from one fd to another 
rather than splicing the whole stream.  i.e. it is used in this series 
for writing out the emulator context part of the libxl stream to a qemu 
save file.

-- 
Ross Lagerwall

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer
  2014-09-11 12:03   ` Ian Campbell
@ 2014-09-11 12:26     ` Ross Lagerwall
  2014-09-11 12:41       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Ross Lagerwall @ 2014-09-11 12:26 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On 09/11/2014 01:03 PM, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
>> From: Ross Lagerwall <ross.lagerwall@citrix.com>
>>
>> Implement a read-only mode for libxl__datacopier.
>
> Read-only in the sense that it only reads, or in the sense that one may
> only read from it?

It only reads rather than reading and writing.

>
> What is the purpose of this mode?
>
> Where does the data go?
>

This mode allows data from an fd to be read into memory using libxl's 
async framework. The data goes into the user-allocated 
libxl__datacopier_state.readbuf.

Regards
-- 
Ross Lagerwall

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams
  2014-09-10 17:11 ` [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams Andrew Cooper
@ 2014-09-11 12:35   ` Ian Campbell
  2014-09-11 13:01     ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ross Lagerwall, Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> TODO:
>  * Integrate with the json series

How much will that invalidate the patch wrt my reviewing it now?

> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  tools/libxl/libxl_create.c   |  310 ++++++++++++++++++++++++++++++++++++++++--

Would it be possible to move most/all of the restore code into a new
file? If it doesn't involve exposing too many currently static bits then
that might be a good thing to do.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 27/29] [VERY RFC] tools/libxl: Support restoring legacy streams
  2014-09-10 17:11 ` [PATCH 27/29] [VERY RFC] tools/libxl: Support restoring legacy streams Andrew Cooper
@ 2014-09-11 12:36   ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> This WorksForMe in the success case, but the error handling is certainly lacking.
> 
> Specifically, the conversion scripts output fd can't be closed until the v2
> read loop has exited (cleanly or otherwise), without risking a close()/open()
> race silently replacing the fd behind the loops back.
> 
> However, it can't be closed when the read loop exits, as the conversion script
> child might still be alive, and would prefer terminating cleaning than failing
> with a bad FD.
> 
> Obviously, having one error handler block for the success/failure of the other
> side is a no-go, and would still involve a preselecting which was expected to
> exit first.
> 
> Does anyone have any clever ideas of how to asynchronously collect the events
> "the conversion script has exited", "the save helper has exited" and "the v2
> read loop has finished" given the available infrastructure, to kick of a
> combined cleanup of all 3?
> 
> (I also need to fix the conversion script info/error logging, but that is a
> distinctly more minor problem.)

This is probably one for Ian when he gets back, but a state machine
which is cranked in response to the callbacks from the various
completion events might be one way to approach this.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 20/29] tools/libxl: Update datacopier to support sending data only
  2014-09-11 12:00     ` Andrew Cooper
@ 2014-09-11 12:39       ` Ian Campbell
  2014-09-11 13:03         ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:39 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Wen Congyang, Xen-devel

On Thu, 2014-09-11 at 13:00 +0100, Andrew Cooper wrote:
> On 11/09/14 12:56, Ian Campbell wrote:
> > On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >> From: Wen Congyang <wency@cn.fujitsu.com>
> >>
> >> datacopier is to read some data and write it out. If we
> >> have some data to send it over network, we cannot use
> >> datacopier. Update it to support this case.
> > I'm afraid I'm not sure what this changelog is saying. If we have data
> > to send over the network then why can't we use datacopier? How does
> > making this conditional on readfd's validity help that case?
> 
> datacopier was originally strictly copying data from infd to outfd,
> until EOF.
> 
> These patches around here in the series extend datacopier to include
> "please write this local buffer to outfd", "please read $N bytes from
> infd to this local buffer", and "please copy exactly $N bytes from infd
> to outfd", all as asynchronous operations.

Thanks, that all needs to be clearly explained in the commit log.

I'm going to leave it to Ian to decide if this is an acceptable
repurposing of the datacopier infrastructure.

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier
  2014-09-11 12:17     ` Ross Lagerwall
@ 2014-09-11 12:39       ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:39 UTC (permalink / raw)
  To: Ross Lagerwall; +Cc: Andrew Cooper, Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 13:17 +0100, Ross Lagerwall wrote:
> On 09/11/2014 01:01 PM, Ian Campbell wrote:
> > On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> >>
> >> Previously, adding more than 1000 bytes of data would cause a segfault.
> >
> > Segfault is a symptom, not a cause, why does it happen?
> 
> struct libxl__datacopier_buf contains a fixed size 1000 byte statically 
> allocated buffer so adding > 1000 bytes of data would cause it to 
> overrun the buffer and overwrite other memory.

Yes, this should be the main point of the commit log though.

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier
  2014-09-11 12:23     ` Ross Lagerwall
@ 2014-09-11 12:40       ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:40 UTC (permalink / raw)
  To: Ross Lagerwall; +Cc: Andrew Cooper, Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 13:23 +0100, Ross Lagerwall wrote:
> On 09/11/2014 01:02 PM, Ian Campbell wrote:
> > On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> >> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> >>
> >> Add a parameter, maxread, to limit the amount of data read from the
> >> source fd of a datacopier.
> >
> > Why is this useful?
> >
> 
> It is useful for splicing _part_ of a stream from one fd to another 
> rather than splicing the whole stream.  i.e. it is used in this series 
> for writing out the emulator context part of the libxl stream to a qemu 
> save file.

Thanks, again this should be part of the ocmmit message.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer
  2014-09-11 12:26     ` Ross Lagerwall
@ 2014-09-11 12:41       ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 12:41 UTC (permalink / raw)
  To: Ross Lagerwall; +Cc: Andrew Cooper, Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 13:26 +0100, Ross Lagerwall wrote:
> On 09/11/2014 01:03 PM, Ian Campbell wrote:
> > On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
> >> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> >>
> >> Implement a read-only mode for libxl__datacopier.
> >
> > Read-only in the sense that it only reads, or in the sense that one may
> > only read from it?
> 
> It only reads rather than reading and writing.
> 
> >
> > What is the purpose of this mode?
> >
> > Where does the data go?
> >
> 
> This mode allows data from an fd to be read into memory using libxl's 
> async framework. The data goes into the user-allocated 
> libxl__datacopier_state.readbuf.

OK. I think you can probably guess what I'm going to say.

In addition to the commit log I'm sure that the doc comments in the
headers must needs some updating for all this new functionality (here
and the surrounding patches).

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams
  2014-09-11 12:35   ` Ian Campbell
@ 2014-09-11 13:01     ` Andrew Cooper
  0 siblings, 0 replies; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 13:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ross Lagerwall, Ian Jackson, Xen-devel

On 11/09/14 13:35, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:11 +0100, Andrew Cooper wrote:
>> TODO:
>>  * Integrate with the json series
> How much will that invalidate the patch wrt my reviewing it now?

Hopefully not much, but it does complicate the use of d_config.  Perhaps
we can start with d_config being unconditional, and the json series can
make it optional in the case a blob can be found in the stream.

>
>> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>  tools/libxl/libxl_create.c   |  310 ++++++++++++++++++++++++++++++++++++++++--
> Would it be possible to move most/all of the restore code into a new
> file? If it doesn't involve exposing too many currently static bits then
> that might be a good thing to do.
>
>
>

Most code should be self contained.  libxl_domain_{save,restore}.c sound
like a good idea.

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 20/29] tools/libxl: Update datacopier to support sending data only
  2014-09-11 12:39       ` Ian Campbell
@ 2014-09-11 13:03         ` Andrew Cooper
  2014-09-11 13:04           ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 13:03 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Wen Congyang, Xen-devel

On 11/09/14 13:39, Ian Campbell wrote:
> On Thu, 2014-09-11 at 13:00 +0100, Andrew Cooper wrote:
>> On 11/09/14 12:56, Ian Campbell wrote:
>>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>>
>>>> datacopier is to read some data and write it out. If we
>>>> have some data to send it over network, we cannot use
>>>> datacopier. Update it to support this case.
>>> I'm afraid I'm not sure what this changelog is saying. If we have data
>>> to send over the network then why can't we use datacopier? How does
>>> making this conditional on readfd's validity help that case?
>> datacopier was originally strictly copying data from infd to outfd,
>> until EOF.
>>
>> These patches around here in the series extend datacopier to include
>> "please write this local buffer to outfd", "please read $N bytes from
>> infd to this local buffer", and "please copy exactly $N bytes from infd
>> to outfd", all as asynchronous operations.
> Thanks, that all needs to be clearly explained in the commit log.
>
> I'm going to leave it to Ian to decide if this is an acceptable
> repurposing of the datacopier infrastructure.
>
> Ian.
>

Ian explicitly requested modification of datacopier in preference to
adding similar infrastructure with a different name, when I talked to
him in person.

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 20/29] tools/libxl: Update datacopier to support sending data only
  2014-09-11 13:03         ` Andrew Cooper
@ 2014-09-11 13:04           ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 13:04 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Wen Congyang, Xen-devel

On Thu, 2014-09-11 at 14:03 +0100, Andrew Cooper wrote:
> On 11/09/14 13:39, Ian Campbell wrote:
> > On Thu, 2014-09-11 at 13:00 +0100, Andrew Cooper wrote:
> >> On 11/09/14 12:56, Ian Campbell wrote:
> >>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >>>> From: Wen Congyang <wency@cn.fujitsu.com>
> >>>>
> >>>> datacopier is to read some data and write it out. If we
> >>>> have some data to send it over network, we cannot use
> >>>> datacopier. Update it to support this case.
> >>> I'm afraid I'm not sure what this changelog is saying. If we have data
> >>> to send over the network then why can't we use datacopier? How does
> >>> making this conditional on readfd's validity help that case?
> >> datacopier was originally strictly copying data from infd to outfd,
> >> until EOF.
> >>
> >> These patches around here in the series extend datacopier to include
> >> "please write this local buffer to outfd", "please read $N bytes from
> >> infd to this local buffer", and "please copy exactly $N bytes from infd
> >> to outfd", all as asynchronous operations.
> > Thanks, that all needs to be clearly explained in the commit log.
> >
> > I'm going to leave it to Ian to decide if this is an acceptable
> > repurposing of the datacopier infrastructure.
> >
> > Ian.
> >
> 
> Ian explicitly requested modification of datacopier in preference to
> adding similar infrastructure with a different name, when I talked to
> him in person.

OK.

Ian.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure
  2014-09-11 10:32   ` Ian Campbell
@ 2014-09-11 14:03     ` Andrew Cooper
  2014-09-11 14:06       ` Ian Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-11 14:03 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 11/09/14 11:32, Ian Campbell wrote:
> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>> Not everything which needs reporting as progress comes with a range.
> Please can you expand on your reasoning wrt the places where you are
> removing a usage of start with a size.

I suppose more correctly, "I wish to introduce new uses of the progress
functionality, which involves reporting progress without a range".

>  Especially since you aren't
> changing any subsequent progress_step call to a you new singleton
> variant.

xc_domain_{save,restore}() are the only generators of progress, and they
are/were being completely replaced, which would make
xc_domain_{save,restore}2() the only generators of progress.

>
>> Allow reporting "0 of 0" for a single progress statement.
> Can we not arrange to suppress this entirely? Perhaps by turning total=0
> into percent=-1 and having the lower level code omit that bit of the
> message in that case? If doing that you should update xentoollog.h to
> make this clear.

The text string is generated by the provider of the progress callback,
and I wanted a way which didn't alter this callback.

Changing to use percent=-1 for this would also work, but produce less
meaningful messages by an existing implementation expecting the old
behaviour.

>
> It appears you are also doing more than just this as well, by changing
> start to set for some reason. Even if you want to change the parameters
> it's not clear that the new name is in any way an improvement.
>
> Thirdly you appear to also be arranging to make it allowable to call
> progress_step without having previously called progress_start/set. Why
> is this needed?

It logically separates setting the string describing the current step,
and providing the progress numbers.

This is IMO a failure in the previous design, and the distinction is
used by the v2 code so the common functions to shunt pages don't need to
be handed a string from their caller to identify the exact current step
(which has already been latched in xch anyway).

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure
  2014-09-11 14:03     ` Andrew Cooper
@ 2014-09-11 14:06       ` Ian Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ian Campbell @ 2014-09-11 14:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Thu, 2014-09-11 at 15:03 +0100, Andrew Cooper wrote:
> On 11/09/14 11:32, Ian Campbell wrote:
> > On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >> Not everything which needs reporting as progress comes with a range.
> > Please can you expand on your reasoning wrt the places where you are
> > removing a usage of start with a size.
> 
> I suppose more correctly, "I wish to introduce new uses of the progress
> functionality, which involves reporting progress without a range".
> 
> >  Especially since you aren't
> > changing any subsequent progress_step call to a you new singleton
> > variant.
> 
> xc_domain_{save,restore}() are the only generators of progress, and they
> are/were being completely replaced, which would make
> xc_domain_{save,restore}2() the only generators of progress.
> 
> >
> >> Allow reporting "0 of 0" for a single progress statement.
> > Can we not arrange to suppress this entirely? Perhaps by turning total=0
> > into percent=-1 and having the lower level code omit that bit of the
> > message in that case? If doing that you should update xentoollog.h to
> > make this clear.
> 
> The text string is generated by the provider of the progress callback,
> and I wanted a way which didn't alter this callback.
> 
> Changing to use percent=-1 for this would also work, but produce less
> meaningful messages by an existing implementation expecting the old
> behaviour.
> 
> >
> > It appears you are also doing more than just this as well, by changing
> > start to set for some reason. Even if you want to change the parameters
> > it's not clear that the new name is in any way an improvement.
> >
> > Thirdly you appear to also be arranging to make it allowable to call
> > progress_step without having previously called progress_start/set. Why
> > is this needed?
> 
> It logically separates setting the string describing the current step,
> and providing the progress numbers.
> 
> This is IMO a failure in the previous design, and the distinction is
> used by the v2 code so the common functions to shunt pages don't need to
> be handed a string from their caller to identify the exact current step
> (which has already been latched in xch anyway).

This may well all be reasonable, in which case please explain it in the
commit log.

> 
> ~Andrew
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier
  2014-09-10 17:11 ` [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier Andrew Cooper
  2014-09-11 12:02   ` Ian Campbell
@ 2014-09-12  8:36   ` Wen Congyang
  2014-09-19  7:45     ` Ross Lagerwall
  1 sibling, 1 reply; 79+ messages in thread
From: Wen Congyang @ 2014-09-12  8:36 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Ross Lagerwall, Ian Jackson, Ian Campbell

On 09/11/2014 01:11 AM, Andrew Cooper wrote:
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> Add a parameter, maxread, to limit the amount of data read from the
> source fd of a datacopier.
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  tools/libxl/libxl_aoutils.c    |    9 +++++++--
>  tools/libxl/libxl_bootloader.c |    2 ++
>  tools/libxl/libxl_dom.c        |    1 +
>  tools/libxl/libxl_internal.h   |    1 +
>  4 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index caba637..6502325 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -145,7 +145,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>                  return;
>              }
>          }
> -    } else if (!libxl__ev_fd_isregistered(&dc->toread)) {
> +    } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
>          /* we have had eof */
>          datacopier_callback(egc, dc, 0, 0);
>          return;
> @@ -233,7 +233,8 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>          }
>          int r = read(ev->fd,
>                       buf->buf + buf->used,
> -                     sizeof(buf->buf) - buf->used);
> +                     (sizeof(buf->buf) - buf->used) < dc->maxread ?
> +                         (sizeof(buf->buf) - buf->used) : dc->maxread);
>          if (r < 0) {
>              if (errno == EINTR) continue;
>              if (errno == EWOULDBLOCK) break;
> @@ -258,7 +259,11 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>          }
>          buf->used += r;
>          dc->used += r;
> +        dc->maxread -= r;
>          assert(buf->used <= sizeof(buf->buf));
> +        assert(dc->maxread >= 0);
> +        if (dc->maxread == 0)
> +            break;

We can call libxl__ev_fd_deregister() here, and no need to touch datacopier_check_state().
Otherwise, datacopier_readable() may be called again, and read() returns 0. And then
libxl__ev_fd_deregister() is called.

Thanks
Wen Congyang

>      }
>      datacopier_check_state(egc, dc);
>  }
> diff --git a/tools/libxl/libxl_bootloader.c b/tools/libxl/libxl_bootloader.c
> index 79947d4..1503101 100644
> --- a/tools/libxl/libxl_bootloader.c
> +++ b/tools/libxl/libxl_bootloader.c
> @@ -516,6 +516,7 @@ static void bootloader_gotptys(libxl__egc *egc, libxl__openpty_state *op)
>  
>      bl->keystrokes.ao = ao;
>      bl->keystrokes.maxsz = BOOTLOADER_BUF_OUT;
> +    bl->keystrokes.maxread = INT_MAX;
>      bl->keystrokes.copywhat =
>          GCSPRINTF("bootloader input for domain %"PRIu32, bl->domid);
>      bl->keystrokes.callback =         bootloader_keystrokes_copyfail;
> @@ -527,6 +528,7 @@ static void bootloader_gotptys(libxl__egc *egc, libxl__openpty_state *op)
>  
>      bl->display.ao = ao;
>      bl->display.maxsz = BOOTLOADER_BUF_IN;
> +    bl->display.maxread = INT_MAX;
>      bl->display.copywhat =
>          GCSPRINTF("bootloader output for domain %"PRIu32, bl->domid);
>      bl->display.callback =         bootloader_display_copyfail;
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index 0dfdb08..2f74341 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -1717,6 +1717,7 @@ void libxl__domain_save_device_model(libxl__egc *egc,
>      dc->readfd = -1;
>      dc->writefd = fd;
>      dc->maxsz = INT_MAX;
> +    dc->maxread = INT_MAX;
>      dc->copywhat = GCSPRINTF("qemu save file for domain %"PRIu32, dss->domid);
>      dc->writewhat = "save/migration stream";
>      dc->callback = save_device_model_datacopier_done;
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 03e9978..d93a6ee 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2425,6 +2425,7 @@ struct libxl__datacopier_state {
>      libxl__ao *ao;
>      int readfd, writefd;
>      ssize_t maxsz;
> +    ssize_t maxread;
>      const char *copywhat, *readwhat, *writewhat; /* for error msgs */
>      FILE *log; /* gets a copy of everything */
>      libxl__datacopier_callback *callback;
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer
  2014-09-10 17:11 ` [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer Andrew Cooper
  2014-09-11 12:03   ` Ian Campbell
@ 2014-09-12  8:49   ` Wen Congyang
  2014-09-19  7:48     ` Ross Lagerwall
  1 sibling, 1 reply; 79+ messages in thread
From: Wen Congyang @ 2014-09-12  8:49 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Ross Lagerwall, Ian Jackson, Ian Campbell

On 09/11/2014 01:11 AM, Andrew Cooper wrote:
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> Implement a read-only mode for libxl__datacopier.  The mode is invoked
> when readbuf is set and writefd is < 0.  On success, the callback passes
> the number of bytes read.
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  tools/libxl/libxl_aoutils.c  |   59 +++++++++++++++++++++++++-----------------
>  tools/libxl/libxl_internal.h |    4 ++-
>  2 files changed, 38 insertions(+), 25 deletions(-)
> 
> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
> index 6502325..9183716 100644
> --- a/tools/libxl/libxl_aoutils.c
> +++ b/tools/libxl/libxl_aoutils.c
> @@ -134,7 +134,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>      STATE_AO_GC(dc->ao);
>      int rc;
>      
> -    if (dc->used) {
> +    if (dc->used && !dc->readbuf) {
>          if (!libxl__ev_fd_isregistered(&dc->towrite)) {
>              rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
>                                         dc->writefd, POLLOUT);
> @@ -147,7 +147,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>          }
>      } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
>          /* we have had eof */
> -        datacopier_callback(egc, dc, 0, 0);
> +        datacopier_callback(egc, dc, 0, dc->readbuf ? dc->used : 0);
>          return;
>      } else {
>          /* nothing buffered, but still reading */
> @@ -215,26 +215,31 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>      }
>      assert(revents & POLLIN);
>      for (;;) {
> -        while (dc->used >= dc->maxsz) {
> -            libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
> -            dc->used -= rm->used;
> -            assert(dc->used >= 0);
> -            LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
> -            free(rm);
> -        }
> +        libxl__datacopier_buf *buf = NULL;
> +        int r;
> +
> +        if (dc->readbuf) {
> +            r = read(ev->fd, dc->readbuf + dc->used, dc->maxread);
> +        } else {
> +            while (dc->used >= dc->maxsz) {
> +                libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
> +                dc->used -= rm->used;
> +                assert(dc->used >= 0);
> +                LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
> +                free(rm);
> +            }
>  
> -        libxl__datacopier_buf *buf =
> -            LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
> -        if (!buf || buf->used >= sizeof(buf->buf)) {
> -            buf = malloc(sizeof(*buf));
> -            if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
> -            buf->used = 0;
> -            LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
> -        }
> -        int r = read(ev->fd,
> -                     buf->buf + buf->used,
> +            buf = LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
> +            if (!buf || buf->used >= sizeof(buf->buf)) {
> +                buf = malloc(sizeof(*buf));
> +                if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
> +                buf->used = 0;
> +                LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
> +            }
> +            r = read(ev->fd, buf->buf + buf->used,
>                       (sizeof(buf->buf) - buf->used) < dc->maxread ?
>                           (sizeof(buf->buf) - buf->used) : dc->maxread);
> +        }
>          if (r < 0) {
>              if (errno == EINTR) continue;
>              if (errno == EWOULDBLOCK) break;
> @@ -257,10 +262,12 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>                  return;
>              }
>          }
> -        buf->used += r;
> +        if (!dc->readbuf) {
> +            buf->used += r;
> +            assert(buf->used <= sizeof(buf->buf));
> +        }
>          dc->used += r;
>          dc->maxread -= r;
> -        assert(buf->used <= sizeof(buf->buf));
>          assert(dc->maxread >= 0);
>          if (dc->maxread == 0)
>              break;
> @@ -324,9 +331,13 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
>          if (rc) goto out;
>      }
>  
> -    rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
> -                               dc->writefd, POLLOUT);
> -    if (rc) goto out;
> +    if (dc->writefd >= 0) {
> +        rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
> +                                   dc->writefd, POLLOUT);
> +        if (rc) goto out;
> +    }
> +
> +    assert(dc->readfd >= 0 || dc->writefd >= 0);

If readfd and writefd are valid, and readbuf is not NULL, what it the behavior?

Thanks
Wen Congyang

>  
>      return 0;
>  
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index d93a6ee..056843a 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2404,7 +2404,8 @@ typedef struct libxl__datacopier_buf libxl__datacopier_buf;
>  
>  /* onwrite==1 means failure happened when writing, logged, errnoval is valid
>   * onwrite==0 means failure happened when reading
> - *     errnoval==0 means we got eof and all data was written
> + *     errnoval>=0 means we got eof and all data was written or number of bytes
> + *                 written when in read mode
>   *     errnoval!=0 means we had a read error, logged
>   * onwrite==-1 means some other internal failure, errnoval not valid, logged
>   * If we get POLLHUP, we call callback_pollhup(..., onwrite, -1);
> @@ -2433,6 +2434,7 @@ struct libxl__datacopier_state {
>      /* remaining fields are private to datacopier */
>      libxl__ev_fd toread, towrite;
>      ssize_t used;
> +    void *readbuf;
>      LIBXL_TAILQ_HEAD(libxl__datacopier_bufs, libxl__datacopier_buf) bufs;
>  };
>  
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-11 11:04         ` Andrew Cooper
  2014-09-11 11:10           ` Ian Campbell
@ 2014-09-14 10:23           ` Shriram Rajagopalan
  2014-09-15 15:09             ` Andrew Cooper
  1 sibling, 1 reply; 79+ messages in thread
From: Shriram Rajagopalan @ 2014-09-14 10:23 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2962 bytes --]

On Sep 11, 2014 4:08 AM, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
> On 11/09/14 12:01, Ian Campbell wrote:
> > On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
> >> On 11/09/14 11:34, Ian Campbell wrote:
> >>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >>>> For testing purposes, the environmental variable "XG_MIGRATION_V2"
allows the
> >>>> two save/restore codepaths to coexist, and have a runtime switch.
> >>>>
> >>>> It is indended that once this series is less RFC, the v2 framework
will
> >>>> completely replace v1.
> >>> I think we are now at the point where this hack needs to be dropped
from
> >>> the series.
> >> One problem is remus.  My plan when dropping this patch was to drop all
> >> of xc_domain_{save/restore}.c as well, but without remus migration-v2
> >> support available, this will break existing set-ups.
> > Hrm, how is that going wrt 4.5 freeze?
>
> I haven’t heard seen anything since v5 of this series (for which I did
> some quick bugfixes and released v6).
>

FYI, thats not entirely true. Yang did post a set of RFC patches for remus
support in migration v2, based on your V6 series (back in July)
http://lists.xenproject.org/archives/html/xen-devel/2014-07/msg01163.html


It would actually be helpful if you could cc me on the patches relevant to
Remus,
or if there is anything specific to Remus that needs to be done. There are
100s of
posts on Xen devel every day and its hard to keep track of everything
posted to
Xen devel.


And I looking at your patch sets in

http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/saverestore2-v6.3

I see that there is no support for Remus currently. Nor can I differentiate
which parts of the
code fix to these "quick bug fixes" that you mentioned above. From the
discussion over the remus rfc
patches, I only recall a bug related to vcpu context caching. But I cannot
delineate that specific part from
the patches in the repo. So, if these bug fixes you are referring to are
something else, please explain.


> I don't know, which probably means not good.
>

> >
> >> One option might be to have legacy and v2 sitting properly side-by-side
> >> in libxc for the transition period.
> > How long do you mean? Until 4.6?
>

fwiw, I don't plan to work on remus migration v2 support until the remus
netbuffer patches get in.
I have been at this for almost two release cycles. Its frustrating to
iterate on feedbacks for patch 4/11
of a series for two months and then get a bunch of first-pass review for
patch 6/10 at the eleventh hour
before a feature freeze, while the rest of the series has still not been
reviewed at all for the past 3 months.


> Ideally, very early in the 4.6 dev period.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 4022 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-14 10:23           ` Shriram Rajagopalan
@ 2014-09-15 15:09             ` Andrew Cooper
  2014-09-15 18:58               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-15 15:09 UTC (permalink / raw)
  To: rshriram; +Cc: FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3944 bytes --]


On 14/09/2014 11:23, Shriram Rajagopalan wrote:
> On Sep 11, 2014 4:08 AM, "Andrew Cooper" <andrew.cooper3@citrix.com
> <mailto:andrew.cooper3@citrix.com>> wrote:
> >
> > On 11/09/14 12:01, Ian Campbell wrote:
> > > On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
> > >> On 11/09/14 11:34, Ian Campbell wrote:
> > >>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> > >>>> For testing purposes, the environmental variable
> "XG_MIGRATION_V2" allows the
> > >>>> two save/restore codepaths to coexist, and have a runtime switch.
> > >>>>
> > >>>> It is indended that once this series is less RFC, the v2
> framework will
> > >>>> completely replace v1.
> > >>> I think we are now at the point where this hack needs to be
> dropped from
> > >>> the series.
> > >> One problem is remus.  My plan when dropping this patch was to
> drop all
> > >> of xc_domain_{save/restore}.c as well, but without remus migration-v2
> > >> support available, this will break existing set-ups.
> > > Hrm, how is that going wrt 4.5 freeze?
> >
> > I haven’t heard seen anything since v5 of this series (for which I did
> > some quick bugfixes and released v6).
> >
>
> FYI, thats not entirely true. Yang did post a set of RFC patches for
> remus
> support in migration v2, based on your V6 series (back in July)
> http://lists.xenproject.org/archives/html/xen-devel/2014-07/msg01163.html

My apologies - it was v6 to v6.1

>
>
> It would actually be helpful if you could cc me on the patches
> relevant to Remus,
> or if there is anything specific to Remus that needs to be done. There
> are 100s of
> posts on Xen devel every day and its hard to keep track of everything
> posted to
> Xen devel.
>
>
> And I looking at your patch sets in
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/saverestore2-v6.3
>
> I see that there is no support for Remus currently. Nor can I
> differentiate which parts of the
> code fix to these "quick bug fixes" that you mentioned above. From the
> discussion over the remus rfc
> patches, I only recall a bug related to vcpu context caching. But I
> cannot delineate that specific part from
> the patches in the repo. So, if these bug fixes you are referring to
> are something else, please explain.

The bugfixes were referring to the vcpu context caching, but far more
bits needed caching than the remus series fixed.  The fixes were
necessary even in the non-remus case and there were also improvements to
receive side state machine to avoid vm corruption caused by an incorrect
send order.

I did not integrate the remus specific patches as there were outstanding
review concerns/comments.

>
>
> > I don't know, which probably means not good.
> >
>
> > >
> > >> One option might be to have legacy and v2 sitting properly
> side-by-side
> > >> in libxc for the transition period.
> > > How long do you mean? Until 4.6?
> >
>
> fwiw, I don't plan to work on remus migration v2 support until the
> remus netbuffer patches get in.
> I have been at this for almost two release cycles. Its frustrating to
> iterate on feedbacks for patch 4/11
> of a series for two months and then get a bunch of first-pass review
> for patch 6/10 at the eleventh hour
> before a feature freeze, while the rest of the series has still not
> been reviewed at all for the past 3 months.

I can appreciate your frustration on this point, and do not envy your
position.

The concern I have is that XenServer 6.5 is shipping with migrationv2 as
we absolutely need it, given the 32->64bit upgrade.  We were hoping to
get the new format committed in 4.5 to guarantee stability, but that is
looking increasingly unlikely to happen.  As a result, it will probably
have to go in early in 4.6, with extra care taken to ensure that no
incompatible changes are made as a result of further review.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 6679 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-15 15:09             ` Andrew Cooper
@ 2014-09-15 18:58               ` Konrad Rzeszutek Wilk
  2014-09-16 11:44                 ` Andrew Cooper
  0 siblings, 1 reply; 79+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-15 18:58 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: rshriram, FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel

On Mon, Sep 15, 2014 at 04:09:51PM +0100, Andrew Cooper wrote:
> 
> On 14/09/2014 11:23, Shriram Rajagopalan wrote:
> >On Sep 11, 2014 4:08 AM, "Andrew Cooper" <andrew.cooper3@citrix.com
> ><mailto:andrew.cooper3@citrix.com>> wrote:
> >>
> >> On 11/09/14 12:01, Ian Campbell wrote:
> >> > On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
> >> >> On 11/09/14 11:34, Ian Campbell wrote:
> >> >>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >> >>>> For testing purposes, the environmental variable
> >"XG_MIGRATION_V2" allows the
> >> >>>> two save/restore codepaths to coexist, and have a runtime switch.
> >> >>>>
> >> >>>> It is indended that once this series is less RFC, the v2
> >framework will
> >> >>>> completely replace v1.
> >> >>> I think we are now at the point where this hack needs to be
> >dropped from
> >> >>> the series.
> >> >> One problem is remus.  My plan when dropping this patch was to

The other is 'tmem'. But 'tmem' has not yet been declared 'baked' so
not making it work from a release perspective is OK.

With the 'tmem' maintainer hat on, however I would like to it work without
having to do anything :-) Which reminds me I need to follow up
on double-checking the migation hasn't bitrotten!


> >drop all
> >> >> of xc_domain_{save/restore}.c as well, but without remus migration-v2
> >> >> support available, this will break existing set-ups.

And by 'set-ups' you mean Xen 4.5 using the v1 migration tools and then
out of tree patches on top of that. In other words, users of the libxc
"API" (which we do not gurantee between releases - it is an internal
API).

> >> > Hrm, how is that going wrt 4.5 freeze?
> >>
> >> I haven’t heard seen anything since v5 of this series (for which I did
> >> some quick bugfixes and released v6).
> >>
> >
> >FYI, thats not entirely true. Yang did post a set of RFC patches for
> >remus
> >support in migration v2, based on your V6 series (back in July)
> >http://lists.xenproject.org/archives/html/xen-devel/2014-07/msg01163.html
> 
> My apologies - it was v6 to v6.1
> 
> >
> >
> >It would actually be helpful if you could cc me on the patches
> >relevant to Remus,
> >or if there is anything specific to Remus that needs to be done. There
> >are 100s of
> >posts on Xen devel every day and its hard to keep track of everything
> >posted to
> >Xen devel.

I've found that putting filters for the right keywords help in that.
That is how I can subscribe to lkml without drinking the
firehouse.

> >
> >
> >And I looking at your patch sets in
> >http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/saverestore2-v6.3
> >
> >I see that there is no support for Remus currently. Nor can I
> >differentiate which parts of the
> >code fix to these "quick bug fixes" that you mentioned above. From the
> >discussion over the remus rfc
> >patches, I only recall a bug related to vcpu context caching. But I
> >cannot delineate that specific part from
> >the patches in the repo. So, if these bug fixes you are referring to
> >are something else, please explain.
> 
> The bugfixes were referring to the vcpu context caching, but far more
> bits needed caching than the remus series fixed.  The fixes were
> necessary even in the non-remus case and there were also improvements to
> receive side state machine to avoid vm corruption caused by an incorrect
> send order.
> 
> I did not integrate the remus specific patches as there were outstanding
> review concerns/comments.

<nods> My recollection as well.
> 
> >
> >
> >> I don't know, which probably means not good.
> >>
> >
> >> >
> >> >> One option might be to have legacy and v2 sitting properly
> >side-by-side
> >> >> in libxc for the transition period.
> >> > How long do you mean? Until 4.6?
> >>
> >
> >fwiw, I don't plan to work on remus migration v2 support until the
> >remus netbuffer patches get in.
> >I have been at this for almost two release cycles. Its frustrating to
> >iterate on feedbacks for patch 4/11
> >of a series for two months and then get a bunch of first-pass review
> >for patch 6/10 at the eleventh hour
> >before a feature freeze, while the rest of the series has still not
> >been reviewed at all for the past 3 months.

What is the dependency on "full remus" support? Is there a list of
all the different patchset that need to be reviewed?

> 
> I can appreciate your frustration on this point, and do not envy your
> position.
> 
> The concern I have is that XenServer 6.5 is shipping with migrationv2 as
> we absolutely need it, given the 32->64bit upgrade.  We were hoping to
> get the new format committed in 4.5 to guarantee stability, but that is
> looking increasingly unlikely to happen.  As a result, it will probably
> have to go in early in 4.6, with extra care taken to ensure that no
> incompatible changes are made as a result of further review.

Could you tell me what are the benefits of having a v1 to v2 runtime
switch for developers/users besides the obvious (faster migration,
easier to understand code)?

For me it sounded that this would allow the community to also
test it and report bugs - which would be invaluable. And better
yet there is a env flag to swap between a baseline and new
code to ease the testing.

The risks seem quite contained - if something goes awry, folks can
use the v1 version - which should have the same amount of bugs
that it had in previous releases. And since it is on by default - so
only dedicated users would turn v2 on.

From an maintaince perspective, it does add more code but then once
feature freeze hits we do not pay attention to features anymore,
but rather to bug-fixes.

Hm, Ian's - what are you folks take on it?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-15 18:58               ` Konrad Rzeszutek Wilk
@ 2014-09-16 11:44                 ` Andrew Cooper
  2014-09-16 19:54                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2014-09-16 11:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: rshriram, FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel


On 15/09/2014 19:58, Konrad Rzeszutek Wilk wrote:
> On Mon, Sep 15, 2014 at 04:09:51PM +0100, Andrew Cooper wrote:
>> On 14/09/2014 11:23, Shriram Rajagopalan wrote:
>>> On Sep 11, 2014 4:08 AM, "Andrew Cooper" <andrew.cooper3@citrix.com
>>> <mailto:andrew.cooper3@citrix.com>> wrote:
>>>> On 11/09/14 12:01, Ian Campbell wrote:
>>>>> On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
>>>>>> On 11/09/14 11:34, Ian Campbell wrote:
>>>>>>> On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
>>>>>>>> For testing purposes, the environmental variable
>>> "XG_MIGRATION_V2" allows the
>>>>>>>> two save/restore codepaths to coexist, and have a runtime switch.
>>>>>>>>
>>>>>>>> It is indended that once this series is less RFC, the v2
>>> framework will
>>>>>>>> completely replace v1.
>>>>>>> I think we are now at the point where this hack needs to be
>>> dropped from
>>>>>>> the series.
>>>>>> One problem is remus.  My plan when dropping this patch was to
> The other is 'tmem'. But 'tmem' has not yet been declared 'baked' so
> not making it work from a release perspective is OK.
>
> With the 'tmem' maintainer hat on, however I would like to it work without
> having to do anything :-) Which reminds me I need to follow up
> on double-checking the migation hasn't bitrotten!

While reverse engineering the existing protocol is not too difficult, I 
think the TMEM migration needs redesigning.  From memory, there is a 
huge quantity of metadata which is sent redundantly (tmem pool uuid with 
every frame).  It would also benefit massively from some batching to 
help reduce the quantity of hypercalls made (5 per frame iirc).

>
>
>>> drop all
>>>>>> of xc_domain_{save/restore}.c as well, but without remus migration-v2
>>>>>> support available, this will break existing set-ups.
> And by 'set-ups' you mean Xen 4.5 using the v1 migration tools and then
> out of tree patches on top of that. In other words, users of the libxc
> "API" (which we do not gurantee between releases - it is an internal
> API).
>
>>>>> Hrm, how is that going wrt 4.5 freeze?
>>>> I haven’t heard seen anything since v5 of this series (for which I did
>>>> some quick bugfixes and released v6).
>>>>
>>> FYI, thats not entirely true. Yang did post a set of RFC patches for
>>> remus
>>> support in migration v2, based on your V6 series (back in July)
>>> http://lists.xenproject.org/archives/html/xen-devel/2014-07/msg01163.html
>> My apologies - it was v6 to v6.1
>>
>>>
>>> It would actually be helpful if you could cc me on the patches
>>> relevant to Remus,
>>> or if there is anything specific to Remus that needs to be done. There
>>> are 100s of
>>> posts on Xen devel every day and its hard to keep track of everything
>>> posted to
>>> Xen devel.
> I've found that putting filters for the right keywords help in that.
> That is how I can subscribe to lkml without drinking the
> firehouse.
>
>>>
>>> And I looking at your patch sets in
>>> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/saverestore2-v6.3
>>>
>>> I see that there is no support for Remus currently. Nor can I
>>> differentiate which parts of the
>>> code fix to these "quick bug fixes" that you mentioned above. From the
>>> discussion over the remus rfc
>>> patches, I only recall a bug related to vcpu context caching. But I
>>> cannot delineate that specific part from
>>> the patches in the repo. So, if these bug fixes you are referring to
>>> are something else, please explain.
>> The bugfixes were referring to the vcpu context caching, but far more
>> bits needed caching than the remus series fixed.  The fixes were
>> necessary even in the non-remus case and there were also improvements to
>> receive side state machine to avoid vm corruption caused by an incorrect
>> send order.
>>
>> I did not integrate the remus specific patches as there were outstanding
>> review concerns/comments.
> <nods> My recollection as well.
>>>
>>>> I don't know, which probably means not good.
>>>>
>>>>>> One option might be to have legacy and v2 sitting properly
>>> side-by-side
>>>>>> in libxc for the transition period.
>>>>> How long do you mean? Until 4.6?
>>> fwiw, I don't plan to work on remus migration v2 support until the
>>> remus netbuffer patches get in.
>>> I have been at this for almost two release cycles. Its frustrating to
>>> iterate on feedbacks for patch 4/11
>>> of a series for two months and then get a bunch of first-pass review
>>> for patch 6/10 at the eleventh hour
>>> before a feature freeze, while the rest of the series has still not
>>> been reviewed at all for the past 3 months.
> What is the dependency on "full remus" support? Is there a list of
> all the different patchset that need to be reviewed?

As with TMEM, remus support needs redesigning, as it needs coordinated 
additions to both the libxc and libxl stream formats to support 
checkpoints without the current layer violations.

>
>> I can appreciate your frustration on this point, and do not envy your
>> position.
>>
>> The concern I have is that XenServer 6.5 is shipping with migrationv2 as
>> we absolutely need it, given the 32->64bit upgrade.  We were hoping to
>> get the new format committed in 4.5 to guarantee stability, but that is
>> looking increasingly unlikely to happen.  As a result, it will probably
>> have to go in early in 4.6, with extra care taken to ensure that no
>> incompatible changes are made as a result of further review.
> Could you tell me what are the benefits of having a v1 to v2 runtime
> switch for developers/users besides the obvious (faster migration,
> easier to understand code)?

Users should not notice a difference, other than it being faster.

 From a developer point of view,

* It actually has some header information now
* It is independent of the bitness of the toolstack (which is the key 
reason we needed to do it for XenServers switch from 32 to 64bit dom0)
* The old format (little that it was) was basically inextensible for PV 
guests (See the PV MSRs thread)
* It has allowed for dropping 2-level PV guest support, as well as other 
32bit Xen bits.

>
> For me it sounded that this would allow the community to also
> test it and report bugs - which would be invaluable. And better
> yet there is a env flag to swap between a baseline and new
> code to ease the testing.

That was only supposed to be development, and removed when committed 
upstream.

~Andrew

>
> The risks seem quite contained - if something goes awry, folks can
> use the v1 version - which should have the same amount of bugs
> that it had in previous releases. And since it is on by default - so
> only dedicated users would turn v2 on.
>
>  From an maintaince perspective, it does add more code but then once
> feature freeze hits we do not pay attention to features anymore,
> but rather to bug-fixes.
>
> Hm, Ian's - what are you folks take on it?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework
  2014-09-16 11:44                 ` Andrew Cooper
@ 2014-09-16 19:54                   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 79+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-16 19:54 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: rshriram, FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel

On Tue, Sep 16, 2014 at 12:44:45PM +0100, Andrew Cooper wrote:
> 
> On 15/09/2014 19:58, Konrad Rzeszutek Wilk wrote:
> >On Mon, Sep 15, 2014 at 04:09:51PM +0100, Andrew Cooper wrote:
> >>On 14/09/2014 11:23, Shriram Rajagopalan wrote:
> >>>On Sep 11, 2014 4:08 AM, "Andrew Cooper" <andrew.cooper3@citrix.com
> >>><mailto:andrew.cooper3@citrix.com>> wrote:
> >>>>On 11/09/14 12:01, Ian Campbell wrote:
> >>>>>On Thu, 2014-09-11 at 11:37 +0100, Andrew Cooper wrote:
> >>>>>>On 11/09/14 11:34, Ian Campbell wrote:
> >>>>>>>On Wed, 2014-09-10 at 18:10 +0100, Andrew Cooper wrote:
> >>>>>>>>For testing purposes, the environmental variable
> >>>"XG_MIGRATION_V2" allows the
> >>>>>>>>two save/restore codepaths to coexist, and have a runtime switch.
> >>>>>>>>
> >>>>>>>>It is indended that once this series is less RFC, the v2
> >>>framework will
> >>>>>>>>completely replace v1.
> >>>>>>>I think we are now at the point where this hack needs to be
> >>>dropped from
> >>>>>>>the series.
> >>>>>>One problem is remus.  My plan when dropping this patch was to
> >The other is 'tmem'. But 'tmem' has not yet been declared 'baked' so
> >not making it work from a release perspective is OK.
> >
> >With the 'tmem' maintainer hat on, however I would like to it work without
> >having to do anything :-) Which reminds me I need to follow up
> >on double-checking the migation hasn't bitrotten!
> 
> While reverse engineering the existing protocol is not too difficult, I
> think the TMEM migration needs redesigning.  From memory, there is a huge
> quantity of metadata which is sent redundantly (tmem pool uuid with every
> frame).  It would also benefit massively from some batching to help reduce
> the quantity of hypercalls made (5 per frame iirc).

Ugh.
.. snip..
> >>>before a feature freeze, while the rest of the series has still not
> >>>been reviewed at all for the past 3 months.
> >What is the dependency on "full remus" support? Is there a list of
> >all the different patchset that need to be reviewed?
> 
> As with TMEM, remus support needs redesigning, as it needs coordinated
> additions to both the libxc and libxl stream formats to support checkpoints
> without the current layer violations.
> 
> >
> >>I can appreciate your frustration on this point, and do not envy your
> >>position.
> >>
> >>The concern I have is that XenServer 6.5 is shipping with migrationv2 as
> >>we absolutely need it, given the 32->64bit upgrade.  We were hoping to
> >>get the new format committed in 4.5 to guarantee stability, but that is
> >>looking increasingly unlikely to happen.  As a result, it will probably
> >>have to go in early in 4.6, with extra care taken to ensure that no
> >>incompatible changes are made as a result of further review.
> >Could you tell me what are the benefits of having a v1 to v2 runtime
> >switch for developers/users besides the obvious (faster migration,
> >easier to understand code)?
> 
> Users should not notice a difference, other than it being faster.
> 
> From a developer point of view,
> 
> * It actually has some header information now
> * It is independent of the bitness of the toolstack (which is the key reason
> we needed to do it for XenServers switch from 32 to 64bit dom0)
> * The old format (little that it was) was basically inextensible for PV
> guests (See the PV MSRs thread)
> * It has allowed for dropping 2-level PV guest support, as well as other
> 32bit Xen bits.

The disadvantages are that:
 - Breaks tmem migration.
 - Breaks outside users of libxc (granted we don't specify an API for
   that, but I am not a huge fan of putting barriers).

> 
> >
> >For me it sounded that this would allow the community to also
> >test it and report bugs - which would be invaluable. And better
> >yet there is a env flag to swap between a baseline and new
> >code to ease the testing.
> 
> That was only supposed to be development, and removed when committed
> upstream.
> 
> ~Andrew
> 
> >
> >The risks seem quite contained - if something goes awry, folks can
> >use the v1 version - which should have the same amount of bugs
> >that it had in previous releases. And since it is on by default - so
> >only dedicated users would turn v2 on.
> >
> > From an maintaince perspective, it does add more code but then once
> >feature freeze hits we do not pay attention to features anymore,
> >but rather to bug-fixes.
> >
> >Hm, Ian's - what are you folks take on it?
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier
  2014-09-12  8:36   ` Wen Congyang
@ 2014-09-19  7:45     ` Ross Lagerwall
  0 siblings, 0 replies; 79+ messages in thread
From: Ross Lagerwall @ 2014-09-19  7:45 UTC (permalink / raw)
  To: Wen Congyang, Andrew Cooper, Xen-devel; +Cc: Ian Jackson, Ian Campbell

On 09/12/2014 09:36 AM, Wen Congyang wrote:
> On 09/11/2014 01:11 AM, Andrew Cooper wrote:
>> From: Ross Lagerwall <ross.lagerwall@citrix.com>
>>
>> Add a parameter, maxread, to limit the amount of data read from the
>> source fd of a datacopier.
>>
>> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
>> ---
>>   tools/libxl/libxl_aoutils.c    |    9 +++++++--
>>   tools/libxl/libxl_bootloader.c |    2 ++
>>   tools/libxl/libxl_dom.c        |    1 +
>>   tools/libxl/libxl_internal.h   |    1 +
>>   4 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
>> index caba637..6502325 100644
>> --- a/tools/libxl/libxl_aoutils.c
>> +++ b/tools/libxl/libxl_aoutils.c
>> @@ -145,7 +145,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>>                   return;
>>               }
>>           }
>> -    } else if (!libxl__ev_fd_isregistered(&dc->toread)) {
>> +    } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
>>           /* we have had eof */
>>           datacopier_callback(egc, dc, 0, 0);
>>           return;
>> @@ -233,7 +233,8 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>>           }
>>           int r = read(ev->fd,
>>                        buf->buf + buf->used,
>> -                     sizeof(buf->buf) - buf->used);
>> +                     (sizeof(buf->buf) - buf->used) < dc->maxread ?
>> +                         (sizeof(buf->buf) - buf->used) : dc->maxread);
>>           if (r < 0) {
>>               if (errno == EINTR) continue;
>>               if (errno == EWOULDBLOCK) break;
>> @@ -258,7 +259,11 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>>           }
>>           buf->used += r;
>>           dc->used += r;
>> +        dc->maxread -= r;
>>           assert(buf->used <= sizeof(buf->buf));
>> +        assert(dc->maxread >= 0);
>> +        if (dc->maxread == 0)
>> +            break;
>
> We can call libxl__ev_fd_deregister() here, and no need to touch datacopier_check_state().
> Otherwise, datacopier_readable() may be called again, and read() returns 0. And then
> libxl__ev_fd_deregister() is called.
>

Well datacopier_check_state() is called immediately after the break and 
that would deregister the fd so datacopier_readable() would not be 
called again. What you suggest is definitely better though.

Thanks
--
Ross Lagerwall

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer
  2014-09-12  8:49   ` Wen Congyang
@ 2014-09-19  7:48     ` Ross Lagerwall
  0 siblings, 0 replies; 79+ messages in thread
From: Ross Lagerwall @ 2014-09-19  7:48 UTC (permalink / raw)
  To: Wen Congyang, Andrew Cooper, Xen-devel; +Cc: Ian Jackson, Ian Campbell

On 09/12/2014 09:49 AM, Wen Congyang wrote:
> On 09/11/2014 01:11 AM, Andrew Cooper wrote:
>> From: Ross Lagerwall <ross.lagerwall@citrix.com>
>>
>> Implement a read-only mode for libxl__datacopier.  The mode is invoked
>> when readbuf is set and writefd is < 0.  On success, the callback passes
>> the number of bytes read.
>>
>> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
>> ---
>>   tools/libxl/libxl_aoutils.c  |   59 +++++++++++++++++++++++++-----------------
>>   tools/libxl/libxl_internal.h |    4 ++-
>>   2 files changed, 38 insertions(+), 25 deletions(-)
>>
>> diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
>> index 6502325..9183716 100644
>> --- a/tools/libxl/libxl_aoutils.c
>> +++ b/tools/libxl/libxl_aoutils.c
>> @@ -134,7 +134,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>>       STATE_AO_GC(dc->ao);
>>       int rc;
>>
>> -    if (dc->used) {
>> +    if (dc->used && !dc->readbuf) {
>>           if (!libxl__ev_fd_isregistered(&dc->towrite)) {
>>               rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
>>                                          dc->writefd, POLLOUT);
>> @@ -147,7 +147,7 @@ static void datacopier_check_state(libxl__egc *egc, libxl__datacopier_state *dc)
>>           }
>>       } else if (!libxl__ev_fd_isregistered(&dc->toread) || dc->maxread == 0) {
>>           /* we have had eof */
>> -        datacopier_callback(egc, dc, 0, 0);
>> +        datacopier_callback(egc, dc, 0, dc->readbuf ? dc->used : 0);
>>           return;
>>       } else {
>>           /* nothing buffered, but still reading */
>> @@ -215,26 +215,31 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>>       }
>>       assert(revents & POLLIN);
>>       for (;;) {
>> -        while (dc->used >= dc->maxsz) {
>> -            libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
>> -            dc->used -= rm->used;
>> -            assert(dc->used >= 0);
>> -            LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
>> -            free(rm);
>> -        }
>> +        libxl__datacopier_buf *buf = NULL;
>> +        int r;
>> +
>> +        if (dc->readbuf) {
>> +            r = read(ev->fd, dc->readbuf + dc->used, dc->maxread);
>> +        } else {
>> +            while (dc->used >= dc->maxsz) {
>> +                libxl__datacopier_buf *rm = LIBXL_TAILQ_FIRST(&dc->bufs);
>> +                dc->used -= rm->used;
>> +                assert(dc->used >= 0);
>> +                LIBXL_TAILQ_REMOVE(&dc->bufs, rm, entry);
>> +                free(rm);
>> +            }
>>
>> -        libxl__datacopier_buf *buf =
>> -            LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
>> -        if (!buf || buf->used >= sizeof(buf->buf)) {
>> -            buf = malloc(sizeof(*buf));
>> -            if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
>> -            buf->used = 0;
>> -            LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
>> -        }
>> -        int r = read(ev->fd,
>> -                     buf->buf + buf->used,
>> +            buf = LIBXL_TAILQ_LAST(&dc->bufs, libxl__datacopier_bufs);
>> +            if (!buf || buf->used >= sizeof(buf->buf)) {
>> +                buf = malloc(sizeof(*buf));
>> +                if (!buf) libxl__alloc_failed(CTX, __func__, 1, sizeof(*buf));
>> +                buf->used = 0;
>> +                LIBXL_TAILQ_INSERT_TAIL(&dc->bufs, buf, entry);
>> +            }
>> +            r = read(ev->fd, buf->buf + buf->used,
>>                        (sizeof(buf->buf) - buf->used) < dc->maxread ?
>>                            (sizeof(buf->buf) - buf->used) : dc->maxread);
>> +        }
>>           if (r < 0) {
>>               if (errno == EINTR) continue;
>>               if (errno == EWOULDBLOCK) break;
>> @@ -257,10 +262,12 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev,
>>                   return;
>>               }
>>           }
>> -        buf->used += r;
>> +        if (!dc->readbuf) {
>> +            buf->used += r;
>> +            assert(buf->used <= sizeof(buf->buf));
>> +        }
>>           dc->used += r;
>>           dc->maxread -= r;
>> -        assert(buf->used <= sizeof(buf->buf));
>>           assert(dc->maxread >= 0);
>>           if (dc->maxread == 0)
>>               break;
>> @@ -324,9 +331,13 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
>>           if (rc) goto out;
>>       }
>>
>> -    rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
>> -                               dc->writefd, POLLOUT);
>> -    if (rc) goto out;
>> +    if (dc->writefd >= 0) {
>> +        rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
>> +                                   dc->writefd, POLLOUT);
>> +        if (rc) goto out;
>> +    }
>> +
>> +    assert(dc->readfd >= 0 || dc->writefd >= 0);
>
> If readfd and writefd are valid, and readbuf is not NULL, what it the behavior?
>

I'd say that's an invalid use of the API. There should probably be an 
assert to check this.

Regards
-- 
Ross Lagerwall

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2014-09-19  7:48 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-10 17:10 [PATCH v7 0/29] Migration Stream v2 Andrew Cooper
2014-09-10 17:10 ` [PATCH 01/29] tools/libxl: Fix stray blank line from debug logging Andrew Cooper
2014-09-11 10:18   ` Ian Campbell
2014-09-10 17:10 ` [PATCH 02/29] tools/[lib]xl: Correct use of init/dispose for libxl_domain_restore_params Andrew Cooper
2014-09-11 10:19   ` Ian Campbell
2014-09-10 17:10 ` [PATCH 03/29] tools/libxc: Implement writev_exact() in the same style as write_exact() Andrew Cooper
2014-09-11 10:19   ` Ian Campbell
2014-09-11 10:57   ` Ian Campbell
2014-09-11 10:59     ` Andrew Cooper
2014-09-10 17:10 ` [PATCH 04/29] libxc/bitops: Add or() to the available bitmap operations Andrew Cooper
2014-09-11 10:21   ` Ian Campbell
2014-09-10 17:10 ` [PATCH 05/29] libxc/progress: Repurpose the current progress reporting infrastructure Andrew Cooper
2014-09-11 10:32   ` Ian Campbell
2014-09-11 14:03     ` Andrew Cooper
2014-09-11 14:06       ` Ian Campbell
2014-09-10 17:10 ` [PATCH 06/29] docs: libxc migration stream specification Andrew Cooper
2014-09-10 17:10 ` [PATCH 07/29] docs: libxl " Andrew Cooper
2014-09-11 10:45   ` Ian Campbell
2014-09-11 10:56     ` Andrew Cooper
2014-09-11 11:03       ` Ian Campbell
2014-09-11 11:10         ` Andrew Cooper
2014-09-10 17:10 ` [PATCH 08/29] tools/python: Infrastructure relating to migration v2 streams Andrew Cooper
2014-09-10 17:10 ` [PATCH 09/29] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
2014-09-11 10:34   ` Ian Campbell
2014-09-11 10:37     ` Andrew Cooper
2014-09-11 11:01       ` Ian Campbell
2014-09-11 11:04         ` Andrew Cooper
2014-09-11 11:10           ` Ian Campbell
2014-09-14 10:23           ` Shriram Rajagopalan
2014-09-15 15:09             ` Andrew Cooper
2014-09-15 18:58               ` Konrad Rzeszutek Wilk
2014-09-16 11:44                 ` Andrew Cooper
2014-09-16 19:54                   ` Konrad Rzeszutek Wilk
2014-09-10 17:10 ` [PATCH 10/29] tools/libxc: C implementation of stream format Andrew Cooper
2014-09-11 10:48   ` Ian Campbell
2014-09-10 17:10 ` [PATCH 11/29] tools/libxc: noarch common code Andrew Cooper
2014-09-11 10:52   ` Ian Campbell
2014-09-10 17:10 ` [PATCH 12/29] tools/libxc: x86 " Andrew Cooper
2014-09-10 17:10 ` [PATCH 13/29] tools/libxc: x86 PV " Andrew Cooper
2014-09-10 17:10 ` [PATCH 14/29] tools/libxc: x86 PV save code Andrew Cooper
2014-09-10 17:10 ` [PATCH 15/29] tools/libxc: x86 PV restore code Andrew Cooper
2014-09-10 17:10 ` [PATCH 16/29] tools/libxc: x86 HVM save code Andrew Cooper
2014-09-10 17:10 ` [PATCH 17/29] tools/libxc: x86 HVM restore code Andrew Cooper
2014-09-10 17:10 ` [PATCH 18/29] tools/libxc: noarch save code Andrew Cooper
2014-09-10 17:10 ` [PATCH 19/29] tools/libxc: noarch restore code Andrew Cooper
2014-09-10 17:10 ` [PATCH 20/29] tools/libxl: Update datacopier to support sending data only Andrew Cooper
2014-09-11 11:56   ` Ian Campbell
2014-09-11 12:00     ` Andrew Cooper
2014-09-11 12:39       ` Ian Campbell
2014-09-11 13:03         ` Andrew Cooper
2014-09-11 13:04           ` Ian Campbell
2014-09-10 17:10 ` [PATCH 21/29] tools/libxl: Allow adding larger amounts of prefixdata to datacopier Andrew Cooper
2014-09-11 12:01   ` Ian Campbell
2014-09-11 12:17     ` Ross Lagerwall
2014-09-11 12:39       ` Ian Campbell
2014-09-10 17:11 ` [PATCH 22/29] tools/libxl: Allow limiting amount copied by datacopier Andrew Cooper
2014-09-11 12:02   ` Ian Campbell
2014-09-11 12:23     ` Ross Lagerwall
2014-09-11 12:40       ` Ian Campbell
2014-09-12  8:36   ` Wen Congyang
2014-09-19  7:45     ` Ross Lagerwall
2014-09-10 17:11 ` [PATCH 23/29] tools/libxl: Extend datacopier to support reading into a buffer Andrew Cooper
2014-09-11 12:03   ` Ian Campbell
2014-09-11 12:26     ` Ross Lagerwall
2014-09-11 12:41       ` Ian Campbell
2014-09-12  8:49   ` Wen Congyang
2014-09-19  7:48     ` Ross Lagerwall
2014-09-10 17:11 ` [PATCH 24/29] tools/libxl: Allow suppression of POLLHUP for datacopiers Andrew Cooper
2014-09-11 12:05   ` Ian Campbell
2014-09-10 17:11 ` [PATCH 25/29] tools/libxl: Stream v2 format Andrew Cooper
2014-09-11 12:06   ` Ian Campbell
2014-09-10 17:11 ` [PATCH 26/29] tools/libxl: Implement libxl__domain_restore() for v2 streams Andrew Cooper
2014-09-11 12:35   ` Ian Campbell
2014-09-11 13:01     ` Andrew Cooper
2014-09-10 17:11 ` [PATCH 27/29] [VERY RFC] tools/libxl: Support restoring legacy streams Andrew Cooper
2014-09-11 12:36   ` Ian Campbell
2014-09-10 17:11 ` [PATCH 28/29] tools/xl: Restore v2 streams using new interface Andrew Cooper
2014-09-10 17:11 ` [PATCH 29/29] tools/[lib]xl: Alter libxl_domain_suspend() to write a v2 stream Andrew Cooper
2014-09-11 11:50 ` [PATCH v7 0/29] Migration Stream v2 Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.