All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] Migration Stream v2
@ 2014-04-30 18:36 Andrew Cooper
  2014-04-30 18:36 ` [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW() Andrew Cooper
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Ian Jackson,
	Tim Deegan, Frediano Ziglio, David Vrabel, Jan Beulich

Hello,

Presented here for review is v4 of the Migration Stream v2 work.  David,
Frediano and myself have worked together on the implementation, and we have
now managed to get PV and HVM migration working using the new format (even
when transparently inserted underneath Xapi).

This series is based on staging, as 9f2f1298d0211962 is a prerequisite
hypercall for a bugfix.

Draft F of the design document is available from
  xenbits.xen.org/people/andrewcoop/domain-save-format-F.pdf

The series is available from the branch 'saverestore2-v4' of
  git://xenbits.xen.org/people/andrewcoop/xen.git
  http://xenbits.xen.org/git-http/people/andrewcoop/xen.git


The code has been a clean rewrite, using the v1 code as a reference but
avoiding obsolete areas (e.g. how to modify the pagetables of a 32 non-pae
guest on 32bit pae Xen).

There are certainly some areas still in need improvement.  There are plans to
combine the save logic for PV and HVM guests by adding a few more hooks to
save_restore_ops.  While reviewing the series for posting, I have noticed some
accidental duplication from attempting to merge PV and HVM together
(ctx->save.p2m_size and ctx->x86_pv.max_pfn) which can be consolidated.

Moving forward will be work to do with converting a legacy image to a v2
image.

Questions/issues needing discussing are:

1) What level should the conversion happen.  The v2 format is capable of
   detecting a legacy stream, but libxc is arguably too low level for the
   conversion decision to happen.

2) What to do about the layer violations which is the toolstack record and
   device model record.  Libxl for HVM guests is the only known user of the
   'toolstack' chunk, which contains some xenstore key/value pairs.  This data
   should not be a blob in the middle of the libxc layer, and should be
   promoted to a first class member of a libxl migration stream.

Anyway - the code is presented here for initial comment/query/critism.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW()
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 11:45   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 2/9] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: David Vrabel

From: David Vrabel <david.vrabel@citrix.com>

DECLARE_HYPERCALL_BUFFER_SHADOW() is like DECLARE_HYPERCALL_BUFFER()
except it is backed by an already allocated hypercall buffer.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/xenctrl.h |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 02129f7..15d2f4f 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -266,6 +266,24 @@ typedef struct xc_hypercall_buffer xc_hypercall_buffer_t;
     }
 
 /*
+ * Like DECLARE_HYPERCALL_BUFFER() but using an already allocated
+ * hypercall buffer, _hbuf.
+ *
+ * Useful when a hypercall buffer is passed to a function and access
+ * via the user pointer is required.
+ *
+ * See DECLARE_HYPERCALL_BUFFER_ARGUMENT() if the user pointer is not
+ * required.
+ */
+#define DECLARE_HYPERCALL_BUFFER_SHADOW(_type, _name, _hbuf)   \
+    _type *_name = _hbuf->hbuf;                                \
+    xc_hypercall_buffer_t XC__HYPERCALL_BUFFER_NAME(_name) = { \
+        .hbuf = (void *)-1,                                    \
+        .param_shadow = _hbuf,                                 \
+        HYPERCALL_BUFFER_INIT_NO_BOUNCE                        \
+    }
+
+/*
  * Declare the necessary data structure to allow a hypercall buffer
  * passed as an argument to a function to be used in the normal way.
  */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 2/9] [HACK] tools/libxc: save/restore v2 framework
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
  2014-04-30 18:36 ` [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW() Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-04-30 18:36 ` [PATCH v4 3/9] tools/libxc: Stream specification and some common code Andrew Cooper
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
two save/restore codepaths to coexist, and have a runtime switch.

It is indended that once this series is less RFC, the v2 framework will
completely replace v1.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/Makefile              |    1 +
 tools/libxc/saverestore/common.h  |   15 +++++++++++++++
 tools/libxc/saverestore/restore.c |   23 +++++++++++++++++++++++
 tools/libxc/saverestore/save.c    |   20 ++++++++++++++++++++
 tools/libxc/xc_domain_restore.c   |    8 ++++++++
 tools/libxc/xc_domain_save.c      |    7 +++++++
 tools/libxc/xenguest.h            |   14 ++++++++++++++
 7 files changed, 88 insertions(+)
 create mode 100644 tools/libxc/saverestore/common.h
 create mode 100644 tools/libxc/saverestore/restore.c
 create mode 100644 tools/libxc/saverestore/save.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index a74b19e..52213dc 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -44,6 +44,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c xc_suspend.c
 ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += $(wildcard saverestore/*.c)
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
 else
 GUEST_SRCS-y += xc_nomigrate.c
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
new file mode 100644
index 0000000..f1aff44
--- /dev/null
+++ b/tools/libxc/saverestore/common.h
@@ -0,0 +1,15 @@
+#ifndef __COMMON__H
+#define __COMMON__H
+
+#include "../xg_private.h"
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
new file mode 100644
index 0000000..6624baa
--- /dev/null
+++ b/tools/libxc/saverestore/restore.c
@@ -0,0 +1,23 @@
+#include "common.h"
+
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
new file mode 100644
index 0000000..c013e62
--- /dev/null
+++ b/tools/libxc/saverestore/save.c
@@ -0,0 +1,20 @@
+#include "common.h"
+
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm,
+                    unsigned long vm_generationid_addr)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index bcb0ae0..faa458a 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1468,6 +1468,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     struct restore_ctx *ctx = &_ctx;
     struct domain_info_context *dinfo = &ctx->dinfo;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_restore2(
+            xch, io_fd, dom, store_evtchn, store_mfn,
+            store_domid, console_evtchn, console_mfn, console_domid,
+            hvm,  pae,  superpages, checkpointed_stream, callbacks);
+    }
+
     DPRINTF("%s: starting restore of new domid %u", __func__, dom);
 
     pagebuf_init(&pagebuf);
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 71f9b59..c94e3e6 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -895,6 +895,13 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
     int completed = 0;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_save2(xch, io_fd, dom, max_iters,
+                               max_factor, flags, callbacks, hvm,
+                               vm_generationid_addr);
+    }
+
     DPRINTF("%s: starting save of domid %u", __func__, dom);
 
     if ( hvm && !callbacks->switch_qemu_logdirty )
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 1f216cd..b80df82 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -89,6 +89,11 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                    struct save_callbacks* callbacks, int hvm,
                    unsigned long vm_generationid_addr);
 
+/* Domain Save v2 */
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm,
+                    unsigned long vm_generationid_addr);
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
@@ -128,6 +133,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       int no_incr_generationid, int checkpointed_stream,
                       unsigned long *vm_generationid_addr,
                       struct restore_callbacks *callbacks);
+
+/* Domain Restore v2 */
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks);
 /**
  * xc_domain_restore writes a file to disk that contains the device
  * model saved state.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 3/9] tools/libxc: Stream specification and some common code
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
  2014-04-30 18:36 ` [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW() Andrew Cooper
  2014-04-30 18:36 ` [PATCH v4 2/9] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 11:57   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.c        |   63 ++++++++++++
 tools/libxc/saverestore/common.h        |    8 ++
 tools/libxc/saverestore/stream_format.h |  158 +++++++++++++++++++++++++++++++
 3 files changed, 229 insertions(+)
 create mode 100644 tools/libxc/saverestore/common.c
 create mode 100644 tools/libxc/saverestore/stream_format.h

diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
new file mode 100644
index 0000000..de2e727
--- /dev/null
+++ b/tools/libxc/saverestore/common.c
@@ -0,0 +1,63 @@
+#include "common.h"
+
+static const char *dhdr_types[] =
+{
+    [DHDR_TYPE_x86_pv]  = "x86 PV",
+    [DHDR_TYPE_x86_hvm] = "x86 HVM",
+    [DHDR_TYPE_x86_pvh] = "x86 PVH",
+    [DHDR_TYPE_arm]     = "ARM",
+};
+
+const char *dhdr_type_to_str(uint32_t type)
+{
+    if ( type < ARRAY_SIZE(dhdr_types) && dhdr_types[type] )
+        return dhdr_types[type];
+
+    return "Reserved";
+}
+
+static const char *mandatory_rec_types[] =
+{
+    [REC_TYPE_end]                  = "End",
+    [REC_TYPE_page_data]            = "Page data",
+    [REC_TYPE_x86_pv_info]          = "x86 PV info",
+    [REC_TYPE_x86_pv_p2m_frames]    = "x86 PV P2M frames",
+    [REC_TYPE_x86_pv_vcpu_basic]    = "x86 PV vcpu basic",
+    [REC_TYPE_x86_pv_vcpu_extended] = "x86 PV vcpu extended",
+    [REC_TYPE_x86_pv_vcpu_xsave]    = "x86 PV vcpu xsave",
+    [REC_TYPE_shared_info]          = "Shared info",
+    [REC_TYPE_tsc_info]             = "TSC info",
+    [REC_TYPE_hvm_context]          = "HVM context",
+    [REC_TYPE_hvm_params]           = "HVM params",
+    [REC_TYPE_toolstack]            = "Toolstack",
+};
+
+/*
+static const char *optional_rec_types[] =
+{
+     None yet...
+};
+*/
+
+const char *rec_type_to_str(uint32_t type)
+{
+    if ( type & REC_TYPE_optional )
+        return "Reserved";
+
+    if ( ((type & REC_TYPE_optional) == 0 ) &&
+         (type < ARRAY_SIZE(mandatory_rec_types)) &&
+         (mandatory_rec_types[type]) )
+        return mandatory_rec_types[type];
+
+    return "Reserved";
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index f1aff44..fff0a39 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -3,6 +3,14 @@
 
 #include "../xg_private.h"
 
+#include "stream_format.h"
+
+// TODO: Find a better place to put this...
+#define ARRAY_SIZE(a) (sizeof(a) / sizeof(a[0]))
+
+const char *dhdr_type_to_str(uint32_t type);
+const char *rec_type_to_str(uint32_t type);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
new file mode 100644
index 0000000..efcca60
--- /dev/null
+++ b/tools/libxc/saverestore/stream_format.h
@@ -0,0 +1,158 @@
+#ifndef __STREAM_FORMAT__H
+#define __STREAM_FORMAT__H
+
+#include <inttypes.h>
+
+/* TODO - find somewhere more appropriate. rec_tsc_info needs it for 64bit */
+#ifndef __packed
+#define __packed __attribute__((packed))
+#endif
+
+/*
+ * Image Header
+ */
+struct __packed ihdr
+{
+    uint64_t marker;
+    uint32_t id;
+    uint32_t version;
+    uint16_t options;
+    uint16_t _res1;
+    uint32_t _res2;
+};
+
+#define IHDR_MARKER  0xffffffffffffffffULL
+#define IHDR_ID      0x58454E46U
+#define IHDR_VERSION 1
+
+#define _IHDR_OPT_ENDIAN 0
+#define IHDR_OPT_LITTLE_ENDIAN (0 << _IHDR_OPT_ENDIAN)
+#define IHDR_OPT_BIG_ENDIAN    (1 << _IHDR_OPT_ENDIAN)
+
+/*
+ * Domain Header
+ */
+struct __packed dhdr
+{
+    uint32_t type;
+    uint16_t page_shift;
+    uint16_t _res1;
+    uint32_t xen_major;
+    uint32_t xen_minor;
+};
+
+#define DHDR_TYPE_x86_pv  0x00000001U
+#define DHDR_TYPE_x86_hvm 0x00000002U
+#define DHDR_TYPE_x86_pvh 0x00000003U
+#define DHDR_TYPE_arm     0x00000004U
+
+/*
+ * Record Header
+ */
+struct __packed rhdr
+{
+    uint32_t type;
+    uint32_t length;
+};
+
+/* Somewhat arbitrary - 8MB */
+#define REC_LENGTH_MAX                (8U << 20)
+
+#define REC_TYPE_end                  0x00000000U
+#define REC_TYPE_page_data            0x00000001U
+#define REC_TYPE_x86_pv_info          0x00000002U
+#define REC_TYPE_x86_pv_p2m_frames    0x00000003U
+#define REC_TYPE_x86_pv_vcpu_basic    0x00000004U
+#define REC_TYPE_x86_pv_vcpu_extended 0x00000005U
+#define REC_TYPE_x86_pv_vcpu_xsave    0x00000006U
+#define REC_TYPE_shared_info          0x00000007U
+#define REC_TYPE_tsc_info             0x00000008U
+#define REC_TYPE_hvm_context          0x00000009U
+#define REC_TYPE_hvm_params           0x0000000aU
+#define REC_TYPE_toolstack            0x0000000bU
+
+#define REC_TYPE_optional             0x80000000U
+
+/* PAGE_DATA */
+struct __packed rec_page_data_header
+{
+    uint32_t count;
+    uint32_t _res1;
+    uint64_t pfn[0];
+};
+
+#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
+#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
+
+/* X86_PV_INFO */
+struct __packed rec_x86_pv_info
+{
+    uint8_t guest_width;
+    uint8_t pt_levels;
+    uint8_t options;
+};
+
+/* X86_PV_P2M_FRAMES */
+struct __packed rec_x86_pv_p2m_frames
+{
+    uint32_t start_pfn;
+    uint32_t end_pfn;
+    uint64_t p2m_pfns[0];
+};
+
+/* VCPU_CONTEXT_{basic,extended} */
+struct __packed rec_x86_pv_vcpu
+{
+    uint32_t vcpu_id;
+    uint32_t _res1;
+    uint8_t context[0];
+};
+
+/* VCPU_CONTEXT_XSAVE */
+struct __packed rec_x86_pv_vcpu_xsave
+{
+    uint32_t vcpu_id;
+    uint32_t _res1;
+    uint64_t xfeature_mask;
+    uint8_t  context[0];
+};
+
+/* TSC_INFO */
+struct __packed rec_tsc_info
+{
+    uint32_t mode;
+    uint32_t khz;
+    uint64_t nsec;
+    uint32_t incarnation;
+};
+
+/* HVM_CONTEXT */
+struct __packed rec_hvm_context
+{
+    uint8_t context[0];
+};
+
+/* HVM_PARAMS */
+struct __packed rec_hvm_params_entry
+{
+    uint64_t index;
+    uint64_t value;
+};
+
+struct __packed rec_hvm_params
+{
+    uint32_t count;
+    uint32_t _res1;
+    struct rec_hvm_params_entry param[0];
+};
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (2 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 3/9] tools/libxc: Stream specification and some common code Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 12:03   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 5/9] tools/libxc: common code Andrew Cooper
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/scripts/generate.py        |   59 +++
 .../libxc/saverestore/scripts/inspect-legacy32.py  |  167 +++++++++
 tools/libxc/saverestore/scripts/streamspec.py      |  106 ++++++
 tools/libxc/saverestore/scripts/verify.py          |  385 ++++++++++++++++++++
 4 files changed, 717 insertions(+)
 create mode 100755 tools/libxc/saverestore/scripts/generate.py
 create mode 100755 tools/libxc/saverestore/scripts/inspect-legacy32.py
 create mode 100644 tools/libxc/saverestore/scripts/streamspec.py
 create mode 100755 tools/libxc/saverestore/scripts/verify.py

diff --git a/tools/libxc/saverestore/scripts/generate.py b/tools/libxc/saverestore/scripts/generate.py
new file mode 100755
index 0000000..3b01e65
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/generate.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+from streamspec import *
+import struct, sys
+
+ihdr = struct.pack(IHDR_FORMAT,
+                   0xffffffffffffffff, # Marker
+                   IHDR_IDENT,         # "XENF" in ASCII
+                   1,                  # Version
+                   IHDR_OPT_LE,        # Options
+                   0, 0                # Reserved
+                   )
+
+def emit_record(type, data):
+    length = len(data)
+
+    r = struct.pack(RH_FORMAT, type, length)
+    r += data
+
+    padding_len = (8 - (length & 7)) & 7
+    r += '\x00' * padding_len
+
+    sys.stdout.write(r)
+
+
+def emit_pv():
+
+    dhdr = struct.pack(DHDR_FORMAT,
+                       DHDR_TYPE_x86_pv,  # Type
+                       12, # Page size
+                       0,  # Reserved
+                       4,  # Xen major
+                       5   # Xen minor
+                       )
+
+    sys.stdout.write(ihdr)
+    sys.stdout.write(dhdr)
+
+    x86_pv_info = struct.pack(X86_PV_INFO_FORMAT,
+                              8, # Guest width
+                              4, # Guest levels
+                              0  # Options
+                              )
+
+    emit_record(REC_TYPE_x86_pv_info, x86_pv_info)
+    emit_record(REC_TYPE_end, "")
+    return 0
+
+
+def main(argv = sys.argv):
+
+    if len(argv) == 0 or argv[0] == "pv":
+        return emit_pv()
+    else:
+        return 1
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
diff --git a/tools/libxc/saverestore/scripts/inspect-legacy32.py b/tools/libxc/saverestore/scripts/inspect-legacy32.py
new file mode 100755
index 0000000..e20a8f2
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/inspect-legacy32.py
@@ -0,0 +1,167 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import sys
+import struct
+
+fin = None
+guest_width = 0
+guest_levels = 0
+
+class StreamError(StandardError):
+    pass
+
+def rdexact(n):
+    _ = fin.read(n)
+    if len(_) != n:
+        raise IOError("Stream truncated")
+    #print"> read 0x%x bytes" % (n, )
+    return _
+
+def unpack_exact(fmt):
+    l = struct.calcsize(fmt)
+    return struct.unpack(fmt, rdexact(l))
+
+def read_extended_info():
+    global guest_width, guest_level
+
+    sig, rem_length = struct.unpack("II", rdexact(8))
+
+    if sig != 0xffffffff:
+        raise StreamError("Bad extended info signature 0x%08x" % (sig,))
+    else:
+        print "Endended Info: length 0x%x" % (rem_length,)
+
+    so_far = 0
+    while so_far < rem_length:
+
+        blkid, datasz = struct.unpack("4sI", rdexact(8))
+        so_far += 8
+
+        print "  Record type: %s, size 0x%x" % (blkid, datasz)
+
+        # Eww, but this is how it is done :(
+        if blkid == "vcpu":
+            if datasz == 0x1430:
+                guest_width = 8
+                guest_levels = 4
+                print "    64bit domain, 4 levels"
+            else:
+                raise StreamError("Unable to determine guest width/level")
+
+        rdexact(datasz)
+        so_far += datasz
+
+    if so_far != rem_length:
+        raise StreamError("Overshot total Extended Info size. Consumed 0x%x bytes" % (so_far,))
+
+def read_chunks():
+
+    while True:
+
+        chunk_type, = struct.unpack("i", rdexact(4))
+        print "Chunk: type 0x%x" % (chunk_type,)
+
+        if chunk_type == 0:
+            print "  End"
+            return
+
+        elif chunk_type > 0:
+            print "  Page Batch"
+            pfn_array = rdexact(chunk_type * 4)
+            page_data = rdexact(chunk_type * 4096)
+
+        elif chunk_type == -2:
+            max_id, = unpack_exact("i")
+            bitmap = rdexact(((max_id/64) + 1) * 8)
+            print "  Vcpu info: max_id %d" % (max_id, )
+
+        elif chunk_type == -7:
+            mode, nsec, khz, incarn = unpack_exact("IQII")
+            print "  TSC_INFO: mode %s, %d ns, %d khz, %d incarn" % ( mode, nsec, khz, incarn)
+
+        elif chunk_type == -9:
+            print "  Last Checkpoint"
+
+        elif chunk_type == -12:
+            sz, = unpack_exact("I")
+            data = rdexact(sz)
+            print "  Compressed Data: sz 0x%x" % (sz, )
+
+        elif chunk_type == -18:
+            sz, = unpack_exact("I")
+            data = rdexact(sz)
+            print "  Toolstack Data: sz 0x%x" % (sz, )
+
+        else:
+            raise StreamError("Unrecognised chunk")
+
+def main(argv = sys.argv):
+    global fin
+
+    if len(argv) == 2:
+        fin = open(argv[1], "rb")
+    else:
+        fin = sys.stdin
+
+    try:
+        # Skip Xl header
+        if "Xen saved domain, xl format\n \0 \r" != rdexact(32):
+            raise StreamError("No xl header")
+
+        _, _, _, optlen = struct.unpack("=IIII", rdexact(16))
+        rdexact(optlen)
+        print "xl header skipped"
+
+        # P2M size
+        p2m_size, = struct.unpack("I", rdexact(4))
+        print "P2M Size: 0x%x" % (p2m_size,)
+
+        # Extended info
+        read_extended_info()
+
+        # P2M list
+
+        fpp = 4096/guest_width
+        p2m_len = (p2m_size + fpp - 1) / fpp
+
+        print "Reading p2m frames.  fpp: %d, p2m_len: %d" % (fpp, p2m_len)
+
+        p2m_frames = rdexact(p2m_len * 4)
+        if p2m_len < 20:
+            print list(struct.unpack("I" * p2m_len, p2m_frames))
+
+        read_chunks()
+
+        unmapped_pfn_count, = unpack_exact("I")
+        unmapped_pfn_list = rdexact(unmapped_pfn_count * 4)
+        print "Unmapped PFN count: 0x%x" % (unmapped_pfn_count, )
+
+        # VCPU Context fudge
+        _ = rdexact(0x1430)
+        _ = rdexact(128)
+        xfeature_mask, xsize = unpack_exact("QQ")
+        _ = rdexact(xsize)
+        print "Got VCPU information"
+
+        shared_info = rdexact(4096)
+        print "Got shinfo"
+
+        if fin.read(1) != "":
+            raise StreamError("Junk found on the end of the stream")
+
+    except (IOError, StreamError, ) as e:
+        print "Error: ", e
+        return 1
+
+    except RuntimeError as e:
+        print "Script error", e
+        print "Please fix me"
+        return 2
+
+    print "Done"
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv))
diff --git a/tools/libxc/saverestore/scripts/streamspec.py b/tools/libxc/saverestore/scripts/streamspec.py
new file mode 100644
index 0000000..ebb0bc6
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/streamspec.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# Image Header
+IHDR_FORMAT = "!QIIHHI"
+
+IHDR_IDENT = 0x58454E46 # "XENF" in ASCII
+
+IHDR_OPT_ENDIAN_ = 0
+IHDR_OPT_LE = (0 << IHDR_OPT_ENDIAN_)
+IHDR_OPT_BE = (1 << IHDR_OPT_ENDIAN_)
+
+IHDR_OPT_RESZ_MASK = 0xfffe
+
+# Domain Header
+DHDR_FORMAT = "=IHHII"
+
+DHDR_TYPE_x86_pv  = 0x00000001
+DHDR_TYPE_x86_hvm = 0x00000002
+DHDR_TYPE_x86_pvh = 0x00000003
+DHDR_TYPE_arm     = 0x00000004
+
+dhdr_type_to_str = {
+    DHDR_TYPE_x86_pv  : "x86 PV",
+    DHDR_TYPE_x86_hvm : "x86 HVM",
+    DHDR_TYPE_x86_pvh : "x86 PVH",
+    DHDR_TYPE_arm     : "ARM",
+}
+
+RH_FORMAT = "=II"
+
+REC_TYPE_end                  = 0x00000000
+REC_TYPE_page_data            = 0x00000001
+REC_TYPE_x86_pv_info          = 0x00000002
+REC_TYPE_x86_pv_p2m_frames    = 0x00000003
+REC_TYPE_x86_pv_vcpu_basic    = 0x00000004
+REC_TYPE_x86_pv_vcpu_extended = 0x00000005
+REC_TYPE_x86_pv_vcpu_xsave    = 0x00000006
+REC_TYPE_shared_info          = 0x00000007
+REC_TYPE_tsc_info             = 0x00000008
+REC_TYPE_hvm_context          = 0x00000009
+REC_TYPE_hvm_params           = 0x0000000a
+REC_TYPE_toolstack            = 0x0000000b
+
+rec_type_to_str = {
+    REC_TYPE_end                  : "End",
+    REC_TYPE_page_data            : "Page data",
+    REC_TYPE_x86_pv_info          : "x86 PV info",
+    REC_TYPE_x86_pv_p2m_frames    : "x86 PV P2M frames",
+    REC_TYPE_x86_pv_vcpu_basic    : "x86 PV vcpu basic",
+    REC_TYPE_x86_pv_vcpu_extended : "x86 PV vcpu extended",
+    REC_TYPE_x86_pv_vcpu_xsave    : "x86 PV vcpu xsave",
+    REC_TYPE_shared_info          : "Shared info",
+    REC_TYPE_tsc_info             : "TSC info",
+    REC_TYPE_hvm_context          : "HVM context",
+    REC_TYPE_hvm_params           : "HVM params",
+    REC_TYPE_toolstack            : "Toolstack",
+}
+
+# page_data
+PAGE_DATA_FORMAT             = "=II"
+PAGE_DATA_PFN_MASK           = (1L << 52) - 1
+PAGE_DATA_PFN_RESZ_MASK      = ((1L << 60) - 1) & ~((1L << 52) - 1)
+
+# flags from xen/public/domctl.h: XEN_DOMCTL_PFINFO_* shifted by 32 bits
+PAGE_DATA_TYPE_SHIFT         = 60
+PAGE_DATA_TYPE_LTABTYPE_MASK = (0x7L << PAGE_DATA_TYPE_SHIFT)
+PAGE_DATA_TYPE_LTAB_MASK     = (0xfL << PAGE_DATA_TYPE_SHIFT)
+PAGE_DATA_TYPE_LPINTAB       = (0x8L << PAGE_DATA_TYPE_SHIFT) # Pinned pagetable
+
+PAGE_DATA_TYPE_NOTAB         = (0x0L << PAGE_DATA_TYPE_SHIFT) # Regular page
+PAGE_DATA_TYPE_L1TAB         = (0x1L << PAGE_DATA_TYPE_SHIFT) # L1 pagetable
+PAGE_DATA_TYPE_L2TAB         = (0x2L << PAGE_DATA_TYPE_SHIFT) # L2 pagetable
+PAGE_DATA_TYPE_L3TAB         = (0x3L << PAGE_DATA_TYPE_SHIFT) # L3 pagetable
+PAGE_DATA_TYPE_L4TAB         = (0x4L << PAGE_DATA_TYPE_SHIFT) # L4 pagetable
+PAGE_DATA_TYPE_BROKEN        = (0xdL << PAGE_DATA_TYPE_SHIFT) # Broken
+PAGE_DATA_TYPE_XALLOC        = (0xeL << PAGE_DATA_TYPE_SHIFT) # Allocate-only
+PAGE_DATA_TYPE_XTAB          = (0xfL << PAGE_DATA_TYPE_SHIFT) # Invalid
+
+# x86_pv_info
+X86_PV_INFO_FORMAT        = "=BBB"
+
+X86_PV_INFO_OPT_VMASST_   = 0
+X86_PV_INFO_OPT_VMASST    = (1 << X86_PV_INFO_OPT_VMASST_)
+
+X86_PV_INFO_OPT_RESZ_MASK = 0xfe
+
+# x86_pv_vcpu_{basic,extended}
+X86_PV_VCPU_FORMAT        = "=II"
+
+# x86_pv_vcpu_xsave
+X86_PV_VCPU_XSAVE_FORMAT  = "=IIQ"
+
+# tsc_info
+TSC_INFO_FORMAT           = "=IIQI"
+
+# hvm_params
+HVM_PARAMS_FORMAT         = "=II"
+HVM_PARAMS_ENTRY_FORMAT   = "=QQ"
+
+#
+# libxl format
+#
+
+LIBXL_QEMU_SIGNATURE = "DeviceModelRecord0002"
+LIBXL_QEMU_RECORD_HDR = "=%dsI" % (len(LIBXL_QEMU_SIGNATURE), )
diff --git a/tools/libxc/saverestore/scripts/verify.py b/tools/libxc/saverestore/scripts/verify.py
new file mode 100755
index 0000000..fdb8a87
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/verify.py
@@ -0,0 +1,385 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import sys
+import struct
+
+from streamspec import *
+
+class StreamError(StandardError):
+    pass
+
+class RecordError(StandardError):
+    pass
+
+def skip_xl_header(stream):
+
+    magic = stream.read(8)
+    if magic != "mat\n \0 \r":
+        return False
+
+    header = stream.read(16)
+    if len(header) != 16:
+        return False
+
+    _, _, _, optlen = struct.unpack("=IIII", header)
+
+    optdata = stream.read(optlen)
+    if len(optdata) != optlen:
+        return False
+
+    return True
+
+
+def verify_ihdr(stream):
+    """ Verify an image header """
+
+    datasz = struct.calcsize(IHDR_FORMAT)
+    data = stream.read(datasz)
+
+    # xl header record?
+    if data == "Xen saved domain, xl for":
+        if skip_xl_header(stream):
+            data = stream.read(datasz)
+        else:
+            raise StreamError("Invalid looking xl header on the stream")
+
+    if len(data) != datasz:
+        raise IOError("Truncated stream")
+
+    marker, id, version, options, res1, res2 = struct.unpack(IHDR_FORMAT, data)
+
+    if marker != 0xffffffffffffffff:
+        raise StreamError("Bad image marker: Expected 0xffffffffffffffff, "
+                          "got 0x%x" % (marker, ))
+
+    if id != 0x58454e46:
+        raise StreamError("Bad image id: Expected 0x0x58454e46, got 0x%x"
+                          % (id, ))
+
+    if version != 1:
+        raise StreamError("Unknown image version: Expected 1, got %d"
+                          % (version, ))
+
+    if options & IHDR_OPT_RESZ_MASK:
+        raise StreamError("Reserved bits set in image options field: 0x%x"
+                          % (options & IHDR_OPT_RESZ_MASK))
+
+    if res1 != 0 or res2 != 0:
+        raise StreamError("Reserved bits set in image header: 0x%04x:0x%08x"
+                          % (res1, res2))
+
+    if ( sys.byteorder == "little" and
+         (options & IHDR_OPT_ENDIAN_) != IHDR_OPT_LE ):
+        raise StreamError("Stream is not native endianess - unable to validate")
+
+    print "Valid Image Header:",
+    if options & IHDR_OPT_BE:
+        print "big endian"
+    else:
+        print "little endian"
+
+def verify_dhdr(stream):
+    """ Verify a domain header """
+
+    datasz = struct.calcsize(DHDR_FORMAT)
+    data = stream.read(datasz)
+
+    if len(data) != datasz:
+        raise IOError("Truncated stream")
+
+    type, page_shift, res1, major, minor = struct.unpack(DHDR_FORMAT, data)
+
+    if type not in dhdr_type_to_str:
+        raise StreamError("Unrecognised domain type 0x%x" % (type, ))
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in domain header 0x%04x"
+                          % (res1, ))
+
+    if page_shift != 12:
+        raise StreamError("Page shift expected to be 12.  Got %d"
+                          % (page_shift, ))
+
+    print "Valid Domain Header: %s from Xen %d.%d (page sz %d)" \
+        % (dhdr_type_to_str[type], major, minor, 2**page_shift)
+
+
+def verify_record_end(content):
+
+    if len(content) != 0:
+        raise RecordError("End record with non-zero length")
+
+def verify_page_data(content):
+    minsz = struct.calcsize(PAGE_DATA_FORMAT)
+
+    if len(content) <= minsz:
+        raise RecordError("PAGE_DATA record must be at least %d bytes long"
+                          % (minsz, ))
+
+    count, res1 = struct.unpack_from(PAGE_DATA_FORMAT, content)
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in PAGE_DATA record 0x%04x"
+                          % (res1, ))
+
+    pfnsz = count * 8
+    if (len(content) - minsz) < pfnsz:
+        raise RecordError("PAGE_DATA record must contain a pfn record for "
+                          "each count")
+
+    pfns = list(struct.unpack_from("=%dQ" % (count,), content, minsz))
+
+    nr_pages = 0
+    for idx, pfn in enumerate(pfns):
+
+        if pfn & PAGE_DATA_PFN_RESZ_MASK:
+            raise RecordError("Reserved bits set in pfn[%d]: 0x%016x",
+                              idx, pfn & PAGE_DATA_PFN_RESZ_MASK)
+
+        if pfn >> PAGE_DATA_TYPE_SHIFT in (5, 6, 7, 8):
+            raise RecordError("Invalid type value in pfn[%d]: 0x%016x",
+                              idx, pfn & PAGE_DATA_TYPE_LTAB_MASK)
+
+        # We expect page data for each normal page or pagetable
+        if PAGE_DATA_TYPE_NOTAB <= (pfn & PAGE_DATA_TYPE_LTABTYPE_MASK) <= PAGE_DATA_TYPE_L4TAB:
+            nr_pages += 1
+
+    pagesz = nr_pages * 4096
+    if len(content) != minsz + pfnsz + pagesz:
+        raise RecordError("Expected %u + %u + %u, got %u" % (minsz, pfnsz, pagesz, len(content)))
+
+
+def verify_record_x86_pv_vcpu_generic(content, name):
+    # Generic for both REC_TYPE_x86_pv_vcpu_{basic,extended}
+    minsz = struct.calcsize(X86_PV_VCPU_FORMAT)
+
+    if len(content) <= minsz:
+        raise RecordError("X86_PV_VCPU_%s record length must be at least %d"
+                          " bytes long" % (name, minsz))
+
+    vcpuid, res1 = struct.unpack_from(X86_PV_VCPU_FORMAT, content)
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in x86_pv_vcpu_%s record 0x%04x"
+                          % (name, res1))
+
+    print "  vcpu%d %s context, %d bytes" % (vcpuid, name, len(content) - minsz)
+
+def verify_record_x86_pv_vcpu_xsave(content):
+    minsz = struct.calcsize(X86_PV_VCPU_XSAVE_FORMAT)
+
+    if len(content) <= minsz:
+        raise RecordError("X86_PV_VCPU_XSAVE record length must be at least %d"
+                          " bytes long" % (minsz, ))
+
+    vcpuid, res1, xmask = struct.unpack_from(X86_PV_VCPU_XSAVE_FORMAT,
+                                             content)
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in X86_PV_VCPU_XSAVE record "
+                          "0x%04x" % (res1, ))
+
+    print "  vcpu%d xsave context, mask 0x%x" % (vcpuid, xmask)
+
+
+def verify_x86_pv_info(content):
+
+    if len(content) != 3:
+        raise RecordError("x86_pf_info: expected length of 3, got %d"
+                          % (len(content), ))
+
+    width, levels, options = struct.unpack(X86_PV_INFO_FORMAT, content)
+
+    if width not in (4, 8):
+        raise RecordError("Expected width of 4 or 8, got %d" % (width, ))
+
+    if levels not in (3, 4):
+        raise RecordError("Expected levels of 3 or 4, got %d" % (levels, ))
+
+    if (options & X86_PV_INFO_OPT_RESZ_MASK) != 0:
+        raise StreamError("Reserved bits set in X86_PV_INFO options: 0x%02x"
+                          % (options & X86_PV_INFO_OPT_RESZ_MASK, ))
+
+    bitness = {4:32, 8:64}[width]
+
+    print "  %sbit guest, %d levels of pagetables" % (bitness, levels)
+
+def verify_x86_pv_p2m_frames(content):
+
+    if len(content) % 8 != 0:
+        raise RecordError("Length expected to be a multiple of 8, not %d"
+                          % (len(content), ))
+
+    start, end = struct.unpack_from("=II", content)
+
+    print "  Start pfn 0x%x, End 0x%x" % (start, end)
+
+def verify_record_shared_info(content):
+
+    if len(content) != 4096:
+        raise RecordError("Length expected to be 4906 bytes, not %d"
+                          % (len(content), ))
+
+def verify_record_tsc_info(content):
+
+    sz = struct.calcsize(TSC_INFO_FORMAT)
+
+    if len(content) != sz:
+        raise RecordError("Length should be %u bytes" % (sz, ))
+
+    mode, khz, nsec, incarn = struct.unpack(TSC_INFO_FORMAT, content)
+    print ("  Mode %u, %u kHz, %u ns, incarnation %d"
+           % (mode, khz, nsec, incarn))
+
+def verify_record_hvm_context(content):
+
+    if len(content) == 0:
+        raise RecordError("Zero length HVM context")
+
+def verify_record_hvm_params(content):
+
+    sz = struct.calcsize(HVM_PARAMS_FORMAT)
+
+    if len(content) < sz:
+        raise RecordError("Length should be at least %u bytes" % (sz, ))
+
+    count, rsvd = struct.unpack(HVM_PARAMS_FORMAT, content[:sz])
+
+    if rsvd != 0:
+        raise RecordError("Reserved field not zero (0x%04x)" % (rsvd, ))
+
+    sz += count * struct.calcsize(HVM_PARAMS_ENTRY_FORMAT)
+
+    if len(content) != sz:
+        raise RecordError("Length should be %u bytes" % (sz, ))
+
+def verify_toolstack(content):
+    # Opaque blob -- nothing to verify.
+    pass
+
+record_verifiers = {
+    REC_TYPE_end : verify_record_end,
+    REC_TYPE_page_data : verify_page_data,
+
+    REC_TYPE_x86_pv_info: verify_x86_pv_info,
+    REC_TYPE_x86_pv_p2m_frames: verify_x86_pv_p2m_frames,
+
+    REC_TYPE_x86_pv_vcpu_basic :
+        lambda x: verify_record_x86_pv_vcpu_generic(x, "basic"),
+    REC_TYPE_x86_pv_vcpu_extended :
+        lambda x: verify_record_x86_pv_vcpu_generic(x, "extended"),
+    REC_TYPE_x86_pv_vcpu_xsave : verify_record_x86_pv_vcpu_xsave,
+
+    REC_TYPE_shared_info: verify_record_shared_info,
+    REC_TYPE_tsc_info: verify_record_tsc_info,
+
+    REC_TYPE_hvm_context: verify_record_hvm_context,
+    REC_TYPE_hvm_params: verify_record_hvm_params,
+    REC_TYPE_toolstack: verify_toolstack,
+}
+
+_squahsed_data_records = 0
+def verify_record(stream):
+    """ Verify a record """
+    global _squahsed_data_records
+
+    datasz = struct.calcsize(RH_FORMAT)
+    data = stream.read(datasz)
+
+    if len(data) != datasz:
+        raise IOError("Truncated stream")
+
+    type, length = struct.unpack(RH_FORMAT, data)
+
+    if type not in rec_type_to_str:
+        raise StreamError("Unrecognised record type %x" % (type, ))
+
+    contentsz = (length + 7) & ~7
+    content = stream.read(contentsz)
+
+    if len(content) != contentsz:
+        raise IOError("Truncated stream")
+
+    padding = content[length:]
+    if padding != "\x00" * len(padding):
+        raise StreamError("Padding containging non0 bytes found")
+
+    if type != REC_TYPE_page_data:
+
+        if _squahsed_data_records > 0:
+            print ("Squashed %d valid Page Data records together"
+                   % (_squahsed_data_records, ))
+            _squahsed_data_records = 0
+
+        print ("Valid Record Header: %s, length %d"
+               % (rec_type_to_str[type], length))
+
+    else:
+        _squahsed_data_records += 1
+
+    if type not in record_verifiers:
+        raise RuntimeError("No verification function")
+    else:
+        record_verifiers[type](content[:length])
+
+    return type
+
+def verify_qemu_record(fin):
+
+    sz = struct.calcsize(LIBXL_QEMU_RECORD_HDR)
+
+    hdr = fin.read(sz)
+
+    if len(hdr) == 0:
+        return
+
+    if len(hdr) < sz:
+        raise StreamError("Junk found on the end of the stream")
+
+    (sig, length) = struct.unpack(LIBXL_QEMU_RECORD_HDR, hdr)
+
+    if sig != LIBXL_QEMU_SIGNATURE or length == 0:
+        raise StreamError("Junk found on the end of the stream")
+
+    qemu_record = fin.read(length)
+
+    if len(qemu_record) != length:
+        raise StreamError("Truncated qemu save record")
+
+    print("Libxl qemu save record, length %u" % (length, ))
+
+def main(argv = sys.argv):
+
+    if len(argv) == 2:
+        fin = open(argv[1], "rb")
+    else:
+        fin = sys.stdin
+
+    try:
+        verify_ihdr(fin)
+        verify_dhdr(fin)
+
+        while verify_record(fin) != REC_TYPE_end:
+            pass
+
+        verify_qemu_record(fin)
+
+        if fin.read(1) != "":
+            raise StreamError("Junk found on the end of the stream")
+
+    except (IOError, StreamError, RecordError) as e:
+        print "Error: ", e
+        return 1
+
+    except RuntimeError as e:
+        print "Script error", e
+        print "Please fix me"
+        return 2
+
+    print "Done"
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv))
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 5/9] tools/libxc: common code
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (3 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 13:03   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 6/9] tools/libxc: x86 pv save implementation Andrew Cooper
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.c         |   87 ++++++
 tools/libxc/saverestore/common.h         |  172 ++++++++++++
 tools/libxc/saverestore/common_x86.c     |   54 ++++
 tools/libxc/saverestore/common_x86.h     |   21 ++
 tools/libxc/saverestore/common_x86_hvm.c |   53 ++++
 tools/libxc/saverestore/common_x86_pv.c  |  431 ++++++++++++++++++++++++++++++
 tools/libxc/saverestore/common_x86_pv.h  |  104 +++++++
 tools/libxc/saverestore/restore.c        |  288 ++++++++++++++++++++
 tools/libxc/saverestore/save.c           |   42 +++
 9 files changed, 1252 insertions(+)
 create mode 100644 tools/libxc/saverestore/common_x86.c
 create mode 100644 tools/libxc/saverestore/common_x86.h
 create mode 100644 tools/libxc/saverestore/common_x86_hvm.c
 create mode 100644 tools/libxc/saverestore/common_x86_pv.c
 create mode 100644 tools/libxc/saverestore/common_x86_pv.h

diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
index de2e727..b159c4c 100644
--- a/tools/libxc/saverestore/common.c
+++ b/tools/libxc/saverestore/common.c
@@ -1,3 +1,5 @@
+#include <assert.h>
+
 #include "common.h"
 
 static const char *dhdr_types[] =
@@ -52,6 +54,91 @@ const char *rec_type_to_str(uint32_t type)
     return "Reserved";
 }
 
+int write_split_record(struct context *ctx, struct record *rec,
+                       void *buf, size_t sz)
+{
+    static const char zeroes[7] = { 0 };
+    xc_interface *xch = ctx->xch;
+    uint32_t combined_length = rec->length + sz;
+    size_t record_length = (combined_length + 7) & ~7UL;
+
+    if ( record_length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
+              " exceeds max (0x%"PRIx32")", rec->type,
+              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    if ( rec->length )
+        assert(rec->data);
+    if ( sz )
+        assert(buf);
+
+    if ( write_exact(ctx->fd, &rec->type, sizeof rec->type) ||
+         write_exact(ctx->fd, &combined_length, sizeof rec->length) ||
+         (rec->length && write_exact(ctx->fd, rec->data, rec->length)) ||
+         (sz && write_exact(ctx->fd, buf, sz)) ||
+         write_exact(ctx->fd, zeroes, record_length - combined_length) )
+    {
+        PERROR("Unable to write record to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
+int read_record(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rhdr rhdr;
+    size_t datasz;
+
+    if ( read_exact(ctx->fd, &rhdr, sizeof rhdr) )
+    {
+        PERROR("Failed to read Record Header from stream");
+        return -1;
+    }
+    else if ( rhdr.length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
+              " exceeds max (0x%"PRIx32")",
+              rhdr.type, rec_type_to_str(rhdr.type),
+              rhdr.length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    datasz = (rhdr.length + 7) & ~7U;
+
+    if ( datasz )
+    {
+        rec->data = malloc(datasz);
+
+        if ( !rec->data )
+        {
+            ERROR("Unable to allocate %zu bytes for record data (0x%08"PRIx32", %s)",
+                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+
+        if ( read_exact(ctx->fd, rec->data, datasz) )
+        {
+            free(rec->data);
+            rec->data = NULL;
+            PERROR("Failed to read %zu bytes of data for record (0x%08"PRIx32", %s)",
+                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+    }
+    else
+        rec->data = NULL;
+
+    rec->type   = rhdr.type;
+    rec->length = rhdr.length;
+
+    return 0;
+};
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index fff0a39..a35eda7 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -1,7 +1,20 @@
 #ifndef __COMMON__H
 #define __COMMON__H
 
+#include <stdbool.h>
+
+// Hack out junk from the namespace
+#define mfn_to_pfn __UNUSED_mfn_to_pfn
+#define pfn_to_mfn __UNUSED_pfn_to_mfn
+
 #include "../xg_private.h"
+#include "../xg_save_restore.h"
+#include "../xc_dom.h"
+#include "../xc_bitops.h"
+
+#undef mfn_to_pfn
+#undef pfn_to_mfn
+
 
 #include "stream_format.h"
 
@@ -11,6 +24,165 @@
 const char *dhdr_type_to_str(uint32_t type);
 const char *rec_type_to_str(uint32_t type);
 
+struct context;
+
+struct save_restore_ops
+{
+    bool (*pfn_is_valid)(struct context *ctx, xen_pfn_t pfn);
+    xen_pfn_t (*pfn_to_gfn)(struct context *ctx, xen_pfn_t pfn);
+    void (*set_gfn)(struct context *ctx, xen_pfn_t pfn, xen_pfn_t gfn);
+    void (*set_page_type)(struct context *ctx, xen_pfn_t pfn, xen_pfn_t type);
+    int (*normalise_page)(struct context *ctx, xen_pfn_t type, void **page);
+    int (*localise_page)(struct context *ctx, uint32_t type, void *page);
+};
+
+struct context
+{
+    xc_interface *xch;
+    uint32_t domid;
+    int fd;
+
+    xc_dominfo_t dominfo;
+
+    struct save_restore_ops ops;
+
+    union
+    {
+        struct
+        {
+            /* From Image Header */
+            uint32_t format_version;
+
+            /* From Domain Header */
+            uint32_t guest_type;
+            uint32_t guest_page_size;
+
+            unsigned long xenstore_mfn, console_mfn;
+            unsigned int xenstore_evtchn, console_evtchn;
+            domid_t xenstore_domid, console_domid;
+
+            struct restore_callbacks *callbacks;
+
+            /* Bitmap of currently populated PFNs during restore. */
+            unsigned long *populated_pfns;
+            unsigned int max_populated_pfn;
+        } restore;
+
+        struct
+        {
+            unsigned long p2m_size;
+
+            struct save_callbacks *callbacks;
+        } save;
+    };
+
+    xen_pfn_t *batch_pfns;
+    unsigned nr_batch_pfns;
+    unsigned long *deferred_pages;
+
+    union
+    {
+        struct
+        {
+            /* 4 or 8; 32 or 64 bit domain */
+            unsigned int width;
+            /* 3 or 4 pagetable levels */
+            unsigned int levels;
+
+
+            /* Maximum Xen frame */
+            unsigned long max_mfn;
+            /* Read-only machine to phys map */
+            xen_pfn_t *m2p;
+            /* first mfn of the compat m2p (Only needed for 32bit PV guests) */
+            xen_pfn_t compat_m2p_mfn0;
+            /* Number of m2p frames mapped */
+            unsigned long nr_m2p_frames;
+
+
+            /* Maximum guest frame */
+            unsigned long max_pfn;
+            /* Frames per page in guest p2m */
+            unsigned int fpp;
+
+            /* Number of frames making up the p2m */
+            unsigned int p2m_frames;
+            /* Guest's phys to machine map.  Mapped read-only (save) or
+             * allocated locally (restore).  Uses guest unsigned longs. */
+            void *p2m;
+            /* The guest pfns containing the p2m leaves */
+            xen_pfn_t *p2m_pfns;
+            /* Types for each page */
+            uint32_t *pfn_types;
+
+            /* Read-only mapping of guests shared info page */
+            shared_info_any_t *shinfo;
+        } x86_pv;
+    };
+};
+
+/*
+ * Write the image and domain headers to the stream.
+ * (to eventually make static in save.c)
+ */
+int write_headers(struct context *ctx, uint16_t guest_type);
+
+extern struct save_restore_ops save_restore_ops_x86_pv;
+extern struct save_restore_ops save_restore_ops_x86_hvm;
+
+struct record
+{
+    uint32_t type;
+    uint32_t length;
+    void *data;
+};
+
+/*
+ * Writes a split record to the stream, applying correct padding where
+ * appropriate.  It is common when sending records containing blobs from Xen
+ * that the header and blob data are separate.  This function accepts a second
+ * buffer and length, and will merge it with the main record when sending.
+ *
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+int write_split_record(struct context *ctx, struct record *rec, void *buf, size_t sz);
+
+/*
+ * Writes a record to the stream, applying correct padding where appropriate.
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+static inline int write_record(struct context *ctx, struct record *rec)
+{
+    return write_split_record(ctx, rec, NULL, 0);
+}
+
+/*
+ * Reads a record from the stream, and fills in the record structure.
+ *
+ * Returns 0 on success and non-0 on failure.
+ *
+ * On success, the records type and size shall be valid.
+ * - If size is 0, data shall be NULL.
+ * - If size is non-0, data shall be a buffer allocated by malloc() which must
+ *   be passed to free() by the caller.
+ *
+ * On failure, the contents of the record structure are undefined.
+ */
+int read_record(struct context *ctx, struct record *rec);
+
+int write_page_data_and_pause(struct context *ctx);
+
+int handle_page_data(struct context *ctx, struct record *rec);
+
+int populate_pfns(struct context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/saverestore/common_x86.c b/tools/libxc/saverestore/common_x86.c
new file mode 100644
index 0000000..0a3d555
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86.c
@@ -0,0 +1,54 @@
+#include "common_x86.h"
+
+int write_tsc_info(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_tsc_info tsc = { 0 };
+    struct record rec =
+    {
+        .type = REC_TYPE_tsc_info,
+        .length = sizeof tsc,
+        .data = &tsc
+    };
+
+    if ( xc_domain_get_tsc_info(xch, ctx->domid, &tsc.mode,
+                                &tsc.nsec, &tsc.khz, &tsc.incarnation) < 0 )
+    {
+        PERROR("Unable to obtain TSC information");
+        return -1;
+    }
+
+    return write_record(ctx, &rec);
+}
+
+int handle_tsc_info(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_tsc_info *tsc = rec->data;
+
+    if ( rec->length != sizeof *tsc )
+    {
+        ERROR("TSC_INFO record wrong size: length %"PRIu32", expected %zu",
+              rec->length, sizeof *tsc);
+        return -1;
+    }
+
+    if ( xc_domain_set_tsc_info(xch, ctx->domid, tsc->mode,
+                                tsc->nsec, tsc->khz, tsc->incarnation) )
+    {
+        PERROR("Unable to set TSC information");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86.h b/tools/libxc/saverestore/common_x86.h
new file mode 100644
index 0000000..429532a
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86.h
@@ -0,0 +1,21 @@
+#ifndef __COMMON_X86__H
+#define __COMMON_X86__H
+
+#include "common.h"
+
+/* Obtains and writes domain TSC information to the stream */
+int write_tsc_info(struct context *ctx);
+
+/* Parses domain TSC information from the stream */
+int handle_tsc_info(struct context *ctx, struct record *rec);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86_hvm.c b/tools/libxc/saverestore/common_x86_hvm.c
new file mode 100644
index 0000000..0b9aac2
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_hvm.c
@@ -0,0 +1,53 @@
+#include "common.h"
+
+static bool x86_hvm_pfn_is_valid(struct context *ctx, xen_pfn_t pfn)
+{
+    return true;
+}
+
+static xen_pfn_t x86_hvm_pfn_to_gfn(struct context *ctx, xen_pfn_t pfn)
+{
+    return pfn;
+}
+
+static void x86_hvm_set_gfn(struct context *ctx, xen_pfn_t pfn,
+                            xen_pfn_t gfn)
+{
+    /* no op */
+}
+
+static void x86_hvm_set_page_type(struct context *ctx, xen_pfn_t pfn, xen_pfn_t type)
+{
+    /* no-op */
+}
+
+static int x86_hvm_normalise_page(struct context *ctx, xen_pfn_t type, void **page)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_localise_page(struct context *ctx, uint32_t type, void *page)
+{
+    /* no-op */
+    return 0;
+}
+
+struct save_restore_ops save_restore_ops_x86_hvm = {
+    .pfn_is_valid   = x86_hvm_pfn_is_valid,
+    .pfn_to_gfn     = x86_hvm_pfn_to_gfn,
+    .set_gfn        = x86_hvm_set_gfn,
+    .set_page_type  = x86_hvm_set_page_type,
+    .normalise_page = x86_hvm_normalise_page,
+    .localise_page  = x86_hvm_localise_page
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86_pv.c b/tools/libxc/saverestore/common_x86_pv.c
new file mode 100644
index 0000000..35bce27
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_pv.c
@@ -0,0 +1,431 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
+{
+    assert(mfn <= ctx->x86_pv.max_mfn);
+    return ctx->x86_pv.m2p[mfn];
+}
+
+static bool x86_pv_pfn_is_valid(struct context *ctx, xen_pfn_t pfn)
+{
+    return pfn <= ctx->x86_pv.max_pfn;
+}
+
+static xen_pfn_t x86_pv_pfn_to_gfn(struct context *ctx, xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    if ( ctx->x86_pv.width == sizeof (uint64_t) )
+        /* 64 bit guest.  Need to truncate their pfns for 32 bit toolstacks */
+        return ((uint64_t *)ctx->x86_pv.p2m)[pfn];
+    else
+    {
+        /* 32 bit guest.  Need to expand INVALID_MFN fot 64 bit toolstacks */
+        uint32_t mfn = ((uint32_t *)ctx->x86_pv.p2m)[pfn];
+
+        return mfn == ~0U ? INVALID_MFN : mfn;
+    }
+}
+
+static void x86_pv_set_page_type(struct context *ctx, xen_pfn_t pfn,
+                                 unsigned long type)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    ctx->x86_pv.pfn_types[pfn] = type;
+}
+
+static void x86_pv_set_gfn(struct context *ctx, xen_pfn_t pfn,
+                           xen_pfn_t mfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    if ( ctx->x86_pv.width == sizeof (uint64_t) )
+        /* 64 bit guest.  Need to expand INVALID_MFN for 32 bit toolstacks */
+        ((uint64_t *)ctx->x86_pv.p2m)[pfn] = mfn == INVALID_MFN ? ~0ULL : mfn;
+    else
+        /* 32 bit guest.  Can safely truncate INVALID_MFN fot 64 bit toolstacks */
+        ((uint32_t *)ctx->x86_pv.p2m)[pfn] = mfn;
+}
+
+static int normalise_pagetable(struct context *ctx, const uint64_t *src,
+                               uint64_t *dst, unsigned long type)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t pte;
+    unsigned i, xen_first = -1, xen_last = -1; /* Indicies of Xen mappings */
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( ctx->x86_pv.levels == 4 )
+    {
+        /* 64bit guests only have Xen mappings in their L4 tables */
+        if ( type == XEN_DOMCTL_PFINFO_L4TAB )
+        {
+            xen_first = 256;
+            xen_last = 271;
+        }
+    }
+    else
+    {
+        switch ( type )
+        {
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            ERROR("??? Found L4 table for 32bit guest");
+            errno = EINVAL;
+            return -1;
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            /* 32bit guests can only use the first 4 entries of their L3 tables.
+             * All other are potentially used by Xen. */
+            xen_first = 4;
+            xen_last = 512;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
+             * are normal but only a few will have Xen mappings.
+             *
+             * 428 = (HYPERVISOR_VIRT_START_PAE >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff
+             *
+             * ...which is conveniently unavailable to us in a 64bit build.
+             */
+            if ( pte_to_frame(ctx, src[428]) == ctx->x86_pv.compat_m2p_mfn0 )
+            {
+                xen_first = 428;
+                xen_last = 512;
+            }
+            break;
+        }
+    }
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        xen_pfn_t mfn, pfn;
+
+        pte = src[i];
+
+        /* Remove Xen mappings: Xen will reconstruct on the other side */
+        if ( i >= xen_first && i <= xen_last )
+            pte = 0;
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            mfn = pte_to_frame(ctx, pte);
+
+            if ( pte & _PAGE_PSE )
+            {
+                ERROR("Cannot migrate superpage (L%lu[%u]: 0x%016"PRIx64")",
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                errno = E2BIG;
+                return -1;
+            }
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                /* This is expected during the live part of migration given
+                 * split pagetable updates, active grant mappings etc.  The
+                 * pagetable will need to be resent after pausing.  It is
+                 * however fatal if we have already paused the domain. */
+                if ( !ctx->dominfo.paused )
+                    errno = EAGAIN;
+                else
+                {
+                    ERROR("Bad MFN for L%lu[%u]",
+                          type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i);
+                    pseudophysmap_walk(ctx, mfn);
+                    errno = ERANGE;
+                }
+                return -1;
+            }
+            else
+                pfn = mfn_to_pfn(ctx, mfn);
+
+            update_pte(ctx, &pte, pfn);
+        }
+
+        dst[i] = pte;
+    }
+
+    return 0;
+}
+
+static int x86_pv_normalise_page(struct context *ctx, xen_pfn_t type,
+                                 void **page)
+{
+    xc_interface *xch = ctx->xch;
+    void *local_page;
+    int rc;
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    local_page = malloc(PAGE_SIZE);
+    if ( !local_page )
+    {
+        ERROR("Unable to allocate scratch page");
+        rc = -1;
+        goto out;
+    }
+
+    rc = normalise_pagetable(ctx, *page, local_page, type);
+    *page = local_page;
+
+  out:
+    return rc;
+}
+
+static int x86_pv_localise_page(struct context *ctx, uint32_t type, void *page)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t *table = page;
+    uint64_t pte;
+    unsigned i;
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    /* Only page tables need localisation. */
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        pte = table[i];
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            xen_pfn_t mfn, pfn;
+
+            pfn = pte_to_frame(ctx, pte);
+            mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+
+            if ( mfn == INVALID_MFN )
+            {
+                if ( populate_pfns(ctx, 1, &pfn, &type) )
+                    return -1;
+
+                mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+            }
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                ERROR("Bad MFN for L%lu[%u]",
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i);
+                pseudophysmap_walk(ctx, mfn);
+                errno = ERANGE;
+                return -1;
+            }
+
+            update_pte(ctx, &pte, mfn);
+
+            table[i] = pte;
+        }
+    }
+
+    return 0;
+}
+
+struct save_restore_ops save_restore_ops_x86_pv = {
+    .pfn_is_valid   = x86_pv_pfn_is_valid,
+    .pfn_to_gfn     = x86_pv_pfn_to_gfn,
+    .set_page_type  = x86_pv_set_page_type,
+    .set_gfn        = x86_pv_set_gfn,
+    .normalise_page = x86_pv_normalise_page,
+    .localise_page  = x86_pv_localise_page,
+};
+
+bool mfn_in_pseudophysmap(struct context *ctx, xen_pfn_t mfn)
+{
+    return ( (mfn <= ctx->x86_pv.max_mfn) &&
+             (mfn_to_pfn(ctx, mfn) <= ctx->x86_pv.max_pfn) &&
+             (ctx->ops.pfn_to_gfn(ctx, mfn_to_pfn(ctx, mfn) == mfn)) );
+}
+
+void pseudophysmap_walk(struct context *ctx, xen_pfn_t mfn)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn = ~0UL;
+
+    ERROR("mfn %#lx, max %#lx", mfn, ctx->x86_pv.max_mfn);
+
+    if ( (mfn != ~0UL) && (mfn <= ctx->x86_pv.max_mfn) )
+    {
+        pfn = ctx->x86_pv.m2p[mfn];
+        ERROR("  m2p[%#lx] = %#lx, max_pfn %#lx",
+              mfn, pfn, ctx->x86_pv.max_pfn);
+    }
+
+    if ( (pfn != ~0UL) && (pfn <= ctx->x86_pv.max_pfn) )
+        ERROR("  p2m[%#lx] = %#lx",
+              pfn, ctx->ops.pfn_to_gfn(ctx, pfn));
+}
+
+xen_pfn_t cr3_to_mfn(struct context *ctx, uint64_t cr3)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return cr3 >> 12;
+    else
+        return (((uint32_t)cr3 >> 12) | ((uint32_t)cr3 << 20));
+}
+
+uint64_t mfn_to_cr3(struct context *ctx, xen_pfn_t mfn)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return ((uint64_t)mfn) << 12;
+    else
+        return (((uint32_t)mfn << 12) | ((uint32_t)mfn >> 20));
+}
+
+int x86_pv_domain_info(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned int guest_width, guest_levels, fpp;
+    int max_pfn;
+
+    /* Get the domain width */
+    if ( xc_domain_get_guest_width(xch, ctx->domid, &guest_width) )
+    {
+        PERROR("Unable to determine dom%d's width", ctx->domid);
+        return -1;
+    }
+    else if ( guest_width == 4 )
+        guest_levels = 3;
+    else if ( guest_width == 8 )
+        guest_levels = 4;
+    else
+    {
+        ERROR("Invalid guest width %d.  Expected 32 or 64", guest_width);
+        return -1;
+    }
+    ctx->x86_pv.width = guest_width;
+    ctx->x86_pv.levels = guest_levels;
+    ctx->x86_pv.fpp = fpp = PAGE_SIZE / ctx->x86_pv.width;
+
+    DPRINTF("%d bits, %d levels", guest_width * 8, guest_levels);
+
+    /* Get the domains maximum pfn */
+    max_pfn = xc_domain_maximum_gpfn(xch, ctx->domid);
+    if ( max_pfn < 0 )
+    {
+        PERROR("Unable to obtain guests max pfn");
+        return -1;
+    }
+    else if ( max_pfn >= ~XEN_DOMCTL_PFINFO_LTAB_MASK )
+    {
+        errno = E2BIG;
+        PERROR("Cannot save a guest this large %#x");
+        return -1;
+    }
+    else if ( max_pfn > 0 )
+    {
+        ctx->x86_pv.max_pfn = max_pfn;
+        ctx->x86_pv.p2m_frames = (ctx->x86_pv.max_pfn + fpp) / fpp;
+
+        DPRINTF("max_pfn %#x, p2m_frames %d", max_pfn, ctx->x86_pv.p2m_frames);
+    }
+
+    return 0;
+}
+
+int x86_pv_map_m2p(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    long max_page = xc_maximum_ram_page(xch);
+    unsigned long m2p_chunks, m2p_size;
+    privcmd_mmap_entry_t *entries = NULL;
+    xen_pfn_t *extents_start = NULL;
+    int rc = -1, i;
+
+    if ( max_page < 0 )
+    {
+        PERROR("Failed to get maximum ram page");
+        goto err;
+    }
+
+    ctx->x86_pv.max_mfn = max_page;
+    m2p_size   = M2P_SIZE(ctx->x86_pv.max_mfn);
+    m2p_chunks = M2P_CHUNKS(ctx->x86_pv.max_mfn);
+
+    extents_start = malloc(m2p_chunks * sizeof(xen_pfn_t));
+    if ( !extents_start )
+    {
+        ERROR("Unable to allocate %zu bytes for m2p mfns",
+              m2p_chunks * sizeof(xen_pfn_t));
+        goto err;
+    }
+
+    if ( xc_machphys_mfn_list(xch, m2p_chunks, extents_start) )
+    {
+        PERROR("Failed to get m2p mfn list");
+        goto err;
+    }
+
+    entries = malloc(m2p_chunks * sizeof(privcmd_mmap_entry_t));
+    if ( !entries )
+    {
+        ERROR("Unable to allocate %zu bytes for m2p mapping mfns",
+              m2p_chunks * sizeof(privcmd_mmap_entry_t));
+        goto err;
+    }
+
+    for ( i = 0; i < m2p_chunks; ++i )
+        entries[i].mfn = extents_start[i];
+
+    ctx->x86_pv.m2p = xc_map_foreign_ranges(
+        xch, DOMID_XEN, m2p_size, PROT_READ,
+        M2P_CHUNK_SIZE, entries, m2p_chunks);
+
+    if ( !ctx->x86_pv.m2p )
+    {
+        PERROR("Failed to mmap m2p ranges");
+        goto err;
+    }
+
+    ctx->x86_pv.nr_m2p_frames = (M2P_CHUNK_SIZE >> PAGE_SHIFT) * m2p_chunks;
+
+#ifdef __i386__
+    /* 32 bit toolstacks automatically get the compat m2p */
+    ctx->x86_pv.compat_m2p_mfn0 = entries[0].mfn;
+#else
+    /* 64 bit toolstacks need to ask Xen specially for it */
+    {
+        struct xen_machphys_mfn_list xmml = {
+            .max_extents = 1,
+            .extent_start = { &ctx->x86_pv.compat_m2p_mfn0 }
+        };
+
+        rc = do_memory_op(xch, XENMEM_machphys_compat_mfn_list,
+                          &xmml, sizeof xmml);
+        if ( rc || xmml.nr_extents != 1 )
+        {
+            PERROR("Failed to get compat mfn list from Xen");
+            rc = -1;
+            goto err;
+        }
+    }
+#endif
+
+    /* All Done */
+    rc = 0;
+    DPRINTF("max_mfn %#lx", ctx->x86_pv.max_mfn);
+
+err:
+    free(entries);
+    free(extents_start);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86_pv.h b/tools/libxc/saverestore/common_x86_pv.h
new file mode 100644
index 0000000..c7315b6
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_pv.h
@@ -0,0 +1,104 @@
+#ifndef __COMMON_X86_PV_H
+#define __COMMON_X86_PV_H
+
+#include "common_x86.h"
+
+/*
+ * Convert an mfn to a pfn, given Xens m2p table.
+ *
+ * Caller must ensure that the requested mfn is in range.
+ */
+xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Convert a pfn to an mfn, given the guests p2m table.
+ *
+ * Caller must ensure that the requested pfn is in range.
+ */
+xen_pfn_t pfn_to_mfn(struct context *ctx, xen_pfn_t pfn);
+
+/*
+ * Set a mapping in the p2m table.
+ *
+ * Caller must ensure that the requested pfn is in range.
+ */
+void set_p2m(struct context *ctx, xen_pfn_t pfn, xen_pfn_t mfn);
+
+/*
+ * Query whether a particular mfn is valid in the physmap of a guest.
+ */
+bool mfn_in_pseudophysmap(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Debug a particular mfn by walking the p2m and m2p.
+ */
+void pseudophysmap_walk(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Convert a PV cr3 field to an mfn.
+ */
+xen_pfn_t cr3_to_mfn(struct context *ctx, uint64_t cr3);
+
+/*
+ * Convert an mfn to a PV cr3 field.
+ */
+uint64_t mfn_to_cr3(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Extract an MFN from a Pagetable Entry.
+ */
+static inline xen_pfn_t pte_to_frame(struct context *ctx, uint64_t pte)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return (pte >> PAGE_SHIFT) & ((1ULL << (52 - PAGE_SHIFT)) - 1);
+    else
+        return (pte >> PAGE_SHIFT) & ((1ULL << (44 - PAGE_SHIFT)) - 1);
+}
+
+static inline void update_pte(struct context *ctx, uint64_t *pte, xen_pfn_t pfn)
+{
+    if ( ctx->x86_pv.width == 8 )
+        *pte &= ~(((1ULL << (52 - PAGE_SHIFT)) - 1) << PAGE_SHIFT);
+    else
+        *pte &= ~(((1ULL << (44 - PAGE_SHIFT)) - 1) << PAGE_SHIFT);
+
+    *pte |= (uint64_t)pfn << PAGE_SHIFT;
+}
+
+/*
+ * Get current domain information.
+ *
+ * Fills ctx->x86_pv
+ * - .width
+ * - .levels
+ * - .fpp
+ * - .p2m_frames
+ *
+ * Used by the save side to create the X86_PV_INFO record, and by the restore
+ * side to verify the incoming stream.
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_domain_info(struct context *ctx);
+
+/*
+ * Maps the Xen M2P.
+ *
+ * Fills ctx->x86_pv.
+ * - .max_mfn
+ * - .m2p
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_map_m2p(struct context *ctx);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
index 6624baa..5834d38 100644
--- a/tools/libxc/saverestore/restore.c
+++ b/tools/libxc/saverestore/restore.c
@@ -12,6 +12,294 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
     return -1;
 }
 
+static bool pfn_is_populated(struct context *ctx, xen_pfn_t pfn)
+{
+    if ( !ctx->restore.populated_pfns || pfn > ctx->restore.max_populated_pfn )
+        return false;
+    return test_bit(pfn, ctx->restore.populated_pfns);
+}
+
+static int pfn_set_populated(struct context *ctx, xen_pfn_t pfn)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( !ctx->restore.populated_pfns || pfn > ctx->restore.max_populated_pfn )
+    {
+        unsigned long new_max_pfn = ((pfn + 1024) & ~1023) - 1;
+        size_t old_sz, new_sz;
+        unsigned long *p;
+
+        old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
+        new_sz = bitmap_size(new_max_pfn + 1);
+
+        p  = realloc(ctx->restore.populated_pfns, new_sz);
+        if ( !p )
+        {
+            PERROR("Failed to realloc populated bitmap");
+            return -1;
+        }
+
+        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+
+        ctx->restore.populated_pfns    = p;
+        ctx->restore.max_populated_pfn = new_max_pfn;
+    }
+
+    set_bit(pfn, ctx->restore.populated_pfns);
+
+    return 0;
+}
+
+int populate_pfns(struct context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof *mfns),
+        *pfns = malloc(count * sizeof *pfns);
+    unsigned i, nr_pfns = 0;
+    int rc = -1;
+
+    if ( !mfns || !pfns )
+    {
+        ERROR("Failed to allocate %zu bytes for populating the physmap",
+              2 * count * sizeof *mfns);
+        goto err;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        if ( types[i] != XEN_DOMCTL_PFINFO_XTAB &&
+             types[i] != XEN_DOMCTL_PFINFO_BROKEN &&
+             !pfn_is_populated(ctx, original_pfns[i]) )
+        {
+            pfns[nr_pfns] = mfns[nr_pfns] = original_pfns[i];
+            ++nr_pfns;
+        }
+    }
+
+    if ( nr_pfns )
+    {
+        rc = xc_domain_populate_physmap_exact(xch, ctx->domid, nr_pfns, 0, 0, mfns);
+        if ( rc )
+        {
+            PERROR("Failed to populate physmap");
+            goto err;
+        }
+
+        for ( i = 0; i < nr_pfns; ++i )
+        {
+            rc = pfn_set_populated(ctx, pfns[i]);
+            if ( rc )
+                goto err;
+            ctx->ops.set_gfn(ctx, pfns[i], mfns[i]);
+        }
+    }
+
+    rc = 0;
+
+ err:
+    free(pfns);
+    free(mfns);
+
+    return rc;
+}
+
+static int process_page_data(struct context *ctx, unsigned count,
+                             xen_pfn_t *pfns, uint32_t *types, void *page_data)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof *mfns);
+    int *map_errs = malloc(count * sizeof *map_errs);
+    int rc = -1;
+    void *mapping = NULL, *guest_page = NULL;
+    unsigned i,    /* i indexes the pfns from the record */
+        j,         /* j indexes the subset of pfns we decide to map */
+        nr_pages;
+
+    if ( !mfns || !map_errs )
+    {
+        ERROR("Failed to allocate %zu bytes to process page data",
+              count * (sizeof *mfns + sizeof *map_errs));
+        goto err;
+    }
+
+    rc = populate_pfns(ctx, count, pfns, types);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", count);
+        goto err;
+    }
+    rc = -1;
+
+    for ( i = 0, nr_pages = 0; i < count; ++i )
+    {
+        ctx->ops.set_page_type(ctx, pfns[i], types[i]);
+
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_NOTAB:
+
+        case XEN_DOMCTL_PFINFO_L1TAB:
+        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+            mfns[nr_pages++] = ctx->ops.pfn_to_gfn(ctx, pfns[i]);
+            break;
+        }
+
+    }
+
+    if ( nr_pages > 0 )
+    {
+        mapping = guest_page = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ | PROT_WRITE,
+            mfns, map_errs, nr_pages);
+        if ( !mapping )
+        {
+            PERROR("Unable to map %u mfns for %u pages of data",
+                   nr_pages, count);
+            goto err;
+        }
+    }
+
+    for ( i = 0, j = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_XTAB:
+        case XEN_DOMCTL_PFINFO_BROKEN:
+            /* Nothing at all to do */
+        case XEN_DOMCTL_PFINFO_XALLOC:
+            /* Nothing futher to do */
+            continue;
+        }
+
+        if ( map_errs[j] )
+        {
+            ERROR("Mapping pfn %lx (mfn %lx, type %#"PRIx32")failed with %d",
+                  pfns[i], mfns[j], types[i], map_errs[j]);
+            goto err;
+        }
+
+        memcpy(guest_page, page_data, PAGE_SIZE);
+
+        /* Undo page normalisation done by the saver. */
+        rc = ctx->ops.localise_page(ctx, types[i], guest_page);
+        if ( rc )
+        {
+            DPRINTF("Failed to localise");
+            goto err;
+        }
+
+        ++j;
+        guest_page += PAGE_SIZE;
+        page_data += PAGE_SIZE;
+    }
+
+    rc = 0;
+
+ err:
+    if ( mapping )
+        munmap(mapping, nr_pages * PAGE_SIZE);
+
+    free(map_errs);
+    free(mfns);
+
+    return rc;
+}
+
+int handle_page_data(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_page_data_header *pages = rec->data;
+    unsigned i, pages_of_data = 0;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL, pfn;
+    uint32_t *types = NULL, type;
+
+    static unsigned pg_count;
+    pg_count++;
+
+    if ( rec->length < sizeof *pages )
+    {
+        ERROR("PAGE_DATA record trucated: length %"PRIu32", min %zu",
+              rec->length, sizeof *pages);
+        goto err;
+    }
+    else if ( pages->count < 1 )
+    {
+        ERROR("Expected at least 1 pfn in PAGE_DATA record");
+        goto err;
+    }
+    else if ( rec->length < sizeof *pages + (pages->count * sizeof (uint64_t)) )
+    {
+        ERROR("PAGE_DATA record (length %"PRIu32") too short to contain %"
+              PRIu32" pfns worth of information", rec->length, pages->count);
+        goto err;
+    }
+
+    pfns = malloc(pages->count * sizeof *pfns);
+    types = malloc(pages->count * sizeof *types);
+    if ( !pfns || !types )
+    {
+        ERROR("Unable to allocate enough memory for %"PRIu32" pfns",
+              pages->count);
+        goto err;
+    }
+
+    for ( i = 0; i < pages->count; ++i )
+    {
+        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
+        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
+        {
+            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
+            goto err;
+        }
+
+        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
+        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
+             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
+        {
+            ERROR("Invalid type %#lx for pfn %#lx (index %u)", type, pfn, i);
+            goto err;
+        }
+        else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
+            /* NOTAB and all L1 thru L4 tables (including pinned) should have
+             * a page worth of data in the record. */
+            pages_of_data++;
+
+        pfns[i] = pfn;
+        types[i] = type;
+    }
+
+    if ( rec->length != (sizeof *pages +
+                         (sizeof (uint64_t) * pages->count) +
+                         (PAGE_SIZE * pages_of_data)) )
+    {
+        ERROR("PAGE_DATA record wrong size: length %"PRIu32", expected "
+              "%zu + %zu + %zu", sizeof *pages,
+              (sizeof (uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
+        goto err;
+    }
+
+    rc = process_page_data(ctx, pages->count, pfns, types,
+                           &pages->pfn[pages->count]);
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index c013e62..e842e6c 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -1,5 +1,47 @@
+#include <arpa/inet.h>
+
 #include "common.h"
 
+int write_headers(struct context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
+    struct ihdr ihdr =
+        {
+            .marker  = IHDR_MARKER,
+            .id      = htonl(IHDR_ID),
+            .version = htonl(IHDR_VERSION),
+            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
+        };
+    struct dhdr dhdr =
+        {
+            .type       = guest_type,
+            .page_shift = 12,
+            .xen_major  = (xen_version >> 16) & 0xffff,
+            .xen_minor  = (xen_version)       & 0xffff,
+        };
+
+    if ( xen_version < 0 )
+    {
+        PERROR("Unable to obtain Xen Version");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &ihdr, sizeof ihdr) )
+    {
+        PERROR("Unable to write Image Header to stream");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &dhdr, sizeof dhdr) )
+    {
+        PERROR("Unable to write Domain Header to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
 int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
                     uint32_t max_factor, uint32_t flags,
                     struct save_callbacks* callbacks, int hvm,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 6/9] tools/libxc: x86 pv save implementation
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (4 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 5/9] tools/libxc: common code Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 13:58   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 7/9] tools/libxc: x86 pv restore implementation Andrew Cooper
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h      |   20 ++
 tools/libxc/saverestore/save.c        |  453 +++++++++++++++++++++++++-
 tools/libxc/saverestore/save_x86_pv.c |  568 +++++++++++++++++++++++++++++++++
 3 files changed, 1040 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxc/saverestore/save_x86_pv.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index a35eda7..116eb13 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -12,6 +12,8 @@
 #include "../xc_dom.h"
 #include "../xc_bitops.h"
 
+#undef GET_FIELD
+#undef SET_FIELD
 #undef mfn_to_pfn
 #undef pfn_to_mfn
 
@@ -121,6 +123,9 @@ struct context
     };
 };
 
+/* Saves an x86 PV domain. */
+int save_x86_pv(struct context *ctx);
+
 /*
  * Write the image and domain headers to the stream.
  * (to eventually make static in save.c)
@@ -137,6 +142,21 @@ struct record
     void *data;
 };
 
+/* Gets a field from an *_any union */
+#define GET_FIELD(_c, _p, _f)                   \
+    ({ (_c)->x86_pv.width == 8 ?                \
+            (_p)->x64._f:                       \
+            (_p)->x32._f;                       \
+    })                                          \
+
+/* Gets a field from an *_any union */
+#define SET_FIELD(_c, _p, _f, _v)               \
+    ({ if ( (_c)->x86_pv.width == 8 )           \
+            (_p)->x64._f = (_v);                \
+        else                                    \
+            (_p)->x32._f = (_v);                \
+    })
+
 /*
  * Writes a split record to the stream, applying correct padding where
  * appropriate.  It is common when sending records containing blobs from Xen
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index e842e6c..a19c217 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -1,3 +1,4 @@
+#include <assert.h>
 #include <arpa/inet.h>
 
 #include "common.h"
@@ -42,13 +43,463 @@ int write_headers(struct context *ctx, uint16_t guest_type)
     return 0;
 }
 
+static int write_batch(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = NULL, *types = NULL;
+    void *guest_mapping = NULL;
+    void **guest_data = NULL;
+    void **local_pages = NULL;
+    int *errors = NULL, rc = -1;
+    unsigned i, p, nr_pages = 0;
+    unsigned nr_pfns = ctx->nr_batch_pfns;
+    void *page, *orig_page;
+
+    struct rec_page_data_header hdr;
+    uint32_t rec_type = REC_TYPE_page_data, rec_size, rec_count, rec_res = 0;
+    uint64_t *rec_pfns = NULL;
+    size_t s;
+
+    assert(nr_pfns != 0);
+
+    /* MFNs of the batch pfns */
+    mfns = malloc(nr_pfns * sizeof *mfns);
+    /* Types of the batch pfns */
+    types = malloc(nr_pfns * sizeof *types);
+    /* Errors from attempting to map the mfns */
+    errors = malloc(nr_pfns * sizeof *errors);
+    /* Pointers to page data to send.  Might be from mapped mfns or local allocations */
+    guest_data = calloc(nr_pfns, sizeof *guest_data);
+    /* Pointers to locally allocated pages.  Probably not all used, but need freeing */
+    local_pages = calloc(nr_pfns, sizeof *local_pages);
+
+    if ( !mfns || !types || !errors || !guest_data || !local_pages )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->batch_pfns[i]);
+
+        /* Likely a ballooned page */
+        if ( mfns[i] == INVALID_MFN )
+            set_bit(ctx->batch_pfns[i], ctx->deferred_pages);
+    }
+
+    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
+    if ( rc )
+    {
+        PERROR("Failed to get types for pfn batch");
+        goto err;
+    }
+    rc = -1;
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        mfns[nr_pages++] = mfns[i];
+    }
+
+    if ( nr_pages > 0 )
+    {
+        guest_mapping = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
+        if ( !guest_mapping )
+        {
+            PERROR("Failed to map guest pages");
+            goto err;
+        }
+    }
+
+    for ( i = 0, p = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        if ( errors[p] )
+        {
+            ERROR("Mapping of pfn %#lx (mfn %#lx) failed %d",
+                  ctx->batch_pfns[i], mfns[p], errors[p]);
+            goto err;
+        }
+
+        orig_page = page = guest_mapping + (p * PAGE_SIZE);
+        rc = ctx->ops.normalise_page(ctx, types[i], &page);
+        if ( rc )
+        {
+            if ( rc == -1 && errno == EAGAIN )
+            {
+                set_bit(ctx->batch_pfns[i], ctx->deferred_pages);
+                types[i] = XEN_DOMCTL_PFINFO_XTAB;
+                --nr_pages;
+            }
+            else
+                goto err;
+        }
+        else
+            guest_data[i] = page;
+
+        if ( page != orig_page )
+            local_pages[i] = page;
+        rc = -1;
+
+        ++p;
+    }
+
+    hdr.count = nr_pfns;
+    s = nr_pfns * sizeof *rec_pfns;
+
+
+    rec_pfns = malloc(s);
+    if ( !rec_pfns )
+    {
+        ERROR("Unable to allocate memory for page data pfn list");
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->batch_pfns[i];
+
+    /*        header +          pfns data           +        page data */
+    rec_size = 4 + 4 + (s) + (nr_pages * PAGE_SIZE);
+    rec_count = nr_pfns;
+
+    if ( write_exact(ctx->fd, &rec_type, sizeof(uint32_t)) ||
+         write_exact(ctx->fd, &rec_size, sizeof(uint32_t)) ||
+         write_exact(ctx->fd, &rec_count, sizeof(uint32_t)) ||
+         write_exact(ctx->fd, &rec_res, sizeof(uint32_t)) )
+    {
+        PERROR("Failed to write page_type header to stream");
+        goto err;
+    }
+
+    if ( write_exact(ctx->fd, rec_pfns, s) )
+    {
+        PERROR("Failed to write page_type header to stream");
+        goto err;
+    }
+
+
+    for ( i = 0; i < nr_pfns; ++i )
+        if ( guest_data[i] )
+        {
+            if ( write_exact(ctx->fd, guest_data[i], PAGE_SIZE) )
+            {
+                PERROR("Failed to write page into stream");
+                goto err;
+            }
+
+            --nr_pages;
+        }
+
+    /* Sanity check */
+    assert(nr_pages == 0);
+    ctx->nr_batch_pfns = 0;
+    rc = 0;
+
+ err:
+    free(rec_pfns);
+    if ( guest_mapping )
+        munmap(guest_mapping, nr_pages * PAGE_SIZE);
+    for ( i = 0; local_pages && i < nr_pfns; ++i )
+            free(local_pages[i]);
+    free(local_pages);
+    free(guest_data);
+    free(errors);
+    free(types);
+    free(mfns);
+
+    return rc;
+}
+
+static int flush_batch(struct context *ctx)
+{
+    int rc = 0;
+
+    if ( ctx->nr_batch_pfns == 0 )
+        return rc;
+
+    rc = write_batch(ctx);
+
+    if ( !rc )
+    {
+        /* Valgrind sanity check */
+        free(ctx->batch_pfns);
+        ctx->batch_pfns = malloc(MAX_BATCH_SIZE * sizeof *ctx->batch_pfns);
+        rc = !ctx->batch_pfns;
+    }
+
+    return rc;
+}
+
+static int add_to_batch(struct context *ctx, xen_pfn_t pfn)
+{
+    int rc = 0;
+
+    if ( ctx->nr_batch_pfns == MAX_BATCH_SIZE )
+        rc = flush_batch(ctx);
+
+    if ( rc == 0 )
+        ctx->batch_pfns[ctx->nr_batch_pfns++] = pfn;
+
+    return rc;
+}
+
+static int write_page_data_live(struct context *ctx,
+                                xc_hypercall_buffer_t *to_send_hbuf,
+                                xc_shadow_op_stats_t *shadow_stats)
+{
+    xc_interface *xch = ctx->xch;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, to_send, to_send_hbuf);
+    unsigned pages_written;
+    unsigned x, max_iter = 5, dirty_threshold = 50;
+    xen_pfn_t p;
+    int rc = -1;
+
+    if ( xc_shadow_control(xch, ctx->domid,
+                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                           NULL, 0, NULL, 0, NULL) < 0 )
+    {
+        PERROR("Failed to enable logdirty");
+        goto out;
+    }
+
+    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
+    {
+        DPRINTF("Iteration %u", x);
+
+        for ( p = 0 ; p < ctx->save.p2m_size; ++p )
+        {
+            if ( test_bit(p, to_send) )
+            {
+                rc = add_to_batch(ctx, p);
+                if ( rc )
+                {
+                    ERROR("Fatal write error :s");
+                    goto out;
+                }
+
+                ++pages_written;
+            }
+        }
+
+        rc = flush_batch(ctx);
+        if ( rc )
+        {
+            ERROR("Fatal write error :s");
+            goto out;
+        }
+        rc = -1;
+
+        if ( xc_shadow_control(
+                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                 HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+                 NULL, 0, shadow_stats) != ctx->save.p2m_size )
+        {
+            PERROR("Failed to retrieve logdirty bitmap");
+            goto out;
+        }
+        else
+        {
+            DPRINTF("  Wrote %u pages; stats: faults %"PRIu32", dirty %"PRIu32,
+                    pages_written, shadow_stats->fault_count,
+                    shadow_stats->dirty_count);
+        }
+
+        if ( shadow_stats->dirty_count < dirty_threshold )
+            break;
+
+        pages_written = 0;
+    }
+    rc = 0;
+
+out:
+    return rc;
+}
+
+
+static int pause_domain(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->dominfo.paused )
+    {
+        rc = (ctx->save.callbacks->suspend(ctx->save.callbacks->data) != 1);
+        if ( rc )
+        {
+            ERROR("Failed to suspend domain");
+            return rc;
+        }
+    }
+
+    IPRINTF("Domain now paused");
+
+    return 0;
+}
+
+static int write_page_data_paused(struct context *ctx,
+                                  xc_hypercall_buffer_t *to_send_hbuf,
+                                  xc_shadow_op_stats_t *shadow_stats)
+{
+    xc_interface *xch = ctx->xch;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, to_send, to_send_hbuf);
+    xen_pfn_t p;
+    unsigned int pages_written;
+    int rc = -1;
+
+    if ( xc_shadow_control(
+             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+             NULL, 0, shadow_stats) != ctx->save.p2m_size )
+    {
+        PERROR("Failed to retrieve logdirty bitmap");
+        goto err;
+    }
+
+    /*
+     * Domain must be paused from this point onwards.
+     */
+
+    for ( p = 0, pages_written = 0 ; p < ctx->save.p2m_size; ++p )
+    {
+        if ( test_bit(p, to_send) || test_bit(p, ctx->deferred_pages) )
+        {
+            if ( add_to_batch(ctx, p) )
+            {
+                PERROR("Fatal error for pfn %lx", p);
+                goto err;
+            }
+            ++pages_written;
+        }
+    }
+    DPRINTF("  Wrote %u pages", pages_written);
+
+    rc = flush_batch(ctx);
+    if ( rc )
+    {
+        ERROR("Fatal write error :s");
+        goto err;
+    }
+
+  err:
+    return rc;
+}
+
+int write_page_data_and_pause(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
+    xc_shadow_op_stats_t shadow_stats;
+    int rc;
+
+    ctx->batch_pfns = malloc(MAX_BATCH_SIZE * sizeof *ctx->batch_pfns);
+    if ( !ctx->batch_pfns )
+    {
+        ERROR("Unable to allocate memory for page batch list");
+        rc = -1;
+        goto out;
+    }
+
+    to_send = xc_hypercall_buffer_alloc_pages(
+        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
+    ctx->deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
+
+    if ( !to_send || !ctx->deferred_pages )
+    {
+        ERROR("Unable to allocate memory for to_{send,fix} bitmaps");
+        rc = -1;
+        goto out;
+    }
+
+    memset(to_send, 0xff, bitmap_size(ctx->save.p2m_size));
+
+    rc = write_page_data_live(ctx, HYPERCALL_BUFFER(to_send), &shadow_stats);
+    if ( rc )
+        goto out;
+
+    rc = pause_domain(ctx);
+    if ( rc )
+        goto out;
+
+    rc = write_page_data_paused(ctx, HYPERCALL_BUFFER(to_send), &shadow_stats);
+    if ( rc )
+        goto out;
+
+    rc = 0;
+
+  out:
+    xc_hypercall_buffer_free_pages(xch, to_send,
+                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
+    free(ctx->deferred_pages);
+    free(ctx->batch_pfns);
+    return rc;
+}
+
 int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
                     uint32_t max_factor, uint32_t flags,
                     struct save_callbacks* callbacks, int hvm,
                     unsigned long vm_generationid_addr)
 {
+    struct context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* Older GCC cant initialise anonymous unions */
+    ctx.save.callbacks = callbacks;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %d does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+    IPRINTF("Saving domain %d", dom);
+
+    ctx.save.p2m_size = xc_domain_maximum_gpfn(xch, dom) + 1;
+    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
+    {
+        errno = E2BIG;
+        ERROR("Cannot save this big a guest");
+        return -1;
+    }
+
+    if ( ctx.dominfo.hvm )
+    {
+        ERROR("HVM Save not supported yet");
+        return -1;
+    }
+    else
+    {
+        ctx.ops = save_restore_ops_x86_pv;
+        return save_x86_pv(&ctx);
+    }
 }
 
 /*
diff --git a/tools/libxc/saverestore/save_x86_pv.c b/tools/libxc/saverestore/save_x86_pv.c
new file mode 100644
index 0000000..c82f7f0
--- /dev/null
+++ b/tools/libxc/saverestore/save_x86_pv.c
@@ -0,0 +1,568 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+static int map_shinfo(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    ctx->x86_pv.shinfo = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ, ctx->dominfo.shared_info_frame);
+    if ( !ctx->x86_pv.shinfo )
+    {
+        PERROR("Failed to map shared info frame at pfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        return -1;
+    }
+
+    return 0;
+}
+
+static void copy_pfns_from_guest(struct context *ctx, xen_pfn_t *dst,
+                                 void *src, size_t count)
+{
+    size_t x;
+
+    if ( ctx->x86_pv.width == sizeof(unsigned long) )
+        memcpy(dst, src, count * sizeof *dst);
+    else
+    {
+        for ( x = 0; x < count; ++x )
+        {
+#ifdef __x86_64__
+            /* 64bit toolstack, 32bit guest.  Expand any INVALID_MFN. */
+            uint32_t s = ((uint32_t *)src)[x];
+
+            dst[x] = s == ~0U ? INVALID_MFN : s;
+#else
+            /* 32bit toolstack, 64bit guest.  Truncate their pointers */
+            dst[x] = ((uint64_t *)src)[x];
+#endif
+        }
+    }
+}
+
+static int map_p2m(struct context *ctx)
+{
+    /* Terminology:
+     *
+     * fll   - frame list list, top level p2m, list of fl mfns
+     * fl    - frame list, mid level p2m, list of leaf mfns
+     * local - own allocated buffers, adjusted for bitness
+     * guest - mappings into the domain
+     */
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    unsigned tries = 100, x, fpp, fll_entries, fl_entries;
+    xen_pfn_t fll_mfn;
+
+    xen_pfn_t *local_fll = NULL;
+    void *guest_fll = NULL;
+    size_t local_fll_size;
+
+    xen_pfn_t *local_fl = NULL;
+    void *guest_fl = NULL;
+    size_t local_fl_size;
+
+    fpp = ctx->x86_pv.fpp = PAGE_SIZE / ctx->x86_pv.width;
+    fll_entries = (ctx->x86_pv.max_pfn / (fpp * fpp)) + 1;
+    fl_entries  = (ctx->x86_pv.max_pfn / fpp) + 1;
+
+    fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list);
+    if ( !fll_mfn )
+        IPRINTF("Waiting for domain to set up its p2m frame list list");
+
+    while ( tries-- && !fll_mfn )
+    {
+        usleep(10000);
+        fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo,
+                            arch.pfn_to_mfn_frame_list_list);
+    }
+
+    if ( !fll_mfn )
+    {
+        ERROR("Timed out waiting for p2m frame list list to be updated");
+        goto err;
+    }
+
+    /* Map the guest top p2m */
+    guest_fll = xc_map_foreign_range(xch, ctx->domid, PAGE_SIZE,
+                                     PROT_READ, fll_mfn);
+    if ( !guest_fll )
+    {
+        PERROR("Failed to map p2m frame list list at %#lx", fll_mfn);
+        goto err;
+    }
+
+    local_fll_size = fll_entries * sizeof *local_fll;
+    local_fll = malloc(local_fll_size);
+    if ( !local_fll )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list list",
+              local_fll_size);
+        goto err;
+    }
+
+    copy_pfns_from_guest(ctx, local_fll, guest_fll, fll_entries);
+
+    /* Map the guest mid p2m frames */
+    guest_fl = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                    local_fll, fll_entries);
+    if ( !guest_fl )
+    {
+        PERROR("Failed to map p2m frame list");
+        goto err;
+    }
+
+    local_fl_size = fl_entries * sizeof *local_fl;
+    local_fl = malloc(local_fl_size);
+    if ( !local_fl )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list",
+              local_fl_size);
+        goto err;
+    }
+
+    copy_pfns_from_guest(ctx, local_fl, guest_fl, fl_entries);
+
+    /* Map the p2m leaves themselves */
+    ctx->x86_pv.p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                           local_fl, fl_entries);
+    if ( !ctx->x86_pv.p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    ctx->x86_pv.p2m_frames = fl_entries;
+    ctx->x86_pv.p2m_pfns = malloc(local_fl_size);
+    if ( !ctx->x86_pv.p2m_pfns )
+    {
+        ERROR("Cannot allocate %zu bytes for p2m pfns list",
+              local_fl_size);
+        goto err;
+    }
+
+    /* Convert leaf frames from mfns to pfns */
+    for ( x = 0; x < fl_entries; ++x )
+        if ( !mfn_in_pseudophysmap(ctx, local_fl[x]) )
+        {
+            ERROR("Bad MFN in p2m_frame_list[%u]", x);
+            pseudophysmap_walk(ctx, local_fl[x]);
+            errno = ERANGE;
+            goto err;
+        }
+        else
+            ctx->x86_pv.p2m_pfns[x] = mfn_to_pfn(ctx, local_fl[x]);
+
+    rc = 0;
+err:
+
+    free(local_fl);
+    if ( guest_fl )
+        munmap(guest_fl, fll_entries * PAGE_SIZE);
+
+    free(local_fll);
+    if ( guest_fll )
+        munmap(guest_fll, PAGE_SIZE);
+
+    return rc;
+}
+
+static int write_one_vcpu_basic(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn;
+    unsigned i;
+    int rc = -1;
+    vcpu_guest_context_any_t vcpu;
+    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
+    struct record rec =
+    {
+        .type = REC_TYPE_x86_pv_vcpu_basic,
+        .length = sizeof vhdr,
+        .data = &vhdr,
+    };
+
+    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
+    {
+        PERROR("Failed to get vcpu%u context", id);
+        goto err;
+    }
+
+    /* Vcpu 0 is special: Convert the suspend record to a PFN */
+    if ( id == 0 )
+    {
+        mfn = GET_FIELD(ctx, &vcpu, user_regs.edx);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad MFN for suspend record");
+            pseudophysmap_walk(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(ctx, &vcpu, user_regs.edx, mfn_to_pfn(ctx, mfn));
+    }
+
+    /* Convert GDT frames to PFNs */
+    for ( i = 0; (i * 512) < GET_FIELD(ctx, &vcpu, gdt_ents); ++i )
+    {
+        mfn = GET_FIELD(ctx, &vcpu, gdt_frames[i]);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad MFN for frame %u of vcpu%u's GDT", i, id);
+            pseudophysmap_walk(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(ctx, &vcpu, gdt_frames[i], mfn_to_pfn(ctx, mfn));
+    }
+
+    /* Convert CR3 to a PFN */
+    mfn = cr3_to_mfn(ctx, GET_FIELD(ctx, &vcpu, ctrlreg[3]));
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Bad MFN for vcpu%u's cr3", id);
+        pseudophysmap_walk(ctx, mfn);
+        errno = ERANGE;
+        goto err;
+    }
+    pfn = mfn_to_pfn(ctx, mfn);
+    SET_FIELD(ctx, &vcpu, ctrlreg[3], mfn_to_cr3(ctx, pfn));
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to PFN */
+    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
+    {
+        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad MFN for vcpu%u's cr1", id);
+            pseudophysmap_walk(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+
+        pfn = mfn_to_pfn(ctx, mfn);
+        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
+    }
+
+    if ( ctx->x86_pv.width == 8 )
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x64);
+    else
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x32);
+
+    if ( rc )
+        goto err;
+
+    DPRINTF("Writing vcpu%u basic context", id);
+    rc = 0;
+ err:
+
+    return rc;
+}
+
+static int write_one_vcpu_extended(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
+    struct record rec =
+    {
+        .type = REC_TYPE_x86_pv_vcpu_extended,
+        .length = sizeof vhdr,
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
+        .domain = ctx->domid,
+        .u.ext_vcpucontext.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u extended context", id);
+        return -1;
+    }
+
+    rc = write_split_record(ctx, &rec, &domctl.u.ext_vcpucontext,
+                            domctl.u.ext_vcpucontext.size);
+    if ( rc )
+        return rc;
+
+    DPRINTF("Writing vcpu%u extended context", id);
+
+    return 0;
+}
+
+static int write_one_vcpu_xsave(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct rec_x86_pv_vcpu_xsave vhdr = { .vcpu_id = id };
+    struct record rec =
+    {
+        .type = REC_TYPE_x86_pv_vcpu_xsave,
+        .length = sizeof vhdr,
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_getvcpuextstate,
+        .domain = ctx->domid,
+        .u.vcpuextstate.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's xsave context", id);
+        goto err;
+    }
+
+    if ( !domctl.u.vcpuextstate.xfeature_mask )
+    {
+        DPRINTF("vcpu%u has no xsave context - skipping", id);
+        goto out;
+    }
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, domctl.u.vcpuextstate.size);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %"PRIx64" bytes for vcpu%u's xsave context",
+              domctl.u.vcpuextstate.size, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%u's xsave context", id);
+        goto err;
+    }
+
+    vhdr.xfeature_mask = domctl.u.vcpuextstate.xfeature_mask;
+
+    rc = write_split_record(ctx, &rec, buffer, domctl.u.vcpuextstate.size);
+    if ( rc )
+        goto err;
+
+    DPRINTF("Writing vcpu%u xsave context", id);
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+static int write_all_vcpu_information(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xc_vcpuinfo_t vinfo;
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
+    {
+        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
+        if ( rc )
+        {
+            PERROR("Failed to get vcpu%u information", i);
+            return rc;
+        }
+
+        if ( !vinfo.online )
+        {
+            DPRINTF("vcpu%u offline - skipping", i);
+            continue;
+        }
+
+        rc = write_one_vcpu_basic(ctx, i) ?:
+            write_one_vcpu_extended(ctx, i) ?:
+            write_one_vcpu_xsave(ctx, i);
+        if ( rc )
+            return rc;
+    };
+
+    return 0;
+}
+
+static int write_x86_pv_info(struct context *ctx)
+{
+    struct rec_x86_pv_info info =
+        {
+            .guest_width = ctx->x86_pv.width,
+            .pt_levels = ctx->x86_pv.levels,
+        };
+    struct record rec =
+        {
+            .type = REC_TYPE_x86_pv_info,
+            .length = sizeof info,
+            .data = &info
+        };
+
+    return write_record(ctx, &rec);
+}
+
+static int write_x86_pv_p2m_frames(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc; unsigned i;
+    size_t datasz = ctx->x86_pv.p2m_frames * sizeof(uint64_t);
+    uint64_t *data = NULL;
+    struct rec_x86_pv_p2m_frames hdr =
+        {
+            .start_pfn = 0,
+            .end_pfn = ctx->x86_pv.max_pfn,
+        };
+    struct record rec =
+        {
+            .type = REC_TYPE_x86_pv_p2m_frames,
+            .length = sizeof hdr,
+            .data = &hdr,
+        };
+
+    /* No need to translate if sizeof(uint64_t) == sizeof(xen_pfn_t) */
+    if ( sizeof(uint64_t) != sizeof(*ctx->x86_pv.p2m_pfns) )
+    {
+        if ( !(data = malloc(datasz)) )
+        {
+            ERROR("Cannot allocate %zu bytes for X86_PV_P2M_FRAMES data", datasz);
+            return -1;
+        }
+
+        for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+            data[i] = ctx->x86_pv.p2m_pfns[i];
+    }
+    else
+        data = (uint64_t *)ctx->x86_pv.p2m_pfns;
+
+    rc = write_split_record(ctx, &rec, data, datasz);
+
+    if ( data != (uint64_t *)ctx->x86_pv.p2m_pfns )
+        free(data);
+
+    return rc;
+}
+
+static int write_shared_info(struct context *ctx)
+{
+    struct record rec =
+    {
+        .type = REC_TYPE_shared_info,
+        .length = PAGE_SIZE,
+        .data = ctx->x86_pv.shinfo,
+    };
+
+    return write_record(ctx, &rec);
+}
+
+int save_x86_pv(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+    struct record end = { REC_TYPE_end, 0, NULL };
+
+    IPRINTF("In experimental %s", __func__);
+
+    /* Write Image and Domain headers to the stream */
+    rc = write_headers(ctx, DHDR_TYPE_x86_pv);
+    if ( rc )
+        goto err;
+
+    /* Query some properties, and stash in the save context */
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        goto err;
+
+    /* Write an X86_PV_INFO record into the stream */
+    rc = write_x86_pv_info(ctx);
+    if ( rc )
+        goto err;
+
+    /* Map various structures */
+    rc = x86_pv_map_m2p(ctx) ?: map_shinfo(ctx) ?: map_p2m(ctx);
+    if ( rc )
+        goto err;
+
+    /* Write a full X86_PV_P2M_FRAMES record into the stream */
+    rc = write_x86_pv_p2m_frames(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_page_data_and_pause(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_shared_info(ctx);
+    if ( rc )
+        goto err;
+
+    /* Refresh domain information now it has paused */
+    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
+         (ctx->dominfo.domid != ctx->domid) )
+    {
+        PERROR("Unable to refresh domain information");
+        rc = -1;
+        goto err;
+    }
+    else if ( (!ctx->dominfo.shutdown ||
+               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
+              !ctx->dominfo.paused )
+    {
+        ERROR("Domain has not been suspended");
+        rc = -1;
+        goto err;
+    }
+
+    /* Write all the vcpu information */
+    rc = write_all_vcpu_information(ctx);
+    if ( rc )
+        goto err;
+
+    /* Write an END record */
+    rc = write_record(ctx, &end);
+    if ( rc )
+        goto err;
+
+    /* all done */
+    assert(!rc);
+    goto cleanup;
+
+ err:
+    assert(rc);
+ cleanup:
+
+    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                      NULL, 0, NULL, 0, NULL);
+
+    free(ctx->x86_pv.p2m_pfns);
+
+    if ( ctx->x86_pv.p2m )
+        munmap(ctx->x86_pv.p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    if ( ctx->x86_pv.shinfo )
+        munmap(ctx->x86_pv.shinfo, PAGE_SIZE);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 7/9] tools/libxc: x86 pv restore implementation
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (5 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 6/9] tools/libxc: x86 pv save implementation Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 14:10   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 8/9] tools/libxc: x86 hvm save implementation Andrew Cooper
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h         |   20 +
 tools/libxc/saverestore/restore.c        |  115 ++++-
 tools/libxc/saverestore/restore_x86_pv.c |  805 ++++++++++++++++++++++++++++++
 3 files changed, 939 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxc/saverestore/restore_x86_pv.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 116eb13..259cb1f 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -14,6 +14,8 @@
 
 #undef GET_FIELD
 #undef SET_FIELD
+#undef MEMCPY_FIELD
+#undef MEMSET_ARRAY_FIELD
 #undef mfn_to_pfn
 #undef pfn_to_mfn
 
@@ -125,6 +127,8 @@ struct context
 
 /* Saves an x86 PV domain. */
 int save_x86_pv(struct context *ctx);
+/* Restores an x86 PV domain. */
+int restore_x86_pv(struct context *ctx);
 
 /*
  * Write the image and domain headers to the stream.
@@ -157,6 +161,22 @@ struct record
             (_p)->x32._f = (_v);                \
     })
 
+/* memcpy field _f from _s to _d, of an *_any union */
+#define MEMCPY_FIELD(_c, _d, _s, _f)                                    \
+    ({ if ( (_c)->x86_pv.width == 8 )                                   \
+            memcpy(&(_d)->x64._f, &(_s)->x64._f, sizeof((_d)->x64._f)); \
+        else                                                            \
+            memcpy(&(_d)->x32._f, &(_s)->x32._f, sizeof((_d)->x32._f)); \
+    })
+
+/* memset array field _f with value _v, from an *_any union */
+#define MEMSET_ARRAY_FIELD(_c, _d, _f, _v)                              \
+    ({ if ( (_c)->x86_pv.width == 8 )                                   \
+           memset(&(_d)->x64._f[0], (_v), sizeof((_d)->x64._f));        \
+       else                                                             \
+           memset(&(_d)->x32._f[0], (_v), sizeof((_d)->x32._f));        \
+    })
+
 /*
  * Writes a split record to the stream, applying correct padding where
  * appropriate.  It is common when sending records containing blobs from Xen
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
index 5834d38..56eeb83 100644
--- a/tools/libxc/saverestore/restore.c
+++ b/tools/libxc/saverestore/restore.c
@@ -1,5 +1,62 @@
+#include <arpa/inet.h>
+
 #include "common.h"
 
+static int read_headers(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct ihdr ihdr;
+    struct dhdr dhdr;
+
+    if ( read_exact(ctx->fd, &ihdr, sizeof ihdr) )
+    {
+        PERROR("Failed to read Image Header from stream");
+        return -1;
+    }
+
+    ihdr.id      = ntohl(ihdr.id);
+    ihdr.version = ntohl(ihdr.version);
+    ihdr.options = ntohs(ihdr.options);
+
+    if ( ihdr.marker != IHDR_MARKER )
+    {
+        ERROR("Invalid marker: Got 0x%016"PRIx64, ihdr.marker);
+        return -1;
+    }
+    else if ( ihdr.id != IHDR_ID )
+    {
+        ERROR("Invalid ID: Expected 0x%08"PRIx32", Got 0x%08"PRIx32,
+              IHDR_ID, ihdr.id);
+        return -1;
+    }
+    else if ( ihdr.version != IHDR_VERSION )
+    {
+        ERROR("Invalid Version: Expected %d, Got %d", ihdr.version, IHDR_VERSION);
+        return -1;
+    }
+    else if ( ihdr.options & IHDR_OPT_BIG_ENDIAN )
+    {
+        ERROR("Unable to handle big endian streams");
+        return -1;
+    }
+
+    ctx->restore.format_version = ihdr.version;
+
+    if ( read_exact(ctx->fd, &dhdr, sizeof dhdr) )
+    {
+        PERROR("Failed to read Domain Header from stream");
+        return -1;
+    }
+
+    ctx->restore.guest_type = dhdr.type;
+    ctx->restore.guest_page_size = (1U << dhdr.page_shift);
+
+    IPRINTF("Found %s domain from Xen %d.%d",
+            dhdr_type_to_str(dhdr.type), dhdr.xen_major, dhdr.xen_minor);
+    return 0;
+}
+
+
 int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        unsigned int store_evtchn, unsigned long *store_mfn,
                        domid_t store_domid, unsigned int console_evtchn,
@@ -8,8 +65,64 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        int checkpointed_stream,
                        struct restore_callbacks *callbacks)
 {
+    struct context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    ctx.restore.console_evtchn = console_evtchn;
+    ctx.restore.console_domid = console_domid;
+    ctx.restore.xenstore_evtchn = store_evtchn;
+    ctx.restore.xenstore_domid = store_domid;
+    ctx.restore.callbacks = callbacks;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %d does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+    IPRINTF("Restoring domain %d", dom);
+
+    if ( read_headers(&ctx) )
+        return -1;
+
+    if ( ctx.dominfo.hvm )
+    {
+        ERROR("HVM Restore not supported yet");
+        return -1;
+    }
+    else
+    {
+        ctx.ops = save_restore_ops_x86_pv;
+        if ( restore_x86_pv(&ctx) )
+            return -1;
+    }
+
+    DPRINTF("XenStore: mfn %#lx, dom %d, evt %u",
+            ctx.restore.xenstore_mfn,
+            ctx.restore.xenstore_domid,
+            ctx.restore.xenstore_evtchn);
+
+    DPRINTF("Console: mfn %#lx, dom %d, evt %u",
+            ctx.restore.console_mfn,
+            ctx.restore.console_domid,
+            ctx.restore.console_evtchn);
+
+    *console_mfn = ctx.restore.console_mfn;
+    *store_mfn = ctx.restore.xenstore_mfn;
+
+    return 0;
 }
 
 static bool pfn_is_populated(struct context *ctx, xen_pfn_t pfn)
diff --git a/tools/libxc/saverestore/restore_x86_pv.c b/tools/libxc/saverestore/restore_x86_pv.c
new file mode 100644
index 0000000..8530750
--- /dev/null
+++ b/tools/libxc/saverestore/restore_x86_pv.c
@@ -0,0 +1,805 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
+#include "common_x86_pv.h"
+
+static int expand_p2m(struct context *ctx, unsigned long max_pfn)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long old_max = ctx->x86_pv.max_pfn, i;
+    unsigned long end_frame = (max_pfn + ctx->x86_pv.fpp) / ctx->x86_pv.fpp;
+    unsigned long old_end_frame = (old_max + ctx->x86_pv.fpp) / ctx->x86_pv.fpp;
+    xen_pfn_t *p2m = NULL, *p2m_pfns = NULL;
+    uint32_t *pfn_types = NULL;
+    size_t p2msz, p2m_pfnsz, pfn_typesz;
+
+    /* We expect expand_p2m to be called exactly once, expanding from 0 the
+     * domains max, but assert some sanity */
+    assert(max_pfn > old_max);
+
+    p2msz = (max_pfn + 1) * ctx->x86_pv.width;
+    p2m = realloc(ctx->x86_pv.p2m, p2msz);
+    if ( !p2m )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m", p2msz);
+        return -1;
+    }
+    ctx->x86_pv.p2m = p2m;
+
+    pfn_typesz = (max_pfn + 1) * sizeof *pfn_types;
+    pfn_types = realloc(ctx->x86_pv.pfn_types, pfn_typesz);
+    if ( !pfn_types )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for pfn_types", pfn_typesz);
+        return -1;
+    }
+    ctx->x86_pv.pfn_types = pfn_types;
+
+    p2m_pfnsz = (end_frame + 1) * sizeof *p2m_pfns;
+    p2m_pfns = realloc(ctx->x86_pv.p2m_pfns, p2m_pfnsz);
+    if ( !p2m_pfns )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m frame list", p2m_pfnsz);
+        return -1;
+    }
+    ctx->x86_pv.p2m_frames = end_frame;
+    ctx->x86_pv.p2m_pfns = p2m_pfns;
+
+    ctx->x86_pv.max_pfn = max_pfn;
+    for ( i = (old_max ? old_max + 1 : 0); i <= max_pfn; ++i )
+    {
+        ctx->ops.set_gfn(ctx, i, INVALID_MFN);
+        ctx->ops.set_page_type(ctx, i, 0);
+    }
+
+    for ( i = (old_end_frame ? old_end_frame + 1 : 0); i <= end_frame; ++i )
+        ctx->x86_pv.p2m_pfns[i] = INVALID_MFN;
+
+    DPRINTF("Expanded p2m from %#lx to %#lx", old_max, max_pfn);
+    return 0;
+}
+
+static int pin_pagetables(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long i;
+    struct mmuext_op pin;
+
+    DPRINTF("Pinning pagetables");
+
+    for ( i = 0; i <= ctx->x86_pv.max_pfn; ++i )
+    {
+        if ( (ctx->x86_pv.pfn_types[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( ctx->x86_pv.pfn_types[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            pin.cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            pin.cmd = MMUEXT_PIN_L2_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            pin.cmd = MMUEXT_PIN_L3_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin.cmd = MMUEXT_PIN_L4_TABLE;
+            break;
+        default:
+            continue;
+        }
+
+        pin.arg1.mfn = ctx->ops.pfn_to_gfn(ctx, i);
+
+        if ( xc_mmuext_op(xch, &pin, 1, ctx->domid) != 0 )
+        {
+            PERROR("Failed to pin page table for pfn %#lx", i);
+            return -1;
+        }
+
+    }
+
+    return 0;
+}
+
+static int process_start_info(struct context *ctx, vcpu_guest_context_any_t *vcpu)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn, mfn;
+    start_info_any_t *guest_start_info = NULL;
+    int rc = -1;
+
+    pfn = GET_FIELD(ctx, vcpu, user_regs.edx);
+
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Start Info pfn %#lx out of range", pfn);
+        goto err;
+    }
+    else if ( ctx->x86_pv.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+    {
+        ERROR("Start Info pfn %#lx has bad type %lu", pfn,
+              ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Start Info has bad MFN");
+        pseudophysmap_walk(ctx, mfn);
+        goto err;
+    }
+
+    guest_start_info = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE, mfn);
+    if ( !guest_start_info )
+    {
+        PERROR("Failed to map Start Info at mfn %#lx", mfn);
+        goto err;
+    }
+
+    /* Deal with xenstore stuff */
+    pfn = GET_FIELD(ctx, guest_start_info, store_mfn);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("XenStore pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("XenStore pfn has bad MFN");
+        pseudophysmap_walk(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.xenstore_mfn = mfn;
+    SET_FIELD(ctx, guest_start_info, store_mfn, mfn);
+    SET_FIELD(ctx, guest_start_info, store_evtchn, ctx->restore.xenstore_evtchn);
+
+
+    /* Deal with console stuff */
+    pfn = GET_FIELD(ctx, guest_start_info, console.domU.mfn);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Console pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Console pfn has bad MFN");
+        pseudophysmap_walk(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.console_mfn = mfn;
+    SET_FIELD(ctx, guest_start_info, console.domU.mfn, mfn);
+    SET_FIELD(ctx, guest_start_info, console.domU.evtchn, ctx->restore.console_evtchn);
+
+    /* Set other information */
+    SET_FIELD(ctx, guest_start_info, nr_pages, ctx->x86_pv.max_pfn + 1);
+    SET_FIELD(ctx, guest_start_info, shared_info,
+              ctx->dominfo.shared_info_frame << PAGE_SHIFT);
+    SET_FIELD(ctx, guest_start_info, flags, 0);
+
+    SET_FIELD(ctx, vcpu, user_regs.edx, mfn);
+    rc = 0;
+
+err:
+    if ( guest_start_info )
+        munmap(guest_start_info, PAGE_SIZE);
+
+    return rc;
+}
+
+static int update_guest_p2m(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn, *guest_p2m = NULL;
+    unsigned i;
+    int rc = -1;
+
+    for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+    {
+        pfn = ctx->x86_pv.p2m_pfns[i];
+
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] out of range",
+                  pfn, i);
+            goto err;
+        }
+        else if ( ctx->x86_pv.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] has bad type %lu", pfn, i,
+                  ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("p2m_frame_list[%u] has bad MFN", i);
+            pseudophysmap_walk(ctx, mfn);
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[i] = mfn;
+    }
+
+    guest_p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_WRITE,
+                                     ctx->x86_pv.p2m_pfns,
+                                     ctx->x86_pv.p2m_frames );
+    if ( !guest_p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    memcpy(guest_p2m, ctx->x86_pv.p2m,
+           (ctx->x86_pv.max_pfn + 1) * ctx->x86_pv.width);
+    rc = 0;
+ err:
+    if ( guest_p2m )
+        munmap(guest_p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+static int handle_end(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+
+    DPRINTF("End record");
+    return 0;
+}
+
+static int handle_x86_pv_info(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_info *info = rec->data;
+
+    if ( rec->length < sizeof *info )
+    {
+        ERROR("X86_PV_INFO record trucated: length %"PRIu32", expected %zu",
+              rec->length, sizeof *info);
+        return -1;
+    }
+    else if ( info->guest_width != 4 &&
+              info->guest_width != 8 )
+    {
+        ERROR("Unexpected guest width %"PRIu32", Expected 4 or 8",
+              info->guest_width);
+        return -1;
+    }
+    else if ( info->guest_width != ctx->x86_pv.width )
+    {
+        int rc;
+        struct xen_domctl domctl;
+
+        /* try to set address size, domain is always created 64 bit */
+        memset(&domctl, 0, sizeof(domctl));
+        domctl.domain = ctx->domid;
+        domctl.cmd    = XEN_DOMCTL_set_address_size;
+        domctl.u.address_size.size = info->guest_width * 8;
+        rc = do_domctl(xch, &domctl);
+        if ( rc != 0 )
+        {
+            ERROR("Width of guest in stream (%"PRIu32
+                  " bits) differs with existing domain (%"PRIu32" bits)",
+                  info->guest_width * 8, ctx->x86_pv.width * 8);
+            return -1;
+        }
+
+        /* domain informations changed, better to refresh */
+        rc = x86_pv_domain_info(ctx);
+        if ( rc != 0 )
+        {
+            ERROR("Unable to refresh guest informations");
+            return -1;
+        }
+    }
+    else if ( info->pt_levels != 3 &&
+              info->pt_levels != 4 )
+    {
+        ERROR("Unexpected guest levels %"PRIu32", Expected 3 or 4",
+              info->pt_levels);
+        return -1;
+    }
+    else if ( info->pt_levels != ctx->x86_pv.levels )
+    {
+        ERROR("Levels of guest in stream (%"PRIu32
+              ") differs with existing domain (%"PRIu32")",
+              info->pt_levels, ctx->x86_pv.levels);
+        return -1;
+    }
+
+    DPRINTF("X86_PV_INFO record: %d bits, %d levels",
+            ctx->x86_pv.width * 8, ctx->x86_pv.levels);
+    return 0;
+}
+
+static int handle_x86_pv_p2m_frames(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_p2m_frames *data = rec->data;
+    unsigned start, end, x;
+    int rc;
+
+    if ( rec->length < sizeof *data )
+    {
+        ERROR("X86_PV_P2M_FRAMES record trucated: length %"PRIu32", min %zu",
+              rec->length, sizeof *data + sizeof(uint64_t));
+        return -1;
+    }
+    else if ( data->start_pfn > data->end_pfn )
+    {
+        ERROR("End pfn in stream (%#"PRIx32") exceeds Start (%#"PRIx32")",
+              data->end_pfn, data->start_pfn);
+        return -1;
+    }
+
+    start =  data->start_pfn / ctx->x86_pv.fpp;
+    end = data->end_pfn / ctx->x86_pv.fpp + 1;
+
+    if ( rec->length != sizeof *data + ((end - start) * sizeof (uint64_t)) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record wrong size: start_pfn %#"PRIx32
+              ", end_pfn %#"PRIx32", length %"PRIu32
+              ", expected %zu + (%u - %u) * %zu",
+              data->start_pfn, data->end_pfn, rec->length,
+              sizeof *data, end, start, sizeof(uint64_t));
+        return -1;
+    }
+
+    if ( data->end_pfn > ctx->x86_pv.max_pfn )
+    {
+        rc = expand_p2m(ctx, data->end_pfn);
+        if ( rc )
+            return rc;
+    }
+
+    for ( x = 0; x < (end - start); ++x )
+        ctx->x86_pv.p2m_pfns[start + x] = data->p2m_pfns[x];
+
+    DPRINTF("X86_PV_P2M_FRAMES record: GFNs %#"PRIx32"->%#"PRIx32,
+            data->start_pfn, data->end_pfn);
+    return 0;
+}
+
+static int handle_x86_pv_vcpu_basic(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu *vhdr = rec->data;
+    vcpu_guest_context_any_t vcpu;
+    size_t vcpusz = ctx->x86_pv.width == 8 ? sizeof vcpu.x64 : sizeof vcpu.x32;
+    xen_pfn_t pfn, mfn;
+    unsigned long tmp;
+    unsigned i;
+    int rc = -1;
+
+    if ( rec->length <= sizeof *vhdr )
+    {
+        ERROR("X86_PV_VCPU_BASIC record trucated: length %"PRIu32", min %zu",
+              rec->length, sizeof *vhdr + 1);
+        goto err;
+    }
+    else if ( rec->length != sizeof *vhdr + vcpusz )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record wrong size: length %"PRIu32
+              ", expected %zu", rec->length, sizeof *vhdr + vcpusz);
+        goto err;
+    }
+    else if ( vhdr->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_BASIC record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vhdr->vcpu_id, ctx->dominfo.max_vcpu_id);
+        goto err;
+    }
+
+    memcpy(&vcpu, &vhdr->context, vcpusz);
+
+    SET_FIELD(ctx, &vcpu, flags, GET_FIELD(ctx, &vcpu, flags) | VGCF_online);
+
+    /* Vcpu 0 is special: Convert the suspend record to an MFN */
+    if ( vhdr->vcpu_id == 0 )
+    {
+        rc = process_start_info(ctx, &vcpu);
+        if ( rc )
+            return rc;
+        rc = -1;
+    }
+
+    tmp = GET_FIELD(ctx, &vcpu, gdt_ents);
+    if ( tmp > 8192 )
+    {
+        ERROR("GDT entry count (%lu) out of range", tmp);
+        errno = ERANGE;
+        goto err;
+    }
+
+    /* Convert GDT frames to MFNs */
+    for ( i = 0; (i * 512) < tmp; ++i )
+    {
+        pfn = GET_FIELD(ctx, &vcpu, gdt_frames[i]);
+        if ( pfn >= ctx->x86_pv.max_pfn )
+        {
+            ERROR("GDT frame %u (pfn %#lx) out of range", i, pfn);
+            goto err;
+        }
+        else if ( ctx->x86_pv.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+        {
+            ERROR("GDT frame %u (pfn %#lx) has bad type %lu", i, pfn,
+                  ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("GDT frame %u has bad MFN", i);
+            pseudophysmap_walk(ctx, mfn);
+            goto err;
+        }
+
+        SET_FIELD(ctx, &vcpu, gdt_frames[i], mfn);
+    }
+
+    /* Convert CR3 to an MFN */
+    pfn = cr3_to_mfn(ctx, GET_FIELD(ctx, &vcpu, ctrlreg[3]));
+    if ( pfn >= ctx->x86_pv.max_pfn )
+    {
+        ERROR("cr3 (pfn %#lx) out of range", pfn);
+        goto err;
+    }
+    else if ( (ctx->x86_pv.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) !=
+              (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+    {
+        ERROR("cr3 (pfn %#lx) has bad type %lu, expected %lu", pfn,
+              ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT,
+              ctx->x86_pv.levels);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("cr3 has bad MFN");
+        pseudophysmap_walk(ctx, mfn);
+        goto err;
+    }
+
+    SET_FIELD(ctx, &vcpu, ctrlreg[3], mfn_to_cr3(ctx, mfn));
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to MFN */
+    if ( ctx->x86_pv.levels == 4 && (vcpu.x64.ctrlreg[1] & 1) )
+    {
+        pfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+
+        if ( pfn >= ctx->x86_pv.max_pfn )
+        {
+            ERROR("cr1 (pfn %#lx) out of range", pfn);
+            goto err;
+        }
+        else if ( (ctx->x86_pv.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
+                  (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+        {
+            ERROR("cr1 (pfn %#lx) has bad type %lu, expected %lu", pfn,
+                  ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT,
+                  ctx->x86_pv.levels);
+            goto err;
+        }
+
+        mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("cr1 has bad MFN");
+            pseudophysmap_walk(ctx, mfn);
+            goto err;
+        }
+
+        vcpu.x64.ctrlreg[1] = (uint64_t)mfn << PAGE_SHIFT;
+    }
+
+    if ( xc_vcpu_setcontext(xch, ctx->domid, vhdr->vcpu_id, &vcpu) )
+    {
+        PERROR("Failed to set vcpu%"PRIu32"'s basic info", vhdr->vcpu_id);
+        goto err;
+    }
+
+    rc = 0;
+    DPRINTF("vcpu%d X86_PV_VCPU_BASIC record", vhdr->vcpu_id);
+ err:
+    return rc;
+}
+
+static int handle_x86_pv_vcpu_extended(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu *vcpu = rec->data;
+    DECLARE_DOMCTL;
+
+    if ( rec->length <= sizeof *vcpu )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record trucated: length %"PRIu32", min %zu",
+              rec->length, sizeof *vcpu + 1);
+        return -1;
+    }
+    else if ( rec->length > sizeof *vcpu + 128 )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record too long: length %"PRIu32", max %zu",
+              rec->length, sizeof *vcpu + 128);
+        return -1;
+    }
+    else if ( vcpu->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vcpu->vcpu_id, ctx->dominfo.max_vcpu_id);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_set_ext_vcpucontext;
+    domctl.domain = ctx->domid;
+    memcpy(&domctl.u.ext_vcpucontext, &vcpu->context, rec->length - sizeof *vcpu);
+
+    if ( xc_domctl(xch, &domctl) != 0 )
+    {
+        PERROR("Failed to set vcpu%"PRIu32"'s extended info", vcpu->vcpu_id);
+        return -1;
+    }
+
+    DPRINTF("vcpu%d X86_PV_VCPU_EXTENDED record", vcpu->vcpu_id);
+    return 0;
+}
+
+static int handle_x86_pv_vcpu_xsave(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu_xsave *vcpu = rec->data;
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    size_t buffersz;
+
+    if ( rec->length <= sizeof *vcpu )
+    {
+        ERROR("X86_PV_VCPU_XSAVE record trucated: length %"PRIu32", min %zu",
+              rec->length, sizeof *vcpu + 1);
+        return -1;
+    }
+    else if ( vcpu->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vcpu->vcpu_id, ctx->dominfo.max_vcpu_id);
+        return -1;
+    }
+
+    buffersz = rec->length - sizeof *vcpu;
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, buffersz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %"PRIu64" bytes for xsave hypercall buffer",
+              buffersz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setvcpuextstate;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpuextstate.vcpu = vcpu->vcpu_id;
+    domctl.u.vcpuextstate.xfeature_mask = vcpu->xfeature_mask;
+    domctl.u.vcpuextstate.size = buffersz;
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    rc = xc_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    if ( rc )
+    {
+        PERROR("Failed to set vcpu%"PRIu32"'s xsave info", vcpu->vcpu_id);
+        return rc;
+    }
+    else
+    {
+        DPRINTF("vcpu%d X86_PV_VCPU_XSAVE record", vcpu->vcpu_id);
+        return 0;
+    }
+}
+
+static int handle_shared_info(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned i;
+    int rc = -1;
+    shared_info_any_t *guest_shared_info = NULL;
+    shared_info_any_t *stream_shared_info = rec->data;
+
+    if ( rec->length != PAGE_SIZE )
+    {
+        ERROR("X86_PV_SHARED_INFO record wrong size: length %"PRIu32
+              ", expected %u", rec->length, PAGE_SIZE);
+        goto err;
+    }
+
+    guest_shared_info = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE,
+        ctx->dominfo.shared_info_frame);
+    if ( !guest_shared_info )
+    {
+        PERROR("Failed to map Shared Info at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        goto err;
+    }
+
+    MEMCPY_FIELD(ctx, guest_shared_info, stream_shared_info, vcpu_info);
+    MEMCPY_FIELD(ctx, guest_shared_info, stream_shared_info, arch);
+
+    SET_FIELD(ctx, guest_shared_info, arch.pfn_to_mfn_frame_list_list, 0);
+
+    MEMSET_ARRAY_FIELD(ctx, guest_shared_info, evtchn_pending, 0);
+    for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
+        SET_FIELD(ctx, guest_shared_info, vcpu_info[i].evtchn_pending_sel, 0);
+
+    MEMSET_ARRAY_FIELD(ctx, guest_shared_info, evtchn_mask, 0xff);
+
+    rc = 0;
+ err:
+
+    if ( guest_shared_info )
+        munmap(guest_shared_info, PAGE_SIZE);
+
+    return rc;
+}
+
+int restore_x86_pv(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct record rec;
+    int rc;
+
+    IPRINTF("In experimental %s", __func__);
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_x86_pv )
+    {
+        ERROR("Unable to restore %s domain into an x86_pv domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != 4096 )
+    {
+        ERROR("Invalid page size %d for x86_pv domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        goto err;
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        goto err;
+
+    do
+    {
+        rc = read_record(ctx, &rec);
+        if ( rc )
+            goto err;
+
+        switch ( rec.type )
+        {
+        case REC_TYPE_end:
+            rc = handle_end(ctx, &rec);
+            break;
+
+        case REC_TYPE_page_data:
+            rc = handle_page_data(ctx, &rec);
+            break;
+
+        case REC_TYPE_x86_pv_info:
+            rc = handle_x86_pv_info(ctx, &rec);
+            break;
+
+        case REC_TYPE_x86_pv_p2m_frames:
+            rc = handle_x86_pv_p2m_frames(ctx, &rec);
+            break;
+
+        case REC_TYPE_x86_pv_vcpu_basic:
+            rc = handle_x86_pv_vcpu_basic(ctx, &rec);
+            break;
+
+        case REC_TYPE_x86_pv_vcpu_extended:
+            rc = handle_x86_pv_vcpu_extended(ctx, &rec);
+            break;
+
+        case REC_TYPE_x86_pv_vcpu_xsave:
+            rc = handle_x86_pv_vcpu_xsave(ctx, &rec);
+            break;
+
+        case REC_TYPE_shared_info:
+            rc = handle_shared_info(ctx, &rec);
+            break;
+
+        case REC_TYPE_tsc_info:
+            rc = handle_tsc_info(ctx, &rec);
+            break;
+
+        default:
+            if ( rec.type & REC_TYPE_optional )
+            {
+                IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
+                        rec.type, rec_type_to_str(rec.type));
+                rc = 0;
+                break;
+            }
+
+            ERROR("Invalid record type (0x%"PRIx32", %s) for x86_pv domains",
+                  rec.type, rec_type_to_str(rec.type));
+            rc = -1;
+            break;
+        }
+
+        free(rec.data);
+        if ( rc )
+            goto err;
+
+    } while ( rec.type != REC_TYPE_end );
+
+    IPRINTF("Finished reading records");
+
+    rc = pin_pagetables(ctx);
+    if ( rc )
+        goto err;
+
+    rc = update_guest_p2m(ctx);
+    if ( rc )
+        goto err;
+
+    rc = xc_dom_gnttab_seed(xch, ctx->domid,
+                            ctx->restore.console_mfn,
+                            ctx->restore.xenstore_mfn,
+                            ctx->restore.console_domid,
+                            ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        goto err;
+    }
+
+    /* all done */
+    IPRINTF("All Done");
+    assert(!rc);
+    goto cleanup;
+
+ err:
+    assert(rc);
+ cleanup:
+
+    free(ctx->x86_pv.p2m);
+    free(ctx->x86_pv.p2m_pfns);
+    free(ctx->x86_pv.pfn_types);
+    free(ctx->restore.populated_pfns);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 8/9] tools/libxc: x86 hvm save implementation
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (6 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 7/9] tools/libxc: x86 pv restore implementation Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 14:14   ` Ian Campbell
  2014-04-30 18:36 ` [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation Andrew Cooper
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, David Vrabel

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/saverestore/common.h       |    2 +
 tools/libxc/saverestore/save.c         |    4 +-
 tools/libxc/saverestore/save_x86_hvm.c |  220 ++++++++++++++++++++++++++++++++
 3 files changed, 224 insertions(+), 2 deletions(-)
 create mode 100644 tools/libxc/saverestore/save_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 259cb1f..e26b879 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -127,6 +127,8 @@ struct context
 
 /* Saves an x86 PV domain. */
 int save_x86_pv(struct context *ctx);
+/* Saves an x86 HVM domain. */
+int save_x86_hvm(struct context *ctx);
 /* Restores an x86 PV domain. */
 int restore_x86_pv(struct context *ctx);
 
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index a19c217..4bff29b 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -492,8 +492,8 @@ int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_ite
 
     if ( ctx.dominfo.hvm )
     {
-        ERROR("HVM Save not supported yet");
-        return -1;
+        ctx.ops = save_restore_ops_x86_hvm;
+        return save_x86_hvm(&ctx);
     }
     else
     {
diff --git a/tools/libxc/saverestore/save_x86_hvm.c b/tools/libxc/saverestore/save_x86_hvm.c
new file mode 100644
index 0000000..57c13bc
--- /dev/null
+++ b/tools/libxc/saverestore/save_x86_hvm.c
@@ -0,0 +1,220 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
+#include <xen/hvm/params.h>
+
+#include "common_x86_pv.h"
+
+static int write_hvm_context(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long hvm_buf_size;
+    struct record hvm_rec = { .type = REC_TYPE_hvm_context, };
+    int rc;
+
+    hvm_buf_size = xc_domain_hvm_getcontext(xch, ctx->domid, 0, 0);
+    if ( hvm_buf_size == -1 )
+    {
+        PERROR("Couldn't get HVM context size from Xen");
+        rc = -1;
+        goto out;
+    }
+    hvm_rec.data = malloc(hvm_buf_size);
+    if ( !hvm_rec.data )
+    {
+        PERROR("Couldn't allocate memory");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.length = xc_domain_hvm_getcontext(xch, ctx->domid,
+                                              hvm_rec.data, hvm_buf_size);
+    if ( hvm_rec.length < 0 )
+    {
+        PERROR("HVM:Could not get hvm buffer");
+        rc = -1;
+        goto out;
+    }
+
+    rc = write_record(ctx, &hvm_rec);
+    if ( rc < 0 )
+    {
+        PERROR("error write HVM_CONTEXT record");
+        goto out;
+    }
+
+ out:
+    free(hvm_rec.data);
+    return rc;
+}
+
+static int write_hvm_params(struct context *ctx)
+{
+    static const unsigned int params[] = {
+        HVM_PARAM_STORE_PFN,
+        HVM_PARAM_IOREQ_PFN,
+        HVM_PARAM_BUFIOREQ_PFN,
+        HVM_PARAM_PAGING_RING_PFN,
+        HVM_PARAM_ACCESS_RING_PFN,
+        HVM_PARAM_SHARING_RING_PFN,
+        HVM_PARAM_VM86_TSS,
+        HVM_PARAM_CONSOLE_PFN,
+        HVM_PARAM_ACPI_IOPORTS_LOCATION,
+        HVM_PARAM_VIRIDIAN,
+        HVM_PARAM_IDENT_PT,
+        HVM_PARAM_PAE_ENABLED,
+    };
+    static const unsigned int num_params = ARRAY_SIZE(params);
+    xc_interface *xch = ctx->xch;
+    struct rec_hvm_params_entry *entries;
+    struct rec_hvm_params hdr = {
+        .count = 0,
+    };
+    struct record rec = {
+        .type   = REC_TYPE_hvm_params,
+        .length = sizeof(hdr),
+        .data   = &hdr,
+    };
+    unsigned int i;
+    int rc;
+
+    entries = malloc(num_params * sizeof(*entries));
+    if ( !entries )
+    {
+        PERROR("HVM params record");
+        rc = -1;
+        goto out;
+    }
+
+    for ( i = 0; i < num_params; i++ )
+    {
+        uint32_t index = params[i];
+        uint64_t value;
+
+        xc_get_hvm_param(xch, ctx->domid, index, (unsigned long *)&value);
+        if ( value != 0 )
+        {
+            entries[hdr.count].index = index;
+            entries[hdr.count].value = value;
+            hdr.count++;
+        }
+    }
+
+    rc = write_split_record(ctx, &rec, entries, hdr.count * sizeof(*entries));
+    if ( rc < 0 )
+    {
+        PERROR("error write HVM_PARAMS record");
+        goto out;
+    }
+
+out:
+    free(entries);
+    return rc;
+}
+
+static int write_toolstack(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct record rec = {
+        .type = REC_TYPE_toolstack,
+        .length = 0,
+    };
+    uint8_t *buf;
+    uint32_t len;
+    int rc;
+
+    if ( !ctx->save.callbacks || !ctx->save.callbacks->toolstack_save )
+        return 0;
+
+    if ( ctx->save.callbacks->toolstack_save(ctx->domid, &buf, &len, ctx->save.callbacks->data) < 0 )
+    {
+        PERROR("Error calling toolstack_save");
+        return -1;
+    }
+
+    rc = write_split_record(ctx, &rec, buf, len);
+    if ( rc < 0 )
+        PERROR("Error writing TOOLSTACK record");
+    free(buf);
+    return rc;
+}
+
+int save_x86_hvm(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+    struct record end = { REC_TYPE_end, 0, NULL };
+
+    IPRINTF("In experimental %s", __func__);
+
+    /* Write Image and Domain headers to the stream */
+    rc = write_headers(ctx, DHDR_TYPE_x86_hvm);
+    if ( rc )
+        goto err;
+
+    rc = write_page_data_and_pause(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        goto err;
+
+    /* Refresh domain information now it has paused */
+    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
+         (ctx->dominfo.domid != ctx->domid) )
+    {
+        PERROR("Unable to refresh domain information");
+        rc = -1;
+        goto err;
+    }
+    else if ( (!ctx->dominfo.shutdown ||
+               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
+              !ctx->dominfo.paused )
+    {
+        ERROR("Domain has not been suspended");
+        rc = -1;
+        goto err;
+    }
+
+    rc = write_toolstack(ctx);
+    if ( rc )
+        goto err;
+
+    /* Write the HVM_CONTEXT record. */
+    rc = write_hvm_context(ctx);
+    if ( rc )
+        goto err;
+
+    /* Write HVM_PARAMS record contains applicable HVM params. */
+    rc = write_hvm_params(ctx);
+    if ( rc )
+        goto err;
+
+    /* Write an END record */
+    rc = write_record(ctx, &end);
+    if ( rc )
+        goto err;
+
+    /* all done */
+    assert(!rc);
+    goto cleanup;
+
+ err:
+    assert(rc);
+ cleanup:
+
+    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                      NULL, 0, NULL, 0, NULL);
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (7 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 8/9] tools/libxc: x86 hvm save implementation Andrew Cooper
@ 2014-04-30 18:36 ` Andrew Cooper
  2014-05-07 14:15   ` Ian Campbell
  2014-05-01 16:41 ` [PATCH v4 0/9] Migration Stream v2 Konrad Rzeszutek Wilk
  2014-05-07 14:23 ` Ian Campbell
  10 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, David Vrabel

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/saverestore/common.h          |    2 +
 tools/libxc/saverestore/restore.c         |    5 +-
 tools/libxc/saverestore/restore_x86_hvm.c |  289 +++++++++++++++++++++++++++++
 3 files changed, 294 insertions(+), 2 deletions(-)
 create mode 100644 tools/libxc/saverestore/restore_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index e26b879..980ed55 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -131,6 +131,8 @@ int save_x86_pv(struct context *ctx);
 int save_x86_hvm(struct context *ctx);
 /* Restores an x86 PV domain. */
 int restore_x86_pv(struct context *ctx);
+/* Restores an x86 HVM domain. */
+int restore_x86_hvm(struct context *ctx);
 
 /*
  * Write the image and domain headers to the stream.
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
index 56eeb83..6feeb6f 100644
--- a/tools/libxc/saverestore/restore.c
+++ b/tools/libxc/saverestore/restore.c
@@ -99,8 +99,9 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
 
     if ( ctx.dominfo.hvm )
     {
-        ERROR("HVM Restore not supported yet");
-        return -1;
+        ctx.ops = save_restore_ops_x86_hvm;
+        if ( restore_x86_hvm(&ctx) )
+            return -1;
     }
     else
     {
diff --git a/tools/libxc/saverestore/restore_x86_hvm.c b/tools/libxc/saverestore/restore_x86_hvm.c
new file mode 100644
index 0000000..d8cd765
--- /dev/null
+++ b/tools/libxc/saverestore/restore_x86_hvm.c
@@ -0,0 +1,289 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+static int handle_end(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+
+    DPRINTF("End record");
+    return 0;
+}
+
+static int handle_toolstack(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore )
+        return 0;
+
+    rc = ctx->restore.callbacks->toolstack_restore(ctx->domid, rec->data, rec->length,
+                                                   ctx->restore.callbacks->data);
+    if ( rc < 0 )
+        PERROR("restoring toolstack");
+    return rc;
+}
+
+static int handle_hvm_context(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = xc_domain_hvm_setcontext(xch, ctx->domid, rec->data, rec->length);
+    if ( rc < 0 )
+        PERROR("Unable to restore HVM context");
+    return rc;
+}
+
+static int handle_hvm_params(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_hvm_params *hdr = rec->data;
+    struct rec_hvm_params_entry *entry = hdr->param;
+    unsigned int i;
+    int rc;
+
+    if ( rec->length < sizeof(*hdr)
+         || rec->length < sizeof(*hdr) + hdr->count * sizeof(*entry) )
+    {
+        ERROR("hvm_params record is too short");
+        return -1;
+    }
+
+    for ( i = 0; i < hdr->count; i++, entry++ )
+    {
+        switch ( entry->index )
+        {
+        case HVM_PARAM_CONSOLE_PFN:
+            ctx->restore.console_mfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_STORE_PFN:
+            ctx->restore.xenstore_mfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_IOREQ_PFN:
+        case HVM_PARAM_BUFIOREQ_PFN:
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        }
+
+        rc = xc_set_hvm_param(xch, ctx->domid, entry->index, entry->value);
+        if ( rc < 0 )
+        {
+            PERROR("set HVM param %"PRId64" = 0x%016"PRIx64,
+                   entry->index, entry->value);
+            return rc;
+        }
+    }
+    return 0;
+}
+
+static int set_extra_hvm_params(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = xc_set_hvm_param(xch, ctx->domid, HVM_PARAM_STORE_EVTCHN,
+                          ctx->restore.xenstore_evtchn);
+    if ( rc < 0 )
+    {
+        PERROR("set HVM_PARAM_STORE_EVTCHN");
+        goto out;
+    }
+
+out:
+    return rc;
+}
+
+static int dump_qemu(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    char qemusig[21];
+    uint32_t qlen;
+    void *qbuf = NULL;
+    int saved_errno;
+    char path[256];
+    FILE *fp;
+
+    if ( read_exact(ctx->fd, qemusig, sizeof(qemusig)) )
+    {
+        PERROR("Error reading QEMU signature");
+        goto err;
+    }
+    if ( memcmp(qemusig, "DeviceModelRecord0002", sizeof(qemusig)) )
+    {
+        qemusig[20] = '\0';
+        ERROR("Invalid device model state signature %s", qemusig);
+        goto err;
+    }
+
+    if ( read_exact(ctx->fd, &qlen, sizeof(qlen)) )
+    {
+        PERROR("Error reading QEMU record length");
+        goto err;
+    }
+
+    qbuf = malloc(qlen);
+    if ( !qbuf )
+    {
+        PERROR("no memory for device model state");
+        goto err;
+    }
+
+    if ( read_exact(ctx->fd, qbuf, qlen) )
+    {
+        PERROR("Error reading device model state");
+        goto err;
+    }
+
+    sprintf(path, XC_DEVICE_MODEL_RESTORE_FILE".%u", ctx->domid);
+    fp = fopen(path, "wb");
+    if ( !fp )
+        return -1;
+
+    DPRINTF("Writing %d bytes of QEMU data\n", qlen);
+    if ( fwrite(qbuf, 1, qlen, fp) != qlen )
+    {
+        saved_errno = errno;
+        fclose(fp);
+        errno = saved_errno;
+        goto err;
+    }
+
+    fclose(fp);
+
+    return 0;
+
+err:
+    free(qbuf);
+    return -1;
+}
+
+int restore_x86_hvm(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct record rec;
+    int rc;
+
+    IPRINTF("In experimental %s", __func__);
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_x86_hvm )
+    {
+        ERROR("Unable to restore %s domain into an x86_hvm domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != 4096 )
+    {
+        ERROR("Invalid page size %d for x86_hvm domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    do
+    {
+        rc = read_record(ctx, &rec);
+        if ( rc )
+            goto err;
+
+        switch ( rec.type )
+        {
+        case REC_TYPE_end:
+            rc = handle_end(ctx, &rec);
+            break;
+
+        case REC_TYPE_page_data:
+            rc = handle_page_data(ctx, &rec);
+            break;
+
+        case REC_TYPE_tsc_info:
+            rc = handle_tsc_info(ctx, &rec);
+            break;
+
+        case REC_TYPE_hvm_context:
+            rc = handle_hvm_context(ctx, &rec);
+            break;
+
+        case REC_TYPE_hvm_params:
+            rc = handle_hvm_params(ctx, &rec);
+            break;
+
+        case REC_TYPE_toolstack:
+            rc = handle_toolstack(ctx, &rec);
+            break;
+
+        default:
+            if ( rec.type & REC_TYPE_optional )
+            {
+                IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
+                        rec.type, rec_type_to_str(rec.type));
+                rc = 0;
+                break;
+            }
+
+            ERROR("Invalid record type (0x%"PRIx32", %s) for x86_pv domains",
+                  rec.type, rec_type_to_str(rec.type));
+            rc = -1;
+            break;
+        }
+
+        free(rec.data);
+        if ( rc )
+            goto err;
+
+    } while ( rec.type != REC_TYPE_end );
+
+    IPRINTF("Finished reading records");
+
+    rc = set_extra_hvm_params(ctx);
+    if ( rc )
+        goto err;
+
+    rc = xc_dom_gnttab_hvm_seed(xch, ctx->domid,
+                                ctx->restore.console_mfn,
+                                ctx->restore.xenstore_mfn,
+                                ctx->restore.console_domid,
+                                ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        goto err;
+    }
+
+    /*
+     * FIXME: reading the device model state from the stream should be
+     * done by libxl.
+     */
+    dump_qemu(ctx);
+
+    /* all done */
+    IPRINTF("All Done");
+    assert(!rc);
+    goto cleanup;
+
+ err:
+    assert(rc);
+ cleanup:
+
+    free(ctx->x86_pv.p2m);
+    free(ctx->x86_pv.p2m_pfns);
+    free(ctx->x86_pv.pfn_types);
+    free(ctx->restore.populated_pfns);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 0/9] Migration Stream v2
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (8 preceding siblings ...)
  2014-04-30 18:36 ` [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation Andrew Cooper
@ 2014-05-01 16:41 ` Konrad Rzeszutek Wilk
  2014-05-01 17:32   ` Andrew Cooper
  2014-05-07 14:23 ` Ian Campbell
  10 siblings, 1 reply; 39+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-01 16:41 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel,
	Frediano Ziglio, David Vrabel, Jan Beulich

On Wed, Apr 30, 2014 at 07:36:43PM +0100, Andrew Cooper wrote:
> Hello,
> 
> Presented here for review is v4 of the Migration Stream v2 work.  David,
> Frediano and myself have worked together on the implementation, and we have
> now managed to get PV and HVM migration working using the new format (even
> when transparently inserted underneath Xapi).
> 
> This series is based on staging, as 9f2f1298d0211962 is a prerequisite
> hypercall for a bugfix.
> 
> Draft F of the design document is available from
>   xenbits.xen.org/people/andrewcoop/domain-save-format-F.pdf

pg 8 says:

body Record body of length body_length octects.

It is a bit hard to parse. Perhaps:
Data stream (body) of size defined in 'body_length' in multiples
of octects.
?

pg. 11
Mentions 'XEN_DOMCTL_PFINFO_*' type. Should you at least point
the reader to where those are defined in the code? Or in the spec?
[edit: next page has it. You could mention that (following page defines
them)


'page_data: "page_size octets of uncompressed page contents for each
page set as present in the pfn array." I think you might want to
replace: "..for each page set" as "corresponding to each element
in the pfn array."


Also oddly enough the picture shows 'page_data[N-1]' but for the
pfn array it has 'pfn[C-1]'. I think you want s/N/C.
[edit: And the next page explains why N != C].

Perhaps for the 'page_data' definition you should mention
that N != C as some of the pfn array entries skip the page_data
array of data?


pg.17
xfeature_mask: The xfeature_mask rom the hypercall body.

I would replace body with structure?


pg.19
TSC_MODE_* constant. You should probably list them. Or at least
say what they are.

Incarnation? Wiki says it means "embodied in flesh" Umm.


pg. 20

blob? How about payload? Data stream?


pg. 21

Would it make sense to point to the list of them at least?
Say: "Can be found in XYZ file."

pg. 22

It says temporary - so... should it be ripped out of the toolstack?


pg. 23

An x86 PV guest image will have in this order: ->
An x86 PV guest image will have this order of records:


pg. 23
x86 HVM guest:
 You should probably say "An x86 HVM guest will have.."

but more interesingly - there is an "DEVICE_MODEL_STATE" which
is not defined in the spec?



> 
> The series is available from the branch 'saverestore2-v4' of
>   git://xenbits.xen.org/people/andrewcoop/xen.git
>   http://xenbits.xen.org/git-http/people/andrewcoop/xen.git
> 
> 
> The code has been a clean rewrite, using the v1 code as a reference but
> avoiding obsolete areas (e.g. how to modify the pagetables of a 32 non-pae
> guest on 32bit pae Xen).
> 
> There are certainly some areas still in need improvement.  There are plans to
> combine the save logic for PV and HVM guests by adding a few more hooks to
> save_restore_ops.  While reviewing the series for posting, I have noticed some
> accidental duplication from attempting to merge PV and HVM together
> (ctx->save.p2m_size and ctx->x86_pv.max_pfn) which can be consolidated.
> 
> Moving forward will be work to do with converting a legacy image to a v2
> image.
> 
> Questions/issues needing discussing are:
> 
> 1) What level should the conversion happen.  The v2 format is capable of
>    detecting a legacy stream, but libxc is arguably too low level for the
>    conversion decision to happen.
> 
> 2) What to do about the layer violations which is the toolstack record and
>    device model record.  Libxl for HVM guests is the only known user of the
>    'toolstack' chunk, which contains some xenstore key/value pairs.  This data
>    should not be a blob in the middle of the libxc layer, and should be
>    promoted to a first class member of a libxl migration stream.
> 
> Anyway - the code is presented here for initial comment/query/critism.
> 
> ~Andrew
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 0/9] Migration Stream v2
  2014-05-01 16:41 ` [PATCH v4 0/9] Migration Stream v2 Konrad Rzeszutek Wilk
@ 2014-05-01 17:32   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-05-01 17:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel,
	Frediano Ziglio, David Vrabel, Jan Beulich

On 01/05/14 17:41, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 30, 2014 at 07:36:43PM +0100, Andrew Cooper wrote:
>> Hello,
>>
>> Presented here for review is v4 of the Migration Stream v2 work.  David,
>> Frediano and myself have worked together on the implementation, and we have
>> now managed to get PV and HVM migration working using the new format (even
>> when transparently inserted underneath Xapi).
>>
>> This series is based on staging, as 9f2f1298d0211962 is a prerequisite
>> hypercall for a bugfix.
>>
>> Draft F of the design document is available from
>>   xenbits.xen.org/people/andrewcoop/domain-save-format-F.pdf
> pg 8 says:
>
> body Record body of length body_length octects.
>
> It is a bit hard to parse. Perhaps:
> Data stream (body) of size defined in 'body_length' in multiples
> of octects.
> ?

Hmm - that is not very well written.  I shall clarify it somewhat,
although the length part is clear from the 'body_length' description.

>
> pg. 11
> Mentions 'XEN_DOMCTL_PFINFO_*' type. Should you at least point
> the reader to where those are defined in the code? Or in the spec?
> [edit: next page has it. You could mention that (following page defines
> them)

This is because of the way in which markdown split the page which is
hard to affect.  I shall clarify the first reference to mention
public/domctl.h

>
>
> 'page_data: "page_size octets of uncompressed page contents for each
> page set as present in the pfn array." I think you might want to
> replace: "..for each page set" as "corresponding to each element
> in the pfn array."
>
>
> Also oddly enough the picture shows 'page_data[N-1]' but for the
> pfn array it has 'pfn[C-1]'. I think you want s/N/C.
> [edit: And the next page explains why N != C].
>
> Perhaps for the 'page_data' definition you should mention
> that N != C as some of the pfn array entries skip the page_data
> array of data?

I will clarify that N is strictly <= C and can validly be 0.

>
>
> pg.17
> xfeature_mask: The xfeature_mask rom the hypercall body.
>
> I would replace body with structure?

This is rather more difficult.

As it turns out, xfeature_mask is not needed for migration.

It is the entire set of available xsave features on the saving cpu,
which is not relevant for the receiving cpu to validate against.  The
first and second uint64_t's of the vcpu_ctx are the xcr0 and xcr0_accum
which are the two masks relevant to validating the feature set and size
of context against the incoming data.

At the moment, the sole use for xfeature_mask is for the receiving cpu
to validate xcr0_accum against, but only after xcr0_accum has already
been validated, which makes the xfeature_mask check strictly redundant.

I have half a mind to submit a patch making Xen completely ignore it on
the "set" side, at which point it can be dropped from the migration
stream.  It does however have a use in "get" side for libxc deciding how
to set up the domain cpuid mask.

>
>
> pg.19
> TSC_MODE_* constant. You should probably list them. Or at least
> say what they are.
>
> Incarnation? Wiki says it means "embodied in flesh" Umm.

As a word, that is a valid meaning.

As for what it is doing in the code, I have no clue.  It is something to
do with how many times the toolstack has issued a set_tsc_info()
hypercall, but why Xen cant count this itself is beyond me, as is why
this even matters in the first place.

>
>
> pg. 20
>
> blob? How about payload? Data stream?

The public headers refer to it as a blob, and blob as in "binary blob"
is a fairly common term.

>
>
> pg. 21
>
> Would it make sense to point to the list of them at least?
> Say: "Can be found in XYZ file."

In my copious free time, I hope to remove some of them.  A large number
of them can be more appropriately done elsehow, such as domain creation
flags.

>
> pg. 22
>
> It says temporary - so... should it be ripped out of the toolstack?

Absolutely - it is gross layer violation.

As far as I am aware, the only time this record is actually used is for
libxl sending xenstore key/value information.  As xl/libxl already has
its own head & tail for the libxc migration stream, it can put xenstore
keys somewhere else.

It is currently included so "pseudo-v2" can transparently replace
xc_domain_save/restore() during development without also making invasive
changes to the users.

>
>
> pg. 23
>
> An x86 PV guest image will have in this order: ->
> An x86 PV guest image will have this order of records:

Ok

>
>
> pg. 23
> x86 HVM guest:
>  You should probably say "An x86 HVM guest will have.."
>
> but more interesingly - there is an "DEVICE_MODEL_STATE" which
> is not defined in the spec?
>

Another layer violation, although this slipup comes from my first
attempt to thing what HVM might need, and David's subsequent edit which
moved it back to reflecting the working code.

The device model state is also something which needs fixing.  Currently
on the sending side, libxl is responsible for writing it into the
stream, but on the receiving side, libxc is responsible for writing it
to a magic file in /var/run/xend/...

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW()
  2014-04-30 18:36 ` [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW() Andrew Cooper
@ 2014-05-07 11:45   ` Ian Campbell
  2014-05-07 12:00     ` David Vrabel
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 11:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> From: David Vrabel <david.vrabel@citrix.com>
> 
> DECLARE_HYPERCALL_BUFFER_SHADOW() is like DECLARE_HYPERCALL_BUFFER()
> except it is backed by an already allocated hypercall buffer.

I suppose enhancing DECLARE_HYPERCALL_BUFFER_ARGUMENT to have this
property has issues with unused variables?

HYPERCALL_BUFFER_AS_PTR() would have been an alternative implementation
(similar to AS_ARG), I suppose there is no particular reason to prefer
one over the other?

> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  tools/libxc/xenctrl.h |   18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index 02129f7..15d2f4f 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -266,6 +266,24 @@ typedef struct xc_hypercall_buffer xc_hypercall_buffer_t;
>      }
>  
>  /*
> + * Like DECLARE_HYPERCALL_BUFFER() but using an already allocated
> + * hypercall buffer, _hbuf.
> + *
> + * Useful when a hypercall buffer is passed to a function and access
> + * via the user pointer is required.
> + *
> + * See DECLARE_HYPERCALL_BUFFER_ARGUMENT() if the user pointer is not
> + * required.
> + */
> +#define DECLARE_HYPERCALL_BUFFER_SHADOW(_type, _name, _hbuf)   \
> +    _type *_name = _hbuf->hbuf;                                \
> +    xc_hypercall_buffer_t XC__HYPERCALL_BUFFER_NAME(_name) = { \
> +        .hbuf = (void *)-1,                                    \
> +        .param_shadow = _hbuf,                                 \
> +        HYPERCALL_BUFFER_INIT_NO_BOUNCE                        \
> +    }
> +
> +/*
>   * Declare the necessary data structure to allow a hypercall buffer
>   * passed as an argument to a function to be used in the normal way.
>   */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] tools/libxc: Stream specification and some common code
  2014-04-30 18:36 ` [PATCH v4 3/9] tools/libxc: Stream specification and some common code Andrew Cooper
@ 2014-05-07 11:57   ` Ian Campbell
  2014-05-07 12:06     ` David Vrabel
  2014-05-07 12:14     ` Andrew Cooper
  0 siblings, 2 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 11:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
> new file mode 100644
> index 0000000..efcca60
> --- /dev/null
> +++ b/tools/libxc/saverestore/stream_format.h

Reference to the spec (which doesn't appear to be in this series
anywhere BTW)

> @@ -0,0 +1,158 @@
> +#ifndef __STREAM_FORMAT__H
> +#define __STREAM_FORMAT__H
> +
> +#include <inttypes.h>
> +
> +/* TODO - find somewhere more appropriate. rec_tsc_info needs it for 64bit */

You seem to use it on all of them though. Can we not use explicit
padding, as you seem to do for many of the other structs?

> +#ifndef __packed
> +#define __packed __attribute__((packed))
> +#endif
> +
> +/*
> + * Image Header
> + */
> +struct __packed ihdr

So long as this remains confined to libxc/saverestore/*.[ch] we can
avoid namespacing/prefixing. I suppose that is a safe assumption.

> +{
> +    uint64_t marker;
> +    uint32_t id;
> +    uint32_t version;
> +    uint16_t options;
> +    uint16_t _res1;
> +    uint32_t _res2;
> +};
> +
> +#define IHDR_MARKER  0xffffffffffffffffULL
> +#define IHDR_ID      0x58454E46U
> +#define IHDR_VERSION 1
> +
> +#define _IHDR_OPT_ENDIAN 0
> +#define IHDR_OPT_LITTLE_ENDIAN (0 << _IHDR_OPT_ENDIAN)
> +#define IHDR_OPT_BIG_ENDIAN    (1 << _IHDR_OPT_ENDIAN)
> +
> +/*
> + * Domain Header
> + */
> +struct __packed dhdr
> +{
> +    uint32_t type;
> +    uint16_t page_shift;
> +    uint16_t _res1;
> +    uint32_t xen_major;
> +    uint32_t xen_minor;
> +};
> +
> +#define DHDR_TYPE_x86_pv  0x00000001U
> +#define DHDR_TYPE_x86_hvm 0x00000002U
> +#define DHDR_TYPE_x86_pvh 0x00000003U
> +#define DHDR_TYPE_arm     0x00000004U

That capitalisation here (and elsewhere) seems rather unconventional. 

BTW IHDR_OPT is also inconsistent with the pattern you've established.
TBH I'd prefer more conventional upper case #defines. I suppose you want
to avoid enums for the ABI-confusing properties they have.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW()
  2014-05-07 11:45   ` Ian Campbell
@ 2014-05-07 12:00     ` David Vrabel
  2014-05-07 12:06       ` Ian Campbell
  0 siblings, 1 reply; 39+ messages in thread
From: David Vrabel @ 2014-05-07 12:00 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper; +Cc: Xen-devel

On 07/05/14 12:45, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> DECLARE_HYPERCALL_BUFFER_SHADOW() is like DECLARE_HYPERCALL_BUFFER()
>> except it is backed by an already allocated hypercall buffer.
> 
> I suppose enhancing DECLARE_HYPERCALL_BUFFER_ARGUMENT to have this
> property has issues with unused variables?

Yes, I think so. Although I didn't actually try it... Perhaps I should.

> HYPERCALL_BUFFER_AS_PTR() would have been an alternative implementation
> (similar to AS_ARG), I suppose there is no particular reason to prefer
> one over the other?

Usage of the buffer wouldn't be consistent with a regular
DECLARE_HYPERCALL_BUFFER().  It would lead to slightly more confusing
code like:

DECLARE_HYPERCALL_BUFFER(hbuf);
unsigned long *to_send = HYPERCALL_BUFFER_AS_PTR(hbuf);
...
    if ( test_bit(p, to_send) )
...
    rc = xc_shadow_control(xch, ctx->domid,
                      XEN_DOMCTL_SHADOW_OP_CLEAN,
                      hbuf,
                      ctx->save.p2m_size
                      NULL, 0, shadow_stats);

And it's not as obvious that hbuf and to_send are really the same buffer
without going back to the declarations.

David

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams
  2014-04-30 18:36 ` [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
@ 2014-05-07 12:03   ` Ian Campbell
  2014-05-07 12:08     ` David Vrabel
  2014-05-07 12:27     ` Andrew Cooper
  0 siblings, 2 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 12:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:

I suppose these are for debugging/inspecting rather than to be used in
practice? In which case I don't think there's any need for me to review
especially.

My only thought is that the need to repeat all of the data types is a
bit unfortunate and risks bugs and/or deviation. I can well imagine you
don't want to invent up an IDL or anything, is there anything out there
which might suffice? SWIG seems like overkill.
http://code.google.com/p/ctypesgen/ perhaps?

("these debug tools are not important enough to stress over" is a valid
rebuttal here IMHO).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW()
  2014-05-07 12:00     ` David Vrabel
@ 2014-05-07 12:06       ` Ian Campbell
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 12:06 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Xen-devel

On Wed, 2014-05-07 at 13:00 +0100, David Vrabel wrote:
> On 07/05/14 12:45, Ian Campbell wrote:
> > On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> DECLARE_HYPERCALL_BUFFER_SHADOW() is like DECLARE_HYPERCALL_BUFFER()
> >> except it is backed by an already allocated hypercall buffer.
> > 
> > I suppose enhancing DECLARE_HYPERCALL_BUFFER_ARGUMENT to have this
> > property has issues with unused variables?
> 
> Yes, I think so. Although I didn't actually try it... Perhaps I should.

Would be good.

> > HYPERCALL_BUFFER_AS_PTR() would have been an alternative implementation
> > (similar to AS_ARG), I suppose there is no particular reason to prefer
> > one over the other?
> 
> Usage of the buffer wouldn't be consistent with a regular
> DECLARE_HYPERCALL_BUFFER().  It would lead to slightly more confusing
> code like:
> 
> DECLARE_HYPERCALL_BUFFER(hbuf);
> unsigned long *to_send = HYPERCALL_BUFFER_AS_PTR(hbuf);
> ...
>     if ( test_bit(p, to_send) )
> ...
>     rc = xc_shadow_control(xch, ctx->domid,
>                       XEN_DOMCTL_SHADOW_OP_CLEAN,
>                       hbuf,
>                       ctx->save.p2m_size
>                       NULL, 0, shadow_stats);
> 
> And it's not as obvious that hbuf and to_send are really the same buffer
> without going back to the declarations.

OK. Acked-by: Ian Campbell <ian.campbell@citrix.com>.

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] tools/libxc: Stream specification and some common code
  2014-05-07 11:57   ` Ian Campbell
@ 2014-05-07 12:06     ` David Vrabel
  2014-05-07 12:14     ` Andrew Cooper
  1 sibling, 0 replies; 39+ messages in thread
From: David Vrabel @ 2014-05-07 12:06 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper; +Cc: Xen-devel

On 07/05/14 12:57, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>> diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
>> new file mode 100644
>> index 0000000..efcca60
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/stream_format.h
> 
> Reference to the spec (which doesn't appear to be in this series
> anywhere BTW)
> 
>> @@ -0,0 +1,158 @@
>> +#ifndef __STREAM_FORMAT__H
>> +#define __STREAM_FORMAT__H
>> +
>> +#include <inttypes.h>
>> +
>> +/* TODO - find somewhere more appropriate. rec_tsc_info needs it for 64bit */
> 
> You seem to use it on all of them though. Can we not use explicit
> padding, as you seem to do for many of the other structs?

Yes, rec_tsc_info needs a trailing uint32_t _res1 field to match the spec.

There should be no need to __packed on any of the structures.

David

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams
  2014-05-07 12:03   ` Ian Campbell
@ 2014-05-07 12:08     ` David Vrabel
  2014-05-07 12:27     ` Andrew Cooper
  1 sibling, 0 replies; 39+ messages in thread
From: David Vrabel @ 2014-05-07 12:08 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper; +Cc: Frediano Ziglio, Xen-devel

On 07/05/14 13:03, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> 
> I suppose these are for debugging/inspecting rather than to be used in
> practice? In which case I don't think there's any need for me to review
> especially.

It's primarily for validating that saved images comply with the spec and
thus it does need to be easy to keep maintained.

> My only thought is that the need to repeat all of the data types is a
> bit unfortunate and risks bugs and/or deviation. I can well imagine you
> don't want to invent up an IDL or anything, is there anything out there
> which might suffice? SWIG seems like overkill.
> http://code.google.com/p/ctypesgen/ perhaps?

I think this is a valid concern.

David

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] tools/libxc: Stream specification and some common code
  2014-05-07 11:57   ` Ian Campbell
  2014-05-07 12:06     ` David Vrabel
@ 2014-05-07 12:14     ` Andrew Cooper
  2014-05-07 13:07       ` Ian Campbell
  1 sibling, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 12:14 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Vrabel, Xen-devel

On 07/05/14 12:57, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>> diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
>> new file mode 100644
>> index 0000000..efcca60
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/stream_format.h
> Reference to the spec (which doesn't appear to be in this series
> anywhere BTW)

The spec pdf is referenced in the header.  I suppose it should move into
docs/.

>
>> @@ -0,0 +1,158 @@
>> +#ifndef __STREAM_FORMAT__H
>> +#define __STREAM_FORMAT__H
>> +
>> +#include <inttypes.h>
>> +
>> +/* TODO - find somewhere more appropriate. rec_tsc_info needs it for 64bit */
> You seem to use it on all of them though. Can we not use explicit
> padding, as you seem to do for many of the other structs?

The issue with rec_tsc_info is its trailing alignment.  i.e.
sizeof(rec_tsc_info) is different between 32 and 64 bit builds, yet all
fields are at the same offset from the beginning.  Putting explicit
trailing padding would invalidate the current code writing
sizeof(rec_tsc_info) bytes into the stream.

Inside Xen when I was sorting out the __attribute__((packed))
consistency, we did agree on __packed for binary interfaces even when
not strictly necessary.

>
>> +#ifndef __packed
>> +#define __packed __attribute__((packed))
>> +#endif
>> +
>> +/*
>> + * Image Header
>> + */
>> +struct __packed ihdr
> So long as this remains confined to libxc/saverestore/*.[ch] we can
> avoid namespacing/prefixing. I suppose that is a safe assumption.

Hmm yes.  (This I suppose is still "v1 code hacking to make something work")

Probably do want some namespacing even if it does only live inside this
code, although an obvious prefix doesn't jump to mind.

>
>> +{
>> +    uint64_t marker;
>> +    uint32_t id;
>> +    uint32_t version;
>> +    uint16_t options;
>> +    uint16_t _res1;
>> +    uint32_t _res2;
>> +};
>> +
>> +#define IHDR_MARKER  0xffffffffffffffffULL
>> +#define IHDR_ID      0x58454E46U
>> +#define IHDR_VERSION 1
>> +
>> +#define _IHDR_OPT_ENDIAN 0
>> +#define IHDR_OPT_LITTLE_ENDIAN (0 << _IHDR_OPT_ENDIAN)
>> +#define IHDR_OPT_BIG_ENDIAN    (1 << _IHDR_OPT_ENDIAN)
>> +
>> +/*
>> + * Domain Header
>> + */
>> +struct __packed dhdr
>> +{
>> +    uint32_t type;
>> +    uint16_t page_shift;
>> +    uint16_t _res1;
>> +    uint32_t xen_major;
>> +    uint32_t xen_minor;
>> +};
>> +
>> +#define DHDR_TYPE_x86_pv  0x00000001U
>> +#define DHDR_TYPE_x86_hvm 0x00000002U
>> +#define DHDR_TYPE_x86_pvh 0x00000003U
>> +#define DHDR_TYPE_arm     0x00000004U
> That capitalisation here (and elsewhere) seems rather unconventional. 
>
> BTW IHDR_OPT is also inconsistent with the pattern you've established.
> TBH I'd prefer more conventional upper case #defines. I suppose you want
> to avoid enums for the ABI-confusing properties they have.
>
>

This particular bit is to match the spec, but I see what you mean.  I
shall make it consistently upper case.

I hadn't actually considered an enum.  It might be ok as each type field
is passed as a uint32_t, although changing it would be faff for no
appreciable gain.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams
  2014-05-07 12:03   ` Ian Campbell
  2014-05-07 12:08     ` David Vrabel
@ 2014-05-07 12:27     ` Andrew Cooper
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 12:27 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 07/05/14 13:03, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>
> I suppose these are for debugging/inspecting rather than to be used in
> practice? In which case I don't think there's any need for me to review
> especially.
>
> My only thought is that the need to repeat all of the data types is a
> bit unfortunate and risks bugs and/or deviation. I can well imagine you
> don't want to invent up an IDL or anything, is there anything out there
> which might suffice? SWIG seems like overkill.
> http://code.google.com/p/ctypesgen/ perhaps?
>
> ("these debug tools are not important enough to stress over" is a valid
> rebuttal here IMHO).
>
>

Since submitting this series, I have discarded "inspect-legacy32.py". 
"generate.py" is also debugging which can safely disappear when this
series becomes non-RFC

I expect validate.py to remain even if only as a developer script.

The new "legacy.py" (name subject to sensible suggestions) is the
converstion tool from legacy migration streams.  Given two fds, it picks
apart the legacy format and writes a v2 stream.

Both legacy.py and validate.py depend on streamspec.py as the stream
specification.  The risk of divergence from stream_format.h is a
concern, but streamspec.py also includes type->"str" dictionaries to
match the lookup functions in common.c.  On the other hand, I cant see
any way to generate one from the other without radically changing the
use of python struct packing strings, which is the RightWay to do the
python side of things.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] tools/libxc: common code
  2014-04-30 18:36 ` [PATCH v4 5/9] tools/libxc: common code Andrew Cooper
@ 2014-05-07 13:03   ` Ian Campbell
  2014-05-07 14:38     ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 13:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  tools/libxc/saverestore/common.c         |   87 ++++++
>  tools/libxc/saverestore/common.h         |  172 ++++++++++++
>  tools/libxc/saverestore/common_x86.c     |   54 ++++
>  tools/libxc/saverestore/common_x86.h     |   21 ++
>  tools/libxc/saverestore/common_x86_hvm.c |   53 ++++
>  tools/libxc/saverestore/common_x86_pv.c  |  431 ++++++++++++++++++++++++++++++
>  tools/libxc/saverestore/common_x86_pv.h  |  104 +++++++
>  tools/libxc/saverestore/restore.c        |  288 ++++++++++++++++++++
>  tools/libxc/saverestore/save.c           |   42 +++
>  9 files changed, 1252 insertions(+)
>  create mode 100644 tools/libxc/saverestore/common_x86.c
>  create mode 100644 tools/libxc/saverestore/common_x86.h
>  create mode 100644 tools/libxc/saverestore/common_x86_hvm.c
>  create mode 100644 tools/libxc/saverestore/common_x86_pv.c
>  create mode 100644 tools/libxc/saverestore/common_x86_pv.h
> 
> diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
> index de2e727..b159c4c 100644
> --- a/tools/libxc/saverestore/common.c
> +++ b/tools/libxc/saverestore/common.c
> @@ -1,3 +1,5 @@
> +#include <assert.h>
> +
>  #include "common.h"
>  
>  static const char *dhdr_types[] =
> @@ -52,6 +54,91 @@ const char *rec_type_to_str(uint32_t type)
>      return "Reserved";
>  }
>  
> +int write_split_record(struct context *ctx, struct record *rec,
> +                       void *buf, size_t sz)
> +{
> +    static const char zeroes[7] = { 0 };
> +    xc_interface *xch = ctx->xch;
> +    uint32_t combined_length = rec->length + sz;
> +    size_t record_length = (combined_length + 7) & ~7UL;

Isn't this ROUNDUP(combined_length, 3)? (Where 3 and/or 7 perhaps ought
to be given names)

> +
> +    if ( record_length > REC_LENGTH_MAX )
> +    {
> +        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
> +              " exceeds max (0x%"PRIx32")", rec->type,
> +              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
> +        return -1;
> +    }
> +
> +    if ( rec->length )
> +        assert(rec->data);
> +    if ( sz )
> +        assert(buf);
> +
> +    if ( write_exact(ctx->fd, &rec->type, sizeof rec->type) ||

libxc style appears to be "sizeof (rec->type)" (the space seems odd to
me but is apparently The Style, I think the brackets are a must).

> +         write_exact(ctx->fd, &combined_length, sizeof rec->length) ||
> +         (rec->length && write_exact(ctx->fd, rec->data, rec->length)) ||
> +         (sz && write_exact(ctx->fd, buf, sz)) ||
> +         write_exact(ctx->fd, zeroes, record_length - combined_length) )
> +    {
> +        PERROR("Unable to write record to stream");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +int read_record(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct rhdr rhdr;
> +    size_t datasz;
> +
> +    if ( read_exact(ctx->fd, &rhdr, sizeof rhdr) )
> +    {
> +        PERROR("Failed to read Record Header from stream");
> +        return -1;
> +    }
> +    else if ( rhdr.length > REC_LENGTH_MAX )
> +    {
> +        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
> +              " exceeds max (0x%"PRIx32")",
> +              rhdr.type, rec_type_to_str(rhdr.type),
> +              rhdr.length, REC_LENGTH_MAX);
> +        return -1;
> +    }
> +
> +    datasz = (rhdr.length + 7) & ~7U;

Another ROUNDUP and #define XXX 3 or 7 opportunity. (I'm not going to
mention any more of these I find).

Is it not a bug in the input for datasz to not be aligned? I thought you
were padding everything.

> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
> index fff0a39..a35eda7 100644
> --- a/tools/libxc/saverestore/common.h
> +++ b/tools/libxc/saverestore/common.h
> @@ -1,7 +1,20 @@
>  #ifndef __COMMON__H
>  #define __COMMON__H
>  
> +#include <stdbool.h>
> +
> +// Hack out junk from the namespace
 ... namespace, because/in order to...

(to prevent accidents I suppose?)

> +#define mfn_to_pfn __UNUSED_mfn_to_pfn
> +#define pfn_to_mfn __UNUSED_pfn_to_mfn
> +
>  #include "../xg_private.h"
> +#include "../xg_save_restore.h"
> +#include "../xc_dom.h"
> +#include "../xc_bitops.h"
> +
> +#undef mfn_to_pfn
> +#undef pfn_to_mfn
> +
>  
> [...]

> +struct context
> +{
> +    xc_interface *xch;
> +    uint32_t domid;
> +    int fd;
> +
> +    xc_dominfo_t dominfo;
> +
> +    struct save_restore_ops ops;
> +
> +    union
> +    {
> +        struct
> +        {
> +            /* From Image Header */
> +            uint32_t format_version;
> +
> +            /* From Domain Header */
> +            uint32_t guest_type;
> +            uint32_t guest_page_size;
> +

I suppose the following block is not from the Domain Header. /* Populate
by builder */ or something?

[...]

> +    union
> +    {
> +        struct
> +        {
> +            /* 4 or 8; 32 or 64 bit domain */
> +            unsigned int width;
> +            /* 3 or 4 pagetable levels */
> +            unsigned int levels;
> +
> +

Deliberate double spacing? (You don't do it consistently)
[...]

> +};
> +
> +/*
> + * Write the image and domain headers to the stream.
> + * (to eventually make static in save.c)

Is this because of the 2/9 HACK to maintain both? So far it seems to me
that the next iteration could drop that and go for the big switcheroo,
but if not could we agree some tag for things which are specific to that
in order to distinguish from longer term todos or hacky bits? (So I know
which to ignore and which to consider)

> + */
> +int write_headers(struct context *ctx, uint16_t guest_type);
> +
> +extern struct save_restore_ops save_restore_ops_x86_pv;
> +extern struct save_restore_ops save_restore_ops_x86_hvm;

>From the union in the context struct  I'm only expecting x86_pv in this
patch, so perhaps x86_hvm was intended to arrive later?

(a non-zero length commit message might have confirmed what was intended
to be supplied here)

> +/*
> + * Writes a split record to the stream, applying correct padding where
> + * appropriate.  It is common when sending records containing blobs from Xen
> + * that the header and blob data are separate.  This function accepts a second
> + * buffer and length, and will merge it with the main record when sending.
> + *
> + * Records with a non-zero length must provide a valid data field; records
> + * with a 0 length shall have their data field ignored.
> + *
> + * Returns 0 on success and non0 on failure.

"non-0". Do the non-zero values have any particular meanings? (errno
codes, other error codes?)

> + */
> +int write_split_record(struct context *ctx, struct record *rec, void *buf, size_t sz);
> +
> +/*
> + * Writes a record to the stream, applying correct padding where appropriate.
> + * Records with a non-zero length must provide a valid data field; records
> + * with a 0 length shall have their data field ignored.
> + *
> + * Returns 0 on success and non0 on failure.
> + */
> +static inline int write_record(struct context *ctx, struct record *rec)
> +{
> +    return write_split_record(ctx, rec, NULL, 0);
> +}
> +
> +/*
> + * Reads a record from the stream, and fills in the record structure.
> + *
> + * Returns 0 on success and non-0 on failure.
> + *
> + * On success, the records type and size shall be valid.
> + * - If size is 0, data shall be NULL.
> + * - If size is non-0, data shall be a buffer allocated by malloc() which must
> + *   be passed to free() by the caller.
> + *
> + * On failure, the contents of the record structure are undefined.
> + */
> +int read_record(struct context *ctx, struct record *rec);
> +
> +int write_page_data_and_pause(struct context *ctx);

I suppose the page_data referred to here is accumulated in ctx?

> +int handle_page_data(struct context *ctx, struct record *rec);

Handle it how? Is this the restore side counterpart to
write_page_data_...?

> +int populate_pfns(struct context *ctx, unsigned count,
> +                  const xen_pfn_t *original_pfns, const uint32_t *types);

Populate from what?

(You spoiled me with doc comments on the first few, so I'm being picky
here ;-))

> +
>  #endif
>  /*
>   * Local variables:
> diff --git a/tools/libxc/saverestore/common_x86_hvm.c b/tools/libxc/saverestore/common_x86_hvm.c
> new file mode 100644
> index 0000000..0b9aac2
> --- /dev/null
> +++ b/tools/libxc/saverestore/common_x86_hvm.c

How delightfully uninteresting.

> diff --git a/tools/libxc/saverestore/common_x86_pv.c b/tools/libxc/saverestore/common_x86_pv.c
> new file mode 100644
> index 0000000..35bce27
> --- /dev/null
> +++ b/tools/libxc/saverestore/common_x86_pv.c

How terrifying.

> @@ -0,0 +1,431 @@
> +#include <assert.h>
> +
> +#include "common_x86_pv.h"
> +
> +xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
> +{
> +    assert(mfn <= ctx->x86_pv.max_mfn);

Just to confirm that anywhere there is an assert like this then if it is
based on potentially user controller input then it has already been
validated elsewhere first? (with the abstraction I couldn't spot where
that was)

> +    return ctx->x86_pv.m2p[mfn];
> +}
> +
> +static bool x86_pv_pfn_is_valid(struct context *ctx, xen_pfn_t pfn)
> +{
> +    return pfn <= ctx->x86_pv.max_pfn;
> +}
> +
> +static xen_pfn_t x86_pv_pfn_to_gfn(struct context *ctx, xen_pfn_t pfn)
> +{
> +    assert(pfn <= ctx->x86_pv.max_pfn);
> +
> +    if ( ctx->x86_pv.width == sizeof (uint64_t) )
> +        /* 64 bit guest.  Need to truncate their pfns for 32 bit toolstacks */
> +        return ((uint64_t *)ctx->x86_pv.p2m)[pfn];
> +    else
> +    {
> +        /* 32 bit guest.  Need to expand INVALID_MFN fot 64 bit toolstacks */

Typo: "fot"

> +        uint32_t mfn = ((uint32_t *)ctx->x86_pv.p2m)[pfn];
> +
> +        return mfn == ~0U ? INVALID_MFN : mfn;
> +    }
> +}
> +
> +static void x86_pv_set_page_type(struct context *ctx, xen_pfn_t pfn,
> +                                 unsigned long type)
> +{
> +    assert(pfn <= ctx->x86_pv.max_pfn);
> +
> +    ctx->x86_pv.pfn_types[pfn] = type;
> +}
> +
> +static void x86_pv_set_gfn(struct context *ctx, xen_pfn_t pfn,
> +                           xen_pfn_t mfn)
> +{
> +    assert(pfn <= ctx->x86_pv.max_pfn);
> +
> +    if ( ctx->x86_pv.width == sizeof (uint64_t) )
> +        /* 64 bit guest.  Need to expand INVALID_MFN for 32 bit toolstacks */
> +        ((uint64_t *)ctx->x86_pv.p2m)[pfn] = mfn == INVALID_MFN ? ~0ULL : mfn;
> +    else
> +        /* 32 bit guest.  Can safely truncate INVALID_MFN fot 64 bit toolstacks */

Another fot.

> +        ((uint32_t *)ctx->x86_pv.p2m)[pfn] = mfn;
> +}
> +
> +static int normalise_pagetable(struct context *ctx, const uint64_t *src,
> +                               uint64_t *dst, unsigned long type)
> +{
> +    xc_interface *xch = ctx->xch;
> +    uint64_t pte;
> +    unsigned i, xen_first = -1, xen_last = -1; /* Indicies of Xen mappings */
> +
> +    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
> +
> +    if ( ctx->x86_pv.levels == 4 )
> +    {
> +        /* 64bit guests only have Xen mappings in their L4 tables */

You (correctly) infer from the pt levels that this must be a 64-bit
guest. Assert?

> +        if ( type == XEN_DOMCTL_PFINFO_L4TAB )
> +        {
> +            xen_first = 256;
> +            xen_last = 271;

Can we #define all these numbers please?

> +        }
> +    }
> +    else
> +    {

Must be 32 bit? Assert?

> +        switch ( type )
> +        {
> +        case XEN_DOMCTL_PFINFO_L4TAB:
> +            ERROR("??? Found L4 table for 32bit guest");
> +            errno = EINVAL;
> +            return -1;
> +
> +        case XEN_DOMCTL_PFINFO_L3TAB:
> +            /* 32bit guests can only use the first 4 entries of their L3 tables.
> +             * All other are potentially used by Xen. */
> +            xen_first = 4;
> +            xen_last = 512;
> +            break;
> +
> +        case XEN_DOMCTL_PFINFO_L2TAB:
> +            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
> +             * are normal but only a few will have Xen mappings.

Internally Xen has PGT_pae_xen_l2, I wonder if it could be coaxed into
exposing that here too?

> +             * 428 = (HYPERVISOR_VIRT_START_PAE >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff
> +             *
> +             * ...which is conveniently unavailable to us in a 64bit build.

Can we export it somehow? (Perhaps with COMPAT in the name)

> ... normalize_page
> +    local_page = malloc(PAGE_SIZE);

How bad is the overhead of (what I pressume are) all these allocs and
frees? (I've not come across the caller in this patch, maybe it will
become clear later).

Would it be worth keeping an array around shadowing the "live" batch of
pages?

> +static int x86_pv_localise_page(struct context *ctx, uint32_t type, void *page)

"localise" means the inverse of "normalise"?


> +void pseudophysmap_walk(struct context *ctx, xen_pfn_t mfn)
> +{
> +    xc_interface *xch = ctx->xch;
> +    xen_pfn_t pfn = ~0UL;
> +
> +    ERROR("mfn %#lx, max %#lx", mfn, ctx->x86_pv.max_mfn);

It took me a while to figure out why these were all errors. Can we call
the function dump_bad_pseudophysmap_walk or some such?

> +xen_pfn_t cr3_to_mfn(struct context *ctx, uint64_t cr3)
> +{
> +    if ( ctx->x86_pv.width == 8 )
> +        return cr3 >> 12;
> +    else
> +        return (((uint32_t)cr3 >> 12) | ((uint32_t)cr3 << 20));

A comment pointing people to the phrase "extended-cr3" as grep fodder
would be appreciated.

> +}
> +
> +uint64_t mfn_to_cr3(struct context *ctx, xen_pfn_t mfn)
> +{
> +    if ( ctx->x86_pv.width == 8 )
> +        return ((uint64_t)mfn) << 12;
> +    else
> +        return (((uint32_t)mfn << 12) | ((uint32_t)mfn >> 20));
> +}
> +
[...]
> +    else if ( guest_width == 4 )

Would prefer to drop the else and insert a blank line to separate the
error handling of the xc_domain_.. from the interpretation of its
result.

> +        guest_levels = 3;
> +    else if ( guest_width == 8 )
> +        guest_levels = 4;
> +    else
> +    {
> +        ERROR("Invalid guest width %d.  Expected 32 or 64", guest_width);

Actually you expected 4 or 8. I'm not sure if this could be strengthened
into an assert.

> +        return -1;
> +    }
> +    ctx->x86_pv.width = guest_width;
> +    ctx->x86_pv.levels = guest_levels;
> +    ctx->x86_pv.fpp = fpp = PAGE_SIZE / ctx->x86_pv.width;
> +
> +    DPRINTF("%d bits, %d levels", guest_width * 8, guest_levels);
> +
> +    /* Get the domains maximum pfn */

Apostrophes are out of fashion at the moment I guess ;-).

> +    max_pfn = xc_domain_maximum_gpfn(xch, ctx->domid);
> +    if ( max_pfn < 0 )
> +    {
> +        PERROR("Unable to obtain guests max pfn");
> +        return -1;
> +    }
> +    else if ( max_pfn >= ~XEN_DOMCTL_PFINFO_LTAB_MASK )
> +    {
> +        errno = E2BIG;
> +        PERROR("Cannot save a guest this large %#x");

Missing the parameter for %#x.

> [...]

> +/*
> + * Extract an MFN from a Pagetable Entry.
> + */
> +static inline xen_pfn_t pte_to_frame(struct context *ctx, uint64_t pte)
> +{
> +    if ( ctx->x86_pv.width == 8 )
> +        return (pte >> PAGE_SHIFT) & ((1ULL << (52 - PAGE_SHIFT)) - 1);
> +    else
> +        return (pte >> PAGE_SHIFT) & ((1ULL << (44 - PAGE_SHIFT)) - 1);

Are bits 45..52 non zero in this case?

Can we #define a couple of MASK's please.

> +static int pfn_set_populated(struct context *ctx, xen_pfn_t pfn)
> +{
> +    xc_interface *xch = ctx->xch;
> +
> +    if ( !ctx->restore.populated_pfns || pfn > ctx->restore.max_populated_pfn )
> +    {
> +        unsigned long new_max_pfn = ((pfn + 1024) & ~1023) - 1;
> +        size_t old_sz, new_sz;
> +        unsigned long *p;
> +
> +        old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
> +        new_sz = bitmap_size(new_max_pfn + 1);
> +
> +        p  = realloc(ctx->restore.populated_pfns, new_sz);

In practice do you not see an initial run of increasing pfns and end up
constantly adding one more? Consider doubling or adding e.g. 128M worth
of space when growing?
> +
> +    if ( nr_pages > 0 )
> +    {
> +        mapping = guest_page = xc_map_foreign_bulk(
> +            xch, ctx->domid, PROT_READ | PROT_WRITE,
> +            mfns, map_errs, nr_pages);
> +        if ( !mapping )
> +        {
> +            PERROR("Unable to map %u mfns for %u pages of data",
> +                   nr_pages, count);
> +            goto err;
> +        }
> +    }
> +
> +    for ( i = 0, j = 0; i < count; ++i )
> +    {
> +        switch ( types[i] )
> +        {
> +        case XEN_DOMCTL_PFINFO_XTAB:
> +        case XEN_DOMCTL_PFINFO_BROKEN:
> +            /* Nothing at all to do */

Is this a deliberate fall-thru? Please comment if so.

> +        case XEN_DOMCTL_PFINFO_XALLOC:
> +            /* Nothing futher to do */

"further"

> +int handle_page_data(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct rec_page_data_header *pages = rec->data;
> +    unsigned i, pages_of_data = 0;
> +    int rc = -1;
> +
> +    xen_pfn_t *pfns = NULL, pfn;
> +    uint32_t *types = NULL, type;
> +
> +    static unsigned pg_count;
> +    pg_count++;
> +
> +    if ( rec->length < sizeof *pages )
> +    {
> +        ERROR("PAGE_DATA record trucated: length %"PRIu32", min %zu",
> +              rec->length, sizeof *pages);
> +        goto err;
> +    }
> +    else if ( pages->count < 1 )

We mostly omit the else where the previous if returns or gotos etc.

> +    for ( i = 0; i < pages->count; ++i )
> +    {
> +        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
> +        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
> +        {
> +            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
> +            goto err;
> +        }
> +
> +        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
> +        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
> +             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )

What are 5 and 8?

This might all be clearer with the use of a #define or two, and perhaps
using a switch over the expected valid types and explicitly invalid
types.


> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
> index c013e62..e842e6c 100644
> --- a/tools/libxc/saverestore/save.c
> +++ b/tools/libxc/saverestore/save.c
> @@ -1,5 +1,47 @@
> +#include <arpa/inet.h>
> +
>  #include "common.h"
>  
> +int write_headers(struct context *ctx, uint16_t guest_type)
> +{
> +    xc_interface *xch = ctx->xch;
> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
> +    struct ihdr ihdr =
> +        {

I think coding style puts this on the previous line and the body should
be 4 spaces further in.

> +            .marker  = IHDR_MARKER,
> +            .id      = htonl(IHDR_ID),
> +            .version = htonl(IHDR_VERSION),
> +            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
> +        };
> +    struct dhdr dhdr =
> +        {
> +            .type       = guest_type,
> +            .page_shift = 12,

We have a define for this.

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] tools/libxc: Stream specification and some common code
  2014-05-07 12:14     ` Andrew Cooper
@ 2014-05-07 13:07       ` Ian Campbell
  2014-05-07 13:20         ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 13:07 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: David Vrabel, Xen-devel

On Wed, 2014-05-07 at 13:14 +0100, Andrew Cooper wrote:
> On 07/05/14 12:57, Ian Campbell wrote:
> > On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> >> diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
> >> new file mode 100644
> >> index 0000000..efcca60
> >> --- /dev/null
> >> +++ b/tools/libxc/saverestore/stream_format.h
> > Reference to the spec (which doesn't appear to be in this series
> > anywhere BTW)
> 
> The spec pdf is referenced in the header.  I suppose it should move into
> docs/.

Please.

> 
> >
> >> @@ -0,0 +1,158 @@
> >> +#ifndef __STREAM_FORMAT__H
> >> +#define __STREAM_FORMAT__H
> >> +
> >> +#include <inttypes.h>
> >> +
> >> +/* TODO - find somewhere more appropriate. rec_tsc_info needs it for 64bit */
> > You seem to use it on all of them though. Can we not use explicit
> > padding, as you seem to do for many of the other structs?
> 
> The issue with rec_tsc_info is its trailing alignment.  i.e.
> sizeof(rec_tsc_info) is different between 32 and 64 bit builds, yet all
> fields are at the same offset from the beginning.  Putting explicit
> trailing padding would invalidate the current code writing
> sizeof(rec_tsc_info) bytes into the stream.

Putting it inside the struct, like you do elsewhere, would not have this
problem I think.

> This particular bit is to match the spec, but I see what you mean.  I
> shall make it consistently upper case.
> 
> I hadn't actually considered an enum.  It might be ok as each type field
> is passed as a uint32_t, although changing it would be faff for no
> appreciable gain.

Yes, if you can't change the variables which store it there isn't much
gain. If you can change the variables then it lets you use switch
statements to make sure you handle all cases, but that probably doesn't
justify the churn.

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] tools/libxc: Stream specification and some common code
  2014-05-07 13:07       ` Ian Campbell
@ 2014-05-07 13:20         ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 13:20 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Vrabel, Xen-devel

On 07/05/14 14:07, Ian Campbell wrote:
> On Wed, 2014-05-07 at 13:14 +0100, Andrew Cooper wrote:
>>>> @@ -0,0 +1,158 @@
>>>> +#ifndef __STREAM_FORMAT__H
>>>> +#define __STREAM_FORMAT__H
>>>> +
>>>> +#include <inttypes.h>
>>>> +
>>>> +/* TODO - find somewhere more appropriate. rec_tsc_info needs it for 64bit */
>>> You seem to use it on all of them though. Can we not use explicit
>>> padding, as you seem to do for many of the other structs?
>> The issue with rec_tsc_info is its trailing alignment.  i.e.
>> sizeof(rec_tsc_info) is different between 32 and 64 bit builds, yet all
>> fields are at the same offset from the beginning.  Putting explicit
>> trailing padding would invalidate the current code writing
>> sizeof(rec_tsc_info) bytes into the stream.
> Putting it inside the struct, like you do elsewhere, would not have this
> problem I think.

But it changes the value in the record length header even if it doesn't
change the content of the record in the stream.  The issue here is
between implicit trailing 0s for alignment purposes (which are not
counted in the record length), and an explicit trailing _resvd field
(which is counted in the record length).

Talking a literal interpretation of the spec, there is a _resvd field,
meaning that the record length should indeed be 4 bytes longer than the
current implementation.

If we are dropping the __packed attributes, I would want some
XC_BUILD_BUG_ON()s to prevent any further surprises (guess how I
discovered this surprise to start with ;p), but where to stick those is
not obvious.

Might it be acceptable to have a single .c file containing stuff like
this which is compiled but not linked?

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] tools/libxc: x86 pv save implementation
  2014-04-30 18:36 ` [PATCH v4 6/9] tools/libxc: x86 pv save implementation Andrew Cooper
@ 2014-05-07 13:58   ` Ian Campbell
  2014-05-07 15:20     ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 13:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:

The previous "common" patch seemed to have a fair bit of x86 pv save
code in it, and there seems to be a fair bit of common code here. I
suppose the split was pretty artificial to make the patches manageable?

> @@ -121,6 +123,9 @@ struct context
>      };
>  };
>  
> +/* Saves an x86 PV domain. */

I should hope so!

> +int save_x86_pv(struct context *ctx);
> +
>  /*
>   * Write the image and domain headers to the stream.
>   * (to eventually make static in save.c)
> @@ -137,6 +142,21 @@ struct record
>      void *data;
>  };
>  
> +/* Gets a field from an *_any union */

Unless the #undefs which were above before I trimmed them relate to
patch #2/9 they should go right before the redefine IMHO.

> +#define GET_FIELD(_c, _p, _f)                   \
> +    ({ (_c)->x86_pv.width == 8 ?                \
> +            (_p)->x64._f:                       \
> +            (_p)->x32._f;                       \
> +    })                                          \
> +
> +/* Gets a field from an *_any union */
> +#define SET_FIELD(_c, _p, _f, _v)               \
> +    ({ if ( (_c)->x86_pv.width == 8 )           \
> +            (_p)->x64._f = (_v);                \
> +        else                                    \
> +            (_p)->x32._f = (_v);                \
> +    })
> +
>  /*
>   * Writes a split record to the stream, applying correct padding where
>   * appropriate.  It is common when sending records containing blobs from Xen

> +    /* MFNs of the batch pfns */
> +    mfns = malloc(nr_pfns * sizeof *mfns);
> +    /* Types of the batch pfns */
> +    types = malloc(nr_pfns * sizeof *types);
> +    /* Errors from attempting to map the mfns */
> +    errors = malloc(nr_pfns * sizeof *errors);
> +    /* Pointers to page data to send.  Might be from mapped mfns or local allocations */
> +    guest_data = calloc(nr_pfns, sizeof *guest_data);
> +    /* Pointers to locally allocated pages.  Probably not all used, but need freeing */
> +    local_pages = calloc(nr_pfns, sizeof *local_pages);

Wouldn't it be better to preallocate (most of) these for the entire
save/restore than to do it for each batch?

> +    for ( i = 0; i < nr_pfns; ++i )
> +    {
> +        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->batch_pfns[i]);
> +
> +        /* Likely a ballooned page */
> +        if ( mfns[i] == INVALID_MFN )
> +            set_bit(ctx->batch_pfns[i], ctx->deferred_pages);

deferred_pages doesn't need to grow dynamically like that bit map in a
previous patch, does it?

> +    hdr.count = nr_pfns;
> +    s = nr_pfns * sizeof *rec_pfns;

Brackets around sizeof would really help here.

> +
> +
> +    rec_pfns = malloc(s);

Given the calculation of s is this a candidate for calloc?

> +    if ( !rec_pfns )
> +    {
> +        ERROR("Unable to allocate memory for page data pfn list");
> +        goto err;
> +    }
> +
> +    for ( i = 0; i < nr_pfns; ++i )
> +        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->batch_pfns[i];

Could this not have a more structured type? (I noticed this in the
decode bit of the previous patch too).

> +
> +    /*        header +          pfns data           +        page data */
> +    rec_size = 4 + 4 + (s) + (nr_pages * PAGE_SIZE);

Can these 4's not be sizeof something? Also the comment looks like it
wants to align with something, but is failing.

> +    rec_count = nr_pfns;
> +
> +    if ( write_exact(ctx->fd, &rec_type, sizeof(uint32_t)) ||
> +         write_exact(ctx->fd, &rec_size, sizeof(uint32_t)) ||

Are these not a struct rhdr?

> +         write_exact(ctx->fd, &rec_count, sizeof(uint32_t)) ||
> +         write_exact(ctx->fd, &rec_res, sizeof(uint32_t)) )
> +    {
> +        PERROR("Failed to write page_type header to stream");
> +        goto err;
> +    }
> +
> +    if ( write_exact(ctx->fd, rec_pfns, s) )
> +    {
> +        PERROR("Failed to write page_type header to stream");
> +        goto err;
> +    }
> +
> +
> +    for ( i = 0; i < nr_pfns; ++i )

Extra braces are useful for clarity in these siturations.


> +static int flush_batch(struct context *ctx)
> +{
> +    int rc = 0;
> +
> +    if ( ctx->nr_batch_pfns == 0 )
> +        return rc;
> +
> +    rc = write_batch(ctx);
> +
> +    if ( !rc )
> +    {
> +        /* Valgrind sanity check */

What now?

> +        free(ctx->batch_pfns);
> +        ctx->batch_pfns = malloc(MAX_BATCH_SIZE * sizeof *ctx->batch_pfns);
> +        rc = !ctx->batch_pfns;
> +    }
> +
> +
> +    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
> +    {
> +        DPRINTF("Iteration %u", x);

Something in this series should be using xc_report_progress_* (I'm not
sure that place is here, but I didn't see it elsewhere yet).

>  int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>                      uint32_t max_factor, uint32_t flags,
>                      struct save_callbacks* callbacks, int hvm,
>                      unsigned long vm_generationid_addr)
>  {
> +    struct context ctx =
> +        {

Coding style.

> +            .xch = xch,
> +            .fd = io_fd,
> +        };
> +
> +    /* Older GCC cant initialise anonymous unions */

How old?


> +static int map_p2m(struct context *ctx)
> +{
> [...]

> +    fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list);
> +    if ( !fll_mfn )
> +        IPRINTF("Waiting for domain to set up its p2m frame list list");
> +
> +    while ( tries-- && !fll_mfn )
> +    {
> +        usleep(10000);
> +        fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo,
> +                            arch.pfn_to_mfn_frame_list_list);
> +    }
> +
> +    if ( !fll_mfn )
> +    {
> +        ERROR("Timed out waiting for p2m frame list list to be updated");
> +        goto err;
> +    }

This (preexisting) ugliness is there to "support" running migrate
immediately after starting a guest, when it hasn't got its act together
yet?

I wonder if we can get away with refusing to migrate under those
circumstances.

> +static int write_one_vcpu_basic(struct context *ctx, uint32_t id)
> +{
> +    xc_interface *xch = ctx->xch;
> +    xen_pfn_t mfn, pfn;
> +    unsigned i;
> +    int rc = -1;
> +    vcpu_guest_context_any_t vcpu;
> +    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
> +    struct record rec =
> +    {
> +        .type = REC_TYPE_x86_pv_vcpu_basic,
> +        .length = sizeof vhdr,
> +        .data = &vhdr,
> +    };
> +
> +    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
> +    {
> +        PERROR("Failed to get vcpu%u context", id);
> +        goto err;
> +    }
> +
> +    /* Vcpu 0 is special: Convert the suspend record to a PFN */

The "suspend record" is the guest's start_info_t I think? Would be
clearer to say that I think (and perhaps allude to it being the magic
3rd parameter to the hypercall.)

> +    /* 64bit guests: Convert CR1 (guest pagetables) to PFN */
> +    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
> +    {
> +        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
> +        if ( !mfn_in_pseudophysmap(ctx, mfn) )
> +        {
> +            ERROR("Bad MFN for vcpu%u's cr1", id);
> +            pseudophysmap_walk(ctx, mfn);
> +            errno = ERANGE;
> +            goto err;
> +        }
> +
> +        pfn = mfn_to_pfn(ctx, mfn);
> +        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
> +    }
> +
> +    if ( ctx->x86_pv.width == 8 )

The use of x86_pv.levels vs x86_pv.width to determine the type of guest
is rather inconsistent. .width would seem to be the right one to use
almost all the time I think.

> +        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x64);
> +    else
> +        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x32);
> +
> +    if ( rc )
> +        goto err;
> +
> +    DPRINTF("Writing vcpu%u basic context", id);
> +    rc = 0;
> + err:
> +
> +    return rc;
> +}
> +
> +static int write_one_vcpu_extended(struct context *ctx, uint32_t id)
> +{
> +    xc_interface *xch = ctx->xch;
> +    int rc;
> +    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
> +    struct record rec =
> +    {
> +        .type = REC_TYPE_x86_pv_vcpu_extended,
> +        .length = sizeof vhdr,
> +        .data = &vhdr,
> +    };

Does this pattern repeat a lot? DECLARE_REC(x86_pv_vcpu_extended, vhdr)
perhaps?

> +    struct xen_domctl domctl =
> +    {
> +        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
> +        .domain = ctx->domid,
> +        .u.ext_vcpucontext.vcpu = id,
> +    };

Wants to use DECLARE_DOMCTL I think (at least everywhere else in libxc
seesms too).

> +static int write_all_vcpu_information(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    xc_vcpuinfo_t vinfo;
> +    unsigned int i;
> +    int rc;
> +
> +    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
> +    {
> +        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
> +        if ( rc )
> +        {
> +            PERROR("Failed to get vcpu%u information", i);
> +            return rc;
> +        }
> +
> +        if ( !vinfo.online )
> +        {
> +            DPRINTF("vcpu%u offline - skipping", i);
> +            continue;
> +        }
> +
> +        rc = write_one_vcpu_basic(ctx, i) ?:
> +            write_one_vcpu_extended(ctx, i) ?:
> +            write_one_vcpu_xsave(ctx, i);

I'd prefer the more verbose one at a time version.

> +    /* Map various structures */
> +    rc = x86_pv_map_m2p(ctx) ?: map_shinfo(ctx) ?: map_p2m(ctx);

One at a time please.

> +    if ( rc )
> +        goto err;
> +

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 7/9] tools/libxc: x86 pv restore implementation
  2014-04-30 18:36 ` [PATCH v4 7/9] tools/libxc: x86 pv restore implementation Andrew Cooper
@ 2014-05-07 14:10   ` Ian Campbell
  2014-05-07 15:37     ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 14:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> +    ihdr.id      = ntohl(ihdr.id);
> +    ihdr.version = ntohl(ihdr.version);
> +    ihdr.options = ntohs(ihdr.options);
> +
> +    if ( ihdr.marker != IHDR_MARKER )
> +    {
> +        ERROR("Invalid marker: Got 0x%016"PRIx64, ihdr.marker);
> +        return -1;
> +    }
> +    else if ( ihdr.id != IHDR_ID )

Can we do away with the unnecessary elses please.


> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
> +    {
> +        PERROR("Failed to get domain info");
> +        return -1;
> +    }
> +
> +    if ( ctx.dominfo.domid != dom )
> +    {
> +        ERROR("Domain %d does not exist", dom);
> +        return -1;
> +    }
> +
> +    ctx.domid = dom;
> +    IPRINTF("Restoring domain %d", dom);
> +
> +    if ( read_headers(&ctx) )
> +        return -1;
> +
> +    if ( ctx.dominfo.hvm )
> +    {
> +        ERROR("HVM Restore not supported yet");
> +        return -1;
> +    }
> +    else

Unneeded.

> +
> +static int handle_x86_pv_vcpu_basic(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct rec_x86_pv_vcpu *vhdr = rec->data;
> +    vcpu_guest_context_any_t vcpu;
> +    size_t vcpusz = ctx->x86_pv.width == 8 ? sizeof vcpu.x64 : sizeof vcpu.x32;
> +    xen_pfn_t pfn, mfn;
> +    unsigned long tmp;
> +    unsigned i;
> +    int rc = -1;
> +
> +    if ( rec->length <= sizeof *vhdr )
> +    {
> +        ERROR("X86_PV_VCPU_BASIC record trucated: length %"PRIu32", min %zu",

"truncated"

> +        else if ( (ctx->x86_pv.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
> +                  (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )

I'd think some temporaries would help with clarity here.

> +static int handle_x86_pv_vcpu_extended(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct rec_x86_pv_vcpu *vcpu = rec->data;
> +    DECLARE_DOMCTL;
> +
> +    if ( rec->length <= sizeof *vcpu )
> +    {
> +        ERROR("X86_PV_VCPU_EXTENDED record trucated: length %"PRIu32", min %zu",
> +              rec->length, sizeof *vcpu + 1);

I've just realised that there was a X86_PV_VCPU_EXTENDED in
handle_x86_pv_vcpu_basic which was wrong, but I've trimmed it already so
I'll comment here.


> +        return -1;
> +    }
> +    else if ( rec->length > sizeof *vcpu + 128 )

There's an awful lot of magic numbers throughout this series. Can we try
and use meaningful symbolic names for things please.

> +int restore_x86_pv(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct record rec;
> +    int rc;
> +
> +    IPRINTF("In experimental %s", __func__);
> +
> +    if ( ctx->restore.guest_type != DHDR_TYPE_x86_pv )
> +    {
> +        ERROR("Unable to restore %s domain into an x86_pv domain",
> +              dhdr_type_to_str(ctx->restore.guest_type));
> +        return -1;
> +    }
> +    else if ( ctx->restore.guest_page_size != 4096 )

#define's exist for this.

> +    /* all done */
> +    IPRINTF("All Done");
> +    assert(!rc);
> +    goto cleanup;
> +
> + err:
> +    assert(rc);

FWIW I think you could drop this and let the success case fall through
without the hop and a skip..

> + cleanup:
> +
> +    free(ctx->x86_pv.p2m);
> +    free(ctx->x86_pv.p2m_pfns);
> +    free(ctx->x86_pv.pfn_types);
> +    free(ctx->restore.populated_pfns);
> +
> +    if ( ctx->x86_pv.m2p )
> +        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
> +
> +    return rc;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 8/9] tools/libxc: x86 hvm save implementation
  2014-04-30 18:36 ` [PATCH v4 8/9] tools/libxc: x86 hvm save implementation Andrew Cooper
@ 2014-05-07 14:14   ` Ian Campbell
  2014-05-07 15:39     ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 14:14 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> +static int write_hvm_params(struct context *ctx)
> +{
> +    static const unsigned int params[] = {
> +        HVM_PARAM_STORE_PFN,
> +        HVM_PARAM_IOREQ_PFN,
> +        HVM_PARAM_BUFIOREQ_PFN,
> +        HVM_PARAM_PAGING_RING_PFN,
> +        HVM_PARAM_ACCESS_RING_PFN,
> +        HVM_PARAM_SHARING_RING_PFN,
> +        HVM_PARAM_VM86_TSS,
> +        HVM_PARAM_CONSOLE_PFN,
> +        HVM_PARAM_ACPI_IOPORTS_LOCATION,
> +        HVM_PARAM_VIRIDIAN,
> +        HVM_PARAM_IDENT_PT,
> +        HVM_PARAM_PAE_ENABLED,
> +    };
> +    static const unsigned int num_params = ARRAY_SIZE(params);

Blank line between statics and regular please.

> +    xc_interface *xch = ctx->xch;
> +    struct rec_hvm_params_entry *entries;
> +    struct rec_hvm_params hdr = {
> +        .count = 0,
> +    };
> +    struct record rec = {
> +        .type   = REC_TYPE_hvm_params,
> +        .length = sizeof(hdr),
> +        .data   = &hdr,
> +    };
> +    unsigned int i;
> +    int rc;
> +
> +    entries = malloc(num_params * sizeof(*entries));

It's small enough that I think you could get away with
	struct rec_hvm_params_entry entries[num_params];


> +    if ( !entries )
> +    {
> +        PERROR("HVM params record");

"Allocating..." (or goes away)

> +    /* Write an END record */
> +    rc = write_record(ctx, &end);

Didn't I see a helper for this earlier on? Could it not be in common
code in any case?

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation
  2014-04-30 18:36 ` [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation Andrew Cooper
@ 2014-05-07 14:15   ` Ian Campbell
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 14:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: David Vrabel, Xen-devel

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Just the generic comments I've been making on this one.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 0/9] Migration Stream v2
  2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
                   ` (9 preceding siblings ...)
  2014-05-01 16:41 ` [PATCH v4 0/9] Migration Stream v2 Konrad Rzeszutek Wilk
@ 2014-05-07 14:23 ` Ian Campbell
  10 siblings, 0 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 14:23 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Tim Deegan, Ian Jackson, Xen-devel, Frediano Ziglio,
	David Vrabel, Jan Beulich

On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> There are certainly some areas still in need improvement.  There are plans to
> combine the save logic for PV and HVM guests by adding a few more hooks to
> save_restore_ops.  While reviewing the series for posting, I have noticed some
> accidental duplication from attempting to merge PV and HVM together
> (ctx->save.p2m_size and ctx->x86_pv.max_pfn) which can be consolidated.

Overall it looks good, well done to all those involved.

A few common issues I spotted which I didn't repeat every on every
instance:
      * Magic numbers (some with prexisting suitable defines) should
        msotly be #defined, or 
      * Chaining of unrelated things into if/else/else (especially where
        mixing input validation and actual functionality as I think I
        saw at least once)
      * { for a struct initialiser on the same line as the variable.
      * Apostrophes in comments are so last year

I think that was it (but if you notice me commenting on something
repeatedly but not consistently there's a good chance I meant to
remember it for this list).

> Moving forward will be work to do with converting a legacy image to a v2
> image.
> 
> Questions/issues needing discussing are:
> 
> 1) What level should the conversion happen.  The v2 format is capable of
>    detecting a legacy stream, but libxc is arguably too low level for the
>    conversion decision to happen.

The detection necessarily has to read the first N bytes, doesn't it?
Which pretty much means the decision has to be made in libxc I think?

> 2) What to do about the layer violations which is the toolstack record and
>    device model record.  Libxl for HVM guests is the only known user of the
>    'toolstack' chunk, which contains some xenstore key/value pairs.  This data
>    should not be a blob in the middle of the libxc layer, and should be
>    promoted to a first class member of a libxl migration stream.

Stefano added this, I can't remember what it is or why it ended up in
libxc, it could well just be "libxc_domain_save is unhackable", or maybe
there was a reason.

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] tools/libxc: common code
  2014-05-07 13:03   ` Ian Campbell
@ 2014-05-07 14:38     ` Andrew Cooper
  2014-05-07 16:08       ` Ian Campbell
  2014-05-08  8:29       ` David Vrabel
  0 siblings, 2 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 14:38 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 07/05/14 14:03, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>>  tools/libxc/saverestore/common.c         |   87 ++++++
>>  tools/libxc/saverestore/common.h         |  172 ++++++++++++
>>  tools/libxc/saverestore/common_x86.c     |   54 ++++
>>  tools/libxc/saverestore/common_x86.h     |   21 ++
>>  tools/libxc/saverestore/common_x86_hvm.c |   53 ++++
>>  tools/libxc/saverestore/common_x86_pv.c  |  431 ++++++++++++++++++++++++++++++
>>  tools/libxc/saverestore/common_x86_pv.h  |  104 +++++++
>>  tools/libxc/saverestore/restore.c        |  288 ++++++++++++++++++++
>>  tools/libxc/saverestore/save.c           |   42 +++
>>  9 files changed, 1252 insertions(+)
>>  create mode 100644 tools/libxc/saverestore/common_x86.c
>>  create mode 100644 tools/libxc/saverestore/common_x86.h
>>  create mode 100644 tools/libxc/saverestore/common_x86_hvm.c
>>  create mode 100644 tools/libxc/saverestore/common_x86_pv.c
>>  create mode 100644 tools/libxc/saverestore/common_x86_pv.h
>>
>> diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
>> index de2e727..b159c4c 100644
>> --- a/tools/libxc/saverestore/common.c
>> +++ b/tools/libxc/saverestore/common.c
>> @@ -1,3 +1,5 @@
>> +#include <assert.h>
>> +
>>  #include "common.h"
>>  
>>  static const char *dhdr_types[] =
>> @@ -52,6 +54,91 @@ const char *rec_type_to_str(uint32_t type)
>>      return "Reserved";
>>  }
>>  
>> +int write_split_record(struct context *ctx, struct record *rec,
>> +                       void *buf, size_t sz)
>> +{
>> +    static const char zeroes[7] = { 0 };
>> +    xc_interface *xch = ctx->xch;
>> +    uint32_t combined_length = rec->length + sz;
>> +    size_t record_length = (combined_length + 7) & ~7UL;
> Isn't this ROUNDUP(combined_length, 3)? (Where 3 and/or 7 perhaps ought
> to be given names)

It is.  I am not certain that 3/7 need specific named, given the
specification mandating that all records have trailing 0s to align the
subsequent record on an 8 byte boundary.

>
>> +
>> +    if ( record_length > REC_LENGTH_MAX )
>> +    {
>> +        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
>> +              " exceeds max (0x%"PRIx32")", rec->type,
>> +              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
>> +        return -1;
>> +    }
>> +
>> +    if ( rec->length )
>> +        assert(rec->data);
>> +    if ( sz )
>> +        assert(buf);
>> +
>> +    if ( write_exact(ctx->fd, &rec->type, sizeof rec->type) ||
> libxc style appears to be "sizeof (rec->type)" (the space seems odd to
> me but is apparently The Style, I think the brackets are a must).

libxc style also states no extraneous brackets.  sizeof is particularly
problematic as it requires brackets when used with a type, and I keep on
reverting back to my personal preference.  I shall try to remain more
consistent.

>
>> +         write_exact(ctx->fd, &combined_length, sizeof rec->length) ||
>> +         (rec->length && write_exact(ctx->fd, rec->data, rec->length)) ||
>> +         (sz && write_exact(ctx->fd, buf, sz)) ||
>> +         write_exact(ctx->fd, zeroes, record_length - combined_length) )
>> +    {
>> +        PERROR("Unable to write record to stream");
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int read_record(struct context *ctx, struct record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    struct rhdr rhdr;
>> +    size_t datasz;
>> +
>> +    if ( read_exact(ctx->fd, &rhdr, sizeof rhdr) )
>> +    {
>> +        PERROR("Failed to read Record Header from stream");
>> +        return -1;
>> +    }
>> +    else if ( rhdr.length > REC_LENGTH_MAX )
>> +    {
>> +        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
>> +              " exceeds max (0x%"PRIx32")",
>> +              rhdr.type, rec_type_to_str(rhdr.type),
>> +              rhdr.length, REC_LENGTH_MAX);
>> +        return -1;
>> +    }
>> +
>> +    datasz = (rhdr.length + 7) & ~7U;
> Another ROUNDUP and #define XXX 3 or 7 opportunity. (I'm not going to
> mention any more of these I find).
>
> Is it not a bug in the input for datasz to not be aligned? I thought you
> were padding everything.

It is certainly not a bug.  For the blobs which can have an arbitrary
length, the record length field reflects the exact length.  Trailing 0s
are then inserted into the stream to align the start of the next record
on an 8 byte boundary.

Here, the simple approach is to just read those trailing bytes into the
buffer containing the data, so the malloc()d memory handed back from
this function might be slightly longer than the length field indicates.

>
>> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
>> index fff0a39..a35eda7 100644
>> --- a/tools/libxc/saverestore/common.h
>> +++ b/tools/libxc/saverestore/common.h
>> @@ -1,7 +1,20 @@
>>  #ifndef __COMMON__H
>>  #define __COMMON__H
>>  
>> +#include <stdbool.h>
>> +
>> +// Hack out junk from the namespace
>  ... namespace, because/in order to...
>
> (to prevent accidents I suppose?)

because mfn_to_pfn and pfn_to_mfn are gross abortions of macros which
look like regular functions but take implicit scoped state.

This list of junk is increased in subsequent patches.

>
>> +#define mfn_to_pfn __UNUSED_mfn_to_pfn
>> +#define pfn_to_mfn __UNUSED_pfn_to_mfn
>> +
>>  #include "../xg_private.h"
>> +#include "../xg_save_restore.h"
>> +#include "../xc_dom.h"
>> +#include "../xc_bitops.h"
>> +
>> +#undef mfn_to_pfn
>> +#undef pfn_to_mfn
>> +
>>  
>> [...]
>> +struct context
>> +{
>> +    xc_interface *xch;
>> +    uint32_t domid;
>> +    int fd;
>> +
>> +    xc_dominfo_t dominfo;
>> +
>> +    struct save_restore_ops ops;
>> +
>> +    union
>> +    {
>> +        struct
>> +        {
>> +            /* From Image Header */
>> +            uint32_t format_version;
>> +
>> +            /* From Domain Header */
>> +            uint32_t guest_type;
>> +            uint32_t guest_page_size;
>> +
> I suppose the following block is not from the Domain Header. /* Populate
> by builder */ or something?

Yes - I do have a task to insert more comments into the code.

>
> [...]
>
>> +    union
>> +    {
>> +        struct
>> +        {
>> +            /* 4 or 8; 32 or 64 bit domain */
>> +            unsigned int width;
>> +            /* 3 or 4 pagetable levels */
>> +            unsigned int levels;
>> +
>> +
> Deliberate double spacing? (You don't do it consistently)
> [...]

I have been collapsing patches from 3 separate people ;)  I will do a
consistency check before this becomes non-rfc.

>
>> +};
>> +
>> +/*
>> + * Write the image and domain headers to the stream.
>> + * (to eventually make static in save.c)
> Is this because of the 2/9 HACK to maintain both? So far it seems to me
> that the next iteration could drop that and go for the big switcheroo,
> but if not could we agree some tag for things which are specific to that
> in order to distinguish from longer term todos or hacky bits? (So I know
> which to ignore and which to consider)

This is to support individual save_x86_pv.c and save_x86_hvm.c files
before the next iteration which will make some more vm/arch hooks and
one single save algorithm.

>
>> + */
>> +int write_headers(struct context *ctx, uint16_t guest_type);
>> +
>> +extern struct save_restore_ops save_restore_ops_x86_pv;
>> +extern struct save_restore_ops save_restore_ops_x86_hvm;
> From the union in the context struct  I'm only expecting x86_pv in this
> patch, so perhaps x86_hvm was intended to arrive later?
>
> (a non-zero length commit message might have confirmed what was intended
> to be supplied here)

Nope - I had an error when folding patches together.

>
>> +/*
>> + * Writes a split record to the stream, applying correct padding where
>> + * appropriate.  It is common when sending records containing blobs from Xen
>> + * that the header and blob data are separate.  This function accepts a second
>> + * buffer and length, and will merge it with the main record when sending.
>> + *
>> + * Records with a non-zero length must provide a valid data field; records
>> + * with a 0 length shall have their data field ignored.
>> + *
>> + * Returns 0 on success and non0 on failure.
> "non-0". Do the non-zero values have any particular meanings? (errno
> codes, other error codes?)

No - see also the other inconsistencies in libxc.

The current state of all the new functions are "0 on success, -1 on
error and errno is unlikely be related".  This is no worse than the vast
majority of libxc currently, and stating "non-0" on error allows for a
positive "libxc error" range as discussed down the pub.

I am not shaving that yak right now, but have tried to leave the code in
a state where inserting consistent error handing should be less hassle
than it otherwise could be.

In all error cases, the log will (should) have accurate information,
including errno when relevent.

>
>> + */
>> +int write_split_record(struct context *ctx, struct record *rec, void *buf, size_t sz);
>> +
>> +/*
>> + * Writes a record to the stream, applying correct padding where appropriate.
>> + * Records with a non-zero length must provide a valid data field; records
>> + * with a 0 length shall have their data field ignored.
>> + *
>> + * Returns 0 on success and non0 on failure.
>> + */
>> +static inline int write_record(struct context *ctx, struct record *rec)
>> +{
>> +    return write_split_record(ctx, rec, NULL, 0);
>> +}
>> +
>> +/*
>> + * Reads a record from the stream, and fills in the record structure.
>> + *
>> + * Returns 0 on success and non-0 on failure.
>> + *
>> + * On success, the records type and size shall be valid.
>> + * - If size is 0, data shall be NULL.
>> + * - If size is non-0, data shall be a buffer allocated by malloc() which must
>> + *   be passed to free() by the caller.
>> + *
>> + * On failure, the contents of the record structure are undefined.
>> + */
>> +int read_record(struct context *ctx, struct record *rec);
>> +
>> +int write_page_data_and_pause(struct context *ctx);
> I suppose the page_data referred to here is accumulated in ctx?

This was some of Davids code which I now realise breaks the unstated
naming scheme I was working to.

>
>> +int handle_page_data(struct context *ctx, struct record *rec);
> Handle it how? Is this the restore side counterpart to
> write_page_data_...?

Correct.  handle_$RECORD_TYPE_NAME() (is supposed to) match
write_$RECORD_TYPE_NAME()

>
>> +int populate_pfns(struct context *ctx, unsigned count,
>> +                  const xen_pfn_t *original_pfns, const uint32_t *types);
> Populate from what?
>
> (You spoiled me with doc comments on the first few, so I'm being picky
> here ;-))

:) You can probably work out the rough order of work based in the
increasing scarcity of comments.  I shall fix this for the next
version.  I was debating sorting it out before posting, but decided that
posting something for review was better than waiting another few days
(as I was already late on my guess for the next posting).

>
>> +
>>  #endif
>>  /*
>>   * Local variables:
>> diff --git a/tools/libxc/saverestore/common_x86_hvm.c b/tools/libxc/saverestore/common_x86_hvm.c
>> new file mode 100644
>> index 0000000..0b9aac2
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/common_x86_hvm.c
> How delightfully uninteresting.

How curiously identical to arm :)  (Not the only file this statement
applies to...)

>
>> diff --git a/tools/libxc/saverestore/common_x86_pv.c b/tools/libxc/saverestore/common_x86_pv.c
>> new file mode 100644
>> index 0000000..35bce27
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/common_x86_pv.c
> How terrifying.

and yet, more simple than the current legacy counterparts.

I have stripped out all legacy bits, like support for 2 level PV guests,
and some vestigial code from the data with a 1G max VM size.  (which
would result in particularly fun results when attempting to migrate a PV
guest on a host using the 44th bit in mfns)

I *think* my sanity is still intacked.

>
>> @@ -0,0 +1,431 @@
>> +#include <assert.h>
>> +
>> +#include "common_x86_pv.h"
>> +
>> +xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
>> +{
>> +    assert(mfn <= ctx->x86_pv.max_mfn);
> Just to confirm that anywhere there is an assert like this then if it is
> based on potentially user controller input then it has already been
> validated elsewhere first? (with the abstraction I couldn't spot where
> that was)

It is validated in every case I am aware of, usually by
"mfn_in_pseudophysmap()"

>
>> +    return ctx->x86_pv.m2p[mfn];
>> +}
>> +
>> +static bool x86_pv_pfn_is_valid(struct context *ctx, xen_pfn_t pfn)
>> +{
>> +    return pfn <= ctx->x86_pv.max_pfn;
>> +}
>> +
>> +static xen_pfn_t x86_pv_pfn_to_gfn(struct context *ctx, xen_pfn_t pfn)
>> +{
>> +    assert(pfn <= ctx->x86_pv.max_pfn);
>> +
>> +    if ( ctx->x86_pv.width == sizeof (uint64_t) )
>> +        /* 64 bit guest.  Need to truncate their pfns for 32 bit toolstacks */
>> +        return ((uint64_t *)ctx->x86_pv.p2m)[pfn];
>> +    else
>> +    {
>> +        /* 32 bit guest.  Need to expand INVALID_MFN fot 64 bit toolstacks */
> Typo: "fot"

Oops

>
>> +        uint32_t mfn = ((uint32_t *)ctx->x86_pv.p2m)[pfn];
>> +
>> +        return mfn == ~0U ? INVALID_MFN : mfn;
>> +    }
>> +}
>> +
>> +static void x86_pv_set_page_type(struct context *ctx, xen_pfn_t pfn,
>> +                                 unsigned long type)
>> +{
>> +    assert(pfn <= ctx->x86_pv.max_pfn);
>> +
>> +    ctx->x86_pv.pfn_types[pfn] = type;
>> +}
>> +
>> +static void x86_pv_set_gfn(struct context *ctx, xen_pfn_t pfn,
>> +                           xen_pfn_t mfn)
>> +{
>> +    assert(pfn <= ctx->x86_pv.max_pfn);
>> +
>> +    if ( ctx->x86_pv.width == sizeof (uint64_t) )
>> +        /* 64 bit guest.  Need to expand INVALID_MFN for 32 bit toolstacks */
>> +        ((uint64_t *)ctx->x86_pv.p2m)[pfn] = mfn == INVALID_MFN ? ~0ULL : mfn;
>> +    else
>> +        /* 32 bit guest.  Can safely truncate INVALID_MFN fot 64 bit toolstacks */
> Another fot.

Probably M-w C-y from above.

>
>> +        ((uint32_t *)ctx->x86_pv.p2m)[pfn] = mfn;
>> +}
>> +
>> +static int normalise_pagetable(struct context *ctx, const uint64_t *src,
>> +                               uint64_t *dst, unsigned long type)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    uint64_t pte;
>> +    unsigned i, xen_first = -1, xen_last = -1; /* Indicies of Xen mappings */
>> +
>> +    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
>> +
>> +    if ( ctx->x86_pv.levels == 4 )
>> +    {
>> +        /* 64bit guests only have Xen mappings in their L4 tables */
> You (correctly) infer from the pt levels that this must be a 64-bit
> guest. Assert?

Not really worth asserting.

Xen doesn't expose levels directly.  It is calculated from the domain
width and (in the past, domain abi string to differentiate 2 and 3 level
guests) at the start of migration.

x86_pv_domain_info() assures that ctx->x86_pv.{width,levels} are consistent.

>
>> +        if ( type == XEN_DOMCTL_PFINFO_L4TAB )
>> +        {
>> +            xen_first = 256;
>> +            xen_last = 271;
> Can we #define all these numbers please?

There are a lot of numbers Xen should expose on the public API/ABI which
it currently doesn't.  I shall see about collecting them together.

>
>> +        }
>> +    }
>> +    else
>> +    {
> Must be 32 bit? Assert?
>
>> +        switch ( type )
>> +        {
>> +        case XEN_DOMCTL_PFINFO_L4TAB:
>> +            ERROR("??? Found L4 table for 32bit guest");
>> +            errno = EINVAL;
>> +            return -1;
>> +
>> +        case XEN_DOMCTL_PFINFO_L3TAB:
>> +            /* 32bit guests can only use the first 4 entries of their L3 tables.
>> +             * All other are potentially used by Xen. */
>> +            xen_first = 4;
>> +            xen_last = 512;
>> +            break;
>> +
>> +        case XEN_DOMCTL_PFINFO_L2TAB:
>> +            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
>> +             * are normal but only a few will have Xen mappings.
> Internally Xen has PGT_pae_xen_l2, I wonder if it could be coaxed into
> exposing that here too?

That would require exposing xen struct page_info's for specific mfns. 
This way seems less hacky.

>
>> +             * 428 = (HYPERVISOR_VIRT_START_PAE >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff
>> +             *
>> +             * ...which is conveniently unavailable to us in a 64bit build.
> Can we export it somehow? (Perhaps with COMPAT in the name)
>
>> ... normalize_page
>> +    local_page = malloc(PAGE_SIZE);
> How bad is the overhead of (what I pressume are) all these allocs and
> frees? (I've not come across the caller in this patch, maybe it will
> become clear later).
>
> Would it be worth keeping an array around shadowing the "live" batch of
> pages?

The allocation and freeing of pages has allowed valgrind to be
fantastically useful at pointing out when my code was doing something silly.

I would expect the overhead is negligible (the slow parts of migration
are the blocking fd reads/writes, mapping hypercalls and memcpy()s), and
the aid to debugging is far more important.

>
>> +static int x86_pv_localise_page(struct context *ctx, uint32_t type, void *page)
> "localise" means the inverse of "normalise"?

Yes.  I specifically wanted to avoid the term "canonicalise" from the
old code in reference to pages and pagetables, as a canonical 64bit
address is an architectural term.

>
>
>> +void pseudophysmap_walk(struct context *ctx, xen_pfn_t mfn)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xen_pfn_t pfn = ~0UL;
>> +
>> +    ERROR("mfn %#lx, max %#lx", mfn, ctx->x86_pv.max_mfn);
> It took me a while to figure out why these were all errors. Can we call
> the function dump_bad_pseudophysmap_walk or some such?

Yes - that makes more sense.

>
>> +xen_pfn_t cr3_to_mfn(struct context *ctx, uint64_t cr3)
>> +{
>> +    if ( ctx->x86_pv.width == 8 )
>> +        return cr3 >> 12;
>> +    else
>> +        return (((uint32_t)cr3 >> 12) | ((uint32_t)cr3 << 20));
> A comment pointing people to the phrase "extended-cr3" as grep fodder
> would be appreciated.

Ok.

>
>> +}
>> +
>> +uint64_t mfn_to_cr3(struct context *ctx, xen_pfn_t mfn)
>> +{
>> +    if ( ctx->x86_pv.width == 8 )
>> +        return ((uint64_t)mfn) << 12;
>> +    else
>> +        return (((uint32_t)mfn << 12) | ((uint32_t)mfn >> 20));
>> +}
>> +
> [...]
>> +    else if ( guest_width == 4 )
> Would prefer to drop the else and insert a blank line to separate the
> error handling of the xc_domain_.. from the interpretation of its
> result.

Hmm - this has come about because of changing the "goto err;" vs "return
-1" error handing.  I shall try to make it more consistent.

>
>> +        guest_levels = 3;
>> +    else if ( guest_width == 8 )
>> +        guest_levels = 4;
>> +    else
>> +    {
>> +        ERROR("Invalid guest width %d.  Expected 32 or 64", guest_width);
> Actually you expected 4 or 8. I'm not sure if this could be strengthened
> into an assert.

Hmm yes.  In this case we can probably assert if Xen hands us back junk.

>
>> +        return -1;
>> +    }
>> +    ctx->x86_pv.width = guest_width;
>> +    ctx->x86_pv.levels = guest_levels;
>> +    ctx->x86_pv.fpp = fpp = PAGE_SIZE / ctx->x86_pv.width;
>> +
>> +    DPRINTF("%d bits, %d levels", guest_width * 8, guest_levels);
>> +
>> +    /* Get the domains maximum pfn */
> Apostrophes are out of fashion at the moment I guess ;-).

Perhaps :s

>
>> +    max_pfn = xc_domain_maximum_gpfn(xch, ctx->domid);
>> +    if ( max_pfn < 0 )
>> +    {
>> +        PERROR("Unable to obtain guests max pfn");
>> +        return -1;
>> +    }
>> +    else if ( max_pfn >= ~XEN_DOMCTL_PFINFO_LTAB_MASK )
>> +    {
>> +        errno = E2BIG;
>> +        PERROR("Cannot save a guest this large %#x");
> Missing the parameter for %#x.

Hmm - xc_osdep_log() should probably have an __attribute__((format ...
)) statement to go with it, although I could have sworn that the
compiler would complain at me if I messed up the formatting string with
ERROR() macro.

>
>> [...]
>> +/*
>> + * Extract an MFN from a Pagetable Entry.
>> + */
>> +static inline xen_pfn_t pte_to_frame(struct context *ctx, uint64_t pte)
>> +{
>> +    if ( ctx->x86_pv.width == 8 )
>> +        return (pte >> PAGE_SHIFT) & ((1ULL << (52 - PAGE_SHIFT)) - 1);
>> +    else
>> +        return (pte >> PAGE_SHIFT) & ((1ULL << (44 - PAGE_SHIFT)) - 1);
> Are bits 45..52 non zero in this case?
>
> Can we #define a couple of MASK's please.

Another Xen ABI which should be exported.

>
>> +static int pfn_set_populated(struct context *ctx, xen_pfn_t pfn)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +
>> +    if ( !ctx->restore.populated_pfns || pfn > ctx->restore.max_populated_pfn )
>> +    {
>> +        unsigned long new_max_pfn = ((pfn + 1024) & ~1023) - 1;
>> +        size_t old_sz, new_sz;
>> +        unsigned long *p;
>> +
>> +        old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
>> +        new_sz = bitmap_size(new_max_pfn + 1);
>> +
>> +        p  = realloc(ctx->restore.populated_pfns, new_sz);
> In practice do you not see an initial run of increasing pfns and end up
> constantly adding one more? Consider doubling or adding e.g. 128M worth
> of space when growing?

In practice, there will be one bump from 0 to the correct size.

However, it is deliberately implemented like this so when someone
decides to fix ballooning/memhotplug vs migration, there is an already
working method of indicating "the p2m has gotten bigger since I started".

>> +
>> +    if ( nr_pages > 0 )
>> +    {
>> +        mapping = guest_page = xc_map_foreign_bulk(
>> +            xch, ctx->domid, PROT_READ | PROT_WRITE,
>> +            mfns, map_errs, nr_pages);
>> +        if ( !mapping )
>> +        {
>> +            PERROR("Unable to map %u mfns for %u pages of data",
>> +                   nr_pages, count);
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    for ( i = 0, j = 0; i < count; ++i )
>> +    {
>> +        switch ( types[i] )
>> +        {
>> +        case XEN_DOMCTL_PFINFO_XTAB:
>> +        case XEN_DOMCTL_PFINFO_BROKEN:
>> +            /* Nothing at all to do */
> Is this a deliberate fall-thru? Please comment if so.
>
>> +        case XEN_DOMCTL_PFINFO_XALLOC:
>> +            /* Nothing futher to do */
> "further"

They are all fall-though, indicated by the continue on the next line,
and lack of default.  I suppose I am using switch() for a fancy if.

>
>> +int handle_page_data(struct context *ctx, struct record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    struct rec_page_data_header *pages = rec->data;
>> +    unsigned i, pages_of_data = 0;
>> +    int rc = -1;
>> +
>> +    xen_pfn_t *pfns = NULL, pfn;
>> +    uint32_t *types = NULL, type;
>> +
>> +    static unsigned pg_count;
>> +    pg_count++;
>> +
>> +    if ( rec->length < sizeof *pages )
>> +    {
>> +        ERROR("PAGE_DATA record trucated: length %"PRIu32", min %zu",
>> +              rec->length, sizeof *pages);
>> +        goto err;
>> +    }
>> +    else if ( pages->count < 1 )
> We mostly omit the else where the previous if returns or gotos etc.

This comes from merging 3 peoples pages and not quite getting the
consistency correct.

>
>> +    for ( i = 0; i < pages->count; ++i )
>> +    {
>> +        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
>> +        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
>> +        {
>> +            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
>> +            goto err;
>> +        }
>> +
>> +        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
>> +        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
>> +             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
> What are 5 and 8?
>
> This might all be clearer with the use of a #define or two, and perhaps
> using a switch over the expected valid types and explicitly invalid
> types.

5 to 8 are the "holes" in the otherwise complete PFINFO_TYPE numerspace.

My views are that the XEN_DOMCTL_PFINFO_* macros are unconditionally
horrible to use, whatever you are trying to do, which is why the use of
them is gauged for local clarity rather than global consistency.

I am not sure a switch statement would help here.

>
>
>> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
>> index c013e62..e842e6c 100644
>> --- a/tools/libxc/saverestore/save.c
>> +++ b/tools/libxc/saverestore/save.c
>> @@ -1,5 +1,47 @@
>> +#include <arpa/inet.h>
>> +
>>  #include "common.h"
>>  
>> +int write_headers(struct context *ctx, uint16_t guest_type)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
>> +    struct ihdr ihdr =
>> +        {
> I think coding style puts this on the previous line and the body should
> be 4 spaces further in.

Not according to the emacs mode at the bottom, which would appear to
create a contradiction in CODING_SYTLE.

>
>> +            .marker  = IHDR_MARKER,
>> +            .id      = htonl(IHDR_ID),
>> +            .version = htonl(IHDR_VERSION),
>> +            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
>> +        };
>> +    struct dhdr dhdr =
>> +        {
>> +            .type       = guest_type,
>> +            .page_shift = 12,
> We have a define for this.
>
> Ian.
>

So we do.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] tools/libxc: x86 pv save implementation
  2014-05-07 13:58   ` Ian Campbell
@ 2014-05-07 15:20     ` Andrew Cooper
  2014-05-07 16:15       ` Ian Campbell
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 15:20 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 07/05/14 14:58, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>
> The previous "common" patch seemed to have a fair bit of x86 pv save
> code in it, and there seems to be a fair bit of common code here. I
> suppose the split was pretty artificial to make the patches manageable?
>
>> @@ -121,6 +123,9 @@ struct context
>>      };
>>  };
>>  
>> +/* Saves an x86 PV domain. */
> I should hope so!
>
>> +int save_x86_pv(struct context *ctx);
>> +
>>  /*
>>   * Write the image and domain headers to the stream.
>>   * (to eventually make static in save.c)
>> @@ -137,6 +142,21 @@ struct record
>>      void *data;
>>  };
>>  
>> +/* Gets a field from an *_any union */
> Unless the #undefs which were above before I trimmed them relate to
> patch #2/9 they should go right before the redefine IMHO.

At the moment the series is specifically designed to allow the legacy
and v2 code to live together for testing purposes.

When this series leaves RFC, the old macros will be killed with fire. 
(See also gross abortions with scoped state)

>
>> +#define GET_FIELD(_c, _p, _f)                   \
>> +    ({ (_c)->x86_pv.width == 8 ?                \
>> +            (_p)->x64._f:                       \
>> +            (_p)->x32._f;                       \
>> +    })                                          \
>> +
>> +/* Gets a field from an *_any union */
>> +#define SET_FIELD(_c, _p, _f, _v)               \
>> +    ({ if ( (_c)->x86_pv.width == 8 )           \
>> +            (_p)->x64._f = (_v);                \
>> +        else                                    \
>> +            (_p)->x32._f = (_v);                \
>> +    })
>> +
>>  /*
>>   * Writes a split record to the stream, applying correct padding where
>>   * appropriate.  It is common when sending records containing blobs from Xen
>> +    /* MFNs of the batch pfns */
>> +    mfns = malloc(nr_pfns * sizeof *mfns);
>> +    /* Types of the batch pfns */
>> +    types = malloc(nr_pfns * sizeof *types);
>> +    /* Errors from attempting to map the mfns */
>> +    errors = malloc(nr_pfns * sizeof *errors);
>> +    /* Pointers to page data to send.  Might be from mapped mfns or local allocations */
>> +    guest_data = calloc(nr_pfns, sizeof *guest_data);
>> +    /* Pointers to locally allocated pages.  Probably not all used, but need freeing */
>> +    local_pages = calloc(nr_pfns, sizeof *local_pages);
> Wouldn't it be better to preallocate (most of) these for the entire
> save/restore than to do it for each batch?

Valgrind vs (suspected)negligible overhead, particularly in this
function which is probably the single most complex of the series.

>
>> +    for ( i = 0; i < nr_pfns; ++i )
>> +    {
>> +        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->batch_pfns[i]);
>> +
>> +        /* Likely a ballooned page */
>> +        if ( mfns[i] == INVALID_MFN )
>> +            set_bit(ctx->batch_pfns[i], ctx->deferred_pages);
> deferred_pages doesn't need to grow dynamically like that bit map in a
> previous patch, does it?

On second thoughts, it probably does.

>
>> +    hdr.count = nr_pfns;
>> +    s = nr_pfns * sizeof *rec_pfns;
> Brackets around sizeof would really help here.
>
>> +
>> +
>> +    rec_pfns = malloc(s);
> Given the calculation of s is this a candidate for calloc?

s was actually some debugging which managed to stay in.  I shall clean
it up a little.

>
>> +    if ( !rec_pfns )
>> +    {
>> +        ERROR("Unable to allocate memory for page data pfn list");
>> +        goto err;
>> +    }
>> +
>> +    for ( i = 0; i < nr_pfns; ++i )
>> +        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->batch_pfns[i];
> Could this not have a more structured type? (I noticed this in the
> decode bit of the previous patch too).

Not without resorting to bitfields, which I would prefer to avoid for
data ending up being written into a stream.

>
>> +
>> +    /*        header +          pfns data           +        page data */
>> +    rec_size = 4 + 4 + (s) + (nr_pages * PAGE_SIZE);
> Can these 4's not be sizeof something? Also the comment looks like it
> wants to align with something, but is failing.

Again, leaked debugging code.

>
>> +    rec_count = nr_pfns;
>> +
>> +    if ( write_exact(ctx->fd, &rec_type, sizeof(uint32_t)) ||
>> +         write_exact(ctx->fd, &rec_size, sizeof(uint32_t)) ||
> Are these not a struct rhdr?

Yes,

I intended to implement "write_record_raw()" or so which trusted the
length calculation given, although I appear to have lost the tuits.  I
should find them again.

>
>> +         write_exact(ctx->fd, &rec_count, sizeof(uint32_t)) ||
>> +         write_exact(ctx->fd, &rec_res, sizeof(uint32_t)) )
>> +    {
>> +        PERROR("Failed to write page_type header to stream");
>> +        goto err;
>> +    }
>> +
>> +    if ( write_exact(ctx->fd, rec_pfns, s) )
>> +    {
>> +        PERROR("Failed to write page_type header to stream");
>> +        goto err;
>> +    }
>> +
>> +
>> +    for ( i = 0; i < nr_pfns; ++i )
> Extra braces are useful for clarity in these siturations.
>
>
>> +static int flush_batch(struct context *ctx)
>> +{
>> +    int rc = 0;
>> +
>> +    if ( ctx->nr_batch_pfns == 0 )
>> +        return rc;
>> +
>> +    rc = write_batch(ctx);
>> +
>> +    if ( !rc )
>> +    {
>> +        /* Valgrind sanity check */
> What now?

This results in the array being marked as unused again, which was useful
when I found two pieces of code playing with the index and write_batch()
finding some plausible-but-wrong pfns to send.

>
>> +        free(ctx->batch_pfns);
>> +        ctx->batch_pfns = malloc(MAX_BATCH_SIZE * sizeof *ctx->batch_pfns);
>> +        rc = !ctx->batch_pfns;
>> +    }
>> +
>> +
>> +    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
>> +    {
>> +        DPRINTF("Iteration %u", x);
> Something in this series should be using xc_report_progress_* (I'm not
> sure that place is here, but I didn't see it elsewhere yet).

Ah yes - I think other areas want to do the same.

>
>>  int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>>                      uint32_t max_factor, uint32_t flags,
>>                      struct save_callbacks* callbacks, int hvm,
>>                      unsigned long vm_generationid_addr)
>>  {
>> +    struct context ctx =
>> +        {
> Coding style.
>
>> +            .xch = xch,
>> +            .fd = io_fd,
>> +        };
>> +
>> +    /* Older GCC cant initialise anonymous unions */
> How old?

4.7 in debian is fine.  4.4 in CentOS 6 isn't

>
>
>> +static int map_p2m(struct context *ctx)
>> +{
>> [...]
>> +    fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list);
>> +    if ( !fll_mfn )
>> +        IPRINTF("Waiting for domain to set up its p2m frame list list");
>> +
>> +    while ( tries-- && !fll_mfn )
>> +    {
>> +        usleep(10000);
>> +        fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo,
>> +                            arch.pfn_to_mfn_frame_list_list);
>> +    }
>> +
>> +    if ( !fll_mfn )
>> +    {
>> +        ERROR("Timed out waiting for p2m frame list list to be updated");
>> +        goto err;
>> +    }
> This (preexisting) ugliness is there to "support" running migrate
> immediately after starting a guest, when it hasn't got its act together
> yet?

Yes - as the guest is in control of its p2m, it has to set itself up
sufficiently first before the toolstack can migrate it.

>
> I wonder if we can get away with refusing to migrate under those
> circumstances.

Probably, although I thought I tried removing this originally then found
a case where I still needed it.  It would be nice to remove it if possible.

>
>> +static int write_one_vcpu_basic(struct context *ctx, uint32_t id)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xen_pfn_t mfn, pfn;
>> +    unsigned i;
>> +    int rc = -1;
>> +    vcpu_guest_context_any_t vcpu;
>> +    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
>> +    struct record rec =
>> +    {
>> +        .type = REC_TYPE_x86_pv_vcpu_basic,
>> +        .length = sizeof vhdr,
>> +        .data = &vhdr,
>> +    };
>> +
>> +    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
>> +    {
>> +        PERROR("Failed to get vcpu%u context", id);
>> +        goto err;
>> +    }
>> +
>> +    /* Vcpu 0 is special: Convert the suspend record to a PFN */
> The "suspend record" is the guest's start_info_t I think? Would be
> clearer to say that I think (and perhaps allude to it being the magic
> 3rd parameter to the hypercall.)

It is inconsistently referred to in the toolstack and Xen public
headers.  I think I inherited the name "suspend record" from the legacy
code.

>
>> +    /* 64bit guests: Convert CR1 (guest pagetables) to PFN */
>> +    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
>> +    {
>> +        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
>> +        if ( !mfn_in_pseudophysmap(ctx, mfn) )
>> +        {
>> +            ERROR("Bad MFN for vcpu%u's cr1", id);
>> +            pseudophysmap_walk(ctx, mfn);
>> +            errno = ERANGE;
>> +            goto err;
>> +        }
>> +
>> +        pfn = mfn_to_pfn(ctx, mfn);
>> +        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
>> +    }
>> +
>> +    if ( ctx->x86_pv.width == 8 )
> The use of x86_pv.levels vs x86_pv.width to determine the type of guest
> is rather inconsistent. .width would seem to be the right one to use
> almost all the time I think.

I (believe I) have consistently used levels when dealing with pagetable
stuff, and width when dealing with bitness issues.

I suppose the difference was more pronounced in the legacy code with
32bit 2 and 3 level guests, but I still think this split is the better
way of distinguishing.

>
>> +        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x64);
>> +    else
>> +        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x32);
>> +
>> +    if ( rc )
>> +        goto err;
>> +
>> +    DPRINTF("Writing vcpu%u basic context", id);
>> +    rc = 0;
>> + err:
>> +
>> +    return rc;
>> +}
>> +
>> +static int write_one_vcpu_extended(struct context *ctx, uint32_t id)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    int rc;
>> +    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
>> +    struct record rec =
>> +    {
>> +        .type = REC_TYPE_x86_pv_vcpu_extended,
>> +        .length = sizeof vhdr,
>> +        .data = &vhdr,
>> +    };
> Does this pattern repeat a lot? DECLARE_REC(x86_pv_vcpu_extended, vhdr)
> perhaps?

It repeats a little but not too often.  However I frankly dislike macros
like that which do opaque initialisation of trivial structures.

>
>> +    struct xen_domctl domctl =
>> +    {
>> +        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
>> +        .domain = ctx->domid,
>> +        .u.ext_vcpucontext.vcpu = id,
>> +    };
> Wants to use DECLARE_DOMCTL I think (at least everywhere else in libxc
> seesms too).

Which has no difference now I dropped -DVALGRIND from libxc, but forgoes
the clarity and optimisation available from structure initialisation.

As previously indicated, I have an issue with macros which use or create
unidentified symbols in scope, because they are unconditionally less
clear than the plain alternative.

>
>> +static int write_all_vcpu_information(struct context *ctx)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xc_vcpuinfo_t vinfo;
>> +    unsigned int i;
>> +    int rc;
>> +
>> +    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
>> +    {
>> +        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
>> +        if ( rc )
>> +        {
>> +            PERROR("Failed to get vcpu%u information", i);
>> +            return rc;
>> +        }
>> +
>> +        if ( !vinfo.online )
>> +        {
>> +            DPRINTF("vcpu%u offline - skipping", i);
>> +            continue;
>> +        }
>> +
>> +        rc = write_one_vcpu_basic(ctx, i) ?:
>> +            write_one_vcpu_extended(ctx, i) ?:
>> +            write_one_vcpu_xsave(ctx, i);
> I'd prefer the more verbose one at a time version.
>
>> +    /* Map various structures */
>> +    rc = x86_pv_map_m2p(ctx) ?: map_shinfo(ctx) ?: map_p2m(ctx);
> One at a time please.

Ok.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 7/9] tools/libxc: x86 pv restore implementation
  2014-05-07 14:10   ` Ian Campbell
@ 2014-05-07 15:37     ` Andrew Cooper
  2014-05-07 16:17       ` Ian Campbell
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 15:37 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 07/05/14 15:10, Ian Campbell wrote:
>
>> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
>> +    {
>> +        PERROR("Failed to get domain info");
>> +        return -1;
>> +    }
>> +
>> +    if ( ctx.dominfo.domid != dom )
>> +    {
>> +        ERROR("Domain %d does not exist", dom);
>> +        return -1;
>> +    }
>> +
>> +    ctx.domid = dom;
>> +    IPRINTF("Restoring domain %d", dom);
>> +
>> +    if ( read_headers(&ctx) )
>> +        return -1;
>> +
>> +    if ( ctx.dominfo.hvm )
>> +    {
>> +        ERROR("HVM Restore not supported yet");
>> +        return -1;
>> +    }
>> +    else
> Unneeded.

Perhaps, but it makes more readable patches as hvm restore is added later.

Furthermore, all of this will need tweaking when this series stops
trying to coexist alongside the legacy code.

>
>> +
>> +static int handle_x86_pv_vcpu_basic(struct context *ctx, struct record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    struct rec_x86_pv_vcpu *vhdr = rec->data;
>> +    vcpu_guest_context_any_t vcpu;
>> +    size_t vcpusz = ctx->x86_pv.width == 8 ? sizeof vcpu.x64 : sizeof vcpu.x32;
>> +    xen_pfn_t pfn, mfn;
>> +    unsigned long tmp;
>> +    unsigned i;
>> +    int rc = -1;
>> +
>> +    if ( rec->length <= sizeof *vhdr )
>> +    {
>> +        ERROR("X86_PV_VCPU_BASIC record trucated: length %"PRIu32", min %zu",
> "truncated"
>
>> +        else if ( (ctx->x86_pv.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
>> +                  (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
> I'd think some temporaries would help with clarity here.

This was code largely cribbed from the legacy code, as it is gnarly and
brittle and I didn't fancy subtly breaking something in the middle.  I
will see what I can do to clean it up.

>
>> +static int handle_x86_pv_vcpu_extended(struct context *ctx, struct record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    struct rec_x86_pv_vcpu *vcpu = rec->data;
>> +    DECLARE_DOMCTL;
>> +
>> +    if ( rec->length <= sizeof *vcpu )
>> +    {
>> +        ERROR("X86_PV_VCPU_EXTENDED record trucated: length %"PRIu32", min %zu",
>> +              rec->length, sizeof *vcpu + 1);
> I've just realised that there was a X86_PV_VCPU_EXTENDED in
> handle_x86_pv_vcpu_basic which was wrong, but I've trimmed it already so
> I'll comment here.

Noted.

>
>
>> +        return -1;
>> +    }
>> +    else if ( rec->length > sizeof *vcpu + 128 )
> There's an awful lot of magic numbers throughout this series. Can we try
> and use meaningful symbolic names for things please.

This one, no.  It is magic even in the legacy stream, and is sizeof
domctl.u on the sending toolstack, which thinking forward is not
necessarily sizeof domctl.u on the receiving toolstack.

>
>> +int restore_x86_pv(struct context *ctx)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    struct record rec;
>> +    int rc;
>> +
>> +    IPRINTF("In experimental %s", __func__);
>> +
>> +    if ( ctx->restore.guest_type != DHDR_TYPE_x86_pv )
>> +    {
>> +        ERROR("Unable to restore %s domain into an x86_pv domain",
>> +              dhdr_type_to_str(ctx->restore.guest_type));
>> +        return -1;
>> +    }
>> +    else if ( ctx->restore.guest_page_size != 4096 )
> #define's exist for this.
>
>> +    /* all done */
>> +    IPRINTF("All Done");
>> +    assert(!rc);
>> +    goto cleanup;
>> +
>> + err:
>> +    assert(rc);
> FWIW I think you could drop this and let the success case fall through
> without the hop and a skip..

I would prefer not to.  This assert has already caught 1 error of mine,
and I seem to remember also caught a bug in an early version of c/s
dda0b77d2caa

~Andrew

>
>> + cleanup:
>> +
>> +    free(ctx->x86_pv.p2m);
>> +    free(ctx->x86_pv.p2m_pfns);
>> +    free(ctx->x86_pv.pfn_types);
>> +    free(ctx->restore.populated_pfns);
>> +
>> +    if ( ctx->x86_pv.m2p )
>> +        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
>> +
>> +    return rc;
>> +}
>> +
>> +/*
>> + * Local variables:
>> + * mode: C
>> + * c-file-style: "BSD"
>> + * c-basic-offset: 4
>> + * tab-width: 4
>> + * indent-tabs-mode: nil
>> + * End:
>> + */
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 8/9] tools/libxc: x86 hvm save implementation
  2014-05-07 14:14   ` Ian Campbell
@ 2014-05-07 15:39     ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 15:39 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Vrabel, Xen-devel

On 07/05/14 15:14, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>> +static int write_hvm_params(struct context *ctx)
>> +{
>> +    static const unsigned int params[] = {
>> +        HVM_PARAM_STORE_PFN,
>> +        HVM_PARAM_IOREQ_PFN,
>> +        HVM_PARAM_BUFIOREQ_PFN,
>> +        HVM_PARAM_PAGING_RING_PFN,
>> +        HVM_PARAM_ACCESS_RING_PFN,
>> +        HVM_PARAM_SHARING_RING_PFN,
>> +        HVM_PARAM_VM86_TSS,
>> +        HVM_PARAM_CONSOLE_PFN,
>> +        HVM_PARAM_ACPI_IOPORTS_LOCATION,
>> +        HVM_PARAM_VIRIDIAN,
>> +        HVM_PARAM_IDENT_PT,
>> +        HVM_PARAM_PAE_ENABLED,
>> +    };
>> +    static const unsigned int num_params = ARRAY_SIZE(params);
> Blank line between statics and regular please.
>
>> +    xc_interface *xch = ctx->xch;
>> +    struct rec_hvm_params_entry *entries;
>> +    struct rec_hvm_params hdr = {
>> +        .count = 0,
>> +    };
>> +    struct record rec = {
>> +        .type   = REC_TYPE_hvm_params,
>> +        .length = sizeof(hdr),
>> +        .data   = &hdr,
>> +    };
>> +    unsigned int i;
>> +    int rc;
>> +
>> +    entries = malloc(num_params * sizeof(*entries));
> It's small enough that I think you could get away with
> 	struct rec_hvm_params_entry entries[num_params];

I would agree.  (Not my code originally).  I shall fix this up without
malloc.

>
>
>> +    if ( !entries )
>> +    {
>> +        PERROR("HVM params record");
> "Allocating..." (or goes away)
>
>> +    /* Write an END record */
>> +    rc = write_record(ctx, &end);
> Didn't I see a helper for this earlier on? Could it not be in common
> code in any case?
>
> Ian.
>

Probably should be.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] tools/libxc: common code
  2014-05-07 14:38     ` Andrew Cooper
@ 2014-05-07 16:08       ` Ian Campbell
  2014-05-07 16:30         ` Andrew Cooper
  2014-05-08  8:29       ` David Vrabel
  1 sibling, 1 reply; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 16:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-05-07 at 15:38 +0100, Andrew Cooper wrote:
> On 07/05/14 14:03, Ian Campbell wrote:
> > On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >> ---
> >>  tools/libxc/saverestore/common.c         |   87 ++++++
> >>  tools/libxc/saverestore/common.h         |  172 ++++++++++++
> >>  tools/libxc/saverestore/common_x86.c     |   54 ++++
> >>  tools/libxc/saverestore/common_x86.h     |   21 ++
> >>  tools/libxc/saverestore/common_x86_hvm.c |   53 ++++
> >>  tools/libxc/saverestore/common_x86_pv.c  |  431 ++++++++++++++++++++++++++++++
> >>  tools/libxc/saverestore/common_x86_pv.h  |  104 +++++++
> >>  tools/libxc/saverestore/restore.c        |  288 ++++++++++++++++++++
> >>  tools/libxc/saverestore/save.c           |   42 +++
> >>  9 files changed, 1252 insertions(+)
> >>  create mode 100644 tools/libxc/saverestore/common_x86.c
> >>  create mode 100644 tools/libxc/saverestore/common_x86.h
> >>  create mode 100644 tools/libxc/saverestore/common_x86_hvm.c
> >>  create mode 100644 tools/libxc/saverestore/common_x86_pv.c
> >>  create mode 100644 tools/libxc/saverestore/common_x86_pv.h
> >>
> >> diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
> >> index de2e727..b159c4c 100644
> >> --- a/tools/libxc/saverestore/common.c
> >> +++ b/tools/libxc/saverestore/common.c
> >> @@ -1,3 +1,5 @@
> >> +#include <assert.h>
> >> +
> >>  #include "common.h"
> >>  
> >>  static const char *dhdr_types[] =
> >> @@ -52,6 +54,91 @@ const char *rec_type_to_str(uint32_t type)
> >>      return "Reserved";
> >>  }
> >>  
> >> +int write_split_record(struct context *ctx, struct record *rec,
> >> +                       void *buf, size_t sz)
> >> +{
> >> +    static const char zeroes[7] = { 0 };
> >> +    xc_interface *xch = ctx->xch;
> >> +    uint32_t combined_length = rec->length + sz;
> >> +    size_t record_length = (combined_length + 7) & ~7UL;
> > Isn't this ROUNDUP(combined_length, 3)? (Where 3 and/or 7 perhaps ought
> > to be given names)
> 
> It is.  I am not certain that 3/7 need specific named, given the
> specification mandating that all records have trailing 0s to align the
> subsequent record on an 8 byte boundary.

If the specification mandates it then that is an argument *for* a
SAVE_RECORD_ALIGNMENT type #define.

> >> +    if ( write_exact(ctx->fd, &rec->type, sizeof rec->type) ||
> > libxc style appears to be "sizeof (rec->type)" (the space seems odd to
> > me but is apparently The Style, I think the brackets are a must).
> 
> libxc style also states no extraneous brackets.

These ones aren't extraneous ;-)

Where does it say that anyway?

> >> +    datasz = (rhdr.length + 7) & ~7U;
> > Another ROUNDUP and #define XXX 3 or 7 opportunity. (I'm not going to
> > mention any more of these I find).
> >
> > Is it not a bug in the input for datasz to not be aligned? I thought you
> > were padding everything.
> 
> It is certainly not a bug.  For the blobs which can have an arbitrary
> length, the record length field reflects the exact length.  Trailing 0s
> are then inserted into the stream to align the start of the next record
> on an 8 byte boundary.

Make sense. In which case can you add a comment /* Include padding not
covered by rhdr.length */ or some such.

> >> +// Hack out junk from the namespace
> >  ... namespace, because/in order to...
> >
> > (to prevent accidents I suppose?)
> 
> because mfn_to_pfn and pfn_to_mfn are gross abortions of macros which
> look like regular functions but take implicit scoped state.

So don't use them?


> I have been collapsing patches from 3 separate people ;)  I will do a
> consistency check before this becomes non-rfc.

nb: these patches are missing the RFC tag if that's what they are.

> >
> >> +/*
> >> + * Writes a split record to the stream, applying correct padding where
> >> + * appropriate.  It is common when sending records containing blobs from Xen
> >> + * that the header and blob data are separate.  This function accepts a second
> >> + * buffer and length, and will merge it with the main record when sending.
> >> + *
> >> + * Records with a non-zero length must provide a valid data field; records
> >> + * with a 0 length shall have their data field ignored.
> >> + *
> >> + * Returns 0 on success and non0 on failure.
> > "non-0". Do the non-zero values have any particular meanings? (errno
> > codes, other error codes?)
> 
> No - see also the other inconsistencies in libxc.
> 
> The current state of all the new functions are "0 on success, -1 on
> error and errno is unlikely be related".  This is no worse than the vast
> majority of libxc currently, and stating "non-0" on error allows for a
> positive "libxc error" range as discussed down the pub.
> 
> I am not shaving that yak right now, but have tried to leave the code in
> a state where inserting consistent error handing should be less hassle
> than it otherwise could be.

OK.

> In all error cases, the log will (should) have accurate information,
> including errno when relevent.

Mentioning this in the comment is nice too, since it prevents callers
thinking they must log as well.

> >> @@ -0,0 +1,431 @@
> >> +#include <assert.h>
> >> +
> >> +#include "common_x86_pv.h"
> >> +
> >> +xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
> >> +{
> >> +    assert(mfn <= ctx->x86_pv.max_mfn);
> > Just to confirm that anywhere there is an assert like this then if it is
> > based on potentially user controller input then it has already been
> > validated elsewhere first? (with the abstraction I couldn't spot where
> > that was)
> 
> It is validated in every case I am aware of, usually by
> "mfn_in_pseudophysmap()"

Good, because it would be a security issue if not...

> >> +        switch ( type )
> >> +        {
> >> +        case XEN_DOMCTL_PFINFO_L4TAB:
> >> +            ERROR("??? Found L4 table for 32bit guest");
> >> +            errno = EINVAL;
> >> +            return -1;
> >> +
> >> +        case XEN_DOMCTL_PFINFO_L3TAB:
> >> +            /* 32bit guests can only use the first 4 entries of their L3 tables.
> >> +             * All other are potentially used by Xen. */
> >> +            xen_first = 4;
> >> +            xen_last = 512;
> >> +            break;
> >> +
> >> +        case XEN_DOMCTL_PFINFO_L2TAB:
> >> +            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
> >> +             * are normal but only a few will have Xen mappings.
> > Internally Xen has PGT_pae_xen_l2, I wonder if it could be coaxed into
> > exposing that here too?
> 
> That would require exposing xen struct page_info's for specific mfns. 
> This way seems less hacky.

I meant via the introduction of XEN_DOMCTL_PFIINF_L2XENTAB (name TBD)
(AIUI this interface is essentially exposing that bit of the page info,
isn't it?)

> >> ... normalize_page
> >> +    local_page = malloc(PAGE_SIZE);
> > How bad is the overhead of (what I pressume are) all these allocs and
> > frees? (I've not come across the caller in this patch, maybe it will
> > become clear later).
> >
> > Would it be worth keeping an array around shadowing the "live" batch of
> > pages?
> 
> The allocation and freeing of pages has allowed valgrind to be
> fantastically useful at pointing out when my code was doing something silly.

Fair enough (I wonder if there is some valgrind mechanism for marking
memory as explicitly uninitialised as if it had been freed/reallocated)

> I would expect the overhead is negligible (the slow parts of migration
> are the blocking fd reads/writes, mapping hypercalls and memcpy()s), and
> the aid to debugging is far more important.

True, AIUI glibc's malloc is pretty speedy.

> 
> >
> >> +static int x86_pv_localise_page(struct context *ctx, uint32_t type, void *page)
> > "localise" means the inverse of "normalise"?
> 
> Yes.  I specifically wanted to avoid the term "canonicalise" from the
> old code in reference to pages and pagetables, as a canonical 64bit
> address is an architectural term.

OK.

> >> +
> >> +    if ( nr_pages > 0 )
> >> +    {
> >> +        mapping = guest_page = xc_map_foreign_bulk(
> >> +            xch, ctx->domid, PROT_READ | PROT_WRITE,
> >> +            mfns, map_errs, nr_pages);
> >> +        if ( !mapping )
> >> +        {
> >> +            PERROR("Unable to map %u mfns for %u pages of data",
> >> +                   nr_pages, count);
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    for ( i = 0, j = 0; i < count; ++i )
> >> +    {
> >> +        switch ( types[i] )
> >> +        {
> >> +        case XEN_DOMCTL_PFINFO_XTAB:
> >> +        case XEN_DOMCTL_PFINFO_BROKEN:
> >> +            /* Nothing at all to do */
> > Is this a deliberate fall-thru? Please comment if so.
> >
> >> +        case XEN_DOMCTL_PFINFO_XALLOC:
> >> +            /* Nothing futher to do */
> > "further"
> 
> They are all fall-though, indicated by the continue on the next line,
> and lack of default.

The existing comment half way through the list is what breaks this
though, since it's no longer clear that it does indicate that.

> 
> >
> >> +    for ( i = 0; i < pages->count; ++i )
> >> +    {
> >> +        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
> >> +        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
> >> +        {
> >> +            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
> >> +            goto err;
> >> +        }
> >> +
> >> +        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
> >> +        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
> >> +             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
> > What are 5 and 8?
> >
> > This might all be clearer with the use of a #define or two, and perhaps
> > using a switch over the expected valid types and explicitly invalid
> > types.
> 
> 5 to 8 are the "holes" in the otherwise complete PFINFO_TYPE numerspace.
> 
> My views are that the XEN_DOMCTL_PFINFO_* macros are unconditionally
> horrible to use, whatever you are trying to do, which is why the use of
> them is gauged for local clarity rather than global consistency.
> 
> I am not sure a switch statement would help here.

In that case a comment would help.

> >> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
> >> index c013e62..e842e6c 100644
> >> --- a/tools/libxc/saverestore/save.c
> >> +++ b/tools/libxc/saverestore/save.c
> >> @@ -1,5 +1,47 @@
> >> +#include <arpa/inet.h>
> >> +
> >>  #include "common.h"
> >>  
> >> +int write_headers(struct context *ctx, uint16_t guest_type)
> >> +{
> >> +    xc_interface *xch = ctx->xch;
> >> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
> >> +    struct ihdr ihdr =
> >> +        {
> > I think coding style puts this on the previous line and the body should
> > be 4 spaces further in.
> 
> Not according to the emacs mode at the bottom, which would appear to
> create a contradiction in CODING_SYTLE.

CODING_STYLE would take precedence if it said anything about this
specific case, which it doesn't. The prevailing style everywhere else I
could find was with the { on the same line.

(not sure which bit of the emacs marker you think contradicts this,
BSD's style(9) doesn't cover this case either)

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] tools/libxc: x86 pv save implementation
  2014-05-07 15:20     ` Andrew Cooper
@ 2014-05-07 16:15       ` Ian Campbell
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 16:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-05-07 at 16:20 +0100, Andrew Cooper wrote:
> On 07/05/14 14:58, Ian Campbell wrote:
> > On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
> >
> > The previous "common" patch seemed to have a fair bit of x86 pv save
> > code in it, and there seems to be a fair bit of common code here. I
> > suppose the split was pretty artificial to make the patches manageable?
> >
> >> @@ -121,6 +123,9 @@ struct context
> >>      };
> >>  };
> >>  
> >> +/* Saves an x86 PV domain. */
> > I should hope so!
> >
> >> +int save_x86_pv(struct context *ctx);
> >> +
> >>  /*
> >>   * Write the image and domain headers to the stream.
> >>   * (to eventually make static in save.c)
> >> @@ -137,6 +142,21 @@ struct record
> >>      void *data;
> >>  };
> >>  
> >> +/* Gets a field from an *_any union */
> > Unless the #undefs which were above before I trimmed them relate to
> > patch #2/9 they should go right before the redefine IMHO.
> 
> At the moment the series is specifically designed to allow the legacy
> and v2 code to live together for testing purposes.
> 
> When this series leaves RFC, the old macros will be killed with fire. 
> (See also gross abortions with scoped state)

I think it was later on where I requested for things subject to that 2/9
related future path be explicily tagged as such.

> >> +    if ( !rec_pfns )
> >> +    {
> >> +        ERROR("Unable to allocate memory for page data pfn list");
> >> +        goto err;
> >> +    }
> >> +
> >> +    for ( i = 0; i < nr_pfns; ++i )
> >> +        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->batch_pfns[i];
> > Could this not have a more structured type? (I noticed this in the
> > decode bit of the previous patch too).
> 
> Not without resorting to bitfields, which I would prefer to avoid for
> data ending up being written into a stream.

Fair enough.

> >> +static int flush_batch(struct context *ctx)
> >> +{
> >> +    int rc = 0;
> >> +
> >> +    if ( ctx->nr_batch_pfns == 0 )
> >> +        return rc;
> >> +
> >> +    rc = write_batch(ctx);
> >> +
> >> +    if ( !rc )
> >> +    {
> >> +        /* Valgrind sanity check */
> > What now?
> 
> This results in the array being marked as unused again, which was useful
> when I found two pieces of code playing with the index and write_batch()
> finding some plausible-but-wrong pfns to send.

If it is going to stay then the comment needs to be more explicit about
this.

Maybe make all this stuff #idndef NDEBUG? Except we don't have that for
tools, nevermind.

> >> +    /* Older GCC cant initialise anonymous unions */
> > How old?
> 
> 4.7 in debian is fine.  4.4 in CentOS 6 isn't

How irritating. Please mention it so someone looking at this in 5 years
can decide to clean it up.

> >
> >> +    /* 64bit guests: Convert CR1 (guest pagetables) to PFN */
> >> +    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
> >> +    {
> >> +        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
> >> +        if ( !mfn_in_pseudophysmap(ctx, mfn) )
> >> +        {
> >> +            ERROR("Bad MFN for vcpu%u's cr1", id);
> >> +            pseudophysmap_walk(ctx, mfn);
> >> +            errno = ERANGE;
> >> +            goto err;
> >> +        }
> >> +
> >> +        pfn = mfn_to_pfn(ctx, mfn);
> >> +        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
> >> +    }
> >> +
> >> +    if ( ctx->x86_pv.width == 8 )
> > The use of x86_pv.levels vs x86_pv.width to determine the type of guest
> > is rather inconsistent. .width would seem to be the right one to use
> > almost all the time I think.
> 
> I (believe I) have consistently used levels when dealing with pagetable
> stuff, and width when dealing with bitness issues.

I think I first noticed it when there was one within the other, but I
don't remember which patch that was in.

> I suppose the difference was more pronounced in the legacy code with
> 32bit 2 and 3 level guests, but I still think this split is the better
> way of distinguishing.

OK then.

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 7/9] tools/libxc: x86 pv restore implementation
  2014-05-07 15:37     ` Andrew Cooper
@ 2014-05-07 16:17       ` Ian Campbell
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 16:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-05-07 at 16:37 +0100, Andrew Cooper wrote:
> On 07/05/14 15:10, Ian Campbell wrote:
> >
> >> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
> >> +    {
> >> +        PERROR("Failed to get domain info");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if ( ctx.dominfo.domid != dom )
> >> +    {
> >> +        ERROR("Domain %d does not exist", dom);
> >> +        return -1;
> >> +    }
> >> +
> >> +    ctx.domid = dom;
> >> +    IPRINTF("Restoring domain %d", dom);
> >> +
> >> +    if ( read_headers(&ctx) )
> >> +        return -1;
> >> +
> >> +    if ( ctx.dominfo.hvm )
> >> +    {
> >> +        ERROR("HVM Restore not supported yet");
> >> +        return -1;
> >> +    }
> >> +    else
> > Unneeded.
> 
> Perhaps, but it makes more readable patches as hvm restore is added later.

OK, I think this one got dumped in with all the other actually
unnecessary elses in my mind.

> 
> Furthermore, all of this will need tweaking when this series stops
> trying to coexist alongside the legacy code.

>From a reviewer's point of view the sooner this happens the better.


> >
> >
> >> +        return -1;
> >> +    }
> >> +    else if ( rec->length > sizeof *vcpu + 128 )
> > There's an awful lot of magic numbers throughout this series. Can we try
> > and use meaningful symbolic names for things please.
> 
> This one, no.  It is magic even in the legacy stream, and is sizeof
> domctl.u on the sending toolstack, which thinking forward is not
> necessarily sizeof domctl.u on the receiving toolstack.

OK. Where a legitimate magic number remains then please add a comment.

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] tools/libxc: common code
  2014-05-07 16:08       ` Ian Campbell
@ 2014-05-07 16:30         ` Andrew Cooper
  2014-05-07 16:35           ` Ian Campbell
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2014-05-07 16:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 07/05/14 17:08, Ian Campbell wrote:
> On Wed, 2014-05-07 at 15:38 +0100, Andrew Cooper wrote:
>> On 07/05/14 14:03, Ian Campbell wrote:
>>> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
>>>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>>>> ---
>>>>  tools/libxc/saverestore/common.c         |   87 ++++++
>>>>  tools/libxc/saverestore/common.h         |  172 ++++++++++++
>>>>  tools/libxc/saverestore/common_x86.c     |   54 ++++
>>>>  tools/libxc/saverestore/common_x86.h     |   21 ++
>>>>  tools/libxc/saverestore/common_x86_hvm.c |   53 ++++
>>>>  tools/libxc/saverestore/common_x86_pv.c  |  431 ++++++++++++++++++++++++++++++
>>>>  tools/libxc/saverestore/common_x86_pv.h  |  104 +++++++
>>>>  tools/libxc/saverestore/restore.c        |  288 ++++++++++++++++++++
>>>>  tools/libxc/saverestore/save.c           |   42 +++
>>>>  9 files changed, 1252 insertions(+)
>>>>  create mode 100644 tools/libxc/saverestore/common_x86.c
>>>>  create mode 100644 tools/libxc/saverestore/common_x86.h
>>>>  create mode 100644 tools/libxc/saverestore/common_x86_hvm.c
>>>>  create mode 100644 tools/libxc/saverestore/common_x86_pv.c
>>>>  create mode 100644 tools/libxc/saverestore/common_x86_pv.h
>>>>
>>>> diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
>>>> index de2e727..b159c4c 100644
>>>> --- a/tools/libxc/saverestore/common.c
>>>> +++ b/tools/libxc/saverestore/common.c
>>>> @@ -1,3 +1,5 @@
>>>> +#include <assert.h>
>>>> +
>>>>  #include "common.h"
>>>>  
>>>>  static const char *dhdr_types[] =
>>>> @@ -52,6 +54,91 @@ const char *rec_type_to_str(uint32_t type)
>>>>      return "Reserved";
>>>>  }
>>>>  
>>>> +int write_split_record(struct context *ctx, struct record *rec,
>>>> +                       void *buf, size_t sz)
>>>> +{
>>>> +    static const char zeroes[7] = { 0 };
>>>> +    xc_interface *xch = ctx->xch;
>>>> +    uint32_t combined_length = rec->length + sz;
>>>> +    size_t record_length = (combined_length + 7) & ~7UL;
>>> Isn't this ROUNDUP(combined_length, 3)? (Where 3 and/or 7 perhaps ought
>>> to be given names)
>> It is.  I am not certain that 3/7 need specific named, given the
>> specification mandating that all records have trailing 0s to align the
>> subsequent record on an 8 byte boundary.
> If the specification mandates it then that is an argument *for* a
> SAVE_RECORD_ALIGNMENT type #define.

Fair enough.  I will include that.

>>>> +    datasz = (rhdr.length + 7) & ~7U;
>>> Another ROUNDUP and #define XXX 3 or 7 opportunity. (I'm not going to
>>> mention any more of these I find).
>>>
>>> Is it not a bug in the input for datasz to not be aligned? I thought you
>>> were padding everything.
>> It is certainly not a bug.  For the blobs which can have an arbitrary
>> length, the record length field reflects the exact length.  Trailing 0s
>> are then inserted into the stream to align the start of the next record
>> on an 8 byte boundary.
> Make sense. In which case can you add a comment /* Include padding not
> covered by rhdr.length */ or some such.

Ok.

>
>>>> +// Hack out junk from the namespace
>>>  ... namespace, because/in order to...
>>>
>>> (to prevent accidents I suppose?)
>> because mfn_to_pfn and pfn_to_mfn are gross abortions of macros which
>> look like regular functions but take implicit scoped state.
> So don't use them?

But the names themselves are the only logical choice for real functions
of the same purpose.

>
>
>> I have been collapsing patches from 3 separate people ;)  I will do a
>> consistency check before this becomes non-rfc.
> nb: these patches are missing the RFC tag if that's what they are.

The RFC tag appears to have dropped off some time during formatting
them.  My appologies.  The series as a whole is still not functionally
complete but is getting there.

>
>>>> @@ -0,0 +1,431 @@
>>>> +#include <assert.h>
>>>> +
>>>> +#include "common_x86_pv.h"
>>>> +
>>>> +xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
>>>> +{
>>>> +    assert(mfn <= ctx->x86_pv.max_mfn);
>>> Just to confirm that anywhere there is an assert like this then if it is
>>> based on potentially user controller input then it has already been
>>> validated elsewhere first? (with the abstraction I couldn't spot where
>>> that was)
>> It is validated in every case I am aware of, usually by
>> "mfn_in_pseudophysmap()"
> Good, because it would be a security issue if not...

In what way?  The absolute worst that could happen is a controlled stop
of the process handling migration for this domain.

It is certainly better than walking off the end of the mapped m2p.

>
>>>> +        switch ( type )
>>>> +        {
>>>> +        case XEN_DOMCTL_PFINFO_L4TAB:
>>>> +            ERROR("??? Found L4 table for 32bit guest");
>>>> +            errno = EINVAL;
>>>> +            return -1;
>>>> +
>>>> +        case XEN_DOMCTL_PFINFO_L3TAB:
>>>> +            /* 32bit guests can only use the first 4 entries of their L3 tables.
>>>> +             * All other are potentially used by Xen. */
>>>> +            xen_first = 4;
>>>> +            xen_last = 512;
>>>> +            break;
>>>> +
>>>> +        case XEN_DOMCTL_PFINFO_L2TAB:
>>>> +            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
>>>> +             * are normal but only a few will have Xen mappings.
>>> Internally Xen has PGT_pae_xen_l2, I wonder if it could be coaxed into
>>> exposing that here too?
>> That would require exposing xen struct page_info's for specific mfns. 
>> This way seems less hacky.
> I meant via the introduction of XEN_DOMCTL_PFIINF_L2XENTAB (name TBD)
> (AIUI this interface is essentially exposing that bit of the page info,
> isn't it?)

Hmm.  There are some spare numbers available to be used.  It might be an
idea.

>
>>>> ... normalize_page
>>>> +    local_page = malloc(PAGE_SIZE);
>>> How bad is the overhead of (what I pressume are) all these allocs and
>>> frees? (I've not come across the caller in this patch, maybe it will
>>> become clear later).
>>>
>>> Would it be worth keeping an array around shadowing the "live" batch of
>>> pages?
>> The allocation and freeing of pages has allowed valgrind to be
>> fantastically useful at pointing out when my code was doing something silly.
> Fair enough (I wonder if there is some valgrind mechanism for marking
> memory as explicitly uninitialised as if it had been freed/reallocated)

I have searched but cant find anything.  As valgrind is designed to wrap
compiled binaries, making function calls into it is probably non-trivial.

>>>> +
>>>> +    if ( nr_pages > 0 )
>>>> +    {
>>>> +        mapping = guest_page = xc_map_foreign_bulk(
>>>> +            xch, ctx->domid, PROT_READ | PROT_WRITE,
>>>> +            mfns, map_errs, nr_pages);
>>>> +        if ( !mapping )
>>>> +        {
>>>> +            PERROR("Unable to map %u mfns for %u pages of data",
>>>> +                   nr_pages, count);
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    for ( i = 0, j = 0; i < count; ++i )
>>>> +    {
>>>> +        switch ( types[i] )
>>>> +        {
>>>> +        case XEN_DOMCTL_PFINFO_XTAB:
>>>> +        case XEN_DOMCTL_PFINFO_BROKEN:
>>>> +            /* Nothing at all to do */
>>> Is this a deliberate fall-thru? Please comment if so.
>>>
>>>> +        case XEN_DOMCTL_PFINFO_XALLOC:
>>>> +            /* Nothing futher to do */
>>> "further"
>> They are all fall-though, indicated by the continue on the next line,
>> and lack of default.
> The existing comment half way through the list is what breaks this
> though, since it's no longer clear that it does indicate that.

The comment is partially stale in this context anyway (following
refactoring).  I shall see about collapsing it down to a single if
statement which can be made to be more clear.

>
>>>> +    for ( i = 0; i < pages->count; ++i )
>>>> +    {
>>>> +        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
>>>> +        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
>>>> +        {
>>>> +            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
>>>> +            goto err;
>>>> +        }
>>>> +
>>>> +        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
>>>> +        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
>>>> +             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
>>> What are 5 and 8?
>>>
>>> This might all be clearer with the use of a #define or two, and perhaps
>>> using a switch over the expected valid types and explicitly invalid
>>> types.
>> 5 to 8 are the "holes" in the otherwise complete PFINFO_TYPE numerspace.
>>
>> My views are that the XEN_DOMCTL_PFINFO_* macros are unconditionally
>> horrible to use, whatever you are trying to do, which is why the use of
>> them is gauged for local clarity rather than global consistency.
>>
>> I am not sure a switch statement would help here.
> In that case a comment would help.

I might be able to make it a little clearer.

>
>>>> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
>>>> index c013e62..e842e6c 100644
>>>> --- a/tools/libxc/saverestore/save.c
>>>> +++ b/tools/libxc/saverestore/save.c
>>>> @@ -1,5 +1,47 @@
>>>> +#include <arpa/inet.h>
>>>> +
>>>>  #include "common.h"
>>>>  
>>>> +int write_headers(struct context *ctx, uint16_t guest_type)
>>>> +{
>>>> +    xc_interface *xch = ctx->xch;
>>>> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
>>>> +    struct ihdr ihdr =
>>>> +        {
>>> I think coding style puts this on the previous line and the body should
>>> be 4 spaces further in.
>> Not according to the emacs mode at the bottom, which would appear to
>> create a contradiction in CODING_SYTLE.
> CODING_STYLE would take precedence if it said anything about this
> specific case, which it doesn't. The prevailing style everywhere else I
> could find was with the { on the same line.
>
> (not sure which bit of the emacs marker you think contradicts this,
> BSD's style(9) doesn't cover this case either)
>
> Ian.
>

That would be emacs' cc-mode interpretation of the marker, when
instructed to indent the entire file.

If it isn't specifically covered in BSD's style(9), I am sure there are
multiple interpretations going.  I will see about persuading emacs not
to indent them.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] tools/libxc: common code
  2014-05-07 16:30         ` Andrew Cooper
@ 2014-05-07 16:35           ` Ian Campbell
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Campbell @ 2014-05-07 16:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-05-07 at 17:30 +0100, Andrew Cooper wrote:
> >
> >>>> @@ -0,0 +1,431 @@
> >>>> +#include <assert.h>
> >>>> +
> >>>> +#include "common_x86_pv.h"
> >>>> +
> >>>> +xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
> >>>> +{
> >>>> +    assert(mfn <= ctx->x86_pv.max_mfn);
> >>> Just to confirm that anywhere there is an assert like this then if it is
> >>> based on potentially user controller input then it has already been
> >>> validated elsewhere first? (with the abstraction I couldn't spot where
> >>> that was)
> >> It is validated in every case I am aware of, usually by
> >> "mfn_in_pseudophysmap()"
> > Good, because it would be a security issue if not...
> 
> In what way?  The absolute worst that could happen is a controlled stop
> of the process handling migration for this domain.

AKA the toolstack daemon, which would be a DoS depending on the
toolstack's architecture.

> 
> >
> >>>> ... normalize_page
> >>>> +    local_page = malloc(PAGE_SIZE);
> >>> How bad is the overhead of (what I pressume are) all these allocs and
> >>> frees? (I've not come across the caller in this patch, maybe it will
> >>> become clear later).
> >>>
> >>> Would it be worth keeping an array around shadowing the "live" batch of
> >>> pages?
> >> The allocation and freeing of pages has allowed valgrind to be
> >> fantastically useful at pointing out when my code was doing something silly.
> > Fair enough (I wonder if there is some valgrind mechanism for marking
> > memory as explicitly uninitialised as if it had been freed/reallocated)
> 
> I have searched but cant find anything.  As valgrind is designed to wrap
> compiled binaries, making function calls into it is probably non-trivial.

It was a long shot...

I was imagining you'd have to have a local
	void valgrind_invalidate(void *p) {}
which valgrind would trap calls to in its usual way "somehow"

Ian.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] tools/libxc: common code
  2014-05-07 14:38     ` Andrew Cooper
  2014-05-07 16:08       ` Ian Campbell
@ 2014-05-08  8:29       ` David Vrabel
  1 sibling, 0 replies; 39+ messages in thread
From: David Vrabel @ 2014-05-08  8:29 UTC (permalink / raw)
  To: Andrew Cooper, Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 07/05/2014 15:38, Andrew Cooper wrote:
> 
> because mfn_to_pfn and pfn_to_mfn are gross abortions of macros which
> look like regular functions but take implicit scoped state.

Please do not use such insensitive and offensive language on xen-devel.

xen-devel needs to be a safe and welcoming space for all contributors.

David

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-05-08  8:29 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW() Andrew Cooper
2014-05-07 11:45   ` Ian Campbell
2014-05-07 12:00     ` David Vrabel
2014-05-07 12:06       ` Ian Campbell
2014-04-30 18:36 ` [PATCH v4 2/9] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 3/9] tools/libxc: Stream specification and some common code Andrew Cooper
2014-05-07 11:57   ` Ian Campbell
2014-05-07 12:06     ` David Vrabel
2014-05-07 12:14     ` Andrew Cooper
2014-05-07 13:07       ` Ian Campbell
2014-05-07 13:20         ` Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
2014-05-07 12:03   ` Ian Campbell
2014-05-07 12:08     ` David Vrabel
2014-05-07 12:27     ` Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 5/9] tools/libxc: common code Andrew Cooper
2014-05-07 13:03   ` Ian Campbell
2014-05-07 14:38     ` Andrew Cooper
2014-05-07 16:08       ` Ian Campbell
2014-05-07 16:30         ` Andrew Cooper
2014-05-07 16:35           ` Ian Campbell
2014-05-08  8:29       ` David Vrabel
2014-04-30 18:36 ` [PATCH v4 6/9] tools/libxc: x86 pv save implementation Andrew Cooper
2014-05-07 13:58   ` Ian Campbell
2014-05-07 15:20     ` Andrew Cooper
2014-05-07 16:15       ` Ian Campbell
2014-04-30 18:36 ` [PATCH v4 7/9] tools/libxc: x86 pv restore implementation Andrew Cooper
2014-05-07 14:10   ` Ian Campbell
2014-05-07 15:37     ` Andrew Cooper
2014-05-07 16:17       ` Ian Campbell
2014-04-30 18:36 ` [PATCH v4 8/9] tools/libxc: x86 hvm save implementation Andrew Cooper
2014-05-07 14:14   ` Ian Campbell
2014-05-07 15:39     ` Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation Andrew Cooper
2014-05-07 14:15   ` Ian Campbell
2014-05-01 16:41 ` [PATCH v4 0/9] Migration Stream v2 Konrad Rzeszutek Wilk
2014-05-01 17:32   ` Andrew Cooper
2014-05-07 14:23 ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.