All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/14] Migration Stream v2
@ 2014-06-11 18:14 Andrew Cooper
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
                   ` (15 more replies)
  0 siblings, 16 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Ian Jackson,
	Tim Deegan, Frediano Ziglio, David Vrabel, Jan Beulich

Hello,

Presented here for review is v5 of the Migration Stream v2 work.

Major work in v5 is the legacy conversion python script which is capable of
converting from legacy to v2 format on the fly.  It was tested using live
migration saving in the legacy format, piping through the script and restoring
using the v2 code.

Other work includes a substantial refactoring of the code structure, allowing
for a single generic save() and restore() functions, with function pointer
hooks for each type of guest to implement.  The spec has been updated to
include PV MSRs in the migration stream.


This series depends on several prerequisite fixes which have recently been
committed into staging (and have not passed the push gate at the time of
writing).  The series also depends on the VM Generation ID series from David.

I now consider the core of the v2 code stable.  I do not expect it to change
much too much, other than the identified TODOs (and code review of course).


The next area of work is the libxl integration, which will seek to undo the
current layering violations.  It will involve introducing a new libxl framing
format (which will happen to look curiously similar to the libxc framing
format), as well as providing legacy compatibility using the legacy conversion
scripts so migrations from older libxl/libxc toolstacks will continue to work.


The code is presented here for comment/query/critism.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12  9:45   ` David Vrabel
                     ` (3 more replies)
  2014-06-11 18:14 ` [PATCH v5 RFC 02/14] scripts: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
                   ` (14 subsequent siblings)
  15 siblings, 4 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Ian Jackson, Ian Campbell

The pandoc markdown format is a superset of plain markdown, and while
`markdown` is capable of making plausible documents from it, the results are
not fantastic.

Differentiate the two via file extension to avoid running `markdown` on a
pandoc document.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>

---

In some copious free time (a doc day perhaps), I will see about getting pandoc
-> html working, but I don't have time right now.
---
 docs/Makefile                            |    9 +-
 docs/specs/libxc-migration-stream.pandoc |  640 ++++++++++++++++++++++++++++++
 2 files changed, 648 insertions(+), 1 deletion(-)
 create mode 100644 docs/specs/libxc-migration-stream.pandoc

diff --git a/docs/Makefile b/docs/Makefile
index 46e8f22..126aa04 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -14,6 +14,8 @@ MAN5SRC-y := $(wildcard man/xl*.pod.5)
 
 MARKDOWNSRC-y := $(wildcard misc/*.markdown)
 
+PANDOCSRC-y += $(wildcard specs/*.pandoc)
+
 TXTSRC-y := $(wildcard misc/*.txt)
 
 
@@ -28,7 +30,8 @@ DOC_TXT  := $(patsubst %.txt,txt/%.txt,$(TXTSRC-y)) \
             $(patsubst %.markdown,txt/%.txt,$(MARKDOWNSRC-y)) \
             $(patsubst man/%.pod.1,txt/man/%.1.txt,$(MAN1SRC-y)) \
             $(patsubst man/%.pod.5,txt/man/%.5.txt,$(MAN5SRC-y))
-DOC_PDF  := $(patsubst %.markdown,pdf/%.pdf,$(MARKDOWNSRC-y))
+DOC_PDF  := $(patsubst %.markdown,pdf/%.pdf,$(MARKDOWNSRC-y)) \
+            $(patsubst %.pandoc,pdf/%.pdf,$(PANDOCSRC-y))
 
 .PHONY: all
 all: build
@@ -191,6 +194,10 @@ pdf/%.pdf: %.markdown
 	$(INSTALL_DIR) $(@D)
 	pandoc -N --toc --standalone $< --output $@
 
+pdf/%.pdf: %.pandoc
+	$(INSTALL_DIR) $(@D)
+	pandoc --number-sections --toc --standalone $< --output $@
+
 ifeq (,$(findstring clean,$(MAKECMDGOALS)))
 $(XEN_ROOT)/config/Docs.mk:
 	$(error You have to run ./configure before building docs)
diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
new file mode 100644
index 0000000..4769356
--- /dev/null
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -0,0 +1,640 @@
+% LibXenCtrl Domain Image Format
+% David Vrabel <<david.vrabel@citrix.com>>
+  Andrew Cooper <<andrew.cooper3@citrix.com>>
+% Draft G
+
+Introduction
+============
+
+Purpose
+-------
+
+The _domain save image_ is the context of a running domain used for
+snapshots of a domain or for transferring domains between hosts during
+migration.
+
+There are a number of problems with the format of the domain save
+image used in Xen 4.4 and earlier (the _legacy format_).
+
+* Dependant on toolstack word size.  A number of fields within the
+  image are native types such as `unsigned long` which have different
+  sizes between 32-bit and 64-bit toolstacks.  This prevents domains
+  from being migrated between hosts running 32-bit and 64-bit
+  toolstacks.
+
+* There is no header identifying the image.
+
+* The image has no version information.
+
+A new format that addresses the above is required.
+
+ARM does not yet have have a domain save image format specified and
+the format described in this specification should be suitable.
+
+Not Yet Included
+----------------
+
+The following features are not yet fully specified and will be
+included in a future draft.
+
+* Remus
+
+* Page data compression.
+
+* ARM
+
+
+Overview
+========
+
+The image format consists of two main sections:
+
+* _Headers_
+* _Records_
+
+Headers
+-------
+
+There are two headers: the _image header_, and the _domain header_.
+The image header describes the format of the image (version etc.).
+The _domain header_ contains general information about the domain
+(architecture, type etc.).
+
+Records
+-------
+
+The main part of the format is a sequence of different _records_.
+Each record type contains information about the domain context.  At a
+minimum there is a END record marking the end of the records section.
+
+
+Fields
+------
+
+All the fields within the headers and records have a fixed width.
+
+Fields are always aligned to their size.
+
+Padding and reserved fields are set to zero on save and must be
+ignored during restore.
+
+Integer (numeric) fields in the image header are always in big-endian
+byte order.
+
+Integer fields in the domain header and in the records are in the
+endianess described in the image header (which will typically be the
+native ordering).
+
+\clearpage
+
+Headers
+=======
+
+Image Header
+------------
+
+The image header identifies an image as a Xen domain save image.  It
+includes the version of this specification that the image complies
+with.
+
+Tools supporting version _V_ of the specification shall always save
+images using version _V_.  Tools shall support restoring from version
+_V_.  If the previous Xen release produced version _V_ - 1 images,
+tools shall supported restoring from these.  Tools may additionally
+support restoring from earlier versions.
+
+The marker field can be used to distinguish between legacy images and
+those corresponding to this specification.  Legacy images will have at
+one or more zero bits within the first 8 octets of the image.
+
+Fields within the image header are always in _big-endian_ byte order,
+regardless of the setting of the endianness bit.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | marker                                          |
+    +-----------------------+-------------------------+
+    | id                    | version                 |
+    +-----------+-----------+-------------------------+
+    | options   | (reserved)                          |
+    +-----------+-------------------------------------+
+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+marker      0xFFFFFFFFFFFFFFFF.
+
+id          0x58454E46 ("XENF" in ASCII).
+
+version     0x00000001.  The version of this specification.
+
+options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
+
+            bit 1-15: Reserved.
+--------------------------------------------------------------------
+
+The endianness shall be 0 (little-endian) for images generated on an
+i386, x86_64, or arm host.
+
+\clearpage
+
+Domain Header
+-------------
+
+The domain header includes general properties of the domain.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | type                  | page_shift| (reserved)  |
+    +-----------------------+-----------+-------------+
+    | xen_major             | xen_minor               |
+    +-----------------------+-------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+type        0x0000: Reserved.
+
+            0x0001: x86 PV.
+
+            0x0002: x86 HVM.
+
+            0x0003: x86 PVH.
+
+            0x0004: ARM.
+
+            0x0005 - 0xFFFFFFFF: Reserved.
+
+page_shift  Size of a guest page as a power of two.
+
+            i.e., page size = 2 ^page_shift^.
+
+xen_major   The Xen major version when this image was saved.
+
+xen_minor   The Xen minor version when this image was saved.
+--------------------------------------------------------------------
+
+The legacy stream conversion tool writes a `xen_major` version of 0, and sets
+`xen_minor` to the version of itself.
+
+\clearpage
+
+Records
+=======
+
+A record has a record header, type specific data and a trailing
+footer.  If `body_length` is not a multiple of 8, the body is padded
+with zeroes to align the end of the record on an 8 octet boundary.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | type                  | body_length             |
+    +-----------+-----------+-------------------------+
+    | body...                                         |
+    ...
+    |           | padding (0 to 7 octets)             |
+    +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field        Description
+-----------  -------------------------------------------------------
+type         0x00000000: END
+
+             0x00000001: PAGE_DATA
+
+             0x00000002: X86_PV_INFO
+
+             0x00000003: X86_PV_P2M_FRAMES
+
+             0x00000004: X86_PV_VCPU_BASIC
+
+             0x00000005: X86_PV_VCPU_EXTENDED
+
+             0x00000006: X86_PV_VCPU_XSAVE
+
+             0x00000007: SHARED_INFO
+
+             0x00000008: TSC_INFO
+
+             0x00000009: HVM_CONTEXT
+
+             0x0000000A: HVM_PARAMS
+
+             0x0000000B: TOOLSTACK
+
+             0x0000000C: X86_PV_VCPU_MSRS
+
+             0x0000000D - 0x7FFFFFFF: Reserved for future _mandatory_
+             records.
+
+             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
+             records.
+
+body_length  Length in octets of the record body.
+
+body         Content of the record.
+
+padding      0 to 7 octets of zeros to pad the whole record to a multiple
+             of 8 octets.
+--------------------------------------------------------------------
+
+Records may be _mandatory_ or _optional_.  Optional records have bit
+31 set in their type.  Restoring an image that has unrecognized or
+unsupported mandatory record must fail.  The contents of optional
+records may be ignored during a restore.
+
+The following sub-sections specify the record body format for each of
+the record types.
+
+\clearpage
+
+END
+----
+
+A end record marks the end of the image.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The end record contains no fields; its body_length is 0.
+
+\clearpage
+
+PAGE_DATA
+---------
+
+The bulk of an image consists of many PAGE_DATA records containing the
+memory contents.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+    | page_data[0]...                                 |
+    ...
+    +-------------------------------------------------+
+    | page_data[N-1]...                               |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+count       Number of pages described in this record.
+
+pfn         An array of count PFNs and their types.
+
+            Bit 63-60: XEN\_DOMCTL\_PFINFO\_* type (from
+            `public/domctl.h` but shifted by 32 bits)
+
+            Bit 59-52: Reserved.
+
+            Bit 51-0: PFN.
+
+page\_data  page\_size octets of uncompressed page contents for each
+            page set as present in the pfn array.
+--------------------------------------------------------------------
+
+Note: Count is strictly > 0.  N is strictly <= C and it is possible for there
+to be no page_data in the record if all pfns are of invalid types.
+
+--------------------------------------------------------------------
+PFINFO type    Value      Description
+-------------  ---------  ------------------------------------------
+NOTAB          0x0        Normal page.
+
+L1TAB          0x1        L1 page table page.
+
+L2TAB          0x2        L2 page table page.
+
+L3TAB          0x3        L3 page table page.
+
+L4TAB          0x4        L4 page table page.
+
+               0x5-0x8    Reserved.
+
+L1TAB_PIN      0x9        L1 page table page (pinned).
+
+L2TAB_PIN      0xA        L2 page table page (pinned).
+
+L3TAB_PIN      0xB        L3 page table page (pinned).
+
+L4TAB_PIN      0xC        L4 page table page (pinned).
+
+BROKEN         0xD        Broken page.
+
+XALLOC         0xE        Allocate only.
+
+XTAB           0xF        Invalid page.
+--------------------------------------------------------------------
+
+Table: XEN\_DOMCTL\_PFINFO\_* Page Types.
+
+PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any
+corresponding `page_data`.
+
+The saver uses the `XTAB` type for PFNs that become invalid in the
+guest's P2M table during a live migration[^2].
+
+Restoring an image with unrecognized page types shall fail.
+
+[^2]: In the legacy format, this is the list of unmapped PFNs in the
+tail.
+
+\clearpage
+
+X86_PV_INFO
+-----------
+
+     0     1     2     3     4     5     6     7 octet
+    +-----+-----+-----------+-------------------------+
+    | w   | ptl | (reserved)                          |
+    +-----+-----+-----------+-------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+guest_width (w)  Guest width in octets (either 4 or 8).
+
+pt_levels (ptl)  Number of page table levels (either 3 or 4).
+--------------------------------------------------------------------
+
+\clearpage
+
+X86_PV_P2M_FRAMES
+-----------------
+
+     0     1     2     3     4     5     6     7 octet
+    +-----+-----+-----+-----+-------------------------+
+    | p2m_start_pfn (S)     | p2m_end_pfn (E)         |
+    +-----+-----+-----+-----+-------------------------+
+    | p2m_pfn[p2m frame containing pfn S]             |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | p2m_pfn[p2m frame containing pfn E]             |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-------------    ---------------------------------------------------
+p2m_start_pfn    First pfn index in the p2m_pfn array.
+
+p2m_end_pfn      Last pfn index in the p2m_pfn array.
+
+p2m_pfn          Array of PFNs containing the guest's P2M table, for
+                 the PFN frames containing the PFN range S to E
+                 (inclusive).
+
+--------------------------------------------------------------------
+
+\clearpage
+
+X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS
+----------------------------------------
+
+The format of these records are identical.  They are all binary blobs
+of data which are accessed using specific pairs of domctl hypercalls.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | vcpu_id               | (reserved)              |
+    +-----------------------+-------------------------+
+    | context...                                      |
+    ...
+    +-------------------------------------------------+
+
+---------------------------------------------------------------------
+Field            Description
+-----------      ----------------------------------------------------
+vcpu_id          The VCPU ID.
+
+context          Binary data for this VCPU.
+---------------------------------------------------------------------
+
+---------------------------------------------------------------------
+Record type                  Accessor hypercalls
+-----------------------      ----------------------------------------
+X86\_PV\_VCPU\_BASIC         XEN\_DOMCTL\_{get,set}vcpucontext
+
+X86\_PV\_VCPU\_EXTENDED      XEN\_DOMCTL\_{get,set}\_ext\_vcpucontext
+
+X86\_PV\_VCPU\_XSAVE         XEN\_DOMCTL\_{get,set}vcpuextstate
+
+X86\_PV\_VCPU\_MSRS          XEN\_DOMCTL\_{get,set}\_vcpu\_msrs
+---------------------------------------------------------------------
+
+\clearpage
+
+SHARED_INFO
+------------------
+
+The content of the Shared Info page.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | shared_info                                     |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+shared_info      Contents of the shared info page.  This record
+                 should be exactly 1 page long.
+--------------------------------------------------------------------
+
+\clearpage
+
+TSC_INFO
+--------
+
+Domain TSC information, as accessed by the
+XEN\_DOMCTL\_{get,set}tscinfo hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | mode                   | khz                    |
+    +------------------------+------------------------+
+    | nsec                                            |
+    +------------------------+------------------------+
+    | incarnation            | (reserved)             |
+    +------------------------+------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+mode             TSC mode, TSC\_MODE\_* constant.
+
+khz              TSC frequency, in kHz.
+
+nsec             Elapsed time, in nanoseconds.
+
+incarnation      Incarnation.
+--------------------------------------------------------------------
+
+\clearpage
+
+HVM_CONTEXT
+-----------
+
+HVM Domain context, as accessed by the
+XEN\_DOMCTL\_{get,set}hvmcontext hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | hvm_ctx                                         |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+hvm_ctx          The HVM Context blob from Xen.
+--------------------------------------------------------------------
+
+\clearpage
+
+HVM_PARAMS
+----------
+
+HVM Domain parameters, as accessed by the
+HVMOP\_{get,set}\_param hypercall sub-ops.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | count (C)              | (reserved)             |
+    +------------------------+------------------------+
+    | param[0].index                                  |
+    +-------------------------------------------------+
+    | param[0].value                                  |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | param[C-1].index                                |
+    +-------------------------------------------------+
+    | param[C-1].value                                |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+count            The number of paramter contained in this record.
+                 Each parameter in the record contains an index and
+                 value.
+
+param index      Parameter index.
+
+param value      Parameter value.
+--------------------------------------------------------------------
+
+\clearpage
+
+TOOLSTACK
+---------
+
+An opaque blob provided by and supplied to the higher layers of the
+toolstack (e.g., libxl) during save and restore.
+
+> This is only temporary -- the intension is the toolstack takes care
+> of this itself.  This record is only present for early development
+> purposes and will be removed before submissions, along with changes
+> to libxl which cause libxl to handle this data itself.
+
+     0     1     2     3     4     5     6     7 octet
+    +------------------------+------------------------+
+    | data                                            |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field            Description
+-----------      ---------------------------------------------------
+data             Blob of toolstack-specific data.
+--------------------------------------------------------------------
+
+\clearpage
+
+Layout
+======
+
+The set of valid records depends on the guest architecture and type.
+
+x86 PV Guest
+------------
+
+An x86 PV guest image will have this order of records:
+
+1. Image header
+2. Domain header
+3. X86\_PV\_INFO record
+4. Many PAGE\_DATA records
+5. TSC\_INFO
+6. X86\_PV\_P2M\_FRAMES record
+7. SHARED\_INFO record
+8. VCPU context records for each online VCPU
+    a. X86\_PV\_VCPU\_BASIC record
+    b. X86\_PV\_VCPU\_EXTENDED record
+    c. X86\_PV\_VCPU\_XSAVE record
+    d. X86\_PV\_VCPU\_MSRS record
+9. END record
+
+x86 HVM Guest
+-------------
+
+1. Image header
+2. Domain header
+3. Many PAGE\_DATA records
+4. TSC\_INFO
+5. HVM\_CONTEXT
+6. HVM\_PARAMS
+
+
+Legacy Images (x86 only)
+========================
+
+Restoring legacy images from older tools shall be handled by
+translating the legacy format image into this new format.
+
+It shall not be possible to save in the legacy format.
+
+There are two different legacy images depending on whether they were
+generated by a 32-bit or a 64-bit toolstack. These shall be
+distinguished by inspecting octets 4-7 in the image.  If these are
+zero then it is a 64-bit image.
+
+Toolstack  Field                            Value
+---------  -----                            -----
+64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
+32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
+32-bit     Chunk type (HVM)                 < 0
+32-bit     Page count (HVM)                 > 0
+
+Table: Possible values for octet 4-7 in legacy images
+
+This assumes the presence of the extended-info chunk which was
+introduced in Xen 3.0.
+
+
+Future Extensions
+=================
+
+All changes to the image or domain headers require the image version
+to be increased.
+
+The format may be extended by adding additional record types.
+
+Extending an existing record type must be done by adding a new record
+type.  This allows old images with the old record to still be
+restored.
+
+The image header may only be extended by _appending_ additional
+fields.  In particular, the `marker`, `id` and `version` fields must
+never change size or location.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 02/14] scripts: Scripts for inspection/valdiation of legacy and new streams
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12  9:48   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

TODO: move to tools/python and install properly...

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/scripts/legacy.py     |  576 +++++++++++++++++++++++++
 tools/libxc/saverestore/scripts/streamspec.py |  136 ++++++
 tools/libxc/saverestore/scripts/verify.py     |  377 ++++++++++++++++
 3 files changed, 1089 insertions(+)
 create mode 100755 tools/libxc/saverestore/scripts/legacy.py
 create mode 100644 tools/libxc/saverestore/scripts/streamspec.py
 create mode 100755 tools/libxc/saverestore/scripts/verify.py

diff --git a/tools/libxc/saverestore/scripts/legacy.py b/tools/libxc/saverestore/scripts/legacy.py
new file mode 100755
index 0000000..a1b0070
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/legacy.py
@@ -0,0 +1,576 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import sys
+import struct
+import streamspec
+import os
+
+__version__ = 1
+
+fin = None     # Input file/fd
+fout = None    # Output file/fd
+twidth = 0     # Legacy toolstack bitness (32 or 64)
+pv = None      # Boolean (pv or hvm)
+
+def stream_read(_ = None):
+    return fin.read(_)
+
+def stream_write(_):
+    return fout.write(_)
+
+class StreamError(StandardError):
+    pass
+
+class VM(object):
+
+    def __init__(self):
+        # Common
+        self.p2m_size = 0
+
+        # PV
+        self.max_vcpu_id = 0
+        self.online_vcpu_map = []
+        self.width = 0
+        self.levels = 0
+        self.basic_len = 0
+        self.extd = False
+        self.xsave_len = 0
+
+def write_ihdr():
+    stream_write(struct.pack(streamspec.IHDR_FORMAT,
+                             streamspec.IHDR_MARKER, # Marker
+                             streamspec.IHDR_IDENT,  # Ident
+                             1,                      # Version
+                             streamspec.IHDR_OPT_LE, # Options
+                             0, 0))                  # Reserved
+
+def write_dhdr():
+    if pv:
+        dtype = streamspec.DHDR_TYPE_x86_pv
+    else:
+        dtype = streamspec.DHDR_TYPE_x86_hvm
+
+    stream_write(struct.pack(streamspec.DHDR_FORMAT,
+                             dtype,        # Type
+                             12,           # Page size
+                             0,            # Reserved
+                             0,            # Xen major (converted)
+                             __version__)) # Xen minor (converted)
+
+def write_record(rt, *argl):
+    alldata = ''.join(argl)
+    length = len(alldata)
+
+    record = struct.pack(streamspec.RH_FORMAT, rt, length) + alldata
+    plen = (8 - (length & 7)) & 7
+    record += '\x00' * plen
+
+    stream_write(record)
+
+def write_pv_info(vm):
+    write_record(streamspec.REC_TYPE_x86_pv_info,
+                 struct.pack(streamspec.X86_PV_INFO_FORMAT,
+                             vm.width, vm.levels, 0, 0))
+
+def write_pv_p2m_frames(vm, pfns):
+    write_record(streamspec.REC_TYPE_x86_pv_p2m_frames,
+                 struct.pack(streamspec.X86_PV_P2M_FRAMES_FORMAT,
+                             0, vm.p2m_size - 1),
+                 struct.pack("Q" * len(pfns), *pfns))
+
+def write_pv_vcpu_basic(vcpu_id, data):
+    write_record(streamspec.REC_TYPE_x86_pv_vcpu_basic,
+                 struct.pack(streamspec.X86_PV_VCPU_FORMAT_HDR, vcpu_id, 0),
+                 data)
+
+def write_pv_vcpu_extd(vcpu_id, data):
+    write_record(streamspec.REC_TYPE_x86_pv_vcpu_extended,
+                 struct.pack(streamspec.X86_PV_VCPU_FORMAT_HDR, vcpu_id, 0),
+                 data)
+
+def write_pv_vcpu_xsave(vcpu_id, data):
+    write_record(streamspec.REC_TYPE_x86_pv_vcpu_xsave,
+                 struct.pack(streamspec.X86_PV_VCPU_FORMAT_HDR, vcpu_id, 0),
+                 data)
+
+def write_page_data(pfns, pages):
+    if fout is None: # Safe copying 1M buffers around for no reason
+        return
+
+    new_pfns = [(((x & 0xf0000000) << 32) | (x & 0x0fffffff)) for x in pfns]
+
+    # Optimise the needless buffer copying in write_record()
+    stream_write(struct.pack(streamspec.RH_FORMAT,
+                             streamspec.REC_TYPE_page_data,
+                             8 + (len(new_pfns) * 8) + len(pages)))
+    stream_write(struct.pack(streamspec.PAGE_DATA_FORMAT, len(new_pfns), 0))
+    stream_write(struct.pack("Q" * len(new_pfns), *new_pfns))
+    stream_write(pages)
+
+def write_tsc_info(mode, khz, nsec, incarn):
+    write_record(streamspec.REC_TYPE_tsc_info,
+                 struct.pack(streamspec.TSC_INFO_FORMAT,
+                             mode, khz, nsec, incarn, 0))
+
+def write_hvm_params(params):
+    if pv:
+        raise StreamError("HVM-only param in PV stream")
+    elif len(params) % 2:
+        raise RuntimeError("Expected even length list of hvm parameters")
+
+    write_record(streamspec.REC_TYPE_hvm_params,
+                 struct.pack(streamspec.HVM_PARAMS_FORMAT, len(params) / 2, 0),
+                 struct.pack("Q" * len(params), *params))
+
+
+def rdexact(nr_bytes):
+    """Read exactly nr_bytes from fin"""
+    _ = stream_read(nr_bytes)
+    if len(_) != nr_bytes:
+        raise IOError("Stream truncated")
+    return _
+
+def unpack_exact(fmt):
+    """Unpack a format from fin"""
+    sz = struct.calcsize(fmt)
+    return struct.unpack(fmt, rdexact(sz))
+
+def unpack_ulongs(nr_ulongs):
+    if twidth == 32:
+        return unpack_exact("I" * nr_ulongs)
+    else:
+        return unpack_exact("Q" * nr_ulongs)
+
+def skip_xl_header():
+
+    hdr = rdexact(32)
+    if hdr != "Xen saved domain, xl format\n \0 \r":
+        raise StreamError("No xl header")
+
+    opts = rdexact(16)
+    _, _, _, optlen = struct.unpack("=IIII", opts)
+
+    optdata = rdexact(optlen)
+
+    print "Skipped xl header"
+
+    stream_write(hdr)
+    stream_write(opts)
+    stream_write(optdata)
+
+def read_pv_extended_info(vm):
+
+    marker, = unpack_ulongs(1)
+
+    if twidth == 32:
+        expected = 0xffffffff
+    else:
+        expected = 0xffffffffffffffff
+
+    if marker != expected:
+        raise StreamError("Unexpected extended info marker 0x%x" % (marker, ))
+
+    total_length, = unpack_exact("I")
+    so_far = 0
+
+    print "Extended Info: length 0x%x" % (total_length, )
+
+    while so_far < total_length:
+
+        blkid, datasz = unpack_exact("=4sI")
+        so_far += 8
+
+        print "  Record type: %s, size 0x%x" % (blkid, datasz)
+
+        data = rdexact(datasz)
+        so_far += datasz
+
+        # Eww, but this is how it is done :(
+        if blkid == "vcpu":
+
+            vm.basic_len = datasz
+
+            if datasz == 0x1430:
+                vm.width = 8
+                vm.levels = 4
+                print "    64bit domain, 4 levels"
+            elif datasz == 0xaf0:
+                vm.width = 4
+                vm.levels = 3
+                print "    32bit domain, 3 levels"
+            else:
+                raise StreamError("Unable to determine guest width/level")
+
+            write_pv_info(vm)
+
+        elif blkid == "extv":
+            vm.extd = True
+
+        elif blkid == "xcnt":
+            vm.xsave_len, = struct.unpack_from("I", data, 0)
+            print "xcnt sz 0x%x" % (vm.xsave_len, )
+
+        else:
+            raise StreamError("Unrecognised extended block")
+
+
+    if so_far != total_length:
+        raise StreamError("Overshot Extended Info size by %d bytes"
+                          % (so_far - total_length,))
+
+def read_pv_p2m_frames(vm):
+    fpp = 4096 / vm.width
+    p2m_frame_len = (vm.p2m_size - 1) / fpp + 1
+
+    print "P2M frames: fpp %d, p2m_frame_len %d" % (fpp, p2m_frame_len)
+    write_pv_p2m_frames(vm, unpack_ulongs(p2m_frame_len))
+
+def read_pv_tail(vm):
+
+    nr_unmapped_pfns, = unpack_exact("I")
+
+    if nr_unmapped_pfns != 0:
+        # "Unmapped" pfns are bogus
+        _ = unpack_ulongs(nr_unmapped_pfns)
+        print "discarding %d bogus 'unmapped pfns'" % (nr_unmapped_pfns, )
+        #raise StreamError("Found bogus 'unmapped pfns'")
+
+    for vcpu_id in vm.online_vcpu_map:
+
+        basic = rdexact(vm.basic_len)
+        print "Got VCPU basic (size 0x%x)" % (vm.basic_len, ),
+        write_pv_vcpu_basic(vcpu_id, basic)
+
+        if vm.extd:
+            extd = rdexact(128)
+            print "extd (size 0x%x)" % (128, ),
+            write_pv_vcpu_extd(vcpu_id, extd)
+
+        if vm.xsave_len:
+            mask, size = unpack_exact("QQ")
+            assert vm.xsave_len - 16 == size
+
+            xsave = rdexact(size)
+            print "xsave (mask 0x%x, size 0x%x)" % (mask, size),
+            write_pv_vcpu_xsave(vcpu_id, xsave)
+        print ""
+
+    shinfo = rdexact(4096)
+    print "Got shinfo"
+
+    write_record(streamspec.REC_TYPE_shared_info, shinfo)
+    write_record(streamspec.REC_TYPE_end, "")
+
+
+def read_chunks(vm):
+
+    hvm_params = []
+
+    while True:
+
+        marker, = unpack_exact("=i")
+        if marker <= 0:
+            print "Chunk: type 0x%x" % (marker, )
+
+        if marker == 0:
+            print "  End"
+
+            if hvm_params:
+                write_hvm_params(hvm_params)
+
+            return
+
+        elif marker > 0:
+
+            if marker > 1024:
+                raise StreamError("Page batch (%d) exceeded MAX_BATCH"
+                                  % (marker, ))
+            pfns = unpack_ulongs(marker)
+
+            # xc_domain_save() leaves many XEN_DOMCTL_PFINFO_XTAB records for
+            # sequences of pfns it cant map.  Drop these.
+            pfns = [ x for x in pfns if x != 0xf0000000 ]
+
+            if len(set(pfns)) != len(pfns):
+                raise StreamError("Duplicate pfns in batch")
+
+                # print "0x[",
+                # for pfn in pfns:
+                #     print "%x" % (pfn, ),
+                # print "]"
+
+            nr_pages = len([x for x in pfns if (x & 0xf0000000) < 0xd0000000])
+
+            #print "  Page Batch, %d PFNs, %d pages" % (marker, nr_pages)
+            pages = rdexact(nr_pages * 4096)
+
+            write_page_data(pfns, pages)
+
+        elif marker == -1: # XC_SAVE_ID_ENABLE_VERIFY_MODE
+            # Verify mode... Seemingly nothing to do...
+            pass
+
+        elif marker == -2: # XC_SAVE_ID_VCPU_INFO
+            max_id, = unpack_exact("i")
+
+            if max_id > 4095:
+                raise StreamError("Vcpu max_id out of range: %d > 4095"
+                                  % (max_id, ) )
+
+            vm.max_vcpu_id = max_id
+            bitmap = unpack_exact("Q" * ((max_id/64) + 1))
+
+            for idx, word in enumerate(bitmap):
+                bit_idx = 0
+
+                while word > 0:
+                    if word & 1:
+                        vm.online_vcpu_map.append((idx * 64) + bit_idx)
+
+                    bit_idx += 1
+                    word >>= 1
+
+            print "  Vcpu info: max_id %d, online map %s" % (vm.max_vcpu_id,
+                                                             vm.online_vcpu_map)
+
+        elif marker == -3: # XC_SAVE_ID_HVM_IDENT_PT
+            _, ident_pt = unpack_exact("=IQ")
+            print "  EPT Identity Pagetable 0x%x" % (ident_pt, )
+            hvm_params.extend([12, # HVM_PARAM_IDENT_PT
+                               ident_pt])
+
+        elif marker == -4: # XC_SAVE_ID_HVM_VM86_TSS
+            _, vm86_tss = unpack_exact("=IQ")
+            print "  VM86 TSS: 0x%x" % (vm86_tss, )
+            hvm_params.extend([15, # HVM_PARAM_VM86_TSS
+                               vm86_tss])
+
+        elif marker == -5: # XC_SAVE_ID_TMEM
+            raise RuntimeError("todo")
+
+        elif marker == -6: # XC_SAVE_ID_TMEM_EXTRA
+            raise RuntimeError("todo")
+
+        elif marker == -7: # XC_SAVE_ID_TSC_INFO
+            mode, nsec, khz, incarn = unpack_exact("=IQII")
+            print ("  TSC_INFO: mode %s, %d ns, %d khz, %d incarn"
+                   % (mode, nsec, khz, incarn))
+            write_tsc_info(mode, khz, nsec, incarn)
+
+        elif marker == -8: # XC_SAVE_ID_HVM_CONSOLE_PFN
+            _, console_pfn = unpack_exact("=IQ")
+            print "  Console pfn 0x%x" % (console_pfn, )
+            hvm_params.extend([17, # HVM_PARAM_CONSOLE_PFN
+                               console_pfn])
+
+        elif marker == -9: # XC_SAVE_ID_LAST_CHECKPOINT
+            print "  Last Checkpoint"
+            # Nothing to do
+
+        elif marker == -10: # XC_SAVE_ID_HVM_ACPI_IOPORTS_LOCATION
+            _, loc = unpack_exact("=IQ")
+            print "  ACPI ioport location 0x%x" % (loc, )
+            hvm_params.extend([19, # HVM_PARAM_ACPI_IOPORTS_LOCATION
+                               loc])
+
+        elif marker == -11: # XC_SAVE_ID_HVM_VIRIDIAN
+            _, loc = unpack_exact("=IQ")
+            print "  Viridian location 0x%x" % (loc, )
+            hvm_params.extend([9, # HVM_PARAM_VIRIDIAN
+                               loc])
+
+        elif marker == -12: # XC_SAVE_ID_COMPRESSED_DATA
+            sz, = unpack_exact("I")
+            data = rdexact(sz)
+            print "  Compressed Data: sz 0x%x" % (sz, )
+            raise RuntimeError("todo")
+
+        elif marker == -13: # XC_SAVE_ID_ENABLE_COMPRESSION
+            raise RuntimeError("todo")
+
+        elif marker == -14: # XC_SAVE_ID_HVM_GENERATION_ID_ADDR
+            _, genid_loc = unpack_exact("=IQ")
+            print "  Generation ID Address 0x%x" % (genid_loc, )
+            hvm_params.extend([32, # HVM_PARAM_VM_GENERATION_ID_ADDR
+                               genid_loc])
+
+        elif marker == -15: # XC_SAVE_ID_HVM_PAGING_RING_PFN
+            _, paging_ring_pfn = unpack_exact("=IQ")
+            print "  Paging ring pfn 0x%x" % (paging_ring_pfn, )
+            hvm_params.extend([27, # HVM_PARAM_PAGING_RING_PFN
+                               paging_ring_pfn])
+
+        elif marker == -16: # XC_SAVE_ID_HVM_ACCESS_RING_PFN
+            _, access_ring_pfn = unpack_exact("=IQ")
+            print "  Access ring pfn 0x%x" % (access_ring_pfn, )
+            hvm_params.extend([28, # HVM_PARAM_ACCESS_RING_PFN
+                               access_ring_pfn])
+
+        elif marker == -17: # XC_SAVE_ID_HVM_SHARING_RING_PFN
+            _, sharing_ring_pfn = unpack_exact("=IQ")
+            print "  Sharing ring pfn 0x%x" % (sharing_ring_pfn, )
+            hvm_params.extend([29, # HVM_PARAM_SHARING_RING_PFN
+                               sharing_ring_pfn])
+
+        elif marker == -18:
+            sz, = unpack_exact("I")
+            data = rdexact(sz)
+            print "  Toolstack Data: sz 0x%x" % (sz, )
+            print >> sys.stderr, "TODO - fix libxl's use of this"
+            write_record(streamspec.REC_TYPE_toolstack, data)
+
+        else:
+            raise StreamError("Unrecognised chunk")
+
+def read_hvm_tail(vm):
+
+    io, bufio, store = unpack_exact("QQQ")
+    print "Magic pfns: 0x%x 0x%x 0x%x" % (io, bufio, store)
+    write_hvm_params([5, io,     # HVM_PARAM_IOREQ_PFN
+                      6, bufio,  # HVM_PARAM_BUFIOREQ_PFN
+                      1, store]) # HVM_PARAM_STORE_PFN
+
+    blobsz, = unpack_exact("I")
+    print "Got HVM Context (0x%x bytes)" % (blobsz, )
+    blob = rdexact(blobsz)
+
+    write_record(streamspec.REC_TYPE_hvm_context, blob)
+    write_record(streamspec.REC_TYPE_end, "")
+
+
+
+def read_qemu(vm):
+
+    rawsig = rdexact(21)
+    sig, = struct.unpack("21s", rawsig)
+    print "Qemu signature: %s" % (sig, )
+
+    if sig == "DeviceModelRecord0002":
+        rawsz = rdexact(4)
+        sz, = struct.unpack("I", rawsz)
+        qdata = rdexact(sz)
+
+        stream_write(rawsig)
+        stream_write(rawsz)
+        stream_write(qdata)
+
+    else:
+        raise RuntimeError("Unrecognised Qemu sig '%s'" % (sig, ))
+
+
+def read_vm(vm):
+
+    try:
+
+        vm.p2m_size, = unpack_ulongs(1)
+        print "P2M Size: 0x%x" % (vm.p2m_size,)
+
+        write_ihdr()
+        write_dhdr()
+
+        if pv:
+            read_pv_extended_info(vm)
+            read_pv_p2m_frames(vm)
+
+        read_chunks(vm)
+
+        if pv:
+            read_pv_tail(vm)
+        else:
+            read_hvm_tail(vm)
+            read_qemu(vm)
+
+    except (IOError, StreamError, ) as e:
+        print >> sys.stderr, "Error: ", e
+        return 1
+
+    except RuntimeError as e:
+        print >> sys.stderr, "Script error", e
+        print >> sys.stderr, "Please fix me"
+        return 2
+    return 0
+
+def open_file_or_fd(val, mode):
+    """
+    If 'val' looks like a decimal integer, open it as an fd.  If not, try to
+    open it as a regular file.
+    """
+
+    fd = -1
+    try:
+        # Does it look like an integer?
+        try:
+            fd = int(val, 10)
+        except ValueError:
+            pass
+
+        # Try to open it...
+        if fd != -1:
+            return os.fdopen(fd, mode)
+        else:
+            return open(val, mode)
+
+    except StandardError, e:
+        if fd != -1:
+            print >> sys.stderr, "Unable to open fd %d: %s" % (fd, e)
+        else:
+            print >> sys.stderr, "Unable to open file '%s': %s" % (val, e)
+
+    raise SystemExit(1)
+
+
+def main(argv):
+    from optparse import OptionParser
+    global fin, fout, twidth, pv
+
+    # Change stdout to be line-buffered.
+    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 1)
+
+    parser = OptionParser(version = __version__,
+                          usage = ("%prog -i INPUT -o OUTPUT"
+                                   " -w WIDTH -g GUEST [-x]"))
+
+    parser.add_option("-v", action = "store_true", default = False,
+                      help = "More verbose")
+
+    # Required options
+    parser.add_option("-i", "--in", dest = "fin", metavar = "<FD or FILE>",
+                      help = "Legacy input to convert")
+    parser.add_option("-o", "--out", dest = "fout", metavar = "<FD or FILE>",
+                      help = "v2 format output")
+    parser.add_option("-w", "--width", dest = "twidth",
+                      metavar = "<32/64>", choices = ["32", "64"],
+                      help = "Legacy toostack bitness")
+    parser.add_option("-g", "--guest-type", dest = "gtype",
+                      metavar = "<pv/hvm>", choices = ["pv", "hvm"],
+                      help = "Type of guest in stream")
+
+    # Optional options
+    parser.add_option("-x", "--xl", action = "store_true", default = False,
+                      help = ("Is an `xl` header present in the stream?"
+                              " (default no)"))
+
+    opts, args = parser.parse_args()
+
+    if (opts.fin is None or opts.fout is None or
+        opts.twidth is None or opts.gtype is None):
+
+        parser.print_help(sys.stderr)
+        raise SystemExit(1)
+
+    fin    = open_file_or_fd(opts.fin,  "rb")
+    fout   = open_file_or_fd(opts.fout, "wb")
+    twidth = int(opts.twidth)
+    pv     = opts.gtype == "pv"
+
+    if opts.xl:
+        skip_xl_header()
+
+    rc = read_vm(VM())
+    fout.close()
+
+    return rc
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv))
diff --git a/tools/libxc/saverestore/scripts/streamspec.py b/tools/libxc/saverestore/scripts/streamspec.py
new file mode 100644
index 0000000..9f97b46
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/streamspec.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# Python structures for the Migration v2 stream format.
+# See docs/specs/libxc-migration-stream.pandoc
+
+# Image Header
+IHDR_FORMAT = "!QIIHHI"
+
+IHDR_MARKER = 0xffffffffffffffff
+IHDR_IDENT  = 0x58454E46 # "XENF" in ASCII
+
+IHDR_OPT_ENDIAN_ = 0
+IHDR_OPT_LE = (0 << IHDR_OPT_ENDIAN_)
+IHDR_OPT_BE = (1 << IHDR_OPT_ENDIAN_)
+
+IHDR_OPT_RESZ_MASK = 0xfffe
+
+# Domain Header
+DHDR_FORMAT = "IHHII"
+
+DHDR_TYPE_x86_pv  = 0x00000001
+DHDR_TYPE_x86_hvm = 0x00000002
+DHDR_TYPE_x86_pvh = 0x00000003
+DHDR_TYPE_arm     = 0x00000004
+
+dhdr_type_to_str = {
+    DHDR_TYPE_x86_pv  : "x86 PV",
+    DHDR_TYPE_x86_hvm : "x86 HVM",
+    DHDR_TYPE_x86_pvh : "x86 PVH",
+    DHDR_TYPE_arm     : "ARM",
+}
+
+RH_FORMAT = "II"
+
+REC_TYPE_end                  = 0x00000000
+REC_TYPE_page_data            = 0x00000001
+REC_TYPE_x86_pv_info          = 0x00000002
+REC_TYPE_x86_pv_p2m_frames    = 0x00000003
+REC_TYPE_x86_pv_vcpu_basic    = 0x00000004
+REC_TYPE_x86_pv_vcpu_extended = 0x00000005
+REC_TYPE_x86_pv_vcpu_xsave    = 0x00000006
+REC_TYPE_shared_info          = 0x00000007
+REC_TYPE_tsc_info             = 0x00000008
+REC_TYPE_hvm_context          = 0x00000009
+REC_TYPE_hvm_params           = 0x0000000a
+REC_TYPE_toolstack            = 0x0000000b
+REC_TYPE_x86_pv_vcpu_msrs     = 0x0000000c
+
+rec_type_to_str = {
+    REC_TYPE_end                  : "End",
+    REC_TYPE_page_data            : "Page data",
+    REC_TYPE_x86_pv_info          : "x86 PV info",
+    REC_TYPE_x86_pv_p2m_frames    : "x86 PV P2M frames",
+    REC_TYPE_x86_pv_vcpu_basic    : "x86 PV vcpu basic",
+    REC_TYPE_x86_pv_vcpu_extended : "x86 PV vcpu extended",
+    REC_TYPE_x86_pv_vcpu_xsave    : "x86 PV vcpu xsave",
+    REC_TYPE_shared_info          : "Shared info",
+    REC_TYPE_tsc_info             : "TSC info",
+    REC_TYPE_hvm_context          : "HVM context",
+    REC_TYPE_hvm_params           : "HVM params",
+    REC_TYPE_toolstack            : "Toolstack",
+    REC_TYPE_x86_pv_vcpu_msrs     : "x86 PV vcpu msrs",
+}
+
+# page_data
+PAGE_DATA_FORMAT             = "II"
+PAGE_DATA_PFN_MASK           = (1L << 52) - 1
+PAGE_DATA_PFN_RESZ_MASK      = ((1L << 60) - 1) & ~((1L << 52) - 1)
+
+# flags from xen/public/domctl.h: XEN_DOMCTL_PFINFO_* shifted by 32 bits
+PAGE_DATA_TYPE_SHIFT         = 60
+PAGE_DATA_TYPE_LTABTYPE_MASK = (0x7L << PAGE_DATA_TYPE_SHIFT)
+PAGE_DATA_TYPE_LTAB_MASK     = (0xfL << PAGE_DATA_TYPE_SHIFT)
+PAGE_DATA_TYPE_LPINTAB       = (0x8L << PAGE_DATA_TYPE_SHIFT) # Pinned pagetable
+
+PAGE_DATA_TYPE_NOTAB         = (0x0L << PAGE_DATA_TYPE_SHIFT) # Regular page
+PAGE_DATA_TYPE_L1TAB         = (0x1L << PAGE_DATA_TYPE_SHIFT) # L1 pagetable
+PAGE_DATA_TYPE_L2TAB         = (0x2L << PAGE_DATA_TYPE_SHIFT) # L2 pagetable
+PAGE_DATA_TYPE_L3TAB         = (0x3L << PAGE_DATA_TYPE_SHIFT) # L3 pagetable
+PAGE_DATA_TYPE_L4TAB         = (0x4L << PAGE_DATA_TYPE_SHIFT) # L4 pagetable
+PAGE_DATA_TYPE_BROKEN        = (0xdL << PAGE_DATA_TYPE_SHIFT) # Broken
+PAGE_DATA_TYPE_XALLOC        = (0xeL << PAGE_DATA_TYPE_SHIFT) # Allocate-only
+PAGE_DATA_TYPE_XTAB          = (0xfL << PAGE_DATA_TYPE_SHIFT) # Invalid
+
+# x86_pv_info
+X86_PV_INFO_FORMAT        = "BBHI"
+
+X86_PV_P2M_FRAMES_FORMAT  = "II"
+
+# x86_pv_vcpu_{basic,extended,xsave,msrs}
+X86_PV_VCPU_FORMAT_HDR    = "II"
+
+# tsc_info
+TSC_INFO_FORMAT           = "IIQII"
+
+# hvm_params
+HVM_PARAMS_ENTRY_FORMAT   = "QQ"
+HVM_PARAMS_FORMAT         = "II"
+
+#
+# libxl format
+#
+
+LIBXL_QEMU_SIGNATURE = "DeviceModelRecord0002"
+LIBXL_QEMU_RECORD_HDR = "=%dsI" % (len(LIBXL_QEMU_SIGNATURE), )
+
+
+# If run as a python script alone, confirm some expected sizes
+if __name__ == "__main__":
+    import sys
+    from struct import calcsize
+
+    ok = True
+    for fmt, sz in [ ("IHDR_FORMAT", 24),
+                     ("DHDR_FORMAT", 16),
+                     ("RH_FORMAT", 8),
+
+                     ("PAGE_DATA_FORMAT", 8),
+                     ("X86_PV_INFO_FORMAT", 8),
+                     ("X86_PV_P2M_FRAMES_FORMAT", 8),
+                     ("X86_PV_VCPU_FORMAT_HDR", 8),
+                     ("TSC_INFO_FORMAT", 24),
+                     ("HVM_PARAMS_ENTRY_FORMAT", 16),
+                     ("HVM_PARAMS_FORMAT", 8),
+                     ]:
+
+        realsz = calcsize(getattr(sys.modules[__name__], fmt))
+        if realsz != sz:
+            print "%s is %d bytes but expected %d" % (fmt, realsz, sz)
+            ok = False
+
+    if ok:
+        sys.exit(0)
+    else:
+        sys.exit(1)
diff --git a/tools/libxc/saverestore/scripts/verify.py b/tools/libxc/saverestore/scripts/verify.py
new file mode 100755
index 0000000..883cc18
--- /dev/null
+++ b/tools/libxc/saverestore/scripts/verify.py
@@ -0,0 +1,377 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import sys
+import struct
+
+from streamspec import *
+
+class StreamError(StandardError):
+    pass
+
+class RecordError(StandardError):
+    pass
+
+def skip_xl_header(stream):
+
+    magic = stream.read(8)
+    if magic != "mat\n \0 \r":
+        return False
+
+    header = stream.read(16)
+    if len(header) != 16:
+        return False
+
+    _, _, _, optlen = struct.unpack("=IIII", header)
+
+    optdata = stream.read(optlen)
+    if len(optdata) != optlen:
+        return False
+
+    return True
+
+
+def verify_ihdr(stream):
+    """ Verify an image header """
+
+    datasz = struct.calcsize(IHDR_FORMAT)
+    data = stream.read(datasz)
+
+    # xl header record?
+    if data == "Xen saved domain, xl for":
+        if skip_xl_header(stream):
+            data = stream.read(datasz)
+        else:
+            raise StreamError("Invalid looking xl header on the stream")
+
+    if len(data) != datasz:
+        raise IOError("Truncated stream")
+
+    marker, id, version, options, res1, res2 = struct.unpack(IHDR_FORMAT, data)
+
+    if marker != 0xffffffffffffffff:
+        raise StreamError("Bad image marker: Expected 0xffffffffffffffff, "
+                          "got 0x%x" % (marker, ))
+
+    if id != 0x58454e46:
+        raise StreamError("Bad image id: Expected 0x0x58454e46, got 0x%x"
+                          % (id, ))
+
+    if version != 1:
+        raise StreamError("Unknown image version: Expected 1, got %d"
+                          % (version, ))
+
+    if options & IHDR_OPT_RESZ_MASK:
+        raise StreamError("Reserved bits set in image options field: 0x%x"
+                          % (options & IHDR_OPT_RESZ_MASK))
+
+    if res1 != 0 or res2 != 0:
+        raise StreamError("Reserved bits set in image header: 0x%04x:0x%08x"
+                          % (res1, res2))
+
+    if ( sys.byteorder == "little" and
+         (options & IHDR_OPT_ENDIAN_) != IHDR_OPT_LE ):
+        raise StreamError("Stream is not native endianess - unable to validate")
+
+    print "Valid Image Header:",
+    if options & IHDR_OPT_BE:
+        print "big endian"
+    else:
+        print "little endian"
+
+def verify_dhdr(stream):
+    """ Verify a domain header """
+
+    datasz = struct.calcsize(DHDR_FORMAT)
+    data = stream.read(datasz)
+
+    if len(data) != datasz:
+        raise IOError("Truncated stream")
+
+    type, page_shift, res1, major, minor = struct.unpack(DHDR_FORMAT, data)
+
+    if type not in dhdr_type_to_str:
+        raise StreamError("Unrecognised domain type 0x%x" % (type, ))
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in domain header 0x%04x"
+                          % (res1, ))
+
+    if page_shift != 12:
+        raise StreamError("Page shift expected to be 12.  Got %d"
+                          % (page_shift, ))
+
+    print "Valid Domain Header: %s from Xen %d.%d (page sz %d)" \
+        % (dhdr_type_to_str[type], major, minor, 2**page_shift)
+
+
+def verify_record_end(content):
+
+    if len(content) != 0:
+        raise RecordError("End record with non-zero length")
+
+def verify_page_data(content):
+    minsz = struct.calcsize(PAGE_DATA_FORMAT)
+
+    if len(content) <= minsz:
+        raise RecordError("PAGE_DATA record must be at least %d bytes long"
+                          % (minsz, ))
+
+    count, res1 = struct.unpack_from(PAGE_DATA_FORMAT, content)
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in PAGE_DATA record 0x%04x"
+                          % (res1, ))
+
+    pfnsz = count * 8
+    if (len(content) - minsz) < pfnsz:
+        raise RecordError("PAGE_DATA record must contain a pfn record for "
+                          "each count")
+
+    pfns = list(struct.unpack_from("=%dQ" % (count,), content, minsz))
+
+    nr_pages = 0
+    for idx, pfn in enumerate(pfns):
+
+        if pfn & PAGE_DATA_PFN_RESZ_MASK:
+            raise RecordError("Reserved bits set in pfn[%d]: 0x%016x",
+                              idx, pfn & PAGE_DATA_PFN_RESZ_MASK)
+
+        if pfn >> PAGE_DATA_TYPE_SHIFT in (5, 6, 7, 8):
+            raise RecordError("Invalid type value in pfn[%d]: 0x%016x",
+                              idx, pfn & PAGE_DATA_TYPE_LTAB_MASK)
+
+        # We expect page data for each normal page or pagetable
+        if PAGE_DATA_TYPE_NOTAB <= (pfn & PAGE_DATA_TYPE_LTABTYPE_MASK) <= PAGE_DATA_TYPE_L4TAB:
+            nr_pages += 1
+
+    pagesz = nr_pages * 4096
+    if len(content) != minsz + pfnsz + pagesz:
+        raise RecordError("Expected %u + %u + %u, got %u" % (minsz, pfnsz, pagesz, len(content)))
+
+
+def verify_record_x86_pv_vcpu_generic(content, name):
+    # Generic for all REC_TYPE_x86_pv_vcpu_{basic,extended,xsave,msrs}
+    minsz = struct.calcsize(X86_PV_VCPU_FORMAT_HDR)
+
+    if len(content) <= minsz:
+        raise RecordError("X86_PV_VCPU_%s record length must be at least %d"
+                          " bytes long" % (name, minsz))
+
+    vcpuid, res1 = struct.unpack_from(X86_PV_VCPU_FORMAT_HDR, content)
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in x86_pv_vcpu_%s record 0x%04x"
+                          % (name, res1))
+
+    print "  vcpu%d %s context, %d bytes" % (vcpuid, name, len(content) - minsz)
+
+
+def verify_x86_pv_info(content):
+
+    expectedsz = struct.calcsize(X86_PV_INFO_FORMAT)
+    if len(content) != expectedsz:
+        raise RecordError("x86_pv_info: expected length of %d, got %d"
+                          % (expectedsz, len(content)))
+
+    width, levels, res1, res2 = struct.unpack(X86_PV_INFO_FORMAT, content)
+
+    if width not in (4, 8):
+        raise RecordError("Expected width of 4 or 8, got %d" % (width, ))
+
+    if levels not in (3, 4):
+        raise RecordError("Expected levels of 3 or 4, got %d" % (levels, ))
+
+    if res1 != 0 or res2 != 0:
+        raise StreamError("Reserved bits set in X86_PV_INFO: 0x%04x 0x%08x"
+                          % (res1, res2))
+
+    bitness = {4:32, 8:64}[width]
+
+    print "  %sbit guest, %d levels of pagetables" % (bitness, levels)
+
+def verify_x86_pv_p2m_frames(content):
+
+    if len(content) % 8 != 0:
+        raise RecordError("Length expected to be a multiple of 8, not %d"
+                          % (len(content), ))
+
+    start, end = struct.unpack_from("=II", content)
+
+    print "  Start pfn 0x%x, End 0x%x" % (start, end)
+
+def verify_record_shared_info(content):
+
+    if len(content) != 4096:
+        raise RecordError("Length expected to be 4906 bytes, not %d"
+                          % (len(content), ))
+
+def verify_record_tsc_info(content):
+
+    sz = struct.calcsize(TSC_INFO_FORMAT)
+
+    if len(content) != sz:
+        raise RecordError("Length should be %u bytes" % (sz, ))
+
+    mode, khz, nsec, incarn, res1 = struct.unpack(TSC_INFO_FORMAT, content)
+
+    if res1 != 0:
+        raise StreamError("Reserved bits set in TSC_INFO: 0x%08x" % (res1, ))
+
+    print ("  Mode %u, %u kHz, %u ns, incarnation %d"
+           % (mode, khz, nsec, incarn))
+
+def verify_record_hvm_context(content):
+
+    if len(content) == 0:
+        raise RecordError("Zero length HVM context")
+
+def verify_record_hvm_params(content):
+
+    sz = struct.calcsize(HVM_PARAMS_FORMAT)
+
+    if len(content) < sz:
+        raise RecordError("Length should be at least %u bytes" % (sz, ))
+
+    count, rsvd = struct.unpack(HVM_PARAMS_FORMAT, content[:sz])
+
+    if rsvd != 0:
+        raise RecordError("Reserved field not zero (0x%04x)" % (rsvd, ))
+
+    sz += count * struct.calcsize(HVM_PARAMS_ENTRY_FORMAT)
+
+    if len(content) != sz:
+        raise RecordError("Length should be %u bytes" % (sz, ))
+
+def verify_toolstack(content):
+    # Opaque blob -- nothing to verify.
+    pass
+
+record_verifiers = {
+    REC_TYPE_end : verify_record_end,
+    REC_TYPE_page_data : verify_page_data,
+
+    REC_TYPE_x86_pv_info: verify_x86_pv_info,
+    REC_TYPE_x86_pv_p2m_frames: verify_x86_pv_p2m_frames,
+
+    REC_TYPE_x86_pv_vcpu_basic :
+        lambda x: verify_record_x86_pv_vcpu_generic(x, "basic"),
+    REC_TYPE_x86_pv_vcpu_extended :
+        lambda x: verify_record_x86_pv_vcpu_generic(x, "extended"),
+    REC_TYPE_x86_pv_vcpu_xsave :
+        lambda x: verify_record_x86_pv_vcpu_generic(x, "xsave"),
+    REC_TYPE_x86_pv_vcpu_msrs :
+        lambda x: verify_record_x86_pv_vcpu_generic(x, "msrs"),
+
+    REC_TYPE_shared_info: verify_record_shared_info,
+    REC_TYPE_tsc_info: verify_record_tsc_info,
+
+    REC_TYPE_hvm_context: verify_record_hvm_context,
+    REC_TYPE_hvm_params: verify_record_hvm_params,
+    REC_TYPE_toolstack: verify_toolstack,
+}
+
+_squahsed_data_records = 0
+def verify_record(stream):
+    """ Verify a record """
+    global _squahsed_data_records
+
+    datasz = struct.calcsize(RH_FORMAT)
+    data = stream.read(datasz)
+
+    if len(data) != datasz:
+        raise IOError("Truncated stream")
+
+    type, length = struct.unpack(RH_FORMAT, data)
+
+    if type not in rec_type_to_str:
+        raise StreamError("Unrecognised record type %x" % (type, ))
+
+    contentsz = (length + 7) & ~7
+    content = stream.read(contentsz)
+
+    if len(content) != contentsz:
+        raise IOError("Truncated stream")
+
+    padding = content[length:]
+    if padding != "\x00" * len(padding):
+        raise StreamError("Padding containging non0 bytes found")
+
+    if type != REC_TYPE_page_data:
+
+        if _squahsed_data_records > 0:
+            print ("Squashed %d valid Page Data records together"
+                   % (_squahsed_data_records, ))
+            _squahsed_data_records = 0
+
+        print ("Valid Record Header: %s, length %d"
+               % (rec_type_to_str[type], length))
+
+    else:
+        _squahsed_data_records += 1
+
+    if type not in record_verifiers:
+        raise RuntimeError("No verification function")
+    else:
+        record_verifiers[type](content[:length])
+
+    return type
+
+def verify_qemu_record(fin):
+
+    sz = struct.calcsize(LIBXL_QEMU_RECORD_HDR)
+
+    hdr = fin.read(sz)
+
+    if len(hdr) == 0:
+        return
+
+    if len(hdr) < sz:
+        raise StreamError("Junk found on the end of the stream")
+
+    (sig, length) = struct.unpack(LIBXL_QEMU_RECORD_HDR, hdr)
+
+    if sig != LIBXL_QEMU_SIGNATURE or length == 0:
+        raise StreamError("Junk found on the end of the stream")
+
+    qemu_record = fin.read(length)
+
+    if len(qemu_record) != length:
+        raise StreamError("Truncated qemu save record")
+
+    print("Libxl qemu save record, length %u" % (length, ))
+
+def main(argv = sys.argv):
+
+    if len(argv) == 2:
+        fin = open(argv[1], "rb")
+    else:
+        fin = sys.stdin
+
+    try:
+        verify_ihdr(fin)
+        verify_dhdr(fin)
+
+        while verify_record(fin) != REC_TYPE_end:
+            pass
+
+        verify_qemu_record(fin)
+
+        if fin.read(1) != "":
+            raise StreamError("Junk found on the end of the stream")
+
+    except (IOError, StreamError, RecordError) as e:
+        print "Error: ", e
+        return 1
+
+    except RuntimeError as e:
+        print "Script error", e
+        print "Please fix me"
+        return 2
+
+    print "Done"
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv))
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
  2014-06-11 18:14 ` [PATCH v5 RFC 02/14] scripts: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-17 16:00   ` Ian Campbell
  2014-06-11 18:14 ` [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format Andrew Cooper
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
two save/restore codepaths to coexist, and have a runtime switch.

It is indended that once this series is less RFC, the v2 framework will
completely replace v1.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/Makefile              |    1 +
 tools/libxc/saverestore/common.h  |   15 +++++++++++++++
 tools/libxc/saverestore/restore.c |   23 +++++++++++++++++++++++
 tools/libxc/saverestore/save.c    |   19 +++++++++++++++++++
 tools/libxc/xc_domain_restore.c   |    8 ++++++++
 tools/libxc/xc_domain_save.c      |    6 ++++++
 tools/libxc/xenguest.h            |   13 +++++++++++++
 7 files changed, 85 insertions(+)
 create mode 100644 tools/libxc/saverestore/common.h
 create mode 100644 tools/libxc/saverestore/restore.c
 create mode 100644 tools/libxc/saverestore/save.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 215101d..e8460c7 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -44,6 +44,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c xc_suspend.c
 ifeq ($(CONFIG_MIGRATE),y)
 GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += $(wildcard saverestore/*.c)
 GUEST_SRCS-y += xc_offline_page.c xc_compression.c
 else
 GUEST_SRCS-y += xc_nomigrate.c
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
new file mode 100644
index 0000000..f1aff44
--- /dev/null
+++ b/tools/libxc/saverestore/common.h
@@ -0,0 +1,15 @@
+#ifndef __COMMON__H
+#define __COMMON__H
+
+#include "../xg_private.h"
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
new file mode 100644
index 0000000..6624baa
--- /dev/null
+++ b/tools/libxc/saverestore/restore.c
@@ -0,0 +1,23 @@
+#include "common.h"
+
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
new file mode 100644
index 0000000..f6ad734
--- /dev/null
+++ b/tools/libxc/saverestore/save.c
@@ -0,0 +1,19 @@
+#include "common.h"
+
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm)
+{
+    IPRINTF("In experimental %s", __func__);
+    return -1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index fe44e4c..2996a0b 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1490,6 +1490,14 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     struct restore_ctx *ctx = &_ctx;
     struct domain_info_context *dinfo = &ctx->dinfo;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_restore2(
+            xch, io_fd, dom, store_evtchn, store_mfn,
+            store_domid, console_evtchn, console_mfn, console_domid,
+            hvm,  pae,  superpages, checkpointed_stream, callbacks);
+    }
+
     DPRINTF("%s: starting restore of new domid %u", __func__, dom);
 
     pagebuf_init(&pagebuf);
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index edb7096..0244f36 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -894,6 +894,12 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
     int completed = 0;
 
+    if ( getenv("XG_MIGRATION_V2") )
+    {
+        return xc_domain_save2(xch, io_fd, dom, max_iters,
+                               max_factor, flags, callbacks, hvm);
+    }
+
     DPRINTF("%s: starting save of domid %u", __func__, dom);
 
     if ( hvm && !callbacks->switch_qemu_logdirty )
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 40bbac8..55755cf 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -88,6 +88,10 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                    uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
                    struct save_callbacks* callbacks, int hvm);
 
+/* Domain Save v2 */
+int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
+                    uint32_t max_factor, uint32_t flags,
+                    struct save_callbacks* callbacks, int hvm);
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
@@ -124,6 +128,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       unsigned int hvm, unsigned int pae, int superpages,
                       int checkpointed_stream,
                       struct restore_callbacks *callbacks);
+
+/* Domain Restore v2 */
+int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
+                       unsigned int store_evtchn, unsigned long *store_mfn,
+                       domid_t store_domid, unsigned int console_evtchn,
+                       unsigned long *console_mfn, domid_t console_domid,
+                       unsigned int hvm, unsigned int pae, int superpages,
+                       int checkpointed_stream,
+                       struct restore_callbacks *callbacks);
 /**
  * xc_domain_restore writes a file to disk that contains the device
  * model saved state.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (2 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12  9:52   ` David Vrabel
  2014-06-12 15:31   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 05/14] tools/libxc: noarch common code Andrew Cooper
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, David Vrabel

Still TODO: Consider namespacing/prefixing

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.c        |   79 +++++++++++++++++
 tools/libxc/saverestore/common.h        |    8 ++
 tools/libxc/saverestore/stream_format.h |  147 +++++++++++++++++++++++++++++++
 3 files changed, 234 insertions(+)
 create mode 100644 tools/libxc/saverestore/common.c
 create mode 100644 tools/libxc/saverestore/stream_format.h

diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
new file mode 100644
index 0000000..de86a90
--- /dev/null
+++ b/tools/libxc/saverestore/common.c
@@ -0,0 +1,79 @@
+#include "common.h"
+
+static const char *dhdr_types[] =
+{
+    [DHDR_TYPE_X86_PV]  = "x86 PV",
+    [DHDR_TYPE_X86_HVM] = "x86 HVM",
+    [DHDR_TYPE_X86_PVH] = "x86 PVH",
+    [DHDR_TYPE_ARM]     = "ARM",
+};
+
+const char *dhdr_type_to_str(uint32_t type)
+{
+    if ( type < ARRAY_SIZE(dhdr_types) && dhdr_types[type] )
+        return dhdr_types[type];
+
+    return "Reserved";
+}
+
+static const char *mandatory_rec_types[] =
+{
+    [REC_TYPE_END]                  = "End",
+    [REC_TYPE_PAGE_DATA]            = "Page data",
+    [REC_TYPE_X86_PV_INFO]          = "x86 PV info",
+    [REC_TYPE_X86_PV_P2M_FRAMES]    = "x86 PV P2M frames",
+    [REC_TYPE_X86_PV_VCPU_BASIC]    = "x86 PV vcpu basic",
+    [REC_TYPE_X86_PV_VCPU_EXTENDED] = "x86 PV vcpu extended",
+    [REC_TYPE_X86_PV_VCPU_XSAVE]    = "x86 PV vcpu xsave",
+    [REC_TYPE_SHARED_INFO]          = "Shared info",
+    [REC_TYPE_TSC_INFO]             = "TSC info",
+    [REC_TYPE_HVM_CONTEXT]          = "HVM context",
+    [REC_TYPE_HVM_PARAMS]           = "HVM params",
+    [REC_TYPE_TOOLSTACK]            = "Toolstack",
+    [REC_TYPE_X86_PV_VCPU_MSRS]     = "x86 PV vcpu msrs",
+};
+
+/*
+static const char *optional_rec_types[] =
+{
+     None yet...
+};
+*/
+
+const char *rec_type_to_str(uint32_t type)
+{
+    if ( type & REC_TYPE_OPTIONAL )
+        return "Reserved";
+
+    if ( ((type & REC_TYPE_OPTIONAL) == 0 ) &&
+         (type < ARRAY_SIZE(mandatory_rec_types)) &&
+         (mandatory_rec_types[type]) )
+        return mandatory_rec_types[type];
+
+    return "Reserved";
+}
+
+static void __attribute__((unused)) build_assertions(void)
+{
+    XC_BUILD_BUG_ON(sizeof(struct ihdr) != 24);
+    XC_BUILD_BUG_ON(sizeof(struct dhdr) != 16);
+    XC_BUILD_BUG_ON(sizeof(struct rhdr) != 8);
+
+    XC_BUILD_BUG_ON(sizeof(struct rec_page_data_header)  != 8);
+    XC_BUILD_BUG_ON(sizeof(struct rec_x86_pv_info)       != 8);
+    XC_BUILD_BUG_ON(sizeof(struct rec_x86_pv_p2m_frames) != 8);
+    XC_BUILD_BUG_ON(sizeof(struct rec_x86_pv_vcpu_hdr)   != 8);
+    XC_BUILD_BUG_ON(sizeof(struct rec_tsc_info)          != 24);
+    XC_BUILD_BUG_ON(sizeof(struct rec_hvm_params_entry)  != 16);
+    XC_BUILD_BUG_ON(sizeof(struct rec_hvm_params)        != 8);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index f1aff44..cbecf0a 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -3,6 +3,14 @@
 
 #include "../xg_private.h"
 
+#include "stream_format.h"
+
+/* String representation of Domain Header types. */
+const char *dhdr_type_to_str(uint32_t type);
+
+/* String representation of Record types. */
+const char *rec_type_to_str(uint32_t type);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/saverestore/stream_format.h b/tools/libxc/saverestore/stream_format.h
new file mode 100644
index 0000000..7e1ab76
--- /dev/null
+++ b/tools/libxc/saverestore/stream_format.h
@@ -0,0 +1,147 @@
+#ifndef __STREAM_FORMAT__H
+#define __STREAM_FORMAT__H
+
+/*
+ * C structures for the Migration v2 stream format.
+ * See docs/specs/libxc-migration-stream.pandoc
+ */
+
+#include <inttypes.h>
+
+/*
+ * Image Header
+ */
+struct ihdr
+{
+    uint64_t marker;
+    uint32_t id;
+    uint32_t version;
+    uint16_t options;
+    uint16_t _res1;
+    uint32_t _res2;
+};
+
+#define IHDR_MARKER  0xffffffffffffffffULL
+#define IHDR_ID      0x58454E46U
+#define IHDR_VERSION 1
+
+#define _IHDR_OPT_ENDIAN 0
+#define IHDR_OPT_LITTLE_ENDIAN (0 << _IHDR_OPT_ENDIAN)
+#define IHDR_OPT_BIG_ENDIAN    (1 << _IHDR_OPT_ENDIAN)
+
+/*
+ * Domain Header
+ */
+struct dhdr
+{
+    uint32_t type;
+    uint16_t page_shift;
+    uint16_t _res1;
+    uint32_t xen_major;
+    uint32_t xen_minor;
+};
+
+#define DHDR_TYPE_X86_PV  0x00000001U
+#define DHDR_TYPE_X86_HVM 0x00000002U
+#define DHDR_TYPE_X86_PVH 0x00000003U
+#define DHDR_TYPE_ARM     0x00000004U
+
+/*
+ * Record Header
+ */
+struct rhdr
+{
+    uint32_t type;
+    uint32_t length;
+};
+
+/* All records must be aligned up to an 8 octet boundary */
+#define REC_ALIGN_ORDER               (3U)
+/* Somewhat arbitrary - 8MB */
+#define REC_LENGTH_MAX                (8U << 20)
+
+#define REC_TYPE_END                  0x00000000U
+#define REC_TYPE_PAGE_DATA            0x00000001U
+#define REC_TYPE_X86_PV_INFO          0x00000002U
+#define REC_TYPE_X86_PV_P2M_FRAMES    0x00000003U
+#define REC_TYPE_X86_PV_VCPU_BASIC    0x00000004U
+#define REC_TYPE_X86_PV_VCPU_EXTENDED 0x00000005U
+#define REC_TYPE_X86_PV_VCPU_XSAVE    0x00000006U
+#define REC_TYPE_SHARED_INFO          0x00000007U
+#define REC_TYPE_TSC_INFO             0x00000008U
+#define REC_TYPE_HVM_CONTEXT          0x00000009U
+#define REC_TYPE_HVM_PARAMS           0x0000000aU
+#define REC_TYPE_TOOLSTACK            0x0000000bU
+#define REC_TYPE_X86_PV_VCPU_MSRS     0x0000000cU
+
+#define REC_TYPE_OPTIONAL             0x80000000U
+
+/* PAGE_DATA */
+struct rec_page_data_header
+{
+    uint32_t count;
+    uint32_t _res1;
+    uint64_t pfn[0];
+};
+
+#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
+#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
+
+/* X86_PV_INFO */
+struct rec_x86_pv_info
+{
+    uint8_t guest_width;
+    uint8_t pt_levels;
+    uint8_t _res[6];
+};
+
+/* X86_PV_P2M_FRAMES */
+struct rec_x86_pv_p2m_frames
+{
+    uint32_t start_pfn;
+    uint32_t end_pfn;
+    uint64_t p2m_pfns[0];
+};
+
+/* X86_PV_VCPU_{BASIC,EXTENDED,XSAVE,MSRS} */
+struct rec_x86_pv_vcpu_hdr
+{
+    uint32_t vcpu_id;
+    uint32_t _res1;
+    uint8_t context[0];
+};
+
+/* TSC_INFO */
+struct rec_tsc_info
+{
+    uint32_t mode;
+    uint32_t khz;
+    uint64_t nsec;
+    uint32_t incarnation;
+    uint32_t _res1;
+};
+
+/* HVM_PARAMS */
+struct rec_hvm_params_entry
+{
+    uint64_t index;
+    uint64_t value;
+};
+
+struct rec_hvm_params
+{
+    uint32_t count;
+    uint32_t _res1;
+    struct rec_hvm_params_entry param[0];
+};
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (3 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12  9:55   ` David Vrabel
  2014-06-17 16:10   ` Ian Campbell
  2014-06-11 18:14 ` [PATCH v5 RFC 06/14] tools/libxc: x86 " Andrew Cooper
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.c |   50 +++++++
 tools/libxc/saverestore/common.h |  269 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 319 insertions(+)

diff --git a/tools/libxc/saverestore/common.c b/tools/libxc/saverestore/common.c
index de86a90..b02f518 100644
--- a/tools/libxc/saverestore/common.c
+++ b/tools/libxc/saverestore/common.c
@@ -1,3 +1,5 @@
+#include <assert.h>
+
 #include "common.h"
 
 static const char *dhdr_types[] =
@@ -53,6 +55,54 @@ const char *rec_type_to_str(uint32_t type)
     return "Reserved";
 }
 
+int write_split_record(struct context *ctx, struct record *rec,
+                       void *buf, size_t sz)
+{
+    static const char zeroes[7] = { 0 };
+    xc_interface *xch = ctx->xch;
+    uint32_t combined_length = rec->length + sz;
+    size_t record_length = ROUNDUP(combined_length, REC_ALIGN_ORDER);
+
+    if ( record_length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
+              " exceeds max (0x%"PRIx32")", rec->type,
+              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    if ( rec->length )
+        assert(rec->data);
+    if ( sz )
+        assert(buf);
+
+    if ( write_exact(ctx->fd, &rec->type, sizeof(rec->type)) ||
+         write_exact(ctx->fd, &combined_length, sizeof(rec->length)) ||
+         (rec->length && write_exact(ctx->fd, rec->data, rec->length)) ||
+         (sz && write_exact(ctx->fd, buf, sz)) ||
+         write_exact(ctx->fd, zeroes, record_length - combined_length) )
+    {
+        PERROR("Unable to write record to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
+int write_record_header(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( write_exact(ctx->fd, &rec->type, sizeof(rec->type)) ||
+         write_exact(ctx->fd, &rec->length, sizeof(rec->length)) )
+    {
+        PERROR("Unable to write record to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
 static void __attribute__((unused)) build_assertions(void)
 {
     XC_BUILD_BUG_ON(sizeof(struct ihdr) != 24);
diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index cbecf0a..d9a3655 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -1,7 +1,20 @@
 #ifndef __COMMON__H
 #define __COMMON__H
 
+#include <stdbool.h>
+
+// Hack out junk from the namespace
+#define mfn_to_pfn __UNUSED_mfn_to_pfn
+#define pfn_to_mfn __UNUSED_pfn_to_mfn
+
 #include "../xg_private.h"
+#include "../xg_save_restore.h"
+#include "../xc_dom.h"
+#include "../xc_bitops.h"
+
+#undef mfn_to_pfn
+#undef pfn_to_mfn
+
 
 #include "stream_format.h"
 
@@ -11,6 +24,262 @@ const char *dhdr_type_to_str(uint32_t type);
 /* String representation of Record types. */
 const char *rec_type_to_str(uint32_t type);
 
+struct context;
+struct record;
+
+/**
+ * Guest common operations, to be implemented for each type of domain. Used by
+ * both the save and restore code.
+ */
+struct common_ops
+{
+    /* Check to see whether a PFN is valid. */
+    bool (*pfn_is_valid)(const struct context *ctx, xen_pfn_t pfn);
+
+    /* Convert a PFN to GFN.  May return ~0UL for an invalid mapping. */
+    xen_pfn_t (*pfn_to_gfn)(const struct context *ctx, xen_pfn_t pfn);
+
+    /* Set the GFN of a PFN. */
+    void (*set_gfn)(struct context *ctx, xen_pfn_t pfn, xen_pfn_t gfn);
+
+    /* Set the type of a PFN. */
+    void (*set_page_type)(struct context *ctx, xen_pfn_t pfn, xen_pfn_t type);
+};
+
+
+/**
+ * Save operations.  To be implemented for each type of guest, for use by the
+ * common save algorithm.
+ *
+ * Every function must be implemented, even if only with a no-op stub.
+ */
+struct save_ops
+{
+    /**
+     * Optionally transform the contents of a page from being specific to the
+     * sending environment, to being generic for the stream.
+     *
+     * The page of data at the end of 'page' may be a read-only mapping of a
+     * running guest; it must not be modified.  If no transformation is
+     * required, the callee should leave '*pages' unouched.
+     *
+     * If a transformation is required, the callee should allocate themselves
+     * a local page using malloc() and return it via '*page'.
+     *
+     * The caller shall free() '*page' in all cases.  In the case that the
+     * callee enounceters an error, it should *NOT* free() the memory it
+     * allocated for '*page'.
+     *
+     * It is valid to fail with EAGAIN if the transformation is not able to be
+     * completed at this point.  The page shall be retried later.
+     *
+     * @returns 0 for success, -1 for failure, with errno appropriately set.
+     */
+    int (*normalise_page)(struct context *ctx, xen_pfn_t type, void **page);
+
+    /**
+     * Set up local environment to restore a domain.  This is called before
+     * any records are written to the stream.  (Typically querying running
+     * domain state, setting up mappings etc.)
+     */
+    int (*setup)(struct context *ctx);
+
+    /**
+     * Write records which need to be at the start of the stream.  This is
+     * called after the Image and Domain headers are written.  (Any records
+     * which need to be ahead of the memory.)
+     */
+    int (*start_of_stream)(struct context *ctx);
+
+    /**
+     * Write records which need to be at the end of the stream, following the
+     * complete memory contents.  The caller shall handle writing the END
+     * record into the stream.  (Any records which need to be after the memory
+     * is complete.)
+     */
+    int (*end_of_stream)(struct context *ctx);
+
+    /**
+     * Clean up the local environment.  Will be called exactly once, either
+     * after a successful save, or upon encountering an error.
+     */
+    int (*cleanup)(struct context *ctx);
+};
+
+
+/**
+ * Restore operations.  To be implemented for each type of guest, for use by
+ * the common restore algorithm.
+ *
+ * Every function must be implemented, even if only with a no-op stub.
+ */
+struct restore_ops
+{
+    /**
+     * Optionally transform the contents of a page from being generic in the
+     * stream, to being specific to the restoring environment.
+     *
+     * 'page' is expected to be modified in-place if a transformation is
+     * required.
+     *
+     * @returns 0 for success, -1 for failure, with errno appropriately set.
+     */
+    int (*localise_page)(struct context *ctx, uint32_t type, void *page);
+
+    /**
+     * Set up local environment to restore a domain.  This is called before
+     * any records are read from the stream.
+     */
+    int (*setup)(struct context *ctx);
+
+    /**
+     * Process an individual record from the stream.  The caller shall take
+     * care of processing the END and PAGE_DATA records.
+     *
+     * Unknown mandatory records, or invalid records for the type of domain
+     * should result in failure.
+     */
+    int (*process_record)(struct context *ctx, struct record *rec);
+
+    /**
+     * Perform any actions required after the stream has been finished. Called
+     * after the END record has been received.
+     */
+    int (*stream_complete)(struct context *ctx);
+
+    /**
+     * Clean up the local environment.  Will be called exactly once, either
+     * after a successful restore, or upon encountering an error.
+     */
+    int (*cleanup)(struct context *ctx);
+};
+
+
+struct context
+{
+    xc_interface *xch;
+    uint32_t domid;
+    int fd;
+
+    xc_dominfo_t dominfo;
+
+    struct common_ops ops;
+
+    union
+    {
+        struct
+        {
+            struct restore_ops ops;
+            struct restore_callbacks *callbacks;
+
+            /* From Image Header */
+            uint32_t format_version;
+
+            /* From Domain Header */
+            uint32_t guest_type;
+            uint32_t guest_page_size;
+
+            /* Xenstore and Console parameters. Some as input from caller,
+             * some as input from stream, some output. */
+            unsigned long xenstore_mfn, console_mfn;
+            unsigned int xenstore_evtchn, console_evtchn;
+            domid_t xenstore_domid, console_domid;
+
+            /* Bitmap of currently populated PFNs during restore. */
+            unsigned long *populated_pfns;
+            unsigned int max_populated_pfn;
+        } restore;
+
+        struct
+        {
+            struct save_ops ops;
+            struct save_callbacks *callbacks;
+
+            unsigned long p2m_size;
+
+            xen_pfn_t *batch_pfns;
+            unsigned nr_batch_pfns;
+            unsigned long *deferred_pages;
+        } save;
+    };
+
+    union
+    {
+        struct
+        {
+            /* 4 or 8; 32 or 64 bit domain */
+            unsigned int width;
+            /* 3 or 4 pagetable levels */
+            unsigned int levels;
+
+            /* Maximum Xen frame */
+            unsigned long max_mfn;
+            /* Read-only machine to phys map */
+            xen_pfn_t *m2p;
+            /* first mfn of the compat m2p (Only needed for 32bit PV guests) */
+            xen_pfn_t compat_m2p_mfn0;
+            /* Number of m2p frames mapped */
+            unsigned long nr_m2p_frames;
+
+            /* Maximum guest frame */
+            unsigned long max_pfn;
+
+            /* Number of frames making up the p2m */
+            unsigned int p2m_frames;
+            /* Guest's phys to machine map.  Mapped read-only (save) or
+             * allocated locally (restore).  Uses guest unsigned longs. */
+            void *p2m;
+            /* The guest pfns containing the p2m leaves */
+            xen_pfn_t *p2m_pfns;
+            /* Types for each page */
+            uint32_t *pfn_types;
+
+            /* Read-only mapping of guests shared info page */
+            shared_info_any_t *shinfo;
+        } x86_pv;
+    };
+};
+
+
+struct record
+{
+    uint32_t type;
+    uint32_t length;
+    void *data;
+};
+
+/*
+ * Writes a record header into the stream.  The caller is responsible for
+ * ensuring that they subseqently write the correct amount of data into the
+ * stream, including appropriate padding.
+ */
+int write_record_header(struct context *ctx, struct record *rec);
+
+/*
+ * Writes a split record to the stream, applying correct padding where
+ * appropriate.  It is common when sending records containing blobs from Xen
+ * that the header and blob data are separate.  This function accepts a second
+ * buffer and length, and will merge it with the main record when sending.
+ *
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+int write_split_record(struct context *ctx, struct record *rec, void *buf, size_t sz);
+
+/*
+ * Writes a record to the stream, applying correct padding where appropriate.
+ * Records with a non-zero length must provide a valid data field; records
+ * with a 0 length shall have their data field ignored.
+ *
+ * Returns 0 on success and non0 on failure.
+ */
+static inline int write_record(struct context *ctx, struct record *rec)
+{
+    return write_split_record(ctx, rec, NULL, 0);
+}
+
 #endif
 /*
  * Local variables:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 06/14] tools/libxc: x86 common code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (4 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 05/14] tools/libxc: noarch common code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12  9:57   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 07/14] tools/libxc: x86 PV " Andrew Cooper
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common_x86.c |   54 ++++++++++++++++++++++++++++++++++
 tools/libxc/saverestore/common_x86.h |   26 ++++++++++++++++
 2 files changed, 80 insertions(+)
 create mode 100644 tools/libxc/saverestore/common_x86.c
 create mode 100644 tools/libxc/saverestore/common_x86.h

diff --git a/tools/libxc/saverestore/common_x86.c b/tools/libxc/saverestore/common_x86.c
new file mode 100644
index 0000000..8907454
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86.c
@@ -0,0 +1,54 @@
+#include "common_x86.h"
+
+int write_tsc_info(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_tsc_info tsc = { 0 };
+    struct record rec =
+    {
+        .type = REC_TYPE_TSC_INFO,
+        .length = sizeof(tsc),
+        .data = &tsc
+    };
+
+    if ( xc_domain_get_tsc_info(xch, ctx->domid, &tsc.mode,
+                                &tsc.nsec, &tsc.khz, &tsc.incarnation) < 0 )
+    {
+        PERROR("Unable to obtain TSC information");
+        return -1;
+    }
+
+    return write_record(ctx, &rec);
+}
+
+int handle_tsc_info(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_tsc_info *tsc = rec->data;
+
+    if ( rec->length != sizeof(*tsc) )
+    {
+        ERROR("TSC_INFO record wrong size: length %"PRIu32", expected %zu",
+              rec->length, sizeof(*tsc));
+        return -1;
+    }
+
+    if ( xc_domain_set_tsc_info(xch, ctx->domid, tsc->mode,
+                                tsc->nsec, tsc->khz, tsc->incarnation) )
+    {
+        PERROR("Unable to set TSC information");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86.h b/tools/libxc/saverestore/common_x86.h
new file mode 100644
index 0000000..8c6ebc0
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86.h
@@ -0,0 +1,26 @@
+#ifndef __COMMON_X86__H
+#define __COMMON_X86__H
+
+#include "common.h"
+
+/*
+ * Obtains a domains TSC information from Xen and writes a TSC_INFO record
+ * into the stream.
+ */
+int write_tsc_info(struct context *ctx);
+
+/*
+ * Parses a TSC_INFO record and applies the result to the domain.
+ */
+int handle_tsc_info(struct context *ctx, struct record *rec);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 07/14] tools/libxc: x86 PV common code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (5 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 06/14] tools/libxc: x86 " Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12  9:59   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 08/14] tools/libxc: x86 PV save code Andrew Cooper
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h        |    5 +
 tools/libxc/saverestore/common_x86_pv.c |  250 +++++++++++++++++++++++++++++++
 tools/libxc/saverestore/common_x86_pv.h |  144 ++++++++++++++++++
 3 files changed, 399 insertions(+)
 create mode 100644 tools/libxc/saverestore/common_x86_pv.c
 create mode 100644 tools/libxc/saverestore/common_x86_pv.h

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index d9a3655..5f6af00 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -14,6 +14,10 @@
 
 #undef mfn_to_pfn
 #undef pfn_to_mfn
+#undef GET_FIELD
+#undef SET_FIELD
+#undef MEMCPY_FIELD
+#undef MEMSET_ARRAY_FIELD
 
 
 #include "stream_format.h"
@@ -240,6 +244,7 @@ struct context
     };
 };
 
+extern struct common_ops common_ops_x86_pv;
 
 struct record
 {
diff --git a/tools/libxc/saverestore/common_x86_pv.c b/tools/libxc/saverestore/common_x86_pv.c
new file mode 100644
index 0000000..02a006b
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_pv.c
@@ -0,0 +1,250 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+/* common_ops function */
+static bool x86_pv_pfn_is_valid(const struct context *ctx, xen_pfn_t pfn)
+{
+    return pfn <= ctx->x86_pv.max_pfn;
+}
+
+/* common_ops function */
+static xen_pfn_t x86_pv_pfn_to_gfn(const struct context *ctx, xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    if ( ctx->x86_pv.width == sizeof(uint64_t) )
+        /* 64 bit guest.  Need to truncate their pfns for 32 bit toolstacks */
+        return ((uint64_t *)ctx->x86_pv.p2m)[pfn];
+    else
+    {
+        /* 32 bit guest.  Need to expand INVALID_MFN for 64 bit toolstacks */
+        uint32_t mfn = ((uint32_t *)ctx->x86_pv.p2m)[pfn];
+
+        return mfn == ~0U ? INVALID_MFN : mfn;
+    }
+}
+
+/* common_ops function */
+static void x86_pv_set_page_type(struct context *ctx, xen_pfn_t pfn,
+                                 unsigned long type)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    ctx->x86_pv.pfn_types[pfn] = type;
+}
+
+/* common_ops function */
+static void x86_pv_set_gfn(struct context *ctx, xen_pfn_t pfn,
+                           xen_pfn_t mfn)
+{
+    assert(pfn <= ctx->x86_pv.max_pfn);
+
+    if ( ctx->x86_pv.width == sizeof(uint64_t) )
+        /* 64 bit guest.  Need to expand INVALID_MFN for 32 bit toolstacks */
+        ((uint64_t *)ctx->x86_pv.p2m)[pfn] = mfn == INVALID_MFN ? ~0ULL : mfn;
+    else
+        /* 32 bit guest.  Can safely truncate INVALID_MFN for 64 bit toolstacks */
+        ((uint32_t *)ctx->x86_pv.p2m)[pfn] = mfn;
+}
+
+struct common_ops common_ops_x86_pv = {
+    .pfn_is_valid   = x86_pv_pfn_is_valid,
+    .pfn_to_gfn     = x86_pv_pfn_to_gfn,
+    .set_page_type  = x86_pv_set_page_type,
+    .set_gfn        = x86_pv_set_gfn,
+};
+
+xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn)
+{
+    assert(mfn <= ctx->x86_pv.max_mfn);
+    return ctx->x86_pv.m2p[mfn];
+}
+
+bool mfn_in_pseudophysmap(struct context *ctx, xen_pfn_t mfn)
+{
+    return ( (mfn <= ctx->x86_pv.max_mfn) &&
+             (mfn_to_pfn(ctx, mfn) <= ctx->x86_pv.max_pfn) &&
+             (ctx->ops.pfn_to_gfn(ctx, mfn_to_pfn(ctx, mfn) == mfn)) );
+}
+
+void dump_bad_pseudophysmap_entry(struct context *ctx, xen_pfn_t mfn)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn = ~0UL;
+
+    ERROR("mfn %#lx, max %#lx", mfn, ctx->x86_pv.max_mfn);
+
+    if ( (mfn != ~0UL) && (mfn <= ctx->x86_pv.max_mfn) )
+    {
+        pfn = ctx->x86_pv.m2p[mfn];
+        ERROR("  m2p[%#lx] = %#lx, max_pfn %#lx",
+              mfn, pfn, ctx->x86_pv.max_pfn);
+    }
+
+    if ( (pfn != ~0UL) && (pfn <= ctx->x86_pv.max_pfn) )
+        ERROR("  p2m[%#lx] = %#lx",
+              pfn, ctx->ops.pfn_to_gfn(ctx, pfn));
+}
+
+xen_pfn_t cr3_to_mfn(struct context *ctx, uint64_t cr3)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return cr3 >> 12;
+    else
+        return (((uint32_t)cr3 >> 12) | ((uint32_t)cr3 << 20));
+}
+
+uint64_t mfn_to_cr3(struct context *ctx, xen_pfn_t mfn)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return ((uint64_t)mfn) << 12;
+    else
+        return (((uint32_t)mfn << 12) | ((uint32_t)mfn >> 20));
+}
+
+int x86_pv_domain_info(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned int guest_width, guest_levels, fpp;
+    int max_pfn;
+
+    /* Get the domain width */
+    if ( xc_domain_get_guest_width(xch, ctx->domid, &guest_width) )
+    {
+        PERROR("Unable to determine dom%d's width", ctx->domid);
+        return -1;
+    }
+
+    if ( guest_width == 4 )
+        guest_levels = 3;
+    else if ( guest_width == 8 )
+        guest_levels = 4;
+    else
+    {
+        ERROR("Invalid guest width %d.  Expected 32 or 64", guest_width * 8);
+        return -1;
+    }
+    ctx->x86_pv.width = guest_width;
+    ctx->x86_pv.levels = guest_levels;
+    fpp = PAGE_SIZE / ctx->x86_pv.width;
+
+    DPRINTF("%d bits, %d levels", guest_width * 8, guest_levels);
+
+    /* Get the domain's size */
+    max_pfn = xc_domain_maximum_gpfn(xch, ctx->domid);
+    if ( max_pfn < 0 )
+    {
+        PERROR("Unable to obtain guests max pfn");
+        return -1;
+    }
+
+    if ( max_pfn > 0 )
+    {
+        ctx->x86_pv.max_pfn = max_pfn;
+        ctx->x86_pv.p2m_frames = (ctx->x86_pv.max_pfn + fpp) / fpp;
+
+        DPRINTF("max_pfn %#x, p2m_frames %d", max_pfn, ctx->x86_pv.p2m_frames);
+    }
+
+    return 0;
+}
+
+int x86_pv_map_m2p(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    long max_page = xc_maximum_ram_page(xch);
+    unsigned long m2p_chunks, m2p_size;
+    privcmd_mmap_entry_t *entries = NULL;
+    xen_pfn_t *extents_start = NULL;
+    int rc = -1, i;
+
+    if ( max_page < 0 )
+    {
+        PERROR("Failed to get maximum ram page");
+        goto err;
+    }
+
+    ctx->x86_pv.max_mfn = max_page;
+    m2p_size   = M2P_SIZE(ctx->x86_pv.max_mfn);
+    m2p_chunks = M2P_CHUNKS(ctx->x86_pv.max_mfn);
+
+    extents_start = malloc(m2p_chunks * sizeof(xen_pfn_t));
+    if ( !extents_start )
+    {
+        ERROR("Unable to allocate %zu bytes for m2p mfns",
+              m2p_chunks * sizeof(xen_pfn_t));
+        goto err;
+    }
+
+    if ( xc_machphys_mfn_list(xch, m2p_chunks, extents_start) )
+    {
+        PERROR("Failed to get m2p mfn list");
+        goto err;
+    }
+
+    entries = malloc(m2p_chunks * sizeof(privcmd_mmap_entry_t));
+    if ( !entries )
+    {
+        ERROR("Unable to allocate %zu bytes for m2p mapping mfns",
+              m2p_chunks * sizeof(privcmd_mmap_entry_t));
+        goto err;
+    }
+
+    for ( i = 0; i < m2p_chunks; ++i )
+        entries[i].mfn = extents_start[i];
+
+    ctx->x86_pv.m2p = xc_map_foreign_ranges(
+        xch, DOMID_XEN, m2p_size, PROT_READ,
+        M2P_CHUNK_SIZE, entries, m2p_chunks);
+
+    if ( !ctx->x86_pv.m2p )
+    {
+        PERROR("Failed to mmap m2p ranges");
+        goto err;
+    }
+
+    ctx->x86_pv.nr_m2p_frames = (M2P_CHUNK_SIZE >> PAGE_SHIFT) * m2p_chunks;
+
+#ifdef __i386__
+    /* 32 bit toolstacks automatically get the compat m2p */
+    ctx->x86_pv.compat_m2p_mfn0 = entries[0].mfn;
+#else
+    /* 64 bit toolstacks need to ask Xen specially for it */
+    {
+        struct xen_machphys_mfn_list xmml = {
+            .max_extents = 1,
+            .extent_start = { &ctx->x86_pv.compat_m2p_mfn0 }
+        };
+
+        rc = do_memory_op(xch, XENMEM_machphys_compat_mfn_list,
+                          &xmml, sizeof(xmml));
+        if ( rc || xmml.nr_extents != 1 )
+        {
+            PERROR("Failed to get compat mfn list from Xen");
+            rc = -1;
+            goto err;
+        }
+    }
+#endif
+
+    /* All Done */
+    rc = 0;
+    DPRINTF("max_mfn %#lx", ctx->x86_pv.max_mfn);
+
+err:
+    free(entries);
+    free(extents_start);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxc/saverestore/common_x86_pv.h b/tools/libxc/saverestore/common_x86_pv.h
new file mode 100644
index 0000000..bb2e9fc
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_pv.h
@@ -0,0 +1,144 @@
+#ifndef __COMMON_X86_PV_H
+#define __COMMON_X86_PV_H
+
+#include "common_x86.h"
+
+/* Gets a field from an *_any union */
+#define GET_FIELD(_c, _p, _f)                   \
+    ({ (_c)->x86_pv.width == 8 ?                \
+            (_p)->x64._f:                       \
+            (_p)->x32._f;                       \
+    })                                          \
+
+/* Gets a field from an *_any union */
+#define SET_FIELD(_c, _p, _f, _v)               \
+    ({ if ( (_c)->x86_pv.width == 8 )           \
+            (_p)->x64._f = (_v);                \
+        else                                    \
+            (_p)->x32._f = (_v);                \
+    })
+
+/* memcpy field _f from _s to _d, of an *_any union */
+#define MEMCPY_FIELD(_c, _d, _s, _f)                                    \
+    ({ if ( (_c)->x86_pv.width == 8 )                                   \
+            memcpy(&(_d)->x64._f, &(_s)->x64._f, sizeof((_d)->x64._f)); \
+        else                                                            \
+            memcpy(&(_d)->x32._f, &(_s)->x32._f, sizeof((_d)->x32._f)); \
+    })
+
+/* memset array field _f with value _v, from an *_any union */
+#define MEMSET_ARRAY_FIELD(_c, _d, _f, _v)                              \
+    ({ if ( (_c)->x86_pv.width == 8 )                                   \
+           memset(&(_d)->x64._f[0], (_v), sizeof((_d)->x64._f));        \
+       else                                                             \
+           memset(&(_d)->x32._f[0], (_v), sizeof((_d)->x32._f));        \
+    })
+
+/*
+ * Convert an mfn to a pfn, given Xens m2p table.
+ *
+ * Caller must ensure that the requested mfn is in range.
+ */
+xen_pfn_t mfn_to_pfn(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Convert a pfn to an mfn, given the guests p2m table.
+ *
+ * Caller must ensure that the requested pfn is in range.
+ */
+xen_pfn_t pfn_to_mfn(struct context *ctx, xen_pfn_t pfn);
+
+/*
+ * Set a mapping in the p2m table.
+ *
+ * Caller must ensure that the requested pfn is in range.
+ */
+void set_p2m(struct context *ctx, xen_pfn_t pfn, xen_pfn_t mfn);
+
+/*
+ * Query whether a particular mfn is valid in the physmap of a guest.
+ */
+bool mfn_in_pseudophysmap(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Debug a particular mfn by walking the p2m and m2p.
+ */
+void dump_bad_pseudophysmap_entry(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Convert a PV cr3 field to an mfn.
+ *
+ * Adjusts for Xen's extended-cr3 format to pack a 44bit physical address into
+ * a 32bit architectural cr3.
+ */
+xen_pfn_t cr3_to_mfn(struct context *ctx, uint64_t cr3);
+
+/*
+ * Convert an mfn to a PV cr3 field.
+ *
+ * Adjusts for Xen's extended-cr3 format to pack a 44bit physical address into
+ * a 32bit architectural cr3.
+ */
+uint64_t mfn_to_cr3(struct context *ctx, xen_pfn_t mfn);
+
+/*
+ * Extract an mfn from a Pagetable Entry.
+ */
+static inline xen_pfn_t pte_to_frame(struct context *ctx, uint64_t pte)
+{
+    if ( ctx->x86_pv.width == 8 )
+        return (pte >> PAGE_SHIFT) & ((1ULL << (52 - PAGE_SHIFT)) - 1);
+    else
+        return (pte >> PAGE_SHIFT) & ((1ULL << (44 - PAGE_SHIFT)) - 1);
+}
+
+/*
+ * Change the mfn in a Pagetable Entry while leaving the flags alone.
+ */
+static inline void update_pte(struct context *ctx, uint64_t *pte, xen_pfn_t mfn)
+{
+    if ( ctx->x86_pv.width == 8 )
+        *pte &= ~(((1ULL << (52 - PAGE_SHIFT)) - 1) << PAGE_SHIFT);
+    else
+        *pte &= ~(((1ULL << (44 - PAGE_SHIFT)) - 1) << PAGE_SHIFT);
+
+    *pte |= (uint64_t)mfn << PAGE_SHIFT;
+}
+
+/*
+ * Get current domain information.
+ *
+ * Fills ctx->x86_pv
+ * - .width
+ * - .levels
+ * - .fpp
+ * - .p2m_frames
+ *
+ * Used by the save side to create the X86_PV_INFO record, and by the restore
+ * side to verify the incoming stream.
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_domain_info(struct context *ctx);
+
+/*
+ * Maps the Xen M2P.
+ *
+ * Fills ctx->x86_pv.
+ * - .max_mfn
+ * - .m2p
+ *
+ * Returns 0 on success and non-zero on error.
+ */
+int x86_pv_map_m2p(struct context *ctx);
+
+#endif
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 08/14] tools/libxc: x86 PV save code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (6 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 07/14] tools/libxc: x86 PV " Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:04   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code Andrew Cooper
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h      |    2 +
 tools/libxc/saverestore/save_x86_pv.c |  839 +++++++++++++++++++++++++++++++++
 2 files changed, 841 insertions(+)
 create mode 100644 tools/libxc/saverestore/save_x86_pv.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 5f6af00..5c8a370 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -246,6 +246,8 @@ struct context
 
 extern struct common_ops common_ops_x86_pv;
 
+extern struct save_ops save_ops_x86_pv;
+
 struct record
 {
     uint32_t type;
diff --git a/tools/libxc/saverestore/save_x86_pv.c b/tools/libxc/saverestore/save_x86_pv.c
new file mode 100644
index 0000000..1def92e
--- /dev/null
+++ b/tools/libxc/saverestore/save_x86_pv.c
@@ -0,0 +1,839 @@
+#include "common_x86_pv.h"
+
+/*
+ * Maps the guests shared info page.
+ */
+static int map_shinfo(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    ctx->x86_pv.shinfo = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ, ctx->dominfo.shared_info_frame);
+    if ( !ctx->x86_pv.shinfo )
+    {
+        PERROR("Failed to map shared info frame at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Copy a list of mfns from a guest, accounting for differences between guest
+ * and toolstack width.
+ */
+static void copy_mfns_from_guest(const struct context *ctx, xen_pfn_t *dst,
+                                 const void *src, size_t count)
+{
+    size_t x;
+
+    if ( ctx->x86_pv.width == sizeof(unsigned long) )
+        memcpy(dst, src, count * sizeof(*dst));
+    else
+    {
+        for ( x = 0; x < count; ++x )
+        {
+#ifdef __x86_64__
+            /* 64bit toolstack, 32bit guest.  Expand any INVALID_MFN. */
+            uint32_t s = ((uint32_t *)src)[x];
+
+            dst[x] = s == ~0U ? INVALID_MFN : s;
+#else
+            /* 32bit toolstack, 64bit guest.  Truncate their pointers */
+            dst[x] = ((uint64_t *)src)[x];
+#endif
+        }
+    }
+}
+
+/*
+ * Walk the guests frame list list and frame list to identify and map the
+ * frames making up the guests p2m table.  Construct a list of pfns making up
+ * the table.
+ */
+static int map_p2m(struct context *ctx)
+{
+    /* Terminology:
+     *
+     * fll   - frame list list, top level p2m, list of fl mfns
+     * fl    - frame list, mid level p2m, list of leaf mfns
+     * local - own allocated buffers, adjusted for bitness
+     * guest - mappings into the domain
+     */
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    unsigned x, fpp, fll_entries, fl_entries;
+    xen_pfn_t fll_mfn;
+
+    xen_pfn_t *local_fll = NULL;
+    void *guest_fll = NULL;
+    size_t local_fll_size;
+
+    xen_pfn_t *local_fl = NULL;
+    void *guest_fl = NULL;
+    size_t local_fl_size;
+
+    fpp = PAGE_SIZE / ctx->x86_pv.width;
+    fll_entries = (ctx->x86_pv.max_pfn / (fpp * fpp)) + 1;
+    fl_entries  = (ctx->x86_pv.max_pfn / fpp) + 1;
+
+    fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list);
+    if ( fll_mfn == 0 || fll_mfn > ctx->x86_pv.max_mfn )
+    {
+        ERROR("Bad mfn %#lx for p2m frame list list", fll_mfn);
+        goto err;
+    }
+
+    /* Map the guest top p2m. */
+    guest_fll = xc_map_foreign_range(xch, ctx->domid, PAGE_SIZE,
+                                     PROT_READ, fll_mfn);
+    if ( !guest_fll )
+    {
+        PERROR("Failed to map p2m frame list list at %#lx", fll_mfn);
+        goto err;
+    }
+
+    local_fll_size = fll_entries * sizeof(*local_fll);
+    local_fll = malloc(local_fll_size);
+    if ( !local_fll )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list list",
+              local_fll_size);
+        goto err;
+    }
+
+    copy_mfns_from_guest(ctx, local_fll, guest_fll, fll_entries);
+
+    /* Check for bad mfns in frame list list. */
+    for ( x = 0; x < fll_entries; ++x )
+    {
+        if ( local_fll[x] == 0 || local_fll[x] > ctx->x86_pv.max_mfn )
+        {
+            ERROR("Bad mfn %#lx at index %u (of %u) in p2m frame list list",
+                  local_fll[x], x, fll_entries);
+            goto err;
+        }
+    }
+
+    /* Map the guest mid p2m frames. */
+    guest_fl = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                    local_fll, fll_entries);
+    if ( !guest_fl )
+    {
+        PERROR("Failed to map p2m frame list");
+        goto err;
+    }
+
+    local_fl_size = fl_entries * sizeof(*local_fl);
+    local_fl = malloc(local_fl_size);
+    if ( !local_fl )
+    {
+        ERROR("Cannot allocate %zu bytes for local p2m frame list",
+              local_fl_size);
+        goto err;
+    }
+
+    copy_mfns_from_guest(ctx, local_fl, guest_fl, fl_entries);
+
+    for ( x = 0; x < fl_entries; ++x )
+    {
+        if ( local_fl[x] == 0 || local_fl[x] > ctx->x86_pv.max_mfn )
+        {
+            ERROR("Bad mfn %#lx at index %u (of %u) in p2m frame list",
+                  local_fl[x], x, fl_entries);
+            goto err;
+        }
+    }
+
+    /* Map the p2m leaves themselves. */
+    ctx->x86_pv.p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_READ,
+                                           local_fl, fl_entries);
+    if ( !ctx->x86_pv.p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    ctx->x86_pv.p2m_frames = fl_entries;
+    ctx->x86_pv.p2m_pfns = malloc(local_fl_size);
+    if ( !ctx->x86_pv.p2m_pfns )
+    {
+        ERROR("Cannot allocate %zu bytes for p2m pfns list",
+              local_fl_size);
+        goto err;
+    }
+
+    /* Convert leaf frames from mfns to pfns. */
+    for ( x = 0; x < fl_entries; ++x )
+    {
+        if ( !mfn_in_pseudophysmap(ctx, local_fl[x]) )
+        {
+            ERROR("Bad mfn in p2m_frame_list[%u]", x);
+            dump_bad_pseudophysmap_entry(ctx, local_fl[x]);
+            errno = ERANGE;
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[x] = mfn_to_pfn(ctx, local_fl[x]);
+    }
+
+    rc = 0;
+err:
+
+    free(local_fl);
+    if ( guest_fl )
+        munmap(guest_fl, fll_entries * PAGE_SIZE);
+
+    free(local_fll);
+    if ( guest_fll )
+        munmap(guest_fll, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Obtain a specific vcpus basic state and write an X86_PV_VCPU_BASIC record
+ * into the stream.  Performs mfn->pfn conversion on architectural state.
+ */
+static int write_one_vcpu_basic(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn;
+    unsigned i;
+    int rc = -1;
+    vcpu_guest_context_any_t vcpu;
+    struct rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_BASIC,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+
+    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
+    {
+        PERROR("Failed to get vcpu%"PRIu32" context", id);
+        goto err;
+    }
+
+    /* Vcpu0 is special: Convert the suspend record to a pfn. */
+    if ( id == 0 )
+    {
+        mfn = GET_FIELD(ctx, &vcpu, user_regs.edx);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for suspend record");
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(ctx, &vcpu, user_regs.edx, mfn_to_pfn(ctx, mfn));
+    }
+
+    /* Convert GDT frames to pfns. */
+    for ( i = 0; (i * 512) < GET_FIELD(ctx, &vcpu, gdt_ents); ++i )
+    {
+        mfn = GET_FIELD(ctx, &vcpu, gdt_frames[i]);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for frame %u of vcpu%"PRIu32"'s GDT", i, id);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        SET_FIELD(ctx, &vcpu, gdt_frames[i], mfn_to_pfn(ctx, mfn));
+    }
+
+    /* Convert CR3 to a pfn. */
+    mfn = cr3_to_mfn(ctx, GET_FIELD(ctx, &vcpu, ctrlreg[3]));
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Bad mfn for vcpu%"PRIu32"'s cr3", id);
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        errno = ERANGE;
+        goto err;
+    }
+    pfn = mfn_to_pfn(ctx, mfn);
+    SET_FIELD(ctx, &vcpu, ctrlreg[3], mfn_to_cr3(ctx, pfn));
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to pfn. */
+    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
+    {
+        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("Bad mfn for vcpu%"PRIu32"'s cr1", id);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            errno = ERANGE;
+            goto err;
+        }
+        pfn = mfn_to_pfn(ctx, mfn);
+        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
+    }
+
+    if ( ctx->x86_pv.width == 8 )
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x64));
+    else
+        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x32));
+
+    if ( rc )
+        goto err;
+
+    DPRINTF("Writing vcpu%"PRIu32" basic context", id);
+    rc = 0;
+ err:
+
+    return rc;
+}
+
+/*
+ * Obtain a specific vcpus extended state and write an X86_PV_VCPU_EXTENDED
+ * record into the stream.
+ */
+static int write_one_vcpu_extended(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+    struct rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_EXTENDED,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
+        .domain = ctx->domid,
+        .u.ext_vcpucontext.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%"PRIu32" extended context", id);
+        return -1;
+    }
+
+    rc = write_split_record(ctx, &rec, &domctl.u.ext_vcpucontext,
+                            domctl.u.ext_vcpucontext.size);
+    if ( rc )
+        return rc;
+
+    DPRINTF("Writing vcpu%"PRIu32" extended context", id);
+
+    return 0;
+}
+
+/*
+ * Query to see whether a specific vcpu has xsave state and if so, write an
+ * X86_PV_VCPU_XSAVE record into the stream.
+ */
+static int write_one_vcpu_xsave(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_XSAVE,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_getvcpuextstate,
+        .domain = ctx->domid,
+        .u.vcpuextstate.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%"PRIu32"'s xsave context", id);
+        goto err;
+    }
+
+    if ( !domctl.u.vcpuextstate.xfeature_mask )
+    {
+        DPRINTF("vcpu%"PRIu32" has no xsave context - skipping", id);
+        goto out;
+    }
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, domctl.u.vcpuextstate.size);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %"PRIx64" bytes for vcpu%"PRIu32"'s xsave context",
+              domctl.u.vcpuextstate.size, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%"PRIu32"'s xsave context", id);
+        goto err;
+    }
+
+    rc = write_split_record(ctx, &rec, buffer, domctl.u.vcpuextstate.size);
+    if ( rc )
+        goto err;
+
+    DPRINTF("Writing vcpu%"PRIu32" xsave context", id);
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * Query to see whether a specific vcpu has msr state and if so, write an
+ * X86_PV_VCPU_MSRS record into the stream.
+ */
+static int write_one_vcpu_msrs(struct context *ctx, uint32_t id)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = -1;
+    size_t buffersz;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    struct rec_x86_pv_vcpu_hdr vhdr =
+    {
+        .vcpu_id = id,
+    };
+    struct record rec =
+    {
+        .type = REC_TYPE_X86_PV_VCPU_MSRS,
+        .length = sizeof(vhdr),
+        .data = &vhdr,
+    };
+    struct xen_domctl domctl =
+    {
+        .cmd = XEN_DOMCTL_get_vcpu_msrs,
+        .domain = ctx->domid,
+        .u.vcpu_msrs.vcpu = id,
+    };
+
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%"PRIu32"'s msrs", id);
+        goto err;
+    }
+
+    if ( !domctl.u.vcpu_msrs.msr_count )
+    {
+        DPRINTF("vcpu%"PRIu32" has no msrs - skipping", id);
+        goto out;
+    }
+
+    buffersz = domctl.u.vcpu_msrs.msr_count * sizeof(xen_domctl_vcpu_msr_t);
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, buffersz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for vcpu%"PRIu32"'s msrs",
+              buffersz, id);
+        goto err;
+    }
+
+    set_xen_guest_handle(domctl.u.vcpu_msrs.msrs, buffer);
+    if ( xc_domctl(xch, &domctl) < 0 )
+    {
+        PERROR("Unable to get vcpu%"PRIu32"'s msrs", id);
+        goto err;
+    }
+
+    rc = write_split_record(ctx, &rec, buffer,
+                            domctl.u.vcpu_msrs.msr_count *
+                            sizeof(xen_domctl_vcpu_msr_t));
+    if ( rc )
+        goto err;
+
+    DPRINTF("Writing vcpu%"PRIu32" msrs", id);
+
+ out:
+    rc = 0;
+
+ err:
+    xc_hypercall_buffer_free(xch, buffer);
+
+    return rc;
+}
+
+/*
+ * For each vcpu, if it is online, write its state into the stream.
+ */
+static int write_all_vcpu_information(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xc_vcpuinfo_t vinfo;
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
+    {
+        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
+        if ( rc )
+        {
+            PERROR("Failed to get vcpu%"PRIu32" information", i);
+            return rc;
+        }
+
+        if ( !vinfo.online )
+        {
+            DPRINTF("vcpu%"PRIu32" offline - skipping", i);
+            continue;
+        }
+
+        rc = write_one_vcpu_basic(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_extended(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_xsave(ctx, i);
+        if ( rc )
+            return rc;
+
+        rc = write_one_vcpu_msrs(ctx, i);
+        if ( rc )
+            return rc;
+    };
+
+    return 0;
+}
+
+/*
+ * Writes an X86_PV_INFO record into the stream.
+ */
+static int write_x86_pv_info(struct context *ctx)
+{
+    struct rec_x86_pv_info info =
+        {
+            .guest_width = ctx->x86_pv.width,
+            .pt_levels = ctx->x86_pv.levels,
+        };
+    struct record rec =
+        {
+            .type = REC_TYPE_X86_PV_INFO,
+            .length = sizeof(info),
+            .data = &info
+        };
+
+    return write_record(ctx, &rec);
+}
+
+/*
+ * Writes an X86_PV_P2M_FRAMES record into the stream.  This contains the list
+ * of pfns making upt the p2m table.
+ */
+static int write_x86_pv_p2m_frames(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc; unsigned i;
+    size_t datasz = ctx->x86_pv.p2m_frames * sizeof(uint64_t);
+    uint64_t *data = NULL;
+    struct rec_x86_pv_p2m_frames hdr =
+        {
+            .start_pfn = 0,
+            .end_pfn = ctx->x86_pv.max_pfn,
+        };
+    struct record rec =
+        {
+            .type = REC_TYPE_X86_PV_P2M_FRAMES,
+            .length = sizeof(hdr),
+            .data = &hdr,
+        };
+
+    /* No need to translate if sizeof(uint64_t) == sizeof(xen_pfn_t). */
+    if ( sizeof(uint64_t) != sizeof(*ctx->x86_pv.p2m_pfns) )
+    {
+        if ( !(data = malloc(datasz)) )
+        {
+            ERROR("Cannot allocate %zu bytes for X86_PV_P2M_FRAMES data", datasz);
+            return -1;
+        }
+
+        for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+            data[i] = ctx->x86_pv.p2m_pfns[i];
+    }
+    else
+        data = (uint64_t *)ctx->x86_pv.p2m_pfns;
+
+    rc = write_split_record(ctx, &rec, data, datasz);
+
+    if ( data != (uint64_t *)ctx->x86_pv.p2m_pfns )
+        free(data);
+
+    return rc;
+}
+
+/*
+ * Writes an SHARED_INFO record into the stream.
+ */
+static int write_shared_info(struct context *ctx)
+{
+    struct record rec =
+    {
+        .type = REC_TYPE_SHARED_INFO,
+        .length = PAGE_SIZE,
+        .data = ctx->x86_pv.shinfo,
+    };
+
+    return write_record(ctx, &rec);
+}
+
+/*
+ * Normalise a pagetable for the migration stream.  Performs pfn->mfn
+ * conversions on the ptes.
+ */
+static int normalise_pagetable(struct context *ctx, const uint64_t *src,
+                               uint64_t *dst, unsigned long type)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t pte;
+    unsigned i, xen_first = -1, xen_last = -1; /* Indicies of Xen mappings. */
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( ctx->x86_pv.levels == 4 )
+    {
+        /* 64bit guests only have Xen mappings in their L4 tables. */
+        if ( type == XEN_DOMCTL_PFINFO_L4TAB )
+        {
+            xen_first = 256;
+            xen_last = 271;
+        }
+    }
+    else
+    {
+        switch ( type )
+        {
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            ERROR("??? Found L4 table for 32bit guest");
+            errno = EINVAL;
+            return -1;
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            /* 32bit guests can only use the first 4 entries of their L3 tables.
+             * All other are potentially used by Xen. */
+            xen_first = 4;
+            xen_last = 512;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            /* It is hard to spot Xen mappings in a 32bit guest's L2.  Most
+             * are normal but only a few will have Xen mappings.
+             *
+             * 428 = (HYPERVISOR_VIRT_START_PAE >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff
+             *
+             * ...which is conveniently unavailable to us in a 64bit build.
+             */
+            if ( pte_to_frame(ctx, src[428]) == ctx->x86_pv.compat_m2p_mfn0 )
+            {
+                xen_first = 428;
+                xen_last = 512;
+            }
+            break;
+        }
+    }
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        xen_pfn_t mfn, pfn;
+
+        pte = src[i];
+
+        /* Remove Xen mappings: Xen will reconstruct on the other side. */
+        if ( i >= xen_first && i <= xen_last )
+            pte = 0;
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            mfn = pte_to_frame(ctx, pte);
+
+            if ( pte & _PAGE_PSE )
+            {
+                ERROR("Cannot migrate superpage (L%lu[%u]: 0x%016"PRIx64")",
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i, pte);
+                errno = E2BIG;
+                return -1;
+            }
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                /*
+                 * This is expected during the live part of migration given
+                 * split pagetable updates, active grant mappings etc.  The
+                 * pagetable will need to be resent after pausing.  It is
+                 * however fatal if we have already paused the domain.
+                 */
+                if ( !ctx->dominfo.paused )
+                    errno = EAGAIN;
+                else
+                {
+                    ERROR("Bad mfn for L%lu[%u]",
+                          type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i);
+                    dump_bad_pseudophysmap_entry(ctx, mfn);
+                    errno = ERANGE;
+                }
+                return -1;
+            }
+
+            pfn = mfn_to_pfn(ctx, mfn);
+            update_pte(ctx, &pte, pfn);
+        }
+
+        dst[i] = pte;
+    }
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Performs pagetable normalisation on appropriate pages.
+ */
+static int x86_pv_normalise_page(struct context *ctx, xen_pfn_t type,
+                                 void **page)
+{
+    xc_interface *xch = ctx->xch;
+    void *local_page;
+    int rc;
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    local_page = malloc(PAGE_SIZE);
+    if ( !local_page )
+    {
+        ERROR("Unable to allocate scratch page");
+        rc = -1;
+        goto out;
+    }
+
+    rc = normalise_pagetable(ctx, *page, local_page, type);
+    *page = local_page;
+
+  out:
+    return rc;
+}
+
+/*
+ * save_ops function.  Queries domain information and maps the Xen m2p and the
+ * guests shinfo and p2m table.
+ */
+static int x86_pv_setup(struct context *ctx)
+{
+    int rc;
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        return rc;
+
+    rc = map_shinfo(ctx);
+    if ( rc )
+        return rc;
+
+    rc = map_p2m(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Writes PV header records into the stream.
+ */
+static int x86_pv_start_of_stream(struct context *ctx)
+{
+    int rc;
+
+    rc = write_x86_pv_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_x86_pv_p2m_frames(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Writes tail records information into the stream.
+ */
+static int x86_pv_end_of_stream(struct context *ctx)
+{
+    int rc;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_shared_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = write_all_vcpu_information(ctx);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+/*
+ * save_ops function.  Cleanup.
+ */
+static int x86_pv_cleanup(struct context *ctx)
+{
+    free(ctx->x86_pv.p2m_pfns);
+
+    if ( ctx->x86_pv.p2m )
+        munmap(ctx->x86_pv.p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    if ( ctx->x86_pv.shinfo )
+        munmap(ctx->x86_pv.shinfo, PAGE_SIZE);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return 0;
+}
+
+struct save_ops save_ops_x86_pv =
+{
+    .normalise_page  = x86_pv_normalise_page,
+    .setup           = x86_pv_setup,
+    .start_of_stream = x86_pv_start_of_stream,
+    .end_of_stream   = x86_pv_end_of_stream,
+    .cleanup         = x86_pv_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (7 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 08/14] tools/libxc: x86 PV save code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:08   ` David Vrabel
  2014-06-12 15:49   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code Andrew Cooper
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h         |    2 +
 tools/libxc/saverestore/restore_x86_pv.c |  965 ++++++++++++++++++++++++++++++
 2 files changed, 967 insertions(+)
 create mode 100644 tools/libxc/saverestore/restore_x86_pv.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 5c8a370..bb21e01 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -248,6 +248,8 @@ extern struct common_ops common_ops_x86_pv;
 
 extern struct save_ops save_ops_x86_pv;
 
+extern struct restore_ops restore_ops_x86_pv;
+
 struct record
 {
     uint32_t type;
diff --git a/tools/libxc/saverestore/restore_x86_pv.c b/tools/libxc/saverestore/restore_x86_pv.c
new file mode 100644
index 0000000..3174d4c
--- /dev/null
+++ b/tools/libxc/saverestore/restore_x86_pv.c
@@ -0,0 +1,965 @@
+#include <assert.h>
+
+#include "common_x86_pv.h"
+
+/*
+ * Expand our local tracking information for the p2m table and domains maximum
+ * size.  Normally this will be called once to expand from 0 to max_pfn, but
+ * is liable to expand multiple times if the domain grows on the sending side
+ * after migration has started.
+ */
+static int expand_p2m(struct context *ctx, unsigned long max_pfn)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long old_max = ctx->x86_pv.max_pfn, i;
+    unsigned int fpp = PAGE_SIZE / ctx->x86_pv.width;
+    unsigned long end_frame = (max_pfn + fpp) / fpp;
+    unsigned long old_end_frame = (old_max + fpp) / fpp;
+    xen_pfn_t *p2m = NULL, *p2m_pfns = NULL;
+    uint32_t *pfn_types = NULL;
+    size_t p2msz, p2m_pfnsz, pfn_typesz;
+
+    assert(max_pfn > old_max);
+
+    p2msz = (max_pfn + 1) * ctx->x86_pv.width;
+    p2m = realloc(ctx->x86_pv.p2m, p2msz);
+    if ( !p2m )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m", p2msz);
+        return -1;
+    }
+    ctx->x86_pv.p2m = p2m;
+
+    pfn_typesz = (max_pfn + 1) * sizeof(*pfn_types);
+    pfn_types = realloc(ctx->x86_pv.pfn_types, pfn_typesz);
+    if ( !pfn_types )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for pfn_types", pfn_typesz);
+        return -1;
+    }
+    ctx->x86_pv.pfn_types = pfn_types;
+
+    p2m_pfnsz = (end_frame + 1) * sizeof(*p2m_pfns);
+    p2m_pfns = realloc(ctx->x86_pv.p2m_pfns, p2m_pfnsz);
+    if ( !p2m_pfns )
+    {
+        ERROR("Failed to (re)alloc %zu bytes for p2m frame list", p2m_pfnsz);
+        return -1;
+    }
+    ctx->x86_pv.p2m_frames = end_frame;
+    ctx->x86_pv.p2m_pfns = p2m_pfns;
+
+    ctx->x86_pv.max_pfn = max_pfn;
+    for ( i = (old_max ? old_max + 1 : 0); i <= max_pfn; ++i )
+    {
+        ctx->ops.set_gfn(ctx, i, INVALID_MFN);
+        ctx->ops.set_page_type(ctx, i, 0);
+    }
+
+    for ( i = (old_end_frame ? old_end_frame + 1 : 0); i <= end_frame; ++i )
+        ctx->x86_pv.p2m_pfns[i] = INVALID_MFN;
+
+    DPRINTF("Expanded p2m from %#lx to %#lx", old_max, max_pfn);
+    return 0;
+}
+
+/*
+ * Pin all of the pagetables.  TODO - batch the hypercalls.
+ */
+static int pin_pagetables(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long i;
+    struct mmuext_op pin;
+
+    DPRINTF("Pinning pagetables");
+
+    for ( i = 0; i <= ctx->x86_pv.max_pfn; ++i )
+    {
+        if ( (ctx->x86_pv.pfn_types[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( ctx->x86_pv.pfn_types[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            pin.cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            pin.cmd = MMUEXT_PIN_L2_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            pin.cmd = MMUEXT_PIN_L3_TABLE;
+            break;
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin.cmd = MMUEXT_PIN_L4_TABLE;
+            break;
+        default:
+            continue;
+        }
+
+        pin.arg1.mfn = ctx->ops.pfn_to_gfn(ctx, i);
+
+        if ( xc_mmuext_op(xch, &pin, 1, ctx->domid) != 0 )
+        {
+            PERROR("Failed to pin page table for pfn %#lx", i);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Update details in a guests start_info strucutre.
+ */
+static int process_start_info(struct context *ctx, vcpu_guest_context_any_t *vcpu)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t pfn, mfn;
+    start_info_any_t *guest_start_info = NULL;
+    int rc = -1;
+
+    pfn = GET_FIELD(ctx, vcpu, user_regs.edx);
+
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Start Info pfn %#lx out of range", pfn);
+        goto err;
+    }
+    else if ( ctx->x86_pv.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+    {
+        ERROR("Start Info pfn %#lx has bad type %"PRIu32, pfn,
+              ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Start Info has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    guest_start_info = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE, mfn);
+    if ( !guest_start_info )
+    {
+        PERROR("Failed to map Start Info at mfn %#lx", mfn);
+        goto err;
+    }
+
+    /* Deal with xenstore stuff */
+    pfn = GET_FIELD(ctx, guest_start_info, store_mfn);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("XenStore pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("XenStore pfn has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.xenstore_mfn = mfn;
+    SET_FIELD(ctx, guest_start_info, store_mfn, mfn);
+    SET_FIELD(ctx, guest_start_info, store_evtchn, ctx->restore.xenstore_evtchn);
+
+    /* Deal with console stuff */
+    pfn = GET_FIELD(ctx, guest_start_info, console.domU.mfn);
+    if ( pfn > ctx->x86_pv.max_pfn )
+    {
+        ERROR("Console pfn %#lx out of range", pfn);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("Console pfn has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    ctx->restore.console_mfn = mfn;
+    SET_FIELD(ctx, guest_start_info, console.domU.mfn, mfn);
+    SET_FIELD(ctx, guest_start_info, console.domU.evtchn, ctx->restore.console_evtchn);
+
+    /* Set other information */
+    SET_FIELD(ctx, guest_start_info, nr_pages, ctx->x86_pv.max_pfn + 1);
+    SET_FIELD(ctx, guest_start_info, shared_info,
+              ctx->dominfo.shared_info_frame << PAGE_SHIFT);
+    SET_FIELD(ctx, guest_start_info, flags, 0);
+
+    SET_FIELD(ctx, vcpu, user_regs.edx, mfn);
+    rc = 0;
+
+err:
+    if ( guest_start_info )
+        munmap(guest_start_info, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Copy the p2m which has been constructed locally as memory has been
+ * allocated, over the p2m in guest, so the guest can find its memory again on
+ * resume.
+ */
+static int update_guest_p2m(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t mfn, pfn, *guest_p2m = NULL;
+    unsigned i;
+    int rc = -1;
+
+    for ( i = 0; i < ctx->x86_pv.p2m_frames; ++i )
+    {
+        pfn = ctx->x86_pv.p2m_pfns[i];
+
+        if ( pfn > ctx->x86_pv.max_pfn )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] out of range",
+                  pfn, i);
+            goto err;
+        }
+        else if ( ctx->x86_pv.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+        {
+            ERROR("pfn (%#lx) for p2m_frame_list[%u] has bad type %"PRIu32, pfn, i,
+                  ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("p2m_frame_list[%u] has bad mfn", i);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        ctx->x86_pv.p2m_pfns[i] = mfn;
+    }
+
+    guest_p2m = xc_map_foreign_pages(xch, ctx->domid, PROT_WRITE,
+                                     ctx->x86_pv.p2m_pfns,
+                                     ctx->x86_pv.p2m_frames );
+    if ( !guest_p2m )
+    {
+        PERROR("Failed to map p2m frames");
+        goto err;
+    }
+
+    memcpy(guest_p2m, ctx->x86_pv.p2m,
+           (ctx->x86_pv.max_pfn + 1) * ctx->x86_pv.width);
+    rc = 0;
+ err:
+    if ( guest_p2m )
+        munmap(guest_p2m, ctx->x86_pv.p2m_frames * PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * Process a toolstack record.  TODO - remove from spec and code once libxl
+ * framing is sorted.
+ */
+static int handle_toolstack(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore )
+        return 0;
+
+    rc = ctx->restore.callbacks->toolstack_restore(ctx->domid, rec->data, rec->length,
+                                                   ctx->restore.callbacks->data);
+    if ( rc < 0 )
+        PERROR("restoring toolstack");
+    return rc;
+}
+
+/*
+ * Process an X86_PV_INFO record.
+ */
+static int handle_x86_pv_info(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_info *info = rec->data;
+
+    if ( rec->length < sizeof(*info) )
+    {
+        ERROR("X86_PV_INFO record truncated: length %"PRIu32", expected %zu",
+              rec->length, sizeof(*info));
+        return -1;
+    }
+    else if ( info->guest_width != 4 &&
+              info->guest_width != 8 )
+    {
+        ERROR("Unexpected guest width %"PRIu32", Expected 4 or 8",
+              info->guest_width);
+        return -1;
+    }
+    else if ( info->guest_width != ctx->x86_pv.width )
+    {
+        int rc;
+        struct xen_domctl domctl;
+
+        /* Try to set address size, domain is always created 64 bit. */
+        memset(&domctl, 0, sizeof(domctl));
+        domctl.domain = ctx->domid;
+        domctl.cmd    = XEN_DOMCTL_set_address_size;
+        domctl.u.address_size.size = info->guest_width * 8;
+        rc = do_domctl(xch, &domctl);
+        if ( rc != 0 )
+        {
+            ERROR("Width of guest in stream (%"PRIu32
+                  " bits) differs with existing domain (%"PRIu32" bits)",
+                  info->guest_width * 8, ctx->x86_pv.width * 8);
+            return -1;
+        }
+
+        /* Domain informations changed, better to refresh. */
+        rc = x86_pv_domain_info(ctx);
+        if ( rc != 0 )
+        {
+            ERROR("Unable to refresh guest informations");
+            return -1;
+        }
+    }
+    else if ( info->pt_levels != 3 &&
+              info->pt_levels != 4 )
+    {
+        ERROR("Unexpected guest levels %"PRIu32", Expected 3 or 4",
+              info->pt_levels);
+        return -1;
+    }
+    else if ( info->pt_levels != ctx->x86_pv.levels )
+    {
+        ERROR("Levels of guest in stream (%"PRIu32
+              ") differs with existing domain (%"PRIu32")",
+              info->pt_levels, ctx->x86_pv.levels);
+        return -1;
+    }
+
+    DPRINTF("X86_PV_INFO record: %d bits, %d levels",
+            ctx->x86_pv.width * 8, ctx->x86_pv.levels);
+    return 0;
+}
+
+/*
+ * Process an X86_PV_P2M_FRAMES record.  Takes care of expanding the local p2m
+ * state if needed.
+ */
+static int handle_x86_pv_p2m_frames(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_p2m_frames *data = rec->data;
+    unsigned start, end, x, fpp = PAGE_SIZE / ctx->x86_pv.width;
+    int rc;
+
+    if ( rec->length < sizeof(*data) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record truncated: length %"PRIu32", min %zu",
+              rec->length, sizeof(*data) + sizeof(uint64_t));
+        return -1;
+    }
+    else if ( data->start_pfn > data->end_pfn )
+    {
+        ERROR("End pfn in stream (%#"PRIx32") exceeds Start (%#"PRIx32")",
+              data->end_pfn, data->start_pfn);
+        return -1;
+    }
+
+    start =  data->start_pfn / fpp;
+    end = data->end_pfn / fpp + 1;
+
+    if ( rec->length != sizeof(*data) + ((end - start) * sizeof(uint64_t)) )
+    {
+        ERROR("X86_PV_P2M_FRAMES record wrong size: start_pfn %#"PRIx32
+              ", end_pfn %#"PRIx32", length %"PRIu32
+              ", expected %zu + (%u - %u) * %zu",
+              data->start_pfn, data->end_pfn, rec->length,
+              sizeof(*data), end, start, sizeof(uint64_t));
+        return -1;
+    }
+
+    if ( data->end_pfn > ctx->x86_pv.max_pfn )
+    {
+        rc = expand_p2m(ctx, data->end_pfn);
+        if ( rc )
+            return rc;
+    }
+
+    for ( x = 0; x < (end - start); ++x )
+        ctx->x86_pv.p2m_pfns[start + x] = data->p2m_pfns[x];
+
+    DPRINTF("X86_PV_P2M_FRAMES record: GFNs %#"PRIx32"->%#"PRIx32,
+            data->start_pfn, data->end_pfn);
+    return 0;
+}
+
+/*
+ * Process an X86_PV_VCPU_BASIC record from the stream.
+ */
+static int handle_x86_pv_vcpu_basic(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu_hdr *vhdr = rec->data;
+    vcpu_guest_context_any_t vcpu;
+    size_t vcpusz = ctx->x86_pv.width == 8 ? sizeof(vcpu.x64) : sizeof(vcpu.x32);
+    xen_pfn_t pfn, mfn;
+    unsigned long tmp;
+    unsigned i;
+    int rc = -1;
+
+    if ( rec->length <= sizeof(*vhdr) )
+    {
+        ERROR("X86_PV_VCPU_BASIC record truncated: length %"PRIu32", min %zu",
+              rec->length, sizeof(*vhdr) + 1);
+        goto err;
+    }
+    else if ( rec->length != sizeof(*vhdr) + vcpusz )
+    {
+        ERROR("X86_PV_VCPU_BASIC record wrong size: length %"PRIu32
+              ", expected %zu", rec->length, sizeof(*vhdr) + vcpusz);
+        goto err;
+    }
+    else if ( vhdr->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_BASIC record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vhdr->vcpu_id, ctx->dominfo.max_vcpu_id);
+        goto err;
+    }
+
+    memcpy(&vcpu, &vhdr->context, vcpusz);
+
+    SET_FIELD(ctx, &vcpu, flags, GET_FIELD(ctx, &vcpu, flags) | VGCF_online);
+
+    /* Vcpu 0 is special: Convert the suspend record to an mfn. */
+    if ( vhdr->vcpu_id == 0 )
+    {
+        rc = process_start_info(ctx, &vcpu);
+        if ( rc )
+            return rc;
+        rc = -1;
+    }
+
+    tmp = GET_FIELD(ctx, &vcpu, gdt_ents);
+    if ( tmp > 8192 )
+    {
+        ERROR("GDT entry count (%lu) out of range", tmp);
+        errno = ERANGE;
+        goto err;
+    }
+
+    /* Convert GDT frames to mfns. */
+    for ( i = 0; (i * 512) < tmp; ++i )
+    {
+        pfn = GET_FIELD(ctx, &vcpu, gdt_frames[i]);
+        if ( pfn >= ctx->x86_pv.max_pfn )
+        {
+            ERROR("GDT frame %u (pfn %#lx) out of range", i, pfn);
+            goto err;
+        }
+        else if ( ctx->x86_pv.pfn_types[pfn] != XEN_DOMCTL_PFINFO_NOTAB )
+        {
+            ERROR("GDT frame %u (pfn %#lx) has bad type %"PRIu32, i, pfn,
+                  ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+            goto err;
+        }
+
+        mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("GDT frame %u has bad mfn", i);
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        SET_FIELD(ctx, &vcpu, gdt_frames[i], mfn);
+    }
+
+    /* Convert CR3 to an mfn. */
+    pfn = cr3_to_mfn(ctx, GET_FIELD(ctx, &vcpu, ctrlreg[3]));
+    if ( pfn >= ctx->x86_pv.max_pfn )
+    {
+        ERROR("cr3 (pfn %#lx) out of range", pfn);
+        goto err;
+    }
+    else if ( (ctx->x86_pv.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) !=
+              (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+    {
+        ERROR("cr3 (pfn %#lx) has bad type %"PRIu32", expected %"PRIu32, pfn,
+              ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT,
+              ctx->x86_pv.levels);
+        goto err;
+    }
+
+    mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+    if ( !mfn_in_pseudophysmap(ctx, mfn) )
+    {
+        ERROR("cr3 has bad mfn");
+        dump_bad_pseudophysmap_entry(ctx, mfn);
+        goto err;
+    }
+
+    SET_FIELD(ctx, &vcpu, ctrlreg[3], mfn_to_cr3(ctx, mfn));
+
+    /* 64bit guests: Convert CR1 (guest pagetables) to mfn. */
+    if ( ctx->x86_pv.levels == 4 && (vcpu.x64.ctrlreg[1] & 1) )
+    {
+        pfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
+
+        if ( pfn >= ctx->x86_pv.max_pfn )
+        {
+            ERROR("cr1 (pfn %#lx) out of range", pfn);
+            goto err;
+        }
+        else if ( (ctx->x86_pv.pfn_types[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) !=
+                  (((xen_pfn_t)ctx->x86_pv.levels) << XEN_DOMCTL_PFINFO_LTAB_SHIFT) )
+        {
+            ERROR("cr1 (pfn %#lx) has bad type %"PRIu32", expected %"PRIu32, pfn,
+                  ctx->x86_pv.pfn_types[pfn] >> XEN_DOMCTL_PFINFO_LTAB_SHIFT,
+                  ctx->x86_pv.levels);
+            goto err;
+        }
+
+        mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+        if ( !mfn_in_pseudophysmap(ctx, mfn) )
+        {
+            ERROR("cr1 has bad mfn");
+            dump_bad_pseudophysmap_entry(ctx, mfn);
+            goto err;
+        }
+
+        vcpu.x64.ctrlreg[1] = (uint64_t)mfn << PAGE_SHIFT;
+    }
+
+    if ( xc_vcpu_setcontext(xch, ctx->domid, vhdr->vcpu_id, &vcpu) )
+    {
+        PERROR("Failed to set vcpu%"PRIu32"'s basic info", vhdr->vcpu_id);
+        goto err;
+    }
+
+    rc = 0;
+    DPRINTF("vcpu%"PRId32" X86_PV_VCPU_BASIC record", vhdr->vcpu_id);
+ err:
+    return rc;
+}
+
+/*
+ * Process an X86_PV_VCPU_EXTENDED record from the stream.
+ */
+static int handle_x86_pv_vcpu_extended(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu_hdr *vcpu = rec->data;
+    DECLARE_DOMCTL;
+
+    if ( rec->length <= sizeof(*vcpu) )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record truncated: length %"PRIu32", min %zu",
+              rec->length, sizeof(*vcpu) + 1);
+        return -1;
+    }
+    else if ( rec->length > sizeof(*vcpu) + 128 )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record too long: length %"PRIu32", max %zu",
+              rec->length, sizeof(*vcpu) + 128);
+        return -1;
+    }
+    else if ( vcpu->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_EXTENDED record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vcpu->vcpu_id, ctx->dominfo.max_vcpu_id);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_set_ext_vcpucontext;
+    domctl.domain = ctx->domid;
+    memcpy(&domctl.u.ext_vcpucontext, &vcpu->context,
+           rec->length - sizeof(*vcpu));
+
+    if ( xc_domctl(xch, &domctl) != 0 )
+    {
+        PERROR("Failed to set vcpu%"PRIu32"'s extended info", vcpu->vcpu_id);
+        return -1;
+    }
+
+    DPRINTF("vcpu%"PRId32" X86_PV_VCPU_EXTENDED record", vcpu->vcpu_id);
+    return 0;
+}
+
+/*
+ * Process an X86_PV_VCPU_XSAVE record from the stream.
+ */
+static int handle_x86_pv_vcpu_xsave(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu_hdr *vhdr = rec->data;
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    size_t buffersz;
+
+    if ( rec->length <= sizeof(*vhdr) )
+    {
+        ERROR("X86_PV_VCPU_XSAVE record truncated: length %"PRIu32", min %zu",
+              rec->length, sizeof(*vhdr) + 1);
+        return -1;
+    }
+    else if ( vhdr->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_XSAVE record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vhdr->vcpu_id, ctx->dominfo.max_vcpu_id);
+        return -1;
+    }
+
+    buffersz = rec->length - sizeof(*vhdr);
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, buffersz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %"PRIu64" bytes for xsave hypercall buffer",
+              buffersz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setvcpuextstate;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpuextstate.vcpu = vhdr->vcpu_id;
+    domctl.u.vcpuextstate.size = buffersz;
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    memcpy(buffer, vhdr->context, buffersz);
+
+    rc = xc_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    if ( rc )
+        PERROR("Failed to set vcpu%"PRIu32"'s xsave info", vhdr->vcpu_id);
+    else
+        DPRINTF("vcpu%"PRId32" X86_PV_VCPU_XSAVE record", vhdr->vcpu_id);
+
+    return rc;
+}
+
+/*
+ * Process an X86_PV_VCPU_MSRS record from the stream.
+ */
+static int handle_x86_pv_vcpu_msrs(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_x86_pv_vcpu_hdr *vhdr = rec->data;
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(void, buffer);
+    size_t buffersz = rec->length - sizeof(*vhdr);
+
+    if ( rec->length <= sizeof(*vhdr) )
+    {
+        ERROR("X86_PV_VCPU_MSRS record truncated: length %"PRIu32", min %zu",
+              rec->length, sizeof(*vhdr) + 1);
+        return -1;
+    }
+    else if ( vhdr->vcpu_id > ctx->dominfo.max_vcpu_id )
+    {
+        ERROR("X86_PV_VCPU_MSRS record vcpu_id (%"PRIu32
+              ") exceeds domain max (%u)",
+              vhdr->vcpu_id, ctx->dominfo.max_vcpu_id);
+        return -1;
+    }
+    else if ( buffersz % sizeof(xen_domctl_vcpu_msr_t) != 0 )
+    {
+        ERROR("X86_PV_VCPU_MSRS payload size %zu"
+              " expected to be a multiple of %zu",
+              buffersz, sizeof(xen_domctl_vcpu_msr_t));
+        return -1;
+    }
+
+    buffer = xc_hypercall_buffer_alloc(xch, buffer, buffersz);
+    if ( !buffer )
+    {
+        ERROR("Unable to allocate %zu bytes for msr hypercall buffer",
+              buffersz);
+        return -1;
+    }
+
+    domctl.cmd = XEN_DOMCTL_set_vcpu_msrs;
+    domctl.domain = ctx->domid;
+    domctl.u.vcpu_msrs.vcpu = vhdr->vcpu_id;
+    domctl.u.vcpu_msrs.msr_count = buffersz % sizeof(xen_domctl_vcpu_msr_t);
+    set_xen_guest_handle(domctl.u.vcpuextstate.buffer, buffer);
+
+    memcpy(buffer, vhdr->context, buffersz);
+
+    rc = xc_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, buffer);
+
+    if ( rc )
+        PERROR("Failed to set vcpu%"PRIu32"'s msrs", vhdr->vcpu_id);
+    else
+        DPRINTF("vcpu%"PRId32" X86_PV_VCPU_MSRS record", vhdr->vcpu_id);
+
+    return rc;
+}
+
+/*
+ * Process a SHARED_INFO record from the stream.
+ */
+static int handle_shared_info(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned i;
+    int rc = -1;
+    shared_info_any_t *guest_shared_info = NULL;
+    shared_info_any_t *stream_shared_info = rec->data;
+
+    if ( rec->length != PAGE_SIZE )
+    {
+        ERROR("X86_PV_SHARED_INFO record wrong size: length %"PRIu32
+              ", expected %lu", rec->length, PAGE_SIZE);
+        goto err;
+    }
+
+    guest_shared_info = xc_map_foreign_range(
+        xch, ctx->domid, PAGE_SIZE, PROT_READ | PROT_WRITE,
+        ctx->dominfo.shared_info_frame);
+    if ( !guest_shared_info )
+    {
+        PERROR("Failed to map Shared Info at mfn %#lx",
+               ctx->dominfo.shared_info_frame);
+        goto err;
+    }
+
+    MEMCPY_FIELD(ctx, guest_shared_info, stream_shared_info, vcpu_info);
+    MEMCPY_FIELD(ctx, guest_shared_info, stream_shared_info, arch);
+
+    SET_FIELD(ctx, guest_shared_info, arch.pfn_to_mfn_frame_list_list, 0);
+
+    MEMSET_ARRAY_FIELD(ctx, guest_shared_info, evtchn_pending, 0);
+    for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
+        SET_FIELD(ctx, guest_shared_info, vcpu_info[i].evtchn_pending_sel, 0);
+
+    MEMSET_ARRAY_FIELD(ctx, guest_shared_info, evtchn_mask, 0xff);
+
+    rc = 0;
+ err:
+
+    if ( guest_shared_info )
+        munmap(guest_shared_info, PAGE_SIZE);
+
+    return rc;
+}
+
+/*
+ * restore_ops function.  Convert pfns back to mfns in pagetables.  Possibly
+ * needs to populate new frames if a PTE is found referring to a frame which
+ * hasn't yet been seen from PAGE_DATA records.
+ */
+static int x86_pv_localise_page(struct context *ctx, uint32_t type, void *page)
+{
+    xc_interface *xch = ctx->xch;
+    uint64_t *table = page;
+    uint64_t pte;
+    unsigned i;
+
+    type &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+
+    /* Only page tables need localisation. */
+    if ( type < XEN_DOMCTL_PFINFO_L1TAB || type > XEN_DOMCTL_PFINFO_L4TAB )
+        return 0;
+
+    for ( i = 0; i < (PAGE_SIZE / sizeof(uint64_t)); ++i )
+    {
+        pte = table[i];
+
+        if ( pte & _PAGE_PRESENT )
+        {
+            xen_pfn_t mfn, pfn;
+
+            pfn = pte_to_frame(ctx, pte);
+            mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+
+            if ( mfn == INVALID_MFN )
+            {
+                if ( populate_pfns(ctx, 1, &pfn, &type) )
+                    return -1;
+
+                mfn = ctx->ops.pfn_to_gfn(ctx, pfn);
+            }
+
+            if ( !mfn_in_pseudophysmap(ctx, mfn) )
+            {
+                ERROR("Bad mfn for L%"PRIu32"[%u]",
+                      type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT, i);
+                dump_bad_pseudophysmap_entry(ctx, mfn);
+                errno = ERANGE;
+                return -1;
+            }
+
+            update_pte(ctx, &pte, mfn);
+
+            table[i] = pte;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * restore_ops function.  Confirm that the incoming stream matches the type of
+ * domain we are attempting to restore into.
+ */
+static int x86_pv_setup(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_X86_PV )
+    {
+        ERROR("Unable to restore %s domain into an x86_pv domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != PAGE_SIZE )
+    {
+        ERROR("Invalid page size %d for x86_pv domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    rc = x86_pv_domain_info(ctx);
+    if ( rc )
+        return rc;
+
+    rc = x86_pv_map_m2p(ctx);
+    if ( rc )
+        return rc;
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_pv_process_record(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+
+    switch ( rec->type )
+    {
+    case REC_TYPE_X86_PV_INFO:
+        return handle_x86_pv_info(ctx, rec);
+
+    case REC_TYPE_X86_PV_P2M_FRAMES:
+        return handle_x86_pv_p2m_frames(ctx, rec);
+
+    case REC_TYPE_X86_PV_VCPU_BASIC:
+        return handle_x86_pv_vcpu_basic(ctx, rec);
+
+    case REC_TYPE_X86_PV_VCPU_EXTENDED:
+        return handle_x86_pv_vcpu_extended(ctx, rec);
+
+    case REC_TYPE_X86_PV_VCPU_XSAVE:
+        return handle_x86_pv_vcpu_xsave(ctx, rec);
+
+    case REC_TYPE_SHARED_INFO:
+        return handle_shared_info(ctx, rec);
+
+    case REC_TYPE_TOOLSTACK:
+        return handle_toolstack(ctx, rec);
+
+    case REC_TYPE_TSC_INFO:
+        return handle_tsc_info(ctx, rec);
+
+    case REC_TYPE_X86_PV_VCPU_MSRS:
+        return handle_x86_pv_vcpu_msrs(ctx, rec);
+
+    default:
+        if ( rec->type & REC_TYPE_OPTIONAL )
+        {
+            IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
+                    rec->type, rec_type_to_str(rec->type));
+            return 0;
+        }
+
+        ERROR("Invalid record type (0x%"PRIx32", %s) for x86_pv domains",
+              rec->type, rec_type_to_str(rec->type));
+        return -1;
+    }
+}
+
+/*
+ * restore_ops function.  Pin the pagetables, rewrite the p2m and seed the
+ * grant table.
+ */
+static int x86_pv_stream_complete(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = pin_pagetables(ctx);
+    if ( rc )
+        return rc;
+
+    rc = update_guest_p2m(ctx);
+    if ( rc )
+        return rc;
+
+    rc = xc_dom_gnttab_seed(xch, ctx->domid,
+                            ctx->restore.console_mfn,
+                            ctx->restore.xenstore_mfn,
+                            ctx->restore.console_domid,
+                            ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        return rc;
+    }
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_pv_cleanup(struct context *ctx)
+{
+    free(ctx->x86_pv.p2m);
+    free(ctx->x86_pv.p2m_pfns);
+    free(ctx->x86_pv.pfn_types);
+
+    if ( ctx->x86_pv.m2p )
+        munmap(ctx->x86_pv.m2p, ctx->x86_pv.nr_m2p_frames * PAGE_SIZE);
+
+    return 0;
+}
+
+struct restore_ops restore_ops_x86_pv =
+{
+    .localise_page   = x86_pv_localise_page,
+    .setup           = x86_pv_setup,
+    .process_record  = x86_pv_process_record,
+    .stream_complete = x86_pv_stream_complete,
+    .cleanup         = x86_pv_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (8 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:11   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code Andrew Cooper
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h         |    1 +
 tools/libxc/saverestore/common_x86_hvm.c |   39 ++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)
 create mode 100644 tools/libxc/saverestore/common_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index bb21e01..7ba9c4f 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -245,6 +245,7 @@ struct context
 };
 
 extern struct common_ops common_ops_x86_pv;
+extern struct common_ops common_ops_x86_hvm;
 
 extern struct save_ops save_ops_x86_pv;
 
diff --git a/tools/libxc/saverestore/common_x86_hvm.c b/tools/libxc/saverestore/common_x86_hvm.c
new file mode 100644
index 0000000..3701add
--- /dev/null
+++ b/tools/libxc/saverestore/common_x86_hvm.c
@@ -0,0 +1,39 @@
+#include "common.h"
+
+static bool x86_hvm_pfn_is_valid(const struct context *ctx, xen_pfn_t pfn)
+{
+    return true;
+}
+
+static xen_pfn_t x86_hvm_pfn_to_gfn(const struct context *ctx, xen_pfn_t pfn)
+{
+    return pfn;
+}
+
+static void x86_hvm_set_gfn(struct context *ctx, xen_pfn_t pfn,
+                            xen_pfn_t gfn)
+{
+    /* no op */
+}
+
+static void x86_hvm_set_page_type(struct context *ctx, xen_pfn_t pfn, xen_pfn_t type)
+{
+    /* no-op */
+}
+
+struct common_ops common_ops_x86_hvm = {
+    .pfn_is_valid   = x86_hvm_pfn_is_valid,
+    .pfn_to_gfn     = x86_hvm_pfn_to_gfn,
+    .set_gfn        = x86_hvm_set_gfn,
+    .set_page_type  = x86_hvm_set_page_type,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (9 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:12   ` David Vrabel
  2014-06-12 15:55   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 12/14] tools/libxc: x86 HVM restore code Andrew Cooper
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h       |    1 +
 tools/libxc/saverestore/save_x86_hvm.c |  223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 tools/libxc/saverestore/save_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 7ba9c4f..9e70ed6 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -248,6 +248,7 @@ extern struct common_ops common_ops_x86_pv;
 extern struct common_ops common_ops_x86_hvm;
 
 extern struct save_ops save_ops_x86_pv;
+extern struct save_ops save_ops_x86_hvm;
 
 extern struct restore_ops restore_ops_x86_pv;
 
diff --git a/tools/libxc/saverestore/save_x86_hvm.c b/tools/libxc/saverestore/save_x86_hvm.c
new file mode 100644
index 0000000..6548eae
--- /dev/null
+++ b/tools/libxc/saverestore/save_x86_hvm.c
@@ -0,0 +1,223 @@
+#include <assert.h>
+
+#include "common_x86.h"
+
+#include <xen/hvm/params.h>
+
+/*
+ * Query for the HVM context and write an HVM_CONTEXT record into the stream.
+ */
+static int write_hvm_context(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long hvm_buf_size;
+    int rc;
+    struct record hvm_rec =
+    {
+        .type = REC_TYPE_HVM_CONTEXT,
+    };
+
+    hvm_buf_size = xc_domain_hvm_getcontext(xch, ctx->domid, 0, 0);
+    if ( hvm_buf_size == -1 )
+    {
+        PERROR("Couldn't get HVM context size from Xen");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.data = malloc(hvm_buf_size);
+    if ( !hvm_rec.data )
+    {
+        PERROR("Couldn't allocate memory");
+        rc = -1;
+        goto out;
+    }
+
+    hvm_rec.length = xc_domain_hvm_getcontext(xch, ctx->domid,
+                                              hvm_rec.data, hvm_buf_size);
+    if ( hvm_rec.length < 0 )
+    {
+        PERROR("HVM:Could not get hvm buffer");
+        rc = -1;
+        goto out;
+    }
+
+    rc = write_record(ctx, &hvm_rec);
+    if ( rc < 0 )
+    {
+        PERROR("error write HVM_CONTEXT record");
+        goto out;
+    }
+
+ out:
+    free(hvm_rec.data);
+    return rc;
+}
+
+/*
+ * Query for a range of HVM parameters and write an HVM_PARAMS record into the
+ * stream.
+ */
+static int write_hvm_params(struct context *ctx)
+{
+    static const unsigned int params[] = {
+        HVM_PARAM_STORE_PFN,
+        HVM_PARAM_IOREQ_PFN,
+        HVM_PARAM_BUFIOREQ_PFN,
+        HVM_PARAM_PAGING_RING_PFN,
+        HVM_PARAM_ACCESS_RING_PFN,
+        HVM_PARAM_SHARING_RING_PFN,
+        HVM_PARAM_VM86_TSS,
+        HVM_PARAM_CONSOLE_PFN,
+        HVM_PARAM_ACPI_IOPORTS_LOCATION,
+        HVM_PARAM_VIRIDIAN,
+        HVM_PARAM_IDENT_PT,
+        HVM_PARAM_PAE_ENABLED,
+        HVM_PARAM_VM_GENERATION_ID_ADDR,
+    };
+
+    xc_interface *xch = ctx->xch;
+    struct rec_hvm_params_entry entries[ARRAY_SIZE(params)];
+    struct rec_hvm_params hdr = {
+        .count = 0,
+    };
+    struct record rec = {
+        .type   = REC_TYPE_HVM_PARAMS,
+        .length = sizeof(hdr),
+        .data   = &hdr,
+    };
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < ARRAY_SIZE(params); i++ )
+    {
+        uint32_t index = params[i];
+        uint64_t value;
+
+        rc = xc_get_hvm_param(xch, ctx->domid, index, (unsigned long *)&value);
+        if ( rc )
+        {
+            /* Gross XenServer hack. Consider HVM_PARAM_CONSOLE_PFN failure
+             * nonfatal. This is related to the fact it is impossible to
+             * distinguish "no console" from a console at pfn/evtchn 0.
+             *
+             * TODO - find a compatible way to fix this.
+             */
+            if ( index == HVM_PARAM_CONSOLE_PFN )
+                continue;
+
+            PERROR("Failed to get HVMPARAM at index %u", index);
+            return rc;
+        }
+
+        if ( value != 0 )
+        {
+            entries[hdr.count].index = index;
+            entries[hdr.count].value = value;
+            hdr.count++;
+        }
+    }
+
+    rc = write_split_record(ctx, &rec, entries, hdr.count * sizeof(*entries));
+    if ( rc )
+        PERROR("Failed to write HVM_PARAMS record");
+
+    return rc;
+}
+
+/* TODO - remove. */
+static int write_toolstack(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct record rec = {
+        .type = REC_TYPE_TOOLSTACK,
+        .length = 0,
+    };
+    uint8_t *buf;
+    uint32_t len;
+    int rc;
+
+    if ( !ctx->save.callbacks || !ctx->save.callbacks->toolstack_save )
+        return 0;
+
+    if ( ctx->save.callbacks->toolstack_save(ctx->domid, &buf, &len, ctx->save.callbacks->data) < 0 )
+    {
+        PERROR("Error calling toolstack_save");
+        return -1;
+    }
+
+    rc = write_split_record(ctx, &rec, buf, len);
+    if ( rc < 0 )
+        PERROR("Error writing TOOLSTACK record");
+    free(buf);
+    return rc;
+}
+
+static int x86_hvm_normalise_page(struct context *ctx, xen_pfn_t type, void **page)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_setup(struct context *ctx)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_start_of_stream(struct context *ctx)
+{
+    /* no-op */
+    return 0;
+}
+
+static int x86_hvm_end_of_stream(struct context *ctx)
+{
+    int rc;
+
+    rc = write_tsc_info(ctx);
+    if ( rc )
+        return rc;
+
+    /* TODO - remove. */
+    rc = write_toolstack(ctx);
+    if ( rc )
+        return rc;
+
+    /* Write the HVM_CONTEXT record. */
+    rc = write_hvm_context(ctx);
+    if ( rc )
+        return rc;
+
+    /* Write HVM_PARAMS record contains applicable HVM params. */
+    rc = write_hvm_params(ctx);
+    if ( rc )
+        return rc;
+
+    return rc;
+}
+
+static int x86_hvm_cleanup(struct context *ctx)
+{
+    /* no-op */
+    return 0;
+}
+
+struct save_ops save_ops_x86_hvm =
+{
+    .normalise_page  = x86_hvm_normalise_page,
+    .setup           = x86_hvm_setup,
+    .start_of_stream = x86_hvm_start_of_stream,
+    .end_of_stream   = x86_hvm_end_of_stream,
+    .cleanup         = x86_hvm_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 12/14] tools/libxc: x86 HVM restore code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (10 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:14   ` David Vrabel
  2014-06-11 18:14 ` [PATCH v5 RFC 13/14] tools/libxc: noarch save code Andrew Cooper
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h          |    1 +
 tools/libxc/saverestore/restore_x86_hvm.c |  280 +++++++++++++++++++++++++++++
 2 files changed, 281 insertions(+)
 create mode 100644 tools/libxc/saverestore/restore_x86_hvm.c

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 9e70ed6..e16e0de 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -251,6 +251,7 @@ extern struct save_ops save_ops_x86_pv;
 extern struct save_ops save_ops_x86_hvm;
 
 extern struct restore_ops restore_ops_x86_pv;
+extern struct restore_ops restore_ops_x86_hvm;
 
 struct record
 {
diff --git a/tools/libxc/saverestore/restore_x86_hvm.c b/tools/libxc/saverestore/restore_x86_hvm.c
new file mode 100644
index 0000000..299d77f
--- /dev/null
+++ b/tools/libxc/saverestore/restore_x86_hvm.c
@@ -0,0 +1,280 @@
+#include <assert.h>
+
+#include "common_x86.h"
+
+/* TODO: remove */
+static int handle_toolstack(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore )
+        return 0;
+
+    rc = ctx->restore.callbacks->toolstack_restore(ctx->domid, rec->data, rec->length,
+                                                   ctx->restore.callbacks->data);
+    if ( rc < 0 )
+        PERROR("restoring toolstack");
+    return rc;
+}
+
+/*
+ * Process an HVM_CONTEXT record from the stream.
+ */
+static int handle_hvm_context(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = xc_domain_hvm_setcontext(xch, ctx->domid, rec->data, rec->length);
+    if ( rc < 0 )
+        PERROR("Unable to restore HVM context");
+    return rc;
+}
+
+/*
+ * Process an HVM_PARAMS record from the stream.
+ */
+static int handle_hvm_params(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_hvm_params *hdr = rec->data;
+    struct rec_hvm_params_entry *entry = hdr->param;
+    unsigned int i;
+    int rc;
+
+    if ( rec->length < sizeof(*hdr)
+         || rec->length < sizeof(*hdr) + hdr->count * sizeof(*entry) )
+    {
+        ERROR("hvm_params record is too short");
+        return -1;
+    }
+
+    for ( i = 0; i < hdr->count; i++, entry++ )
+    {
+        switch ( entry->index )
+        {
+        case HVM_PARAM_CONSOLE_PFN:
+            ctx->restore.console_mfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_STORE_PFN:
+            ctx->restore.xenstore_mfn = entry->value;
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        case HVM_PARAM_IOREQ_PFN:
+        case HVM_PARAM_BUFIOREQ_PFN:
+            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            break;
+        }
+
+        rc = xc_set_hvm_param(xch, ctx->domid, entry->index, entry->value);
+        if ( rc < 0 )
+        {
+            PERROR("set HVM param %"PRId64" = 0x%016"PRIx64,
+                   entry->index, entry->value);
+            return rc;
+        }
+    }
+    return 0;
+}
+
+/* TODO: remove */
+static int dump_qemu(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    char qemusig[21], path[256];
+    uint32_t qlen;
+    void *qbuf = NULL;
+    int rc = -1;
+    FILE *fp = NULL;
+
+    if ( read_exact(ctx->fd, qemusig, sizeof(qemusig)) )
+    {
+        PERROR("Error reading QEMU signature");
+        goto out;
+    }
+
+    if ( memcmp(qemusig, "DeviceModelRecord0002", sizeof(qemusig)) )
+    {
+        qemusig[20] = '\0';
+        ERROR("Invalid device model state signature %s", qemusig);
+        goto out;
+    }
+
+    if ( read_exact(ctx->fd, &qlen, sizeof(qlen)) )
+    {
+        PERROR("Error reading QEMU record length");
+        goto out;
+    }
+
+    qbuf = malloc(qlen);
+    if ( !qbuf )
+    {
+        PERROR("no memory for device model state");
+        goto out;
+    }
+
+    if ( read_exact(ctx->fd, qbuf, qlen) )
+    {
+        PERROR("Error reading device model state");
+        goto out;
+    }
+
+    sprintf(path, XC_DEVICE_MODEL_RESTORE_FILE".%u", ctx->domid);
+    fp = fopen(path, "wb");
+    if ( !fp )
+    {
+        PERROR("Failed to open '%s' for writing", path);
+        goto out;
+    }
+
+    DPRINTF("Writing %u bytes of QEMU data", qlen);
+    if ( fwrite(qbuf, 1, qlen, fp) != qlen )
+    {
+        PERROR("Failed to write %u bytes of QEMU data", qlen);
+        goto out;
+    }
+
+    rc = 0;
+
+ out:
+    if ( fp )
+        fclose(fp);
+    free(qbuf);
+
+    return rc;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_hvm_localise_page(struct context *ctx, uint32_t type, void *page)
+{
+    /* no-op */
+    return 0;
+}
+
+/*
+ * restore_ops function. Confirms the stream matches the domain.
+ */
+static int x86_hvm_setup(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( ctx->restore.guest_type != DHDR_TYPE_X86_HVM )
+    {
+        ERROR("Unable to restore %s domain into an x86_hvm domain",
+              dhdr_type_to_str(ctx->restore.guest_type));
+        return -1;
+    }
+    else if ( ctx->restore.guest_page_size != PAGE_SIZE )
+    {
+        ERROR("Invalid page size %d for x86_hvm domains",
+              ctx->restore.guest_page_size);
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * restore_ops function.
+ */
+static int x86_hvm_process_record(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+
+    switch ( rec->type )
+    {
+    case REC_TYPE_TSC_INFO:
+        return handle_tsc_info(ctx, rec);
+
+    case REC_TYPE_HVM_CONTEXT:
+        return handle_hvm_context(ctx, rec);
+
+    case REC_TYPE_HVM_PARAMS:
+        return handle_hvm_params(ctx, rec);
+
+    case REC_TYPE_TOOLSTACK:
+        return handle_toolstack(ctx, rec);
+
+    default:
+        if ( rec->type & REC_TYPE_OPTIONAL )
+        {
+            IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
+                    rec->type, rec_type_to_str(rec->type));
+            return 0;
+        }
+
+        ERROR("Invalid record type (0x%"PRIx32", %s) for x86_hvm domains",
+              rec->type, rec_type_to_str(rec->type));
+        return -1;
+    }
+}
+
+/*
+ * restore_ops function.  Sets extra hvm parameters and seeds the grant table.
+ */
+static int x86_hvm_stream_complete(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    rc = xc_set_hvm_param(xch, ctx->domid, HVM_PARAM_STORE_EVTCHN,
+                          ctx->restore.xenstore_evtchn);
+    if ( rc )
+    {
+        PERROR("Failed to set HVM_PARAM_STORE_EVTCHN");
+        return rc;
+    }
+
+    rc = xc_dom_gnttab_hvm_seed(xch, ctx->domid,
+                                ctx->restore.console_mfn,
+                                ctx->restore.xenstore_mfn,
+                                ctx->restore.console_domid,
+                                ctx->restore.xenstore_domid);
+    if ( rc )
+    {
+        PERROR("Failed to seed grant table");
+        return rc;
+    }
+
+    /*
+     * FIXME: reading the device model state from the stream should be
+     * done by libxl.
+     */
+    rc = dump_qemu(ctx);
+    if ( rc )
+    {
+        ERROR("Failed to dump qemu");
+        return rc;
+    }
+
+    return rc;
+}
+
+static int x86_hvm_cleanup(struct context *ctx)
+{
+    /* no-op */
+    return 0;
+}
+
+struct restore_ops restore_ops_x86_hvm =
+{
+    .localise_page   = x86_hvm_localise_page,
+    .setup           = x86_hvm_setup,
+    .process_record  = x86_hvm_process_record,
+    .stream_complete = x86_hvm_stream_complete,
+    .cleanup         = x86_hvm_cleanup,
+};
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (11 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 12/14] tools/libxc: x86 HVM restore code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:24   ` David Vrabel
                     ` (2 more replies)
  2014-06-11 18:14 ` [PATCH v5 RFC 14/14] tools/libxc: noarch restore code Andrew Cooper
                   ` (2 subsequent siblings)
  15 siblings, 3 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/save.c |  545 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 544 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index f6ad734..9ad43a5 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -1,11 +1,554 @@
+#include <assert.h>
+#include <arpa/inet.h>
+
 #include "common.h"
 
+/*
+ * Writes an Image header and Domain header into the stream.
+ */
+static int write_headers(struct context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
+    struct ihdr ihdr =
+        {
+            .marker  = IHDR_MARKER,
+            .id      = htonl(IHDR_ID),
+            .version = htonl(IHDR_VERSION),
+            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
+        };
+    struct dhdr dhdr =
+        {
+            .type       = guest_type,
+            .page_shift = XC_PAGE_SHIFT,
+            .xen_major  = (xen_version >> 16) & 0xffff,
+            .xen_minor  = (xen_version)       & 0xffff,
+        };
+
+    if ( xen_version < 0 )
+    {
+        PERROR("Unable to obtain Xen Version");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
+    {
+        PERROR("Unable to write Image Header to stream");
+        return -1;
+    }
+
+    if ( write_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
+    {
+        PERROR("Unable to write Domain Header to stream");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Writes an END record into the stream.
+ */
+static int write_end_record(struct context *ctx)
+{
+    struct record end = { REC_TYPE_END, 0, NULL };
+
+    return write_record(ctx, &end);
+}
+
+/*
+ * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
+ * is constructed in ctx->save.batch_pfns.
+ *
+ * This function:
+ * - gets the types for each pfn in the batch.
+ * - for each pfn with real data:
+ *   - maps and attempts to localise the pages.
+ * - construct and writes a PAGE_DATA record into the stream.
+ */
+static int write_batch(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = NULL, *types = NULL;
+    void *guest_mapping = NULL;
+    void **guest_data = NULL;
+    void **local_pages = NULL;
+    int *errors = NULL, rc = -1;
+    unsigned i, p, nr_pages = 0;
+    unsigned nr_pfns = ctx->save.nr_batch_pfns;
+    void *page, *orig_page;
+    uint64_t *rec_pfns = NULL;
+    struct rec_page_data_header hdr = { 0 };
+    struct record rec =
+    {
+        .type = REC_TYPE_PAGE_DATA,
+    };
+
+    assert(nr_pfns != 0);
+
+    /* Mfns of the batch pfns. */
+    mfns = malloc(nr_pfns * sizeof(*mfns));
+    /* Types of the batch pfns. */
+    types = malloc(nr_pfns * sizeof(*types));
+    /* Errors from attempting to map the mfns. */
+    errors = malloc(nr_pfns * sizeof(*errors));
+    /* Pointers to page data to send.  Either mapped mfns or local allocations. */
+    guest_data = calloc(nr_pfns, sizeof(*guest_data));
+    /* Pointers to locally allocated pages.  Need freeing. */
+    local_pages = calloc(nr_pfns, sizeof(*local_pages));
+
+    if ( !mfns || !types || !errors || !guest_data || !local_pages )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->save.batch_pfns[i]);
+
+        /* Likely a ballooned page. */
+        if ( mfns[i] == INVALID_MFN )
+            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+    }
+
+    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
+    if ( rc )
+    {
+        PERROR("Failed to get types for pfn batch");
+        goto err;
+    }
+    rc = -1;
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        mfns[nr_pages++] = mfns[i];
+    }
+
+    if ( nr_pages > 0 )
+    {
+        guest_mapping = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
+        if ( !guest_mapping )
+        {
+            PERROR("Failed to map guest pages");
+            goto err;
+        }
+    }
+
+    for ( i = 0, p = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            continue;
+        }
+
+        if ( errors[p] )
+        {
+            ERROR("Mapping of pfn %#lx (mfn %#lx) failed %d",
+                  ctx->save.batch_pfns[i], mfns[p], errors[p]);
+            goto err;
+        }
+
+        orig_page = page = guest_mapping + (p * PAGE_SIZE);
+        rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
+        if ( rc )
+        {
+            if ( rc == -1 && errno == EAGAIN )
+            {
+                set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+                types[i] = XEN_DOMCTL_PFINFO_XTAB;
+                --nr_pages;
+            }
+            else
+                goto err;
+        }
+        else
+            guest_data[i] = page;
+
+        if ( page != orig_page )
+            local_pages[i] = page;
+        rc = -1;
+
+        ++p;
+    }
+
+    rec_pfns = malloc(nr_pfns * sizeof(*rec_pfns));
+    if ( !rec_pfns )
+    {
+        ERROR("Unable to allocate %zu bytes of memory for page data pfn list",
+              nr_pfns * sizeof(*rec_pfns));
+        goto err;
+    }
+
+    hdr.count = nr_pfns;
+
+    rec.length = sizeof(hdr);
+    rec.length += nr_pfns * sizeof(*rec_pfns);
+    rec.length += nr_pages * PAGE_SIZE;
+
+    for ( i = 0; i < nr_pfns; ++i )
+        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.batch_pfns[i];
+
+    if ( write_record_header(ctx, &rec) ||
+         write_exact(ctx->fd, &hdr, sizeof(hdr)) ||
+         write_exact(ctx->fd, rec_pfns, nr_pfns * sizeof(*rec_pfns)) )
+    {
+        PERROR("Failed to write page_type header to stream");
+        goto err;
+    }
+
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        if ( guest_data[i] )
+        {
+            if ( write_exact(ctx->fd, guest_data[i], PAGE_SIZE) )
+            {
+                PERROR("Failed to write page into stream");
+                goto err;
+            }
+
+            --nr_pages;
+        }
+    }
+
+    /* Sanity check we have sent all the pages we expected to. */
+    assert(nr_pages == 0);
+    rc = ctx->save.nr_batch_pfns = 0;
+
+ err:
+    free(rec_pfns);
+    if ( guest_mapping )
+        munmap(guest_mapping, nr_pages * PAGE_SIZE);
+    for ( i = 0; local_pages && i < nr_pfns; ++i )
+            free(local_pages[i]);
+    free(local_pages);
+    free(guest_data);
+    free(errors);
+    free(types);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Flush a batch of pfns into the stream.
+ */
+static int flush_batch(struct context *ctx)
+{
+    int rc = 0;
+
+    if ( ctx->save.nr_batch_pfns == 0 )
+        return rc;
+
+    rc = write_batch(ctx);
+
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
+    }
+
+    return rc;
+}
+
+/*
+ * Add a single pfn to the batch, flushing the batch if full.
+ */
+static int add_to_batch(struct context *ctx, xen_pfn_t pfn)
+{
+    int rc = 0;
+
+    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
+        rc = flush_batch(ctx);
+
+    if ( rc == 0 )
+        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
+
+    return rc;
+}
+
+/*
+ * Pause the domain.
+ */
+static int pause_domain(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc;
+
+    if ( !ctx->dominfo.paused )
+    {
+        /* TODO: Properly specify the return value from this callback. */
+        rc = (ctx->save.callbacks->suspend(ctx->save.callbacks->data) != 1);
+        if ( rc )
+        {
+            ERROR("Failed to suspend domain");
+            return rc;
+        }
+    }
+
+    IPRINTF("Domain now paused");
+    return 0;
+}
+
+/*
+ * Send all domain memory.  This is the heart of the live migration loop.
+ */
+static int send_domain_memory(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
+    xc_shadow_op_stats_t stats = { -1, -1 };
+    unsigned pages_written;
+    unsigned x, max_iter = 5, dirty_threshold = 50;
+    xen_pfn_t p;
+    int rc = -1;
+
+    to_send = xc_hypercall_buffer_alloc_pages(
+        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
+
+    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
+    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
+
+    if ( !ctx->save.batch_pfns || !to_send || !ctx->save.deferred_pages )
+    {
+        ERROR("Unable to allocate memory for to_{send,fix}/batch bitmaps");
+        goto out;
+    }
+
+    if ( xc_shadow_control(xch, ctx->domid,
+                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                           NULL, 0, NULL, 0, NULL) < 0 )
+    {
+        PERROR("Failed to enable logdirty");
+        goto out;
+    }
+
+    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
+    {
+        if ( x == 0 )
+        {
+            /* First iteration, send all pages. */
+            memset(to_send, 0xff, bitmap_size(ctx->save.p2m_size));
+        }
+        else
+        {
+            /* Else consult the dirty bitmap. */
+            if ( xc_shadow_control(
+                     xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                     HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+                     NULL, 0, &stats) != ctx->save.p2m_size )
+            {
+                PERROR("Failed to retrieve logdirty bitmap");
+                rc = -1;
+                goto out;
+            }
+            else
+                DPRINTF("  Wrote %u pages; stats: faults %"PRIu32", dirty %"PRIu32,
+                        pages_written, stats.fault_count, stats.dirty_count);
+            pages_written = 0;
+
+            if ( stats.dirty_count < dirty_threshold )
+                break;
+        }
+
+        DPRINTF("Iteration %u", x);
+
+        for ( p = 0 ; p < ctx->save.p2m_size; ++p )
+        {
+            if ( test_bit(p, to_send) )
+            {
+                rc = add_to_batch(ctx, p);
+                if ( rc )
+                    goto out;
+                ++pages_written;
+            }
+        }
+
+        rc = flush_batch(ctx);
+        if ( rc )
+            goto out;
+    }
+
+    rc = pause_domain(ctx);
+    if ( rc )
+        goto out;
+
+    if ( xc_shadow_control(
+             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
+             NULL, 0, &stats) != ctx->save.p2m_size )
+    {
+        PERROR("Failed to retrieve logdirty bitmap");
+        rc = -1;
+        goto out;
+    }
+
+    for ( p = 0, pages_written = 0 ; p < ctx->save.p2m_size; ++p )
+    {
+        if ( test_bit(p, to_send) || test_bit(p, ctx->save.deferred_pages) )
+        {
+            rc = add_to_batch(ctx, p);
+            if ( rc )
+                goto out;
+            ++pages_written;
+        }
+    }
+
+    rc = flush_batch(ctx);
+    if ( rc )
+        goto out;
+
+    DPRINTF("  Wrote %u pages", pages_written);
+    IPRINTF("Sent all pages");
+
+  out:
+    xc_hypercall_buffer_free_pages(xch, to_send,
+                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
+    free(ctx->save.deferred_pages);
+    free(ctx->save.batch_pfns);
+    return rc;
+}
+
+/*
+ * Save a domain.
+ */
+static int save(struct context *ctx, uint16_t guest_type)
+{
+    xc_interface *xch = ctx->xch;
+    int rc, saved_rc = 0, saved_errno = 0;
+
+    IPRINTF("Saving domain %d, type %s",
+            ctx->domid, dhdr_type_to_str(guest_type));
+
+    rc = ctx->save.ops.setup(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_headers(ctx, guest_type);
+    if ( rc )
+        goto err;
+
+    rc = ctx->save.ops.start_of_stream(ctx);
+    if ( rc )
+        goto err;
+
+    rc = send_domain_memory(ctx);
+    if ( rc )
+        goto err;
+
+    /* Refresh domain information now it has paused. */
+    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
+         (ctx->dominfo.domid != ctx->domid) )
+    {
+        PERROR("Unable to refresh domain information");
+        rc = -1;
+        goto err;
+    }
+    else if ( (!ctx->dominfo.shutdown ||
+               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
+              !ctx->dominfo.paused )
+    {
+        ERROR("Domain has not been suspended");
+        rc = -1;
+        goto err;
+    }
+
+    rc = ctx->save.ops.end_of_stream(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_end_record(ctx);
+    if ( rc )
+        goto err;
+
+    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                      NULL, 0, NULL, 0, NULL);
+
+    IPRINTF("Save successful");
+    goto done;
+
+ err:
+    saved_errno = errno;
+    saved_rc = rc;
+    PERROR("Save failed");
+
+ done:
+    rc = ctx->save.ops.cleanup(ctx);
+    if ( rc )
+        PERROR("Failed to clean up");
+
+    if ( saved_rc )
+    {
+        rc = saved_rc;
+        errno = saved_errno;
+    }
+
+    return rc;
+};
+
 int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
                     uint32_t max_factor, uint32_t flags,
                     struct save_callbacks* callbacks, int hvm)
 {
+    struct context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
+    ctx.save.callbacks = callbacks;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %d does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+    IPRINTF("Saving domain %d", dom);
+
+    ctx.save.p2m_size = xc_domain_maximum_gpfn(xch, dom) + 1;
+    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
+    {
+        errno = E2BIG;
+        ERROR("Cannot save this big a guest");
+        return -1;
+    }
+
+    if ( ctx.dominfo.hvm )
+    {
+        ctx.ops = common_ops_x86_hvm;
+        ctx.save.ops = save_ops_x86_hvm;
+        return save(&ctx, DHDR_TYPE_X86_HVM);
+    }
+    else
+    {
+        ctx.ops = common_ops_x86_pv;
+        ctx.save.ops = save_ops_x86_pv;
+        return save(&ctx, DHDR_TYPE_X86_PV);
+    }
 }
 
 /*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 RFC 14/14] tools/libxc: noarch restore code
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (12 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 13/14] tools/libxc: noarch save code Andrew Cooper
@ 2014-06-11 18:14 ` Andrew Cooper
  2014-06-12 10:27   ` David Vrabel
                     ` (2 more replies)
  2014-06-12  3:17 ` [PATCH v5 0/14] Migration Stream v2 Hongyang Yang
  2014-06-12  9:38 ` David Vrabel
  15 siblings, 3 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-11 18:14 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Frediano Ziglio, David Vrabel

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 tools/libxc/saverestore/common.h  |    6 +
 tools/libxc/saverestore/restore.c |  556 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 561 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index e16e0de..2d44961 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -292,6 +292,12 @@ static inline int write_record(struct context *ctx, struct record *rec)
     return write_split_record(ctx, rec, NULL, 0);
 }
 
+/* TODO - find a better way of hiding this.  It should be private to
+ * restore.c, but is needed by x86_pv_localise_page()
+ */
+int populate_pfns(struct context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types);
+
 #endif
 /*
  * Local variables:
diff --git a/tools/libxc/saverestore/restore.c b/tools/libxc/saverestore/restore.c
index 6624baa..c00742d 100644
--- a/tools/libxc/saverestore/restore.c
+++ b/tools/libxc/saverestore/restore.c
@@ -1,5 +1,499 @@
+#include <arpa/inet.h>
+
 #include "common.h"
 
+/*
+ * Read and validate the Image and Domain headers.
+ */
+static int read_headers(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct ihdr ihdr;
+    struct dhdr dhdr;
+
+    if ( read_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
+    {
+        PERROR("Failed to read Image Header from stream");
+        return -1;
+    }
+
+    ihdr.id      = ntohl(ihdr.id);
+    ihdr.version = ntohl(ihdr.version);
+    ihdr.options = ntohs(ihdr.options);
+
+    if ( ihdr.marker != IHDR_MARKER )
+    {
+        ERROR("Invalid marker: Got 0x%016"PRIx64, ihdr.marker);
+        return -1;
+    }
+    else if ( ihdr.id != IHDR_ID )
+    {
+        ERROR("Invalid ID: Expected 0x%08"PRIx32", Got 0x%08"PRIx32,
+              IHDR_ID, ihdr.id);
+        return -1;
+    }
+    else if ( ihdr.version != IHDR_VERSION )
+    {
+        ERROR("Invalid Version: Expected %d, Got %d", ihdr.version, IHDR_VERSION);
+        return -1;
+    }
+    else if ( ihdr.options & IHDR_OPT_BIG_ENDIAN )
+    {
+        ERROR("Unable to handle big endian streams");
+        return -1;
+    }
+
+    ctx->restore.format_version = ihdr.version;
+
+    if ( read_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
+    {
+        PERROR("Failed to read Domain Header from stream");
+        return -1;
+    }
+
+    ctx->restore.guest_type = dhdr.type;
+    ctx->restore.guest_page_size = (1U << dhdr.page_shift);
+
+    IPRINTF("Found %s domain from Xen %d.%d",
+            dhdr_type_to_str(dhdr.type), dhdr.xen_major, dhdr.xen_minor);
+    return 0;
+}
+
+/**
+ * Reads a record from the stream, and fills in the record structure.
+ *
+ * Returns 0 on success and non-0 on failure.
+ *
+ * On success, the records type and size shall be valid.
+ * - If size is 0, data shall be NULL.
+ * - If size is non-0, data shall be a buffer allocated by malloc() which must
+ *   be passed to free() by the caller.
+ *
+ * On failure, the contents of the record structure are undefined.
+ */
+static int read_record(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rhdr rhdr;
+    size_t datasz;
+
+    if ( read_exact(ctx->fd, &rhdr, sizeof(rhdr)) )
+    {
+        PERROR("Failed to read Record Header from stream");
+        return -1;
+    }
+    else if ( rhdr.length > REC_LENGTH_MAX )
+    {
+        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
+              " exceeds max (0x%"PRIx32")",
+              rhdr.type, rec_type_to_str(rhdr.type),
+              rhdr.length, REC_LENGTH_MAX);
+        return -1;
+    }
+
+    datasz = ROUNDUP(rhdr.length, REC_ALIGN_ORDER);
+
+    if ( datasz )
+    {
+        rec->data = malloc(datasz);
+
+        if ( !rec->data )
+        {
+            ERROR("Unable to allocate %zu bytes for record data (0x%08"PRIx32", %s)",
+                  datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+
+        if ( read_exact(ctx->fd, rec->data, datasz) )
+        {
+            free(rec->data);
+            rec->data = NULL;
+            PERROR("Failed to read %zu bytes of data for record (0x%08"PRIx32", %s)",
+                   datasz, rhdr.type, rec_type_to_str(rhdr.type));
+            return -1;
+        }
+    }
+    else
+        rec->data = NULL;
+
+    rec->type   = rhdr.type;
+    rec->length = rhdr.length;
+
+    return 0;
+};
+
+/*
+ * Is a pfn populated?
+ */
+static bool pfn_is_populated(const struct context *ctx, xen_pfn_t pfn)
+{
+    if ( !ctx->restore.populated_pfns || pfn > ctx->restore.max_populated_pfn )
+        return false;
+    return test_bit(pfn, ctx->restore.populated_pfns);
+}
+
+/*
+ * Set a pfn as populated, expanding the tracking structures if needed.
+ */
+static int pfn_set_populated(struct context *ctx, xen_pfn_t pfn)
+{
+    xc_interface *xch = ctx->xch;
+
+    if ( !ctx->restore.populated_pfns || pfn > ctx->restore.max_populated_pfn )
+    {
+        unsigned long new_max_pfn = ((pfn + 1024) & ~1023) - 1;
+        size_t old_sz, new_sz;
+        unsigned long *p;
+
+        old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
+        new_sz = bitmap_size(new_max_pfn + 1);
+
+        p = realloc(ctx->restore.populated_pfns, new_sz);
+        if ( !p )
+        {
+            PERROR("Failed to realloc populated bitmap");
+            return -1;
+        }
+
+        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+
+        ctx->restore.populated_pfns    = p;
+        ctx->restore.max_populated_pfn = new_max_pfn;
+    }
+
+    set_bit(pfn, ctx->restore.populated_pfns);
+
+    return 0;
+}
+
+/*
+ * Given a set of pfns, obtain memory from Xen to fill the physmap for the
+ * unpopulated subset.
+ */
+int populate_pfns(struct context *ctx, unsigned count,
+                  const xen_pfn_t *original_pfns, const uint32_t *types)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof(*mfns)),
+        *pfns = malloc(count * sizeof(*pfns));
+    unsigned i, nr_pfns = 0;
+    int rc = -1;
+
+    if ( !mfns || !pfns )
+    {
+        ERROR("Failed to allocate %zu bytes for populating the physmap",
+              2 * count * sizeof(*mfns));
+        goto err;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        if ( types[i] != XEN_DOMCTL_PFINFO_XTAB &&
+             types[i] != XEN_DOMCTL_PFINFO_BROKEN &&
+             !pfn_is_populated(ctx, original_pfns[i]) )
+        {
+            pfns[nr_pfns] = mfns[nr_pfns] = original_pfns[i];
+            ++nr_pfns;
+        }
+    }
+
+    if ( nr_pfns )
+    {
+        rc = xc_domain_populate_physmap_exact(xch, ctx->domid, nr_pfns, 0, 0, mfns);
+        if ( rc )
+        {
+            PERROR("Failed to populate physmap");
+            goto err;
+        }
+
+        for ( i = 0; i < nr_pfns; ++i )
+        {
+            rc = pfn_set_populated(ctx, pfns[i]);
+            if ( rc )
+                goto err;
+            ctx->ops.set_gfn(ctx, pfns[i], mfns[i]);
+        }
+    }
+
+    rc = 0;
+
+ err:
+    free(pfns);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Given a list of pfns, their types, and a block of page data from the
+ * stream, populate and record their types, map the relevent subset and copy
+ * the data into the guest.
+ */
+static int process_page_data(struct context *ctx, unsigned count,
+                             xen_pfn_t *pfns, uint32_t *types, void *page_data)
+{
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
+    int *map_errs = malloc(count * sizeof(*map_errs));
+    int rc = -1;
+    void *mapping = NULL, *guest_page = NULL;
+    unsigned i,    /* i indexes the pfns from the record. */
+        j,         /* j indexes the subset of pfns we decide to map. */
+        nr_pages;
+
+    if ( !mfns || !map_errs )
+    {
+        ERROR("Failed to allocate %zu bytes to process page data",
+              count * (sizeof(*mfns) + sizeof(*map_errs)));
+        goto err;
+    }
+
+    rc = populate_pfns(ctx, count, pfns, types);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", count);
+        goto err;
+    }
+    rc = -1;
+
+    for ( i = 0, nr_pages = 0; i < count; ++i )
+    {
+        ctx->ops.set_page_type(ctx, pfns[i], types[i]);
+
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_NOTAB:
+
+        case XEN_DOMCTL_PFINFO_L1TAB:
+        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
+
+            mfns[nr_pages++] = ctx->ops.pfn_to_gfn(ctx, pfns[i]);
+            break;
+        }
+
+    }
+
+    if ( nr_pages > 0 )
+    {
+        mapping = guest_page = xc_map_foreign_bulk(
+            xch, ctx->domid, PROT_READ | PROT_WRITE,
+            mfns, map_errs, nr_pages);
+        if ( !mapping )
+        {
+            PERROR("Unable to map %u mfns for %u pages of data",
+                   nr_pages, count);
+            goto err;
+        }
+    }
+
+    for ( i = 0, j = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_XTAB:
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+            /* No page data to deal with. */
+            continue;
+        }
+
+        if ( map_errs[j] )
+        {
+            ERROR("Mapping pfn %lx (mfn %lx, type %#"PRIx32")failed with %d",
+                  pfns[i], mfns[j], types[i], map_errs[j]);
+            goto err;
+        }
+
+        memcpy(guest_page, page_data, PAGE_SIZE);
+
+        /* Undo page normalisation done by the saver. */
+        rc = ctx->restore.ops.localise_page(ctx, types[i], guest_page);
+        if ( rc )
+        {
+            DPRINTF("Failed to localise");
+            goto err;
+        }
+
+        ++j;
+        guest_page += PAGE_SIZE;
+        page_data += PAGE_SIZE;
+    }
+
+    rc = 0;
+
+ err:
+    if ( mapping )
+        munmap(mapping, nr_pages * PAGE_SIZE);
+
+    free(map_errs);
+    free(mfns);
+
+    return rc;
+}
+
+/*
+ * Validate a PAGE_DATA record from the stream, and pass the results to
+ * process_page_data() to actually perform the legwork.
+ */
+static int handle_page_data(struct context *ctx, struct record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct rec_page_data_header *pages = rec->data;
+    unsigned i, pages_of_data = 0;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL, pfn;
+    uint32_t *types = NULL, type;
+
+    if ( rec->length < sizeof(*pages) )
+    {
+        ERROR("PAGE_DATA record truncated: length %"PRIu32", min %zu",
+              rec->length, sizeof(*pages));
+        goto err;
+    }
+    else if ( pages->count < 1 )
+    {
+        ERROR("Expected at least 1 pfn in PAGE_DATA record");
+        goto err;
+    }
+    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
+    {
+        ERROR("PAGE_DATA record (length %"PRIu32") too short to contain %"
+              PRIu32" pfns worth of information", rec->length, pages->count);
+        goto err;
+    }
+
+    pfns = malloc(pages->count * sizeof(*pfns));
+    types = malloc(pages->count * sizeof(*types));
+    if ( !pfns || !types )
+    {
+        ERROR("Unable to allocate enough memory for %"PRIu32" pfns",
+              pages->count);
+        goto err;
+    }
+
+    for ( i = 0; i < pages->count; ++i )
+    {
+        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
+        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
+        {
+            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
+            goto err;
+        }
+
+        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
+        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
+             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
+        {
+            ERROR("Invalid type %#"PRIx32" for pfn %#lx (index %u)", type, pfn, i);
+            goto err;
+        }
+        else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
+            /* NOTAB and all L1 thru L4 tables (including pinned) should have
+             * a page worth of data in the record. */
+            pages_of_data++;
+
+        pfns[i] = pfn;
+        types[i] = type;
+    }
+
+    if ( rec->length != (sizeof(*pages) +
+                         (sizeof(uint64_t) * pages->count) +
+                         (PAGE_SIZE * pages_of_data)) )
+    {
+        ERROR("PAGE_DATA record wrong size: length %"PRIu32", expected "
+              "%zu + %zu + %zu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
+        goto err;
+    }
+
+    rc = process_page_data(ctx, pages->count, pfns, types,
+                           &pages->pfn[pages->count]);
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+/*
+ * Restore a domain.
+ */
+static int restore(struct context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    struct record rec;
+    int rc, saved_rc = 0, saved_errno = 0;
+
+    IPRINTF("Restoring domain");
+
+    rc = ctx->restore.ops.setup(ctx);
+    if ( rc )
+        goto err;
+
+    do
+    {
+        rc = read_record(ctx, &rec);
+        if ( rc )
+            goto err;
+
+        switch ( rec.type )
+        {
+        case REC_TYPE_END:
+            DPRINTF("End record");
+            break;
+
+        case REC_TYPE_PAGE_DATA:
+            rc = handle_page_data(ctx, &rec);
+            break;
+
+        default:
+            rc = ctx->restore.ops.process_record(ctx, &rec);
+            break;
+        }
+
+        free(rec.data);
+        if ( rc )
+            goto err;
+
+    } while ( rec.type != REC_TYPE_END );
+
+    rc = ctx->restore.ops.stream_complete(ctx);
+    if ( rc )
+        goto err;
+
+    IPRINTF("Restore successful");
+    goto done;
+
+ err:
+    saved_errno = errno;
+    saved_rc = rc;
+    PERROR("Restore failed");
+
+ done:
+    free(ctx->restore.populated_pfns);
+    rc = ctx->restore.ops.cleanup(ctx);
+    if ( rc )
+        PERROR("Failed to clean up");
+
+    if ( saved_rc )
+    {
+        rc = saved_rc;
+        errno = saved_errno;
+    }
+
+    return rc;
+}
+
 int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        unsigned int store_evtchn, unsigned long *store_mfn,
                        domid_t store_domid, unsigned int console_evtchn,
@@ -8,8 +502,68 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
                        int checkpointed_stream,
                        struct restore_callbacks *callbacks)
 {
+    struct context ctx =
+        {
+            .xch = xch,
+            .fd = io_fd,
+        };
+
+    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
+    ctx.restore.console_evtchn = console_evtchn;
+    ctx.restore.console_domid = console_domid;
+    ctx.restore.xenstore_evtchn = store_evtchn;
+    ctx.restore.xenstore_domid = store_domid;
+    ctx.restore.callbacks = callbacks;
+
     IPRINTF("In experimental %s", __func__);
-    return -1;
+
+    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    {
+        PERROR("Failed to get domain info");
+        return -1;
+    }
+
+    if ( ctx.dominfo.domid != dom )
+    {
+        ERROR("Domain %d does not exist", dom);
+        return -1;
+    }
+
+    ctx.domid = dom;
+    IPRINTF("Restoring domain %d", dom);
+
+    if ( read_headers(&ctx) )
+        return -1;
+
+    if ( ctx.dominfo.hvm )
+    {
+        ctx.ops = common_ops_x86_hvm;
+        ctx.restore.ops = restore_ops_x86_hvm;
+        if ( restore(&ctx) )
+            return -1;
+    }
+    else
+    {
+        ctx.ops = common_ops_x86_pv;
+        ctx.restore.ops = restore_ops_x86_pv;
+        if ( restore(&ctx) )
+            return -1;
+    }
+
+    DPRINTF("XenStore: mfn %#lx, dom %d, evt %u",
+            ctx.restore.xenstore_mfn,
+            ctx.restore.xenstore_domid,
+            ctx.restore.xenstore_evtchn);
+
+    DPRINTF("Console: mfn %#lx, dom %d, evt %u",
+            ctx.restore.console_mfn,
+            ctx.restore.console_domid,
+            ctx.restore.console_evtchn);
+
+    *console_mfn = ctx.restore.console_mfn;
+    *store_mfn = ctx.restore.xenstore_mfn;
+
+    return 0;
 }
 
 /*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (13 preceding siblings ...)
  2014-06-11 18:14 ` [PATCH v5 RFC 14/14] tools/libxc: noarch restore code Andrew Cooper
@ 2014-06-12  3:17 ` Hongyang Yang
  2014-06-12 13:27   ` Andrew Cooper
  2014-06-12  9:38 ` David Vrabel
  15 siblings, 1 reply; 76+ messages in thread
From: Hongyang Yang @ 2014-06-12  3:17 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson,
	Frediano Ziglio, David Vrabel, Jan Beulich

Hi, Andrew,

On 06/12/2014 02:14 AM, Andrew Cooper wrote:
> Hello,
>
> Presented here for review is v5 of the Migration Stream v2 work.
>
> Major work in v5 is the legacy conversion python script which is capable of
> converting from legacy to v2 format on the fly.  It was tested using live
> migration saving in the legacy format, piping through the script and restoring
> using the v2 code.
>
> Other work includes a substantial refactoring of the code structure, allowing
> for a single generic save() and restore() functions, with function pointer
> hooks for each type of guest to implement.  The spec has been updated to
> include PV MSRs in the migration stream.
>
>
> This series depends on several prerequisite fixes which have recently been
> committed into staging (and have not passed the push gate at the time of
> writing).  The series also depends on the VM Generation ID series from David.
>
> I now consider the core of the v2 code stable.  I do not expect it to change
> much too much, other than the identified TODOs (and code review of course).
>
>
> The next area of work is the libxl integration, which will seek to undo the
> current layering violations.  It will involve introducing a new libxl framing
> format (which will happen to look curiously similar to the libxc framing
> format), as well as providing legacy compatibility using the legacy conversion
> scripts so migrations from older libxl/libxc toolstacks will continue to work.

What are your plans for libxl integration, do you have any design specs or is
it already implemented in the early stage?

>
>
> The code is presented here for comment/query/critism.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
                   ` (14 preceding siblings ...)
  2014-06-12  3:17 ` [PATCH v5 0/14] Migration Stream v2 Hongyang Yang
@ 2014-06-12  9:38 ` David Vrabel
  2014-06-17 15:57   ` Ian Campbell
  15 siblings, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:38 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson,
	Frediano Ziglio, David Vrabel, Jan Beulich

On 11/06/14 19:14, Andrew Cooper wrote:
> Hello,
> 
> Presented here for review is v5 of the Migration Stream v2 work.

The commit messages are rather bare.  I shall go through and suggest
some text to help other people reviewing.

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
@ 2014-06-12  9:45   ` David Vrabel
  2014-06-12 15:26   ` David Vrabel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:45 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Ian Jackson, Ian Campbell

Add the specification for a new migration stream format.  The document
includes all the details but to summarize:

The existing (legacy) format is dependant on the word size of the
toolstack.  This prevents domains from migrating from hosts running
32-bit toolstacks to hosts running 64-bit toolstacks (and vice-versa).

The legacy format lacks any version information making it difficult to
extend in compatible way.

The new format has a header (the image header) with version information,
a domain header with basic information of the domain and a stream of
records for the image data.

The format will be used for future domain types (such as on ARM).

The specification is pandoc format (and extended markdown format) and
the documentation build system is extended to generate PDFs from pandoc
documents.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 02/14] scripts: Scripts for inspection/valdiation of legacy and new streams
  2014-06-11 18:14 ` [PATCH v5 RFC 02/14] scripts: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
@ 2014-06-12  9:48   ` David Vrabel
  0 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:48 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Add python scripts to manage image formats.

legacy.py converts a legacy format image into the new format.  This
avoids having to retain the restore code for the legacy format.
Toolstacks will identify a legacy image and call the conversion script
prior to restoring the domain.

verify.py validates that a new format image complies with the specification.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format
  2014-06-11 18:14 ` [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format Andrew Cooper
@ 2014-06-12  9:52   ` David Vrabel
  2014-06-12 15:31   ` David Vrabel
  1 sibling, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:52 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: David Vrabel

Provide the C structures matching the binary (wire) format of the new
stream format.  All header/record fields are naturally aligned and
explicit padding fields are used to ensure the correct layout (i.e.,
there is no need for any non-standard structure packing pragma or
attribute).

Provide some helper functions for converting types to string for
diagnostic purposes.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-11 18:14 ` [PATCH v5 RFC 05/14] tools/libxc: noarch common code Andrew Cooper
@ 2014-06-12  9:55   ` David Vrabel
  2014-06-17 16:10   ` Ian Campbell
  1 sibling, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:55 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Add the context structure used to keep state during the save/restore
process.

Define the set of architecture or domain type specific operations with a
set of callbacks (common_ops, save_ops, and restore_ops).

Add common functions for writing records.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 06/14] tools/libxc: x86 common code
  2014-06-11 18:14 ` [PATCH v5 RFC 06/14] tools/libxc: x86 " Andrew Cooper
@ 2014-06-12  9:57   ` David Vrabel
  2014-06-17 16:11     ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:57 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Save/restore records common to all x86 domain types (HVM, PV).

This is only the TSC_INFO record.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 07/14] tools/libxc: x86 PV common code
  2014-06-11 18:14 ` [PATCH v5 RFC 07/14] tools/libxc: x86 PV " Andrew Cooper
@ 2014-06-12  9:59   ` David Vrabel
  0 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12  9:59 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Add functions common to save and restore of x86 PV guests.  This
includes functions for dealing with the P2M and M2P and the VCPU context.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 08/14] tools/libxc: x86 PV save code
  2014-06-11 18:14 ` [PATCH v5 RFC 08/14] tools/libxc: x86 PV save code Andrew Cooper
@ 2014-06-12 10:04   ` David Vrabel
  0 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:04 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Save the x86 PV specific parts of a domain.  This is the X86_PV_INFO
record, the P2M_FRAMES, the X86_PV_SHARED_INFO, the three different VCPU
context records, and the MSR records.

The normalise_page callback used by the common code when writing the
PAGE_DATA records, converts MFNs in page tables to PFNs.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code
  2014-06-11 18:14 ` [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code Andrew Cooper
@ 2014-06-12 10:08   ` David Vrabel
  2014-06-12 15:49   ` David Vrabel
  1 sibling, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:08 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Restore the x86 PV specific parts.  The X86_PV_INFO, the P2M_FRAMES,
SHARED_INFO, and VCPU context records.

The localise_page callback is called from the common PAGE_DATA code to
convert PFNs in page tables to MFNs.

Page tables are pinned and the guest's P2M is updated when the stream is
complete.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code
  2014-06-11 18:14 ` [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code Andrew Cooper
@ 2014-06-12 10:11   ` David Vrabel
  2014-06-17 16:22     ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:11 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

The callbacks common to x86 HVM save and restore.  Since HVM guests do
not have a guest-maintained P2M, these are very trivial.

A similar implementation would be suitable for ARM guests.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code
  2014-06-11 18:14 ` [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code Andrew Cooper
@ 2014-06-12 10:12   ` David Vrabel
  2014-06-12 15:55   ` David Vrabel
  1 sibling, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:12 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Save the x86 HVM specific parts of the domain.  This is considerably
simpler than an x86 PV domain.  Only the HVM_CONTEXT and HVM_PARAMS
records are needed.

There is no need for any page normalisation.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 12/14] tools/libxc: x86 HVM restore code
  2014-06-11 18:14 ` [PATCH v5 RFC 12/14] tools/libxc: x86 HVM restore code Andrew Cooper
@ 2014-06-12 10:14   ` David Vrabel
  0 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:14 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Restore the x86 HVM specific parts of a domain.  This is the HVM_CONTEXT
and HVM_PARAMS records.

There is no need for any page localisation.

This also includes writing the trailing qemu save record to a file
because this is what libxc currently does.  This is intended to be moved
into libxl proper in the future.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-11 18:14 ` [PATCH v5 RFC 13/14] tools/libxc: noarch save code Andrew Cooper
@ 2014-06-12 10:24   ` David Vrabel
  2014-06-17 16:28     ` Ian Campbell
  2014-06-18  6:59   ` Hongyang Yang
  2014-06-19  2:48   ` Wen Congyang
  2 siblings, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:24 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Save a domain, calling domain type specific function at the appropriate
points.  This implements the xc_domain_save2() API function which is
equivalent to the existing xc_domain_save().

This writes the image and domain headers, and writes all the PAGE_DATA
records using a "live" process.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 14/14] tools/libxc: noarch restore code
  2014-06-11 18:14 ` [PATCH v5 RFC 14/14] tools/libxc: noarch restore code Andrew Cooper
@ 2014-06-12 10:27   ` David Vrabel
  2014-06-12 16:05   ` David Vrabel
  2014-06-19  6:16   ` Hongyang Yang
  2 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12 10:27 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

Restore a domain from the new format.  This reads and validates the
domain and image header and loads the guest memory from the PAGE_DATA
records, populating the p2m as it does so.

This provides the xc_domain_restore2() function as an alternative to the
existing xc_domain_restore().

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-12  3:17 ` [PATCH v5 0/14] Migration Stream v2 Hongyang Yang
@ 2014-06-12 13:27   ` Andrew Cooper
  2014-06-12 13:49     ` Wei Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-12 13:27 UTC (permalink / raw)
  To: Hongyang Yang
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel,
	Frediano Ziglio, David Vrabel, Jan Beulich, Wei Liu

On 12/06/14 04:17, Hongyang Yang wrote:
> Hi, Andrew,
>
> On 06/12/2014 02:14 AM, Andrew Cooper wrote:
>> Hello,
>>
>> Presented here for review is v5 of the Migration Stream v2 work.
>>
>> Major work in v5 is the legacy conversion python script which is
>> capable of
>> converting from legacy to v2 format on the fly.  It was tested using
>> live
>> migration saving in the legacy format, piping through the script and
>> restoring
>> using the v2 code.
>>
>> Other work includes a substantial refactoring of the code structure,
>> allowing
>> for a single generic save() and restore() functions, with function
>> pointer
>> hooks for each type of guest to implement.  The spec has been updated to
>> include PV MSRs in the migration stream.
>>
>>
>> This series depends on several prerequisite fixes which have recently
>> been
>> committed into staging (and have not passed the push gate at the time of
>> writing).  The series also depends on the VM Generation ID series
>> from David.
>>
>> I now consider the core of the v2 code stable.  I do not expect it to
>> change
>> much too much, other than the identified TODOs (and code review of
>> course).
>>
>>
>> The next area of work is the libxl integration, which will seek to
>> undo the
>> current layering violations.  It will involve introducing a new libxl
>> framing
>> format (which will happen to look curiously similar to the libxc framing
>> format), as well as providing legacy compatibility using the legacy
>> conversion
>> scripts so migrations from older libxl/libxc toolstacks will continue
>> to work.
>
> What are your plans for libxl integration, do you have any design
> specs or is
> it already implemented in the early stage?


There is no code in place yet, and my first course of action is to
design how this is going to look.

It is going to be conceptually similar, with libxl gaining a framing
format.  I also need to coordinate with Wei and his JSON series which is
moving data from xl to libxl.  There will be "save libxl v2", "restore
libxl v2" and "automatically convert old libxl to new libxl" which will
mostly involve the legacy conversion script.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-12 13:27   ` Andrew Cooper
@ 2014-06-12 13:49     ` Wei Liu
  2014-06-12 14:18       ` Andrew Cooper
  0 siblings, 1 reply; 76+ messages in thread
From: Wei Liu @ 2014-06-12 13:49 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel,
	Frediano Ziglio, David Vrabel, Jan Beulich, Wei Liu,
	Hongyang Yang

On Thu, Jun 12, 2014 at 02:27:39PM +0100, Andrew Cooper wrote:
> On 12/06/14 04:17, Hongyang Yang wrote:
> > Hi, Andrew,
> >
> > On 06/12/2014 02:14 AM, Andrew Cooper wrote:
> >> Hello,
> >>
> >> Presented here for review is v5 of the Migration Stream v2 work.
> >>
> >> Major work in v5 is the legacy conversion python script which is
> >> capable of
> >> converting from legacy to v2 format on the fly.  It was tested using
> >> live
> >> migration saving in the legacy format, piping through the script and
> >> restoring
> >> using the v2 code.
> >>
> >> Other work includes a substantial refactoring of the code structure,
> >> allowing
> >> for a single generic save() and restore() functions, with function
> >> pointer
> >> hooks for each type of guest to implement.  The spec has been updated to
> >> include PV MSRs in the migration stream.
> >>
> >>
> >> This series depends on several prerequisite fixes which have recently
> >> been
> >> committed into staging (and have not passed the push gate at the time of
> >> writing).  The series also depends on the VM Generation ID series
> >> from David.
> >>
> >> I now consider the core of the v2 code stable.  I do not expect it to
> >> change
> >> much too much, other than the identified TODOs (and code review of
> >> course).
> >>
> >>
> >> The next area of work is the libxl integration, which will seek to
> >> undo the
> >> current layering violations.  It will involve introducing a new libxl
> >> framing
> >> format (which will happen to look curiously similar to the libxc framing
> >> format), as well as providing legacy compatibility using the legacy
> >> conversion
> >> scripts so migrations from older libxl/libxc toolstacks will continue
> >> to work.
> >
> > What are your plans for libxl integration, do you have any design
> > specs or is
> > it already implemented in the early stage?
> 
> 
> There is no code in place yet, and my first course of action is to
> design how this is going to look.
> 
> It is going to be conceptually similar, with libxl gaining a framing
> format.  I also need to coordinate with Wei and his JSON series which is
> moving data from xl to libxl.  There will be "save libxl v2", "restore
> libxl v2" and "automatically convert old libxl to new libxl" which will
> mostly involve the legacy conversion script.
> 

AIUI you're going to use both the new JSON infrastructure and the
configuration synchronisation bits, right? I shall upstream the rest of
my JSON infrastructure when xen-unstable is unblocked, but the
synchronisation bits will take longer.

Wei.

> ~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-12 13:49     ` Wei Liu
@ 2014-06-12 14:18       ` Andrew Cooper
  2014-06-12 14:27         ` Wei Liu
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-12 14:18 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel,
	Frediano Ziglio, David Vrabel, Jan Beulich, Hongyang Yang

On 12/06/14 14:49, Wei Liu wrote:
> On Thu, Jun 12, 2014 at 02:27:39PM +0100, Andrew Cooper wrote:
>> On 12/06/14 04:17, Hongyang Yang wrote:
>>> Hi, Andrew,
>>>
>>> On 06/12/2014 02:14 AM, Andrew Cooper wrote:
>>>> Hello,
>>>>
>>>> Presented here for review is v5 of the Migration Stream v2 work.
>>>>
>>>> Major work in v5 is the legacy conversion python script which is
>>>> capable of
>>>> converting from legacy to v2 format on the fly.  It was tested using
>>>> live
>>>> migration saving in the legacy format, piping through the script and
>>>> restoring
>>>> using the v2 code.
>>>>
>>>> Other work includes a substantial refactoring of the code structure,
>>>> allowing
>>>> for a single generic save() and restore() functions, with function
>>>> pointer
>>>> hooks for each type of guest to implement.  The spec has been updated to
>>>> include PV MSRs in the migration stream.
>>>>
>>>>
>>>> This series depends on several prerequisite fixes which have recently
>>>> been
>>>> committed into staging (and have not passed the push gate at the time of
>>>> writing).  The series also depends on the VM Generation ID series
>>>> from David.
>>>>
>>>> I now consider the core of the v2 code stable.  I do not expect it to
>>>> change
>>>> much too much, other than the identified TODOs (and code review of
>>>> course).
>>>>
>>>>
>>>> The next area of work is the libxl integration, which will seek to
>>>> undo the
>>>> current layering violations.  It will involve introducing a new libxl
>>>> framing
>>>> format (which will happen to look curiously similar to the libxc framing
>>>> format), as well as providing legacy compatibility using the legacy
>>>> conversion
>>>> scripts so migrations from older libxl/libxc toolstacks will continue
>>>> to work.
>>> What are your plans for libxl integration, do you have any design
>>> specs or is
>>> it already implemented in the early stage?
>>
>> There is no code in place yet, and my first course of action is to
>> design how this is going to look.
>>
>> It is going to be conceptually similar, with libxl gaining a framing
>> format.  I also need to coordinate with Wei and his JSON series which is
>> moving data from xl to libxl.  There will be "save libxl v2", "restore
>> libxl v2" and "automatically convert old libxl to new libxl" which will
>> mostly involve the legacy conversion script.
>>
> AIUI you're going to use both the new JSON infrastructure and the
> configuration synchronisation bits, right? I shall upstream the rest of
> my JSON infrastructure when xen-unstable is unblocked, but the
> synchronisation bits will take longer.
>
> Wei.

I care about which (hopefully singular) legacy libxl migration format
needs to be converted, and what the new stream is going to look like.

>From my understanding (without any of your changes in place) the old
"libxl" stream has:

* xl header
   - contains "last boot .cfg file"
* 'libxl bits' which are passed straight to libxc


At a guess, the new stream wants to look something like:

* xl header
* libxl header
* libxl json "current configuration"
* libxc bits
* libxl xenstore key-value pairs record
* qemu record

where the legacy conversion script will have to deal with reframing the
old libxc bits in terms of the new libxl stream.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-12 14:18       ` Andrew Cooper
@ 2014-06-12 14:27         ` Wei Liu
  0 siblings, 0 replies; 76+ messages in thread
From: Wei Liu @ 2014-06-12 14:27 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel,
	Frediano Ziglio, David Vrabel, Jan Beulich, Wei Liu,
	Hongyang Yang

On Thu, Jun 12, 2014 at 03:18:55PM +0100, Andrew Cooper wrote:
[...]
> 
> I care about which (hopefully singular) legacy libxl migration format
> needs to be converted, and what the new stream is going to look like.
> 
> >From my understanding (without any of your changes in place) the old
> "libxl" stream has:
> 
> * xl header
>    - contains "last boot .cfg file"
> * 'libxl bits' which are passed straight to libxc
> 
> 
> At a guess, the new stream wants to look something like:
> 
> * xl header
> * libxl header
> * libxl json "current configuration"
> * libxc bits
> * libxl xenstore key-value pairs record
> * qemu record
> 

Looks sensible to me. I hope I can manage to finish the configuration
bit when you get there. ;-)

Wei.

> where the legacy conversion script will have to deal with reframing the
> old libxc bits in terms of the new libxl stream.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
  2014-06-12  9:45   ` David Vrabel
@ 2014-06-12 15:26   ` David Vrabel
  2014-06-17 15:20   ` Ian Campbell
  2014-06-17 16:40   ` Ian Campbell
  3 siblings, 0 replies; 76+ messages in thread
From: David Vrabel @ 2014-06-12 15:26 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Ian Jackson, Ian Campbell

On 11/06/14 19:14, Andrew Cooper wrote:
> The pandoc markdown format is a superset of plain markdown, and while
> `markdown` is capable of making plausible documents from it, the results are
> not fantastic.
> 
> Differentiate the two via file extension to avoid running `markdown` on a
> pandoc document.
...
> +Image Header
> +------------
...
> +--------------------------------------------------------------------
> +Field       Description
> +----------- --------------------------------------------------------
> +marker      0xFFFFFFFFFFFFFFFF.
> +
> +id          0x58454E46 ("XENF" in ASCII).
> +
> +version     0x00000001.  The version of this specification.

Suggest version 2 here since that's what we've ended up calling it and
will reduce confusion if people mistakenly call the legacy format
"version 1".

At some point before the 4.5 release we'll want to update the version on
the title page to version 2 to match.

> +x86 PV Guest
> +------------
> +
> +An x86 PV guest image will have this order of records:
> +
> +1. Image header
> +2. Domain header
> +3. X86\_PV\_INFO record
> +4. Many PAGE\_DATA records
> +5. TSC\_INFO
> +6. X86\_PV\_P2M\_FRAMES record

This record is immediately after the X86_PV_INFO record and before the
PAGE_DATA records.

> +7. SHARED\_INFO record
> +8. VCPU context records for each online VCPU
> +    a. X86\_PV\_VCPU\_BASIC record
> +    b. X86\_PV\_VCPU\_EXTENDED record
> +    c. X86\_PV\_VCPU\_XSAVE record
> +    d. X86\_PV\_VCPU\_MSRS record
> +9. END record

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format
  2014-06-11 18:14 ` [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format Andrew Cooper
  2014-06-12  9:52   ` David Vrabel
@ 2014-06-12 15:31   ` David Vrabel
  2014-06-17 15:55     ` Ian Campbell
  1 sibling, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12 15:31 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: David Vrabel

On 11/06/14 19:14, Andrew Cooper wrote:
> Still TODO: Consider namespacing/prefixing

Perhaps xc_sr_?

Otherwise:

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Would prefer if you dropped this signed-off by throughout and instead
include a reviewed-by where I've provided one.

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code
  2014-06-11 18:14 ` [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code Andrew Cooper
  2014-06-12 10:08   ` David Vrabel
@ 2014-06-12 15:49   ` David Vrabel
  2014-06-12 17:01     ` Andrew Cooper
  1 sibling, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12 15:49 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

On 11/06/14 19:14, Andrew Cooper wrote:
> --- /dev/null
> +++ b/tools/libxc/saverestore/restore_x86_pv.c
...
> +static int x86_pv_process_record(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +
> +    switch ( rec->type )
> +    {
> +    case REC_TYPE_X86_PV_INFO:
> +        return handle_x86_pv_info(ctx, rec);
> +
...
> +
> +    case REC_TYPE_X86_PV_VCPU_MSRS:
> +        return handle_x86_pv_vcpu_msrs(ctx, rec);
> +
> +    default:
> +        if ( rec->type & REC_TYPE_OPTIONAL )
> +        {
> +            IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
> +                    rec->type, rec_type_to_str(rec->type));
> +            return 0;
> +        }
> +
> +        ERROR("Invalid record type (0x%"PRIx32", %s) for x86_pv domains",
> +              rec->type, rec_type_to_str(rec->type));
> +        return -1;
> +    }

I think this default case can be moved to common code.  Perhaps return
XC_SR_RECORD_UNHANDLED (== 1) to indicate the record wasn't handled.

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code
  2014-06-11 18:14 ` [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code Andrew Cooper
  2014-06-12 10:12   ` David Vrabel
@ 2014-06-12 15:55   ` David Vrabel
  2014-06-12 17:07     ` Andrew Cooper
  1 sibling, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12 15:55 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

On 11/06/14 19:14, Andrew Cooper wrote:
> --- /dev/null
> +++ b/tools/libxc/saverestore/save_x86_hvm.c
...
> +static int write_hvm_params(struct context *ctx)
> +{
...
> +
> +    for ( i = 0; i < ARRAY_SIZE(params); i++ )
> +    {
> +        uint32_t index = params[i];
> +        uint64_t value;
> +
> +        rc = xc_get_hvm_param(xch, ctx->domid, index, (unsigned long *)&value);
> +        if ( rc )
> +        {
> +            /* Gross XenServer hack. Consider HVM_PARAM_CONSOLE_PFN failure
> +             * nonfatal. This is related to the fact it is impossible to
> +             * distinguish "no console" from a console at pfn/evtchn 0.
> +             *
> +             * TODO - find a compatible way to fix this.
> +             */
> +            if ( index == HVM_PARAM_CONSOLE_PFN )
> +                continue;

Um. Is this really needed with upstream Xen?

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 14/14] tools/libxc: noarch restore code
  2014-06-11 18:14 ` [PATCH v5 RFC 14/14] tools/libxc: noarch restore code Andrew Cooper
  2014-06-12 10:27   ` David Vrabel
@ 2014-06-12 16:05   ` David Vrabel
  2014-06-12 17:16     ` Andrew Cooper
  2014-06-19  6:16   ` Hongyang Yang
  2 siblings, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-12 16:05 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

On 11/06/14 19:14, Andrew Cooper wrote:
> --- a/tools/libxc/saverestore/common.h
> +++ b/tools/libxc/saverestore/common.h
> @@ -292,6 +292,12 @@ static inline int write_record(struct context *ctx, struct record *rec)
>      return write_split_record(ctx, rec, NULL, 0);
>  }
>  
> +/* TODO - find a better way of hiding this.  It should be private to
> + * restore.c, but is needed by x86_pv_localise_page()
> + */
> +int populate_pfns(struct context *ctx, unsigned count,
> +                  const xen_pfn_t *original_pfns, const uint32_t *types);

I don't see a problem with this being here, if it's needed in those two
places.

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code
  2014-06-12 15:49   ` David Vrabel
@ 2014-06-12 17:01     ` Andrew Cooper
  2014-06-17 16:22       ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-12 17:01 UTC (permalink / raw)
  To: David Vrabel; +Cc: Frediano Ziglio, Xen-devel

On 12/06/14 16:49, David Vrabel wrote:
> On 11/06/14 19:14, Andrew Cooper wrote:
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/restore_x86_pv.c
> ...
>> +static int x86_pv_process_record(struct context *ctx, struct record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +
>> +    switch ( rec->type )
>> +    {
>> +    case REC_TYPE_X86_PV_INFO:
>> +        return handle_x86_pv_info(ctx, rec);
>> +
> ...
>> +
>> +    case REC_TYPE_X86_PV_VCPU_MSRS:
>> +        return handle_x86_pv_vcpu_msrs(ctx, rec);
>> +
>> +    default:
>> +        if ( rec->type & REC_TYPE_OPTIONAL )
>> +        {
>> +            IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
>> +                    rec->type, rec_type_to_str(rec->type));
>> +            return 0;
>> +        }
>> +
>> +        ERROR("Invalid record type (0x%"PRIx32", %s) for x86_pv domains",
>> +              rec->type, rec_type_to_str(rec->type));
>> +        return -1;
>> +    }
> I think this default case can be moved to common code.  Perhaps return
> XC_SR_RECORD_UNHANDLED (== 1) to indicate the record wasn't handled.
>
> David

It am looking to not adversely affect any attempt to standardise libxc
error reporting in the future, which is why all this code uses 0 or -1,
even though errno might or might not be relevant.  Even at the moment,
there are callbacks into higher level toolstacks which can return any
arbitrary error.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code
  2014-06-12 15:55   ` David Vrabel
@ 2014-06-12 17:07     ` Andrew Cooper
  2014-06-17 16:25       ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-12 17:07 UTC (permalink / raw)
  To: David Vrabel; +Cc: Frediano Ziglio, Xen-devel

On 12/06/14 16:55, David Vrabel wrote:
> On 11/06/14 19:14, Andrew Cooper wrote:
>> --- /dev/null
>> +++ b/tools/libxc/saverestore/save_x86_hvm.c
> ...
>> +static int write_hvm_params(struct context *ctx)
>> +{
> ...
>> +
>> +    for ( i = 0; i < ARRAY_SIZE(params); i++ )
>> +    {
>> +        uint32_t index = params[i];
>> +        uint64_t value;
>> +
>> +        rc = xc_get_hvm_param(xch, ctx->domid, index, (unsigned long *)&value);
>> +        if ( rc )
>> +        {
>> +            /* Gross XenServer hack. Consider HVM_PARAM_CONSOLE_PFN failure
>> +             * nonfatal. This is related to the fact it is impossible to
>> +             * distinguish "no console" from a console at pfn/evtchn 0.
>> +             *
>> +             * TODO - find a compatible way to fix this.
>> +             */
>> +            if ( index == HVM_PARAM_CONSOLE_PFN )
>> +                continue;
> Um. Is this really needed with upstream Xen?
>
> David

What is needed with upstream Xen is way of representing "no console". 
The XenServer way of unconditionally failing the get_param hypercall
with -EINVAL is not suitable.

This particular hack however shouldn't have slipped in.  It comes as a
side effect of testing this series under XenServer.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 14/14] tools/libxc: noarch restore code
  2014-06-12 16:05   ` David Vrabel
@ 2014-06-12 17:16     ` Andrew Cooper
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-12 17:16 UTC (permalink / raw)
  To: David Vrabel; +Cc: Frediano Ziglio, Xen-devel

On 12/06/14 17:05, David Vrabel wrote:
> On 11/06/14 19:14, Andrew Cooper wrote:
>> --- a/tools/libxc/saverestore/common.h
>> +++ b/tools/libxc/saverestore/common.h
>> @@ -292,6 +292,12 @@ static inline int write_record(struct context *ctx, struct record *rec)
>>      return write_split_record(ctx, rec, NULL, 0);
>>  }
>>  
>> +/* TODO - find a better way of hiding this.  It should be private to
>> + * restore.c, but is needed by x86_pv_localise_page()
>> + */
>> +int populate_pfns(struct context *ctx, unsigned count,
>> +                  const xen_pfn_t *original_pfns, const uint32_t *types);
> I don't see a problem with this being here, if it's needed in those two
> places.
>
> David
>

I was hoping to find a way of making it disappear as it is the one ugly
lump in the otherwise clean partitioning of common and arch specific code.

I am however struggling...

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
  2014-06-12  9:45   ` David Vrabel
  2014-06-12 15:26   ` David Vrabel
@ 2014-06-17 15:20   ` Ian Campbell
  2014-06-17 17:42     ` Andrew Cooper
  2014-06-17 16:40   ` Ian Campbell
  3 siblings, 1 reply; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 15:20 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> The pandoc markdown format is a superset of plain markdown, and while
> `markdown` is capable of making plausible documents from it, the results are
> not fantastic.

"not fantastic" doesn't sound like a reason to not generate the html and
txt versions of it.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format
  2014-06-12 15:31   ` David Vrabel
@ 2014-06-17 15:55     ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 15:55 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Xen-devel

On Thu, 2014-06-12 at 16:31 +0100, David Vrabel wrote:
> On 11/06/14 19:14, Andrew Cooper wrote:
> > Still TODO: Consider namespacing/prefixing
> 
> Perhaps xc_sr_?

That's as good a colour for the shed as anything IMHO. It's all internal
I think, so it's not as super critical as an external API, so long as it
is sensible in the scope of the library.

> Otherwise:
> 
> Reviewed-by: David Vrabel <david.vrabel@citrix.com>
> 
> > Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> 
> Would prefer if you dropped this signed-off by throughout and instead
> include a reviewed-by where I've provided one.
> 
> David
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/14] Migration Stream v2
  2014-06-12  9:38 ` David Vrabel
@ 2014-06-17 15:57   ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 15:57 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Andrew Cooper, Tim Deegan, Xen-devel,
	Frediano Ziglio, Jan Beulich, Ian Jackson


On Thu, 2014-06-12 at 10:38 +0100, David Vrabel wrote:
> On 11/06/14 19:14, Andrew Cooper wrote:
> > Hello,
> > 
> > Presented here for review is v5 of the Migration Stream v2 work.
> 
> The commit messages are rather bare.  I shall go through and suggest
> some text to help other people reviewing.

Thank you very much for doing this for Andrew.

The lack of intra-version commit logs is also rather annoying too.
Andrew, can you fix that for v5->v6 please.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework
  2014-06-11 18:14 ` [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
@ 2014-06-17 16:00   ` Ian Campbell
  2014-06-17 16:17     ` Andrew Cooper
  0 siblings, 1 reply; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:00 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
> two save/restore codepaths to coexist, and have a runtime switch.
> 
> It is indended that once this series is less RFC, the v2 framework will
> completely replace v1.

When do you imagine that will be? It seems to me like it could/should
reasonably happen now.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-11 18:14 ` [PATCH v5 RFC 05/14] tools/libxc: noarch common code Andrew Cooper
  2014-06-12  9:55   ` David Vrabel
@ 2014-06-17 16:10   ` Ian Campbell
  2014-06-17 16:28     ` Andrew Cooper
  1 sibling, 1 reply; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> +int write_split_record(struct context *ctx, struct record *rec,
> +                       void *buf, size_t sz)
> +{
> +    static const char zeroes[7] = { 0 };
> +    xc_interface *xch = ctx->xch;
> +    uint32_t combined_length = rec->length + sz;
> +    size_t record_length = ROUNDUP(combined_length, REC_ALIGN_ORDER);

I suppose the [7] must relate to REC_ALIGN_ORDER somehow, can you not
derive it? (1<<REC_ALIGN_ORDER perhaps)

> +
> +    if ( record_length > REC_LENGTH_MAX )
> +    {
> +        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
> +              " exceeds max (0x%"PRIx32")", rec->type,
> +              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
> +        return -1;
> +    }
> +
> +    if ( rec->length )
> +        assert(rec->data);
> +    if ( sz )
> +        assert(buf);
> +
> +    if ( write_exact(ctx->fd, &rec->type, sizeof(rec->type)) ||
> +         write_exact(ctx->fd, &combined_length, sizeof(rec->length)) ||
> +         (rec->length && write_exact(ctx->fd, rec->data, rec->length)) ||
> +         (sz && write_exact(ctx->fd, buf, sz)) ||
> +         write_exact(ctx->fd, zeroes, record_length - combined_length) )

For clarity I'd be inclined to split this into 4 separate ifs:
  write type & length
  optionally write data
  optionally write extra buf
  write padding.

either goto err or a context specific PERROR.

> +    {
> +        PERROR("Unable to write record to stream");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +int write_record_header(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +
> +    if ( write_exact(ctx->fd, &rec->type, sizeof(rec->type)) ||
> +         write_exact(ctx->fd, &rec->length, sizeof(rec->length)) )

No need to round this one up? Or assert that it is already aligned?

> +    {
> +        PERROR("Unable to write record to stream");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  static void __attribute__((unused)) build_assertions(void)
>  {
>      XC_BUILD_BUG_ON(sizeof(struct ihdr) != 24);
> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
> index cbecf0a..d9a3655 100644
> --- a/tools/libxc/saverestore/common.h
> +++ b/tools/libxc/saverestore/common.h
> @@ -1,7 +1,20 @@
>  #ifndef __COMMON__H
>  #define __COMMON__H
>  
> +#include <stdbool.h>
> +
> +// Hack out junk from the namespace

Do you have a plan to not need these hacks?

> +     * required, the callee should leave '*pages' unouched.

untouched.

> +     * callee enounceters an error, it should *NOT* free() the memory it

encounters.

> + * ensuring that they subseqently write the correct amount of data into the

subsequently.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 06/14] tools/libxc: x86 common code
  2014-06-12  9:57   ` David Vrabel
@ 2014-06-17 16:11     ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:11 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Frediano Ziglio, Xen-devel

On Thu, 2014-06-12 at 10:57 +0100, David Vrabel wrote:
> Save/restore records common to all x86 domain types (HVM, PV).
> 
> This is only the TSC_INFO record.

With this changelog in place:
        Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework
  2014-06-17 16:00   ` Ian Campbell
@ 2014-06-17 16:17     ` Andrew Cooper
  2014-06-17 16:47       ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-17 16:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xen-devel

On 17/06/14 17:00, Ian Campbell wrote:
> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
>> two save/restore codepaths to coexist, and have a runtime switch.
>>
>> It is indended that once this series is less RFC, the v2 framework will
>> completely replace v1.
> When do you imagine that will be? It seems to me like it could/should
> reasonably happen now.
>
>
>

When I have some plausible libxl compatibility in the series, which is
still ongoing work.

If this were to be done in two parts, we would end up with a single
identifiable xl stream containing possibly the new or possibly the old
libxc stream, which makes backwards compatibility much harder to retrofit.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code
  2014-06-12 17:01     ` Andrew Cooper
@ 2014-06-17 16:22       ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Thu, 2014-06-12 at 18:01 +0100, Andrew Cooper wrote:
> On 12/06/14 16:49, David Vrabel wrote:
> > On 11/06/14 19:14, Andrew Cooper wrote:
> >> --- /dev/null
> >> +++ b/tools/libxc/saverestore/restore_x86_pv.c
> > ...
> >> +static int x86_pv_process_record(struct context *ctx, struct record *rec)
> >> +{
> >> +    xc_interface *xch = ctx->xch;
> >> +
> >> +    switch ( rec->type )
> >> +    {
> >> +    case REC_TYPE_X86_PV_INFO:
> >> +        return handle_x86_pv_info(ctx, rec);
> >> +
> > ...
> >> +
> >> +    case REC_TYPE_X86_PV_VCPU_MSRS:
> >> +        return handle_x86_pv_vcpu_msrs(ctx, rec);
> >> +
> >> +    default:
> >> +        if ( rec->type & REC_TYPE_OPTIONAL )
> >> +        {
> >> +            IPRINTF("Ignoring optional record (0x%"PRIx32", %s)",
> >> +                    rec->type, rec_type_to_str(rec->type));
> >> +            return 0;
> >> +        }
> >> +
> >> +        ERROR("Invalid record type (0x%"PRIx32", %s) for x86_pv domains",
> >> +              rec->type, rec_type_to_str(rec->type));
> >> +        return -1;
> >> +    }
> > I think this default case can be moved to common code.  Perhaps return
> > XC_SR_RECORD_UNHANDLED (== 1) to indicate the record wasn't handled.
> >
> > David
> 
> It am looking to not adversely affect any attempt to standardise libxc
> error reporting in the future, which is why all this code uses 0 or -1,
> even though errno might or might not be relevant.  Even at the moment,
> there are callbacks into higher level toolstacks which can return any
> arbitrary error.

That's fine but what David is suggesting is a sentinel value which is
entirely between the common code and the arch hook, both of which you
are defining here, it should never escape past that common caller
AFAICT.

Wanting to clean things up and not wanting to make that harder is fine,
but lets no hobble new code which could legitimately make good use of
their return value, AFAICT this is one of those cases.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code
  2014-06-12 10:11   ` David Vrabel
@ 2014-06-17 16:22     ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:22 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Frediano Ziglio, Xen-devel

On Thu, 2014-06-12 at 11:11 +0100, David Vrabel wrote:
> The callbacks common to x86 HVM save and restore.  Since HVM guests do
> not have a guest-maintained P2M, these are very trivial.
> 
> A similar implementation would be suitable for ARM guests.

With this commit log in place:
Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code
  2014-06-12 17:07     ` Andrew Cooper
@ 2014-06-17 16:25       ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:25 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Thu, 2014-06-12 at 18:07 +0100, Andrew Cooper wrote:
> On 12/06/14 16:55, David Vrabel wrote:
> > On 11/06/14 19:14, Andrew Cooper wrote:
> >> --- /dev/null
> >> +++ b/tools/libxc/saverestore/save_x86_hvm.c
> > ...
> >> +static int write_hvm_params(struct context *ctx)
> >> +{
> > ...
> >> +
> >> +    for ( i = 0; i < ARRAY_SIZE(params); i++ )
> >> +    {
> >> +        uint32_t index = params[i];
> >> +        uint64_t value;
> >> +
> >> +        rc = xc_get_hvm_param(xch, ctx->domid, index, (unsigned long *)&value);
> >> +        if ( rc )
> >> +        {
> >> +            /* Gross XenServer hack. Consider HVM_PARAM_CONSOLE_PFN failure
> >> +             * nonfatal. This is related to the fact it is impossible to
> >> +             * distinguish "no console" from a console at pfn/evtchn 0.
> >> +             *
> >> +             * TODO - find a compatible way to fix this.
> >> +             */
> >> +            if ( index == HVM_PARAM_CONSOLE_PFN )
> >> +                continue;
> > Um. Is this really needed with upstream Xen?
> >
> > David
> 
> What is needed with upstream Xen is way of representing "no console". 

When and how does this situation arise?

AFAICT with upstream Xen there is always a console in the special pfn
area.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-12 10:24   ` David Vrabel
@ 2014-06-17 16:28     ` Ian Campbell
  2014-06-17 16:38       ` David Vrabel
  0 siblings, 1 reply; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:28 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Frediano Ziglio, Xen-devel

On Thu, 2014-06-12 at 11:24 +0100, David Vrabel wrote:
> Save a domain, calling domain type specific function at the appropriate
> points.  This implements the xc_domain_save2() API function which is
> equivalent to the existing xc_domain_save().
> 
> This writes the image and domain headers, and writes all the PAGE_DATA
> records using a "live" process.

It doesn't do "dead" migrate at all?

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-17 16:10   ` Ian Campbell
@ 2014-06-17 16:28     ` Andrew Cooper
  2014-06-17 16:53       ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-17 16:28 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 17/06/14 17:10, Ian Campbell wrote:
> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>> +int write_split_record(struct context *ctx, struct record *rec,
>> +                       void *buf, size_t sz)
>> +{
>> +    static const char zeroes[7] = { 0 };
>> +    xc_interface *xch = ctx->xch;
>> +    uint32_t combined_length = rec->length + sz;
>> +    size_t record_length = ROUNDUP(combined_length, REC_ALIGN_ORDER);
> I suppose the [7] must relate to REC_ALIGN_ORDER somehow, can you not
> derive it? (1<<REC_ALIGN_ORDER perhaps)

It does indeed.  I will change it to a calculation.

>
>> +
>> +    if ( record_length > REC_LENGTH_MAX )
>> +    {
>> +        ERROR("Record (0x%08"PRIx32", %s) length 0x%"PRIx32
>> +              " exceeds max (0x%"PRIx32")", rec->type,
>> +              rec_type_to_str(rec->type), rec->length, REC_LENGTH_MAX);
>> +        return -1;
>> +    }
>> +
>> +    if ( rec->length )
>> +        assert(rec->data);
>> +    if ( sz )
>> +        assert(buf);
>> +
>> +    if ( write_exact(ctx->fd, &rec->type, sizeof(rec->type)) ||
>> +         write_exact(ctx->fd, &combined_length, sizeof(rec->length)) ||
>> +         (rec->length && write_exact(ctx->fd, rec->data, rec->length)) ||
>> +         (sz && write_exact(ctx->fd, buf, sz)) ||
>> +         write_exact(ctx->fd, zeroes, record_length - combined_length) )
> For clarity I'd be inclined to split this into 4 separate ifs:
>   write type & length
>   optionally write data
>   optionally write extra buf
>   write padding.
>
> either goto err or a context specific PERROR.

Ok.

>
>> +    {
>> +        PERROR("Unable to write record to stream");
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int write_record_header(struct context *ctx, struct record *rec)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +
>> +    if ( write_exact(ctx->fd, &rec->type, sizeof(rec->type)) ||
>> +         write_exact(ctx->fd, &rec->length, sizeof(rec->length)) )
> No need to round this one up? Or assert that it is already aligned?

The comment by the prototype explains that this is for use by "callers
doing complicated records" (i.e. PAGE_INFO) and that the caller is
responsible for ensuring that length is correct.

>
>> +    {
>> +        PERROR("Unable to write record to stream");
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>  static void __attribute__((unused)) build_assertions(void)
>>  {
>>      XC_BUILD_BUG_ON(sizeof(struct ihdr) != 24);
>> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
>> index cbecf0a..d9a3655 100644
>> --- a/tools/libxc/saverestore/common.h
>> +++ b/tools/libxc/saverestore/common.h
>> @@ -1,7 +1,20 @@
>>  #ifndef __COMMON__H
>>  #define __COMMON__H
>>  
>> +#include <stdbool.h>
>> +
>> +// Hack out junk from the namespace
> Do you have a plan to not need these hacks?

Not really.  There are enough other areas of libxc which still use these
macros, and I can't go and simply update all other areas as struct
context is meaningless outside of libxc/saverestore.

>
>> +     * required, the callee should leave '*pages' unouched.
> untouched.
>
>> +     * callee enounceters an error, it should *NOT* free() the memory it
> encounters.
>
>> + * ensuring that they subseqently write the correct amount of data into the
> subsequently.
>
>

Noted.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-17 16:28     ` Ian Campbell
@ 2014-06-17 16:38       ` David Vrabel
  2014-06-17 16:54         ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: David Vrabel @ 2014-06-17 16:38 UTC (permalink / raw)
  To: Ian Campbell, David Vrabel; +Cc: Andrew Cooper, Frediano Ziglio, Xen-devel

On 17/06/14 17:28, Ian Campbell wrote:
> On Thu, 2014-06-12 at 11:24 +0100, David Vrabel wrote:
>> Save a domain, calling domain type specific function at the appropriate
>> points.  This implements the xc_domain_save2() API function which is
>> equivalent to the existing xc_domain_save().
>>
>> This writes the image and domain headers, and writes all the PAGE_DATA
>> records using a "live" process.
> 
> It doesn't do "dead" migrate at all?

No, but I think it would be trivial to add.  Do you think it is necessary?

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
                     ` (2 preceding siblings ...)
  2014-06-17 15:20   ` Ian Campbell
@ 2014-06-17 16:40   ` Ian Campbell
  2014-06-17 18:04     ` Andrew Cooper
  3 siblings, 1 reply; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:40 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Xen-devel

On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> +The following features are not yet fully specified and will be
> +included in a future draft.
> +
> +* Remus

What is the plan for Remus here?

It has pretty large implications for the flow of a migration stream and
therefore on the code in the final two patches, I suspect it will
require high level changes to those functions, so I'm reluctant to spend
a lot of time on them as they are.

> +TOOLSTACK
> +---------
> +
> +An opaque blob provided by and supplied to the higher layers of the
> +toolstack (e.g., libxl) during save and restore.
> +
> +> This is only temporary -- the intension is the toolstack takes care

"intention"

(please run a spell checker over this document, if it's anything like
the code comments I've seen while reviewing it will be full of spelling
errors).

> +> of this itself.  This record is only present for early development
> +> purposes and will be removed before submissions, along with changes
> +> to libxl which cause libxl to handle this data itself.

How confident are you that this can be done before 4.5? I suspect that
it's going to involve an awful lot of replumbing.

I also think it will interact in exciting ways with the remus stuff.

I think we need to start seeing the start of concrete plans for both of
these things ASAP.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework
  2014-06-17 16:17     ` Andrew Cooper
@ 2014-06-17 16:47       ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

On Tue, 2014-06-17 at 17:17 +0100, Andrew Cooper wrote:
> On 17/06/14 17:00, Ian Campbell wrote:
> > On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> >> For testing purposes, the environmental variable "XG_MIGRATION_V2" allows the
> >> two save/restore codepaths to coexist, and have a runtime switch.
> >>
> >> It is indended that once this series is less RFC, the v2 framework will
> >> completely replace v1.
> > When do you imagine that will be? It seems to me like it could/should
> > reasonably happen now.
> >
> >
> >
> 
> When I have some plausible libxl compatibility in the series, which is
> still ongoing work.
> 
> If this were to be done in two parts, we would end up with a single
> identifiable xl stream containing possibly the new or possibly the old
> libxc stream, which makes backwards compatibility much harder to retrofit.

I keep finding myself guessing whether weird wrinkles are due to this
hack or are actual wrinkles which I should complain about.

Please make the change to the final form ASAP, otherwise the review
effort is going onto the wrong thing.

Until this series is committed there is no need for any backwards compat
with intermediate forms of the series. We won't commit to it until it is
complete and non-RFC, so no need to worry about creating a halfway
converted stream now.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-17 16:28     ` Andrew Cooper
@ 2014-06-17 16:53       ` Ian Campbell
  2014-06-17 18:26         ` Andrew Cooper
  0 siblings, 1 reply; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:53 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Tue, 2014-06-17 at 17:28 +0100, Andrew Cooper wrote:
> >> +// Hack out junk from the namespace
> > Do you have a plan to not need these hacks?
> 
> Not really.  There are enough other areas of libxc which still use these
> macros, and I can't go and simply update all other areas as 

I (or rather git grep) can't see the existing definitions/uses
mfn_to_pfn and pfn_to_mfn outside of xc_domain_{save,restore}.c. Where
are the defined and used outside of those?

(I see some in mini-os, but you specifically said libxc)

Likewise for the *_FIELD stuff which is used in ~2 places outside the
save restore code according to grep.

> struct
> context is meaningless outside of libxc/saverestore.

So how are these used there?

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-17 16:38       ` David Vrabel
@ 2014-06-17 16:54         ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-17 16:54 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Frediano Ziglio, Xen-devel

On Tue, 2014-06-17 at 17:38 +0100, David Vrabel wrote:
> On 17/06/14 17:28, Ian Campbell wrote:
> > On Thu, 2014-06-12 at 11:24 +0100, David Vrabel wrote:
> >> Save a domain, calling domain type specific function at the appropriate
> >> points.  This implements the xc_domain_save2() API function which is
> >> equivalent to the existing xc_domain_save().
> >>
> >> This writes the image and domain headers, and writes all the PAGE_DATA
> >> records using a "live" process.
> > 
> > It doesn't do "dead" migrate at all?
> 
> No, but I think it would be trivial to add.  Do you think it is necessary?

I think so. libxl offers it as an option I there might be occasions when
you cared more about migrating ASAP rather than minimising the non-live
downtime (e.g. if the h/w was about to catch fire).

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-17 15:20   ` Ian Campbell
@ 2014-06-17 17:42     ` Andrew Cooper
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-17 17:42 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 17/06/14 16:20, Ian Campbell wrote:
> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>> The pandoc markdown format is a superset of plain markdown, and while
>> `markdown` is capable of making plausible documents from it, the results are
>> not fantastic.
> "not fantastic" doesn't sound like a reason to not generate the html and
> txt versions of it.
>
> Ian.
>

Actually it turns out that adding txt and html support with pandoc is
trivial, so I have done.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-17 16:40   ` Ian Campbell
@ 2014-06-17 18:04     ` Andrew Cooper
  2014-06-19  9:13       ` Hongyang Yang
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-17 18:04 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, Xen-devel

On 17/06/14 17:40, Ian Campbell wrote:
> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>> +The following features are not yet fully specified and will be
>> +included in a future draft.
>> +
>> +* Remus
> What is the plan for Remus here?
>
> It has pretty large implications for the flow of a migration stream and
> therefore on the code in the final two patches, I suspect it will
> require high level changes to those functions, so I'm reluctant to spend
> a lot of time on them as they are.

I don't believe too much change will be required to the final two
patches, but it does depend on fixing the current qemu record layer
violations.

It will be much easier to do after a prototype to the libxl level fixes.

>> +> of this itself.  This record is only present for early development
>> +> purposes and will be removed before submissions, along with changes
>> +> to libxl which cause libxl to handle this data itself.
> How confident are you that this can be done before 4.5? I suspect that
> it's going to involve an awful lot of replumbing.

I suspect it will, and I frankly have no idea.  It is my experience in
the past that hacking on libxl turns out to be harder than expected.

>
> I also think it will interact in exciting ways with the remus stuff.

Less so, I think.  Remus mandates identical toolstacks on either side,
so the problem of legacy remus to new remus doesn't exist.

>
> I think we need to start seeing the start of concrete plans for both of
> these things ASAP.

Indeed.

The concrete plan is that it needs to be done, and it needs to be done
to a sufficient extent with general upstream approval that I can be
fairly confident that what upstream accepts isn't different to the
version which gets shipped with XenServer.$NEXT.  On the other hand,
getting Xapi + libxc working for XenServer.$NEXT is critical as
customers would like to be able to migrate their vms.

I would also like to sleep at some point.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-17 16:53       ` Ian Campbell
@ 2014-06-17 18:26         ` Andrew Cooper
  2014-06-18  9:19           ` Ian Campbell
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-17 18:26 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 17/06/14 17:53, Ian Campbell wrote:
> On Tue, 2014-06-17 at 17:28 +0100, Andrew Cooper wrote:
>>>> +// Hack out junk from the namespace
>>> Do you have a plan to not need these hacks?
>> Not really.  There are enough other areas of libxc which still use these
>> macros, and I can't go and simply update all other areas as 
> I (or rather git grep) can't see the existing definitions/uses
> mfn_to_pfn and pfn_to_mfn outside of xc_domain_{save,restore}.c. Where
> are the defined and used outside of those?

mfn_to_pfn it turns out isn't.  pfn_to_mfn is used once in xc_domain.c. 
Open coding it might be a solution.

>
> (I see some in mini-os, but you specifically said libxc)

MiniOS will undoubtedly using its kernel versions of these functions, so
is not relevant here.

>
> Likewise for the *_FIELD stuff which is used in ~2 places outside the
> save restore code according to grep.

xc_core_x86.c defines itself GET_FIELD() so clearly doesn't use
xg_save_restore.h

xc_resume.c clearly uses xg_save_restore.h but could probably be
converted to be similar to xc_core_x86.c

>
>> struct
>> context is meaningless outside of libxc/saverestore.
> So how are these used there?
>
> Ian.
>

They are not.  They are reimplemented in common_x86_pv.h so as to not
take magic locally scoped variables with specific names.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-11 18:14 ` [PATCH v5 RFC 13/14] tools/libxc: noarch save code Andrew Cooper
  2014-06-12 10:24   ` David Vrabel
@ 2014-06-18  6:59   ` Hongyang Yang
  2014-06-18  7:08     ` Hongyang Yang
  2014-06-19  2:48   ` Wen Congyang
  2 siblings, 1 reply; 76+ messages in thread
From: Hongyang Yang @ 2014-06-18  6:59 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

On 06/12/2014 02:14 AM, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>   tools/libxc/saverestore/save.c |  545 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 544 insertions(+), 1 deletion(-)
>
> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
> index f6ad734..9ad43a5 100644
> --- a/tools/libxc/saverestore/save.c
> +++ b/tools/libxc/saverestore/save.c
> @@ -1,11 +1,554 @@
> +#include <assert.h>
> +#include <arpa/inet.h>
> +
>   #include "common.h"
>
> +/*
> + * Writes an Image header and Domain header into the stream.
> + */
> +static int write_headers(struct context *ctx, uint16_t guest_type)
> +{
> +    xc_interface *xch = ctx->xch;
...snip...
> +/*
> + * Send all domain memory.  This is the heart of the live migration loop.
> + */
> +static int send_domain_memory(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
> +    xc_shadow_op_stats_t stats = { -1, -1 };
> +    unsigned pages_written;
> +    unsigned x, max_iter = 5, dirty_threshold = 50;
> +    xen_pfn_t p;
> +    int rc = -1;
> +
> +    to_send = xc_hypercall_buffer_alloc_pages(
> +        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
> +
> +    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
> +    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
> +
> +    if ( !ctx->save.batch_pfns || !to_send || !ctx->save.deferred_pages )
> +    {
> +        ERROR("Unable to allocate memory for to_{send,fix}/batch bitmaps");
> +        goto out;
> +    }
> +
> +    if ( xc_shadow_control(xch, ctx->domid,
> +                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
> +                           NULL, 0, NULL, 0, NULL) < 0 )
> +    {
> +        PERROR("Failed to enable logdirty");
> +        goto out;
> +    }
> +
> +    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
> +    {
> +        if ( x == 0 )
> +        {
> +            /* First iteration, send all pages. */
> +            memset(to_send, 0xff, bitmap_size(ctx->save.p2m_size));
> +        }
> +        else
> +        {
> +            /* Else consult the dirty bitmap. */
> +            if ( xc_shadow_control(
> +                     xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
> +                     HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
> +                     NULL, 0, &stats) != ctx->save.p2m_size )
> +            {
> +                PERROR("Failed to retrieve logdirty bitmap");
> +                rc = -1;
> +                goto out;
> +            }
> +            else
> +                DPRINTF("  Wrote %u pages; stats: faults %"PRIu32", dirty %"PRIu32,
> +                        pages_written, stats.fault_count, stats.dirty_count);
> +            pages_written = 0;
> +
> +            if ( stats.dirty_count < dirty_threshold )
> +                break;
> +        }
> +
> +        DPRINTF("Iteration %u", x);
> +
> +        for ( p = 0 ; p < ctx->save.p2m_size; ++p )
> +        {
> +            if ( test_bit(p, to_send) )
> +            {
> +                rc = add_to_batch(ctx, p);
> +                if ( rc )
> +                    goto out;
> +                ++pages_written;
> +            }
> +        }
> +
> +        rc = flush_batch(ctx);
> +        if ( rc )
> +            goto out;
> +    }
> +
> +    rc = pause_domain(ctx);
> +    if ( rc )
> +        goto out;
> +
> +    if ( xc_shadow_control(
> +             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
> +             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
> +             NULL, 0, &stats) != ctx->save.p2m_size )
> +    {
> +        PERROR("Failed to retrieve logdirty bitmap");
> +        rc = -1;
> +        goto out;
> +    }
> +
> +    for ( p = 0, pages_written = 0 ; p < ctx->save.p2m_size; ++p )
> +    {
> +        if ( test_bit(p, to_send) || test_bit(p, ctx->save.deferred_pages) )
> +        {
> +            rc = add_to_batch(ctx, p);
> +            if ( rc )
> +                goto out;
> +            ++pages_written;
> +        }
> +    }
> +
> +    rc = flush_batch(ctx);
> +    if ( rc )
> +        goto out;
> +
> +    DPRINTF("  Wrote %u pages", pages_written);
> +    IPRINTF("Sent all pages");
> +
> +  out:
> +    xc_hypercall_buffer_free_pages(xch, to_send,
> +                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
> +    free(ctx->save.deferred_pages);
> +    free(ctx->save.batch_pfns);
> +    return rc;
> +}
> +
> +/*
> + * Save a domain.
> + */
> +static int save(struct context *ctx, uint16_t guest_type)
> +{
> +    xc_interface *xch = ctx->xch;
> +    int rc, saved_rc = 0, saved_errno = 0;
> +
> +    IPRINTF("Saving domain %d, type %s",
> +            ctx->domid, dhdr_type_to_str(guest_type));
> +
> +    rc = ctx->save.ops.setup(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    rc = write_headers(ctx, guest_type);
> +    if ( rc )
> +        goto err;
> +
> +    rc = ctx->save.ops.start_of_stream(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    rc = send_domain_memory(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    /* Refresh domain information now it has paused. */
> +    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
> +         (ctx->dominfo.domid != ctx->domid) )
> +    {
> +        PERROR("Unable to refresh domain information");
> +        rc = -1;
> +        goto err;
> +    }
> +    else if ( (!ctx->dominfo.shutdown ||
> +               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
> +              !ctx->dominfo.paused )
> +    {
> +        ERROR("Domain has not been suspended");
> +        rc = -1;
> +        goto err;
> +    }
> +
> +    rc = ctx->save.ops.end_of_stream(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    rc = write_end_record(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
> +                      NULL, 0, NULL, 0, NULL);

If migration failed because log-dirty already been enabled or there's err after
we enabled log-dirty, we should off shadow op. otherwise we will always fail
when migration, so this op should under error path. The following patch fix it.

diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index 9ad43a5..6e9d325 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -474,9 +474,6 @@ static int save(struct context *ctx, uint16_t guest_type)
      if ( rc )
          goto err;

-    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                      NULL, 0, NULL, 0, NULL);
-
      IPRINTF("Save successful");
      goto done;

@@ -490,6 +487,9 @@ static int save(struct context *ctx, uint16_t guest_type)
      if ( rc )
          PERROR("Failed to clean up");

+    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                      NULL, 0, NULL, 0, NULL);
+
      if ( saved_rc )
      {
          rc = saved_rc;


> +
> +    IPRINTF("Save successful");
> +    goto done;
> +
> + err:
> +    saved_errno = errno;
> +    saved_rc = rc;
> +    PERROR("Save failed");
> +
> + done:
> +    rc = ctx->save.ops.cleanup(ctx);
> +    if ( rc )
> +        PERROR("Failed to clean up");
> +
> +    if ( saved_rc )
> +    {
> +        rc = saved_rc;
> +        errno = saved_errno;
> +    }
> +
> +    return rc;
> +};
> +
>   int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>                       uint32_t max_factor, uint32_t flags,
>                       struct save_callbacks* callbacks, int hvm)
>   {
> +    struct context ctx =
> +        {
> +            .xch = xch,
> +            .fd = io_fd,
> +        };
> +
> +    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
> +    ctx.save.callbacks = callbacks;
> +
>       IPRINTF("In experimental %s", __func__);
> -    return -1;
> +
> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
> +    {
> +        PERROR("Failed to get domain info");
> +        return -1;
> +    }
> +
> +    if ( ctx.dominfo.domid != dom )
> +    {
> +        ERROR("Domain %d does not exist", dom);
> +        return -1;
> +    }
> +
> +    ctx.domid = dom;
> +    IPRINTF("Saving domain %d", dom);
> +
> +    ctx.save.p2m_size = xc_domain_maximum_gpfn(xch, dom) + 1;
> +    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
> +    {
> +        errno = E2BIG;
> +        ERROR("Cannot save this big a guest");
> +        return -1;
> +    }
> +
> +    if ( ctx.dominfo.hvm )
> +    {
> +        ctx.ops = common_ops_x86_hvm;
> +        ctx.save.ops = save_ops_x86_hvm;
> +        return save(&ctx, DHDR_TYPE_X86_HVM);
> +    }
> +    else
> +    {
> +        ctx.ops = common_ops_x86_pv;
> +        ctx.save.ops = save_ops_x86_pv;
> +        return save(&ctx, DHDR_TYPE_X86_PV);
> +    }
>   }
>
>   /*
>

-- 
Thanks,
Yang.

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-18  6:59   ` Hongyang Yang
@ 2014-06-18  7:08     ` Hongyang Yang
  0 siblings, 0 replies; 76+ messages in thread
From: Hongyang Yang @ 2014-06-18  7:08 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

On 06/18/2014 02:59 PM, Hongyang Yang wrote:
> On 06/12/2014 02:14 AM, Andrew Cooper wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>>   tools/libxc/saverestore/save.c |  545 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 544 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
>> index f6ad734..9ad43a5 100644
>> --- a/tools/libxc/saverestore/save.c
>> +++ b/tools/libxc/saverestore/save.c
>> @@ -1,11 +1,554 @@
>> +#include <assert.h>
>> +#include <arpa/inet.h>
>> +
>>   #include "common.h"
>>
>> +/*
>> + * Writes an Image header and Domain header into the stream.
>> + */
>> +static int write_headers(struct context *ctx, uint16_t guest_type)
>> +{
>> +    xc_interface *xch = ctx->xch;
> ...snip...
>> +/*
>> + * Send all domain memory.  This is the heart of the live migration loop.
>> + */
>> +static int send_domain_memory(struct context *ctx)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
>> +    xc_shadow_op_stats_t stats = { -1, -1 };
>> +    unsigned pages_written;
>> +    unsigned x, max_iter = 5, dirty_threshold = 50;
>> +    xen_pfn_t p;
>> +    int rc = -1;
>> +
>> +    to_send = xc_hypercall_buffer_alloc_pages(
>> +        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
>> +
>> +    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
>> sizeof(*ctx->save.batch_pfns));
>> +    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
>> +
>> +    if ( !ctx->save.batch_pfns || !to_send || !ctx->save.deferred_pages )
>> +    {
>> +        ERROR("Unable to allocate memory for to_{send,fix}/batch bitmaps");
>> +        goto out;
>> +    }
>> +
>> +    if ( xc_shadow_control(xch, ctx->domid,
>> +                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
>> +                           NULL, 0, NULL, 0, NULL) < 0 )
>> +    {
>> +        PERROR("Failed to enable logdirty");
>> +        goto out;
>> +    }
>> +
>> +    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
>> +    {
>> +        if ( x == 0 )
>> +        {
>> +            /* First iteration, send all pages. */
>> +            memset(to_send, 0xff, bitmap_size(ctx->save.p2m_size));
>> +        }
>> +        else
>> +        {
>> +            /* Else consult the dirty bitmap. */
>> +            if ( xc_shadow_control(
>> +                     xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
>> +                     HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
>> +                     NULL, 0, &stats) != ctx->save.p2m_size )
>> +            {
>> +                PERROR("Failed to retrieve logdirty bitmap");
>> +                rc = -1;
>> +                goto out;
>> +            }
>> +            else
>> +                DPRINTF("  Wrote %u pages; stats: faults %"PRIu32", dirty
>> %"PRIu32,
>> +                        pages_written, stats.fault_count, stats.dirty_count);
>> +            pages_written = 0;
>> +
>> +            if ( stats.dirty_count < dirty_threshold )
>> +                break;
>> +        }
>> +
>> +        DPRINTF("Iteration %u", x);
>> +
>> +        for ( p = 0 ; p < ctx->save.p2m_size; ++p )
>> +        {
>> +            if ( test_bit(p, to_send) )
>> +            {
>> +                rc = add_to_batch(ctx, p);
>> +                if ( rc )
>> +                    goto out;
>> +                ++pages_written;
>> +            }
>> +        }
>> +
>> +        rc = flush_batch(ctx);
>> +        if ( rc )
>> +            goto out;
>> +    }
>> +
>> +    rc = pause_domain(ctx);
>> +    if ( rc )
>> +        goto out;
>> +
>> +    if ( xc_shadow_control(
>> +             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
>> +             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
>> +             NULL, 0, &stats) != ctx->save.p2m_size )
>> +    {
>> +        PERROR("Failed to retrieve logdirty bitmap");
>> +        rc = -1;
>> +        goto out;
>> +    }
>> +
>> +    for ( p = 0, pages_written = 0 ; p < ctx->save.p2m_size; ++p )
>> +    {
>> +        if ( test_bit(p, to_send) || test_bit(p, ctx->save.deferred_pages) )
>> +        {
>> +            rc = add_to_batch(ctx, p);
>> +            if ( rc )
>> +                goto out;
>> +            ++pages_written;
>> +        }
>> +    }
>> +
>> +    rc = flush_batch(ctx);
>> +    if ( rc )
>> +        goto out;
>> +
>> +    DPRINTF("  Wrote %u pages", pages_written);
>> +    IPRINTF("Sent all pages");
>> +
>> +  out:
>> +    xc_hypercall_buffer_free_pages(xch, to_send,
>> +                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
>> +    free(ctx->save.deferred_pages);
>> +    free(ctx->save.batch_pfns);
>> +    return rc;
>> +}
>> +
>> +/*
>> + * Save a domain.
>> + */
>> +static int save(struct context *ctx, uint16_t guest_type)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    int rc, saved_rc = 0, saved_errno = 0;
>> +
>> +    IPRINTF("Saving domain %d, type %s",
>> +            ctx->domid, dhdr_type_to_str(guest_type));
>> +
>> +    rc = ctx->save.ops.setup(ctx);
>> +    if ( rc )
>> +        goto err;
>> +
>> +    rc = write_headers(ctx, guest_type);
>> +    if ( rc )
>> +        goto err;
>> +
>> +    rc = ctx->save.ops.start_of_stream(ctx);
>> +    if ( rc )
>> +        goto err;
>> +
>> +    rc = send_domain_memory(ctx);
>> +    if ( rc )
>> +        goto err;
>> +
>> +    /* Refresh domain information now it has paused. */
>> +    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
>> +         (ctx->dominfo.domid != ctx->domid) )
>> +    {
>> +        PERROR("Unable to refresh domain information");
>> +        rc = -1;
>> +        goto err;
>> +    }
>> +    else if ( (!ctx->dominfo.shutdown ||
>> +               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
>> +              !ctx->dominfo.paused )
>> +    {
>> +        ERROR("Domain has not been suspended");
>> +        rc = -1;
>> +        goto err;
>> +    }
>> +
>> +    rc = ctx->save.ops.end_of_stream(ctx);
>> +    if ( rc )
>> +        goto err;
>> +
>> +    rc = write_end_record(ctx);
>> +    if ( rc )
>> +        goto err;
>> +
>> +    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
>> +                      NULL, 0, NULL, 0, NULL);
>
> If migration failed because log-dirty already been enabled or there's err after
> we enabled log-dirty, we should off shadow op. otherwise we will always fail
> when migration, so this op should under error path. The following patch fix it.

Or there's another solution, just like the old migration, try to re-enable log
dirty if enable log dirty failed.

>
> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
> index 9ad43a5..6e9d325 100644
> --- a/tools/libxc/saverestore/save.c
> +++ b/tools/libxc/saverestore/save.c
> @@ -474,9 +474,6 @@ static int save(struct context *ctx, uint16_t guest_type)
>       if ( rc )
>           goto err;
>
> -    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
> -                      NULL, 0, NULL, 0, NULL);
> -
>       IPRINTF("Save successful");
>       goto done;
>
> @@ -490,6 +487,9 @@ static int save(struct context *ctx, uint16_t guest_type)
>       if ( rc )
>           PERROR("Failed to clean up");
>
> +    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
> +                      NULL, 0, NULL, 0, NULL);
> +
>       if ( saved_rc )
>       {
>           rc = saved_rc;
>
>
>> +
>> +    IPRINTF("Save successful");
>> +    goto done;
>> +
>> + err:
>> +    saved_errno = errno;
>> +    saved_rc = rc;
>> +    PERROR("Save failed");
>> +
>> + done:
>> +    rc = ctx->save.ops.cleanup(ctx);
>> +    if ( rc )
>> +        PERROR("Failed to clean up");
>> +
>> +    if ( saved_rc )
>> +    {
>> +        rc = saved_rc;
>> +        errno = saved_errno;
>> +    }
>> +
>> +    return rc;
>> +};
>> +
>>   int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t
>> max_iters,
>>                       uint32_t max_factor, uint32_t flags,
>>                       struct save_callbacks* callbacks, int hvm)
>>   {
>> +    struct context ctx =
>> +        {
>> +            .xch = xch,
>> +            .fd = io_fd,
>> +        };
>> +
>> +    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
>> +    ctx.save.callbacks = callbacks;
>> +
>>       IPRINTF("In experimental %s", __func__);
>> -    return -1;
>> +
>> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
>> +    {
>> +        PERROR("Failed to get domain info");
>> +        return -1;
>> +    }
>> +
>> +    if ( ctx.dominfo.domid != dom )
>> +    {
>> +        ERROR("Domain %d does not exist", dom);
>> +        return -1;
>> +    }
>> +
>> +    ctx.domid = dom;
>> +    IPRINTF("Saving domain %d", dom);
>> +
>> +    ctx.save.p2m_size = xc_domain_maximum_gpfn(xch, dom) + 1;
>> +    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
>> +    {
>> +        errno = E2BIG;
>> +        ERROR("Cannot save this big a guest");
>> +        return -1;
>> +    }
>> +
>> +    if ( ctx.dominfo.hvm )
>> +    {
>> +        ctx.ops = common_ops_x86_hvm;
>> +        ctx.save.ops = save_ops_x86_hvm;
>> +        return save(&ctx, DHDR_TYPE_X86_HVM);
>> +    }
>> +    else
>> +    {
>> +        ctx.ops = common_ops_x86_pv;
>> +        ctx.save.ops = save_ops_x86_pv;
>> +        return save(&ctx, DHDR_TYPE_X86_PV);
>> +    }
>>   }
>>
>>   /*
>>
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 05/14] tools/libxc: noarch common code
  2014-06-17 18:26         ` Andrew Cooper
@ 2014-06-18  9:19           ` Ian Campbell
  0 siblings, 0 replies; 76+ messages in thread
From: Ian Campbell @ 2014-06-18  9:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On Tue, 2014-06-17 at 19:26 +0100, Andrew Cooper wrote:
> On 17/06/14 17:53, Ian Campbell wrote:
> > On Tue, 2014-06-17 at 17:28 +0100, Andrew Cooper wrote:
> >>>> +// Hack out junk from the namespace
> >>> Do you have a plan to not need these hacks?
> >> Not really.  There are enough other areas of libxc which still use these
> >> macros, and I can't go and simply update all other areas as 
> > I (or rather git grep) can't see the existing definitions/uses
> > mfn_to_pfn and pfn_to_mfn outside of xc_domain_{save,restore}.c. Where
> > are the defined and used outside of those?
> 
> mfn_to_pfn it turns out isn't.  pfn_to_mfn is used once in xc_domain.c. 

Ah, I grepped for mfn_to_pfn twice so never for pfn_to_mfn...

tools/libxc/xc_offline_page.c uses it too BTW.

I'd be inclined to namespace this existing function (i.e. prefix xc_*)
which will stop it clashing with any use you want to make of the more
generic name in the new migration code.

> > Likewise for the *_FIELD stuff which is used in ~2 places outside the
> > save restore code according to grep.
> 
> xc_core_x86.c defines itself GET_FIELD() so clearly doesn't use
> xg_save_restore.h
> 
> xc_resume.c clearly uses xg_save_restore.h but could probably be
> converted to be similar to xc_core_x86.c

Arguably xc_domain_resume could/should be made part of the new migration
infratructure anyway.

> >> struct
> >> context is meaningless outside of libxc/saverestore.
> > So how are these used there?
> >
> > Ian.
> >
> 
> They are not.  They are reimplemented in common_x86_pv.h so as to not
> take magic locally scoped variables with specific names.

Either fix the existing ones to suit your tastes or just bite your
tongue and use the existing macros, don't add hacks.

Ian.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-11 18:14 ` [PATCH v5 RFC 13/14] tools/libxc: noarch save code Andrew Cooper
  2014-06-12 10:24   ` David Vrabel
  2014-06-18  6:59   ` Hongyang Yang
@ 2014-06-19  2:48   ` Wen Congyang
  2014-06-19  9:19     ` Andrew Cooper
  2 siblings, 1 reply; 76+ messages in thread
From: Wen Congyang @ 2014-06-19  2:48 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

At 06/12/2014 02:14 AM, Andrew Cooper Wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  tools/libxc/saverestore/save.c |  545 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 544 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
> index f6ad734..9ad43a5 100644
> --- a/tools/libxc/saverestore/save.c
> +++ b/tools/libxc/saverestore/save.c
> @@ -1,11 +1,554 @@
> +#include <assert.h>
> +#include <arpa/inet.h>
> +
>  #include "common.h"
>  
> +/*
> + * Writes an Image header and Domain header into the stream.
> + */
> +static int write_headers(struct context *ctx, uint16_t guest_type)
> +{
> +    xc_interface *xch = ctx->xch;
> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
> +    struct ihdr ihdr =
> +        {
> +            .marker  = IHDR_MARKER,
> +            .id      = htonl(IHDR_ID),
> +            .version = htonl(IHDR_VERSION),
> +            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
> +        };
> +    struct dhdr dhdr =
> +        {
> +            .type       = guest_type,
> +            .page_shift = XC_PAGE_SHIFT,
> +            .xen_major  = (xen_version >> 16) & 0xffff,
> +            .xen_minor  = (xen_version)       & 0xffff,
> +        };
> +
> +    if ( xen_version < 0 )
> +    {
> +        PERROR("Unable to obtain Xen Version");
> +        return -1;
> +    }
> +
> +    if ( write_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
> +    {
> +        PERROR("Unable to write Image Header to stream");
> +        return -1;
> +    }
> +
> +    if ( write_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
> +    {
> +        PERROR("Unable to write Domain Header to stream");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/*
> + * Writes an END record into the stream.
> + */
> +static int write_end_record(struct context *ctx)
> +{
> +    struct record end = { REC_TYPE_END, 0, NULL };
> +
> +    return write_record(ctx, &end);
> +}
> +
> +/*
> + * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
> + * is constructed in ctx->save.batch_pfns.
> + *
> + * This function:
> + * - gets the types for each pfn in the batch.
> + * - for each pfn with real data:
> + *   - maps and attempts to localise the pages.
> + * - construct and writes a PAGE_DATA record into the stream.
> + */
> +static int write_batch(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    xen_pfn_t *mfns = NULL, *types = NULL;
> +    void *guest_mapping = NULL;
> +    void **guest_data = NULL;
> +    void **local_pages = NULL;
> +    int *errors = NULL, rc = -1;
> +    unsigned i, p, nr_pages = 0;
> +    unsigned nr_pfns = ctx->save.nr_batch_pfns;
> +    void *page, *orig_page;
> +    uint64_t *rec_pfns = NULL;
> +    struct rec_page_data_header hdr = { 0 };
> +    struct record rec =
> +    {
> +        .type = REC_TYPE_PAGE_DATA,
> +    };
> +
> +    assert(nr_pfns != 0);
> +
> +    /* Mfns of the batch pfns. */
> +    mfns = malloc(nr_pfns * sizeof(*mfns));
> +    /* Types of the batch pfns. */
> +    types = malloc(nr_pfns * sizeof(*types));
> +    /* Errors from attempting to map the mfns. */
> +    errors = malloc(nr_pfns * sizeof(*errors));
> +    /* Pointers to page data to send.  Either mapped mfns or local allocations. */
> +    guest_data = calloc(nr_pfns, sizeof(*guest_data));
> +    /* Pointers to locally allocated pages.  Need freeing. */
> +    local_pages = calloc(nr_pfns, sizeof(*local_pages));

This function is called too many times, so we will allocate/free
memory again and again. It may affect the performance.

I think we can allocate at setup stage, and only clear guest_data/
local_pages here.

> +
> +    if ( !mfns || !types || !errors || !guest_data || !local_pages )
> +    {
> +        ERROR("Unable to allocate arrays for a batch of %u pages",
> +              nr_pfns);
> +        goto err;
> +    }
> +
> +    for ( i = 0; i < nr_pfns; ++i )
> +    {
> +        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->save.batch_pfns[i]);
> +
> +        /* Likely a ballooned page. */
> +        if ( mfns[i] == INVALID_MFN )
> +            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
> +    }
> +
> +    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
> +    if ( rc )
> +    {
> +        PERROR("Failed to get types for pfn batch");
> +        goto err;
> +    }
> +    rc = -1;
> +
> +    for ( i = 0; i < nr_pfns; ++i )
> +    {
> +        switch ( types[i] )
> +        {
> +        case XEN_DOMCTL_PFINFO_BROKEN:
> +        case XEN_DOMCTL_PFINFO_XALLOC:
> +        case XEN_DOMCTL_PFINFO_XTAB:
> +            continue;
> +        }
> +
> +        mfns[nr_pages++] = mfns[i];
> +    }
> +
> +    if ( nr_pages > 0 )
> +    {
> +        guest_mapping = xc_map_foreign_bulk(
> +            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
> +        if ( !guest_mapping )
> +        {
> +            PERROR("Failed to map guest pages");
> +            goto err;
> +        }

To support remus, we will map/unmap guest memory again and again. It
also affects the performance. We can cache guest mapping here.

Thanks
Wen Congyang

> +    }
> +
> +    for ( i = 0, p = 0; i < nr_pfns; ++i )
> +    {
> +        switch ( types[i] )
> +        {
> +        case XEN_DOMCTL_PFINFO_BROKEN:
> +        case XEN_DOMCTL_PFINFO_XALLOC:
> +        case XEN_DOMCTL_PFINFO_XTAB:
> +            continue;
> +        }
> +
> +        if ( errors[p] )
> +        {
> +            ERROR("Mapping of pfn %#lx (mfn %#lx) failed %d",
> +                  ctx->save.batch_pfns[i], mfns[p], errors[p]);
> +            goto err;
> +        }
> +
> +        orig_page = page = guest_mapping + (p * PAGE_SIZE);
> +        rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
> +        if ( rc )
> +        {
> +            if ( rc == -1 && errno == EAGAIN )
> +            {
> +                set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
> +                types[i] = XEN_DOMCTL_PFINFO_XTAB;
> +                --nr_pages;
> +            }
> +            else
> +                goto err;
> +        }
> +        else
> +            guest_data[i] = page;
> +
> +        if ( page != orig_page )
> +            local_pages[i] = page;
> +        rc = -1;
> +
> +        ++p;
> +    }
> +
> +    rec_pfns = malloc(nr_pfns * sizeof(*rec_pfns));
> +    if ( !rec_pfns )
> +    {
> +        ERROR("Unable to allocate %zu bytes of memory for page data pfn list",
> +              nr_pfns * sizeof(*rec_pfns));
> +        goto err;
> +    }
> +
> +    hdr.count = nr_pfns;
> +
> +    rec.length = sizeof(hdr);
> +    rec.length += nr_pfns * sizeof(*rec_pfns);
> +    rec.length += nr_pages * PAGE_SIZE;
> +
> +    for ( i = 0; i < nr_pfns; ++i )
> +        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.batch_pfns[i];
> +
> +    if ( write_record_header(ctx, &rec) ||
> +         write_exact(ctx->fd, &hdr, sizeof(hdr)) ||
> +         write_exact(ctx->fd, rec_pfns, nr_pfns * sizeof(*rec_pfns)) )
> +    {
> +        PERROR("Failed to write page_type header to stream");
> +        goto err;
> +    }
> +
> +    for ( i = 0; i < nr_pfns; ++i )
> +    {
> +        if ( guest_data[i] )
> +        {
> +            if ( write_exact(ctx->fd, guest_data[i], PAGE_SIZE) )
> +            {
> +                PERROR("Failed to write page into stream");
> +                goto err;
> +            }
> +
> +            --nr_pages;
> +        }
> +    }
> +
> +    /* Sanity check we have sent all the pages we expected to. */
> +    assert(nr_pages == 0);
> +    rc = ctx->save.nr_batch_pfns = 0;
> +
> + err:
> +    free(rec_pfns);
> +    if ( guest_mapping )
> +        munmap(guest_mapping, nr_pages * PAGE_SIZE);
> +    for ( i = 0; local_pages && i < nr_pfns; ++i )
> +            free(local_pages[i]);
> +    free(local_pages);
> +    free(guest_data);
> +    free(errors);
> +    free(types);
> +    free(mfns);
> +
> +    return rc;
> +}
> +
> +/*
> + * Flush a batch of pfns into the stream.
> + */
> +static int flush_batch(struct context *ctx)
> +{
> +    int rc = 0;
> +
> +    if ( ctx->save.nr_batch_pfns == 0 )
> +        return rc;
> +
> +    rc = write_batch(ctx);
> +
> +    if ( !rc )
> +    {
> +        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
> +                                    MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
> +    }
> +
> +    return rc;
> +}
> +
> +/*
> + * Add a single pfn to the batch, flushing the batch if full.
> + */
> +static int add_to_batch(struct context *ctx, xen_pfn_t pfn)
> +{
> +    int rc = 0;
> +
> +    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
> +        rc = flush_batch(ctx);
> +
> +    if ( rc == 0 )
> +        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> +
> +    return rc;
> +}
> +
> +/*
> + * Pause the domain.
> + */
> +static int pause_domain(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    int rc;
> +
> +    if ( !ctx->dominfo.paused )
> +    {
> +        /* TODO: Properly specify the return value from this callback. */
> +        rc = (ctx->save.callbacks->suspend(ctx->save.callbacks->data) != 1);
> +        if ( rc )
> +        {
> +            ERROR("Failed to suspend domain");
> +            return rc;
> +        }
> +    }
> +
> +    IPRINTF("Domain now paused");
> +    return 0;
> +}
> +
> +/*
> + * Send all domain memory.  This is the heart of the live migration loop.
> + */
> +static int send_domain_memory(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    DECLARE_HYPERCALL_BUFFER(unsigned long, to_send);
> +    xc_shadow_op_stats_t stats = { -1, -1 };
> +    unsigned pages_written;
> +    unsigned x, max_iter = 5, dirty_threshold = 50;
> +    xen_pfn_t p;
> +    int rc = -1;
> +
> +    to_send = xc_hypercall_buffer_alloc_pages(
> +        xch, to_send, NRPAGES(bitmap_size(ctx->save.p2m_size)));
> +
> +    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE * sizeof(*ctx->save.batch_pfns));
> +    ctx->save.deferred_pages = calloc(1, bitmap_size(ctx->save.p2m_size));
> +
> +    if ( !ctx->save.batch_pfns || !to_send || !ctx->save.deferred_pages )
> +    {
> +        ERROR("Unable to allocate memory for to_{send,fix}/batch bitmaps");
> +        goto out;
> +    }
> +
> +    if ( xc_shadow_control(xch, ctx->domid,
> +                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
> +                           NULL, 0, NULL, 0, NULL) < 0 )
> +    {
> +        PERROR("Failed to enable logdirty");
> +        goto out;
> +    }
> +
> +    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
> +    {
> +        if ( x == 0 )
> +        {
> +            /* First iteration, send all pages. */
> +            memset(to_send, 0xff, bitmap_size(ctx->save.p2m_size));
> +        }
> +        else
> +        {
> +            /* Else consult the dirty bitmap. */
> +            if ( xc_shadow_control(
> +                     xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
> +                     HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
> +                     NULL, 0, &stats) != ctx->save.p2m_size )
> +            {
> +                PERROR("Failed to retrieve logdirty bitmap");
> +                rc = -1;
> +                goto out;
> +            }
> +            else
> +                DPRINTF("  Wrote %u pages; stats: faults %"PRIu32", dirty %"PRIu32,
> +                        pages_written, stats.fault_count, stats.dirty_count);
> +            pages_written = 0;
> +
> +            if ( stats.dirty_count < dirty_threshold )
> +                break;
> +        }
> +
> +        DPRINTF("Iteration %u", x);
> +
> +        for ( p = 0 ; p < ctx->save.p2m_size; ++p )
> +        {
> +            if ( test_bit(p, to_send) )
> +            {
> +                rc = add_to_batch(ctx, p);
> +                if ( rc )
> +                    goto out;
> +                ++pages_written;
> +            }
> +        }
> +
> +        rc = flush_batch(ctx);
> +        if ( rc )
> +            goto out;
> +    }
> +
> +    rc = pause_domain(ctx);
> +    if ( rc )
> +        goto out;
> +
> +    if ( xc_shadow_control(
> +             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
> +             HYPERCALL_BUFFER(to_send), ctx->save.p2m_size,
> +             NULL, 0, &stats) != ctx->save.p2m_size )
> +    {
> +        PERROR("Failed to retrieve logdirty bitmap");
> +        rc = -1;
> +        goto out;
> +    }
> +
> +    for ( p = 0, pages_written = 0 ; p < ctx->save.p2m_size; ++p )
> +    {
> +        if ( test_bit(p, to_send) || test_bit(p, ctx->save.deferred_pages) )
> +        {
> +            rc = add_to_batch(ctx, p);
> +            if ( rc )
> +                goto out;
> +            ++pages_written;
> +        }
> +    }
> +
> +    rc = flush_batch(ctx);
> +    if ( rc )
> +        goto out;
> +
> +    DPRINTF("  Wrote %u pages", pages_written);
> +    IPRINTF("Sent all pages");
> +
> +  out:
> +    xc_hypercall_buffer_free_pages(xch, to_send,
> +                                   NRPAGES(bitmap_size(ctx->save.p2m_size)));
> +    free(ctx->save.deferred_pages);
> +    free(ctx->save.batch_pfns);
> +    return rc;
> +}
> +
> +/*
> + * Save a domain.
> + */
> +static int save(struct context *ctx, uint16_t guest_type)
> +{
> +    xc_interface *xch = ctx->xch;
> +    int rc, saved_rc = 0, saved_errno = 0;
> +
> +    IPRINTF("Saving domain %d, type %s",
> +            ctx->domid, dhdr_type_to_str(guest_type));
> +
> +    rc = ctx->save.ops.setup(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    rc = write_headers(ctx, guest_type);
> +    if ( rc )
> +        goto err;
> +
> +    rc = ctx->save.ops.start_of_stream(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    rc = send_domain_memory(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    /* Refresh domain information now it has paused. */
> +    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
> +         (ctx->dominfo.domid != ctx->domid) )
> +    {
> +        PERROR("Unable to refresh domain information");
> +        rc = -1;
> +        goto err;
> +    }
> +    else if ( (!ctx->dominfo.shutdown ||
> +               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
> +              !ctx->dominfo.paused )
> +    {
> +        ERROR("Domain has not been suspended");
> +        rc = -1;
> +        goto err;
> +    }
> +
> +    rc = ctx->save.ops.end_of_stream(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    rc = write_end_record(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
> +                      NULL, 0, NULL, 0, NULL);
> +
> +    IPRINTF("Save successful");
> +    goto done;
> +
> + err:
> +    saved_errno = errno;
> +    saved_rc = rc;
> +    PERROR("Save failed");
> +
> + done:
> +    rc = ctx->save.ops.cleanup(ctx);
> +    if ( rc )
> +        PERROR("Failed to clean up");
> +
> +    if ( saved_rc )
> +    {
> +        rc = saved_rc;
> +        errno = saved_errno;
> +    }
> +
> +    return rc;
> +};
> +
>  int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>                      uint32_t max_factor, uint32_t flags,
>                      struct save_callbacks* callbacks, int hvm)
>  {
> +    struct context ctx =
> +        {
> +            .xch = xch,
> +            .fd = io_fd,
> +        };
> +
> +    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
> +    ctx.save.callbacks = callbacks;
> +
>      IPRINTF("In experimental %s", __func__);
> -    return -1;
> +
> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
> +    {
> +        PERROR("Failed to get domain info");
> +        return -1;
> +    }
> +
> +    if ( ctx.dominfo.domid != dom )
> +    {
> +        ERROR("Domain %d does not exist", dom);
> +        return -1;
> +    }
> +
> +    ctx.domid = dom;
> +    IPRINTF("Saving domain %d", dom);
> +
> +    ctx.save.p2m_size = xc_domain_maximum_gpfn(xch, dom) + 1;
> +    if ( ctx.save.p2m_size > ~XEN_DOMCTL_PFINFO_LTAB_MASK )
> +    {
> +        errno = E2BIG;
> +        ERROR("Cannot save this big a guest");
> +        return -1;
> +    }
> +
> +    if ( ctx.dominfo.hvm )
> +    {
> +        ctx.ops = common_ops_x86_hvm;
> +        ctx.save.ops = save_ops_x86_hvm;
> +        return save(&ctx, DHDR_TYPE_X86_HVM);
> +    }
> +    else
> +    {
> +        ctx.ops = common_ops_x86_pv;
> +        ctx.save.ops = save_ops_x86_pv;
> +        return save(&ctx, DHDR_TYPE_X86_PV);
> +    }
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 14/14] tools/libxc: noarch restore code
  2014-06-11 18:14 ` [PATCH v5 RFC 14/14] tools/libxc: noarch restore code Andrew Cooper
  2014-06-12 10:27   ` David Vrabel
  2014-06-12 16:05   ` David Vrabel
@ 2014-06-19  6:16   ` Hongyang Yang
  2014-06-19  9:00     ` Andrew Cooper
  2 siblings, 1 reply; 76+ messages in thread
From: Hongyang Yang @ 2014-06-19  6:16 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Frediano Ziglio, David Vrabel

On 06/12/2014 02:14 AM, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>   tools/libxc/saverestore/common.h  |    6 +
>   tools/libxc/saverestore/restore.c |  556 ++++++++++++++++++++++++++++++++++++-
>   2 files changed, 561 insertions(+), 1 deletion(-)
>
> diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
> index e16e0de..2d44961 100644
> --- a/tools/libxc/saverestore/common.h
> +++ b/tools/libxc/saverestore/common.h
> @@ -292,6 +292,12 @@ static inline int write_record(struct context *ctx, struct record *rec)
>       return write_split_record(ctx, rec, NULL, 0);
>   }
>
...snip...
> +/*
> + * Given a list of pfns, their types, and a block of page data from the
> + * stream, populate and record their types, map the relevent subset and copy
> + * the data into the guest.
> + */
> +static int process_page_data(struct context *ctx, unsigned count,
> +                             xen_pfn_t *pfns, uint32_t *types, void *page_data)
> +{
> +    xc_interface *xch = ctx->xch;
> +    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
> +    int *map_errs = malloc(count * sizeof(*map_errs));
> +    int rc = -1;
> +    void *mapping = NULL, *guest_page = NULL;
> +    unsigned i,    /* i indexes the pfns from the record. */
> +        j,         /* j indexes the subset of pfns we decide to map. */
> +        nr_pages;
> +
> +    if ( !mfns || !map_errs )
> +    {
> +        ERROR("Failed to allocate %zu bytes to process page data",
> +              count * (sizeof(*mfns) + sizeof(*map_errs)));
> +        goto err;
> +    }
> +
> +    rc = populate_pfns(ctx, count, pfns, types);
> +    if ( rc )
> +    {
> +        ERROR("Failed to populate pfns for batch of %u pages", count);
> +        goto err;
> +    }
> +    rc = -1;
> +
> +    for ( i = 0, nr_pages = 0; i < count; ++i )
> +    {
> +        ctx->ops.set_page_type(ctx, pfns[i], types[i]);
> +
> +        switch ( types[i] )
> +        {
> +        case XEN_DOMCTL_PFINFO_NOTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L1TAB:
> +        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L2TAB:
> +        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L3TAB:
> +        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +        case XEN_DOMCTL_PFINFO_L4TAB:
> +        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
> +
> +            mfns[nr_pages++] = ctx->ops.pfn_to_gfn(ctx, pfns[i]);
> +            break;
> +        }
> +
> +    }
> +
> +    if ( nr_pages > 0 )
> +    {
> +        mapping = guest_page = xc_map_foreign_bulk(
> +            xch, ctx->domid, PROT_READ | PROT_WRITE,
> +            mfns, map_errs, nr_pages);
> +        if ( !mapping )
> +        {
> +            PERROR("Unable to map %u mfns for %u pages of data",
> +                   nr_pages, count);
> +            goto err;
> +        }
> +    }
> +
> +    for ( i = 0, j = 0; i < count; ++i )
> +    {
> +        switch ( types[i] )
> +        {
> +        case XEN_DOMCTL_PFINFO_XTAB:
> +        case XEN_DOMCTL_PFINFO_BROKEN:
> +        case XEN_DOMCTL_PFINFO_XALLOC:
> +            /* No page data to deal with. */
> +            continue;
> +        }
> +
> +        if ( map_errs[j] )
> +        {
> +            ERROR("Mapping pfn %lx (mfn %lx, type %#"PRIx32")failed with %d",
> +                  pfns[i], mfns[j], types[i], map_errs[j]);

missing rc = -1 here, rc could be 0 because in the following called:
     rc = ctx->restore.ops.localise_page(ctx, types[i], guest_page);

> +            goto err;
> +        }
> +
> +        memcpy(guest_page, page_data, PAGE_SIZE);
> +
> +        /* Undo page normalisation done by the saver. */
> +        rc = ctx->restore.ops.localise_page(ctx, types[i], guest_page);
> +        if ( rc )
> +        {
> +            DPRINTF("Failed to localise");
> +            goto err;
> +        }
> +
> +        ++j;
> +        guest_page += PAGE_SIZE;
> +        page_data += PAGE_SIZE;
> +    }
> +
> +    rc = 0;
> +
> + err:
> +    if ( mapping )
> +        munmap(mapping, nr_pages * PAGE_SIZE);
> +
> +    free(map_errs);
> +    free(mfns);
> +
> +    return rc;
> +}
> +
> +/*
> + * Validate a PAGE_DATA record from the stream, and pass the results to
> + * process_page_data() to actually perform the legwork.
> + */
> +static int handle_page_data(struct context *ctx, struct record *rec)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct rec_page_data_header *pages = rec->data;
> +    unsigned i, pages_of_data = 0;
> +    int rc = -1;
> +
> +    xen_pfn_t *pfns = NULL, pfn;
> +    uint32_t *types = NULL, type;
> +
> +    if ( rec->length < sizeof(*pages) )
> +    {
> +        ERROR("PAGE_DATA record truncated: length %"PRIu32", min %zu",
> +              rec->length, sizeof(*pages));
> +        goto err;
> +    }
> +    else if ( pages->count < 1 )
> +    {
> +        ERROR("Expected at least 1 pfn in PAGE_DATA record");
> +        goto err;
> +    }
> +    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
> +    {
> +        ERROR("PAGE_DATA record (length %"PRIu32") too short to contain %"
> +              PRIu32" pfns worth of information", rec->length, pages->count);
> +        goto err;
> +    }
> +
> +    pfns = malloc(pages->count * sizeof(*pfns));
> +    types = malloc(pages->count * sizeof(*types));
> +    if ( !pfns || !types )
> +    {
> +        ERROR("Unable to allocate enough memory for %"PRIu32" pfns",
> +              pages->count);
> +        goto err;
> +    }
> +
> +    for ( i = 0; i < pages->count; ++i )
> +    {
> +        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
> +        if ( !ctx->ops.pfn_is_valid(ctx, pfn) )
> +        {
> +            ERROR("pfn %#lx (index %u) outside domain maximum", pfn, i);
> +            goto err;
> +        }
> +
> +        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
> +        if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
> +             ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
> +        {
> +            ERROR("Invalid type %#"PRIx32" for pfn %#lx (index %u)", type, pfn, i);
> +            goto err;
> +        }
> +        else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
> +            /* NOTAB and all L1 thru L4 tables (including pinned) should have
> +             * a page worth of data in the record. */
> +            pages_of_data++;
> +
> +        pfns[i] = pfn;
> +        types[i] = type;
> +    }
> +
> +    if ( rec->length != (sizeof(*pages) +
> +                         (sizeof(uint64_t) * pages->count) +
> +                         (PAGE_SIZE * pages_of_data)) )
> +    {
> +        ERROR("PAGE_DATA record wrong size: length %"PRIu32", expected "
> +              "%zu + %zu + %zu", rec->length, sizeof(*pages),
> +              (sizeof(uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
> +        goto err;
> +    }
> +
> +    rc = process_page_data(ctx, pages->count, pfns, types,
> +                           &pages->pfn[pages->count]);
> + err:
> +    free(types);
> +    free(pfns);
> +
> +    return rc;
> +}
> +
> +/*
> + * Restore a domain.
> + */
> +static int restore(struct context *ctx)
> +{
> +    xc_interface *xch = ctx->xch;
> +    struct record rec;
> +    int rc, saved_rc = 0, saved_errno = 0;
> +
> +    IPRINTF("Restoring domain");
> +
> +    rc = ctx->restore.ops.setup(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    do
> +    {
> +        rc = read_record(ctx, &rec);
> +        if ( rc )
> +            goto err;
> +
> +        switch ( rec.type )
> +        {
> +        case REC_TYPE_END:
> +            DPRINTF("End record");
> +            break;
> +
> +        case REC_TYPE_PAGE_DATA:
> +            rc = handle_page_data(ctx, &rec);
> +            break;
> +
> +        default:
> +            rc = ctx->restore.ops.process_record(ctx, &rec);
> +            break;
> +        }
> +
> +        free(rec.data);
> +        if ( rc )
> +            goto err;
> +
> +    } while ( rec.type != REC_TYPE_END );
> +
> +    rc = ctx->restore.ops.stream_complete(ctx);
> +    if ( rc )
> +        goto err;
> +
> +    IPRINTF("Restore successful");
> +    goto done;
> +
> + err:
> +    saved_errno = errno;
> +    saved_rc = rc;
> +    PERROR("Restore failed");
> +
> + done:
> +    free(ctx->restore.populated_pfns);
> +    rc = ctx->restore.ops.cleanup(ctx);
> +    if ( rc )
> +        PERROR("Failed to clean up");
> +
> +    if ( saved_rc )
> +    {
> +        rc = saved_rc;
> +        errno = saved_errno;
> +    }
> +
> +    return rc;
> +}
> +
>   int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>                          unsigned int store_evtchn, unsigned long *store_mfn,
>                          domid_t store_domid, unsigned int console_evtchn,
> @@ -8,8 +502,68 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom,
>                          int checkpointed_stream,
>                          struct restore_callbacks *callbacks)
>   {
> +    struct context ctx =
> +        {
> +            .xch = xch,
> +            .fd = io_fd,
> +        };
> +
> +    /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions :( */
> +    ctx.restore.console_evtchn = console_evtchn;
> +    ctx.restore.console_domid = console_domid;
> +    ctx.restore.xenstore_evtchn = store_evtchn;
> +    ctx.restore.xenstore_domid = store_domid;
> +    ctx.restore.callbacks = callbacks;
> +
>       IPRINTF("In experimental %s", __func__);
> -    return -1;
> +
> +    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
> +    {
> +        PERROR("Failed to get domain info");
> +        return -1;
> +    }
> +
> +    if ( ctx.dominfo.domid != dom )
> +    {
> +        ERROR("Domain %d does not exist", dom);
> +        return -1;
> +    }
> +
> +    ctx.domid = dom;
> +    IPRINTF("Restoring domain %d", dom);
> +
> +    if ( read_headers(&ctx) )
> +        return -1;
> +
> +    if ( ctx.dominfo.hvm )
> +    {
> +        ctx.ops = common_ops_x86_hvm;
> +        ctx.restore.ops = restore_ops_x86_hvm;
> +        if ( restore(&ctx) )
> +            return -1;
> +    }
> +    else
> +    {
> +        ctx.ops = common_ops_x86_pv;
> +        ctx.restore.ops = restore_ops_x86_pv;
> +        if ( restore(&ctx) )
> +            return -1;
> +    }
> +
> +    DPRINTF("XenStore: mfn %#lx, dom %d, evt %u",
> +            ctx.restore.xenstore_mfn,
> +            ctx.restore.xenstore_domid,
> +            ctx.restore.xenstore_evtchn);
> +
> +    DPRINTF("Console: mfn %#lx, dom %d, evt %u",
> +            ctx.restore.console_mfn,
> +            ctx.restore.console_domid,
> +            ctx.restore.console_evtchn);
> +
> +    *console_mfn = ctx.restore.console_mfn;
> +    *store_mfn = ctx.restore.xenstore_mfn;
> +
> +    return 0;
>   }
>
>   /*
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 14/14] tools/libxc: noarch restore code
  2014-06-19  6:16   ` Hongyang Yang
@ 2014-06-19  9:00     ` Andrew Cooper
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-19  9:00 UTC (permalink / raw)
  To: Hongyang Yang; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 19/06/14 07:16, Hongyang Yang wrote:
> On 06/12/2014 02:14 AM, Andrew Cooper wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>>   tools/libxc/saverestore/common.h  |    6 +
>>   tools/libxc/saverestore/restore.c |  556
>> ++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 561 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/saverestore/common.h
>> b/tools/libxc/saverestore/common.h
>> index e16e0de..2d44961 100644
>> --- a/tools/libxc/saverestore/common.h
>> +++ b/tools/libxc/saverestore/common.h
>> @@ -292,6 +292,12 @@ static inline int write_record(struct context
>> *ctx, struct record *rec)
>>       return write_split_record(ctx, rec, NULL, 0);
>>   }
>>
> ...snip...
>> +/*
>> + * Given a list of pfns, their types, and a block of page data from the
>> + * stream, populate and record their types, map the relevent subset
>> and copy
>> + * the data into the guest.
>> + */
>> +static int process_page_data(struct context *ctx, unsigned count,
>> +                             xen_pfn_t *pfns, uint32_t *types, void
>> *page_data)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
>> +    int *map_errs = malloc(count * sizeof(*map_errs));
>> +    int rc = -1;
>> +    void *mapping = NULL, *guest_page = NULL;
>> +    unsigned i,    /* i indexes the pfns from the record. */
>> +        j,         /* j indexes the subset of pfns we decide to map. */
>> +        nr_pages;
>> +
>> +    if ( !mfns || !map_errs )
>> +    {
>> +        ERROR("Failed to allocate %zu bytes to process page data",
>> +              count * (sizeof(*mfns) + sizeof(*map_errs)));
>> +        goto err;
>> +    }
>> +
>> +    rc = populate_pfns(ctx, count, pfns, types);
>> +    if ( rc )
>> +    {
>> +        ERROR("Failed to populate pfns for batch of %u pages", count);
>> +        goto err;
>> +    }
>> +    rc = -1;
>> +
>> +    for ( i = 0, nr_pages = 0; i < count; ++i )
>> +    {
>> +        ctx->ops.set_page_type(ctx, pfns[i], types[i]);
>> +
>> +        switch ( types[i] )
>> +        {
>> +        case XEN_DOMCTL_PFINFO_NOTAB:
>> +
>> +        case XEN_DOMCTL_PFINFO_L1TAB:
>> +        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
>> +
>> +        case XEN_DOMCTL_PFINFO_L2TAB:
>> +        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
>> +
>> +        case XEN_DOMCTL_PFINFO_L3TAB:
>> +        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
>> +
>> +        case XEN_DOMCTL_PFINFO_L4TAB:
>> +        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
>> +
>> +            mfns[nr_pages++] = ctx->ops.pfn_to_gfn(ctx, pfns[i]);
>> +            break;
>> +        }
>> +
>> +    }
>> +
>> +    if ( nr_pages > 0 )
>> +    {
>> +        mapping = guest_page = xc_map_foreign_bulk(
>> +            xch, ctx->domid, PROT_READ | PROT_WRITE,
>> +            mfns, map_errs, nr_pages);
>> +        if ( !mapping )
>> +        {
>> +            PERROR("Unable to map %u mfns for %u pages of data",
>> +                   nr_pages, count);
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    for ( i = 0, j = 0; i < count; ++i )
>> +    {
>> +        switch ( types[i] )
>> +        {
>> +        case XEN_DOMCTL_PFINFO_XTAB:
>> +        case XEN_DOMCTL_PFINFO_BROKEN:
>> +        case XEN_DOMCTL_PFINFO_XALLOC:
>> +            /* No page data to deal with. */
>> +            continue;
>> +        }
>> +
>> +        if ( map_errs[j] )
>> +        {
>> +            ERROR("Mapping pfn %lx (mfn %lx, type %#"PRIx32")failed
>> with %d",
>> +                  pfns[i], mfns[j], types[i], map_errs[j]);
>
> missing rc = -1 here, rc could be 0 because in the following called:
>     rc = ctx->restore.ops.localise_page(ctx, types[i], guest_page);

So it can - well spotted.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-17 18:04     ` Andrew Cooper
@ 2014-06-19  9:13       ` Hongyang Yang
  2014-06-19  9:36         ` Andrew Cooper
  0 siblings, 1 reply; 76+ messages in thread
From: Hongyang Yang @ 2014-06-19  9:13 UTC (permalink / raw)
  To: Andrew Cooper, Ian Campbell; +Cc: Ian Jackson, Xen-devel

Hi Andrew, Ian,

On 06/18/2014 02:04 AM, Andrew Cooper wrote:
> On 17/06/14 17:40, Ian Campbell wrote:
>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>>> +The following features are not yet fully specified and will be
>>> +included in a future draft.
>>> +
>>> +* Remus
>> What is the plan for Remus here?
>>
>> It has pretty large implications for the flow of a migration stream and
>> therefore on the code in the final two patches, I suspect it will
>> require high level changes to those functions, so I'm reluctant to spend
>> a lot of time on them as they are.
>
> I don't believe too much change will be required to the final two
> patches, but it does depend on fixing the current qemu record layer
> violations.
>
> It will be much easier to do after a prototype to the libxl level fixes.

I'm trying to porting Remus to migration v2...

>
>>> +> of this itself.  This record is only present for early development
>>> +> purposes and will be removed before submissions, along with changes
>>> +> to libxl which cause libxl to handle this data itself.
>> How confident are you that this can be done before 4.5? I suspect that
>> it's going to involve an awful lot of replumbing.
>
> I suspect it will, and I frankly have no idea.  It is my experience in
> the past that hacking on libxl turns out to be harder than expected.
>
>>
>> I also think it will interact in exciting ways with the remus stuff.
>
> Less so, I think.  Remus mandates identical toolstacks on either side,
> so the problem of legacy remus to new remus doesn't exist.
>
>>
>> I think we need to start seeing the start of concrete plans for both of
>> these things ASAP.
>
> Indeed.
>
> The concrete plan is that it needs to be done, and it needs to be done
> to a sufficient extent with general upstream approval that I can be
> fairly confident that what upstream accepts isn't different to the
> version which gets shipped with XenServer.$NEXT.  On the other hand,
> getting Xapi + libxc working for XenServer.$NEXT is critical as
> customers would like to be able to migrate their vms.
>
> I would also like to sleep at some point.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-19  2:48   ` Wen Congyang
@ 2014-06-19  9:19     ` Andrew Cooper
  2014-06-22 14:02       ` Shriram Rajagopalan
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-19  9:19 UTC (permalink / raw)
  To: Wen Congyang; +Cc: Frediano Ziglio, David Vrabel, Xen-devel

On 19/06/14 03:48, Wen Congyang wrote:
> At 06/12/2014 02:14 AM, Andrew Cooper Wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>>  tools/libxc/saverestore/save.c |  545 +++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 544 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
>> index f6ad734..9ad43a5 100644
>> --- a/tools/libxc/saverestore/save.c
>> +++ b/tools/libxc/saverestore/save.c
>> @@ -1,11 +1,554 @@
>> +#include <assert.h>
>> +#include <arpa/inet.h>
>> +
>>  #include "common.h"
>>  
>> +/*
>> + * Writes an Image header and Domain header into the stream.
>> + */
>> +static int write_headers(struct context *ctx, uint16_t guest_type)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
>> +    struct ihdr ihdr =
>> +        {
>> +            .marker  = IHDR_MARKER,
>> +            .id      = htonl(IHDR_ID),
>> +            .version = htonl(IHDR_VERSION),
>> +            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
>> +        };
>> +    struct dhdr dhdr =
>> +        {
>> +            .type       = guest_type,
>> +            .page_shift = XC_PAGE_SHIFT,
>> +            .xen_major  = (xen_version >> 16) & 0xffff,
>> +            .xen_minor  = (xen_version)       & 0xffff,
>> +        };
>> +
>> +    if ( xen_version < 0 )
>> +    {
>> +        PERROR("Unable to obtain Xen Version");
>> +        return -1;
>> +    }
>> +
>> +    if ( write_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
>> +    {
>> +        PERROR("Unable to write Image Header to stream");
>> +        return -1;
>> +    }
>> +
>> +    if ( write_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
>> +    {
>> +        PERROR("Unable to write Domain Header to stream");
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Writes an END record into the stream.
>> + */
>> +static int write_end_record(struct context *ctx)
>> +{
>> +    struct record end = { REC_TYPE_END, 0, NULL };
>> +
>> +    return write_record(ctx, &end);
>> +}
>> +
>> +/*
>> + * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
>> + * is constructed in ctx->save.batch_pfns.
>> + *
>> + * This function:
>> + * - gets the types for each pfn in the batch.
>> + * - for each pfn with real data:
>> + *   - maps and attempts to localise the pages.
>> + * - construct and writes a PAGE_DATA record into the stream.
>> + */
>> +static int write_batch(struct context *ctx)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xen_pfn_t *mfns = NULL, *types = NULL;
>> +    void *guest_mapping = NULL;
>> +    void **guest_data = NULL;
>> +    void **local_pages = NULL;
>> +    int *errors = NULL, rc = -1;
>> +    unsigned i, p, nr_pages = 0;
>> +    unsigned nr_pfns = ctx->save.nr_batch_pfns;
>> +    void *page, *orig_page;
>> +    uint64_t *rec_pfns = NULL;
>> +    struct rec_page_data_header hdr = { 0 };
>> +    struct record rec =
>> +    {
>> +        .type = REC_TYPE_PAGE_DATA,
>> +    };
>> +
>> +    assert(nr_pfns != 0);
>> +
>> +    /* Mfns of the batch pfns. */
>> +    mfns = malloc(nr_pfns * sizeof(*mfns));
>> +    /* Types of the batch pfns. */
>> +    types = malloc(nr_pfns * sizeof(*types));
>> +    /* Errors from attempting to map the mfns. */
>> +    errors = malloc(nr_pfns * sizeof(*errors));
>> +    /* Pointers to page data to send.  Either mapped mfns or local allocations. */
>> +    guest_data = calloc(nr_pfns, sizeof(*guest_data));
>> +    /* Pointers to locally allocated pages.  Need freeing. */
>> +    local_pages = calloc(nr_pfns, sizeof(*local_pages));
> This function is called too many times, so we will allocate/free
> memory again and again. It may affect the performance.
>
> I think we can allocate at setup stage, and only clear guest_data/
> local_pages here.

We likely can.  It is currently like this because it allowed valgrind to
do a fantastic job of spotting errors when flushing an incomplete batch
at the end of each iteration.

It should be possible to consolidate the allocations and use valgrind
client requests to achieve the same effect, although at this point my
effort is far more focused to getting something which works correctly
ready in the 4.5 timeframe.

>
>> +
>> +    if ( !mfns || !types || !errors || !guest_data || !local_pages )
>> +    {
>> +        ERROR("Unable to allocate arrays for a batch of %u pages",
>> +              nr_pfns);
>> +        goto err;
>> +    }
>> +
>> +    for ( i = 0; i < nr_pfns; ++i )
>> +    {
>> +        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->save.batch_pfns[i]);
>> +
>> +        /* Likely a ballooned page. */
>> +        if ( mfns[i] == INVALID_MFN )
>> +            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
>> +    }
>> +
>> +    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
>> +    if ( rc )
>> +    {
>> +        PERROR("Failed to get types for pfn batch");
>> +        goto err;
>> +    }
>> +    rc = -1;
>> +
>> +    for ( i = 0; i < nr_pfns; ++i )
>> +    {
>> +        switch ( types[i] )
>> +        {
>> +        case XEN_DOMCTL_PFINFO_BROKEN:
>> +        case XEN_DOMCTL_PFINFO_XALLOC:
>> +        case XEN_DOMCTL_PFINFO_XTAB:
>> +            continue;
>> +        }
>> +
>> +        mfns[nr_pages++] = mfns[i];
>> +    }
>> +
>> +    if ( nr_pages > 0 )
>> +    {
>> +        guest_mapping = xc_map_foreign_bulk(
>> +            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
>> +        if ( !guest_mapping )
>> +        {
>> +            PERROR("Failed to map guest pages");
>> +            goto err;
>> +        }
> To support remus, we will map/unmap guest memory again and again. It
> also affects the performance. We can cache guest mapping here.
>
> Thanks
> Wen Congyang
>
>

I am not aware of the current code caching mappings into the guest. 
64bit toolstacks would be fine setting up mappings for every gfn and
reusing the mappings, but this really won't work for 32bit toolstacks.

What remus does appear to do is have a limited cache of
previously-written pages for XOR+RLE compression.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-19  9:13       ` Hongyang Yang
@ 2014-06-19  9:36         ` Andrew Cooper
  2014-06-19 10:23           ` Hongyang Yang
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-19  9:36 UTC (permalink / raw)
  To: Hongyang Yang; +Cc: Ian Jackson, Ian Campbell, Xen-devel

On 19/06/14 10:13, Hongyang Yang wrote:
> Hi Andrew, Ian,
>
> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
>> On 17/06/14 17:40, Ian Campbell wrote:
>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>>>> +The following features are not yet fully specified and will be
>>>> +included in a future draft.
>>>> +
>>>> +* Remus
>>> What is the plan for Remus here?
>>>
>>> It has pretty large implications for the flow of a migration stream and
>>> therefore on the code in the final two patches, I suspect it will
>>> require high level changes to those functions, so I'm reluctant to
>>> spend
>>> a lot of time on them as they are.
>>
>> I don't believe too much change will be required to the final two
>> patches, but it does depend on fixing the current qemu record layer
>> violations.
>>
>> It will be much easier to do after a prototype to the libxl level fixes.
>
> I'm trying to porting Remus to migration v2...

Ah fantastic! Here I was expecting to have eventually brave that code
myself.

How is it going?  How are you finding hacking on v2 compared to the
legacy code? (I think you are the first person who isn't me trying to
extend it)  Is there anything I can do while still developing v2 to make
things easier?


I really need to get a prototype libxl framing document sorted, but in
principle my plan (given only a minimum understanding of the algorithm)
is this:

...
* Write page data update
* Write vcpu context etc
* Write a REMUS_CHECKPOINT record (or appropriate name)
* Call the checkpoint callback, passing ownership of the fd to libxl
** libxl writes a libxl qemu record into the stream
* checkpoint callback returns to libxl, returning ownership of the fd
* libxc chooses between sending an END record or looping
...

The fd ownership is expected to work exactly the same on the receiving
side, using the REMUS_CHECKPOINT record as an indicator.

Does this look plausible or sensible, or have I missed something?

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-19  9:36         ` Andrew Cooper
@ 2014-06-19 10:23           ` Hongyang Yang
  2014-06-19 10:44             ` Andrew Cooper
  0 siblings, 1 reply; 76+ messages in thread
From: Hongyang Yang @ 2014-06-19 10:23 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Jackson, Ian Campbell, Xen-devel

On 06/19/2014 05:36 PM, Andrew Cooper wrote:
> On 19/06/14 10:13, Hongyang Yang wrote:
>> Hi Andrew, Ian,
>>
>> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
>>> On 17/06/14 17:40, Ian Campbell wrote:
>>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>>>>> +The following features are not yet fully specified and will be
>>>>> +included in a future draft.
>>>>> +
>>>>> +* Remus
>>>> What is the plan for Remus here?
>>>>
>>>> It has pretty large implications for the flow of a migration stream and
>>>> therefore on the code in the final two patches, I suspect it will
>>>> require high level changes to those functions, so I'm reluctant to
>>>> spend
>>>> a lot of time on them as they are.
>>>
>>> I don't believe too much change will be required to the final two
>>> patches, but it does depend on fixing the current qemu record layer
>>> violations.
>>>
>>> It will be much easier to do after a prototype to the libxl level fixes.
>>
>> I'm trying to porting Remus to migration v2...
>
> Ah fantastic! Here I was expecting to have eventually brave that code
> myself.
>
> How is it going?  How are you finding hacking on v2 compared to the
> legacy code? (I think you are the first person who isn't me trying to
> extend it)  Is there anything I can do while still developing v2 to make
> things easier?

It's just starting, but only on libxc side based on your patch series.
v2 code is more cleaner than legacy code, easy to understand, and yes,
make hacking easier. Maybe I will need your help when the hacking goes
on...

>
>
> I really need to get a prototype libxl framing document sorted, but in
> principle my plan (given only a minimum understanding of the algorithm)
> is this:
>
> ...
> * Write page data update
> * Write vcpu context etc
> * Write a REMUS_CHECKPOINT record (or appropriate name)
> * Call the checkpoint callback, passing ownership of the fd to libxl
> ** libxl writes a libxl qemu record into the stream
> * checkpoint callback returns to libxl, returning ownership of the fd
> * libxc chooses between sending an END record or looping
> ...
>
> The fd ownership is expected to work exactly the same on the receiving
> side, using the REMUS_CHECKPOINT record as an indicator.

It mostly looks plausible, but the save side and restore side needs to
be synchronised, otherwise, the following problem may exists:
   sending side is in libxl and send qemu records, receiving side still
   in libxc, after it is switched to libxl, part of record may lose.
maybe a handshake will solve the problem, weather it's in libxl or libxc,
but current migration frame dose not support send msgs from receiving side
to sending side, so it need modifications. We should support this feature.

>
> Does this look plausible or sensible, or have I missed something?
>
> ~Andrew
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-19 10:23           ` Hongyang Yang
@ 2014-06-19 10:44             ` Andrew Cooper
  2014-06-22 14:36               ` Shriram Rajagopalan
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Cooper @ 2014-06-19 10:44 UTC (permalink / raw)
  To: Hongyang Yang; +Cc: Ian Jackson, Ian Campbell, Xen-devel

On 19/06/14 11:23, Hongyang Yang wrote:
> On 06/19/2014 05:36 PM, Andrew Cooper wrote:
>> On 19/06/14 10:13, Hongyang Yang wrote:
>>> Hi Andrew, Ian,
>>>
>>> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
>>>> On 17/06/14 17:40, Ian Campbell wrote:
>>>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
>>>>>> +The following features are not yet fully specified and will be
>>>>>> +included in a future draft.
>>>>>> +
>>>>>> +* Remus
>>>>> What is the plan for Remus here?
>>>>>
>>>>> It has pretty large implications for the flow of a migration
>>>>> stream and
>>>>> therefore on the code in the final two patches, I suspect it will
>>>>> require high level changes to those functions, so I'm reluctant to
>>>>> spend
>>>>> a lot of time on them as they are.
>>>>
>>>> I don't believe too much change will be required to the final two
>>>> patches, but it does depend on fixing the current qemu record layer
>>>> violations.
>>>>
>>>> It will be much easier to do after a prototype to the libxl level
>>>> fixes.
>>>
>>> I'm trying to porting Remus to migration v2...
>>
>> Ah fantastic! Here I was expecting to have eventually brave that code
>> myself.
>>
>> How is it going?  How are you finding hacking on v2 compared to the
>> legacy code? (I think you are the first person who isn't me trying to
>> extend it)  Is there anything I can do while still developing v2 to make
>> things easier?
>
> It's just starting, but only on libxc side based on your patch series.
> v2 code is more cleaner than legacy code, easy to understand, and yes,
> make hacking easier. Maybe I will need your help when the hacking goes
> on...
>
>>
>>
>> I really need to get a prototype libxl framing document sorted, but in
>> principle my plan (given only a minimum understanding of the algorithm)
>> is this:
>>
>> ...
>> * Write page data update
>> * Write vcpu context etc
>> * Write a REMUS_CHECKPOINT record (or appropriate name)
>> * Call the checkpoint callback, passing ownership of the fd to libxl
>> ** libxl writes a libxl qemu record into the stream
>> * checkpoint callback returns to libxl, returning ownership of the fd
>> * libxc chooses between sending an END record or looping
>> ...
>>
>> The fd ownership is expected to work exactly the same on the receiving
>> side, using the REMUS_CHECKPOINT record as an indicator.
>
> It mostly looks plausible, but the save side and restore side needs to
> be synchronised, otherwise, the following problem may exists:
>   sending side is in libxl and send qemu records, receiving side still
>   in libxc, after it is switched to libxl, part of record may lose.
> maybe a handshake will solve the problem, weather it's in libxl or libxc,
> but current migration frame dose not support send msgs from receiving
> side
> to sending side, so it need modifications. We should support this
> feature.

Ah yes I see.

How about this?

Libxc REMUS_CHECKPOINT is defined as a 0-length record (like the current
END record).
Libxl REMUS_CHECKPOINT is defined containing at least "last checkpoint"
bit in the header.

Libxc writes a libxc REMUS_CHECKPOINT record into the stream and always
hands the fd to libxl.
Libxl then writes a libxl REMUS_CHECKPOINT record, including the last
checkpoint bit if needed.

This means that it is libxl on the receiving side which determines
whether the last checkpoint has been reached, and libxc must always pass
the fd up.  This fixes the synchronisation issues, without requiring a
back channel, but still maintaining appropriate layering.

~Andrew

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 13/14] tools/libxc: noarch save code
  2014-06-19  9:19     ` Andrew Cooper
@ 2014-06-22 14:02       ` Shriram Rajagopalan
  0 siblings, 0 replies; 76+ messages in thread
From: Shriram Rajagopalan @ 2014-06-22 14:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Frediano Ziglio, David Vrabel, Wen Congyang, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 7172 bytes --]

On Jun 19, 2014 2:52 PM, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
> On 19/06/14 03:48, Wen Congyang wrote:
> > At 06/12/2014 02:14 AM, Andrew Cooper Wrote:
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >> ---
> >>  tools/libxc/saverestore/save.c |  545
+++++++++++++++++++++++++++++++++++++++-
> >>  1 file changed, 544 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/tools/libxc/saverestore/save.c
b/tools/libxc/saverestore/save.c
> >> index f6ad734..9ad43a5 100644
> >> --- a/tools/libxc/saverestore/save.c
> >> +++ b/tools/libxc/saverestore/save.c
> >> @@ -1,11 +1,554 @@
> >> +#include <assert.h>
> >> +#include <arpa/inet.h>
> >> +
> >>  #include "common.h"
> >>
> >> +/*
> >> + * Writes an Image header and Domain header into the stream.
> >> + */
> >> +static int write_headers(struct context *ctx, uint16_t guest_type)
> >> +{
> >> +    xc_interface *xch = ctx->xch;
> >> +    int32_t xen_version = xc_version(xch, XENVER_version, NULL);
> >> +    struct ihdr ihdr =
> >> +        {
> >> +            .marker  = IHDR_MARKER,
> >> +            .id      = htonl(IHDR_ID),
> >> +            .version = htonl(IHDR_VERSION),
> >> +            .options = htons(IHDR_OPT_LITTLE_ENDIAN),
> >> +        };
> >> +    struct dhdr dhdr =
> >> +        {
> >> +            .type       = guest_type,
> >> +            .page_shift = XC_PAGE_SHIFT,
> >> +            .xen_major  = (xen_version >> 16) & 0xffff,
> >> +            .xen_minor  = (xen_version)       & 0xffff,
> >> +        };
> >> +
> >> +    if ( xen_version < 0 )
> >> +    {
> >> +        PERROR("Unable to obtain Xen Version");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if ( write_exact(ctx->fd, &ihdr, sizeof(ihdr)) )
> >> +    {
> >> +        PERROR("Unable to write Image Header to stream");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if ( write_exact(ctx->fd, &dhdr, sizeof(dhdr)) )
> >> +    {
> >> +        PERROR("Unable to write Domain Header to stream");
> >> +        return -1;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +/*
> >> + * Writes an END record into the stream.
> >> + */
> >> +static int write_end_record(struct context *ctx)
> >> +{
> >> +    struct record end = { REC_TYPE_END, 0, NULL };
> >> +
> >> +    return write_record(ctx, &end);
> >> +}
> >> +
> >> +/*
> >> + * Writes a batch of memory as a PAGE_DATA record into the stream.
 The batch
> >> + * is constructed in ctx->save.batch_pfns.
> >> + *
> >> + * This function:
> >> + * - gets the types for each pfn in the batch.
> >> + * - for each pfn with real data:
> >> + *   - maps and attempts to localise the pages.
> >> + * - construct and writes a PAGE_DATA record into the stream.
> >> + */
> >> +static int write_batch(struct context *ctx)
> >> +{
> >> +    xc_interface *xch = ctx->xch;
> >> +    xen_pfn_t *mfns = NULL, *types = NULL;
> >> +    void *guest_mapping = NULL;
> >> +    void **guest_data = NULL;
> >> +    void **local_pages = NULL;
> >> +    int *errors = NULL, rc = -1;
> >> +    unsigned i, p, nr_pages = 0;
> >> +    unsigned nr_pfns = ctx->save.nr_batch_pfns;
> >> +    void *page, *orig_page;
> >> +    uint64_t *rec_pfns = NULL;
> >> +    struct rec_page_data_header hdr = { 0 };
> >> +    struct record rec =
> >> +    {
> >> +        .type = REC_TYPE_PAGE_DATA,
> >> +    };
> >> +
> >> +    assert(nr_pfns != 0);
> >> +
> >> +    /* Mfns of the batch pfns. */
> >> +    mfns = malloc(nr_pfns * sizeof(*mfns));
> >> +    /* Types of the batch pfns. */
> >> +    types = malloc(nr_pfns * sizeof(*types));
> >> +    /* Errors from attempting to map the mfns. */
> >> +    errors = malloc(nr_pfns * sizeof(*errors));
> >> +    /* Pointers to page data to send.  Either mapped mfns or local
allocations. */
> >> +    guest_data = calloc(nr_pfns, sizeof(*guest_data));
> >> +    /* Pointers to locally allocated pages.  Need freeing. */
> >> +    local_pages = calloc(nr_pfns, sizeof(*local_pages));
> > This function is called too many times, so we will allocate/free
> > memory again and again. It may affect the performance.
> >
> > I think we can allocate at setup stage, and only clear guest_data/
> > local_pages here.
>
> We likely can.  It is currently like this because it allowed valgrind to
> do a fantastic job of spotting errors when flushing an incomplete batch
> at the end of each iteration.
>
> It should be possible to consolidate the allocations and use valgrind
> client requests to achieve the same effect, although at this point my
> effort is far more focused to getting something which works correctly
> ready in the 4.5 timeframe.
>
> >
> >> +
> >> +    if ( !mfns || !types || !errors || !guest_data || !local_pages )
> >> +    {
> >> +        ERROR("Unable to allocate arrays for a batch of %u pages",
> >> +              nr_pfns);
> >> +        goto err;
> >> +    }
> >> +
> >> +    for ( i = 0; i < nr_pfns; ++i )
> >> +    {
> >> +        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx,
ctx->save.batch_pfns[i]);
> >> +
> >> +        /* Likely a ballooned page. */
> >> +        if ( mfns[i] == INVALID_MFN )
> >> +            set_bit(ctx->save.batch_pfns[i],
ctx->save.deferred_pages);
> >> +    }
> >> +
> >> +    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
> >> +    if ( rc )
> >> +    {
> >> +        PERROR("Failed to get types for pfn batch");
> >> +        goto err;
> >> +    }
> >> +    rc = -1;
> >> +
> >> +    for ( i = 0; i < nr_pfns; ++i )
> >> +    {
> >> +        switch ( types[i] )
> >> +        {
> >> +        case XEN_DOMCTL_PFINFO_BROKEN:
> >> +        case XEN_DOMCTL_PFINFO_XALLOC:
> >> +        case XEN_DOMCTL_PFINFO_XTAB:
> >> +            continue;
> >> +        }
> >> +
> >> +        mfns[nr_pages++] = mfns[i];
> >> +    }
> >> +
> >> +    if ( nr_pages > 0 )
> >> +    {
> >> +        guest_mapping = xc_map_foreign_bulk(
> >> +            xch, ctx->domid, PROT_READ, mfns, errors, nr_pages);
> >> +        if ( !guest_mapping )
> >> +        {
> >> +            PERROR("Failed to map guest pages");
> >> +            goto err;
> >> +        }
> > To support remus, we will map/unmap guest memory again and again. It
> > also affects the performance. We can cache guest mapping here.
> >
> > Thanks
> > Wen Congyang
> >
> >
>
> I am not aware of the current code caching mappings into the guest.
> 64bit toolstacks would be fine setting up mappings for every gfn and
> reusing the mappings, but this really won't work for 32bit toolstacks.
>
> What remus does appear to do is have a limited cache of
> previously-written pages for XOR+RLE compression.
>

I think Wen is referring to the 'mapping/unmapping' in batches of 4MB. This
slows down the time to checkpoint by 2X generally and tends to get worse as
the guest's memory size increases beyond 1GB.

What we used to do earlier (not in mainline libxc code), is to map the
entire domain memory into libxc before the infinite loop. This, as you said
has some restrictions, but halves the checkpointing time (ie halves the
time for which a guest execution is suspended).

Shriram

[-- Attachment #1.2: Type: text/html, Size: 10497 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-19 10:44             ` Andrew Cooper
@ 2014-06-22 14:36               ` Shriram Rajagopalan
  2014-06-22 16:01                 ` Andrew Cooper
  0 siblings, 1 reply; 76+ messages in thread
From: Shriram Rajagopalan @ 2014-06-22 14:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5331 bytes --]

On Jun 19, 2014 4:16 PM, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
> On 19/06/14 11:23, Hongyang Yang wrote:
> > On 06/19/2014 05:36 PM, Andrew Cooper wrote:
> >> On 19/06/14 10:13, Hongyang Yang wrote:
> >>> Hi Andrew, Ian,
> >>>
> >>> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
> >>>> On 17/06/14 17:40, Ian Campbell wrote:
> >>>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> >>>>>> +The following features are not yet fully specified and will be
> >>>>>> +included in a future draft.
> >>>>>> +
> >>>>>> +* Remus
> >>>>> What is the plan for Remus here?
> >>>>>
> >>>>> It has pretty large implications for the flow of a migration
> >>>>> stream and
> >>>>> therefore on the code in the final two patches, I suspect it will
> >>>>> require high level changes to those functions, so I'm reluctant to
> >>>>> spend
> >>>>> a lot of time on them as they are.
> >>>>
> >>>> I don't believe too much change will be required to the final two
> >>>> patches, but it does depend on fixing the current qemu record layer
> >>>> violations.
> >>>>
> >>>> It will be much easier to do after a prototype to the libxl level
> >>>> fixes.
> >>>
> >>> I'm trying to porting Remus to migration v2...
> >>
> >> Ah fantastic! Here I was expecting to have eventually brave that code
> >> myself.
> >>
> >> How is it going?  How are you finding hacking on v2 compared to the
> >> legacy code? (I think you are the first person who isn't me trying to
> >> extend it)  Is there anything I can do while still developing v2 to
make
> >> things easier?
> >
> > It's just starting, but only on libxc side based on your patch series.
> > v2 code is more cleaner than legacy code, easy to understand, and yes,
> > make hacking easier. Maybe I will need your help when the hacking goes
> > on...
> >
> >>
> >>
> >> I really need to get a prototype libxl framing document sorted, but in
> >> principle my plan (given only a minimum understanding of the algorithm)
> >> is this:
> >>
> >> ...
> >> * Write page data update
> >> * Write vcpu context etc
> >> * Write a REMUS_CHECKPOINT record (or appropriate name)
> >> * Call the checkpoint callback, passing ownership of the fd to libxl
> >> ** libxl writes a libxl qemu record into the stream
> >> * checkpoint callback returns to libxl, returning ownership of the fd
> >> * libxc chooses between sending an END record or looping
> >> ...
> >>
> >> The fd ownership is expected to work exactly the same on the receiving
> >> side, using the REMUS_CHECKPOINT record as an indicator.
> >
> > It mostly looks plausible, but the save side and restore side needs to
> > be synchronised, otherwise, the following problem may exists:
> >   sending side is in libxl and send qemu records, receiving side still
> >   in libxc, after it is switched to libxl, part of record may lose.
> > maybe a handshake will solve the problem, weather it's in libxl or
libxc,
> > but current migration frame dose not support send msgs from receiving
> > side
> > to sending side, so it need modifications. We should support this
> > feature.
>
> Ah yes I see.
>
> How about this?
>
> Libxc REMUS_CHECKPOINT is defined as a 0-length record (like the current
> END record).
> Libxl REMUS_CHECKPOINT is defined containing at least "last checkpoint"
> bit in the header.
>
> Libxc writes a libxc REMUS_CHECKPOINT record into the stream and always
> hands the fd to libxl.
> Libxl then writes a libxl REMUS_CHECKPOINT record, including the last
> checkpoint bit if needed.
>

I am a bit lost on this part. A silly question: the last I recall (a long
time ago), the v2 format didn't allow for the page compression to be done
asynchronously. Has this limitation changed?
IOW, in the current migration process, the dirty page data is written out
while the guest remains suspended. With remus, the compressed page data is
written out after resuming the guest. This deferred write out logic needs
to be incorporated into v2 code.

> This means that it is libxl on the receiving side which determines
> whether the last checkpoint has been reached, and libxc must always pass
> the fd up.  This fixes the synchronisation issues, without requiring a
> back channel, but still maintaining appropriate layering.
>

So there is a TODO item in the current libxl-remus patches. We need an
explicit acknowledgement from the reveiver side that it has gotten the
memory checkpoint. Whether it is from libxc or libxl on the receiver side
does not matter, as long as the ack signifies reception of the memory
checkpoint.
The need for an explicit memory ack is because the disk and memory
checkpoint channels are independent.
We need both acks before releasing the buffered network output on the
receiver side.
  The disk channel (blktap2 or DRBD ) has always sent an explicit ack. But
not the memory channel. Though its over TCP, on a given iteration, memory
checkpoint data may still reside on the sender side socket buffer while the
disk checkpoint has reached the other end -- which isn't good.

Existing libxc code does a fdatasync or fsync on the fd at the end of each
iteration. I don't think it works as intended on TCP sockets. Please
correct me if I am wrong about this.

> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 7014 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 RFC 01/14] docs: libxc migration stream specification
  2014-06-22 14:36               ` Shriram Rajagopalan
@ 2014-06-22 16:01                 ` Andrew Cooper
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Cooper @ 2014-06-22 16:01 UTC (permalink / raw)
  To: rshriram; +Cc: FNST-Yang Hongyang, Ian Jackson, Ian Campbell, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 6087 bytes --]

On 22/06/14 15:36, Shriram Rajagopalan wrote:
>
>
> On Jun 19, 2014 4:16 PM, "Andrew Cooper" <andrew.cooper3@citrix.com
> <mailto:andrew.cooper3@citrix.com>> wrote:
> >
> > On 19/06/14 11:23, Hongyang Yang wrote:
> > > On 06/19/2014 05:36 PM, Andrew Cooper wrote:
> > >> On 19/06/14 10:13, Hongyang Yang wrote:
> > >>> Hi Andrew, Ian,
> > >>>
> > >>> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
> > >>>> On 17/06/14 17:40, Ian Campbell wrote:
> > >>>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> > >>>>>> +The following features are not yet fully specified and will be
> > >>>>>> +included in a future draft.
> > >>>>>> +
> > >>>>>> +* Remus
> > >>>>> What is the plan for Remus here?
> > >>>>>
> > >>>>> It has pretty large implications for the flow of a migration
> > >>>>> stream and
> > >>>>> therefore on the code in the final two patches, I suspect it will
> > >>>>> require high level changes to those functions, so I'm reluctant to
> > >>>>> spend
> > >>>>> a lot of time on them as they are.
> > >>>>
> > >>>> I don't believe too much change will be required to the final two
> > >>>> patches, but it does depend on fixing the current qemu record layer
> > >>>> violations.
> > >>>>
> > >>>> It will be much easier to do after a prototype to the libxl level
> > >>>> fixes.
> > >>>
> > >>> I'm trying to porting Remus to migration v2...
> > >>
> > >> Ah fantastic! Here I was expecting to have eventually brave that code
> > >> myself.
> > >>
> > >> How is it going?  How are you finding hacking on v2 compared to the
> > >> legacy code? (I think you are the first person who isn't me trying to
> > >> extend it)  Is there anything I can do while still developing v2
> to make
> > >> things easier?
> > >
> > > It's just starting, but only on libxc side based on your patch series.
> > > v2 code is more cleaner than legacy code, easy to understand, and yes,
> > > make hacking easier. Maybe I will need your help when the hacking goes
> > > on...
> > >
> > >>
> > >>
> > >> I really need to get a prototype libxl framing document sorted,
> but in
> > >> principle my plan (given only a minimum understanding of the
> algorithm)
> > >> is this:
> > >>
> > >> ...
> > >> * Write page data update
> > >> * Write vcpu context etc
> > >> * Write a REMUS_CHECKPOINT record (or appropriate name)
> > >> * Call the checkpoint callback, passing ownership of the fd to libxl
> > >> ** libxl writes a libxl qemu record into the stream
> > >> * checkpoint callback returns to libxl, returning ownership of the fd
> > >> * libxc chooses between sending an END record or looping
> > >> ...
> > >>
> > >> The fd ownership is expected to work exactly the same on the
> receiving
> > >> side, using the REMUS_CHECKPOINT record as an indicator.
> > >
> > > It mostly looks plausible, but the save side and restore side needs to
> > > be synchronised, otherwise, the following problem may exists:
> > >   sending side is in libxl and send qemu records, receiving side still
> > >   in libxc, after it is switched to libxl, part of record may lose.
> > > maybe a handshake will solve the problem, weather it's in libxl or
> libxc,
> > > but current migration frame dose not support send msgs from receiving
> > > side
> > > to sending side, so it need modifications. We should support this
> > > feature.
> >
> > Ah yes I see.
> >
> > How about this?
> >
> > Libxc REMUS_CHECKPOINT is defined as a 0-length record (like the current
> > END record).
> > Libxl REMUS_CHECKPOINT is defined containing at least "last checkpoint"
> > bit in the header.
> >
> > Libxc writes a libxc REMUS_CHECKPOINT record into the stream and always
> > hands the fd to libxl.
> > Libxl then writes a libxl REMUS_CHECKPOINT record, including the last
> > checkpoint bit if needed.
> >
>
> I am a bit lost on this part. A silly question: the last I recall (a
> long time ago), the v2 format didn't allow for the page compression to
> be done asynchronously. Has this limitation changed?
>

The v2 format specifies records in a stream; nothing more.  It has no
bearing on whether the page compression happens asynchronously wrt
unpausing the domain or not.

I presume you actually mean the current implementation...

> IOW, in the current migration process, the dirty page data is written
> out while the guest remains suspended. With remus, the compressed page
> data is written out after resuming the guest. This deferred write out
> logic needs to be incorporated into v2 code.
>

... which is the way it is because the first implementation was done
with regular basic migration as a top priority.  This can certainly be
reworked when remus support is reintroduced.

> > This means that it is libxl on the receiving side which determines
> > whether the last checkpoint has been reached, and libxc must always pass
> > the fd up.  This fixes the synchronisation issues, without requiring a
> > back channel, but still maintaining appropriate layering.
> >
>
> So there is a TODO item in the current libxl-remus patches. We need an
> explicit acknowledgement from the reveiver side that it has gotten the
> memory checkpoint. Whether it is from libxc or libxl on the receiver
> side does not matter, as long as the ack signifies reception of the
> memory checkpoint.
> The need for an explicit memory ack is because the disk and memory
> checkpoint channels are independent.
> We need both acks before releasing the buffered network output on the
> receiver side.
>   The disk channel (blktap2 or DRBD ) has always sent an explicit ack.
> But not the memory channel. Though its over TCP, on a given iteration,
> memory checkpoint data may still reside on the sender side socket
> buffer while the disk checkpoint has reached the other end -- which
> isn't good.
>
> Existing libxc code does a fdatasync or fsync on the fd at the end of
> each iteration. I don't think it works as intended on TCP sockets.
> Please correct me if I am wrong about this.
>

That is a very sensible need for an explicit ack, although it would seem
to make more sense at the libxl level rather than the libxc level.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 9486 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2014-06-22 16:01 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11 18:14 [PATCH v5 0/14] Migration Stream v2 Andrew Cooper
2014-06-11 18:14 ` [PATCH v5 RFC 01/14] docs: libxc migration stream specification Andrew Cooper
2014-06-12  9:45   ` David Vrabel
2014-06-12 15:26   ` David Vrabel
2014-06-17 15:20   ` Ian Campbell
2014-06-17 17:42     ` Andrew Cooper
2014-06-17 16:40   ` Ian Campbell
2014-06-17 18:04     ` Andrew Cooper
2014-06-19  9:13       ` Hongyang Yang
2014-06-19  9:36         ` Andrew Cooper
2014-06-19 10:23           ` Hongyang Yang
2014-06-19 10:44             ` Andrew Cooper
2014-06-22 14:36               ` Shriram Rajagopalan
2014-06-22 16:01                 ` Andrew Cooper
2014-06-11 18:14 ` [PATCH v5 RFC 02/14] scripts: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
2014-06-12  9:48   ` David Vrabel
2014-06-11 18:14 ` [PATCH v5 RFC 03/14] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
2014-06-17 16:00   ` Ian Campbell
2014-06-17 16:17     ` Andrew Cooper
2014-06-17 16:47       ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 04/14] tools/libxc: C implementation of stream format Andrew Cooper
2014-06-12  9:52   ` David Vrabel
2014-06-12 15:31   ` David Vrabel
2014-06-17 15:55     ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 05/14] tools/libxc: noarch common code Andrew Cooper
2014-06-12  9:55   ` David Vrabel
2014-06-17 16:10   ` Ian Campbell
2014-06-17 16:28     ` Andrew Cooper
2014-06-17 16:53       ` Ian Campbell
2014-06-17 18:26         ` Andrew Cooper
2014-06-18  9:19           ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 06/14] tools/libxc: x86 " Andrew Cooper
2014-06-12  9:57   ` David Vrabel
2014-06-17 16:11     ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 07/14] tools/libxc: x86 PV " Andrew Cooper
2014-06-12  9:59   ` David Vrabel
2014-06-11 18:14 ` [PATCH v5 RFC 08/14] tools/libxc: x86 PV save code Andrew Cooper
2014-06-12 10:04   ` David Vrabel
2014-06-11 18:14 ` [PATCH v5 RFC 09/14] tools/libxc: x86 PV restore code Andrew Cooper
2014-06-12 10:08   ` David Vrabel
2014-06-12 15:49   ` David Vrabel
2014-06-12 17:01     ` Andrew Cooper
2014-06-17 16:22       ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 10/14] tools/libxc: x86 HVM common code Andrew Cooper
2014-06-12 10:11   ` David Vrabel
2014-06-17 16:22     ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 11/14] tools/libxc: x86 HVM save code Andrew Cooper
2014-06-12 10:12   ` David Vrabel
2014-06-12 15:55   ` David Vrabel
2014-06-12 17:07     ` Andrew Cooper
2014-06-17 16:25       ` Ian Campbell
2014-06-11 18:14 ` [PATCH v5 RFC 12/14] tools/libxc: x86 HVM restore code Andrew Cooper
2014-06-12 10:14   ` David Vrabel
2014-06-11 18:14 ` [PATCH v5 RFC 13/14] tools/libxc: noarch save code Andrew Cooper
2014-06-12 10:24   ` David Vrabel
2014-06-17 16:28     ` Ian Campbell
2014-06-17 16:38       ` David Vrabel
2014-06-17 16:54         ` Ian Campbell
2014-06-18  6:59   ` Hongyang Yang
2014-06-18  7:08     ` Hongyang Yang
2014-06-19  2:48   ` Wen Congyang
2014-06-19  9:19     ` Andrew Cooper
2014-06-22 14:02       ` Shriram Rajagopalan
2014-06-11 18:14 ` [PATCH v5 RFC 14/14] tools/libxc: noarch restore code Andrew Cooper
2014-06-12 10:27   ` David Vrabel
2014-06-12 16:05   ` David Vrabel
2014-06-12 17:16     ` Andrew Cooper
2014-06-19  6:16   ` Hongyang Yang
2014-06-19  9:00     ` Andrew Cooper
2014-06-12  3:17 ` [PATCH v5 0/14] Migration Stream v2 Hongyang Yang
2014-06-12 13:27   ` Andrew Cooper
2014-06-12 13:49     ` Wei Liu
2014-06-12 14:18       ` Andrew Cooper
2014-06-12 14:27         ` Wei Liu
2014-06-12  9:38 ` David Vrabel
2014-06-17 15:57   ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.