All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Woodhouse <dwmw2@infradead.org>
To: Xen-devel <xen-devel@lists.xenproject.org>
Cc: "Stefano Stabellini" <sstabellini@kernel.org>,
	"Julien Grall" <julien@xen.org>, "Wei Liu" <wl@xen.org>,
	"Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>,
	"George Dunlap" <George.Dunlap@eu.citrix.com>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Varad Gautam" <vrd@amazon.de>,
	paul@xen.org, "Ian Jackson" <ian.jackson@eu.citrix.com>,
	"Hongyan Xia" <hongyxia@amazon.com>, "Amit Shah" <aams@amazon.de>,
	"Roger Pau Monné" <roger.pau@citrix.com>
Subject: [Xen-devel] [RFC PATCH v3 15/22] Start documenting the live update handover
Date: Thu, 30 Jan 2020 16:13:23 +0000	[thread overview]
Message-ID: <20200130161330.2324143-15-dwmw2@infradead.org> (raw)
In-Reply-To: <20200130161330.2324143-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 docs/specs/libxc-migration-stream.pandoc |  19 +-
 docs/specs/live-update-handover.pandoc   | 371 +++++++++++++++++++++++
 2 files changed, 388 insertions(+), 2 deletions(-)
 create mode 100644 docs/specs/live-update-handover.pandoc

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
index a7a8a08936..9a6679f3de 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -227,12 +227,18 @@ type         0x00000000: END
 
              0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
 
-             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
+             0x00000010 - 0x3FFFFFFF: Reserved for future _mandatory_
              records.
 
-             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
+             0x40000000 - 0x7FFFFFFF: Reserved for future _mandatory_
+             live update records.
+
+             0x80000000 - 0xBFFFFFFF: Reserved for future _optional_
              records.
 
+             0xC0000000 - 0xFFFFFFFF: Reserved for future _optional_
+             live update records.
+
 body_length  Length in octets of the record body.
 
 body         Content of the record.
@@ -246,6 +252,15 @@ Records may be _mandatory_ or _optional_.  Optional records have bit
 unsupported mandatory record must fail.  The contents of optional
 records may be ignored during a restore.
 
+Note: This basic record format,. and some of the record types defined here,
+are also used for Live Update, as discussed in the Live Update Handover
+document: `docs/specs/live-update-handover.pandoc`.
+
+Records defined for live update have bit 30 set in their type value,
+are defined in that document, and are out of scope for this document.
+Such records shall not appear in the Domain Image Format as defined by
+this document.
+
 The following sub-sections specify the record body format for each of
 the record types.
 
diff --git a/docs/specs/live-update-handover.pandoc b/docs/specs/live-update-handover.pandoc
new file mode 100644
index 0000000000..31d23c7c90
--- /dev/null
+++ b/docs/specs/live-update-handover.pandoc
@@ -0,0 +1,371 @@
+% Live Update Handover Protocol
+% David Woodhouse <<dwmw@amazon.co.uk>>
+% Revision 1
+
+Introduction
+============
+
+Purpose
+-------
+
+Live update performs a _kexec_ from one running version of Xen to
+another, preserving all running domains in a form of guest-transparent
+live migration.
+
+This document outlines the memory layout requirements and data stream
+used in handover protocol, to ensure that pages used by running
+domains are preserved during the transition from one version of Xen
+to the next.
+
+
+Compatibility
+-------------
+
+It cannot be repeated often enough that information passed over live
+update is an ABI. It is expected that live update can be performed from
+one major version of Xen to another, or even hypothetically to a system
+which is not Xen at all.
+
+It is necessary that some data are handed over "in place"; in
+particular the memory pages of the running domains. However, no
+internal Xen data structures may be transferred in this fashion; at
+least not without retrospectively declaring them to be ABI, with the
+restrictions that places on subsequent changes.
+
+
+
+Handover
+========
+
+
+Memory Usage Restrictions
+-------------------------
+
+The new Xen must take care not to use any memory pages which already
+belong to guests. To facilitate this, a contiguous region of memory
+is reserved for the boot allocator, known as _live update bootmem_.
+
+This region is reserved by the original Xen during its own boot, and
+the location made available to the _kexec(8)_ user space tool
+through the `kexec_get_range` hypercall using a new region type
+`KEXEC_RANGE_MA_LIVEUPDATE`. It is passed to the new Xen on the
+command line, using the `liveupdate=` parameter.
+
+The new Xen must not use any pages outside this region until it has
+consumed the live update data stream and determined which pages are
+already in use by running domains.
+
+At run time, Xen may use memory from the reserved region for any
+purpose that does not require preservation over a live update; in
+particular it must not be mapped to a domain.
+
+The new Xen executable image must be loaded by kexec to the same
+physical location as the running Xen, since that region of memory is
+known to be available. For that reason, freed init memory from the
+Xen image is also treated as reserved _live update bootmem_.
+
+
+Live Update Data Stream
+-----------------------
+
+During handover, the running Xen pauses all domains and creates a
+_live update data stream_ containing all the information required by
+the new Xen to restore them. This is largely the same as guest
+transparent live migration.
+
+Data pages for this stream may be allocated anywhere in physical
+memory outside the _live update bootmem_ regions.
+
+Xen creates a physically contiguous array of MFNs of the allocated
+data pages, suitable for passing to `vmap()` to obtain a virtually
+contiguous mapping of the whole data stream.
+
+
+Breadcrumb
+----------
+
+Since the live update data stream is created during the final `kexec_exec`
+hypercall, its address cannot be passed on the command line to the
+new Xen since the command line needs to have been set up by `kexec(8)`
+in userspace long beforehand.
+
+Thus, to allow the new Xen to find the data stream, the old Xen places
+a _breadcrumb_ in the first words of the _live update bootmem_, containing
+the number of data pages, and the physical address of the contiguous MFN
+array.
+
+The breadcrumb is written as the last action of the `kexec_reloc()`
+routine during the `kexec` handover, so cannot overwrite anything
+important by virtue of the existing guarantee that Xen will not place
+any data in that region which needs to survive across a live update.
+
+A restriction of the `kexec_reloc()` mechanism for writing the breadcrumb
+is that the values are host-endian and are masked with PAGE_MASK; the low
+bits are zeroed. This is actually perfect for the magic value used
+to recognise a live update breadcrumb, since it neatly prevents any attempt
+to live update to a Xen which uses a different endianness or page size.
+
+For the physical address of the MFN list it's perfectly fine, since
+that list is page-aligned anyway. For the number of pages, it means
+the value must be shifted accordingly. Hence the use of `shifted_nr_pages`
+in the breadcrumb structure below:
+
+
+     0      1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | live_update_magic                               |
+    +-------------------------------------------------+
+    | mfn_array_physaddr                              |
+    +-------------------------------------------------+
+    | shifted_nr_pages                                |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field               Description
+------------------- ------------------------------------------------
+live_update_magic   "LiveUpda" (0x4c69766555706461) stored in the the host
+                    endianness and masked with PAGE_MASK.
+                    For example on x86_64: `00 60 70 55 65 76 89 4c`.
+
+mfn_array_physaddr  Machine address of MFN list for data streaes.
+
+shift_nr_pages      Number of data pages, shifted by PAGE_SHIFT to
+                    avoid the limitation of kexec_reloc().
+--------------------------------------------------------------------
+
+
+IOMMU
+-----
+
+Where devices are passed through to domains, it may not be possible
+to quiesce those devices for the purpose of performing the update.
+
+If performing live update with assigned devices, the original Xen will
+leave the IOMMU mappings active during the handover (thus implying
+that IOMMU page tables may not be allocated in the `live update
+bootmem` region either).
+
+The new Xen must resume control of the IOMMU without causing those mappings
+to become invalid even for a short period of time. On hardware which does not
+support Posted Interrupts, interrupts may need to be generated on resume.
+
+_This section will be expanded once we actually have it working._
+
+\clearpage
+
+Data Stream Overview
+====================
+
+Once discovered and mapped, the live update data stream forms a
+virtually contiguous stream of records following the basic form
+documented in the LibXenCtrl Domain Image Format at
+`docs/specs/libxc-migration-stream.pandoc`.
+
+Some record types from the LibXenCtrl Domain Image format are used
+as-is, such as the `X86_PV_INFO`, `X86_PV_VCPU_BASIC`, `HVM_CONTEXT`
+and other records containing domain-specific data.
+
+The Domain Header from that document is not used in that form, and a new
+record of type `LU_DOMAIN_INFO` is defined below.
+
+Other new record types specific to the live update process are defined in
+this document. Of those, some contain global state such as the M2P table
+information, while others are domain-specific.
+
+The live update data stream starts with records containing global
+information, followed any number of times by a `LU_DOMAIN_INFO` record
+and subsequent domain-specific records for that domain.
+
+There is a single `END` record at the end of the live update data stream,
+indicating that no more `DOMAIN_INFO` records are present.
+
+\clearpage
+
+As defined in the LibXenCtrl Domain Image format document, a record
+has the following structure. Record type values defined for live update
+have bit 30 set, and are thus in the range 0x40000000-0x7FFFFFFF for
+mandatory live update records, and 0xC0000000-0xFFFFFFFF for optional
+live update records _(of which there are none at the present time)_.
+
+
+    0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | type                  | body_length             |
+    +-----------+-----------+-------------------------+
+    | body...                                         |
+    ...
+    |           | padding (0 to 7 octets)             |
+    +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field        Description
+-----------  -------------------------------------------------------
+type         0x40000000: LU_VERSION
+
+             0x40000001: LU_M2P
+
+             0x40000002: LU_M2P_COMPAT
+
+             0x40000003: LU_DOMAIN_INFO
+
+             0x40000004 - 0x7FFFFFFF: Reserved for future _mandatory_
+             live update records.
+
+             0xC0000000 - 0xFFFFFFFF: Reserved for future _optional_
+             live update records.
+
+body_length  Length in octets of the record body.
+
+body         Content of the record.
+
+padding      0 to 7 octets of zeros to pad the whole record to a multiple
+             of 8 octets.
+--------------------------------------------------------------------
+
+
+\clearpage
+
+Global Records
+==============
+
+LU_VERSION
+----------
+
+The version field indicates the version of Xen from which the system
+is live updating. In theory this should never be relevant, but it
+allows for version-specific workarounds to be implementing in the receiving
+Xen should they become necessary.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | xen_major             | xen_minor               |
+    +-----------------------+-------------------------+
+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+xen_major   The Xen major version from which the system is updating.
+
+xen_minor   The Xen minor version from which the system is updating.
+--------------------------------------------------------------------
+
+\clearpage
+
+LU_M2P / LU_M2P_COMPAT
+----------------------
+
+The M2P and compatibility M2P records contain a scatter/gather list of
+pages containing native or 32-bit M2P data.
+
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | m2p_page_data[0]...                             |
+    ...
+    +-------------------------------------------------+
+    | m2p_page_data[N-1]...                           |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field           Description
+-----------     --------------------------------------------------------
+m2p_page_data   A 64-bit value containing the physical address of the
+                next page of M2P data, encoding the _order_ of the page
+                into the low 12 bits. Thus, a 1GiB page at 0x4C0000000
+                would be encoded as 0x4C000001E.
+
+                In case the M2P does not contiguously cover pages starting
+                from MFN zero, a discontiguity is indicated by a field
+                with order set to zero. The high bits of the field then
+                provide the MFN for which the subsequent M2P data page
+                provides data.
+
+--------------------------------------------------------------------
+
+\clearpage
+
+Domain Specific Records
+=======================
+
+
+LU_DOMAIN_INFO
+--------------
+
+The domain info record contains general properties necessary to
+recreate a domain in the receiving Xen, and marks the start of a set
+of other domain-specific records pertaining to that domain.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | type                  | page_shift| domain_id   |
+    +-----------------------+-----------+-------------+
+    | domain_handle[0-7]                              |
+    +-------------------------------------------------+
+    | domain_handle[8-15]                             |
+    +-----------------------+-------------------------+
+    | ssidref               | flags                   |
+    +-----------------------+-------------------------+
+    | max_vcpus             | emulation_flags         |
+    +-----------------------+-------------------------+
+    | extra_flags           | (padding)               |
+    +-----------------------+-------------------------+
+
+
+--------------------------------------------------------------------
+Field           Description
+--------------- --------------------------------------------------------
+type            0x0000: Reserved.
+
+                0x0001: x86 PV.
+
+                0x0002: x86 HVM.
+
+                0x0003 - 0xFFFFFFFF: Reserved.
+
+page_shift      Size of a guest page as a power of two.
+
+                i.e., page size = 2 ^page_shift^.
+
+domain_id       Domain ID
+
+
+domain_handle   UUID domain handle.
+
+ssidref         Security Identifier Index
+
+flags           Domain flags using `XEN_DOMCTL_CTF_`
+
+max_vcpus       Maximum vCPUs for domain.
+
+emulation_flags Emulation flags using `XEN_X86_EMU_`
+
+extra_flags     Additional flags:
+
+                0x00000001: Is privileged
+
+--------------------------------------------------------------------
+
+\clearpage
+
+Future Extensions
+=================
+
+All changes to this specification should bump the revision number in
+the title block.
+
+All changes to the image or domain headers require the image version
+to be increased.
+
+The format may be extended by adding additional record types.
+
+Extending an existing record type must be done by adding a new record
+type.  This allows old images with the old record to still be
+restored.
+
+The image header may only be extended by _appending_ additional
+fields.  In particular, the `marker`, `id` and `version` fields must
+never change size or location.
+
+
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  parent reply	other threads:[~2020-01-30 16:14 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-30 16:12 [Xen-devel] [RFC PATCH v3 0/22] Live update: boot memory management, data stream handling, record format David Woodhouse
2020-01-30 16:13 ` [Xen-devel] [RFC PATCH v3 01/22] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 02/22] x86/boot: Reserve live update boot memory David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 03/22] Reserve live update memory regions David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 04/22] Add KEXEC_RANGE_MA_LIVEUPDATE David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 05/22] Add KEXEC_TYPE_LIVE_UPDATE David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 06/22] Add IND_WRITE64 primitive to kexec kimage David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 07/22] Add basic live update stream creation David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 08/22] Add kimage_add_live_update_data() David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 09/22] Add basic lu_save_all() shell David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 10/22] Don't add bad pages above HYPERVISOR_VIRT_END to the domheap David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 11/22] xen/vmap: allow vm_init_type to be called during early_boot David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 12/22] xen/vmap: allow vmap() to be called during early boot David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 13/22] x86/setup: move vm_init() before end_boot_allocator() David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 14/22] Detect live update breadcrumb at boot and map data stream David Woodhouse
2020-01-30 16:13   ` David Woodhouse [this message]
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 16/22] Migrate migration stream definitions into Xen public headers David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 17/22] Add lu_stream_{open, close, append}_record() David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 18/22] Add LU_VERSION and LU_END records to live update stream David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 19/22] Add shell of lu_reserve_pages() David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 20/22] x86/setup: lift dom0 creation out into create_dom0 function David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 21/22] x86/setup: finish plumbing in live update path through __start_xen() David Woodhouse
2020-01-30 16:13   ` [Xen-devel] [RFC PATCH v3 22/22] x86/setup: simplify handling of initrdidx when no initrd present David Woodhouse
2020-01-31 17:16 ` [Xen-devel] [RFC PATCH v3 23/22] x86/smp: reset x2apic_enabled in smp_send_stop() David Woodhouse
2020-02-18 15:22 ` [Xen-devel] [RFC PATCH v3 0/22] Live update: boot memory management, data stream handling, record format Ian Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200130161330.2324143-15-dwmw2@infradead.org \
    --to=dwmw2@infradead.org \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=aams@amazon.de \
    --cc=andrew.cooper3@citrix.com \
    --cc=hongyxia@amazon.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=julien@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=paul@xen.org \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=vrd@amazon.de \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.