All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shriram Rajagopalan <rshriram@cs.ubc.ca>
To: David Vrabel <david.vrabel@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	"Xen-devel@lists.xen.org" <Xen-devel@lists.xen.org>
Subject: Re: Domain Save Image Format proposal (draft B)
Date: Mon, 10 Feb 2014 12:00:44 -0800	[thread overview]
Message-ID: <CAP8mzPMtM0yXtv_th_rwPBMMx-nMEBCUoRnAtwXJUjYHy7AOdA@mail.gmail.com> (raw)
In-Reply-To: <52F90A71.40802@citrix.com>


[-- Attachment #1.1: Type: text/plain, Size: 16425 bytes --]

On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@citrix.com>wrote:

> Here is a draft of a proposal for a new domain save image format.  It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).
>
> http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft A  6 Feb 2014   Initial draft.
>
> Draft B  10 Feb 2014  Corrected image header field widths.
>
>                       Minor updates and clarifications.
> --------------------------------------------------------------------
>
> Purpose
> -------
>
> The _domain save image_ is the context of a running domain used for
> snapshots of a domain or for transferring domains between hosts during
> migration.
>
> There are a number of problems with the format of the domain save
> image used in Xen 4.4 and earlier (the _legacy format_).
>
> * Dependant on toolstack word size.  A number of fields within the
>   image are native types such as `unsigned long` which have different
>   sizes between 32-bit and 64-bit hosts.  This prevents domains from
>   being migrated between 32-bit and 64-bit hosts.
>
> * There is no header identifying the image.
>
> * The image has no version information.
>
> A new format that addresses the above is required.
>
> ARM does not yet have have a domain save image format specified and
> the format described in this specification should be suitable.
>
>

I suggest keeping the processing overhead in mind when designing the new
image format. Some key things have been addressed, such as making sure data
is always padded to maintain alignment. But there are also some aspects of
this
proposal that seem awfully unnecessary.. More details below.


>
> Overview
> ========
>
> The image format consists of two main sections:
>
> * _Headers_
> * _Records_
>
> Headers
> -------
>
> There are two headers: the _image header_, and the _domain header_.
> The image header describes the format of the image (version etc.).
> The _domain header_ contains general information about the domain
> (architecture, type etc.).
>
> Records
> -------
>
> The main part of the format is a sequence of different _records_.
> Each record type contains information about the domain context.  At a
> minimum there is a END record marking the end of the records section.
>
>
> Fields
> ------
>
> All the fields within the headers and records have a fixed width.
>
> Fields are always aligned to their size.
>
> Padding and reserved fields are set to zero on save and must be
> ignored during restore.
>
>
So far so good.


> Integer (numeric) fields in the image header are always in big-endian
> byte order.
>
> Integer fields in the domain header and in the records are in the
> endianess described in the image header (which will typically be the
> native ordering).
>
>

Its tempting to adopt all the TCP-style madness for transferring a set of
structured data.  Why this endian-ness mess?  Am I missing something here?
I am assuming that a lion's share of Xen's deployment is on x86
(not including Amazon). So that leaves ARM.  Why not let these
processors take the hit of endian-ness conversion?

Headers
> =======
>
> Image Header
> ------------
>
> The image header identifies an image as a Xen domain save image.  It
> includes the version of this specification that the image complies
> with.
>
> Tools supporting version _V_ of the specification shall always save
> images using version _V_.  Tools shall support restoring from version
> _V_ and version _V_ - 1.  Tools may additionally support restoring
> from earlier versions.
>
> The marker field can be used to distinguish between legacy images and
> those corresponding to this specification.  Legacy images will have at
> one or more zero bits within the first 8 octets of the image.
>
> Fields within the image header are always in _big-endian_ byte order,
> regardless of the setting of the endianness bit.
>

and more endian-ness mess.


>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | marker                                          |
>     +-----------------------+-------------------------+
>     | id                    | version                 |
>     +-----------+-----------+-------------------------+
>     | options   |                                     |
>     +-----------+-------------------------------------+
>
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> marker      0xFFFFFFFFFFFFFFFF.
>
> id          0x58454E46 ("XENF" in ASCII).
>
> version     0x00000001.  The version of this specification.
>
> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
>
>             bit 1-15: Reserved.
> --------------------------------------------------------------------
>
> Domain Header
> -------------
>
> The domain header includes general properties of the domain.
>
>      0      1     2     3     4     5     6     7 octet
>     +-----------+-----------+-----------+-------------+
>     | arch      | type      | page_shift| (reserved)  |
>     +-----------+-----------+-----------+-------------+
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> arch        0x0000: Reserved.
>
>             0x0001: x86.
>
>             0x0002: ARM.
>
> type        0x0000: Reserved.
>
>             0x0001: x86 PV.
>
>             0x0002 - 0xFFFF: Reserved.
>
> page_shift  Size of a guest page as a power of two.
>
>             i.e., page size = 2^page_shift^.
> --------------------------------------------------------------------
>
>
> Records
> =======
>
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
>     ...
>     Record body of length body_length octets followed by
>     0 to 7 octets of padding.
>     ...
>     +-----------------------+-------------------------+
>     | checksum              | (reserved)              |
>     +-----------------------+-------------------------+
>
>
I am assuming that you the checksum field is present only
for debugging purposes? Otherwise, I see no reason for the
computational overhead, given that we are already sending data
over a reliable channel + IIRC we already have an image-wide checksum
when saving the image to disk.

If debugging is the only use case, then I guess, the type field
can be prefixed with a 1/0 bit, eliminating the need for the
1-bit checkum options field + 7-byte padding. Similarly, if debugging
mode is not set, why waste another 8 bytes in the end for the checksum
field.

Unless you think there may be more types with need of special options,

Feel free to correct me if I am missing something elementary here..




> --------------------------------------------------------------------
> Field        Description
> -----------  -------------------------------------------------------
> type         0x00000000: END
>
>              0x00000001: PAGE_DATA
>
>              0x00000002: VCPU_INFO
>
>              0x00000003: VCPU_CONTEXT
>
>              0x00000004: X86_PV_INFO
>
>              0x00000005: P2M
>
>              0x00000006 - 0xFFFFFFFF: Reserved
>
> body_length  Length in octets of the record body.
>
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
>
>              Bit 1-15: Reserved.
>
> checksum     CRC-32 checksum of the record body (including any trailing
>              padding), or 0x00000000 if the checksum field is invalid.
> --------------------------------------------------------------------
>
> The following sub-sections specify the record body format for each of
> the record types.
>
> END
> ----
>
> A end record marks the end of the image.
>
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>
> The end record contains no fields; its body_length is 0.
>
> PAGE_DATA
> ---------
>
> The bulk of an image consists of many PAGE_DATA records containing the
> memory contents.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | count (C)             | (reserved)              |
>     +-----------------------+-------------------------+
>     | pfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | pfn[C-1]                                        |
>     +-------------------------------------------------+
>     | page_data[0]...                                 |
>     ...
>     +-------------------------------------------------+
>     | page_data[N-1]...                               |
>     ...
>     +-------------------------------------------------+
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> count       Number of pages described in this record.
>
> pfn         An array of count PFNs. Bits 63-60 contain
>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.
>
> page_data   page_size octets of uncompressed page contents for each page
>             set as present in the pfn array.
> --------------------------------------------------------------------
>
>
s/uncompressed/(compressed/uncompressed)/
(Remus sends compressed data)


> VCPU_INFO
> ---------
>
> > [ This is a combination of parts of the extended-info and
> > XC_SAVE_ID_VCPU_INFO chunks. ]
>
> The VCPU_INFO record includes the maximum possible VCPU ID.  This will
> be followed a VCPU_CONTEXT record for each online VCPU.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+------------------------+
>     | max_vcpu_id           | (reserved)             |
>     +-----------------------+------------------------+
>
> --------------------------------------------------------------------
> Field        Description
> -----------  ---------------------------------------------------
> max_vcpu_id  Maximum possible VCPU ID.
> --------------------------------------------------------------------
>
>
> VCPU_CONTEXT
> ------------
>
> The context for a single VCPU.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | vcpu_id               | (reserved)              |
>     +-----------------------+-------------------------+
>     | vcpu_ctx...                                     |
>     ...
>     +-------------------------------------------------+
>
> --------------------------------------------------------------------
> Field            Description
> -----------      ---------------------------------------------------
> vcpu_id          The VCPU ID.
>
> vcpu_ctx         Context for this VCPU.
> --------------------------------------------------------------------
>
> [ vcpu_ctx format TBD. ]
>
>
> X86_PV_INFO
> -----------
>
> > [ This record replaces part of the extended-info chunk. ]
>
>      0     1     2     3     4     5     6     7 octet
>     +-----+-----+-----+-------------------------------+
>     | w   | ptl | o   | (reserved)                    |
>     +-----+-----+-----+-------------------------------+
>
> --------------------------------------------------------------------
> Field            Description
> -----------      ---------------------------------------------------
> guest_width (w)  Guest width in octets (either 4 or 8).
>
> pt_levels (ptl)  Number of page table levels (either 3 or 4).
>
> options (o)      Bit 0: 0 - no VMASST_pae_extended_cr3,
>                  1 - VMASST_pae_extended_cr3.
>
>                  Bit 1-7: Reserved.
> --------------------------------------------------------------------
>
>
> P2M
> ---
>
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]
>
> The P2M record contains a portion of the source domain's P2M.
> Multiple P2M records may be sent if the source P2M changes during the
> stream.
>
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | pfn_begin                                       |
>     +-------------------------------------------------+
>     | pfn_end                                         |
>     +-------------------------------------------------+
>     | mfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | mfn[N-1]                                        |
>     +-------------------------------------------------+
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> pfn_begin   The first PFN in this portion of the P2M
>
> pfn_end     One past the last PFN in this portion of the P2M.
>
> mfn         Array of (pfn_end - pfn-begin) MFNs corresponding to
>             the set of PFNs in the range [pfn_begin, pfn_end).
> --------------------------------------------------------------------
>
>
> Layout
> ======
>
> The set of valid records depends on the guest architecture and type.
>
> x86 PV Guest
> ------------
>
> An x86 PV guest image will have in this order:
>
> 1. Image header
> 2. Domain header
> 3. X86_PV_INFO record
> 4. At least one P2M record
> 5. At least one PAGE_DATA record

6. VCPU_INFO record
> 6. At least one VCPU_CONTEXT record

7. END record
>
>
There seems to be a bunch of info missing. Here are some
missing elements that I can recall at the moment:
a) there is no support for sending over one time markers that switch the
receiver's operating mode in the middle of a data stream.
E.g., XC_SAVE_ENABLE_COMPRESSION, XC_SAVE_ID_LAST_CHECKPOINT, etc.
XC_SAVE_ENABLE_VERIFY_MODE,

b) in pv case, the tail also has a list of unmapped PFNs at the end of
every iteration.

c) XC_SAVE_ID_TOOLSTACK -- used by xl to pass device context information
(generally
for HVMs).


>
> Legacy Images (x86 only)
> ========================
>
> Restoring legacy images from older tools shall be handled by
> translating the legacy format image into this new format.
>
> It shall not be possible to save in the legacy format.
>
> There are two different legacy images depending on whether they were
> generated by a 32-bit or a 64-bit toolstack. These shall be
> distinguished by inspecting octets 4-7 in the image.  If these are
> zero then it is a 64-bit image.
>
> Toolstack  Field                            Value
> ---------  -----                            -----
> 64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
> 32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
> 32-bit     Chunk type (HVM)                 < 0
> 32-bit     Page count (HVM)                 > 0
>
> Table: Possible values for octet 4-7 in legacy images
>
> This assumes the presence of the extended-info chunk which was
> introduced in Xen 3.0.
>
>
> Future Extensions
> =================
>
> All changes to this format require the image version to be increased.
>
> The format may be extended by adding additional record types.
>
> Extending an existing record type must be done by adding a new record
> type.  This allows old images with the old record to still be
> restored.
>
> The image header may be extended by _appending_ additional fields.  In
> particular, the `marker`, `id` and `version` fields must never change
> size or location.
>
>

[-- Attachment #1.2: Type: text/html, Size: 20803 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  parent reply	other threads:[~2014-02-10 20:00 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
2014-02-10 17:32 ` Andrew Cooper
2014-02-10 17:48 ` Frediano Ziglio
2014-02-10 20:00 ` Shriram Rajagopalan [this message]
2014-02-11  1:35   ` Andrew Cooper
2014-02-11  4:12     ` Shriram Rajagopalan
2014-02-11 10:58       ` Zoltan Kiss
2014-02-12 15:34       ` Tim Deegan
2014-02-11 11:58   ` David Vrabel
2014-02-11 16:13     ` Shriram Rajagopalan
2014-02-11  9:30 ` Ian Campbell
2014-02-11 11:40   ` David Vrabel
2014-02-11 12:09     ` Ian Campbell
2014-02-11 13:10       ` Jan Beulich
2014-02-11 13:12       ` Ian Campbell
2014-02-12 15:41   ` Tim Deegan
2014-02-11 10:06 ` Jan Beulich
2014-02-11 13:04   ` David Vrabel
2014-02-11 13:20     ` Jan Beulich
2014-02-11 16:45   ` Ian Jackson
2014-02-11 17:08     ` Shriram Rajagopalan
2014-02-11 17:15       ` Ian Campbell
2014-02-11 17:30         ` Ian Jackson
2014-02-11 17:31         ` Frediano Ziglio
2014-02-11 17:53           ` Ian Jackson
2014-02-11 17:59             ` Ian Campbell
2014-02-12  9:07               ` Frediano Ziglio
2014-02-12 11:27                 ` Frediano Ziglio
2014-02-12 11:34                   ` Ian Campbell
2014-02-11 16:38 ` Ian Jackson
2014-02-11 17:04   ` Andrew Cooper
2014-02-11 17:07     ` Ian Jackson
2014-02-11 16:49 ` Ian Jackson
2014-02-11 17:10   ` David Vrabel
2014-02-11 17:28     ` Ian Jackson
2014-02-12 16:36 ` Tim Deegan
2014-02-12 17:09   ` David Vrabel
2014-02-12 18:16     ` Frediano Ziglio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8mzPMtM0yXtv_th_rwPBMMx-nMEBCUoRnAtwXJUjYHy7AOdA@mail.gmail.com \
    --to=rshriram@cs.ubc.ca \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=Xen-devel@lists.xen.org \
    --cc=david.vrabel@citrix.com \
    --cc=ian.campbell@citrix.com \
    --cc=stefano.stabellini@eu.citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.