All of lore.kernel.org
 help / color / mirror / Atom feed
* Domain Save Image Format proposal (draft B)
@ 2014-02-10 17:20 David Vrabel
  2014-02-10 17:32 ` Andrew Cooper
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: David Vrabel @ 2014-02-10 17:20 UTC (permalink / raw)
  To: Xen-devel
  Cc: Shriram Rajagopalan, Ian Jackson, Ian Campbell, Stefano Stabellini

Here is a draft of a proposal for a new domain save image format.  It
does not currently cover all use cases (e.g., images for HVM guest are
not considered).

http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf

Introduction
============

Revision History
----------------

--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft A  6 Feb 2014   Initial draft.

Draft B  10 Feb 2014  Corrected image header field widths.

                      Minor updates and clarifications.
--------------------------------------------------------------------

Purpose
-------

The _domain save image_ is the context of a running domain used for
snapshots of a domain or for transferring domains between hosts during
migration.

There are a number of problems with the format of the domain save
image used in Xen 4.4 and earlier (the _legacy format_).

* Dependant on toolstack word size.  A number of fields within the
  image are native types such as `unsigned long` which have different
  sizes between 32-bit and 64-bit hosts.  This prevents domains from
  being migrated between 32-bit and 64-bit hosts.

* There is no header identifying the image.

* The image has no version information.

A new format that addresses the above is required.

ARM does not yet have have a domain save image format specified and
the format described in this specification should be suitable.


Overview
========

The image format consists of two main sections:

* _Headers_
* _Records_

Headers
-------

There are two headers: the _image header_, and the _domain header_.
The image header describes the format of the image (version etc.).
The _domain header_ contains general information about the domain
(architecture, type etc.).

Records
-------

The main part of the format is a sequence of different _records_.
Each record type contains information about the domain context.  At a
minimum there is a END record marking the end of the records section.


Fields
------

All the fields within the headers and records have a fixed width.

Fields are always aligned to their size.

Padding and reserved fields are set to zero on save and must be
ignored during restore.

Integer (numeric) fields in the image header are always in big-endian
byte order.

Integer fields in the domain header and in the records are in the
endianess described in the image header (which will typically be the
native ordering).

Headers
=======

Image Header
------------

The image header identifies an image as a Xen domain save image.  It
includes the version of this specification that the image complies
with.

Tools supporting version _V_ of the specification shall always save
images using version _V_.  Tools shall support restoring from version
_V_ and version _V_ - 1.  Tools may additionally support restoring
from earlier versions.

The marker field can be used to distinguish between legacy images and
those corresponding to this specification.  Legacy images will have at
one or more zero bits within the first 8 octets of the image.

Fields within the image header are always in _big-endian_ byte order,
regardless of the setting of the endianness bit.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | marker                                          |
    +-----------------------+-------------------------+
    | id                    | version                 |
    +-----------+-----------+-------------------------+
    | options   |                                     |
    +-----------+-------------------------------------+


--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
marker      0xFFFFFFFFFFFFFFFF.

id          0x58454E46 ("XENF" in ASCII).

version     0x00000001.  The version of this specification.

options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.

            bit 1-15: Reserved.
--------------------------------------------------------------------

Domain Header
-------------

The domain header includes general properties of the domain.

     0      1     2     3     4     5     6     7 octet
    +-----------+-----------+-----------+-------------+
    | arch      | type      | page_shift| (reserved)  |
    +-----------+-----------+-----------+-------------+

--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
arch        0x0000: Reserved.

            0x0001: x86.

            0x0002: ARM.

type        0x0000: Reserved.

            0x0001: x86 PV.

            0x0002 - 0xFFFF: Reserved.

page_shift  Size of a guest page as a power of two.

            i.e., page size = 2^page_shift^.
--------------------------------------------------------------------


Records
=======

A record has a record header, type specific data and a trailing
footer.  If body_length is not a multiple of 8, the body is padded
with zeroes to align the checksum field on an 8 octet boundary.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+-------------------------+
    | type                  | body_length             |
    +-----------+-----------+-------------------------+
    | options   | (reserved)                          |
    +-----------+-------------------------------------+
    ...
    Record body of length body_length octets followed by
    0 to 7 octets of padding.
    ...
    +-----------------------+-------------------------+
    | checksum              | (reserved)              |
    +-----------------------+-------------------------+

--------------------------------------------------------------------
Field        Description
-----------  -------------------------------------------------------
type         0x00000000: END

             0x00000001: PAGE_DATA

             0x00000002: VCPU_INFO

             0x00000003: VCPU_CONTEXT

             0x00000004: X86_PV_INFO

             0x00000005: P2M

             0x00000006 - 0xFFFFFFFF: Reserved

body_length  Length in octets of the record body.

options      Bit 0: 0 - checksum invalid, 1 = checksum valid.

             Bit 1-15: Reserved.

checksum     CRC-32 checksum of the record body (including any trailing
             padding), or 0x00000000 if the checksum field is invalid.
--------------------------------------------------------------------

The following sub-sections specify the record body format for each of
the record types.

END
----

A end record marks the end of the image.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+

The end record contains no fields; its body_length is 0.

PAGE_DATA
---------

The bulk of an image consists of many PAGE_DATA records containing the
memory contents.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+-------------------------+
    | count (C)             | (reserved)              |
    +-----------------------+-------------------------+
    | pfn[0]                                          |
    +-------------------------------------------------+
    ...
    +-------------------------------------------------+
    | pfn[C-1]                                        |
    +-------------------------------------------------+
    | page_data[0]...                                 |
    ...
    +-------------------------------------------------+
    | page_data[N-1]...                               |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
count       Number of pages described in this record.

pfn         An array of count PFNs. Bits 63-60 contain
            the XEN\_DOMCTL\_PFINFO_* value for that PFN.

page_data   page_size octets of uncompressed page contents for each page
            set as present in the pfn array.
--------------------------------------------------------------------

VCPU_INFO
---------

> [ This is a combination of parts of the extended-info and
> XC_SAVE_ID_VCPU_INFO chunks. ]

The VCPU_INFO record includes the maximum possible VCPU ID.  This will
be followed a VCPU_CONTEXT record for each online VCPU.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+------------------------+
    | max_vcpu_id           | (reserved)             |
    +-----------------------+------------------------+

--------------------------------------------------------------------
Field        Description
-----------  ---------------------------------------------------
max_vcpu_id  Maximum possible VCPU ID.
--------------------------------------------------------------------


VCPU_CONTEXT
------------

The context for a single VCPU.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+-------------------------+
    | vcpu_id               | (reserved)              |
    +-----------------------+-------------------------+
    | vcpu_ctx...                                     |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
vcpu_id          The VCPU ID.

vcpu_ctx         Context for this VCPU.
--------------------------------------------------------------------

[ vcpu_ctx format TBD. ]


X86_PV_INFO
-----------

> [ This record replaces part of the extended-info chunk. ]

     0     1     2     3     4     5     6     7 octet
    +-----+-----+-----+-------------------------------+
    | w   | ptl | o   | (reserved)                    |
    +-----+-----+-----+-------------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
guest_width (w)  Guest width in octets (either 4 or 8).

pt_levels (ptl)  Number of page table levels (either 3 or 4).

options (o)      Bit 0: 0 - no VMASST_pae_extended_cr3,
                 1 - VMASST_pae_extended_cr3.

                 Bit 1-7: Reserved.
--------------------------------------------------------------------


P2M
---

[ This is a more flexible replacement for the old p2m_size field and
p2m array. ]

The P2M record contains a portion of the source domain's P2M.
Multiple P2M records may be sent if the source P2M changes during the
stream.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | pfn_begin                                       |
    +-------------------------------------------------+
    | pfn_end                                         |
    +-------------------------------------------------+
    | mfn[0]                                          |
    +-------------------------------------------------+
    ...
    +-------------------------------------------------+
    | mfn[N-1]                                        |
    +-------------------------------------------------+

--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
pfn_begin   The first PFN in this portion of the P2M

pfn_end     One past the last PFN in this portion of the P2M.

mfn         Array of (pfn_end - pfn-begin) MFNs corresponding to
            the set of PFNs in the range [pfn_begin, pfn_end).
--------------------------------------------------------------------


Layout
======

The set of valid records depends on the guest architecture and type.

x86 PV Guest
------------

An x86 PV guest image will have in this order:

1. Image header
2. Domain header
3. X86_PV_INFO record
4. At least one P2M record
5. At least one PAGE_DATA record
6. VCPU_INFO record
6. At least one VCPU_CONTEXT record
7. END record


Legacy Images (x86 only)
========================

Restoring legacy images from older tools shall be handled by
translating the legacy format image into this new format.

It shall not be possible to save in the legacy format.

There are two different legacy images depending on whether they were
generated by a 32-bit or a 64-bit toolstack. These shall be
distinguished by inspecting octets 4-7 in the image.  If these are
zero then it is a 64-bit image.

Toolstack  Field                            Value
---------  -----                            -----
64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
32-bit     Chunk type (HVM)                 < 0
32-bit     Page count (HVM)                 > 0

Table: Possible values for octet 4-7 in legacy images

This assumes the presence of the extended-info chunk which was
introduced in Xen 3.0.


Future Extensions
=================

All changes to this format require the image version to be increased.

The format may be extended by adding additional record types.

Extending an existing record type must be done by adding a new record
type.  This allows old images with the old record to still be
restored.

The image header may be extended by _appending_ additional fields.  In
particular, the `marker`, `id` and `version` fields must never change
size or location.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
@ 2014-02-10 17:32 ` Andrew Cooper
  2014-02-10 17:48 ` Frediano Ziglio
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Andrew Cooper @ 2014-02-10 17:32 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Jackson,
	Ian Campbell, Xen-devel

On 10/02/14 17:20, David Vrabel wrote:
> Here is a draft of a proposal for a new domain save image format.  It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).
>
> http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft A  6 Feb 2014   Initial draft.
>
> Draft B  10 Feb 2014  Corrected image header field widths.
>
>                       Minor updates and clarifications.
> --------------------------------------------------------------------
>
> Purpose
> -------
>
> The _domain save image_ is the context of a running domain used for
> snapshots of a domain or for transferring domains between hosts during
> migration.
>
> There are a number of problems with the format of the domain save
> image used in Xen 4.4 and earlier (the _legacy format_).
>
> * Dependant on toolstack word size.  A number of fields within the
>   image are native types such as `unsigned long` which have different
>   sizes between 32-bit and 64-bit hosts.  This prevents domains from
>   being migrated between 32-bit and 64-bit hosts.

s/hosts/toolstacks/ (or toolstack domains)

~Andrew

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
  2014-02-10 17:32 ` Andrew Cooper
@ 2014-02-10 17:48 ` Frediano Ziglio
  2014-02-10 20:00 ` Shriram Rajagopalan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Frediano Ziglio @ 2014-02-10 17:48 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Jackson,
	Ian Campbell, Xen-devel

On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote:
> Here is a draft of a proposal for a new domain save image format.  It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).
> 
> http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf
> 
....
> 
> Integer fields in the domain header and in the records are in the
> endianess described in the image header (which will typically be the
> native ordering).
> 

This mean that we could have a Intel image coded in big endian format.
It's fine but we have to support native and not native images.

...

Frediano

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
  2014-02-10 17:32 ` Andrew Cooper
  2014-02-10 17:48 ` Frediano Ziglio
@ 2014-02-10 20:00 ` Shriram Rajagopalan
  2014-02-11  1:35   ` Andrew Cooper
  2014-02-11 11:58   ` David Vrabel
  2014-02-11  9:30 ` Ian Campbell
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Shriram Rajagopalan @ 2014-02-10 20:00 UTC (permalink / raw)
  To: David Vrabel; +Cc: Stefano Stabellini, Ian Jackson, Ian Campbell, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 16425 bytes --]

On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@citrix.com>wrote:

> Here is a draft of a proposal for a new domain save image format.  It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).
>
> http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft A  6 Feb 2014   Initial draft.
>
> Draft B  10 Feb 2014  Corrected image header field widths.
>
>                       Minor updates and clarifications.
> --------------------------------------------------------------------
>
> Purpose
> -------
>
> The _domain save image_ is the context of a running domain used for
> snapshots of a domain or for transferring domains between hosts during
> migration.
>
> There are a number of problems with the format of the domain save
> image used in Xen 4.4 and earlier (the _legacy format_).
>
> * Dependant on toolstack word size.  A number of fields within the
>   image are native types such as `unsigned long` which have different
>   sizes between 32-bit and 64-bit hosts.  This prevents domains from
>   being migrated between 32-bit and 64-bit hosts.
>
> * There is no header identifying the image.
>
> * The image has no version information.
>
> A new format that addresses the above is required.
>
> ARM does not yet have have a domain save image format specified and
> the format described in this specification should be suitable.
>
>

I suggest keeping the processing overhead in mind when designing the new
image format. Some key things have been addressed, such as making sure data
is always padded to maintain alignment. But there are also some aspects of
this
proposal that seem awfully unnecessary.. More details below.


>
> Overview
> ========
>
> The image format consists of two main sections:
>
> * _Headers_
> * _Records_
>
> Headers
> -------
>
> There are two headers: the _image header_, and the _domain header_.
> The image header describes the format of the image (version etc.).
> The _domain header_ contains general information about the domain
> (architecture, type etc.).
>
> Records
> -------
>
> The main part of the format is a sequence of different _records_.
> Each record type contains information about the domain context.  At a
> minimum there is a END record marking the end of the records section.
>
>
> Fields
> ------
>
> All the fields within the headers and records have a fixed width.
>
> Fields are always aligned to their size.
>
> Padding and reserved fields are set to zero on save and must be
> ignored during restore.
>
>
So far so good.


> Integer (numeric) fields in the image header are always in big-endian
> byte order.
>
> Integer fields in the domain header and in the records are in the
> endianess described in the image header (which will typically be the
> native ordering).
>
>

Its tempting to adopt all the TCP-style madness for transferring a set of
structured data.  Why this endian-ness mess?  Am I missing something here?
I am assuming that a lion's share of Xen's deployment is on x86
(not including Amazon). So that leaves ARM.  Why not let these
processors take the hit of endian-ness conversion?

Headers
> =======
>
> Image Header
> ------------
>
> The image header identifies an image as a Xen domain save image.  It
> includes the version of this specification that the image complies
> with.
>
> Tools supporting version _V_ of the specification shall always save
> images using version _V_.  Tools shall support restoring from version
> _V_ and version _V_ - 1.  Tools may additionally support restoring
> from earlier versions.
>
> The marker field can be used to distinguish between legacy images and
> those corresponding to this specification.  Legacy images will have at
> one or more zero bits within the first 8 octets of the image.
>
> Fields within the image header are always in _big-endian_ byte order,
> regardless of the setting of the endianness bit.
>

and more endian-ness mess.


>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | marker                                          |
>     +-----------------------+-------------------------+
>     | id                    | version                 |
>     +-----------+-----------+-------------------------+
>     | options   |                                     |
>     +-----------+-------------------------------------+
>
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> marker      0xFFFFFFFFFFFFFFFF.
>
> id          0x58454E46 ("XENF" in ASCII).
>
> version     0x00000001.  The version of this specification.
>
> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
>
>             bit 1-15: Reserved.
> --------------------------------------------------------------------
>
> Domain Header
> -------------
>
> The domain header includes general properties of the domain.
>
>      0      1     2     3     4     5     6     7 octet
>     +-----------+-----------+-----------+-------------+
>     | arch      | type      | page_shift| (reserved)  |
>     +-----------+-----------+-----------+-------------+
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> arch        0x0000: Reserved.
>
>             0x0001: x86.
>
>             0x0002: ARM.
>
> type        0x0000: Reserved.
>
>             0x0001: x86 PV.
>
>             0x0002 - 0xFFFF: Reserved.
>
> page_shift  Size of a guest page as a power of two.
>
>             i.e., page size = 2^page_shift^.
> --------------------------------------------------------------------
>
>
> Records
> =======
>
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
>     ...
>     Record body of length body_length octets followed by
>     0 to 7 octets of padding.
>     ...
>     +-----------------------+-------------------------+
>     | checksum              | (reserved)              |
>     +-----------------------+-------------------------+
>
>
I am assuming that you the checksum field is present only
for debugging purposes? Otherwise, I see no reason for the
computational overhead, given that we are already sending data
over a reliable channel + IIRC we already have an image-wide checksum
when saving the image to disk.

If debugging is the only use case, then I guess, the type field
can be prefixed with a 1/0 bit, eliminating the need for the
1-bit checkum options field + 7-byte padding. Similarly, if debugging
mode is not set, why waste another 8 bytes in the end for the checksum
field.

Unless you think there may be more types with need of special options,

Feel free to correct me if I am missing something elementary here..




> --------------------------------------------------------------------
> Field        Description
> -----------  -------------------------------------------------------
> type         0x00000000: END
>
>              0x00000001: PAGE_DATA
>
>              0x00000002: VCPU_INFO
>
>              0x00000003: VCPU_CONTEXT
>
>              0x00000004: X86_PV_INFO
>
>              0x00000005: P2M
>
>              0x00000006 - 0xFFFFFFFF: Reserved
>
> body_length  Length in octets of the record body.
>
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
>
>              Bit 1-15: Reserved.
>
> checksum     CRC-32 checksum of the record body (including any trailing
>              padding), or 0x00000000 if the checksum field is invalid.
> --------------------------------------------------------------------
>
> The following sub-sections specify the record body format for each of
> the record types.
>
> END
> ----
>
> A end record marks the end of the image.
>
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>
> The end record contains no fields; its body_length is 0.
>
> PAGE_DATA
> ---------
>
> The bulk of an image consists of many PAGE_DATA records containing the
> memory contents.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | count (C)             | (reserved)              |
>     +-----------------------+-------------------------+
>     | pfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | pfn[C-1]                                        |
>     +-------------------------------------------------+
>     | page_data[0]...                                 |
>     ...
>     +-------------------------------------------------+
>     | page_data[N-1]...                               |
>     ...
>     +-------------------------------------------------+
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> count       Number of pages described in this record.
>
> pfn         An array of count PFNs. Bits 63-60 contain
>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.
>
> page_data   page_size octets of uncompressed page contents for each page
>             set as present in the pfn array.
> --------------------------------------------------------------------
>
>
s/uncompressed/(compressed/uncompressed)/
(Remus sends compressed data)


> VCPU_INFO
> ---------
>
> > [ This is a combination of parts of the extended-info and
> > XC_SAVE_ID_VCPU_INFO chunks. ]
>
> The VCPU_INFO record includes the maximum possible VCPU ID.  This will
> be followed a VCPU_CONTEXT record for each online VCPU.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+------------------------+
>     | max_vcpu_id           | (reserved)             |
>     +-----------------------+------------------------+
>
> --------------------------------------------------------------------
> Field        Description
> -----------  ---------------------------------------------------
> max_vcpu_id  Maximum possible VCPU ID.
> --------------------------------------------------------------------
>
>
> VCPU_CONTEXT
> ------------
>
> The context for a single VCPU.
>
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | vcpu_id               | (reserved)              |
>     +-----------------------+-------------------------+
>     | vcpu_ctx...                                     |
>     ...
>     +-------------------------------------------------+
>
> --------------------------------------------------------------------
> Field            Description
> -----------      ---------------------------------------------------
> vcpu_id          The VCPU ID.
>
> vcpu_ctx         Context for this VCPU.
> --------------------------------------------------------------------
>
> [ vcpu_ctx format TBD. ]
>
>
> X86_PV_INFO
> -----------
>
> > [ This record replaces part of the extended-info chunk. ]
>
>      0     1     2     3     4     5     6     7 octet
>     +-----+-----+-----+-------------------------------+
>     | w   | ptl | o   | (reserved)                    |
>     +-----+-----+-----+-------------------------------+
>
> --------------------------------------------------------------------
> Field            Description
> -----------      ---------------------------------------------------
> guest_width (w)  Guest width in octets (either 4 or 8).
>
> pt_levels (ptl)  Number of page table levels (either 3 or 4).
>
> options (o)      Bit 0: 0 - no VMASST_pae_extended_cr3,
>                  1 - VMASST_pae_extended_cr3.
>
>                  Bit 1-7: Reserved.
> --------------------------------------------------------------------
>
>
> P2M
> ---
>
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]
>
> The P2M record contains a portion of the source domain's P2M.
> Multiple P2M records may be sent if the source P2M changes during the
> stream.
>
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | pfn_begin                                       |
>     +-------------------------------------------------+
>     | pfn_end                                         |
>     +-------------------------------------------------+
>     | mfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | mfn[N-1]                                        |
>     +-------------------------------------------------+
>
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> pfn_begin   The first PFN in this portion of the P2M
>
> pfn_end     One past the last PFN in this portion of the P2M.
>
> mfn         Array of (pfn_end - pfn-begin) MFNs corresponding to
>             the set of PFNs in the range [pfn_begin, pfn_end).
> --------------------------------------------------------------------
>
>
> Layout
> ======
>
> The set of valid records depends on the guest architecture and type.
>
> x86 PV Guest
> ------------
>
> An x86 PV guest image will have in this order:
>
> 1. Image header
> 2. Domain header
> 3. X86_PV_INFO record
> 4. At least one P2M record
> 5. At least one PAGE_DATA record

6. VCPU_INFO record
> 6. At least one VCPU_CONTEXT record

7. END record
>
>
There seems to be a bunch of info missing. Here are some
missing elements that I can recall at the moment:
a) there is no support for sending over one time markers that switch the
receiver's operating mode in the middle of a data stream.
E.g., XC_SAVE_ENABLE_COMPRESSION, XC_SAVE_ID_LAST_CHECKPOINT, etc.
XC_SAVE_ENABLE_VERIFY_MODE,

b) in pv case, the tail also has a list of unmapped PFNs at the end of
every iteration.

c) XC_SAVE_ID_TOOLSTACK -- used by xl to pass device context information
(generally
for HVMs).


>
> Legacy Images (x86 only)
> ========================
>
> Restoring legacy images from older tools shall be handled by
> translating the legacy format image into this new format.
>
> It shall not be possible to save in the legacy format.
>
> There are two different legacy images depending on whether they were
> generated by a 32-bit or a 64-bit toolstack. These shall be
> distinguished by inspecting octets 4-7 in the image.  If these are
> zero then it is a 64-bit image.
>
> Toolstack  Field                            Value
> ---------  -----                            -----
> 64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
> 32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
> 32-bit     Chunk type (HVM)                 < 0
> 32-bit     Page count (HVM)                 > 0
>
> Table: Possible values for octet 4-7 in legacy images
>
> This assumes the presence of the extended-info chunk which was
> introduced in Xen 3.0.
>
>
> Future Extensions
> =================
>
> All changes to this format require the image version to be increased.
>
> The format may be extended by adding additional record types.
>
> Extending an existing record type must be done by adding a new record
> type.  This allows old images with the old record to still be
> restored.
>
> The image header may be extended by _appending_ additional fields.  In
> particular, the `marker`, `id` and `version` fields must never change
> size or location.
>
>

[-- Attachment #1.2: Type: text/html, Size: 20803 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 20:00 ` Shriram Rajagopalan
@ 2014-02-11  1:35   ` Andrew Cooper
  2014-02-11  4:12     ` Shriram Rajagopalan
  2014-02-11 11:58   ` David Vrabel
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Cooper @ 2014-02-11  1:35 UTC (permalink / raw)
  To: rshriram, David Vrabel
  Cc: Xen-devel, Ian Jackson, Ian Campbell, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 18695 bytes --]

On 10/02/2014 20:00, Shriram Rajagopalan wrote:
> On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@citrix.com
> <mailto:david.vrabel@citrix.com>> wrote:
>
>     Here is a draft of a proposal for a new domain save image format.  It
>     does not currently cover all use cases (e.g., images for HVM guest are
>     not considered).
>
>     http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf
>
>     Introduction
>     ============
>
>     Revision History
>     ----------------
>
>     --------------------------------------------------------------------
>     Version  Date         Changes
>     -------  -----------  ----------------------------------------------
>     Draft A  6 Feb 2014   Initial draft.
>
>     Draft B  10 Feb 2014  Corrected image header field widths.
>
>                           Minor updates and clarifications.
>     --------------------------------------------------------------------
>
>     Purpose
>     -------
>
>     The _domain save image_ is the context of a running domain used for
>     snapshots of a domain or for transferring domains between hosts during
>     migration.
>
>     There are a number of problems with the format of the domain save
>     image used in Xen 4.4 and earlier (the _legacy format_).
>
>     * Dependant on toolstack word size.  A number of fields within the
>       image are native types such as `unsigned long` which have different
>       sizes between 32-bit and 64-bit hosts.  This prevents domains from
>       being migrated between 32-bit and 64-bit hosts.
>
>     * There is no header identifying the image.
>
>     * The image has no version information.
>
>     A new format that addresses the above is required.
>
>     ARM does not yet have have a domain save image format specified and
>     the format described in this specification should be suitable.
>
>
>
> I suggest keeping the processing overhead in mind when designing the new
> image format. Some key things have been addressed, such as making sure
> data
> is always padded to maintain alignment. But there are also some
> aspects of this
> proposal that seem awfully unnecessary.. More details below.
>  
>
>
>     Overview
>     ========
>
>     The image format consists of two main sections:
>
>     * _Headers_
>     * _Records_
>
>     Headers
>     -------
>
>     There are two headers: the _image header_, and the _domain header_.
>     The image header describes the format of the image (version etc.).
>     The _domain header_ contains general information about the domain
>     (architecture, type etc.).
>
>     Records
>     -------
>
>     The main part of the format is a sequence of different _records_.
>     Each record type contains information about the domain context.  At a
>     minimum there is a END record marking the end of the records section.
>
>
>     Fields
>     ------
>
>     All the fields within the headers and records have a fixed width.
>
>     Fields are always aligned to their size.
>
>     Padding and reserved fields are set to zero on save and must be
>     ignored during restore.
>
>
> So far so good.
>  
>
>     Integer (numeric) fields in the image header are always in big-endian
>     byte order.
>
>     Integer fields in the domain header and in the records are in the
>     endianess described in the image header (which will typically be the
>     native ordering).
>
>
>
> Its tempting to adopt all the TCP-style madness for transferring a set of
> structured data.  Why this endian-ness mess?  Am I missing something here?
> I am assuming that a lion's share of Xen's deployment is on x86 
> (not including Amazon). So that leaves ARM.  Why not let these 
> processors take the hit of endian-ness conversion?

The large majority is indeed x86, but don't discount ARM because it is
currently in the minority.  With the current requirements, the vast
majority of the data will still be little endian on x86.

>
>     Headers
>     =======
>
>     Image Header
>     ------------
>
>     The image header identifies an image as a Xen domain save image.  It
>     includes the version of this specification that the image complies
>     with.
>
>     Tools supporting version _V_ of the specification shall always save
>     images using version _V_.  Tools shall support restoring from version
>     _V_ and version _V_ - 1.  Tools may additionally support restoring
>     from earlier versions.
>
>     The marker field can be used to distinguish between legacy images and
>     those corresponding to this specification.  Legacy images will have at
>     one or more zero bits within the first 8 octets of the image.
>
>     Fields within the image header are always in _big-endian_ byte order,
>     regardless of the setting of the endianness bit.
>
>
> and more endian-ness mess.

Network order is perfectly valid.  Is is how all your network packets
arrive...

>
>
>          0     1     2     3     4     5     6     7 octet
>         +-------------------------------------------------+
>         | marker                                          |
>         +-----------------------+-------------------------+
>         | id                    | version                 |
>         +-----------+-----------+-------------------------+
>         | options   |                                     |
>         +-----------+-------------------------------------+
>
>
>     --------------------------------------------------------------------
>     Field       Description
>     ----------- --------------------------------------------------------
>     marker      0xFFFFFFFFFFFFFFFF.
>
>     id          0x58454E46 ("XENF" in ASCII).
>
>     version     0x00000001.  The version of this specification.
>
>     options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
>
>                 bit 1-15: Reserved.
>     --------------------------------------------------------------------
>
>     Domain Header
>     -------------
>
>     The domain header includes general properties of the domain.
>
>          0      1     2     3     4     5     6     7 octet
>         +-----------+-----------+-----------+-------------+
>         | arch      | type      | page_shift| (reserved)  |
>         +-----------+-----------+-----------+-------------+
>
>     --------------------------------------------------------------------
>     Field       Description
>     ----------- --------------------------------------------------------
>     arch        0x0000: Reserved.
>
>                 0x0001: x86.
>
>                 0x0002: ARM.
>
>     type        0x0000: Reserved.
>
>                 0x0001: x86 PV.
>
>                 0x0002 - 0xFFFF: Reserved.
>
>     page_shift  Size of a guest page as a power of two.
>
>                 i.e., page size = 2^page_shift^.
>     --------------------------------------------------------------------
>
>
>     Records
>     =======
>
>     A record has a record header, type specific data and a trailing
>     footer.  If body_length is not a multiple of 8, the body is padded
>     with zeroes to align the checksum field on an 8 octet boundary.
>
>          0     1     2     3     4     5     6     7 octet
>         +-----------------------+-------------------------+
>         | type                  | body_length             |
>         +-----------+-----------+-------------------------+
>         | options   | (reserved)                          |
>         +-----------+-------------------------------------+
>         ...
>         Record body of length body_length octets followed by
>         0 to 7 octets of padding.
>         ...
>         +-----------------------+-------------------------+
>         | checksum              | (reserved)              |
>         +-----------------------+-------------------------+
>
>
> I am assuming that you the checksum field is present only
> for debugging purposes? Otherwise, I see no reason for the
> computational overhead, given that we are already sending data
> over a reliable channel + IIRC we already have an image-wide checksum 
> when saving the image to disk.
>
> If debugging is the only use case, then I guess, the type field
> can be prefixed with a 1/0 bit, eliminating the need for the 
> 1-bit checkum options field + 7-byte padding. Similarly, if debugging 
> mode is not set, why waste another 8 bytes in the end for the checksum
> field.
>
> Unless you think there may be more types with need of special options,
>
> Feel free to correct me if I am missing something elementary here..

What image-wide checksum?

Are you certain that all your data is moving over reliable channels? 
Are you certain that your hard drives are bit perfect.  Are you certain
that your network connection is bit perfect?

Given the amount of data sent as part of a migration, 8 bytes per record
is not a substantial overhead.

>
>
>  
>
>     --------------------------------------------------------------------
>     Field        Description
>     -----------  -------------------------------------------------------
>     type         0x00000000: END
>
>                  0x00000001: PAGE_DATA
>
>                  0x00000002: VCPU_INFO
>
>                  0x00000003: VCPU_CONTEXT
>
>                  0x00000004: X86_PV_INFO
>
>                  0x00000005: P2M
>
>                  0x00000006 - 0xFFFFFFFF: Reserved
>
>     body_length  Length in octets of the record body.
>
>     options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
>
>                  Bit 1-15: Reserved.
>
>     checksum     CRC-32 checksum of the record body (including any
>     trailing
>                  padding), or 0x00000000 if the checksum field is invalid.
>     --------------------------------------------------------------------
>
>     The following sub-sections specify the record body format for each of
>     the record types.
>
>     END
>     ----
>
>     A end record marks the end of the image.
>
>          0     1     2     3     4     5     6     7 octet
>         +-------------------------------------------------+
>
>     The end record contains no fields; its body_length is 0.
>
>     PAGE_DATA
>     ---------
>
>     The bulk of an image consists of many PAGE_DATA records containing the
>     memory contents.
>
>          0     1     2     3     4     5     6     7 octet
>         +-----------------------+-------------------------+
>         | count (C)             | (reserved)              |
>         +-----------------------+-------------------------+
>         | pfn[0]                                          |
>         +-------------------------------------------------+
>         ...
>         +-------------------------------------------------+
>         | pfn[C-1]                                        |
>         +-------------------------------------------------+
>         | page_data[0]...                                 |
>         ...
>         +-------------------------------------------------+
>         | page_data[N-1]...                               |
>         ...
>         +-------------------------------------------------+
>
>     --------------------------------------------------------------------
>     Field       Description
>     ----------- --------------------------------------------------------
>     count       Number of pages described in this record.
>
>     pfn         An array of count PFNs. Bits 63-60 contain
>                 the XEN\_DOMCTL\_PFINFO_* value for that PFN.
>
>     page_data   page_size octets of uncompressed page contents for
>     each page
>                 set as present in the pfn array.
>     --------------------------------------------------------------------
>
>
> s/uncompressed/(compressed/uncompressed)/
> (Remus sends compressed data)

IIRC, remus sends XOR+RLE encoded pages?  Along with HVM domains, this
needs covering in a future draft.

~Andrew

>  
>
>     VCPU_INFO
>     ---------
>
>     > [ This is a combination of parts of the extended-info and
>     > XC_SAVE_ID_VCPU_INFO chunks. ]
>
>     The VCPU_INFO record includes the maximum possible VCPU ID.  This will
>     be followed a VCPU_CONTEXT record for each online VCPU.
>
>          0     1     2     3     4     5     6     7 octet
>         +-----------------------+------------------------+
>         | max_vcpu_id           | (reserved)             |
>         +-----------------------+------------------------+
>
>     --------------------------------------------------------------------
>     Field        Description
>     -----------  ---------------------------------------------------
>     max_vcpu_id  Maximum possible VCPU ID.
>     --------------------------------------------------------------------
>
>
>     VCPU_CONTEXT
>     ------------
>
>     The context for a single VCPU.
>
>          0     1     2     3     4     5     6     7 octet
>         +-----------------------+-------------------------+
>         | vcpu_id               | (reserved)              |
>         +-----------------------+-------------------------+
>         | vcpu_ctx...                                     |
>         ...
>         +-------------------------------------------------+
>
>     --------------------------------------------------------------------
>     Field            Description
>     -----------      ---------------------------------------------------
>     vcpu_id          The VCPU ID.
>
>     vcpu_ctx         Context for this VCPU.
>     --------------------------------------------------------------------
>
>     [ vcpu_ctx format TBD. ]
>
>
>     X86_PV_INFO
>     -----------
>
>     > [ This record replaces part of the extended-info chunk. ]
>
>          0     1     2     3     4     5     6     7 octet
>         +-----+-----+-----+-------------------------------+
>         | w   | ptl | o   | (reserved)                    |
>         +-----+-----+-----+-------------------------------+
>
>     --------------------------------------------------------------------
>     Field            Description
>     -----------      ---------------------------------------------------
>     guest_width (w)  Guest width in octets (either 4 or 8).
>
>     pt_levels (ptl)  Number of page table levels (either 3 or 4).
>
>     options (o)      Bit 0: 0 - no VMASST_pae_extended_cr3,
>                      1 - VMASST_pae_extended_cr3.
>
>                      Bit 1-7: Reserved.
>     --------------------------------------------------------------------
>
>
>     P2M
>     ---
>
>     [ This is a more flexible replacement for the old p2m_size field and
>     p2m array. ]
>
>     The P2M record contains a portion of the source domain's P2M.
>     Multiple P2M records may be sent if the source P2M changes during the
>     stream.
>
>          0     1     2     3     4     5     6     7 octet
>         +-------------------------------------------------+
>         | pfn_begin                                       |
>         +-------------------------------------------------+
>         | pfn_end                                         |
>         +-------------------------------------------------+
>         | mfn[0]                                          |
>         +-------------------------------------------------+
>         ...
>         +-------------------------------------------------+
>         | mfn[N-1]                                        |
>         +-------------------------------------------------+
>
>     --------------------------------------------------------------------
>     Field       Description
>     ----------- --------------------------------------------------------
>     pfn_begin   The first PFN in this portion of the P2M
>
>     pfn_end     One past the last PFN in this portion of the P2M.
>
>     mfn         Array of (pfn_end - pfn-begin) MFNs corresponding to
>                 the set of PFNs in the range [pfn_begin, pfn_end).
>     --------------------------------------------------------------------
>
>
>     Layout
>     ======
>
>     The set of valid records depends on the guest architecture and type.
>
>     x86 PV Guest
>     ------------
>
>     An x86 PV guest image will have in this order:
>
>     1. Image header
>     2. Domain header
>     3. X86_PV_INFO record
>     4. At least one P2M record
>     5. At least one PAGE_DATA record
>
>     6. VCPU_INFO record
>     6. At least one VCPU_CONTEXT record
>
>     7. END record
>
>
> There seems to be a bunch of info missing. Here are some
> missing elements that I can recall at the moment:
> a) there is no support for sending over one time markers that switch the
> receiver's operating mode in the middle of a data stream. 
> E.g., XC_SAVE_ENABLE_COMPRESSION, XC_SAVE_ID_LAST_CHECKPOINT, etc.
> XC_SAVE_ENABLE_VERIFY_MODE,
>
> b) in pv case, the tail also has a list of unmapped PFNs at the end of
> every iteration.
>
> c) XC_SAVE_ID_TOOLSTACK -- used by xl to pass device context
> information (generally
> for HVMs).
>  
>
>
>     Legacy Images (x86 only)
>     ========================
>
>     Restoring legacy images from older tools shall be handled by
>     translating the legacy format image into this new format.
>
>     It shall not be possible to save in the legacy format.
>
>     There are two different legacy images depending on whether they were
>     generated by a 32-bit or a 64-bit toolstack. These shall be
>     distinguished by inspecting octets 4-7 in the image.  If these are
>     zero then it is a 64-bit image.
>
>     Toolstack  Field                            Value
>     ---------  -----                            -----
>     64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
>     32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
>     32-bit     Chunk type (HVM)                 < 0
>     32-bit     Page count (HVM)                 > 0
>
>     Table: Possible values for octet 4-7 in legacy images
>
>     This assumes the presence of the extended-info chunk which was
>     introduced in Xen 3.0.
>
>
>     Future Extensions
>     =================
>
>     All changes to this format require the image version to be increased.
>
>     The format may be extended by adding additional record types.
>
>     Extending an existing record type must be done by adding a new record
>     type.  This allows old images with the old record to still be
>     restored.
>
>     The image header may be extended by _appending_ additional fields.  In
>     particular, the `marker`, `id` and `version` fields must never change
>     size or location.
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 36445 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11  1:35   ` Andrew Cooper
@ 2014-02-11  4:12     ` Shriram Rajagopalan
  2014-02-11 10:58       ` Zoltan Kiss
  2014-02-12 15:34       ` Tim Deegan
  0 siblings, 2 replies; 38+ messages in thread
From: Shriram Rajagopalan @ 2014-02-11  4:12 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ian Campbell, Xen-devel, Ian Jackson, David Vrabel, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 10735 bytes --]

On Mon, Feb 10, 2014 at 7:35 PM, Andrew Cooper <andrew.cooper3@citrix.com>wrote:

>  On 10/02/2014 20:00, Shriram Rajagopalan wrote:
>
>  On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@citrix.com>wrote:
>
>> Here is a draft of a proposal for a new domain save image format.  It
>> does not currently cover all use cases (e.g., images for HVM guest are
>> not considered).
>>
>> http://xenbits.xen.org/people/dvrabel/domain-save-format-B.pdf
>>
>> Introduction
>> ============
>>
>> Revision History
>> ----------------
>>
>> --------------------------------------------------------------------
>> Version  Date         Changes
>> -------  -----------  ----------------------------------------------
>> Draft A  6 Feb 2014   Initial draft.
>>
>> Draft B  10 Feb 2014  Corrected image header field widths.
>>
>>                       Minor updates and clarifications.
>> --------------------------------------------------------------------
>>
>> Purpose
>> -------
>>
>> The _domain save image_ is the context of a running domain used for
>> snapshots of a domain or for transferring domains between hosts during
>> migration.
>>
>> There are a number of problems with the format of the domain save
>> image used in Xen 4.4 and earlier (the _legacy format_).
>>
>> * Dependant on toolstack word size.  A number of fields within the
>>   image are native types such as `unsigned long` which have different
>>   sizes between 32-bit and 64-bit hosts.  This prevents domains from
>>   being migrated between 32-bit and 64-bit hosts.
>>
>> * There is no header identifying the image.
>>
>> * The image has no version information.
>>
>> A new format that addresses the above is required.
>>
>> ARM does not yet have have a domain save image format specified and
>> the format described in this specification should be suitable.
>>
>>
>
>  I suggest keeping the processing overhead in mind when designing the new
> image format. Some key things have been addressed, such as making sure data
>  is always padded to maintain alignment. But there are also some aspects
> of this
> proposal that seem awfully unnecessary.. More details below.
>
>
>>
>> Overview
>> ========
>>
>> The image format consists of two main sections:
>>
>> * _Headers_
>> * _Records_
>>
>> Headers
>> -------
>>
>> There are two headers: the _image header_, and the _domain header_.
>> The image header describes the format of the image (version etc.).
>> The _domain header_ contains general information about the domain
>> (architecture, type etc.).
>>
>> Records
>> -------
>>
>> The main part of the format is a sequence of different _records_.
>> Each record type contains information about the domain context.  At a
>> minimum there is a END record marking the end of the records section.
>>
>>
>> Fields
>> ------
>>
>> All the fields within the headers and records have a fixed width.
>>
>> Fields are always aligned to their size.
>>
>> Padding and reserved fields are set to zero on save and must be
>> ignored during restore.
>>
>>
>  So far so good.
>
>
>> Integer (numeric) fields in the image header are always in big-endian
>> byte order.
>>
>> Integer fields in the domain header and in the records are in the
>> endianess described in the image header (which will typically be the
>> native ordering).
>>
>>
>
>   Its tempting to adopt all the TCP-style madness for transferring a set
> of
> structured data.  Why this endian-ness mess?  Am I missing something here?
> I am assuming that a lion's share of Xen's deployment is on x86
> (not including Amazon). So that leaves ARM.  Why not let these
> processors take the hit of endian-ness conversion?
>
>
> The large majority is indeed x86, but don't discount ARM because it is
> currently in the minority.  With the current requirements, the vast
> majority of the data will still be little endian on x86.
>
>
>
>  Headers
>> =======
>>
>> Image Header
>> ------------
>>
>> The image header identifies an image as a Xen domain save image.  It
>> includes the version of this specification that the image complies
>> with.
>>
>> Tools supporting version _V_ of the specification shall always save
>> images using version _V_.  Tools shall support restoring from version
>> _V_ and version _V_ - 1.  Tools may additionally support restoring
>> from earlier versions.
>>
>> The marker field can be used to distinguish between legacy images and
>> those corresponding to this specification.  Legacy images will have at
>> one or more zero bits within the first 8 octets of the image.
>>
>> Fields within the image header are always in _big-endian_ byte order,
>> regardless of the setting of the endianness bit.
>>
>
>  and more endian-ness mess.
>
>
> Network order is perfectly valid.  Is is how all your network packets
> arrive...
>
>
>

True. But why should we explicitly convert the application level data to
network byte order and then convert it back to host byte order, when its
already going to be done by the underlying stack, as you put it?


>
>
>>      0     1     2     3     4     5     6     7 octet
>>     +-------------------------------------------------+
>>     | marker                                          |
>>     +-----------------------+-------------------------+
>>     | id                    | version                 |
>>     +-----------+-----------+-------------------------+
>>     | options   |                                     |
>>     +-----------+-------------------------------------+
>>
>>
>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> marker      0xFFFFFFFFFFFFFFFF.
>>
>> id          0x58454E46 ("XENF" in ASCII).
>>
>> version     0x00000001.  The version of this specification.
>>
>> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
>>
>>             bit 1-15: Reserved.
>> --------------------------------------------------------------------
>>
>> Domain Header
>> -------------
>>
>> The domain header includes general properties of the domain.
>>
>>      0      1     2     3     4     5     6     7 octet
>>     +-----------+-----------+-----------+-------------+
>>     | arch      | type      | page_shift| (reserved)  |
>>     +-----------+-----------+-----------+-------------+
>>
>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> arch        0x0000: Reserved.
>>
>>             0x0001: x86.
>>
>>             0x0002: ARM.
>>
>> type        0x0000: Reserved.
>>
>>             0x0001: x86 PV.
>>
>>             0x0002 - 0xFFFF: Reserved.
>>
>> page_shift  Size of a guest page as a power of two.
>>
>>             i.e., page size = 2^page_shift^.
>> --------------------------------------------------------------------
>>
>>
>> Records
>> =======
>>
>> A record has a record header, type specific data and a trailing
>> footer.  If body_length is not a multiple of 8, the body is padded
>> with zeroes to align the checksum field on an 8 octet boundary.
>>
>>      0     1     2     3     4     5     6     7 octet
>>     +-----------------------+-------------------------+
>>     | type                  | body_length             |
>>     +-----------+-----------+-------------------------+
>>     | options   | (reserved)                          |
>>     +-----------+-------------------------------------+
>>     ...
>>     Record body of length body_length octets followed by
>>     0 to 7 octets of padding.
>>     ...
>>     +-----------------------+-------------------------+
>>     | checksum              | (reserved)              |
>>     +-----------------------+-------------------------+
>>
>>
>  I am assuming that you the checksum field is present only
> for debugging purposes? Otherwise, I see no reason for the
> computational overhead, given that we are already sending data
> over a reliable channel + IIRC we already have an image-wide checksum
> when saving the image to disk.
>
>  If debugging is the only use case, then I guess, the type field
> can be prefixed with a 1/0 bit, eliminating the need for the
> 1-bit checkum options field + 7-byte padding. Similarly, if debugging
> mode is not set, why waste another 8 bytes in the end for the checksum
> field.
>
>  Unless you think there may be more types with need of special options,
>
>  Feel free to correct me if I am missing something elementary here..
>
>
> What image-wide checksum?
>
>
May be I got it wrong. I vaguely recall some sort of a crc checksum being
stored along with
the saved memory snapshots. But that could have been someone else's
research code. Sorry..


> Are you certain that all your data is moving over reliable channels?
>

Lets see.. Am I certain that all migration is happening over TCP ? yes.
Worst case reliable UDP.
By reliable, I just mean no bit errors or such stuff. I am not talking
about security.


> Are you certain that your hard drives are bit perfect.
>

Absolutely not. Which is why I was under the impression that the image wide
checksum
would detect a corrupt image.

  Are you certain that your network connection is bit perfect?
>
>
Nope. But I am fairly certain that good old TCP and IP checksums + the
ethernet's checksum have
been put in place to detect these errors and recover transparent to the
application. Are you are implying
that there is some remote corner case that allows corrupt data to escape
all of these three checks in
the network stack and percolate to the application layer? I don't think so.

If you are implying that the DRAMs cause memory bit errors that flip bits
here and here, wreaking havoc,
then probably yes, checksums make sense. However, with ECC memory modules
being the norm (please
correct me if I wrong about this), why start bothering now, if we didn't
over the last 3 years? What has changed?

My point here being, checksums seem like unnecessary compute overhead when
doing live migration
or Remus.  One can simply set this field to 0 when doing live
migration/Remus.
And, as you said later in this mail, data transmission overhead is not that
much.

However, as far as storing snapshots in disks is concerned, I totally agree
that there needs to be some
form of a checksum to ensure that the data has not been corrupted. But why
have record-level checksums?
It is not as if we can recover the corrupted records. Majority of the use
cases are, IMO, do or die. If checksum
is correct, then start the restore process. Else abort.  So why not have an
image wide checksum?



>  Given the amount of data sent as part of a migration, 8 bytes per record
> is not a substantial overhead.
>
>
thanks
shriram

[-- Attachment #1.2: Type: text/html, Size: 19825 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
                   ` (2 preceding siblings ...)
  2014-02-10 20:00 ` Shriram Rajagopalan
@ 2014-02-11  9:30 ` Ian Campbell
  2014-02-11 11:40   ` David Vrabel
  2014-02-12 15:41   ` Tim Deegan
  2014-02-11 10:06 ` Jan Beulich
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Ian Campbell @ 2014-02-11  9:30 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Jackson, Xen-devel

On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote:
> It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).

Other currently missing cases:
      * live migration
      * remus
      * page compression
      * ...?

I think it is useful to enumerate them, and if necessary to note which
ones you aren't planning to consider in the immediate future.

> Headers
> =======
> 
> Image Header
> ------------
> 
> The image header identifies an image as a Xen domain save image.  It
> includes the version of this specification that the image complies
> with.
> 
> Tools supporting version _V_ of the specification shall always save
> images using version _V_.  Tools shall support restoring from version
> _V_ and version _V_ - 1.

This isn't quite right since it is in terms of image format version and
not Xen version (unless the two are to be linked somehow). The Xen
project supports migration from version X-1 to version X (where X is the
Xen version). It's not inconceivable that the image format version
wouldn't change over Xen releases.

This might become a more obvious issue once you add HVM to the spec and
introduce the necessary Xen HVM save state block.

> Tools may additionally support restoring from earlier versions.

> The marker field can be used to distinguish between legacy images and
> those corresponding to this specification.  Legacy images will have at
> one or more zero bits within the first 8 octets of the image.
> 
> Fields within the image header are always in _big-endian_ byte order,
> regardless of the setting of the endianness bit.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | marker                                          |
>     +-----------------------+-------------------------+
>     | id                    | version                 |
>     +-----------+-----------+-------------------------+
>     | options   |                                     |
>     +-----------+-------------------------------------+
> 
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> marker      0xFFFFFFFFFFFFFFFF.
> 
> id          0x58454E46 ("XENF" in ASCII).
> 
> version     0x00000001.  The version of this specification.
> 
> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.

Couldn't we just specify that things are in a specific endianness
related to the header's arch field?

I appreciate the desire to make the format endian neutral and explicit
about which is in use but (apart from the header) why would you ever
want to go non-native endian for a given arch?

>             bit 1-15: Reserved.
> --------------------------------------------------------------------
> 
> Domain Header
> -------------
> 
> The domain header includes general properties of the domain.
> 
>      0      1     2     3     4     5     6     7 octet
>     +-----------+-----------+-----------+-------------+
>     | arch      | type      | page_shift| (reserved)  |
>     +-----------+-----------+-----------+-------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> arch        0x0000: Reserved.
> 
>             0x0001: x86.
> 
>             0x0002: ARM.
> 
> type        0x0000: Reserved.
> 
>             0x0001: x86 PV.
> 
>             0x0002 - 0xFFFF: Reserved.

Is the type field per-arch? i.e. if arch=0x0002 can we use type = 0x0001
for ARM domains?

> page_shift  Size of a guest page as a power of two.
> 
>             i.e., page size = 2^page_shift^.

This doesn't really need 16 bits, no harm though I suppose.

> --------------------------------------------------------------------
> 
> 
> Records
> =======
> 
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
>     ...
>     Record body of length body_length octets followed by
>     0 to 7 octets of padding.
>     ...
>     +-----------------------+-------------------------+
>     | checksum              | (reserved)              |
>     +-----------------------+-------------------------+
> 
> --------------------------------------------------------------------
> Field        Description
> -----------  -------------------------------------------------------
> type         0x00000000: END
> 
>              0x00000001: PAGE_DATA
> 
>              0x00000002: VCPU_INFO
> 
>              0x00000003: VCPU_CONTEXT
> 
>              0x00000004: X86_PV_INFO
> 
>              0x00000005: P2M
> 
>              0x00000006 - 0xFFFFFFFF: Reserved
> 
> body_length  Length in octets of the record body.
> 
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
> 
>              Bit 1-15: Reserved.
> 
> checksum     CRC-32 checksum of the record body (including any trailing
>              padding), or 0x00000000 if the checksum field is invalid.

Which CRC-32 :-P

(wikipedia claims there are 6 variants using various polynomials...)

> P2M
> ---
> 
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]


What is the latter for again?

I don't know if it really fits into this spec, but it might be useful to
include the rationale for the inclusion of each record type.

> PAGE_DATA
> ---------
> 
> The bulk of an image consists of many PAGE_DATA records containing the
> memory contents.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | count (C)             | (reserved)              |
>     +-----------------------+-------------------------+
>     | pfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | pfn[C-1]                                        |
>     +-------------------------------------------------+
>     | page_data[0]...                                 |
>     ...
>     +-------------------------------------------------+
>     | page_data[N-1]...                               |
>     ...
>     +-------------------------------------------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> count       Number of pages described in this record.
> 
> pfn         An array of count PFNs. Bits 63-60 contain
>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.

Now might be a good time to remove this intertwining? I suppose 60-bits
is a lot of pfn's, but if the VMs address space is sparse it isn't out
of the question.

It might be useful to be explicit (perhaps in a glossary) about what you
mean by "PFN" for various guest types -- some of the terminology can be
a bit confused in actual usage between the various guest types,
including in some of the existing interface definitions.

> Layout
> ======
> 
> The set of valid records depends on the guest architecture and type.
> 
> x86 PV Guest
> ------------
> 
> An x86 PV guest image will have in this order:
> 
> 1. Image header
> 2. Domain header
> 3. X86_PV_INFO record
> 4. At least one P2M record
> 5. At least one PAGE_DATA record
> 6. VCPU_INFO record
> 6. At least one VCPU_CONTEXT record
> 7. END record

It might be good to be more flexible in the ordering (or rather, less
prescriptive here), unless there is a reason for a strict ordering i.e.
dependencies between records.

For example it seems like the p2m cannot change after you start sending
page_data, which I don't think works for live migration (although I'm
not sure you've yet incorporated that as a feature).

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
                   ` (3 preceding siblings ...)
  2014-02-11  9:30 ` Ian Campbell
@ 2014-02-11 10:06 ` Jan Beulich
  2014-02-11 13:04   ` David Vrabel
  2014-02-11 16:45   ` Ian Jackson
  2014-02-11 16:38 ` Ian Jackson
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Jan Beulich @ 2014-02-11 10:06 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, Ian Campbell,
	Stefano Stabellini

>>> On 10.02.14 at 18:20, David Vrabel <david.vrabel@citrix.com> wrote:
> Fields
> ------
> 
> All the fields within the headers and records have a fixed width.
> 
> Fields are always aligned to their size.
> 
> Padding and reserved fields are set to zero on save and must be
> ignored during restore.

Meaning it would be impossible to assign a meaning to these fields
later. I'd rather mandate that the restore side has to check these
fields are zero, and bail if they aren't.

> Integer (numeric) fields in the image header are always in big-endian
> byte order.

Why would big endian be preferable when both currently
supported architectures use little endian?

> Domain Header
> -------------
> 
> The domain header includes general properties of the domain.
> 
>      0      1     2     3     4     5     6     7 octet
>     +-----------+-----------+-----------+-------------+
>     | arch      | type      | page_shift| (reserved)  |
>     +-----------+-----------+-----------+-------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> arch        0x0000: Reserved.
> 
>             0x0001: x86.
> 
>             0x0002: ARM.
> 
> type        0x0000: Reserved.
> 
>             0x0001: x86 PV.
> 
>             0x0002 - 0xFFFF: Reserved.

So how would ARM, x86 HVM, and x86 PVH be expressed?

> P2M
> ---
> 
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]
> 
> The P2M record contains a portion of the source domain's P2M.
> Multiple P2M records may be sent if the source P2M changes during the
> stream.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | pfn_begin                                       |
>     +-------------------------------------------------+
>     | pfn_end                                         |
>     +-------------------------------------------------+
>     | mfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | mfn[N-1]                                        |
>     +-------------------------------------------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> pfn_begin   The first PFN in this portion of the P2M
> 
> pfn_end     One past the last PFN in this portion of the P2M.

I'd favor an inclusive range here, such that if we ever reach a
fully populatable 64-bit PFN space (on some future architecture)
there'd still be no issue with special casing the then unavoidable
wraparound.

> Layout
> ======
> 
> The set of valid records depends on the guest architecture and type.
> 
> x86 PV Guest
> ------------
> 
> An x86 PV guest image will have in this order:
> 
> 1. Image header
> 2. Domain header
> 3. X86_PV_INFO record
> 4. At least one P2M record
> 5. At least one PAGE_DATA record
> 6. VCPU_INFO record
> 6. At least one VCPU_CONTEXT record
> 7. END record

Is it necessary to require this strict ordering? Obviously the headers
have to be first and the END record last, but at least some of the
other records don't seem to need to be at exactly the place you list
them.

> Legacy Images (x86 only)
> ========================
> 
> Restoring legacy images from older tools shall be handled by
> translating the legacy format image into this new format.
> 
> It shall not be possible to save in the legacy format.
> 
> There are two different legacy images depending on whether they were
> generated by a 32-bit or a 64-bit toolstack. These shall be
> distinguished by inspecting octets 4-7 in the image.  If these are
> zero then it is a 64-bit image.
> 
> Toolstack  Field                            Value
> ---------  -----                            -----
> 64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)

Afaics this is being determined via xc_domain_maximum_gpfn(),
which I don't think guarantees the result to be limited to 2^32.
Or in fact the libxc interface wrongly limits the value (by
truncating the "long" returned from the hypercall to an "int"). So
in practice consistent images would have the field limited to 2^31
on 64-bit tool stacks (since for larger values the negative function
return value would get converted by sign-extension, but all sorts
of other trouble would result due to the now huge p2m_size).

> Future Extensions
> =================
> 
> All changes to this format require the image version to be increased.

Oh, okay, this partly deals with the first question above. Question
is whether that's a useful requirement, i.e. whether that wouldn't
lead to an inflation of versions needing conversion (for a tool stack
that wants to support more than just migration from N-1).

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11  4:12     ` Shriram Rajagopalan
@ 2014-02-11 10:58       ` Zoltan Kiss
  2014-02-12 15:34       ` Tim Deegan
  1 sibling, 0 replies; 38+ messages in thread
From: Zoltan Kiss @ 2014-02-11 10:58 UTC (permalink / raw)
  To: rshriram, Andrew Cooper
  Cc: David Vrabel, Ian Jackson, Ian Campbell, Stefano Stabellini, Xen-devel

On 11/02/14 04:12, Shriram Rajagopalan wrote:
> On Mon, Feb 10, 2014 at 7:35 PM, Andrew Cooper
>>         The marker field can be used to distinguish between legacy
>>         images and
>>         those corresponding to this specification.  Legacy images will
>>         have at
>>         one or more zero bits within the first 8 octets of the image.
>>
>>         Fields within the image header are always in _big-endian_ byte
>>         order,
>>         regardless of the setting of the endianness bit.
>>
>>
>>     and more endian-ness mess.
>
>     Network order is perfectly valid.  Is is how all your network
>     packets arrive...
>
>
>
>
> True. But why should we explicitly convert the application level data to
> network byte order and then convert it back to host byte order, when its
> already going to be done by the underlying stack, as you put it?

TCP doesn't change the byte order of the payload, only header fields are 
defined to be in big endian, AFAIK.

Zoli

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11  9:30 ` Ian Campbell
@ 2014-02-11 11:40   ` David Vrabel
  2014-02-11 12:09     ` Ian Campbell
  2014-02-12 15:41   ` Tim Deegan
  1 sibling, 1 reply; 38+ messages in thread
From: David Vrabel @ 2014-02-11 11:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, David Vrabel,
	Stefano Stabellini

On 11/02/14 09:30, Ian Campbell wrote:
> On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote:
>> 
>> Tools supporting version _V_ of the specification shall always save
>> images using version _V_.  Tools shall support restoring from version
>> _V_ and version _V_ - 1.
> 
> This isn't quite right since it is in terms of image format version and
> not Xen version (unless the two are to be linked somehow). The Xen
> project supports migration from version X-1 to version X (where X is the
> Xen version). It's not inconceivable that the image format version
> wouldn't change over Xen releases.

I was expecting it to be only necessary to bump the format version with
each Xen x.y release but I think you're right.  It may be needed to bump
the version of a x.y.z release.

>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> marker      0xFFFFFFFFFFFFFFFF.
>>
>> id          0x58454E46 ("XENF" in ASCII).
>>
>> version     0x00000001.  The version of this specification.
>>
>> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
> 
> Couldn't we just specify that things are in a specific endianness
> related to the header's arch field?
> 
> I appreciate the desire to make the format endian neutral and explicit
> about which is in use but (apart from the header) why would you ever
> want to go non-native endian for a given arch?

I am anticipating bi-endian architectures which could mean (for example)
migrating a little-endian guest from a little-endian host to a
big-endian host.

I would prefer to retain this bit, but I think we can specify that
certain architectures always use a specific endianness so initially we
wouldn't need to support anything other than the native endianness
(little-endian on both x86 and ARM).

>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> arch        0x0000: Reserved.
>>
>>             0x0001: x86.
>>
>>             0x0002: ARM.
>>
>> type        0x0000: Reserved.
>>
>>             0x0001: x86 PV.
>>
>>             0x0002 - 0xFFFF: Reserved.
> 
> Is the type field per-arch? i.e. if arch=0x0002 can we use type = 0x0001
> for ARM domains?

I think it would be best to avoid reusing types for different
architectures -- it's not like we're going to be short on types.

>> P2M
>> ---
>>
>> [ This is a more flexible replacement for the old p2m_size field and
>> p2m array. ]
> 
> 
> What is the latter for again?

Er.  I think I've misunderstood the code and gotten confused here.

>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> count       Number of pages described in this record.
>>
>> pfn         An array of count PFNs. Bits 63-60 contain
>>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.
> 
> Now might be a good time to remove this intertwining? I suppose 60-bits
> is a lot of pfn's, but if the VMs address space is sparse it isn't out
> of the question.

I don't think we want to consider systems with > 64 bits of address
space so 60-bits is more than enough for PFNs.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 20:00 ` Shriram Rajagopalan
  2014-02-11  1:35   ` Andrew Cooper
@ 2014-02-11 11:58   ` David Vrabel
  2014-02-11 16:13     ` Shriram Rajagopalan
  1 sibling, 1 reply; 38+ messages in thread
From: David Vrabel @ 2014-02-11 11:58 UTC (permalink / raw)
  To: rshriram; +Cc: Stefano Stabellini, Ian Jackson, Ian Campbell, Xen-devel

On 10/02/14 20:00, Shriram Rajagopalan wrote:
> On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@citrix.com
> <mailto:david.vrabel@citrix.com>> wrote:
> 
> 
> Its tempting to adopt all the TCP-style madness for transferring a set of
> structured data.  Why this endian-ness mess?  Am I missing something here?
> I am assuming that a lion's share of Xen's deployment is on x86 
> (not including Amazon). So that leaves ARM.  Why not let these 
> processors take the hit of endian-ness conversion?

I'm not sure I would characterize a spec being precise about byte
ordering as "endianness mess".

I think it would be a pretty poor specification if it didn't specify
byte ordering -- we can't have the tools having to make assumptions
about the ordering.

However, I do think it can be specified in such a way that all the
current use cases don't have to do any byte swapping (except for the
minimal header).

>         +-----------------------+-------------------------+
>         | checksum              | (reserved)              |
>         +-----------------------+-------------------------+
> 
> 
> I am assuming that you the checksum field is present only
> for debugging purposes? Otherwise, I see no reason for the
> computational overhead, given that we are already sending data
> over a reliable channel + IIRC we already have an image-wide checksum 
> when saving the image to disk.

I'm not aware of any image wide checksum.

The checksum seems like a potentially useful feature but I don't have a
requirement for it so if no one else thinks it is useful it can be removed.

>     PAGE_DATA
>     ---------
[...]
>     --------------------------------------------------------------------
>     Field       Description
>     ----------- --------------------------------------------------------
>     count       Number of pages described in this record.
> 
>     pfn         An array of count PFNs. Bits 63-60 contain
>                 the XEN\_DOMCTL\_PFINFO_* value for that PFN.
> 
>     page_data   page_size octets of uncompressed page contents for each page
>                 set as present in the pfn array.
>     --------------------------------------------------------------------
> 
> 
> s/uncompressed/(compressed/uncompressed)/
> (Remus sends compressed data)

No.  I think compressed page data should have its own record type. The
current scheme of mode flipping records seems crazy to me.

>     x86 PV Guest
>     ------------
> 
>     An x86 PV guest image will have in this order:
> 
>     1. Image header
>     2. Domain header
>     3. X86_PV_INFO record
>     4. At least one P2M record
>     5. At least one PAGE_DATA record
> 
>     6. VCPU_INFO record
>     6. At least one VCPU_CONTEXT record
> 
>     7. END record
> 
> 
> There seems to be a bunch of info missing. Here are some
> missing elements that I can recall at the moment:
> a) there is no support for sending over one time markers that switch the
> receiver's operating mode in the middle of a data stream. 
> E.g., XC_SAVE_ENABLE_COMPRESSION, XC_SAVE_ID_LAST_CHECKPOINT, etc.
> XC_SAVE_ENABLE_VERIFY_MODE,

Yes. As I noted, this specification is not yet complete.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 11:40   ` David Vrabel
@ 2014-02-11 12:09     ` Ian Campbell
  2014-02-11 13:10       ` Jan Beulich
  2014-02-11 13:12       ` Ian Campbell
  0 siblings, 2 replies; 38+ messages in thread
From: Ian Campbell @ 2014-02-11 12:09 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, Stefano Stabellini

On Tue, 2014-02-11 at 11:40 +0000, David Vrabel wrote:
> On 11/02/14 09:30, Ian Campbell wrote:
> > On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote:
> >> 
> >> Tools supporting version _V_ of the specification shall always save
> >> images using version _V_.  Tools shall support restoring from version
> >> _V_ and version _V_ - 1.
> > 
> > This isn't quite right since it is in terms of image format version and
> > not Xen version (unless the two are to be linked somehow). The Xen
> > project supports migration from version X-1 to version X (where X is the
> > Xen version). It's not inconceivable that the image format version
> > wouldn't change over Xen releases.
> 
> I was expecting it to be only necessary to bump the format version with
> each Xen x.y release but I think you're right.  It may be needed to bump
> the version of a x.y.z release.

We try very hard to avoid that (i.e. changing the save format in a
stable branch) actually. 

What I meant was:
	Xen = 4.5, V = 1
	Xen = 4.6, V = 1 (nothing changed in the save file spec for 4.5)
	Xen = 4.7, V = 2

The spec with it's current wording about V would seem to say the Xen 4.7
must support migration from Xen 4.5, which is not the case.

For HVM support there are going to be (at least) two additional record
types "opaque blob of Xen state" and "opaque blob of qemu state". It is
very unlikely that anyone will remember to bump V in the tools when
changing those.

I think the solution is to also include the Xen version in the header,
as well as the "file format" version V. You could do this as part of the
header of those opaque blobs, but it might be better in the actual
overall headers (/as well).

I've no idea how to deal with the qemu blob -- but I think qemu upstream
has been working to make cross version migration work better, so maybe
it isn't a problem any more.

> 
> >> --------------------------------------------------------------------
> >> Field       Description
> >> ----------- --------------------------------------------------------
> >> marker      0xFFFFFFFFFFFFFFFF.
> >>
> >> id          0x58454E46 ("XENF" in ASCII).
> >>
> >> version     0x00000001.  The version of this specification.
> >>
> >> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
> > 
> > Couldn't we just specify that things are in a specific endianness
> > related to the header's arch field?
> > 
> > I appreciate the desire to make the format endian neutral and explicit
> > about which is in use but (apart from the header) why would you ever
> > want to go non-native endian for a given arch?
> 
> I am anticipating bi-endian architectures which could mean (for example)
> migrating a little-endian guest from a little-endian host to a
> big-endian host.

It's possible that this would end up looking like a totally separate
arch anyway (cf "armbe"), not just at the save format level, but at the
hypercall and PVIO layers too.

> I would prefer to retain this bit, but I think we can specify that
> certain architectures always use a specific endianness so initially we
> wouldn't need to support anything other than the native endianness
> (little-endian on both x86 and ARM).

I think that makes sense initially.

It's always a bit tricky with a spec to separate the notion of what is
possible within the spec from what you actually intend to implement now.

Perhaps a reasonable compromise is to document a requirement that savers
populate this field accurately but only require restorers to support
their nativeness (leaving anything further as optional).

> >> --------------------------------------------------------------------
> >> Field       Description
> >> ----------- --------------------------------------------------------
> >> arch        0x0000: Reserved.
> >>
> >>             0x0001: x86.
> >>
> >>             0x0002: ARM.
> >>
> >> type        0x0000: Reserved.
> >>
> >>             0x0001: x86 PV.
> >>
> >>             0x0002 - 0xFFFF: Reserved.
> > 
> > Is the type field per-arch? i.e. if arch=0x0002 can we use type = 0x0001
> > for ARM domains?
> 
> I think it would be best to avoid reusing types for different
> architectures -- it's not like we're going to be short on types.

Any reason not to combine the arch+type into a single field then?

> >> --------------------------------------------------------------------
> >> Field       Description
> >> ----------- --------------------------------------------------------
> >> count       Number of pages described in this record.
> >>
> >> pfn         An array of count PFNs. Bits 63-60 contain
> >>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.
> > 
> > Now might be a good time to remove this intertwining? I suppose 60-bits
> > is a lot of pfn's, but if the VMs address space is sparse it isn't out
> > of the question.
> 
> I don't think we want to consider systems with > 64 bits of address
> space so 60-bits is more than enough for PFNs.

Is it? What about systems with 61..63 bits of address space?

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 10:06 ` Jan Beulich
@ 2014-02-11 13:04   ` David Vrabel
  2014-02-11 13:20     ` Jan Beulich
  2014-02-11 16:45   ` Ian Jackson
  1 sibling, 1 reply; 38+ messages in thread
From: David Vrabel @ 2014-02-11 13:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, Ian Campbell,
	Stefano Stabellini

On 11/02/14 10:06, Jan Beulich wrote:
>>>> On 10.02.14 at 18:20, David Vrabel <david.vrabel@citrix.com> wrote:
>> Fields
>> ------
>>
>> All the fields within the headers and records have a fixed width.
>>
>> Fields are always aligned to their size.
>>
>> Padding and reserved fields are set to zero on save and must be
>> ignored during restore.
> 
> Meaning it would be impossible to assign a meaning to these fields
> later. I'd rather mandate that the restore side has to check these
> fields are zero, and bail if they aren't.

Reserved fields/bits can be used but it would require a new record type
and a bump of the format version.

I was aiming to minimize the number of ways the format can be extended.

>> Integer (numeric) fields in the image header are always in big-endian
>> byte order.
> 
> Why would big endian be preferable when both currently
> supported architectures use little endian?

Mostly to encourage tools to pay attention to byte order rather than
assuming native and getting away with it...

>> Domain Header
>> -------------
>>
>> The domain header includes general properties of the domain.
>>
>>      0      1     2     3     4     5     6     7 octet
>>     +-----------+-----------+-----------+-------------+
>>     | arch      | type      | page_shift| (reserved)  |
>>     +-----------+-----------+-----------+-------------+
>>
>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> arch        0x0000: Reserved.
>>
>>             0x0001: x86.
>>
>>             0x0002: ARM.
>>
>> type        0x0000: Reserved.
>>
>>             0x0001: x86 PV.
>>
>>             0x0002 - 0xFFFF: Reserved.
> 
> So how would ARM, x86 HVM, and x86 PVH be expressed?

Something like:

  0x0001: x86 PV.
  0x0002: x86 HVM.
  0x0003: x86 PVH.
  0x0004: ARM.

Which does make the arch field a bit redundant, I suppose.

>> P2M
>> ---
>>
>> [ This is a more flexible replacement for the old p2m_size field and
>> p2m array. ]
>>
>> The P2M record contains a portion of the source domain's P2M.
>> Multiple P2M records may be sent if the source P2M changes during the
>> stream.
>>
>>      0     1     2     3     4     5     6     7 octet
>>     +-------------------------------------------------+
>>     | pfn_begin                                       |
>>     +-------------------------------------------------+
>>     | pfn_end                                         |
>>     +-------------------------------------------------+
>>     | mfn[0]                                          |
>>     +-------------------------------------------------+
>>     ...
>>     +-------------------------------------------------+
>>     | mfn[N-1]                                        |
>>     +-------------------------------------------------+
>>
>> --------------------------------------------------------------------
>> Field       Description
>> ----------- --------------------------------------------------------
>> pfn_begin   The first PFN in this portion of the P2M
>>
>> pfn_end     One past the last PFN in this portion of the P2M.
> 
> I'd favor an inclusive range here, such that if we ever reach a
> fully populatable 64-bit PFN space (on some future architecture)
> there'd still be no issue with special casing the then unavoidable
> wraparound.

Ok, but 64-bit PFN space would suggest 76 bit of address space which
seems somewhat far off.  Is that something we want to consider now?

>> Legacy Images (x86 only)
>> ========================
>>
>> Restoring legacy images from older tools shall be handled by
>> translating the legacy format image into this new format.
>>
>> It shall not be possible to save in the legacy format.
>>
>> There are two different legacy images depending on whether they were
>> generated by a 32-bit or a 64-bit toolstack. These shall be
>> distinguished by inspecting octets 4-7 in the image.  If these are
>> zero then it is a 64-bit image.
>>
>> Toolstack  Field                            Value
>> ---------  -----                            -----
>> 64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
> 
> Afaics this is being determined via xc_domain_maximum_gpfn(),
> which I don't think guarantees the result to be limited to 2^32.
> Or in fact the libxc interface wrongly limits the value (by
> truncating the "long" returned from the hypercall to an "int"). So
> in practice consistent images would have the field limited to 2^31
> on 64-bit tool stacks (since for larger values the negative function
> return value would get converted by sign-extension, but all sorts
> of other trouble would result due to the now huge p2m_size).

For the handling of legacy images I think we need to only consider
images that could have been practically generated by older tools.

>> Future Extensions
>> =================
>>
>> All changes to this format require the image version to be increased.
> 
> Oh, okay, this partly deals with the first question above. Question
> is whether that's a useful requirement, i.e. whether that wouldn't
> lead to an inflation of versions needing conversion (for a tool stack
> that wants to support more than just migration from N-1).

Only legacy images would be converted to the newest format.  I would
expect version V-1 images would be handled by (mostly) the same code as
V images.  Particularly if V is V-1 with extra record types.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 12:09     ` Ian Campbell
@ 2014-02-11 13:10       ` Jan Beulich
  2014-02-11 13:12       ` Ian Campbell
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2014-02-11 13:10 UTC (permalink / raw)
  To: David Vrabel, Ian Campbell
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, Stefano Stabellini

>>> On 11.02.14 at 13:09, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Tue, 2014-02-11 at 11:40 +0000, David Vrabel wrote:
>> On 11/02/14 09:30, Ian Campbell wrote:
>> > On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote:
>> >> --------------------------------------------------------------------
>> >> Field       Description
>> >> ----------- --------------------------------------------------------
>> >> count       Number of pages described in this record.
>> >>
>> >> pfn         An array of count PFNs. Bits 63-60 contain
>> >>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.
>> > 
>> > Now might be a good time to remove this intertwining? I suppose 60-bits
>> > is a lot of pfn's, but if the VMs address space is sparse it isn't out
>> > of the question.
>> 
>> I don't think we want to consider systems with > 64 bits of address
>> space so 60-bits is more than enough for PFNs.
> 
> Is it? What about systems with 61..63 bits of address space?

Their PFNs would, assuming 4k page size, still only be 49..51
bits wide.

That said - if reasonably cleanly doable within such a revised save
image format, I would second Ian's desire to split the type from
the PFN. Not so much because of limited PFN space, but because
of the chance of needing to introduce further types, requiring
more than 4 bits.

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 12:09     ` Ian Campbell
  2014-02-11 13:10       ` Jan Beulich
@ 2014-02-11 13:12       ` Ian Campbell
  1 sibling, 0 replies; 38+ messages in thread
From: Ian Campbell @ 2014-02-11 13:12 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Jackson, Xen-devel

On Tue, 2014-02-11 at 12:09 +0000, Ian Campbell wrote:
> 
> 
> > >>
> --------------------------------------------------------------------
> > >> Field       Description
> > >> -----------
> --------------------------------------------------------
> > >> count       Number of pages described in this record.
> > >>
> > >> pfn         An array of count PFNs. Bits 63-60 contain
> > >>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.
> > > 
> > > Now might be a good time to remove this intertwining? I suppose
> 60-bits
> > > is a lot of pfn's, but if the VMs address space is sparse it isn't
> out
> > > of the question.
> > 
> > I don't think we want to consider systems with > 64 bits of address
> > space so 60-bits is more than enough for PFNs.
> 
> Is it? What about systems with 61..63 bits of address space?

Nevermind, another reply in the thread reminded me that these are PFNs,
so 64-bit PFN == 72 bits of address space.

Although, perhaps it would be better to spec this as nibble and a 60-bit
field rather than by pretending the top of a 64-bit field is special.

(or an octet and a 7 octet field if you prefer)

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 13:04   ` David Vrabel
@ 2014-02-11 13:20     ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2014-02-11 13:20 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, Ian Campbell,
	StefanoStabellini

>>> On 11.02.14 at 14:04, David Vrabel <david.vrabel@citrix.com> wrote:
> On 11/02/14 10:06, Jan Beulich wrote:
>>>>> On 10.02.14 at 18:20, David Vrabel <david.vrabel@citrix.com> wrote:
>>> Domain Header
>>> -------------
>>>
>>> The domain header includes general properties of the domain.
>>>
>>>      0      1     2     3     4     5     6     7 octet
>>>     +-----------+-----------+-----------+-------------+
>>>     | arch      | type      | page_shift| (reserved)  |
>>>     +-----------+-----------+-----------+-------------+
>>>
>>> --------------------------------------------------------------------
>>> Field       Description
>>> ----------- --------------------------------------------------------
>>> arch        0x0000: Reserved.
>>>
>>>             0x0001: x86.
>>>
>>>             0x0002: ARM.
>>>
>>> type        0x0000: Reserved.
>>>
>>>             0x0001: x86 PV.
>>>
>>>             0x0002 - 0xFFFF: Reserved.
>> 
>> So how would ARM, x86 HVM, and x86 PVH be expressed?
> 
> Something like:
> 
>   0x0001: x86 PV.
>   0x0002: x86 HVM.
>   0x0003: x86 PVH.
>   0x0004: ARM.

Ah, so the list above wasn't meant to be exhaustive (for the
current set of things to care about).

> Which does make the arch field a bit redundant, I suppose.

Indeed.

>>> P2M
>>> ---
>>>
>>> [ This is a more flexible replacement for the old p2m_size field and
>>> p2m array. ]
>>>
>>> The P2M record contains a portion of the source domain's P2M.
>>> Multiple P2M records may be sent if the source P2M changes during the
>>> stream.
>>>
>>>      0     1     2     3     4     5     6     7 octet
>>>     +-------------------------------------------------+
>>>     | pfn_begin                                       |
>>>     +-------------------------------------------------+
>>>     | pfn_end                                         |
>>>     +-------------------------------------------------+
>>>     | mfn[0]                                          |
>>>     +-------------------------------------------------+
>>>     ...
>>>     +-------------------------------------------------+
>>>     | mfn[N-1]                                        |
>>>     +-------------------------------------------------+
>>>
>>> --------------------------------------------------------------------
>>> Field       Description
>>> ----------- --------------------------------------------------------
>>> pfn_begin   The first PFN in this portion of the P2M
>>>
>>> pfn_end     One past the last PFN in this portion of the P2M.
>> 
>> I'd favor an inclusive range here, such that if we ever reach a
>> fully populatable 64-bit PFN space (on some future architecture)
>> there'd still be no issue with special casing the then unavoidable
>> wraparound.
> 
> Ok, but 64-bit PFN space would suggest 76 bit of address space which
> seems somewhat far off.  Is that something we want to consider now?

If it's as cheap as using an inclusive range instead of a half-inclusive
one, I'd say yes.

>>> Legacy Images (x86 only)
>>> ========================
>>>
>>> Restoring legacy images from older tools shall be handled by
>>> translating the legacy format image into this new format.
>>>
>>> It shall not be possible to save in the legacy format.
>>>
>>> There are two different legacy images depending on whether they were
>>> generated by a 32-bit or a 64-bit toolstack. These shall be
>>> distinguished by inspecting octets 4-7 in the image.  If these are
>>> zero then it is a 64-bit image.
>>>
>>> Toolstack  Field                            Value
>>> ---------  -----                            -----
>>> 64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
>> 
>> Afaics this is being determined via xc_domain_maximum_gpfn(),
>> which I don't think guarantees the result to be limited to 2^32.
>> Or in fact the libxc interface wrongly limits the value (by
>> truncating the "long" returned from the hypercall to an "int"). So
>> in practice consistent images would have the field limited to 2^31
>> on 64-bit tool stacks (since for larger values the negative function
>> return value would get converted by sign-extension, but all sorts
>> of other trouble would result due to the now huge p2m_size).
> 
> For the handling of legacy images I think we need to only consider
> images that could have been practically generated by older tools.

Right. That's what I meant to say with everything following the
first sentence.

>>> Future Extensions
>>> =================
>>>
>>> All changes to this format require the image version to be increased.
>> 
>> Oh, okay, this partly deals with the first question above. Question
>> is whether that's a useful requirement, i.e. whether that wouldn't
>> lead to an inflation of versions needing conversion (for a tool stack
>> that wants to support more than just migration from N-1).
> 
> Only legacy images would be converted to the newest format.  I would
> expect version V-1 images would be handled by (mostly) the same code as
> V images.  Particularly if V is V-1 with extra record types.

Just consider distros, namely such that have a lower release
frequency than we. They'd necessarily want to cover at least
the range between their V'-1 and V', which could just end up
being V-2 ... V from Xen pov, but - especially if we were to
further shorten the release cycle - could easily become a larger
range.

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 11:58   ` David Vrabel
@ 2014-02-11 16:13     ` Shriram Rajagopalan
  0 siblings, 0 replies; 38+ messages in thread
From: Shriram Rajagopalan @ 2014-02-11 16:13 UTC (permalink / raw)
  To: David Vrabel; +Cc: Stefano Stabellini, Ian Jackson, Ian Campbell, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4502 bytes --]

On Tue, Feb 11, 2014 at 5:58 AM, David Vrabel <david.vrabel@citrix.com>wrote:

> On 10/02/14 20:00, Shriram Rajagopalan wrote:
> > On Mon, Feb 10, 2014 at 9:20 AM, David Vrabel <david.vrabel@citrix.com
> > <mailto:david.vrabel@citrix.com>> wrote:
> >
> >
> > Its tempting to adopt all the TCP-style madness for transferring a set of
> > structured data.  Why this endian-ness mess?  Am I missing something
> here?
> > I am assuming that a lion's share of Xen's deployment is on x86
> > (not including Amazon). So that leaves ARM.  Why not let these
> > processors take the hit of endian-ness conversion?
>
> I'm not sure I would characterize a spec being precise about byte
> ordering as "endianness mess".
>
> I think it would be a pretty poor specification if it didn't specify
> byte ordering -- we can't have the tools having to make assumptions
> about the ordering.
>
>
Totally agree. But as someone else put it (and you did as well), my point
was
that its sufficient to specify it once, somewhere in the image header and
making
sure that (as you put it below), that the current use cases don't have to
go through
needless endian conversion.



> However, I do think it can be specified in such a way that all the
> current use cases don't have to do any byte swapping (except for the
> minimal header).
>
> >         +-----------------------+-------------------------+
> >         | checksum              | (reserved)              |
> >         +-----------------------+-------------------------+
> >
> >
> > I am assuming that you the checksum field is present only
> > for debugging purposes? Otherwise, I see no reason for the
> > computational overhead, given that we are already sending data
> > over a reliable channel + IIRC we already have an image-wide checksum
> > when saving the image to disk.
>
> I'm not aware of any image wide checksum.
>

Yep. I was mistaken.


> The checksum seems like a potentially useful feature but I don't have a
> requirement for it so if no one else thinks it is useful it can be removed.
>
>
My suggestion is that when saving the image to disk, why not have a single
image-wide checksum to ensure that the image from disk being restored is
still valid?


> >     PAGE_DATA
> >     ---------
> [...]
> >     --------------------------------------------------------------------
> >     Field       Description
> >     ----------- --------------------------------------------------------
> >     count       Number of pages described in this record.
> >
> >     pfn         An array of count PFNs. Bits 63-60 contain
> >                 the XEN\_DOMCTL\_PFINFO_* value for that PFN.
> >
> >     page_data   page_size octets of uncompressed page contents for each
> page
> >                 set as present in the pfn array.
> >     --------------------------------------------------------------------
> >
> >
> > s/uncompressed/(compressed/uncompressed)/
> > (Remus sends compressed data)
>
> No.  I think compressed page data should have its own record type. The
> current scheme of mode flipping records seems crazy to me.
>
>
What record flipping? For page compression, Remus basically has a simple
XOR+RLE encoded sequence of bytes, preceded by a 4-byte length field.
Instead of sending the usual 4K per-page page_data, this compressed chunk
is sent.
The additional code on the remote side is an additional "if" block, that
uses
 xc_uncompess instead of memcpy to get the uncompressed page.

It would not change the way the PAGE_DATA record would be transmitted.

Though, one potentially cooler addition could be to use the option field of
the record header
to indicate whether the data is compressed or not. Given that we have 64
bits, we could even
go as far as specifying the type of compression module used (e.g., none,
remus, gzip, etc.).
This might be really helpful when one wants to save/restore large images (a
8GB VM for example)
to/from disks. Is this better/worse than simply gzipping the entire saved
image? I don't know yet.

However, for live migration, this would be pretty helpful (especially when
migrating over long latency
networks).  Remus' compression technique cannot be used for live migration
as it requires a previous
version of pages for XOR+RLE compression.  However, gzip and other such
compression algorithms
would be pretty handy in the live migration case, over WAN or even a
clogged LAN, where there
are tons of VMs being moved back and forth.

Feel free to shoot down this idea if it seems unfeasible.

 Thanks
shriram

[-- Attachment #1.2: Type: text/html, Size: 6435 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
                   ` (4 preceding siblings ...)
  2014-02-11 10:06 ` Jan Beulich
@ 2014-02-11 16:38 ` Ian Jackson
  2014-02-11 17:04   ` Andrew Cooper
  2014-02-11 16:49 ` Ian Jackson
  2014-02-12 16:36 ` Tim Deegan
  7 siblings, 1 reply; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 16:38 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Campbell, Xen-devel

David Vrabel writes ("Domain Save Image Format proposal (draft B)"):
> Records
> =======
> 
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
...
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.

There needs to be a flag saying what the receiver should do if it sees
a record it doesn't understand.  There are two possible behaviours:
ignore the record, and abandon the restore attempt.

> VCPU_INFO
> ---------
> 
> > [ This is a combination of parts of the extended-info and
> > XC_SAVE_ID_VCPU_INFO chunks. ]
> 
> The VCPU_INFO record includes the maximum possible VCPU ID.  This will
> be followed a VCPU_CONTEXT record for each online VCPU.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+------------------------+
>     | max_vcpu_id           | (reserved)             |
>     +-----------------------+------------------------+

For ease of extensibility, it might be worth saying that the VCPU_INFO
may contain additional data which should be ignored by receivers that
don't understand it.

> Future Extensions
> =================
> 
> All changes to this format require the image version to be increased.

It would be better to have a more fine-grained extensibility
mechanism.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 10:06 ` Jan Beulich
  2014-02-11 13:04   ` David Vrabel
@ 2014-02-11 16:45   ` Ian Jackson
  2014-02-11 17:08     ` Shriram Rajagopalan
  1 sibling, 1 reply; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 16:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Shriram Rajagopalan, Stefano Stabellini, Xen-devel, David Vrabel,
	Ian Campbell

Jan Beulich writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> On 10.02.14 at 18:20, David Vrabel <david.vrabel@citrix.com> wrote:
> > Fields
> > ------
> > 
> > All the fields within the headers and records have a fixed width.
> > 
> > Fields are always aligned to their size.
> > 
> > Padding and reserved fields are set to zero on save and must be
> > ignored during restore.
> 
> Meaning it would be impossible to assign a meaning to these fields
> later. I'd rather mandate that the restore side has to check these
> fields are zero, and bail if they aren't.

I disagree.  It is precisely the fact that they are ignored which
makes it possible to assign meaning to them later.

It is easy enough to make old readers bail.  What is good about
David's spec is the room for enhancement _without_ making old readers
bail.

I would like to see is that the version number does not normally need
to be incremented.  For example, the .deb package format which I
designed in 1993 has stayed at version 2.0 even though it has been
heavily extended in a number of directions.

> > Integer (numeric) fields in the image header are always in big-endian
> > byte order.
> 
> Why would big endian be preferable when both currently
> supported architectures use little endian?

Because (a) that's Network Byte Order (b) there are convenient
functions for dealing with endian-conversion.

> > 1. Image header
> > 2. Domain header
> > 3. X86_PV_INFO record
> > 4. At least one P2M record
> > 5. At least one PAGE_DATA record
> > 6. VCPU_INFO record
> > 6. At least one VCPU_CONTEXT record
> > 7. END record
> 
> Is it necessary to require this strict ordering? Obviously the headers
> have to be first and the END record last, but at least some of the
> other records don't seem to need to be at exactly the place you list
> them.

It's necessary to define some constraints on the ordering.  In
practice the receiving implementation may or may not work with all
orderings and defining a particular ordering is probably easiest.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
                   ` (5 preceding siblings ...)
  2014-02-11 16:38 ` Ian Jackson
@ 2014-02-11 16:49 ` Ian Jackson
  2014-02-11 17:10   ` David Vrabel
  2014-02-12 16:36 ` Tim Deegan
  7 siblings, 1 reply; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 16:49 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Campbell, Xen-devel

David Vrabel writes ("Domain Save Image Format proposal (draft B)"):
> Here is a draft of a proposal for a new domain save image format.  It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).

I think this is a good start.  I've made some other comments already.

> Overview
> ========

I would like to make another perhaps controversial suggestion.  We
should explicitly specify that the receiver may advertise its
capabilities to the sender, so that backwards-migration _can_ be
supported if we choose to do so.

In practice I think that means a capability advertisement block.
Probably, one bit per version, one bit per record type, etc.

I greatly prefer doing forward-compatibility with new record type
enums etc. than with version numbers.  Version numbers presuppose a
strict global order on all the implementations' capabilities, which is
of course not necessarily true in free software.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 16:38 ` Ian Jackson
@ 2014-02-11 17:04   ` Andrew Cooper
  2014-02-11 17:07     ` Ian Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Cooper @ 2014-02-11 17:04 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Shriram Rajagopalan, Ian Campbell, Xen-devel, David Vrabel,
	Stefano Stabellini

On 11/02/14 16:38, Ian Jackson wrote:
> David Vrabel writes ("Domain Save Image Format proposal (draft B)"):
>> Records
>> =======
>>
>> A record has a record header, type specific data and a trailing
>> footer.  If body_length is not a multiple of 8, the body is padded
>> with zeroes to align the checksum field on an 8 octet boundary.
>>
>>      0     1     2     3     4     5     6     7 octet
>>     +-----------------------+-------------------------+
>>     | type                  | body_length             |
>>     +-----------+-----------+-------------------------+
>>     | options   | (reserved)                          |
>>     +-----------+-------------------------------------+
> ...
>> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
> There needs to be a flag saying what the receiver should do if it sees
> a record it doesn't understand.  There are two possible behaviours:
> ignore the record, and abandon the restore attempt.

No need.  Any unrecognised records can be safely ignored.  Any record
which couldn't be ignored would be required to bump the main stream
version number at which point the older reader would bail on that basis.

This would allow adding new backward compatible features without
breaking older systems, which is the way compatibility is maintained in
the current code.

~Andrew

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:04   ` Andrew Cooper
@ 2014-02-11 17:07     ` Ian Jackson
  0 siblings, 0 replies; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 17:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Shriram Rajagopalan, Ian Campbell, Xen-devel, David Vrabel,
	Stefano Stabellini

Andrew Cooper writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> No need.  Any unrecognised records can be safely ignored.  Any record
> which couldn't be ignored would be required to bump the main stream
> version number at which point the older reader would bail on that basis.

See my other comment about why I don't like version numbers for
forward compatibility in free software.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 16:45   ` Ian Jackson
@ 2014-02-11 17:08     ` Shriram Rajagopalan
  2014-02-11 17:15       ` Ian Campbell
  0 siblings, 1 reply; 38+ messages in thread
From: Shriram Rajagopalan @ 2014-02-11 17:08 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Xen-devel, David Vrabel, Jan Beulich, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 3091 bytes --]

On Tue, Feb 11, 2014 at 10:45 AM, Ian Jackson <Ian.Jackson@eu.citrix.com>wrote:

> Jan Beulich writes ("Re: [Xen-devel] Domain Save Image Format proposal
> (draft B)"):
> > On 10.02.14 at 18:20, David Vrabel <david.vrabel@citrix.com> wrote:
> > > Fields
> > > ------
> > >
> > > All the fields within the headers and records have a fixed width.
> > >
> > > Fields are always aligned to their size.
> > >
> > > Padding and reserved fields are set to zero on save and must be
> > > ignored during restore.
> >
> > Meaning it would be impossible to assign a meaning to these fields
> > later. I'd rather mandate that the restore side has to check these
> > fields are zero, and bail if they aren't.
>
> I disagree.  It is precisely the fact that they are ignored which
> makes it possible to assign meaning to them later.
>
> It is easy enough to make old readers bail.  What is good about
> David's spec is the room for enhancement _without_ making old readers
> bail.
>
> I would like to see is that the version number does not normally need
> to be incremented.  For example, the .deb package format which I
> designed in 1993 has stayed at version 2.0 even though it has been
> heavily extended in a number of directions.
>
> > > Integer (numeric) fields in the image header are always in big-endian
> > > byte order.
> >
> > Why would big endian be preferable when both currently
> > supported architectures use little endian?
>
> Because (a) that's Network Byte Order (b) there are convenient
> functions for dealing with endian-conversion.
>
>
Yes and Yes. But why resort to Network Byte Order at all when we
don't have support any architectures using big endian?

Are we thinking of transferring images or migrating data between machines
that have
different endian-ness? Its not like network elements such as middleboxes
(which could probably be big-endian ) are going to interpret the
application
level data and take routing decisions.

I am not that familiar with ARM, but from what I read, its bi-endian past
v3.
Don't think we have plans to support SPARC, which too is bi-endian
beyond a certain version IIRC.

That leaves legacy ARMs and SPARC that use big endian mode.

So am I missing something elementary here Ian? Why the emphasis on
network byte order? I certainly agree that the byte order should be declared
once in the header, so that there would be no confusion on how to interpret
it.



> > > 1. Image header
> > > 2. Domain header
> > > 3. X86_PV_INFO record
> > > 4. At least one P2M record
> > > 5. At least one PAGE_DATA record
> > > 6. VCPU_INFO record
> > > 6. At least one VCPU_CONTEXT record
> > > 7. END record
> >
> > Is it necessary to require this strict ordering? Obviously the headers
> > have to be first and the END record last, but at least some of the
> > other records don't seem to need to be at exactly the place you list
> > them.
>
> It's necessary to define some constraints on the ordering.  In
> practice the receiving implementation may or may not work with all
> orderings and defining a particular ordering is probably easiest.
>
> Ian.
>
>

[-- Attachment #1.2: Type: text/html, Size: 4313 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 16:49 ` Ian Jackson
@ 2014-02-11 17:10   ` David Vrabel
  2014-02-11 17:28     ` Ian Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: David Vrabel @ 2014-02-11 17:10 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Campbell, Xen-devel

On 11/02/14 16:49, Ian Jackson wrote:
> David Vrabel writes ("Domain Save Image Format proposal (draft B)"):
>> Here is a draft of a proposal for a new domain save image format.  It
>> does not currently cover all use cases (e.g., images for HVM guest are
>> not considered).
> 
> I think this is a good start.  I've made some other comments already.
> 
>> Overview
>> ========
> 
> I would like to make another perhaps controversial suggestion.  We
> should explicitly specify that the receiver may advertise its
> capabilities to the sender, so that backwards-migration _can_ be
> supported if we choose to do so.
> 
> In practice I think that means a capability advertisement block.
> Probably, one bit per version, one bit per record type, etc.
> 
> I greatly prefer doing forward-compatibility with new record type
> enums etc. than with version numbers.  Version numbers presuppose a
> strict global order on all the implementations' capabilities, which is
> of course not necessarily true in free software.

I think I quite like this idea.  There would still need to be a central
repository of record types.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:08     ` Shriram Rajagopalan
@ 2014-02-11 17:15       ` Ian Campbell
  2014-02-11 17:30         ` Ian Jackson
  2014-02-11 17:31         ` Frediano Ziglio
  0 siblings, 2 replies; 38+ messages in thread
From: Ian Campbell @ 2014-02-11 17:15 UTC (permalink / raw)
  To: rshriram
  Cc: Xen-devel, Ian Jackson, David Vrabel, Jan Beulich, Stefano Stabellini

On Tue, 2014-02-11 at 11:08 -0600, Shriram Rajagopalan wrote:

> I am not that familiar with ARM, but from what I read, its bi-endian
> past v3.

Modern ARMs can still do big endian, and are sometimes used that way
too.

I expect that this would be expressed as a new guest arch/type (ARMBE)
though.

But there's nothing wrong per-se with having the endianness of an image
be explicit in the file formats header, even if it is a bit redundant.

Note that the initial header has to be in some fixed endianness, but
it's a small handful of bytes which only occurs once, I don't think the
byte swapping is an issue there.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:10   ` David Vrabel
@ 2014-02-11 17:28     ` Ian Jackson
  0 siblings, 0 replies; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 17:28 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Campbell, Xen-devel

David Vrabel writes ("Re: Domain Save Image Format proposal (draft B)"):
> On 11/02/14 16:49, Ian Jackson wrote:
> > I greatly prefer doing forward-compatibility with new record type
> > enums etc. than with version numbers.  Version numbers presuppose a
> > strict global order on all the implementations' capabilities, which is
> > of course not necessarily true in free software.
> 
> I think I quite like this idea.  There would still need to be a central
> repository of record types.

Right.

One thing you can do is like PNG and have separate ranges of record
type numbers for (a) ok to ignore this if not understood and (b) fatal
if not understood.

I think PNG also has subset of the namespace for "private use".

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:15       ` Ian Campbell
@ 2014-02-11 17:30         ` Ian Jackson
  2014-02-11 17:31         ` Frediano Ziglio
  1 sibling, 0 replies; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 17:30 UTC (permalink / raw)
  To: Ian Campbell
  Cc: rshriram, Xen-devel, David Vrabel, Jan Beulich, Stefano Stabellini

Ian Campbell writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> But there's nothing wrong per-se with having the endianness of an image
> be explicit in the file formats header, even if it is a bit redundant.

Right.

> Note that the initial header has to be in some fixed endianness, but
> it's a small handful of bytes which only occurs once, I don't think the
> byte swapping is an issue there.

Given that it has to be some fixed endianness, and we don't want to
write code which will go wrong if we want to support a big endian
arcdh in the future, the code we write now must include byteswapping
calls.

Those calls are simple if the specified endianness is big endian.
Also, making it be the opposite of the native for popular platforms
means we're more likely to notice if we miss any endian fix calls.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:15       ` Ian Campbell
  2014-02-11 17:30         ` Ian Jackson
@ 2014-02-11 17:31         ` Frediano Ziglio
  2014-02-11 17:53           ` Ian Jackson
  1 sibling, 1 reply; 38+ messages in thread
From: Frediano Ziglio @ 2014-02-11 17:31 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Stefano Stabellini, Ian Jackson, Xen-devel, David Vrabel,
	Jan Beulich, rshriram

On Tue, 2014-02-11 at 17:15 +0000, Ian Campbell wrote:
> On Tue, 2014-02-11 at 11:08 -0600, Shriram Rajagopalan wrote:
> 
> > I am not that familiar with ARM, but from what I read, its bi-endian
> > past v3.
> 
> Modern ARMs can still do big endian, and are sometimes used that way
> too.
> 
> I expect that this would be expressed as a new guest arch/type (ARMBE)
> though.
> 
> But there's nothing wrong per-se with having the endianness of an image
> be explicit in the file formats header, even if it is a bit redundant.
> 
> Note that the initial header has to be in some fixed endianness, but
> it's a small handful of bytes which only occurs once, I don't think the
> byte swapping is an issue there.
> 
> Ian.
> 

I think we should just stick with the native format. It's easy to
handle.

If a Xen x86 receive an ARM image it probably cannot cope with it. It
can only store it in some form. On the other end is much easy to restore
as x86 is always little-endian and we know images to restore will be
little endian.

Considering architecture which support both endianess we will never
restore a little-endian image in a big-endian guest. Although we could
convert headers and stuff like that we can't convert memory inside the
guest so the guest must have the same endianess. If a machine can handle
both endianess just setting some CPU flags in this case it will make
sense (but it will still maintain the VM endianess!) but why slow down
on system that can't handle it?

Frediano

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:31         ` Frediano Ziglio
@ 2014-02-11 17:53           ` Ian Jackson
  2014-02-11 17:59             ` Ian Campbell
  0 siblings, 1 reply; 38+ messages in thread
From: Ian Jackson @ 2014-02-11 17:53 UTC (permalink / raw)
  To: Frediano Ziglio
  Cc: Ian Campbell, Stefano Stabellini, Xen-devel, David Vrabel,
	Jan Beulich, rshriram

Frediano Ziglio writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> [stuff]

I think this is in danger of becoming a bikeshed issue.  I am
unconvinced by the arguments made on the other side.  The performance
and code complexity implications of byteswapping the image header are
IMO minimal.  So I'm going to put my foot down and say this:

David is entirely correct to specify a fixed endianness for the image
header.  As a tools maintainer I would accept patches for David's
format (subject to the other comments that are being discussed).  I
would be very reluctant to accept patches for a similar format with a
variable-endianness image header.  I would certainly not accept
patches which purport to implement a nominally-fixed-endianness format
but where the implementation is not in fact portable to the other
endianness.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:53           ` Ian Jackson
@ 2014-02-11 17:59             ` Ian Campbell
  2014-02-12  9:07               ` Frediano Ziglio
  0 siblings, 1 reply; 38+ messages in thread
From: Ian Campbell @ 2014-02-11 17:59 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Xen-devel, Frediano Ziglio, David Vrabel,
	Jan Beulich, rshriram

On Tue, 2014-02-11 at 17:53 +0000, Ian Jackson wrote:
> Frediano Ziglio writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> David is entirely correct to specify a fixed endianness for the image
> header.

Full ack.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11 17:59             ` Ian Campbell
@ 2014-02-12  9:07               ` Frediano Ziglio
  2014-02-12 11:27                 ` Frediano Ziglio
  0 siblings, 1 reply; 38+ messages in thread
From: Frediano Ziglio @ 2014-02-12  9:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Stefano Stabellini, Ian Jackson, Xen-devel, David Vrabel,
	Jan Beulich, rshriram

On Tue, 2014-02-11 at 17:59 +0000, Ian Campbell wrote:
> On Tue, 2014-02-11 at 17:53 +0000, Ian Jackson wrote:
> > Frediano Ziglio writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> > David is entirely correct to specify a fixed endianness for the image
> > header.
> 
> Full ack.
> 
> 
Me too. I was not speaking about the image header but all others
records. Obviously by native endianess I mean the endianess of the VM.
If I have a ARMBE this would save as big endian because the VM is big
endian. On the same host an ARMLE would save using little endian.

Frediano

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-12  9:07               ` Frediano Ziglio
@ 2014-02-12 11:27                 ` Frediano Ziglio
  2014-02-12 11:34                   ` Ian Campbell
  0 siblings, 1 reply; 38+ messages in thread
From: Frediano Ziglio @ 2014-02-12 11:27 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Stefano Stabellini, Ian Jackson, Xen-devel, David Vrabel,
	Jan Beulich, rshriram

On Wed, 2014-02-12 at 09:07 +0000, Frediano Ziglio wrote:
> On Tue, 2014-02-11 at 17:59 +0000, Ian Campbell wrote:
> > On Tue, 2014-02-11 at 17:53 +0000, Ian Jackson wrote:
> > > Frediano Ziglio writes ("Re: [Xen-devel] Domain Save Image Format proposal (draft B)"):
> > > David is entirely correct to specify a fixed endianness for the image
> > > header.
> > 
> > Full ack.
> > 
> > 
> Me too. I was not speaking about the image header but all others
> records. Obviously by native endianess I mean the endianess of the VM.
> If I have a ARMBE this would save as big endian because the VM is big
> endian. On the same host an ARMLE would save using little endian.
> 
> Frediano
> 

More about endianess and protocols. Why the network defined a network
byte order?
1- you don't need to specify a flag
2- you always have to convert or not, that's no if

Consider point 2. Considering you have a function to get a 32 bit in the
native format

uint32_t get_native32(uint32_t n) {
   XXX
}

Now, in the case you use always the same endian XXX is either a NOP or a
a swap. The case you support both is more complicated, something like

  if (is_my_format)
    return n;
  else
    return swap(n);

If we inline it we expand the code, if we not inline we slow down a bit.
Considering that in many platforms swapping is so easy that can be
inlined implementing get_native32 can be easily inlined.

In the case we need to handle large structures and you have to call may
times these kind of functions perhaps the time spent for the conversion
and size of the code does not matter (we could use a template system to
optimize the two cases spending extra code or device to not inline) but
if choosing a protocol specification instead of another could help why
not choosing the proper endian schema considering the pro/cons of all
cases?

Frediano

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-12 11:27                 ` Frediano Ziglio
@ 2014-02-12 11:34                   ` Ian Campbell
  0 siblings, 0 replies; 38+ messages in thread
From: Ian Campbell @ 2014-02-12 11:34 UTC (permalink / raw)
  To: Frediano Ziglio
  Cc: Stefano Stabellini, Ian Jackson, Xen-devel, David Vrabel,
	Jan Beulich, rshriram

On Wed, 2014-02-12 at 11:27 +0000, Frediano Ziglio wrote:
> More about endianess and protocols.

I don't think we need more on this. It has already gotten far more
discussion than needed for such a minor issue and I trust David to take
on board any useful feedback and to update the spec accordingly as
needed.

It would be more useful if people were to spend their time thinking
about the core of the proposal instead of nitpicking around the edges.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11  4:12     ` Shriram Rajagopalan
  2014-02-11 10:58       ` Zoltan Kiss
@ 2014-02-12 15:34       ` Tim Deegan
  1 sibling, 0 replies; 38+ messages in thread
From: Tim Deegan @ 2014-02-12 15:34 UTC (permalink / raw)
  To: Shriram Rajagopalan
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Xen-devel, David Vrabel

At 22:12 -0600 on 10 Feb (1392066773), Shriram Rajagopalan wrote:
> My point here being, checksums seem like unnecessary compute overhead when
> doing live migration
> or Remus.  One can simply set this field to 0 when doing live
> migration/Remus.

Remus _compresses_ the payload, right?  CRC32 is basically free
compared to that. 

But I think the point about checksumming the whole image is sound.
That will catch _more_ classes of corruption than a per-block data
checksum, and the class of bugs caught by per-block checksums (basically,
that the header and the data don't match for some reason) are pretty
small in this design.

Tim.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-11  9:30 ` Ian Campbell
  2014-02-11 11:40   ` David Vrabel
@ 2014-02-12 15:41   ` Tim Deegan
  1 sibling, 0 replies; 38+ messages in thread
From: Tim Deegan @ 2014-02-12 15:41 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Shriram Rajagopalan, Xen-devel, Ian Jackson, David Vrabel,
	Stefano Stabellini

At 09:30 +0000 on 11 Feb (1392107440), Ian Campbell wrote:
> > checksum     CRC-32 checksum of the record body (including any trailing
> >              padding), or 0x00000000 if the checksum field is invalid.
> 
> Which CRC-32 :-P

CRC32C please; it's got hardware acceleration in SSE4.2.  There are
also options like Adler which are supposed to be faster in normal
arithmetic (not sure if that's still the case if you include tricks
like using vector multiply to fold the payload).

Tim.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
                   ` (6 preceding siblings ...)
  2014-02-11 16:49 ` Ian Jackson
@ 2014-02-12 16:36 ` Tim Deegan
  2014-02-12 17:09   ` David Vrabel
  7 siblings, 1 reply; 38+ messages in thread
From: Tim Deegan @ 2014-02-12 16:36 UTC (permalink / raw)
  To: David Vrabel
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Jackson,
	Ian Campbell, Xen-devel

Hi,

This draft has my wholehearted support.  Even without addressing any
of the points under discussion something along these lines would be a
vast improvement on the current format.

I have two general questions:

 - The existing save-format definition is spread across a number of
   places: libxc for hypervisor state, qemu for DM state, and the main
   toolstack (libxl/xend/xapi/&c) for other config runes and a general
   wrapper.  This is clearly a reworking of the libxc parts -- do
   you think there's anything currently defined elsewhere that belongs
   in this spec?

 - Have you given any thought to making this into a wire protocol
   rather than just a file format?  Would there be any benefit to
   having records individually acked by the receiver in a live
   migration, or having the receiver send instructions about
   compatibility?  Or is that again left to the toolstack to manage?

and a few nits:

At 17:20 +0000 on 10 Feb (1392049249), David Vrabel wrote:
> Records
> =======
> 
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
>     ...
>     Record body of length body_length octets followed by
>     0 to 7 octets of padding.
>     ...
>     +-----------------------+-------------------------+
>     | checksum              | (reserved)              |
>     +-----------------------+-------------------------+
> 
> --------------------------------------------------------------------
> Field        Description
> -----------  -------------------------------------------------------
> type         0x00000000: END
> 
>              0x00000001: PAGE_DATA
> 
>              0x00000002: VCPU_INFO
> 
>              0x00000003: VCPU_CONTEXT
> 
>              0x00000004: X86_PV_INFO
> 
>              0x00000005: P2M
> 
>              0x00000006 - 0xFFFFFFFF: Reserved
> 
> body_length  Length in octets of the record body.
> 
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
> 
>              Bit 1-15: Reserved.
> 
> checksum     CRC-32 checksum of the record body (including any trailing
>              padding), or 0x00000000 if the checksum field is invalid.

Apart from any discussion of the merits of per-record vs whole-file
checksums, it would be useful for this checksum to cover the header
too.  E.g., by declaring it to be the checksum of header+data where
the checksum field is 0, or by declaring that it shall be that pattern
which causes the finished header+data to checksum to 0.

> VCPU_INFO
> ---------
> 
> > [ This is a combination of parts of the extended-info and
> > XC_SAVE_ID_VCPU_INFO chunks. ]
> 
> The VCPU_INFO record includes the maximum possible VCPU ID.  This will
> be followed a VCPU_CONTEXT record for each online VCPU.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+------------------------+
>     | max_vcpu_id           | (reserved)             |
>     +-----------------------+------------------------+
> 
> --------------------------------------------------------------------
> Field        Description
> -----------  ---------------------------------------------------
> max_vcpu_id  Maximum possible VCPU ID.
> --------------------------------------------------------------------

If this is all that's in this record, maybe it should be called
VCPU_COUNT?

> P2M
> ---
> 
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]
> 
> The P2M record contains a portion of the source domain's P2M.
> Multiple P2M records may be sent if the source P2M changes during the
> stream.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | pfn_begin                                       |
>     +-------------------------------------------------+
>     | pfn_end                                         |
>     +-------------------------------------------------+
>     | mfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | mfn[N-1]                                        |
>     +-------------------------------------------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> pfn_begin   The first PFN in this portion of the P2M
> 
> pfn_end     One past the last PFN in this portion of the P2M.
> 
> mfn         Array of (pfn_end - pfn-begin) MFNs corresponding to
>             the set of PFNs in the range [pfn_begin, pfn_end).
> --------------------------------------------------------------------

The current save record doesn't contain the p2m itself, but rather the
p2m_frame_list, an array of the MFNs (in the save record, PFNs) that
hold the actual p2m.  Frames in that list are used to populate the p2m
as memory is allocated on the receiving side.

I'm not sure what it would mean to allow the guest to change the
location of its p2m table (as distinct from the contents) on the fly
during a migration.  We would at least have to re-send the contents of
any frames that are no longer in the p2m table, in case the receiver
has already overwritten them.  And I think it should be fine to just
send the whole list every time (or else we need to manage deltas
carefully too).

Also, while I'm thinking about record names, this probably ought to be
called X86_PV_P2M or something like that.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-12 16:36 ` Tim Deegan
@ 2014-02-12 17:09   ` David Vrabel
  2014-02-12 18:16     ` Frediano Ziglio
  0 siblings, 1 reply; 38+ messages in thread
From: David Vrabel @ 2014-02-12 17:09 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Shriram Rajagopalan, Stefano Stabellini, Ian Jackson,
	Ian Campbell, Xen-devel

On 12/02/14 16:36, Tim Deegan wrote:
> Hi,
> 
> This draft has my wholehearted support.  Even without addressing any
> of the points under discussion something along these lines would be a
> vast improvement on the current format.
> 
> I have two general questions:
> 
>  - The existing save-format definition is spread across a number of
>    places: libxc for hypervisor state, qemu for DM state, and the main
>    toolstack (libxl/xend/xapi/&c) for other config runes and a general
>    wrapper.  This is clearly a reworking of the libxc parts -- do
>    you think there's anything currently defined elsewhere that belongs
>    in this spec?

I was considering this format as a container for those blobs, but I
think there should be enough flexibility that additional things could be
moved into the spec in the future.

>  - Have you given any thought to making this into a wire protocol
>    rather than just a file format?  Would there be any benefit to
>    having records individually acked by the receiver in a live
>    migration, or having the receiver send instructions about
>    compatibility?  Or is that again left to the toolstack to manage?

I don't see how having the restorer send anything back to the saver
would work with image files[1] so any two way stuff must be optional so
this can be left for future.

Ian J had some suggestions for how to handle compatibility better
without having the restorer report its capabilities.

>> checksum     CRC-32 checksum of the record body (including any trailing
>>              padding), or 0x00000000 if the checksum field is invalid.
> 
> Apart from any discussion of the merits of per-record vs whole-file
> checksums, it would be useful for this checksum to cover the header
> too.  E.g., by declaring it to be the checksum of header+data where
> the checksum field is 0, or by declaring that it shall be that pattern
> which causes the finished header+data to checksum to 0.

A single checksum for a multi GB file doesn't seem robust enough, which
is why I made it per-record.  Per-record checksums also mean you can
discard records the restorer isn't interested in without having to read
them to calculate the checksum.

I'm not entirely convinced by the usefulness of checksums, though.  If
no one else thinks they would be useful I'll probably drop them.

>> P2M
>> ---
[...]
> The current save record doesn't contain the p2m itself, but rather the
> p2m_frame_list, an array of the MFNs (in the save record, PFNs) that
> hold the actual p2m.  Frames in that list are used to populate the p2m
> as memory is allocated on the receiving side.

Er. Yes, I got confused by the code here and misunderstood it.

David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Domain Save Image Format proposal (draft B)
  2014-02-12 17:09   ` David Vrabel
@ 2014-02-12 18:16     ` Frediano Ziglio
  0 siblings, 0 replies; 38+ messages in thread
From: Frediano Ziglio @ 2014-02-12 18:16 UTC (permalink / raw)
  To: David Vrabel
  Cc: Ian Campbell, Stefano Stabellini, Tim Deegan, Ian Jackson,
	Xen-devel, Shriram Rajagopalan

On Wed, 2014-02-12 at 17:09 +0000, David Vrabel wrote:
> On 12/02/14 16:36, Tim Deegan wrote:
> > Hi,
> > 
> > This draft has my wholehearted support.  Even without addressing any
> > of the points under discussion something along these lines would be a
> > vast improvement on the current format.
> > 
> > I have two general questions:
> > 
> >  - The existing save-format definition is spread across a number of
> >    places: libxc for hypervisor state, qemu for DM state, and the main
> >    toolstack (libxl/xend/xapi/&c) for other config runes and a general
> >    wrapper.  This is clearly a reworking of the libxc parts -- do
> >    you think there's anything currently defined elsewhere that belongs
> >    in this spec?
> 
> I was considering this format as a container for those blobs, but I
> think there should be enough flexibility that additional things could be
> moved into the spec in the future.
> 
> >  - Have you given any thought to making this into a wire protocol
> >    rather than just a file format?  Would there be any benefit to
> >    having records individually acked by the receiver in a live
> >    migration, or having the receiver send instructions about
> >    compatibility?  Or is that again left to the toolstack to manage?
> 
> I don't see how having the restorer send anything back to the saver
> would work with image files[1] so any two way stuff must be optional so
> this can be left for future.
> 
> Ian J had some suggestions for how to handle compatibility better
> without having the restorer report its capabilities.
> 
> >> checksum     CRC-32 checksum of the record body (including any trailing
> >>              padding), or 0x00000000 if the checksum field is invalid.
> > 
> > Apart from any discussion of the merits of per-record vs whole-file
> > checksums, it would be useful for this checksum to cover the header
> > too.  E.g., by declaring it to be the checksum of header+data where
> > the checksum field is 0, or by declaring that it shall be that pattern
> > which causes the finished header+data to checksum to 0.
> 
> A single checksum for a multi GB file doesn't seem robust enough, which
> is why I made it per-record.  Per-record checksums also mean you can
> discard records the restorer isn't interested in without having to read
> them to calculate the checksum.
> 
> I'm not entirely convinced by the usefulness of checksums, though.  If
> no one else thinks they would be useful I'll probably drop them.
> 

<paranoia mode>
I think it depends if you want to detect some type of corruption. Images
can be send through wire or saved to disk and then restored. Although
network put a lot of checks and disk know when data are corrupted in the
physical layer (as sectors have CRCs too) corruptions occurring on
memory transfers or bus (like SATA or PCI) are not detected.

CRC could also be useful for Remus to detect corruption and request
updating it.
</paranoia mode>

> >> P2M
> >> ---
> [...]
> > The current save record doesn't contain the p2m itself, but rather the
> > p2m_frame_list, an array of the MFNs (in the save record, PFNs) that
> > hold the actual p2m.  Frames in that list are used to populate the p2m
> > as memory is allocated on the receiving side.
> 
> Er. Yes, I got confused by the code here and misunderstood it.
> 
> David
> 

Frediano

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2014-02-12 18:16 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10 17:20 Domain Save Image Format proposal (draft B) David Vrabel
2014-02-10 17:32 ` Andrew Cooper
2014-02-10 17:48 ` Frediano Ziglio
2014-02-10 20:00 ` Shriram Rajagopalan
2014-02-11  1:35   ` Andrew Cooper
2014-02-11  4:12     ` Shriram Rajagopalan
2014-02-11 10:58       ` Zoltan Kiss
2014-02-12 15:34       ` Tim Deegan
2014-02-11 11:58   ` David Vrabel
2014-02-11 16:13     ` Shriram Rajagopalan
2014-02-11  9:30 ` Ian Campbell
2014-02-11 11:40   ` David Vrabel
2014-02-11 12:09     ` Ian Campbell
2014-02-11 13:10       ` Jan Beulich
2014-02-11 13:12       ` Ian Campbell
2014-02-12 15:41   ` Tim Deegan
2014-02-11 10:06 ` Jan Beulich
2014-02-11 13:04   ` David Vrabel
2014-02-11 13:20     ` Jan Beulich
2014-02-11 16:45   ` Ian Jackson
2014-02-11 17:08     ` Shriram Rajagopalan
2014-02-11 17:15       ` Ian Campbell
2014-02-11 17:30         ` Ian Jackson
2014-02-11 17:31         ` Frediano Ziglio
2014-02-11 17:53           ` Ian Jackson
2014-02-11 17:59             ` Ian Campbell
2014-02-12  9:07               ` Frediano Ziglio
2014-02-12 11:27                 ` Frediano Ziglio
2014-02-12 11:34                   ` Ian Campbell
2014-02-11 16:38 ` Ian Jackson
2014-02-11 17:04   ` Andrew Cooper
2014-02-11 17:07     ` Ian Jackson
2014-02-11 16:49 ` Ian Jackson
2014-02-11 17:10   ` David Vrabel
2014-02-11 17:28     ` Ian Jackson
2014-02-12 16:36 ` Tim Deegan
2014-02-12 17:09   ` David Vrabel
2014-02-12 18:16     ` Frediano Ziglio

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.