From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: Domain Save Image Format proposal (draft B)
Date: Tue, 11 Feb 2014 09:30:40 +0000
Message-ID: <1392111040.26657.50.camel@kazak.uk.xensource.com>
References: <52F90A71.40802@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <52F90A71.40802@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: David Vrabel <david.vrabel@citrix.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "Xen-devel@lists.xen.org" <Xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote:
> It
> does not currently cover all use cases (e.g., images for HVM guest are
> not considered).

Other currently missing cases:
      * live migration
      * remus
      * page compression
      * ...?

I think it is useful to enumerate them, and if necessary to note which
ones you aren't planning to consider in the immediate future.

> Headers
> =======
> 
> Image Header
> ------------
> 
> The image header identifies an image as a Xen domain save image.  It
> includes the version of this specification that the image complies
> with.
> 
> Tools supporting version _V_ of the specification shall always save
> images using version _V_.  Tools shall support restoring from version
> _V_ and version _V_ - 1.

This isn't quite right since it is in terms of image format version and
not Xen version (unless the two are to be linked somehow). The Xen
project supports migration from version X-1 to version X (where X is the
Xen version). It's not inconceivable that the image format version
wouldn't change over Xen releases.

This might become a more obvious issue once you add HVM to the spec and
introduce the necessary Xen HVM save state block.

> Tools may additionally support restoring from earlier versions.

> The marker field can be used to distinguish between legacy images and
> those corresponding to this specification.  Legacy images will have at
> one or more zero bits within the first 8 octets of the image.
> 
> Fields within the image header are always in _big-endian_ byte order,
> regardless of the setting of the endianness bit.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | marker                                          |
>     +-----------------------+-------------------------+
>     | id                    | version                 |
>     +-----------+-----------+-------------------------+
>     | options   |                                     |
>     +-----------+-------------------------------------+
> 
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> marker      0xFFFFFFFFFFFFFFFF.
> 
> id          0x58454E46 ("XENF" in ASCII).
> 
> version     0x00000001.  The version of this specification.
> 
> options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.

Couldn't we just specify that things are in a specific endianness
related to the header's arch field?

I appreciate the desire to make the format endian neutral and explicit
about which is in use but (apart from the header) why would you ever
want to go non-native endian for a given arch?

>             bit 1-15: Reserved.
> --------------------------------------------------------------------
> 
> Domain Header
> -------------
> 
> The domain header includes general properties of the domain.
> 
>      0      1     2     3     4     5     6     7 octet
>     +-----------+-----------+-----------+-------------+
>     | arch      | type      | page_shift| (reserved)  |
>     +-----------+-----------+-----------+-------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> arch        0x0000: Reserved.
> 
>             0x0001: x86.
> 
>             0x0002: ARM.
> 
> type        0x0000: Reserved.
> 
>             0x0001: x86 PV.
> 
>             0x0002 - 0xFFFF: Reserved.

Is the type field per-arch? i.e. if arch=0x0002 can we use type = 0x0001
for ARM domains?

> page_shift  Size of a guest page as a power of two.
> 
>             i.e., page size = 2^page_shift^.

This doesn't really need 16 bits, no harm though I suppose.

> --------------------------------------------------------------------
> 
> 
> Records
> =======
> 
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
>     ...
>     Record body of length body_length octets followed by
>     0 to 7 octets of padding.
>     ...
>     +-----------------------+-------------------------+
>     | checksum              | (reserved)              |
>     +-----------------------+-------------------------+
> 
> --------------------------------------------------------------------
> Field        Description
> -----------  -------------------------------------------------------
> type         0x00000000: END
> 
>              0x00000001: PAGE_DATA
> 
>              0x00000002: VCPU_INFO
> 
>              0x00000003: VCPU_CONTEXT
> 
>              0x00000004: X86_PV_INFO
> 
>              0x00000005: P2M
> 
>              0x00000006 - 0xFFFFFFFF: Reserved
> 
> body_length  Length in octets of the record body.
> 
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
> 
>              Bit 1-15: Reserved.
> 
> checksum     CRC-32 checksum of the record body (including any trailing
>              padding), or 0x00000000 if the checksum field is invalid.

Which CRC-32 :-P

(wikipedia claims there are 6 variants using various polynomials...)

> P2M
> ---
> 
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]


What is the latter for again?

I don't know if it really fits into this spec, but it might be useful to
include the rationale for the inclusion of each record type.

> PAGE_DATA
> ---------
> 
> The bulk of an image consists of many PAGE_DATA records containing the
> memory contents.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | count (C)             | (reserved)              |
>     +-----------------------+-------------------------+
>     | pfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | pfn[C-1]                                        |
>     +-------------------------------------------------+
>     | page_data[0]...                                 |
>     ...
>     +-------------------------------------------------+
>     | page_data[N-1]...                               |
>     ...
>     +-------------------------------------------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> count       Number of pages described in this record.
> 
> pfn         An array of count PFNs. Bits 63-60 contain
>             the XEN\_DOMCTL\_PFINFO_* value for that PFN.

Now might be a good time to remove this intertwining? I suppose 60-bits
is a lot of pfn's, but if the VMs address space is sparse it isn't out
of the question.

It might be useful to be explicit (perhaps in a glossary) about what you
mean by "PFN" for various guest types -- some of the terminology can be
a bit confused in actual usage between the various guest types,
including in some of the existing interface definitions.

> Layout
> ======
> 
> The set of valid records depends on the guest architecture and type.
> 
> x86 PV Guest
> ------------
> 
> An x86 PV guest image will have in this order:
> 
> 1. Image header
> 2. Domain header
> 3. X86_PV_INFO record
> 4. At least one P2M record
> 5. At least one PAGE_DATA record
> 6. VCPU_INFO record
> 6. At least one VCPU_CONTEXT record
> 7. END record

It might be good to be more flexible in the ordering (or rather, less
prescriptive here), unless there is a reason for a strict ordering i.e.
dependencies between records.

For example it seems like the p2m cannot change after you start sending
page_data, which I don't think works for live migration (although I'm
not sure you've yet incorporated that as a feature).

Ian.