From mboxrd@z Thu Jan  1 00:00:00 1970
From: Tim Deegan <tim@xen.org>
Subject: Re: Domain Save Image Format proposal (draft B)
Date: Wed, 12 Feb 2014 17:36:25 +0100
Message-ID: <20140212163625.GE91459@deinos.phlegethon.org>
References: <52F90A71.40802@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <52F90A71.40802@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: David Vrabel <david.vrabel@citrix.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, Ian Campbell <ian.campbell@citrix.com>, "Xen-devel@lists.xen.org" <Xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

Hi,

This draft has my wholehearted support.  Even without addressing any
of the points under discussion something along these lines would be a
vast improvement on the current format.

I have two general questions:

 - The existing save-format definition is spread across a number of
   places: libxc for hypervisor state, qemu for DM state, and the main
   toolstack (libxl/xend/xapi/&c) for other config runes and a general
   wrapper.  This is clearly a reworking of the libxc parts -- do
   you think there's anything currently defined elsewhere that belongs
   in this spec?

 - Have you given any thought to making this into a wire protocol
   rather than just a file format?  Would there be any benefit to
   having records individually acked by the receiver in a live
   migration, or having the receiver send instructions about
   compatibility?  Or is that again left to the toolstack to manage?

and a few nits:

At 17:20 +0000 on 10 Feb (1392049249), David Vrabel wrote:
> Records
> =======
> 
> A record has a record header, type specific data and a trailing
> footer.  If body_length is not a multiple of 8, the body is padded
> with zeroes to align the checksum field on an 8 octet boundary.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+-------------------------+
>     | type                  | body_length             |
>     +-----------+-----------+-------------------------+
>     | options   | (reserved)                          |
>     +-----------+-------------------------------------+
>     ...
>     Record body of length body_length octets followed by
>     0 to 7 octets of padding.
>     ...
>     +-----------------------+-------------------------+
>     | checksum              | (reserved)              |
>     +-----------------------+-------------------------+
> 
> --------------------------------------------------------------------
> Field        Description
> -----------  -------------------------------------------------------
> type         0x00000000: END
> 
>              0x00000001: PAGE_DATA
> 
>              0x00000002: VCPU_INFO
> 
>              0x00000003: VCPU_CONTEXT
> 
>              0x00000004: X86_PV_INFO
> 
>              0x00000005: P2M
> 
>              0x00000006 - 0xFFFFFFFF: Reserved
> 
> body_length  Length in octets of the record body.
> 
> options      Bit 0: 0 - checksum invalid, 1 = checksum valid.
> 
>              Bit 1-15: Reserved.
> 
> checksum     CRC-32 checksum of the record body (including any trailing
>              padding), or 0x00000000 if the checksum field is invalid.

Apart from any discussion of the merits of per-record vs whole-file
checksums, it would be useful for this checksum to cover the header
too.  E.g., by declaring it to be the checksum of header+data where
the checksum field is 0, or by declaring that it shall be that pattern
which causes the finished header+data to checksum to 0.

> VCPU_INFO
> ---------
> 
> > [ This is a combination of parts of the extended-info and
> > XC_SAVE_ID_VCPU_INFO chunks. ]
> 
> The VCPU_INFO record includes the maximum possible VCPU ID.  This will
> be followed a VCPU_CONTEXT record for each online VCPU.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-----------------------+------------------------+
>     | max_vcpu_id           | (reserved)             |
>     +-----------------------+------------------------+
> 
> --------------------------------------------------------------------
> Field        Description
> -----------  ---------------------------------------------------
> max_vcpu_id  Maximum possible VCPU ID.
> --------------------------------------------------------------------

If this is all that's in this record, maybe it should be called
VCPU_COUNT?

> P2M
> ---
> 
> [ This is a more flexible replacement for the old p2m_size field and
> p2m array. ]
> 
> The P2M record contains a portion of the source domain's P2M.
> Multiple P2M records may be sent if the source P2M changes during the
> stream.
> 
>      0     1     2     3     4     5     6     7 octet
>     +-------------------------------------------------+
>     | pfn_begin                                       |
>     +-------------------------------------------------+
>     | pfn_end                                         |
>     +-------------------------------------------------+
>     | mfn[0]                                          |
>     +-------------------------------------------------+
>     ...
>     +-------------------------------------------------+
>     | mfn[N-1]                                        |
>     +-------------------------------------------------+
> 
> --------------------------------------------------------------------
> Field       Description
> ----------- --------------------------------------------------------
> pfn_begin   The first PFN in this portion of the P2M
> 
> pfn_end     One past the last PFN in this portion of the P2M.
> 
> mfn         Array of (pfn_end - pfn-begin) MFNs corresponding to
>             the set of PFNs in the range [pfn_begin, pfn_end).
> --------------------------------------------------------------------

The current save record doesn't contain the p2m itself, but rather the
p2m_frame_list, an array of the MFNs (in the save record, PFNs) that
hold the actual p2m.  Frames in that list are used to populate the p2m
as memory is allocated on the receiving side.

I'm not sure what it would mean to allow the guest to change the
location of its p2m table (as distinct from the contents) on the fly
during a migration.  We would at least have to re-send the contents of
any frames that are no longer in the p2m table, in case the receiver
has already overwritten them.  And I think it should be fine to just
send the whole list every time (or else we need to manage deltas
carefully too).

Also, while I'm thinking about record names, this probably ought to be
called X86_PV_P2M or something like that.

Cheers,

Tim.