From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Campbell Subject: Re: Domain Save Image Format proposal (draft B) Date: Tue, 11 Feb 2014 09:30:40 +0000 Message-ID: <1392111040.26657.50.camel@kazak.uk.xensource.com> References: <52F90A71.40802@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <52F90A71.40802@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: David Vrabel Cc: Shriram Rajagopalan , Stefano Stabellini , Ian Jackson , "Xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org On Mon, 2014-02-10 at 17:20 +0000, David Vrabel wrote: > It > does not currently cover all use cases (e.g., images for HVM guest are > not considered). Other currently missing cases: * live migration * remus * page compression * ...? I think it is useful to enumerate them, and if necessary to note which ones you aren't planning to consider in the immediate future. > Headers > ======= > > Image Header > ------------ > > The image header identifies an image as a Xen domain save image. It > includes the version of this specification that the image complies > with. > > Tools supporting version _V_ of the specification shall always save > images using version _V_. Tools shall support restoring from version > _V_ and version _V_ - 1. This isn't quite right since it is in terms of image format version and not Xen version (unless the two are to be linked somehow). The Xen project supports migration from version X-1 to version X (where X is the Xen version). It's not inconceivable that the image format version wouldn't change over Xen releases. This might become a more obvious issue once you add HVM to the spec and introduce the necessary Xen HVM save state block. > Tools may additionally support restoring from earlier versions. > The marker field can be used to distinguish between legacy images and > those corresponding to this specification. Legacy images will have at > one or more zero bits within the first 8 octets of the image. > > Fields within the image header are always in _big-endian_ byte order, > regardless of the setting of the endianness bit. > > 0 1 2 3 4 5 6 7 octet > +-------------------------------------------------+ > | marker | > +-----------------------+-------------------------+ > | id | version | > +-----------+-----------+-------------------------+ > | options | | > +-----------+-------------------------------------+ > > > -------------------------------------------------------------------- > Field Description > ----------- -------------------------------------------------------- > marker 0xFFFFFFFFFFFFFFFF. > > id 0x58454E46 ("XENF" in ASCII). > > version 0x00000001. The version of this specification. > > options bit 0: Endianness. 0 = little-endian, 1 = big-endian. Couldn't we just specify that things are in a specific endianness related to the header's arch field? I appreciate the desire to make the format endian neutral and explicit about which is in use but (apart from the header) why would you ever want to go non-native endian for a given arch? > bit 1-15: Reserved. > -------------------------------------------------------------------- > > Domain Header > ------------- > > The domain header includes general properties of the domain. > > 0 1 2 3 4 5 6 7 octet > +-----------+-----------+-----------+-------------+ > | arch | type | page_shift| (reserved) | > +-----------+-----------+-----------+-------------+ > > -------------------------------------------------------------------- > Field Description > ----------- -------------------------------------------------------- > arch 0x0000: Reserved. > > 0x0001: x86. > > 0x0002: ARM. > > type 0x0000: Reserved. > > 0x0001: x86 PV. > > 0x0002 - 0xFFFF: Reserved. Is the type field per-arch? i.e. if arch=0x0002 can we use type = 0x0001 for ARM domains? > page_shift Size of a guest page as a power of two. > > i.e., page size = 2^page_shift^. This doesn't really need 16 bits, no harm though I suppose. > -------------------------------------------------------------------- > > > Records > ======= > > A record has a record header, type specific data and a trailing > footer. If body_length is not a multiple of 8, the body is padded > with zeroes to align the checksum field on an 8 octet boundary. > > 0 1 2 3 4 5 6 7 octet > +-----------------------+-------------------------+ > | type | body_length | > +-----------+-----------+-------------------------+ > | options | (reserved) | > +-----------+-------------------------------------+ > ... > Record body of length body_length octets followed by > 0 to 7 octets of padding. > ... > +-----------------------+-------------------------+ > | checksum | (reserved) | > +-----------------------+-------------------------+ > > -------------------------------------------------------------------- > Field Description > ----------- ------------------------------------------------------- > type 0x00000000: END > > 0x00000001: PAGE_DATA > > 0x00000002: VCPU_INFO > > 0x00000003: VCPU_CONTEXT > > 0x00000004: X86_PV_INFO > > 0x00000005: P2M > > 0x00000006 - 0xFFFFFFFF: Reserved > > body_length Length in octets of the record body. > > options Bit 0: 0 - checksum invalid, 1 = checksum valid. > > Bit 1-15: Reserved. > > checksum CRC-32 checksum of the record body (including any trailing > padding), or 0x00000000 if the checksum field is invalid. Which CRC-32 :-P (wikipedia claims there are 6 variants using various polynomials...) > P2M > --- > > [ This is a more flexible replacement for the old p2m_size field and > p2m array. ] What is the latter for again? I don't know if it really fits into this spec, but it might be useful to include the rationale for the inclusion of each record type. > PAGE_DATA > --------- > > The bulk of an image consists of many PAGE_DATA records containing the > memory contents. > > 0 1 2 3 4 5 6 7 octet > +-----------------------+-------------------------+ > | count (C) | (reserved) | > +-----------------------+-------------------------+ > | pfn[0] | > +-------------------------------------------------+ > ... > +-------------------------------------------------+ > | pfn[C-1] | > +-------------------------------------------------+ > | page_data[0]... | > ... > +-------------------------------------------------+ > | page_data[N-1]... | > ... > +-------------------------------------------------+ > > -------------------------------------------------------------------- > Field Description > ----------- -------------------------------------------------------- > count Number of pages described in this record. > > pfn An array of count PFNs. Bits 63-60 contain > the XEN\_DOMCTL\_PFINFO_* value for that PFN. Now might be a good time to remove this intertwining? I suppose 60-bits is a lot of pfn's, but if the VMs address space is sparse it isn't out of the question. It might be useful to be explicit (perhaps in a glossary) about what you mean by "PFN" for various guest types -- some of the terminology can be a bit confused in actual usage between the various guest types, including in some of the existing interface definitions. > Layout > ====== > > The set of valid records depends on the guest architecture and type. > > x86 PV Guest > ------------ > > An x86 PV guest image will have in this order: > > 1. Image header > 2. Domain header > 3. X86_PV_INFO record > 4. At least one P2M record > 5. At least one PAGE_DATA record > 6. VCPU_INFO record > 6. At least one VCPU_CONTEXT record > 7. END record It might be good to be more flexible in the ordering (or rather, less prescriptive here), unless there is a reason for a strict ordering i.e. dependencies between records. For example it seems like the p2m cannot change after you start sending page_data, which I don't think works for live migration (although I'm not sure you've yet incorporated that as a feature). Ian.