Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability

From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Fabiano Rosas" <farosas@suse.de>,
	qemu-devel@nongnu.org, "Claudio Fontana" <cfontana@suse.de>,
	jfehlig@suse.com, dfaggioli@suse.com, dgilbert@redhat.com,
	"Juan Quintela" <quintela@redhat.com>,
	"Nikolay Borisov" <nborisov@suse.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>
Subject: Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
Date: Fri, 31 Mar 2023 12:13:32 -0400	[thread overview]
Message-ID: <ZCcGrEkSA64z6MpV@x1n> (raw)
In-Reply-To: <ZCb9oVI6WUaGizwm@redhat.com>

On Fri, Mar 31, 2023 at 04:34:57PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 10:39:23AM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > > > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > > > From: Nikolay Borisov <nborisov@suse.com>
> > > > > 
> > > > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > > > each ram page of the migration stream has a specific offset in the
> > > > > resulting migration stream. The reason why we'd want such behavior are
> > > > > two fold:
> > > > > 
> > > > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > > > >    bounded size, since pages which are dirtied multiple times will
> > > > >    always go to a fixed location in the file, rather than constantly
> > > > >    being added to a sequential stream. This eliminates cases where a vm
> > > > >    with, say, 1G of ram can result in a migration file that's 10s of
> > > > >    GBs, provided that the workload constantly redirties memory.
> > > > > 
> > > > >  - It paves the way to implement DIO-enabled save/restore of the
> > > > >    migration stream as the pages are ensured to be written at aligned
> > > > >    offsets.
> > > > > 
> > > > > The feature requires changing the stream format. First, a bitmap is
> > > > > introduced which tracks which pages have been written (i.e are
> > > > > dirtied) during migration and subsequently it's being written in the
> > > > > resulting file, again at a fixed location for every ramblock. Zero
> > > > > pages are ignored as they'd be zero in the destination migration as
> > > > > well. With the changed format data would look like the following:
> > > > > 
> > > > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > > > 
> > > > What happens with huge pages?  Would page size matter here?
> > > > 
> > > > I would assume it's fine it uses a constant (small) page size, assuming
> > > > that should match with the granule that qemu tracks dirty (which IIUC is
> > > > the host page size not guest's).
> > > > 
> > > > But I didn't yet pay any further thoughts on that, maybe it would be
> > > > worthwhile in all cases to record page sizes here to be explicit or the
> > > > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > > > field just for sanity check too).
> > > 
> > > I think recording the page sizes is an anti-feature in this case.
> > > 
> > > The migration format / state needs to reflect the guest ABI, but
> > > we need to be free to have different backend config behind that
> > > either side of the save/restore.
> > > 
> > > IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> > > small pages initially and after restore use 2 x 1 GB hugepages,
> > > or vica-verca.
> > > 
> > > The important thing with the pages that are saved into the file
> > > is that they are a 1:1 mapping guest RAM regions to file offsets.
> > > IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> > > in the file.
> > > 
> > > If the src VM used 1 GB pages, we would be writing a full 2 GB
> > > of data assuming both pages were dirty.
> > > 
> > > If the src VM used 4k pages, we would be writing some subset of
> > > the 2 GB of data, and the rest would be unwritten.
> > > 
> > > Either way, when reading back the data we restore it into either
> > > 1 GB pages of 4k pages, beause any places there were unwritten
> > > orignally  will read back as zeros.
> > 
> > I think there's already the page size information, because there's a bitmap
> > embeded in the format at least in the current proposal, and the bitmap can
> > only be defined with a page size provided in some form.
> > 
> > Here I agree the backend can change before/after a migration (live or
> > not).  Though the question is whether page size matters in the snapshot
> > layout rather than what the loaded QEMU instance will use as backend.
> 
> IIUC, the page size information merely sets a constraint on the granularity
> of unwritten (sparse) regions in the file. If we didn't want to express
> page size directly in the file format we would need explicit start/end
> offsets for each written block. This is less convenient that just having
> a bitmap, so I think its ok to use the page size bitmap

I'm perfectly fine with having the bitmap.  The original question was about
whether we should store page_size into the same header too along with the
bitmap.

Currently I think the page size can be implied by either the system
configuration (e.g. arch, cpu setups) and also the size of bitmap.  So I'm
wondering whether it'll be cleaner to replace the bitmap size with page
size (hence one can calculate the bitmap size from the page size), or just
keep both of them for sanity.

Besides, since we seem to be defining a new header format to be stored on
disks, maybe it'll be worthwhile to leave some space for future extentions
of the image?

So the image format can start with a versioning (perhaps also with field
explaning what it contains). Then if someday we want to extend the image,
the new qemu binary will still be able to load the old image even if the
format may change.  Or vice versa, where the old qemu binary would be able
to identify it's loading a new image that it doesn't really understand, so
to properly notify the user rather than weird loading errors.

-- 
Peter Xu