Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management

From: David Woodhouse <dwmw2@infradead.org>
To: Julien Grall <julien@xen.org>,
	Xen-devel <xen-devel@lists.xenproject.org>,
	 "Xia, Hongyan" <hongyax@amazon.com>
Cc: "Stefano Stabellini" <sstabellini@kernel.org>,
	"Wei Liu" <wl@xen.org>,
	paul@xen.org, "George Dunlap" <George.Dunlap@eu.citrix.com>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>,
	"Ian Jackson" <ian.jackson@eu.citrix.com>,
	"Jan Beulich" <jbeulich@suse.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
Date: Tue, 14 Jan 2020 15:48:07 +0100	[thread overview]
Message-ID: <b24cf0a1b56f56167f51d5dd86fd81afb48a377c.camel@infradead.org> (raw)
In-Reply-To: <1743ee7c-e238-8b77-d40f-bd0e3d6bb0ed@xen.org>

[-- Attachment #1.1: Type: text/plain, Size: 4022 bytes --]

On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote:
> Hi David,
> 
> On 13/01/2020 11:54, David Woodhouse wrote:
> > On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
> > > When doing a live update, Xen needs to be very careful not to scribble
> > > on pages which contain guest memory or state information for the
> > > domains which are being preserved.
> > > 
> > > The information about which pages are in use is contained in the live
> > > update state passed from the previous Xen — which is mostly just a
> > > guest-transparent live migration data stream, except that it points to
> > > the page tables in place in memory while traditional live migration
> > > obviously copies the pages separately.
> > > 
> > > Our initial implementation actually prepended a list of 'in-use' ranges
> > > to the live update state, and made the boot allocator treat them the
> > > same as 'bad pages'. That worked well enough for initial development
> > > but wouldn't scale to a live production system, mainly because the boot
> > > allocator has a limit of 512 memory ranges that it can keep track of,
> > > and a real system would end up more fragmented than that.
> > > 
> > > My other concern with that approach is that it required two passes over
> > > the domain-owned pages. We have to do a later pass *anyway*, as we set
> > > up ownership in the frametable for each page — and that has to happen
> > > after we've managed to allocate a 'struct domain' for each page_info to
> > > point to. If we want to keep the pause time due to a live update down
> > > to a bare minimum, doing two passes over the full set of domain pages
> > > isn't my favourite strategy.
> 
> We actually need one more pass for PV domain (at least). The pass is 
> used to allocate the page type (e.g L4, L1,...). This can't be done 
> before because we need the pages to belongs to the guest before going 
> through its page-tables.

All the more reason why I don't want to do an *additional* pass just
for the allocator.

> > > 
> > > So we've settled on a simpler approach \x02— reserve a contiguous region
> > > of physical memory which *won't* be used for domain pages. Let the boot
> > > allocator see *only* that region of memory, and plug the rest of the
> > > memory in later only after doing a full pass of the live update state.
> 
> It is a bit unclear what the region will be used for. If you plan to put 
> the state of the VMs in it, then you can't possibly use it for boot 
> allocation (e.g frametable) otherwise this may be overwritten when doing 
> the live update.

Right. This is only for boot time allocations by Xen#2, before it's
processed the LU data and knows which parts of the rest of memory it
can use. It allocates its frame table from there, as well as anything
else it needs to allocate before/while processing the LU data.

As an implementation detail, I anticipate that we'll be using the boot
allocator for that early part from the reserved region, and that the
switch to using the full available memory (less those pages already in-
use) will *coincide* with switching to the real heap allocator.

The reserved region *isn't* for the LU data itself. That can be
allocated from arbitrary pages *outside* the reserved area, in Xen#1.
Xen#2 can vmap those pages, and needs to avoid stomping on them just
like it needs to avoid stomping on actual domain-owned pages.

The plan is that Xen#1 allocates arbitrary pages to store the actual LU
data. Then another page (or higher order allocation if we need >2MiB of
actual LU data) containing the MFNs of all those data pages. Then we
need to somehow pass the address of that MFN-list to Xen#2.

My current plan is to put *that* in the first 64 bits of the reserved
LU bootmem region, and load it from there early in the Xen#2 boot
process. I'm looking at adding an IND_WRITE64 primitive to the kimage
processing, to allow it to be trivially appended for kexec_reloc() to
obey.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel