xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Jan Beulich <jbeulich@suse.com>
To: Julien Grall <julien@xen.org>
Cc: dwmw2@infradead.org, paul@xen.org, hongyxia@amazon.com,
	raphning@amazon.com, maghul@amazon.com,
	Julien Grall <jgrall@amazon.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	Ian Jackson <iwj@xenproject.org>,
	Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org>,
Subject: Re: [PATCH RFC 1/2] docs/design: Add a design document for Live Update
Date: Fri, 7 May 2021 11:52:52 +0200	[thread overview]
Message-ID: <f51b2ef6-3998-7371-cea9-502c5c9f8afa@suse.com> (raw)
In-Reply-To: <20210506104259.16928-2-julien@xen.org>

On 06.05.2021 12:42, Julien Grall wrote:
> +## High-level overview
> +
> +Xen has a framework to bring a new image of the Xen hypervisor in memory using
> +kexec.  The existing framework does not meet the baseline functionality for
> +Live Update, since kexec results in a restart for the hypervisor, host, Dom0,
> +and all the guests.
> +
> +The operation can be divided in roughly 4 parts:
> +
> +    1. Trigger: The operation will by triggered from outside the hypervisor
> +       (e.g. dom0 userspace).
> +    2. Save: The state will be stabilized by pausing the domains and
> +       serialized by xen#1.
> +    3. Hand-over: xen#1 will pass the serialized state and transfer control to
> +       xen#2.
> +    4. Restore: The state will be deserialized by xen#2.
> +
> +All the domains will be paused before xen#1 is starting to save the states,
> +and any domain that was running before Live Update will be unpaused after
> +xen#2 has finished to restore the states.  This is to prevent a domain to try
> +to modify the state of another domain while it is being saved/restored.
> +
> +The current approach could be seen as non-cooperative migration with a twist:
> +all the domains (including dom0) are not expected be involved in the Live
> +Update process.
> +
> +The major differences compare to live migration are:
> +
> +    * The state is not transferred to another host, but instead locally to
> +      xen#2.
> +    * The memory content or device state (for passthrough) does not need to
> +      be part of the stream. Instead we need to preserve it.
> +    * PV backends, device emulators, xenstored are not recreated but preserved
> +      (as these are part of dom0).

Isn't dom0 too limiting here?

> +## Trigger
> +
> +Live update is built on top of the kexec interface to prepare the command line,
> +load xen#2 and trigger the operation.  A new kexec type has been introduced
> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
> +
> +The Live Update will be triggered from outside the hypervisor (e.g. dom0
> +userspace).  Support for the operation has been added in kexec-tools 2.0.21.
> +
> +All the domains will be paused before xen#1 is starting to save the states.
> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re-
> +scheduled.  In other words, a pause request will not wait for asynchronous
> +requests (e.g. I/O) to finish.  For Live Update, this is not an ideal time to
> +pause because it will require more xen#1 internal state to be transferred.
> +Therefore, all the domains will be paused at an architectural restartable
> +boundary.

To me this leaves entirely unclear what this then means. domain_pause()
not being suitable is one thing, but what _is_ suitable seems worth
mentioning. Among other things I'd be curious to know what this would
mean for pending hypercall continuations.

> +## Save
> +
> +xen#1 will be responsible to preserve and serialize the state of each existing
> +domain and any system-wide state (e.g M2P).
> +
> +Each domain will be serialized independently using a modified migration stream,
> +if there is any dependency between domains (such as for IOREQ server) they will
> +be recorded using a domid. All the complexity of resolving the dependencies are
> +left to the restore path in xen#2 (more in the *Restore* section).
> +
> +At the moment, the domains are saved one by one in a single thread, but it
> +would be possible to consider multi-threading if it takes too long. Although
> +this may require some adjustment in the stream format.
> +
> +As we want to be able to Live Update between major versions of Xen (e.g Xen
> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal
> +structure but instead the minimal information that allow us to recreate the
> +domains.
> +
> +For instance, we don't want to preserve the frametable (and therefore
> +*struct page\_info*) as-is because the refcounting may be different across
> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate
> +*struct page\_info* based on minimal information that are considered stable
> +(such as the page type).

Perhaps leaving it at this very generic description is fine, but I can
easily see cases (which may not even be corner ones) where this quickly
gets problematic: What if xen#2 has state xen#1 didn't (properly) record?
Such information may not be possible to take out of thin air. Is the
consequence then that in such a case LU won't work? If so, is it perhaps
worthwhile having a Limitations section somewhere?

> +## Hand over
> +
> +### Memory usage restrictions
> +
> +xen#2 must take care not to use any memory pages which already belong to
> +guests.  To facilitate this, a number of contiguous region of memory are
> +reserved for the boot allocator, known as *live update bootmem*.
> +
> +xen#1 will always reserve a region just below Xen (the size is controlled by
> +the Xen command line parameter liveupdate) to allow Xen growing and provide
> +information about LiveUpdate (see the section *Breadcrumb*).  The region will be
> +passed to xen#2 using the same command line option but with the base address
> +specified.

I particularly don't understand the "to allow Xen growing" aspect here:
xen#2 needs to be placed in a different memory range anyway until xen#1
has handed over control. Are you suggesting it gets moved over to xen#1's
original physical range (not necessarily an exact match), and then
perhaps to start below where xen#1 started? Why would you do this? Xen
intentionally lives at a 2Mb boundary, such that in principle (for EFI:
in fact) large page mappings are possible. I also see no reason to reuse
the same physical area of memory for Xen itself - all you need is for
Xen's virtual addresses to be properly mapped to the new physical range.
I wonder what I'm missing here.

> +For simplicity, additional regions will be provided in the stream.  They will
> +consist of region that could be re-used by xen#2 during boot (such as the
> +xen#1's frametable memory).
> +
> +xen#2 must not use any pages outside those regions until it has consumed the
> +Live Update data stream and determined which pages are already in use by
> +running domains or need to be re-used as-is by Xen (e.g M2P).

Is the M2P really in the "need to be re-used" group, not just "can
be re-used for simplicity and efficiency reasons"?

> +## Restore
> +
> +After xen#2 initialized itself and map the stream, it will be responsible to
> +restore the state of the system and each domain.
> +
> +Unlike the save part, it is not possible to restore a domain in a single pass.
> +There are dependencies between:
> +
> +    1. different states of a domain.  For instance, the event channels ABI
> +       used (2l vs fifo) requires to be restored before restoring the event
> +       channels.
> +    2. the same "state" within a domain.  For instance, in case of PV domain,
> +       the pages' ownership requires to be restored before restoring the type
> +       of the page (e.g is it an L4, L1... table?).
> +
> +    3. domains.  For instance when restoring the grant mapping, it will be
> +       necessary to have the page's owner in hand to do proper refcounting.
> +       Therefore the pages' ownership have to be restored first.
> +
> +Dependencies will be resolved using either multiple passes (for dependency
> +type 2 and 3) or using a specific ordering between records (for dependency
> +type 1).
> +
> +Each domain will be restored in 3 passes:
> +
> +    * Pass 0: Create the domain and restore the P2M for HVM. This can be broken
> +      down in 3 parts:
> +      * Allocate a domain via _domain\_create()_ but skip part that requires
> +        extra records (e.g HAP, P2M).
> +      * Restore any parts which needs to be done before create the vCPUs. This
> +        including restoring the P2M and whether HAP is used.
> +      * Create the vCPUs. Note this doesn't restore the state of the vCPUs.
> +    * Pass 1: It will restore the pages' ownership and the grant-table frames
> +    * Pass 2: This steps will restore any domain states (e.g vCPU state, event
> +      channels) that wasn't

What about foreign mappings (which are part of the P2M)? Can they be
validly restored prior to restoring page ownership? In how far do you
fully trust xen#1's state to be fully consistent anyway, rather than
perhaps checking it?

> +A domain should not have a dependency on another domain within the same pass.
> +Therefore it would be possible to take advantage of all the CPUs to restore
> +domains in parallel and reduce the overall downtime.

"Dependency" may be ambiguous here. For example, an interdomain event
channel to me necessarily expresses a dependency between two domains.


  parent reply	other threads:[~2021-05-07  9:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-06 10:42 [PATCH RFC 0/2] Add a design document for Live Updating Xen Julien Grall
2021-05-06 10:42 ` [PATCH RFC 1/2] docs/design: Add a design document for Live Update Julien Grall
2021-05-06 14:43   ` Paul Durrant
2021-05-07  9:18   ` Hongyan Xia
2021-05-07 10:00     ` Julien Grall
2021-05-07  9:52   ` Jan Beulich [this message]
2021-05-07 11:44     ` Julien Grall
2021-05-07 12:15       ` Jan Beulich
2021-05-07 14:59         ` Xia, Hongyan
2021-05-07 15:28           ` Jan Beulich
2021-05-06 10:42 ` [PATCH RFC 2/2] xen/kexec: Reserve KEXEC_TYPE_LIVEUPDATE and KEXEC_RANGE_MA_LIVEUPDATE Julien Grall
2021-05-07  7:59   ` Paul Durrant
2021-05-07  8:24   ` Hongyan Xia
2021-05-07  8:30     ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f51b2ef6-3998-7371-cea9-502c5c9f8afa@suse.com \
    --to=jbeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dwmw2@infradead.org \
    --cc=george.dunlap@citrix.com \
    --cc=hongyxia@amazon.com \
    --cc=iwj@xenproject.org \
    --cc=jgrall@amazon.com \
    --cc=julien@xen.org \
    --cc=maghul@amazon.com \
    --cc=paul@xen.org \
    --cc=raphning@amazon.com \
    --cc=sstabellini@kernel.org \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).