From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2D96C433B4 for ; Fri, 7 May 2021 10:00:25 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5B613613D6 for ; Fri, 7 May 2021 10:00:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5B613613D6 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=xen.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from list by lists.xenproject.org with outflank-mailman.123933.233866 (Exim 4.92) (envelope-from ) id 1lexHJ-0003QM-81; Fri, 07 May 2021 10:00:13 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 123933.233866; Fri, 07 May 2021 10:00:13 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lexHJ-0003QF-50; Fri, 07 May 2021 10:00:13 +0000 Received: by outflank-mailman (input) for mailman id 123933; Fri, 07 May 2021 10:00:11 +0000 Received: from mail.xenproject.org ([104.130.215.37]) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lexHH-0003Q9-NN for xen-devel@lists.xenproject.org; Fri, 07 May 2021 10:00:11 +0000 Received: from xenbits.xenproject.org ([104.239.192.120]) by mail.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lexHB-0004lc-3G; Fri, 07 May 2021 10:00:05 +0000 Received: from 54-240-197-238.amazon.com ([54.240.197.238] helo=a483e7b01a66.ant.amazon.com) by xenbits.xenproject.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lexHA-0002Pb-Pp; Fri, 07 May 2021 10:00:05 +0000 X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xen.org; s=20200302mail; h=Content-Transfer-Encoding:Content-Type:In-Reply-To: MIME-Version:Date:Message-ID:From:References:Cc:To:Subject; bh=k9xZC6GAGFDMz2L3HsOLpRl3Y6kuyrkIhyR51NFYcc8=; b=kg2QV5B76hS4dZkbraC5jedNVI XRUXrqOzTHJx1QD3ee2ppJ5qkNyd7Oy1ZKZIQ5kJfLUFndiQN3srZq1EFgGuaWqIdQW3iwE7cBipu 3nNPTqn6XI2aR4x9+0L6GTPKjmf8YImFSsI7HA3qjGYCbm2uBpLmaftEAqUGL4k9Y660=; Subject: Re: [PATCH RFC 1/2] docs/design: Add a design document for Live Update To: Hongyan Xia , xen-devel@lists.xenproject.org Cc: dwmw2@infradead.org, paul@xen.org, raphning@amazon.com, maghul@amazon.com, Julien Grall , Andrew Cooper , George Dunlap , Ian Jackson , Jan Beulich , Stefano Stabellini , Wei Liu References: <20210506104259.16928-1-julien@xen.org> <20210506104259.16928-2-julien@xen.org> <288e5af69a356060b9fce6c6fa77324950dac2c2.camel@xen.org> From: Julien Grall Message-ID: Date: Fri, 7 May 2021 11:00:02 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: <288e5af69a356060b9fce6c6fa77324950dac2c2.camel@xen.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Hi Hongyan, On 07/05/2021 10:18, Hongyan Xia wrote: > On Thu, 2021-05-06 at 11:42 +0100, Julien Grall wrote: >> From: Julien Grall >> >> Administrators often require updating the Xen hypervisor to address >> security vulnerabilities, introduce new features, or fix software >> defects. >> Currently, we offer the following methods to perform the update: >> >> * Rebooting the guests and the host: this is highly disrupting to >> running >> guests. >> * Migrating off the guests, rebooting the host: this currently >> requires >> the guest to cooperate (see [1] for a non-cooperative solution) >> and it >> may not always be possible to migrate it off (i.e lack of >> capacity, use >> of local storage...). >> * Live patching: This is the less disruptive of the existing >> methods. >> However, it can be difficult to prepare the livepatch if the >> change is >> large or there are data structures to update. > > Might want to mention that live patching slowly consumes memory and > fragments the Xen image and degrades performance (especially when the > patched code is on the critical path). My goal wasn't to list all the drawbacks for each existign methods. Instead, I wanted to give a simple important reason for each for them. I would prefer to keep the list as it is unless someone needs more arguments about introducing a new approach. > >> >> This patch will introduce a new proposal called "Live Update" which >> will >> activate new software without noticeable downtime (i.e no - or >> minimal - >> customer). >> >> Signed-off-by: Julien Grall >> --- >> docs/designs/liveupdate.md | 254 >> +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 254 insertions(+) >> create mode 100644 docs/designs/liveupdate.md >> >> diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md >> new file mode 100644 >> index 000000000000..32993934f4fe >> --- /dev/null >> +++ b/docs/designs/liveupdate.md >> @@ -0,0 +1,254 @@ >> +# Live Updating Xen >> + >> +## Background >> + >> +Administrators often require updating the Xen hypervisor to address >> security >> +vulnerabilities, introduce new features, or fix software >> defects. Currently, >> +we offer the following methods to perform the update: >> + >> + * Rebooting the guests and the host: this is highly disrupting >> to running >> + guests. >> + * Migrating off the guests, rebooting the host: this currently >> requires >> + the guest to cooperate (see [1] for a non-cooperative >> solution) and it >> + may not always be possible to migrate it off (i.e lack of >> capacity, use >> + of local storage...). >> + * Live patching: This is the less disruptive of the existing >> methods. >> + However, it can be difficult to prepare the livepatch if the >> change is >> + large or there are data structures to update. >> + >> +This document will present a new approach called "Live Update" which >> will >> +activate new software without noticeable downtime (i.e no - or >> minimal - >> +customer pain). >> + >> +## Terminology >> + >> +xen#1: Xen version currently active and running on a droplet. This >> is the >> +“source” for the Live Update operation. This version can actually >> be newer >> +than xen#2 in case of a rollback operation. >> + >> +xen#2: Xen version that's the “target” of the Live Update operation. >> This >> +version will become the active version after successful Live >> Update. This >> +version of Xen can actually be older than xen#1 in case of a >> rollback >> +operation. > > A bit redundant since it was mentioned in Xen 1 already. Definitions tends to be redundant. So I would prefer to keep like that. > >> + >> +## High-level overview >> + >> +Xen has a framework to bring a new image of the Xen hypervisor in >> memory using >> +kexec. The existing framework does not meet the baseline >> functionality for >> +Live Update, since kexec results in a restart for the hypervisor, >> host, Dom0, >> +and all the guests. >> + >> +The operation can be divided in roughly 4 parts: >> + >> + 1. Trigger: The operation will by triggered from outside the >> hypervisor >> + (e.g. dom0 userspace). >> + 2. Save: The state will be stabilized by pausing the domains and >> + serialized by xen#1. >> + 3. Hand-over: xen#1 will pass the serialized state and transfer >> control to >> + xen#2. >> + 4. Restore: The state will be deserialized by xen#2. >> + >> +All the domains will be paused before xen#1 is starting to save the >> states, >> +and any domain that was running before Live Update will be unpaused >> after >> +xen#2 has finished to restore the states. This is to prevent a >> domain to try >> +to modify the state of another domain while it is being >> saved/restored. >> + >> +The current approach could be seen as non-cooperative migration with >> a twist: >> +all the domains (including dom0) are not expected be involved in the >> Live >> +Update process. >> + >> +The major differences compare to live migration are: >> + >> + * The state is not transferred to another host, but instead >> locally to >> + xen#2. >> + * The memory content or device state (for passthrough) does not >> need to >> + be part of the stream. Instead we need to preserve it. >> + * PV backends, device emulators, xenstored are not recreated but >> preserved >> + (as these are part of dom0). >> + >> + >> +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*) >> will need >> +to be preserved because another entity may have mappings (e.g >> foreign, grant) >> +on them. >> + >> +## Trigger >> + >> +Live update is built on top of the kexec interface to prepare the >> command line, >> +load xen#2 and trigger the operation. A new kexec type has been >> introduced >> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update. >> + >> +The Live Update will be triggered from outside the hypervisor (e.g. >> dom0 >> +userspace). Support for the operation has been added in kexec-tools >> 2.0.21. >> + >> +All the domains will be paused before xen#1 is starting to save the >> states. >> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can >> be re- >> +scheduled. In other words, a pause request will not wait for >> asynchronous >> +requests (e.g. I/O) to finish. For Live Update, this is not an >> ideal time to >> +pause because it will require more xen#1 internal state to be >> transferred. >> +Therefore, all the domains will be paused at an architectural >> restartable >> +boundary. >> + >> +Live update will not happen synchronously to the request but when >> all the >> +domains are quiescent. As domains running device emulators (e.g >> Dom0) will >> +be part of the process to quiesce HVM domains, we will need to let >> them run >> +until xen#1 is actually starting to save the state. HVM vCPUs will >> be paused >> +as soon as any pending asynchronous request has finished. >> + >> +In the current implementation, all PV domains will continue to run >> while the >> +rest will be paused as soon as possible. Note this approach is >> assuming that >> +device emulators are only running in PV domains. >> + >> +It should be easy to extend to PVH domains not requiring device >> emulations. >> +It will require more thought if we need to run device models in HVM >> domains as >> +there might be inter-dependency. >> + >> +## Save >> + >> +xen#1 will be responsible to preserve and serialize the state of >> each existing >> +domain and any system-wide state (e.g M2P). >> + >> +Each domain will be serialized independently using a modified >> migration stream, >> +if there is any dependency between domains (such as for IOREQ >> server) they will >> +be recorded using a domid. All the complexity of resolving the >> dependencies are >> +left to the restore path in xen#2 (more in the *Restore* section). >> + >> +At the moment, the domains are saved one by one in a single thread, >> but it >> +would be possible to consider multi-threading if it takes too long. >> Although >> +this may require some adjustment in the stream format. >> + >> +As we want to be able to Live Update between major versions of Xen >> (e.g Xen >> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen >> internal >> +structure but instead the minimal information that allow us to >> recreate the >> +domains. >> + >> +For instance, we don't want to preserve the frametable (and >> therefore >> +*struct page\_info*) as-is because the refcounting may be different >> across >> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able >> to recreate >> +*struct page\_info* based on minimal information that are considered >> stable >> +(such as the page type). >> + >> +Note that upgrading between version of Xen will also require all the >> hypercalls >> +to be stable. This will not be covered by this document. >> + >> +## Hand over >> + >> +### Memory usage restrictions >> + >> +xen#2 must take care not to use any memory pages which already >> belong to >> +guests. To facilitate this, a number of contiguous region of memory >> are >> +reserved for the boot allocator, known as *live update bootmem*. >> + >> +xen#1 will always reserve a region just below Xen (the size is >> controlled by >> +the Xen command line parameter liveupdate) to allow Xen growing and >> provide >> +information about LiveUpdate (see the section *Breadcrumb*). The >> region will be >> +passed to xen#2 using the same command line option but with the base >> address >> +specified. > > The size of the command line option may not be the same depending on > the size of Xen #2. Good point I will update it. > >> + >> +For simplicity, additional regions will be provided in the >> stream. They will >> +consist of region that could be re-used by xen#2 during boot (such >> as the >> +xen#1's frametable memory). >> + >> +xen#2 must not use any pages outside those regions until it has >> consumed the >> +Live Update data stream and determined which pages are already in >> use by >> +running domains or need to be re-used as-is by Xen (e.g M2P). >> + >> +At run time, Xen may use memory from the reserved region for any >> purpose that >> +does not require preservation over a Live Update; in particular it >> __must__ not be >> +mapped to a domain or used by any Xen state requiring to be >> preserved (e.g >> +M2P). In other word, the xenheap pages could be allocated from the >> reserved >> +regions if we remove the concept of shared xenheap pages. >> + >> +The xen#2's binary may be bigger (or smaller) compare to xen#1's >> binary. So >> +for the purpose of loading xen#2 binary, kexec should treat the >> reserved memory >> +right below xen#1 and its region as a single contiguous space. xen#2 >> will be >> +loaded right at the top of the contiguous space and the rest of the >> memory will >> +be the new reserved memory (this may shrink or grow). For that >> reason, freed >> +init memory from xen#1 image is also treated as reserved liveupdate >> update > > s/update// > > This is explained quite well actually, but I wonder if we can move this > part closer to the liveupdate command line section (they both talk > about the initial bootmem region and Xen size changes). After that, we > then talk about multiple regions and how we should use them. Just for clarification, do you mean moving after "The region will be passed to xen#2 using the same command line option but with the base address specified."? >> +bootmem. >> + >> +### Live Update data stream >> + >> +During handover, xen#1 creates a Live Update data stream containing >> all the >> +information required by the new Xen#2 to restore all the domains. >> + >> +Data pages for this stream may be allocated anywhere in physical >> memory outside >> +the *live update bootmem* regions. >> + >> +As calling __vmap()__/__vunmap()__ has a cost on the downtime. We >> want to reduce the >> +number of call to __vmap()__ when restoring the stream. Therefore >> the stream >> +will be contiguously virtually mapped in xen#2. xen#1 will create >> an array of > > Using vmap during restore for a contiguous range sounds more like > implementation and optimisation detail to me rather than an ABI > requirement, so I would s/the stream will be/the stream can be/. I will do. > >> +MFNs of the allocated data pages, suitable for passing to >> __vmap()__. The >> +array will be physically contiguous but the MFNs don't need to be >> physically >> +contiguous. >> + >> +### Breadcrumb >> + >> +Since the Live Update data stream is created during the final >> **kexec\_exec** >> +hypercall, its address cannot be passed on the command line to the >> new Xen >> +since the command line needs to have been set up by **kexec(8)** in >> userspace >> +long beforehand. >> + >> +Thus, to allow the new Xen to find the data stream, xen#1 places a >> breadcrumb >> +in the first words of the Live Update bootmem, containing the number >> of data >> +pages, and the physical address of the contiguous MFN array. >> + >> +### IOMMU >> + >> +Where devices are passed through to domains, it may not be possible >> to quiesce >> +those devices for the purpose of performing the update. >> + >> +If performing Live Update with assigned devices, xen#1 will leave >> the IOMMU >> +mappings active during the handover (thus implying that IOMMU page >> tables may >> +not be allocated in the *live update bootmem* region either). >> + >> +xen#2 must take control of the IOMMU without causing those mappings >> to become >> +invalid even for a short period of time. In other words, xen#2 >> should not >> +re-setup the IOMMUs. On hardware which does not support Posted >> Interrupts, >> +interrupts may need to be generated on resume. >> + >> +## Restore >> + >> +After xen#2 initialized itself and map the stream, it will be >> responsible to >> +restore the state of the system and each domain. >> + >> +Unlike the save part, it is not possible to restore a domain in a >> single pass. >> +There are dependencies between: >> + >> + 1. different states of a domain. For instance, the event >> channels ABI >> + used (2l vs fifo) requires to be restored before restoring >> the event >> + channels. >> + 2. the same "state" within a domain. For instance, in case of >> PV domain, >> + the pages' ownership requires to be restored before restoring >> the type >> + of the page (e.g is it an L4, L1... table?). >> + >> + 3. domains. For instance when restoring the grant mapping, it >> will be >> + necessary to have the page's owner in hand to do proper >> refcounting. >> + Therefore the pages' ownership have to be restored first. >> + >> +Dependencies will be resolved using either multiple passes (for >> dependency >> +type 2 and 3) or using a specific ordering between records (for >> dependency >> +type 1). >> + >> +Each domain will be restored in 3 passes: >> + >> + * Pass 0: Create the domain and restore the P2M for HVM. This >> can be broken >> + down in 3 parts: >> + * Allocate a domain via _domain\_create()_ but skip part that >> requires >> + extra records (e.g HAP, P2M). >> + * Restore any parts which needs to be done before create the >> vCPUs. This >> + including restoring the P2M and whether HAP is used. >> + * Create the vCPUs. Note this doesn't restore the state of the >> vCPUs. >> + * Pass 1: It will restore the pages' ownership and the grant- >> table frames >> + * Pass 2: This steps will restore any domain states (e.g vCPU >> state, event >> + channels) that wasn't > > Sentence seems incomplete. I can add 'already restored' if that clarifies it? Cheers, -- Julien Grall