From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 860E1C433B4 for ; Thu, 6 May 2021 14:44:11 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0135D61157 for ; Thu, 6 May 2021 14:44:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0135D61157 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from list by lists.xenproject.org with outflank-mailman.123618.233191 (Exim 4.92) (envelope-from ) id 1lefEK-0002L5-KJ; Thu, 06 May 2021 14:43:56 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 123618.233191; Thu, 06 May 2021 14:43:56 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lefEK-0002Ky-HG; Thu, 06 May 2021 14:43:56 +0000 Received: by outflank-mailman (input) for mailman id 123618; Thu, 06 May 2021 14:43:55 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lefEJ-0002Ks-NY for xen-devel@lists.xenproject.org; Thu, 06 May 2021 14:43:55 +0000 Received: from mail-wm1-x32a.google.com (unknown [2a00:1450:4864:20::32a]) by us1-rack-iad1.inumbo.com (Halon) with ESMTPS id 4d0b5e78-daf3-42c9-84ba-2aa4180077dd; Thu, 06 May 2021 14:43:52 +0000 (UTC) Received: by mail-wm1-x32a.google.com with SMTP id b11-20020a7bc24b0000b0290148da0694ffso5450533wmj.2 for ; Thu, 06 May 2021 07:43:52 -0700 (PDT) Received: from [192.168.1.186] (host86-180-176-157.range86-180.btcentralplus.com. [86.180.176.157]) by smtp.gmail.com with ESMTPSA id o15sm4540204wru.42.2021.05.06.07.43.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 06 May 2021 07:43:51 -0700 (PDT) X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: 4d0b5e78-daf3-42c9-84ba-2aa4180077dd DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:reply-to:subject:to:cc:references:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=Ml5++Jmo/jaNHi4JfexBhOqrQdGVQl2n4qPdFHUxiT0=; b=mLkgDE2yM4VpG0cUsA79H+eBo4OGaodagcUEj2lwfpkYwkhF1ruZA9gMUEDeqBoX3n HrBRt2aF5buUvEhSFzW3mN/KCtDcdsyCO7/B1XhB2Dqx4VM+hLG2/vdvIa9ju9P7wWnU wuhFM0j2TPBsVWKoGzPN4nB4Kzt0/PXAzdmGC0k1AN1uebtN4nYkGyQS3qmQ4klHE59b 4FR5Na6u7T1qm8i9RZvovwRjCq1e+gBD2ztoEKGFL96KnWQMFlEYzhkvZ0FNfuAK+FLr K/S87goOrw/x/omhETe7bpMEzSJlIuc3uEBKZ1IMmKggQ50gI6585GN9yql21Fq2DNRP 30vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:reply-to:subject:to:cc:references :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=Ml5++Jmo/jaNHi4JfexBhOqrQdGVQl2n4qPdFHUxiT0=; b=MtZPdzELho2E43ZGYBqIMvCk6FSN/dEcDKK4Newn7OIgvnu5/quFwfsGgqzr3HRT9i wZp4n6b0akooZv3W39sG5ZIrn/wz5hM0/VU1UWFaPEOR7WbRUYh+DBZJiKyrmVa45g2U Pit2Y3csti/R8onarpU7yxgQ9sDkHvQaPjozLwVcKWmHbE4Llx3vdEbRWNzedLYDCcz9 XZc2TxJ7yvB6RQwwvmCvhdSh2PNJ7y20s1Cv1hRVM++FlBxmPyIVoOM7YTSFroC1qIJN iJz4rjqgF9pOxMStjZVv6M1lNGEW5gFAqnHxZan9NSgrG6NZEpsxzFn9gKcf2ps/8KNk oTLA== X-Gm-Message-State: AOAM53181+vcwOMKxwYTsi0Wx12N3PE3sUxwR1+GTxeN8mi63ofIbcoS 3hnNWp6IL5wmNJiS+5H2SbM= X-Google-Smtp-Source: ABdhPJyYb7fEMw/HUg92rZQsqfqaGjOwXILUu1v+CGWYvH/G5bQt6f30ARgUe6lI11rjKnLYzFdxFw== X-Received: by 2002:a1c:c28a:: with SMTP id s132mr15400176wmf.145.1620312231584; Thu, 06 May 2021 07:43:51 -0700 (PDT) From: Paul Durrant X-Google-Original-From: Paul Durrant Reply-To: paul@xen.org Subject: Re: [PATCH RFC 1/2] docs/design: Add a design document for Live Update To: Julien Grall , xen-devel@lists.xenproject.org Cc: dwmw2@infradead.org, hongyxia@amazon.com, raphning@amazon.com, maghul@amazon.com, Julien Grall , Andrew Cooper , George Dunlap , Ian Jackson , Jan Beulich , Stefano Stabellini , Wei Liu References: <20210506104259.16928-1-julien@xen.org> <20210506104259.16928-2-julien@xen.org> Message-ID: <00db0c29-337f-4afd-d3a9-ee44b5b74146@xen.org> Date: Thu, 6 May 2021 15:43:50 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: <20210506104259.16928-2-julien@xen.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit On 06/05/2021 11:42, Julien Grall wrote: > From: Julien Grall > Looks good in general... just a few comments below... > Administrators often require updating the Xen hypervisor to address > security vulnerabilities, introduce new features, or fix software defects. > Currently, we offer the following methods to perform the update: > > * Rebooting the guests and the host: this is highly disrupting to running > guests. > * Migrating off the guests, rebooting the host: this currently requires > the guest to cooperate (see [1] for a non-cooperative solution) and it > may not always be possible to migrate it off (i.e lack of capacity, use > of local storage...). > * Live patching: This is the less disruptive of the existing methods. > However, it can be difficult to prepare the livepatch if the change is > large or there are data structures to update. > > This patch will introduce a new proposal called "Live Update" which will > activate new software without noticeable downtime (i.e no - or minimal - > customer). > > Signed-off-by: Julien Grall > --- > docs/designs/liveupdate.md | 254 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 254 insertions(+) > create mode 100644 docs/designs/liveupdate.md > > diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md > new file mode 100644 > index 000000000000..32993934f4fe > --- /dev/null > +++ b/docs/designs/liveupdate.md > @@ -0,0 +1,254 @@ > +# Live Updating Xen > + > +## Background > + > +Administrators often require updating the Xen hypervisor to address security > +vulnerabilities, introduce new features, or fix software defects. Currently, > +we offer the following methods to perform the update: > + > + * Rebooting the guests and the host: this is highly disrupting to running > + guests. > + * Migrating off the guests, rebooting the host: this currently requires > + the guest to cooperate (see [1] for a non-cooperative solution) and it > + may not always be possible to migrate it off (i.e lack of capacity, use > + of local storage...). > + * Live patching: This is the less disruptive of the existing methods. > + However, it can be difficult to prepare the livepatch if the change is > + large or there are data structures to update. > + > +This document will present a new approach called "Live Update" which will > +activate new software without noticeable downtime (i.e no - or minimal - > +customer pain). > + > +## Terminology > + > +xen#1: Xen version currently active and running on a droplet. This is the > +“source” for the Live Update operation. This version can actually be newer > +than xen#2 in case of a rollback operation. > + > +xen#2: Xen version that's the “target” of the Live Update operation. This > +version will become the active version after successful Live Update. This > +version of Xen can actually be older than xen#1 in case of a rollback > +operation. > + > +## High-level overview > + > +Xen has a framework to bring a new image of the Xen hypervisor in memory using > +kexec. The existing framework does not meet the baseline functionality for > +Live Update, since kexec results in a restart for the hypervisor, host, Dom0, > +and all the guests. Feels like there's a sentence or two missing here. The subject has jumped from a framework that is not fit for purpose to 'the operation'. > + > +The operation can be divided in roughly 4 parts: > + > + 1. Trigger: The operation will by triggered from outside the hypervisor > + (e.g. dom0 userspace). > + 2. Save: The state will be stabilized by pausing the domains and > + serialized by xen#1. > + 3. Hand-over: xen#1 will pass the serialized state and transfer control to > + xen#2. > + 4. Restore: The state will be deserialized by xen#2. > + > +All the domains will be paused before xen#1 is starting to save the states, s/is starting/starts > +and any domain that was running before Live Update will be unpaused after > +xen#2 has finished to restore the states. This is to prevent a domain to try s/finished to restore/finished restoring and s/domain to try/domain trying > +to modify the state of another domain while it is being saved/restored. > + > +The current approach could be seen as non-cooperative migration with a twist: > +all the domains (including dom0) are not expected be involved in the Live > +Update process. > + > +The major differences compare to live migration are: s/compare/compared > + > + * The state is not transferred to another host, but instead locally to > + xen#2. > + * The memory content or device state (for passthrough) does not need to > + be part of the stream. Instead we need to preserve it. > + * PV backends, device emulators, xenstored are not recreated but preserved > + (as these are part of dom0). > + > + > +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*) will need > +to be preserved because another entity may have mappings (e.g foreign, grant) > +on them. > + > +## Trigger > + > +Live update is built on top of the kexec interface to prepare the command line, > +load xen#2 and trigger the operation. A new kexec type has been introduced > +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update. > + > +The Live Update will be triggered from outside the hypervisor (e.g. dom0 > +userspace). Support for the operation has been added in kexec-tools 2.0.21. > + > +All the domains will be paused before xen#1 is starting to save the states. You already said this in the previous section. > +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re- > +scheduled. In other words, a pause request will not wait for asynchronous > +requests (e.g. I/O) to finish. For Live Update, this is not an ideal time to > +pause because it will require more xen#1 internal state to be transferred. > +Therefore, all the domains will be paused at an architectural restartable > +boundary. > + > +Live update will not happen synchronously to the request but when all the > +domains are quiescent. As domains running device emulators (e.g Dom0) will > +be part of the process to quiesce HVM domains, we will need to let them run > +until xen#1 is actually starting to save the state. HVM vCPUs will be paused > +as soon as any pending asynchronous request has finished. > + > +In the current implementation, all PV domains will continue to run while the > +rest will be paused as soon as possible. Note this approach is assuming that > +device emulators are only running in PV domains. > + > +It should be easy to extend to PVH domains not requiring device emulations. > +It will require more thought if we need to run device models in HVM domains as > +there might be inter-dependency. > + > +## Save > + > +xen#1 will be responsible to preserve and serialize the state of each existing > +domain and any system-wide state (e.g M2P). s/to preserve and serialize/for preserving and serializing > + > +Each domain will be serialized independently using a modified migration stream, > +if there is any dependency between domains (such as for IOREQ server) they will > +be recorded using a domid. All the complexity of resolving the dependencies are > +left to the restore path in xen#2 (more in the *Restore* section). > + > +At the moment, the domains are saved one by one in a single thread, but it > +would be possible to consider multi-threading if it takes too long. Although > +this may require some adjustment in the stream format. > + > +As we want to be able to Live Update between major versions of Xen (e.g Xen > +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal > +structure but instead the minimal information that allow us to recreate the > +domains. > + > +For instance, we don't want to preserve the frametable (and therefore > +*struct page\_info*) as-is because the refcounting may be different across > +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate > +*struct page\_info* based on minimal information that are considered stable > +(such as the page type). > + > +Note that upgrading between version of Xen will also require all the hypercalls > +to be stable. This will not be covered by this document. > + > +## Hand over > + > +### Memory usage restrictions > + > +xen#2 must take care not to use any memory pages which already belong to > +guests. To facilitate this, a number of contiguous region of memory are > +reserved for the boot allocator, known as *live update bootmem*. > + > +xen#1 will always reserve a region just below Xen (the size is controlled by > +the Xen command line parameter liveupdate) to allow Xen growing and provide > +information about LiveUpdate (see the section *Breadcrumb*). The region will be > +passed to xen#2 using the same command line option but with the base address > +specified. > + > +For simplicity, additional regions will be provided in the stream. They will > +consist of region that could be re-used by xen#2 during boot (such as the s/region/a region Paul > +xen#1's frametable memory). > + > +xen#2 must not use any pages outside those regions until it has consumed the > +Live Update data stream and determined which pages are already in use by > +running domains or need to be re-used as-is by Xen (e.g M2P). > + > +At run time, Xen may use memory from the reserved region for any purpose that > +does not require preservation over a Live Update; in particular it __must__ not be > +mapped to a domain or used by any Xen state requiring to be preserved (e.g > +M2P). In other word, the xenheap pages could be allocated from the reserved > +regions if we remove the concept of shared xenheap pages. > + > +The xen#2's binary may be bigger (or smaller) compare to xen#1's binary. So > +for the purpose of loading xen#2 binary, kexec should treat the reserved memory > +right below xen#1 and its region as a single contiguous space. xen#2 will be > +loaded right at the top of the contiguous space and the rest of the memory will > +be the new reserved memory (this may shrink or grow). For that reason, freed > +init memory from xen#1 image is also treated as reserved liveupdate update > +bootmem. > + > +### Live Update data stream > + > +During handover, xen#1 creates a Live Update data stream containing all the > +information required by the new Xen#2 to restore all the domains. > + > +Data pages for this stream may be allocated anywhere in physical memory outside > +the *live update bootmem* regions. > + > +As calling __vmap()__/__vunmap()__ has a cost on the downtime. We want to reduce the > +number of call to __vmap()__ when restoring the stream. Therefore the stream > +will be contiguously virtually mapped in xen#2. xen#1 will create an array of > +MFNs of the allocated data pages, suitable for passing to __vmap()__. The > +array will be physically contiguous but the MFNs don't need to be physically > +contiguous. > + > +### Breadcrumb > + > +Since the Live Update data stream is created during the final **kexec\_exec** > +hypercall, its address cannot be passed on the command line to the new Xen > +since the command line needs to have been set up by **kexec(8)** in userspace > +long beforehand. > + > +Thus, to allow the new Xen to find the data stream, xen#1 places a breadcrumb > +in the first words of the Live Update bootmem, containing the number of data > +pages, and the physical address of the contiguous MFN array. > + > +### IOMMU > + > +Where devices are passed through to domains, it may not be possible to quiesce > +those devices for the purpose of performing the update. > + > +If performing Live Update with assigned devices, xen#1 will leave the IOMMU > +mappings active during the handover (thus implying that IOMMU page tables may > +not be allocated in the *live update bootmem* region either). > + > +xen#2 must take control of the IOMMU without causing those mappings to become > +invalid even for a short period of time. In other words, xen#2 should not > +re-setup the IOMMUs. On hardware which does not support Posted Interrupts, > +interrupts may need to be generated on resume. > + > +## Restore > + > +After xen#2 initialized itself and map the stream, it will be responsible to > +restore the state of the system and each domain. > + > +Unlike the save part, it is not possible to restore a domain in a single pass. > +There are dependencies between: > + > + 1. different states of a domain. For instance, the event channels ABI > + used (2l vs fifo) requires to be restored before restoring the event > + channels. > + 2. the same "state" within a domain. For instance, in case of PV domain, > + the pages' ownership requires to be restored before restoring the type > + of the page (e.g is it an L4, L1... table?). > + > + 3. domains. For instance when restoring the grant mapping, it will be > + necessary to have the page's owner in hand to do proper refcounting. > + Therefore the pages' ownership have to be restored first. > + > +Dependencies will be resolved using either multiple passes (for dependency > +type 2 and 3) or using a specific ordering between records (for dependency > +type 1). > + > +Each domain will be restored in 3 passes: > + > + * Pass 0: Create the domain and restore the P2M for HVM. This can be broken > + down in 3 parts: > + * Allocate a domain via _domain\_create()_ but skip part that requires > + extra records (e.g HAP, P2M). > + * Restore any parts which needs to be done before create the vCPUs. This > + including restoring the P2M and whether HAP is used. > + * Create the vCPUs. Note this doesn't restore the state of the vCPUs. > + * Pass 1: It will restore the pages' ownership and the grant-table frames > + * Pass 2: This steps will restore any domain states (e.g vCPU state, event > + channels) that wasn't > + > +A domain should not have a dependency on another domain within the same pass. > +Therefore it would be possible to take advantage of all the CPUs to restore > +domains in parallel and reduce the overall downtime. > + > +Once all the domains have been restored, they will be unpaused if they were > +running before Live Update. > + > +* * * > +[1] https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md;h=4b876d809fb5b8aac02d29fd7760a5c0d5b86d87;hb=HEAD > + >