From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-23.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8ACADC433ED for ; Tue, 13 Apr 2021 19:20:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6BDD8613C6 for ; Tue, 13 Apr 2021 19:20:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343498AbhDMTUk (ORCPT ); Tue, 13 Apr 2021 15:20:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38196 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229458AbhDMTUe (ORCPT ); Tue, 13 Apr 2021 15:20:34 -0400 Received: from mail-io1-xd2e.google.com (mail-io1-xd2e.google.com [IPv6:2607:f8b0:4864:20::d2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84FAAC061574 for ; Tue, 13 Apr 2021 12:20:14 -0700 (PDT) Received: by mail-io1-xd2e.google.com with SMTP id x16so18240499iob.1 for ; Tue, 13 Apr 2021 12:20:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qUYPcj8zrA5dCXV+Qd7SV6lPwqpZcDSVkNd2rz1KGP4=; b=Gzs3YBuQdikVGHOQ0Jpqli7GUxYflzrfBuLoFvlAAy3NqLjZ2PRBtB8+1VwIE6isCY kIZAfWUdKN6XC8NSnk1z6P3jdfds4l2PBFqOb2ggsMHN1YV0iDxAKcwWfxh5PWfXJo8F 0UwJe6JNkq1vB5bMA+8S15rDSFnHzbSWvArJTsdlqGKNkiUkcgifJjjdc5+msvxn1KIs RQvX4ifvcr1Etk9Ogernfx6NjX1pQZLdYRT4sc9HxLtJs5OgBf2lu9Urc21zEe4/cPgY 2+eTltF7ldJ79SwjjCjExpTTN29ZmNlkyNVl36xG0zRcS0M+/otq98h6Cp0o0FnwXwEf kW3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qUYPcj8zrA5dCXV+Qd7SV6lPwqpZcDSVkNd2rz1KGP4=; b=t0kRLAKhZPRcsqHlsUckprBPgpR+1+EkPXw5JmVX6DzWPCxiE2CME4V9Np1XEhD2iJ uvJfavrzy9QgqFpmS1EwhFGSkTM2vRbWn7sHOVCgnmzc7+Br0yJXouhEsIbbv+hA9cPd wU0njhF0vGa/xgzGz3WnIA33W9xJwiuD23/fAdKh8K6et45OfEqRaF0V9UsYbSbLh9HX ef1fqaJD4QMvLY/CayDB2t1QDsr1I5+hgGpzCeGtkx2VNBWJO+YhG546EgEC0xZBrpxY UgqffMHss0KCgp3Ryu5zuXcFBVfqXobtXlOr246NBf3uaOMpy7iKK4NiAhl0/Lk/W5zV 7QVQ== X-Gm-Message-State: AOAM532z4OVsnp8Q03chE21UCBQ4xEzIYOG4FMEAKe1aUzMG6nPTcVUd aNfamrTkDQwA+RnxMWzaOOA6O8rMHcgwANLQWFqHkg== X-Google-Smtp-Source: ABdhPJwSwBukOLXixI0WMfs2vnVZ5rnPEXzOUNbTKi4Duil4sVjTdcPyLwsKfy79OAMMKO+qB+v38GYYF6GQf0GUMig= X-Received: by 2002:a5d:83cf:: with SMTP id u15mr27662714ior.34.1618341613707; Tue, 13 Apr 2021 12:20:13 -0700 (PDT) MIME-Version: 1.0 References: <20210413014821.GA3276@ashkalra_ubuntu_server> <20210413114712.GA3996@ashkalra_ubuntu_server> In-Reply-To: <20210413114712.GA3996@ashkalra_ubuntu_server> From: Steve Rutherford Date: Tue, 13 Apr 2021 12:19:37 -0700 Message-ID: Subject: Re: [PATCH v12 13/13] x86/kvm: Add kexec support for SEV Live Migration. To: Ashish Kalra Cc: Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Joerg Roedel , Borislav Petkov , Tom Lendacky , X86 ML , KVM list , LKML , Sean Christopherson , Venu Busireddy , Brijesh Singh , kexec@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 13, 2021 at 4:47 AM Ashish Kalra wrote: > > On Mon, Apr 12, 2021 at 07:25:03PM -0700, Steve Rutherford wrote: > > On Mon, Apr 12, 2021 at 6:48 PM Ashish Kalra wrote: > > > > > > On Mon, Apr 12, 2021 at 06:23:32PM -0700, Steve Rutherford wrote: > > > > On Mon, Apr 12, 2021 at 5:22 PM Steve Rutherford wrote: > > > > > > > > > > On Mon, Apr 12, 2021 at 12:48 PM Ashish Kalra wrote: > > > > > > > > > > > > From: Ashish Kalra > > > > > > > > > > > > Reset the host's shared pages list related to kernel > > > > > > specific page encryption status settings before we load a > > > > > > new kernel by kexec. We cannot reset the complete > > > > > > shared pages list here as we need to retain the > > > > > > UEFI/OVMF firmware specific settings. > > > > > > > > > > > > The host's shared pages list is maintained for the > > > > > > guest to keep track of all unencrypted guest memory regions, > > > > > > therefore we need to explicitly mark all shared pages as > > > > > > encrypted again before rebooting into the new guest kernel. > > > > > > > > > > > > Signed-off-by: Ashish Kalra > > > > > > --- > > > > > > arch/x86/kernel/kvm.c | 24 ++++++++++++++++++++++++ > > > > > > 1 file changed, 24 insertions(+) > > > > > > > > > > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > > > > > > index bcc82e0c9779..4ad3ed547ff1 100644 > > > > > > --- a/arch/x86/kernel/kvm.c > > > > > > +++ b/arch/x86/kernel/kvm.c > > > > > > @@ -39,6 +39,7 @@ > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > +#include > > > > > > > > > > > > DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled); > > > > > > > > > > > > @@ -384,6 +385,29 @@ static void kvm_pv_guest_cpu_reboot(void *unused) > > > > > > */ > > > > > > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) > > > > > > wrmsrl(MSR_KVM_PV_EOI_EN, 0); > > > > > > + /* > > > > > > + * Reset the host's shared pages list related to kernel > > > > > > + * specific page encryption status settings before we load a > > > > > > + * new kernel by kexec. NOTE: We cannot reset the complete > > > > > > + * shared pages list here as we need to retain the > > > > > > + * UEFI/OVMF firmware specific settings. > > > > > > + */ > > > > > > + if (sev_live_migration_enabled & (smp_processor_id() == 0)) { > > > > > What happens if the reboot of CPU0 races with another CPU servicing a > > > > > device request (while the reboot is pending for that CPU)? > > > > > Seems like you could run into a scenario where you have hypercalls racing. > > > > > > > > > > Calling this on every core isn't free, but it is an easy way to avoid this race. > > > > > You could also count cores, and have only last core do the job, but > > > > > that seems more complicated. > > > > On second thought, I think this may be insufficient as a fix, since my > > > > read of kernel/reboot.c seems to imply that devices aren't shutdown > > > > until after these notifiers occur. As such, a single thread might be > > > > able to race with itself. I could be wrong here though. > > > > > > > > The heavy hammer would be to disable migration through the MSR (which > > > > the subsequent boot will re-enable). > > > > > > > > I'm curious if there is a less "blocking" way of handling kexecs (that > > > > strategy would block LM while the guest booted). > > > > > > > > One option that comes to mind would be for the guest to "mute" the > > > > encryption status hypercall after the call to reset the encryption > > > > status. The problem would be that the encryption status for pages > > > > would be very temporarily inaccurate in the window between that call > > > > and the start of the next boot. That isn't ideal, but, on the other > > > > hand, the VM was about to reboot anyway, so a corrupted shared page > > > > for device communication probably isn't super important. Still, I'm > > > > not really a fan of that. This would avoid corrupting the next boot, > > > > which is clearly an improvement. > > > > > > > > Each time the kernel boots it could also choose something like a > > > > generation ID, and pass that down each time it calls the hypercall. > > > > This would then let userspace identify which requests were coming from > > > > the subsequent boot. > > > > > > > > Everything here (except, perhaps, disabling migration through the MSR) > > > > seems kind of complicated. I somewhat hope my interpretation of > > > > kernel/reboot.c is wrong and this race just is not possible in the > > > > first place. > > > > > > > > > > Disabling migration through the MSR after resetting the page encryption > > > status is a reasonable approach. There is a similar window existing for > > > normal VM boot during which LM is disabled, from the point where OVMF > > > checks and adds support for SEV LM and the kernel boot checks for the > > > same and enables LM using the MSR. > > > > I'm not totally confident that disabling LM through the MSR is > > sufficient. I also think the newly booted kernel needs to reset the > > state itself, since nothing stops the hypercalls after the disable > > goes through. The host won't know the difference between early boot > > (pre-enablement) hypercalls and racy just-before-restart hypercalls. > > You might disable migration through the hypercall, get a late status > > change hypercall, reboot, then re-enable migration, but still have > > stale state. > > > > I _believe_ that the kernel doesn't mark it's RAM as private on boot > > as an optimization (might be wrong about this), since it would have > > been expensive to mark all of ram as encrypted previously. I believe > > that is no longer a limitation given the KVM_EXIT, so we can reset > > this during early boot instead of just before the kexec. > > > > I was wondering if disabling both migration (via the MSR) and "muting" > the hypercall using the "sev_live_migration_enabled" variable after the > page encryption status has been reset, will reset the page encryption > status of the guest to the (last known/good) configuration available to > the guest at boot time (i.e, all RAM pages marked as private and UEFI > setup shared MMIO/device regions, etc). > > But disabling migration and muting hypercalls after page encryption > status reset is still "racy" with hypercalls on other vCPUS, and that > can potentially mess-up the page encryption status available to guest > after kexec. > > So probably, as you mentioned above, resetting the page encryption > status during early boot (immediately after detecting host support for > migration and enabling the hypercalls) instead of just before the kexec > is a good fix. That strategy sounds good to me. Thanks, Steve > > Thanks, > Ashish > > > > > > > + int i; > > > > > > + unsigned long nr_pages; > > > > > > + > > > > > > + for (i = 0; i < e820_table->nr_entries; i++) { > > > > > > + struct e820_entry *entry = &e820_table->entries[i]; > > > > > > + > > > > > > + if (entry->type != E820_TYPE_RAM) > > > > > > + continue; > > > > > > + > > > > > > + nr_pages = DIV_ROUND_UP(entry->size, PAGE_SIZE); > > > > > > + > > > > > > + kvm_sev_hypercall3(KVM_HC_PAGE_ENC_STATUS, > > > > > > + entry->addr, nr_pages, 1); > > > > > > + } > > > > > > + } > > > > > > kvm_pv_disable_apf(); > > > > > > kvm_disable_steal_time(); > > > > > > } > > > > > > -- > > > > > > 2.17.1 > > > > > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from mail-io1-xd2f.google.com ([2607:f8b0:4864:20::d2f]) by bombadil.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lWOa7-007HGK-0N for kexec@lists.infradead.org; Tue, 13 Apr 2021 19:20:16 +0000 Received: by mail-io1-xd2f.google.com with SMTP id h141so9836174iof.2 for ; Tue, 13 Apr 2021 12:20:14 -0700 (PDT) MIME-Version: 1.0 References: <20210413014821.GA3276@ashkalra_ubuntu_server> <20210413114712.GA3996@ashkalra_ubuntu_server> In-Reply-To: <20210413114712.GA3996@ashkalra_ubuntu_server> From: Steve Rutherford Date: Tue, 13 Apr 2021 12:19:37 -0700 Message-ID: Subject: Re: [PATCH v12 13/13] x86/kvm: Add kexec support for SEV Live Migration. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "kexec" Errors-To: kexec-bounces+dwmw2=infradead.org@lists.infradead.org To: Ashish Kalra Cc: Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Joerg Roedel , Borislav Petkov , Tom Lendacky , X86 ML , KVM list , LKML , Sean Christopherson , Venu Busireddy , Brijesh Singh , kexec@lists.infradead.org On Tue, Apr 13, 2021 at 4:47 AM Ashish Kalra wrote: > > On Mon, Apr 12, 2021 at 07:25:03PM -0700, Steve Rutherford wrote: > > On Mon, Apr 12, 2021 at 6:48 PM Ashish Kalra wrote: > > > > > > On Mon, Apr 12, 2021 at 06:23:32PM -0700, Steve Rutherford wrote: > > > > On Mon, Apr 12, 2021 at 5:22 PM Steve Rutherford wrote: > > > > > > > > > > On Mon, Apr 12, 2021 at 12:48 PM Ashish Kalra wrote: > > > > > > > > > > > > From: Ashish Kalra > > > > > > > > > > > > Reset the host's shared pages list related to kernel > > > > > > specific page encryption status settings before we load a > > > > > > new kernel by kexec. We cannot reset the complete > > > > > > shared pages list here as we need to retain the > > > > > > UEFI/OVMF firmware specific settings. > > > > > > > > > > > > The host's shared pages list is maintained for the > > > > > > guest to keep track of all unencrypted guest memory regions, > > > > > > therefore we need to explicitly mark all shared pages as > > > > > > encrypted again before rebooting into the new guest kernel. > > > > > > > > > > > > Signed-off-by: Ashish Kalra > > > > > > --- > > > > > > arch/x86/kernel/kvm.c | 24 ++++++++++++++++++++++++ > > > > > > 1 file changed, 24 insertions(+) > > > > > > > > > > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > > > > > > index bcc82e0c9779..4ad3ed547ff1 100644 > > > > > > --- a/arch/x86/kernel/kvm.c > > > > > > +++ b/arch/x86/kernel/kvm.c > > > > > > @@ -39,6 +39,7 @@ > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > +#include > > > > > > > > > > > > DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled); > > > > > > > > > > > > @@ -384,6 +385,29 @@ static void kvm_pv_guest_cpu_reboot(void *unused) > > > > > > */ > > > > > > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) > > > > > > wrmsrl(MSR_KVM_PV_EOI_EN, 0); > > > > > > + /* > > > > > > + * Reset the host's shared pages list related to kernel > > > > > > + * specific page encryption status settings before we load a > > > > > > + * new kernel by kexec. NOTE: We cannot reset the complete > > > > > > + * shared pages list here as we need to retain the > > > > > > + * UEFI/OVMF firmware specific settings. > > > > > > + */ > > > > > > + if (sev_live_migration_enabled & (smp_processor_id() == 0)) { > > > > > What happens if the reboot of CPU0 races with another CPU servicing a > > > > > device request (while the reboot is pending for that CPU)? > > > > > Seems like you could run into a scenario where you have hypercalls racing. > > > > > > > > > > Calling this on every core isn't free, but it is an easy way to avoid this race. > > > > > You could also count cores, and have only last core do the job, but > > > > > that seems more complicated. > > > > On second thought, I think this may be insufficient as a fix, since my > > > > read of kernel/reboot.c seems to imply that devices aren't shutdown > > > > until after these notifiers occur. As such, a single thread might be > > > > able to race with itself. I could be wrong here though. > > > > > > > > The heavy hammer would be to disable migration through the MSR (which > > > > the subsequent boot will re-enable). > > > > > > > > I'm curious if there is a less "blocking" way of handling kexecs (that > > > > strategy would block LM while the guest booted). > > > > > > > > One option that comes to mind would be for the guest to "mute" the > > > > encryption status hypercall after the call to reset the encryption > > > > status. The problem would be that the encryption status for pages > > > > would be very temporarily inaccurate in the window between that call > > > > and the start of the next boot. That isn't ideal, but, on the other > > > > hand, the VM was about to reboot anyway, so a corrupted shared page > > > > for device communication probably isn't super important. Still, I'm > > > > not really a fan of that. This would avoid corrupting the next boot, > > > > which is clearly an improvement. > > > > > > > > Each time the kernel boots it could also choose something like a > > > > generation ID, and pass that down each time it calls the hypercall. > > > > This would then let userspace identify which requests were coming from > > > > the subsequent boot. > > > > > > > > Everything here (except, perhaps, disabling migration through the MSR) > > > > seems kind of complicated. I somewhat hope my interpretation of > > > > kernel/reboot.c is wrong and this race just is not possible in the > > > > first place. > > > > > > > > > > Disabling migration through the MSR after resetting the page encryption > > > status is a reasonable approach. There is a similar window existing for > > > normal VM boot during which LM is disabled, from the point where OVMF > > > checks and adds support for SEV LM and the kernel boot checks for the > > > same and enables LM using the MSR. > > > > I'm not totally confident that disabling LM through the MSR is > > sufficient. I also think the newly booted kernel needs to reset the > > state itself, since nothing stops the hypercalls after the disable > > goes through. The host won't know the difference between early boot > > (pre-enablement) hypercalls and racy just-before-restart hypercalls. > > You might disable migration through the hypercall, get a late status > > change hypercall, reboot, then re-enable migration, but still have > > stale state. > > > > I _believe_ that the kernel doesn't mark it's RAM as private on boot > > as an optimization (might be wrong about this), since it would have > > been expensive to mark all of ram as encrypted previously. I believe > > that is no longer a limitation given the KVM_EXIT, so we can reset > > this during early boot instead of just before the kexec. > > > > I was wondering if disabling both migration (via the MSR) and "muting" > the hypercall using the "sev_live_migration_enabled" variable after the > page encryption status has been reset, will reset the page encryption > status of the guest to the (last known/good) configuration available to > the guest at boot time (i.e, all RAM pages marked as private and UEFI > setup shared MMIO/device regions, etc). > > But disabling migration and muting hypercalls after page encryption > status reset is still "racy" with hypercalls on other vCPUS, and that > can potentially mess-up the page encryption status available to guest > after kexec. > > So probably, as you mentioned above, resetting the page encryption > status during early boot (immediately after detecting host support for > migration and enabling the hypercalls) instead of just before the kexec > is a good fix. That strategy sounds good to me. Thanks, Steve > > Thanks, > Ashish > > > > > > > + int i; > > > > > > + unsigned long nr_pages; > > > > > > + > > > > > > + for (i = 0; i < e820_table->nr_entries; i++) { > > > > > > + struct e820_entry *entry = &e820_table->entries[i]; > > > > > > + > > > > > > + if (entry->type != E820_TYPE_RAM) > > > > > > + continue; > > > > > > + > > > > > > + nr_pages = DIV_ROUND_UP(entry->size, PAGE_SIZE); > > > > > > + > > > > > > + kvm_sev_hypercall3(KVM_HC_PAGE_ENC_STATUS, > > > > > > + entry->addr, nr_pages, 1); > > > > > > + } > > > > > > + } > > > > > > kvm_pv_disable_apf(); > > > > > > kvm_disable_steal_time(); > > > > > > } > > > > > > -- > > > > > > 2.17.1 > > > > > > _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec