From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 200B8C433DB for ; Wed, 10 Mar 2021 03:49:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E06EB64FEE for ; Wed, 10 Mar 2021 03:49:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230516AbhCJDsf (ORCPT ); Tue, 9 Mar 2021 22:48:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230198AbhCJDsE (ORCPT ); Tue, 9 Mar 2021 22:48:04 -0500 Received: from mail-il1-x135.google.com (mail-il1-x135.google.com [IPv6:2607:f8b0:4864:20::135]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 408D4C06174A for ; Tue, 9 Mar 2021 19:48:04 -0800 (PST) Received: by mail-il1-x135.google.com with SMTP id e7so14235709ile.7 for ; Tue, 09 Mar 2021 19:48:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=Ry2VwUF3H+Dyy7z8FInA0z451jNOET0hq4JeSLr2Jw4=; b=UVSnRrtE43Jks5iUAOlGPEOwwRWlqyDAH1+CUXlTtgkj2GfZ6kC3/1rgdzIHbhkQJj g1kL87FPi9mklgmAKAsg1b1UUigDawSInQrzpVD6CNqpyUimrBKWE4IftY4A6lO7KYUf iEIjJAj2c52jSkfUiN9w3ymvO0r31jsaYgLljS7CJbcsZh5ybtANfxPStxBSpUzDENmj 8gj7CIbSz96umU8mqDjBxRFNtgUfDENz7LCWQ9hWOntKfs19HXKrjVGl4TILMW09inWf lXfOQ/SkybbLGq88tVtaUHnpz6NG/Ci0WiL3yLa5/dFJ1ruvqcmTos0rimowcXpe/soy okbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=Ry2VwUF3H+Dyy7z8FInA0z451jNOET0hq4JeSLr2Jw4=; b=Ap7VKQqfAjOYv2ZCLbPXsTHsRfwdKwk516ZCtSX5RZBULRhOA1XNPiIBVTqrfW26SX JrI97ZHQtyTNfK+6Qv/gnrvC5YBqB6phnudkwp0EVxbI352RCxzY3u4tAyl8mt1ApLZ2 rAk7osB5CU9ID40fvbaqvAXi8/sv6zHkI55oEO42+s5Wc7vZeDv2v57Wm755w/T8LB// /V2+fmn0Fpu8DjZibRtc/NTVHWcRk72NCA6WOa9gObPJZ5e0R7RVPrrF76tQLTvexUdJ N62R0q6k0e/xUHVPhTkXpptBbcOVlH0HIA96AaCBUda7iCdBhERsMGSB/NqQYjfw99Z/ jGnQ== X-Gm-Message-State: AOAM531iVM5GRZH1WDnAc0FMP4lsf81/RpgTWC5u+qG3TX1NI6nWEF3W OzrfUDXVBIssJsVgfVtl8f8UK8MF1pNo3I7bjg8aeQ== X-Google-Smtp-Source: ABdhPJyINiR4QFryPzcCN7fKWXkuisQNTdF58oY+/aNr/SqhnvTnZQ3haFVAv3ll4nHDoPIJnukTKHJaEpt7pCy4Up8= X-Received: by 2002:a05:6e02:1a4d:: with SMTP id u13mr1221281ilv.176.1615348083355; Tue, 09 Mar 2021 19:48:03 -0800 (PST) MIME-Version: 1.0 References: <7266edd714add8ec9d7f63eddfc9bbd4d789c213.1612398155.git.ashish.kalra@amd.com> <20210224175122.GA19661@ashkalra_ubuntu_server> <20210225202008.GA5208@ashkalra_ubuntu_server> <20210226140432.GB5950@ashkalra_ubuntu_server> <20210308104014.GA5333@ashkalra_ubuntu_server> In-Reply-To: From: Steve Rutherford Date: Tue, 9 Mar 2021 19:47:26 -0800 Message-ID: Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl To: "Kalra, Ashish" Cc: "Singh, Brijesh" , Sean Christopherson , "pbonzini@redhat.com" , "joro@8bytes.org" , "Lendacky, Thomas" , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "venu.busireddy@oracle.com" , Will Deacon , Quentin Perret Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 9, 2021 at 7:42 PM Kalra, Ashish wrote: > > > > > On Mar 9, 2021, at 3:22 AM, Steve Rutherford w= rote: > > > > =EF=BB=BFOn Mon, Mar 8, 2021 at 1:11 PM Brijesh Singh wrote: > >> > >> > >>> On 3/8/21 1:51 PM, Sean Christopherson wrote: > >>> On Mon, Mar 08, 2021, Ashish Kalra wrote: > >>>> On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > >>>>> +Will and Quentin (arm64) > >>>>> > >>>>> Moving the non-KVM x86 folks to bcc, I don't they care about KVM de= tails at this > >>>>> point. > >>>>> > >>>>> On Fri, Feb 26, 2021, Ashish Kalra wrote: > >>>>>> On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > >>>>>>> On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra wrote: > >>>>>>> Thanks for grabbing the data! > >>>>>>> > >>>>>>> I am fine with both paths. Sean has stated an explicit desire for > >>>>>>> hypercall exiting, so I think that would be the current consensus= . > >>>>> Yep, though it'd be good to get Paolo's input, too. > >>>>> > >>>>>>> If we want to do hypercall exiting, this should be in a follow-up > >>>>>>> series where we implement something more generic, e.g. a hypercal= l > >>>>>>> exiting bitmap or hypercall exit list. If we are taking the hyper= call > >>>>>>> exit route, we can drop the kvm side of the hypercall. > >>>>> I don't think this is a good candidate for arbitrary hypercall inte= rception. Or > >>>>> rather, I think hypercall interception should be an orthogonal impl= ementation. > >>>>> > >>>>> The guest, including guest firmware, needs to be aware that the hyp= ercall is > >>>>> supported, and the ABI needs to be well-defined. Relying on usersp= ace VMMs to > >>>>> implement a common ABI is an unnecessary risk. > >>>>> > >>>>> We could make KVM's default behavior be a nop, i.e. have KVM enforc= e the ABI but > >>>>> require further VMM intervention. But, I just don't see the point,= it would > >>>>> save only a few lines of code. It would also limit what KVM could = do in the > >>>>> future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to = userspace, > >>>>> then mandatory interception would essentially make it impossible fo= r KVM to do > >>>>> bookkeeping while still honoring the interception request. > >>>>> > >>>>> However, I do think it would make sense to have the userspace exit = be a generic > >>>>> exit type. But hey, we already have the necessary ABI defined for = that! It's > >>>>> just not used anywhere. > >>>>> > >>>>> /* KVM_EXIT_HYPERCALL */ > >>>>> struct { > >>>>> __u64 nr; > >>>>> __u64 args[6]; > >>>>> __u64 ret; > >>>>> __u32 longmode; > >>>>> __u32 pad; > >>>>> } hypercall; > >>>>> > >>>>> > >>>>>>> Userspace could also handle the MSR using MSR filters (would need= to > >>>>>>> confirm that). Then userspace could also be in control of the cp= uid bit. > >>>>> An MSR is not a great fit; it's x86 specific and limited to 64 bits= of data. > >>>>> The data limitation could be fudged by shoving data into non-standa= rd GPRs, but > >>>>> that will result in truly heinous guest code, and extensibility iss= ues. > >>>>> > >>>>> The data limitation is a moot point, because the x86-only thing is = a deal > >>>>> breaker. arm64's pKVM work has a near-identical use case for a gue= st to share > >>>>> memory with a host. I can't think of a clever way to avoid having = to support > >>>>> TDX's and SNP's hypervisor-agnostic variants, but we can at least n= ot have > >>>>> multiple KVM variants. > >>>>> > >>>> Potentially, there is another reason for in-kernel hypercall handlin= g > >>>> considering SEV-SNP. In case of SEV-SNP the RMP table tracks the sta= te > >>>> of each guest page, for instance pages in hypervisor state, i.e., pa= ges > >>>> with C=3D0 and pages in guest valid state with C=3D1. > >>>> > >>>> Now, there shouldn't be a need for page encryption status hypercalls= on > >>>> SEV-SNP as KVM can track & reference guest page status directly usin= g > >>>> the RMP table. > >>> Relying on the RMP table itself would require locking the RMP table f= or an > >>> extended duration, and walking the entire RMP to find shared pages wo= uld be > >>> very inefficient. > >>> > >>>> As KVM maintains the RMP table, therefore we will need SET/GET type = of > >>>> interfaces to provide the guest page encryption status to userspace. > >>> Hrm, somehow I temporarily forgot about SNP and TDX adding their own = hypercalls > >>> for converting between shared and private. And in the case of TDX, t= he hypercall > >>> can't be trusted, i.e. is just a hint, otherwise the guest could indu= ce a #MC in > >>> the host. > >>> > >>> But, the different guest behavior doesn't require KVM to maintain a l= ist/tree, > >>> e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page en= cryption > >>> status changes would also suffice. > >>> > >>> Actually, that made me think of another argument against maintaining = a list in > >>> KVM: there's no way to notify userspace that a page's status has chan= ged. > >>> Userspace would need to query KVM to do GET_LIST after every GET_DIRT= Y. > >>> Obviously not a huge issue, but it does make migration slightly less = efficient. > >>> > >>> On a related topic, there are fatal race conditions that will require= careful > >>> coordination between guest and host, and will effectively be wired in= to the ABI. > >>> SNP and TDX don't suffer these issues because host awareness of statu= s is atomic > >>> with respect to the guest actually writing the page with the new encr= yption > >>> status. > >>> > >>> For SEV live migration... > >>> > >>> If the guest does the hypercall after writing the page, then the gues= t is hosed > >>> if it gets migrated while writing the page (scenario #1): > >>> > >>> vCPU Userspace > >>> zero_bytes[0:N] > >>> > >>> > >>> zero_bytes[N+1:4095] > >>> set_shared (dest) > >>> kaboom! > >> > >> > >> Maybe I am missing something, this is not any different from a normal > >> operation inside a guest. Making a page shared/private in the page tab= le > >> does not update the content of the page itself. In your above case, I > >> assume zero_bytes[N+1:4095] are written by the destination VM. The > >> memory region was private in the source VM page table, so, those write= s > >> will be performed encrypted. The destination VM later changed the memo= ry > >> to shared, but nobody wrote to the memory after it has been transition= ed > >> to the shared, so a reader of the memory should get ciphertext and > >> unless there was a write after the set_shared (dest). > >> > >> > >>> If userspace does GET_DIRTY after GET_LIST, then the host would trans= fer bad > >>> data by consuming a stale list (scenario #2): > >>> > >>> vCPU Userspace > >>> get_list (from KVM or internally) > >>> set_shared (src) > >>> zero_page (src) > >>> get_dirty > >>> > >>> > >>> kaboom! > >> > >> > >> I don't remember how things are done in recent Ashish Qemu/KVM patches > >> but in previous series, the get_dirty() happens before the querying th= e > >> encrypted state. There was some logic in VMM to resync the encrypted > >> bitmap during the final migration stage and perform any additional dat= a > >> transfer since last sync. > >> > >> > >>> If both guest and host order things to avoid #1 and #2, the host can = still > >>> migrate the wrong data (scenario #3): > >>> > >>> vCPU Userspace > >>> set_private > >>> zero_bytes[0:4096] > >>> get_dirty > >>> set_shared (src) > >>> get_list > >>> > >>> > >>> set_private (dest) > >>> kaboom! > >> > >> > >> Since there was no write to the memory after the set_shared (src), so > >> the content of the page should not have changed. After the set_private > >> (dest), the caller should be seeing the same content written by the > >> zero_bytes[0:4096] > > I think Sean was going for the situation where the VM has moved to the > > destination, which would have changed the VEK. That way the guest > > would be decrypting the old ciphertext with the new (wrong) key. > >> > > But how can this happen, if a page is migrated as private , when it is re= ceived it will be decrypted using the transport key TEK and then re-encrypt= ed using the destination VM=E2=80=99s VEK on the destination VM. > If, as in scenario #3 above, the page is set to shared just before being migrated. It would then be migrated in the clear, but be interpreted on the target as encrypted (since, immediately post-migration, the page is flipped to private without ever writing to the page). This is not a scenario that is expected to work, as it requires violating (currently unspoken?) invariants. Thanks, Steve > Thanks, > Ashish > > >> > >>> Scenario #3 is unlikely, but plausible, e.g. if the guest bails from = its > >>> conversion flow for whatever reason, after making the initial hyperca= ll. Maybe > >>> it goes without saying, but to address #3, the guest must consider ex= isting data > >>> as lost the instant it tells the host the page has been converted to = a different > >>> type. > >>> > >>>> For the above reason if we do in-kernel hypercall handling for page > >>>> encryption status (which we probably won't require for SEV-SNP & > >>>> correspondingly there will be no hypercall exiting), > >>> As above, that doesn't preclude KVM from exiting to userspace on conv= ersion. > >>> > >>>> then we can implement a standard GET/SET ioctl interface to get/set = the guest > >>>> page encryption status for userspace, which will work across SEV, SE= V-ES and > >>>> SEV-SNP.