From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD114C77B7C for ; Thu, 11 May 2023 20:22:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239018AbjEKUWh (ORCPT ); Thu, 11 May 2023 16:22:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231446AbjEKUWg (ORCPT ); Thu, 11 May 2023 16:22:36 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32793270D for ; Thu, 11 May 2023 13:22:34 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-b9a7e3fc659so20207541276.1 for ; Thu, 11 May 2023 13:22:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1683836553; x=1686428553; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=Pc3KTqUCSKzWsEHzUQVAXEAq3C5U5xaJDzEQ7e6JFi4=; b=M2/PeIN97s00eeSmiyRLGD3PvefqBVctLWv3nW5t4O6pOr++sFB6okzpEdh+jpfo/S jemOIrVU7Z7IMbsUdZ7dXTnK+8YnKESUBsfmeDIo4B6gIGKl4mp9YbGYzF05b2MHc++R zkcce/ZyYYpNkKUtCt72zSN1YjgO9e+EWHySXPu/XQ/kyZm/voKg5eCBBSAsHahuDaJH jLL9LhVQJCo8AHn4S8PdTTdejXbygVl0i7ZxwB1wak1aR4nGilvXs1ASnRk30XHqvR7O efEpYYxjaBUGWBULkP6gDuE9gTITT4yxhSlSPSOY1ZVFJN0bRyOk9oCkyt57Bvs8tkus XFtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683836553; x=1686428553; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=Pc3KTqUCSKzWsEHzUQVAXEAq3C5U5xaJDzEQ7e6JFi4=; b=eB7xJStXxBis2HIzTkCPWecJpC+k9tvDVV1dl95d50K/AMbLUfJ8Ak6ktKzGoxqt/y WTkkKvfFa0Wwgtsi+sv5XcQfPpGTX7gPjxuI/bekmdwRhZA92PxYynbHtsQDIVDocL5p DqhfL9c9xJh49jBjTREyr1BpjIohuFI65Q+1zYY0rhS+ftNf/Q8VilFGPsfBMFtBbdAz tuLTJ+NQZG8WdCGBjer8wKVra2tA+r7+uSaPGdmw9E91Pa7ui5007/yyRaejUTWr9rYT +5TJ4b3+y0EcMBj7fC5YY9CUh4NKWdM5GLNzL9I2Aj1eFOe/giHRChfTfiqXS5L9i1GG xe3Q== X-Gm-Message-State: AC+VfDwS26lmqvL0QrCxkU+h+jIZxKXkPbLEWU8G3SpAAN66esqIFP8z jF+mhJTvZQ6Bi7TQ/8ozAQqyEDlXFxI= X-Google-Smtp-Source: ACHHUZ7Axk7QKjEI2Ze7kTlQKJK2/j1UAuv4jea1yq8A1ku9q05Red0Z9Gik8pF5gmrO0M4cdRMEeCOKKjQ= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:5e55:0:b0:ba6:c897:53d9 with SMTP id s82-20020a255e55000000b00ba6c89753d9mr1141409ybb.10.1683836553436; Thu, 11 May 2023 13:22:33 -0700 (PDT) Date: Thu, 11 May 2023 13:22:31 -0700 In-Reply-To: Mime-Version: 1.0 References: <1ed06a62-05a1-ebe6-7ac4-5b35ba272d13@redhat.com> <9efef45f-e9f4-18d1-0120-f0fc0961761c@redhat.com> <5869f50f-0858-ab0c-9049-4345abcf5641@redhat.com> Message-ID: Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) From: Sean Christopherson To: Vishal Annapurve Cc: David Hildenbrand , Chao Peng , Paolo Bonzini , Vitaly Kuznetsov , Jim Mattson , Joerg Roedel , "Maciej S . Szmigiero" , Vlastimil Babka , Yu Zhang , "Kirill A . Shutemov" , dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , wei.w.wang@intel.com, Mike Rapoport , Liam Merwick , Isaku Yamahata , Jarkko Sakkinen , Ackerley Tng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Hugh Dickins , Christian Brauner Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Wed, May 10, 2023, Vishal Annapurve wrote: > On Wed, May 10, 2023 at 2:39=E2=80=AFPM Sean Christopherson wrote: > > > But I would still like to get this discussed here before we move on. > > > > > > I am wondering if it would make sense to implement > > > restricted_mem/guest_mem file to expose both private and shared memor= y > > > regions, inline with Kirill's original proposal now that the file > > > implementation is controlled by KVM. > > > > > > Thinking from userspace perspective: > > > 1) Userspace creates guest mem files and is able to mmap them but all > > > accesses to these files result into faults as no memory is allowed to > > > be mapped into userspace VMM pagetables. > > > > Never mapping anything into the userspace page table is infeasible. Te= chnically > > it's doable, but it'd effectively require all of the work of an fd-base= d approach > > (and probably significantly more), _and_ it'd require touching core mm = code. > > > > VMAs don't provide hva=3D>pfn information, they're the kernel's way of = implementing > > the abstraction provided to userspace by mmap(), mprotect() etc. Among= many other > > things, a VMA describes properties of what is mapped, e.g. hugetblfs ve= rsus > > anonymous, where memory is mapped (virtual address), how memory is mapp= ed, e.g. > > RWX protections, etc. But a VMA doesn't track the physical address, th= at info > > is all managed through the userspace page tables. > > > > To make it possible to allow userspace to mmap() but not access memory = (without > > redoing how the kernel fundamentally manages virtual=3D>physical mappin= gs), the > > simplest approach is to install PTEs into userspace page tables, but ne= ver mark > > them Present in hardware, i.e. prevent actually accessing the backing m= emory. > > This is is exactly what Kirill's series in link [3] below implemented. > > >=20 > Maybe it's simpler to do when mmaped regions are backed with files. >=20 > I see that shmem has fault handlers for accesses to VMA regions > associated with the files, In theory a file implementation can always > choose to not allocate physical pages for such faults (similar to > F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier). Ah, you're effectively suggesting a hybrid model where the file is the sing= le source of truth for what's private versus shared, ad KVM gets pfns through direct communication with the backing store via the file descriptor, but us= erspace can still control things via mmap() and friends. If you're not suggesting a backdoor, i.e. KVM still gets private pfns via h= vas, then we're back at Kirill's series, because otherwise there's no easy way f= or KVM to retrieve the pfn. A form of this was also discussed, though I don't know how much of the disc= ussion happened on-list. =20 KVM actually does something like this for s390's Ultravisor (UV), which is = quite a bit like TDX (UV is a trusted intermediary) except that it handles faults= much, much more gracefully. Specifically, when the untrusted host attempts to ac= cess a secure page, a fault occurs and the kernel responds by telling UV to export= the page. The fault is gracefully handled even even for kernel accesses (see do_secure_storage_access()). The kernel does BUG() if the export fail= s when handling fault from kernel context, but my understanding is that export can= fail if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ens= ures_ success, and goes straight to BUG()/panic() if something goes wrong. On the guest side, accesses to exported (swapped) secure pages generate int= ercepts and KVM faults in the page. To do so, KVM freezes the page/folio refcount,= tells the UV to import the page, and then unfreezes the page/folio. But very cru= cially, when _anything_ in the untrusted host attempts to access the secure page, t= he above fault handling for untrusted host accesses kicks in. In other words,= the guest can cause thrash, but can't bring down the host. TDX on the other hand silently poisons memory, i.e. doesn't even generate a synchronous fault. Thus the kernel needs to be 100% perfect on preventing = _any_ accesses to private memory from the host, and doing that is non-trivial and invasive. SNP does synchronously fault, but the automatically converting in the #PF h= andler got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion succes= s as the guest can trigger concurrent RMP modifications. So the end result ends up = being the same as TDX, host accesses need to be completely prevented. Again, this is all doable, but costly. And IMO, provides very little value= . Allowing things like mbind() is nice-to-have at best, as implementing fbind= () isn't straightforward and arguably valuable to have irrespective of this discussion, e.g. to allow userspace to say "use this policy regardless of w= hat process maps the file". Using a common memory pool (same physical page is used for both shared and = private) is a similar story. There are plenty of existing controls to limit userspa= ce/guest memory usage and to deal with OOM scenarios, so barring egregious host acco= unting and/or limiting bugs, which would affect _all_ VM types, the worst case sce= nario is that a VM is terminated because host userspace is buggy. On the slip si= de, using a common pool brings complexity into the kernel, as backing stores would ne= ed to be taught to deny access to a subset of pages in their mappings, and in mul= tiple paths, e.g. faults, read()/write() and similar, page migration, swap, etc. [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@w= ww.fastmail.com > > Issues that led to us abandoning the "map with special !Present PTEs" a= pproach: > > > > - Using page tables, i.e. hardware defined structures, to track gfn=3D= >pfn mappings > > is inefficient and inflexible compared to software defined structure= s, especially > > for the expected use cases for CoCo guests. > > > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest ass= ociation, > > let alone a 1:1 pfn:gfn mapping. >=20 > Maybe KVM can ensure that each page of the guest_mem file is > associated with a single memslot. This is a hard NAK. Guest physical address space is guaranteed to have hol= es and/or be discontiguous, for the PCI hole at the top of lower memory. Allo= wing only a single binding would prevent userspace from backing all (or large ch= unks) of guest memory with a single file. > HVAs when they are registered can be associated with offsets into guest_m= em files. Enforcing 1:1 assocations is doable if KVM inserts a shim/interposer, e.g. = essentially implements the exclusivity bits of restrictedmem. But that's adding even m= ore complexity. > > - Does not work for memory that isn't backed by 'struct page', e.g. if= devices > > gain support for exposing encrypted memory regions to guests. > > > > - Poking into the VMAs to convert memory would be likely be less perfo= rmant due > > to using infrastructure that is much "heavier", e.g. would require t= aking > > mmap_lock for write. >=20 > Converting memory doesn't necessarily need to poke holes into VMA, but > rather just unmap pagetables just like what would happen when mmapped > files are punched to free the backing file offsets. Sorry, bad choice of word on my part. I didn't intend to imply poking hole= s, in this case I used "poking" to mean "modifying". munmap(), mprotected(), etc= all require modifying VMAs, which means taking mmap_lock for write. > > In short, shoehorning this into mmap() requires fighting how the kernel= works at > > pretty much every step, and in the end, adding e.g. fbind() is a lot ea= sier. > > > > > 2) Userspace registers mmaped HVA ranges with KVM with additional > > > KVM_MEM_PRIVATE flag > > > 3) Userspace converts memory attributes and this memory conversion > > > allows userspace to access shared ranges of the file because those ar= e > > > allowed to be faulted in from guest_mem. Shared to private conversion > > > unmaps the file ranges from userspace VMM pagetables. > > > 4) Granularity of userspace pagetable mappings for shared ranges will > > > have to be dictated by KVM guest_mem file implementation. > > > > > > Caveat here is that once private pages are mapped into userspace view= . > > > > > > Benefits here: > > > 1) Userspace view remains consistent while still being able to use HV= A ranges > > > 2) It would be possible to use HVA based APIs from userspace to do > > > things like binding. > > > 3) Double allocation wouldn't be a concern since hva ranges and gpa > > > ranges possibly map to the same HPA ranges. > > > > #3 isn't entirely correct. If a different process (call it "B") maps s= hared memory, > > and then the guest converts that memory from shared to private, the bac= king pages > > for the previously shared mapping will still be mapped by process B unl= ess userspace > > also ensures process B also unmaps on conversion. > > >=20 > This should be ideally handled by something like: unmap_mapping_range() That'd work for the hybrid model (fd backdoor with pseudo mmap() support), = but not for a generic VMA-based implementation. If the file isn't the single s= ource of truth, then forcing all mappings to go away simply can't work. =20 > > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages= , keeping > > the 1GiB mapping is desirable if the guest converts a few KiB of memory= to shared, > > and possibly even if the guest converts a few MiB of memory. >=20 > This caveat maybe can be lived with as shared ranges most likely will > not be backed by 1G pages anyways, possibly causing IO performance to > get hit. This possibly needs more discussion about conversion > granularity used by guests. Yes, it's not the end of the world. My point is that separating shared and= private memory provides more flexibility. Maybe that flexibility never ends up bei= ng super important, but at the same time we shouldn't willingly paint ourselve= s into a corner.