From: David Hildenbrand <david@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Joerg Roedel <jroedel@suse.de>, Andi Kleen <ak@linux.intel.com>,
David Rientjes <rientjes@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
Tom Lendacky <thomas.lendacky@amd.com>,
Thomas Gleixner <tglx@linutronix.de>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Varad Gautam <varad.gautam@suse.com>,
Dario Faggioli <dfaggioli@suse.com>,
x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
"Kirill A . Shutemov" <kirill@shutemov.name>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Yu Zhang <yu.c.zhang@linux.intel.com>
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
Date: Wed, 1 Sep 2021 10:09:07 +0200 [thread overview]
Message-ID: <f413cc20-66fc-cf1e-47ab-b8f099c89583@redhat.com> (raw)
In-Reply-To: <YS6lIg6kjNPI1EgF@google.com>
>> Do we have to protect from that? How would KVM protect from user space
>> replacing private pages by shared pages in any of the models we discuss?
>
> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> as both private and shared, where "shared" also incorporates any mapping from the
> host. Essentially it boils down to the kernel ensuring that a pfn is unmapped
> before it's converted to/from private, and KVM ensuring that it honors any
> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> as proposed in this RFC.
Okay, so the fallocate(PUNCHHOLE) from user space could trigger the
respective unmapping and freeing of backing storage.
>
> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> backing-store is destroyed; the backing-store needs to notify downstream MMUs
> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.
Right.
>
> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
> but my argument is that that behavior is not visible to userspace and thus
> out of scope for this discussion, e.g. zapping/restoring the direct map can
> be added/removed without impacting the userspace ABI.
Right. Removing it shouldn't also be requited IMHO. There are other ways
to teach the kernel to not read/write some online pages (filter
/proc/kcore, disable hibernation, strict access checks for /dev/mem ...).
>
>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
>>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal
>>>> user memory slot. When creating a "encryped" user memory slot, require that
>>>> the whole VMA is covered at creation time. You know the VMA can't change
>>>> later.
>>>
>>> This can work for the basic use cases, but even then I'd strongly prefer not to
>>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind
>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
>>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and
>>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers
>>> to handle mprotect()/munmap()/etc...
>>
>> Right, and for the existing use cases this worked. But encrypted memory
>> breaks many assumptions we once made ...
>>
>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
>> page tables but not actually mapped into user space page tables. There is no
>> way to reach these via the rmap.
>>
>> We have something like that already via vfio. And that is fundamentally
>> broken when it comes to mmu notifiers, page pinning, page migration, ...
>
> I'm not super familiar with VFIO internals, but the idea with the fd-based
> approach is that the backing-store would be in direct communication with KVM and
> would handle those operations through that direct channel.
Right. The problem I am seeing is that e.g., try_to_unmap() might not be
able to actually fully unmap a page, because some non-synchronized KVM
MMU still maps a page. It would be great to evaluate how the fd
callbacks would fit into the whole picture, including the current rmap.
I guess I'm missing the bigger picture how it all fits together on the
!KVM side.
>
>>> As is, I don't think KVM would get any kind of notification if userpaces unmaps
>>> the VMA for a private memslot that does not have any entries in the host page
>>> tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page
>>> is touched by the backing store, but I don't think the end result would be any
>>> prettier than a dedicated API for KVM to consume.
>>>
>>> Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
>>> to page migration or swap. For those types of operations, KVM currently just
>>> reacts to invalidation notifications by zapping guest PTEs, and then gets the
>>> new pfn when the guest re-faults on the page. That sequence doesn't work for
>>> TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
>>> contents, i.e. the host needs to call into KVM for the actual migration.
>>
>> Right, but I still think this is a kernel internal. You can do such
>> handshake later in the kernel IMHO.
>
> It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the
> migration and that would invert the mmu_notifer API, e.g. instead of "telling"
> secondary MMUs to invalidate/change a mappings, the mm would be "asking"
> secondary MMus "can you move this?". More below.
In my thinking, the the rmap via mmu notifiers would do the unmapping
just as we know it (from primary MMU -> secondary MMU). Once
try_to_unmap() succeeded, the fd provider could kick-off the migration
via whatever callback.
>
>> But I also already thought: is it really KVM that is to perform the
>> migration or is it the fd-provider that performs the migration? Who says
>> memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just
>> knows how to migrate such a page?
>>
>> I'd love to have some details on how that's supposed to work, and which
>> information we'd need to migrate/swap/... in addition to the EPFN and a new
>> SPFN.
>
> KVM "needs" to do the migration. On TDX, the migration will be a SEAMCALL,
> a post-VMXON instruction that transfers control to the TDX-Module, that at
> minimum needs a per-VM identifier, the gfn, and the page table level. The call
The per-VM identifier and the GFN would be easy to grab. Page table
level, not so sure -- do you mean the general page table depth? Or if
it's mapped as 4k vs. 2M ... ? The latter could be answered by the fd
provider already I assume.
Does the page still have to be mapped into the secondary MMU when
performing the migration via TDX? I assume not, which would simplify
things a lot.
> into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock)
> to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to
> the backing-store attempting to migrate memory that KVM is unmapping due to a
> memslot change.
Something like that might be handled by fixing private memory slots
similar to in my draft, right?
>
> The per-VM identifier may not apply to SEV-SNP, but I believe everything else
> holds true.
Thanks!
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-09-01 8:09 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-24 0:52 [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
2021-08-24 10:48 ` Yu Zhang
2021-08-26 0:35 ` Sean Christopherson
2021-08-26 13:23 ` Yu Zhang
2021-08-26 10:15 ` David Hildenbrand
2021-08-26 17:05 ` Andy Lutomirski
2021-08-26 21:26 ` David Hildenbrand
2021-08-27 18:24 ` Andy Lutomirski
2021-08-27 22:28 ` Sean Christopherson
2021-08-31 19:12 ` David Hildenbrand
2021-08-31 20:45 ` Sean Christopherson
2021-09-01 7:51 ` David Hildenbrand
2021-08-27 2:31 ` Yu Zhang
2021-08-31 19:08 ` David Hildenbrand
2021-08-31 20:01 ` Andi Kleen
2021-08-31 20:15 ` David Hildenbrand
2021-08-31 20:39 ` Andi Kleen
2021-09-01 3:34 ` Yu Zhang
2021-09-01 4:53 ` Andy Lutomirski
2021-09-01 7:12 ` Tian, Kevin
2021-09-01 10:24 ` Yu Zhang
2021-09-01 16:07 ` Andy Lutomirski
2021-09-01 16:27 ` David Hildenbrand
2021-09-02 8:34 ` Yu Zhang
2021-09-02 8:44 ` David Hildenbrand
2021-09-02 11:02 ` Yu Zhang
2021-09-02 8:19 ` Yu Zhang
2021-09-02 18:41 ` Andy Lutomirski
2021-09-07 1:33 ` Yan Zhao
2021-09-02 9:27 ` Joerg Roedel
2021-09-02 18:41 ` Andy Lutomirski
2021-09-02 18:57 ` Sean Christopherson
2021-09-02 19:07 ` Dave Hansen
2021-09-02 20:42 ` Andy Lutomirski
2021-08-27 22:18 ` Sean Christopherson
2021-08-31 19:07 ` David Hildenbrand
2021-08-31 21:54 ` Sean Christopherson
2021-09-01 8:09 ` David Hildenbrand [this message]
2021-09-01 15:54 ` Andy Lutomirski
2021-09-01 16:16 ` David Hildenbrand
2021-09-01 17:09 ` Andy Lutomirski
2021-09-01 16:18 ` James Bottomley
2021-09-01 16:22 ` David Hildenbrand
2021-09-01 16:31 ` James Bottomley
2021-09-01 16:37 ` David Hildenbrand
2021-09-01 16:45 ` James Bottomley
2021-09-01 17:08 ` David Hildenbrand
2021-09-01 17:50 ` Sean Christopherson
2021-09-01 17:53 ` David Hildenbrand
2021-09-01 17:08 ` Andy Lutomirski
2021-09-01 17:13 ` James Bottomley
2021-09-02 10:18 ` Joerg Roedel
2021-09-01 18:24 ` Andy Lutomirski
2021-09-01 19:26 ` Dave Hansen
2021-09-07 15:00 ` Tom Lendacky
2021-09-01 4:58 ` Andy Lutomirski
2021-09-01 7:49 ` David Hildenbrand
2021-09-02 18:47 ` Kirill A. Shutemov
2021-09-02 20:33 ` Sean Christopherson
2021-09-03 19:14 ` Kirill A. Shutemov
2021-09-03 19:15 ` Andy Lutomirski
2021-09-10 17:18 ` Kirill A. Shutemov
2021-09-15 19:58 ` Chao Peng
2021-09-15 13:51 ` David Hildenbrand
2021-09-15 14:29 ` Kirill A. Shutemov
2021-09-15 14:59 ` David Hildenbrand
2021-09-15 15:35 ` David Hildenbrand
2021-09-15 20:04 ` Kirill A. Shutemov
2021-09-15 14:11 ` Kirill A. Shutemov
2021-09-16 7:36 ` Chao Peng
2021-09-16 9:24 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f413cc20-66fc-cf1e-47ab-b8f099c89583@redhat.com \
--to=david@redhat.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=dfaggioli@suse.com \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=jroedel@suse.de \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=kvm@vger.kernel.org \
--cc=linux-coco@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rientjes@google.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=varad.gautam@suse.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
--cc=x86@kernel.org \
--cc=yu.c.zhang@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).