linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Michael Roth <michael.roth@amd.com>
Cc: isaku.yamahata@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	erdemaktas@google.com, Sagi Shahar <sagis@google.com>,
	David Matlack <dmatlack@google.com>,
	Kai Huang <kai.huang@intel.com>,
	Zhi Wang <zhi.wang.linux@gmail.com>,
	chen.bo@intel.com, linux-coco@lists.linux.dev,
	Chao Peng <chao.p.peng@linux.intel.com>,
	Ackerley Tng <ackerleytng@google.com>,
	Vishal Annapurve <vannapurve@google.com>,
	Yuan Yao <yuan.yao@linux.intel.com>
Subject: Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory
Date: Fri, 18 Aug 2023 15:27:47 -0700	[thread overview]
Message-ID: <ZN/wY53TF2aOZtLu@google.com> (raw)
In-Reply-To: <20230722003449.664x3xcu6ydi2vrz@amd.com>

Sorry for responding so late, lost track of this and only found it against when
reviewing the next version :-/

On Fri, Jul 21, 2023, Michael Roth wrote:
> On Fri, Jul 21, 2023 at 07:25:58AM -0700, Sean Christopherson wrote:
> > Practically speaking, hooking the fault path will result in undesirable behavior.
> > Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
> > at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
> > shares a single 4KiB page in a 2MiB chunk.   Dirty logging is another case where
> > the RMP can stay at 2MiB.  Or as a very silly example, imagine userspace pulls a
> > stupid and covers a single 2MiB chunk of gmem with 512 memslots.
> 
> Unfortunately I don't think things aren't quite that flexible with SNP. If
> RMP entry is 2MB, and you map a sub-page as 4K in the NPT, you'll immediately
> get a PFERR_GUEST_SIZEM on the first access (presumably when the guest tries
> to PVALIDATE it before use). The RMP fault handler will then subsequently
> need to PSMASH the 2MB entry into 4K before that guest can access it. So you
> get an extra page fault for every 2MB page that's mapped this way.
> (APM Volume 2, Section 15.36.10).

Well that's just bloody stupid.  Why is that a restriction?  Obviously creating
an NPT mapping that's larger would be annoying to handle, e.g. would require
locking multiple entries or something, so I can understand why that's disallowed.
But why can't software map at a finer granularity?

Note, I'm expecting a spec change, just expressing a bit of disbelief.

Anyways, IMO, we should eat the extra #NPF in that scenario and optimize for much,
much more likely scenario of the RMP and NPT both being able to use 2MiB pages.
And that means not inserting into the RMP when handling #NPFs, e.g. so that userspace
can do fallocate() to prep the memory before mapping, and so that if the SPTEs
get zapped for whatever reason, faulting them back in doesn't trigger useless
RMP updates.

I guess the other way to optimze things would be for userspace to use the ioctl()
to map memory into the guest[*].  But even then, initializing when allocating
seems cleaner, especially for TDX.

[*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com

> That might not be a big deal for guests that are at least somewhat optimized
> to make use of 2MB pages, but another situation is:
> 
>   - gmem allocates 2MB page
>   - guest issues PVALIDATE on 2MB page
>   - guest later converts a subpage to shared but doesn't holepunch
>   - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
>   - KVM MMU splits NPT mapping to 4K
>   - guest converts that shared page back to private
> 
> At this point there are no mixed attributes, and KVM would normally
> allow for 2MB NPT mappings again, but this is actually not allowed
> because the RMP table mappings are validated/4K and cannot be promoted on
> the hypervisor, so the NPT mappings must still be limited to 4K to
> match this, otherwise we hit the reverse of the PFERR_GUEST_SIZEM
> scenario above, where the NPT mapping level is *larger* than the RMP
> entry level. Unfortunately that does not result in a PFERR_GUEST_SIZEM
> where we can fix things up in response, but instead it's a general
> RMP fault that would be tricky to distinguish from an RMP fault
> resulting from an implicit page conversion or some other guest weirdness
> without doing RMP table checks every time we get a general RMP fault.

This seems like a bug in the SNP code.  (a) why would KVM/SNP PSMASH in that
scenario and (b) why can't it zap/split the NPT before PSMASH?

> So for all intents and purposes it does sort of end up being the case
> that the mapping size and RMP entry size are sort of intertwined and
> can't totally be decoupled, and if you don't take that into account
> when updating the RMP entry, you'll have to do some extra PSMASH's
> in response to PFERR_GUEST_SIZEM RMP faults later.
> 
> > 
> > That likely means KVM will need an additional hook to clamp the max_level at the
> > RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
> > should be able to blindly do the conversion because it would be a kernel bug if
> > the page is already assigned to an ASID in the RMP, i.e. the additional hook
> > shouldn't incur an extra RMP lookup.
> 
> Yah we'd still need a hook in roughly this same place for clamping
> max_level. Previous versions of SNP hypervisor patches all had a separate
> hook for handling these cases, but since the work of updating the RMP table
> prior to mapping isn't too dissimilar from the work of determining max
> mapping size, I combined both of them into the kvm_x86_gmem_prepare()
> hook/implementation.
> 
> But I don't see any major issue with moving RMPUPDATE handling to an
> allocation-time hook. As mentioned above we'd get additional
> PFERR_GUEST_SIZEM faults by not taking MMU mapping level into account, but
> I previously had it implemented similarly via a hook in kvm_gmem_get_pfn()
> (because we need the GPA) and didn't notice anything major. But I'm not
> sure exactly where you're suggesting we do it now, so could use some
> clarify on that if kvm_gmem_get_pfn() isn't what you had in mind.

I was thinking kvm_gmem_get_folio().  If userspace doesn't preallocate, then
kvm_gmem_get_pfn() will still east the cost of the conversion, but it at least
gives userspace the option and handles the zap case.  Tracking which folios have
been assigned (HKID or RMP) should be pretty straightforward.

I'm not totally opposed to doing more work at map time, e.g. to avoid faults or
to play nice with PSMASH, but I would rather that not bleed into gmem.   And I
think/hope if we hook kvm_gmem_get_folio(), then TDX and SNP can use that as a
common base, i.e. whatever extra shenanigans are needed for SNP can be carried
in the SNP series.  For me, having such "quirks" in the vendor specific series
would be quite helpful as I'm having a hell of a time keeping track of all the
wrinkles of TDX and SNP.

  reply	other threads:[~2023-08-18 22:30 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-20 23:32 [RFC PATCH v4 00/10] KVM: guest_memfd(), X86: Common base for SNP and TDX (was KVM: guest memory: Misc enhancement) isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 01/10] KVM: x86: Add is_vm_type_supported callback isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 02/10] KVM: x86/mmu: Guard against collision with KVM-defined PFERR_IMPLICIT_ACCESS isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 03/10] KVM: x86/mmu: Pass around full 64-bit error code for the KVM page fault isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private isaku.yamahata
2023-07-21 14:11   ` Sean Christopherson
2023-07-22  0:52     ` Isaku Yamahata
2024-02-22  2:05       ` Sean Christopherson
2023-07-20 23:32 ` [RFC PATCH v4 05/10] KVM: Add new members to struct kvm_gfn_range to operate on isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 06/10] KVM: x86: Export the kvm_zap_gfn_range() for the SNP use isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory isaku.yamahata
2023-07-21 14:25   ` Sean Christopherson
2023-07-22  0:34     ` Michael Roth
2023-08-18 22:27       ` Sean Christopherson [this message]
2023-08-26  0:59         ` Michael Roth
2023-08-29 13:27           ` Michael Roth
2023-09-08 23:57             ` Sean Christopherson
2023-07-20 23:32 ` [RFC PATCH v4 08/10] KVM: x86: Add gmem hook for invalidating " isaku.yamahata
2023-07-20 23:32 ` [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP isaku.yamahata
2023-07-21 14:51   ` Sean Christopherson
2023-07-21 18:43     ` Isaku Yamahata
2023-07-25  9:07     ` Xiaoyao Li
2023-07-25 15:36       ` Sean Christopherson
2023-07-27  0:37         ` Isaku Yamahata
2023-07-20 23:32 ` [RFC PATCH v4 10/10] KVM: X86: KVM_MEM_ENC_OP check if unused field (flags, error) is zero isaku.yamahata

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZN/wY53TF2aOZtLu@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chen.bo@intel.com \
    --cc=dmatlack@google.com \
    --cc=erdemaktas@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=sagis@google.com \
    --cc=vannapurve@google.com \
    --cc=yuan.yao@linux.intel.com \
    --cc=zhi.wang.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).