linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?)
@ 2023-10-16 11:50 Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 1/8] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory Michael Roth
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

This patchset is also available at:

  https://github.com/mdroth/linux/commits/gmem-x86-v1-rfc

and is based on top of the kvm-x86 gmem tree (e7af8d17224a):

  https://github.com/kvm-x86/linux/commits/guest_memfd


== OVERVIEW ==

This is a series of patches that are currently needed for implementing
SEV-SNP hypervisor support on top of guest_memfd, but have a lot of
potential to overlap with changes that may be also be needed for TDX.

The goal of this series is to help reach a consensus on what
functionality implemented here is indeed common between SNP/TDX, and
try to finalize what these interfaces should look like so we can
incorporate them into a common gmem/x86 tree to base on top of to
reduce potential churn from the various SNP/TDX-specific patchsets.

A couple of the patches here are newer versions of patches that were
included in a similar series posted by Isaku here[1] that were revised
to incorporate feedback from Sean and others.

Some of the approaches implementated have deviated somewhat from what
may have been discussed/suggested on the list. For these cases I've
tried to provide my rationale below along with some relevant background
so that we can continue these discussions from where we left off and
reach a consensus on what these need to look like to be usable for both
SNP and TDX, and acceptable for gmem in general.


== Hooks for preparing gmem pages (patch #3) ==

The design here is mainly driven by this discussion with Sean[2]. In the
prior version used by v9 SNP hypervisor patchset[3] and included in
Isaku's patchset[1], this hook was triggered directly by KVM MMU just
prior to mapping it into the guest NPT/EPT to handle things like putting
it into the initial 'private' state as defined by the architecture (e.g.
'guest-owned' state in the SNP RMP table).

Sean was hoping to tie this update to allocation time rather than
mapping time, so it has more symmetry with the 'invalidation' side that
happens when the backing pages are freed back to the host, and allows
for better run-time performance if userspace opts to pre-allocate pages
in advance, since the cost of the 'preparation' hook could then also be
paid up front.

To accomodate this I changed this hook to trigger automatically when
a folio is allocated either via kvm_gmem_get_pfn(), or via an
fallocate() operation. There are a couple places in this implementation
where things fall a bit short of the original design goals however:

  1) For SNP, preparing a page as 'guest-owned' in the RMP table
     requires the GPA, which can only be known after the guest_memfd
     range that being allocated has been bound to a memslot, and there's
     no guarantee userspace won't attempt to fallocate() in advance of
     binding to a memslot unless we enforce that as part of the gmem
     API.

     Instead, this implementation simply attempts to call the prepare
     hook every time a folio is accessed via the common
     kvm_gmem_get_folio() path to ensure these 'deferred' preparation
     hooks will happen before KVM MMU maps any pages into a guest.
     Thus, it's up to the architecture/platform to keep track of
     whether a page is already in the 'prepared' state. For SNP this
     tracked via the RMP table itself, so we sort of already have this
     for free.

  2) AIUI the design proposed by Sean would involve gmem internally
     keeping track of whether or not a page/folio has already been
     prepared. As mentioned above, in the version we instead simply
     punt that tracking to the architecture.

     I experimented with tracking this inside gmem though, but it was a
     bit of a mis-start. I tried using an xarray to keep track of 2
     states: 'allocated'/'prepare', since both would be need if we
     wanted to be able to do things like handle deferred preparation
     hooks for situations like a memslot getting bound to a range that
     has already been allocated.

     The 'allocated' state was inferred based on whether an entry was
     present for a particular gmem offset, and the entry itself was a
     set of flags, 'prepared' being one of them. However, at the time
     there was a TODO[4] to investigate dropping the use of filemap in
     favor of doing that internally in gmem, and this seemed like it
     could be an intermediate step toward that, so I started heading
     down that road a bit by using higher-order/multi-index xarray
     entries with the thought of eventually being able to just track
     the folios themselves and drop reliance on filemap. This got messy
     quickly however and I dropped it once Paolo's investigations
     suggested that replacing filemap completely probably wouldn't be
     very worthwhile.[3]

     So maybe internally tracking 'allocated'/'prepared' states in gmem
     is more doable if we take a simpler approach like 4K-granularity
     xarrays or sparse bitmaps, or something else, but it's still
     enough additional complexity that I think it would be good to
     understand better what we really gain by doing this tracking in
     gmem. The one thing I can think of is the ability to immediately
     move already-allocated pages into the 'prepared' state if userspace
     decides to pre-allocate them prior to binding the range to a
     memslot, but I'm not sure how much that buys us performance-wise.
     At least for SNP, the initial guest payload will necessarily be
     put into 'prepared' state prior to boot, and anything other than
     that will need to go through the hole shared->private+PVALIDATE
     dance where saving on an RMPUPDATE in that path probably wouldn't
     make a huge difference.


== Hooks for invalidating gmem pages (patch #4) ==

In the prior version used by v9 SNP hypervisor patchset[3] and included
in Isaku's patchset[1], gmem calls these hooks directly during
hole-punching operations or during kvm_gmem_release() to make gmem
patches accessible to the host before freeing them.

There was an open TODO[4] for looking at making using of
.invalidate_folio, .evict_inode, and similar callbacks to handle this
sort of cleanup.

Toward that end I switched to using .free_folio to trigger these
arch-specific hooks since it seemed to be the most direct choice. What's
nice about that is even in the context of future support for things like
intra-host migration to support live update[5], where multiple gmem
instances might share an inode, there is less risk of one gmem instance
clobbering the other when it is release, since the reference counting on
the underlying inode will keep the inode alive.

One downside to using .free_folio is there is no metadata to pass along
with it: the callback gets PFN/order and that's it. For SNP this is
fine, but maybe for other platforms that's not enough to handle the
cleanup work.

If some sort of metadata *is* needed, .invalidate_folio is an option
since it can also pass along information associated with the folio via
folio_attach_private() and otherwise behaves similarly to .free_folio.
One major exception however it hole-punching, where splitting the folio
results in the private data being lost. And unlike normal pagecache
users, there aren't obvious points to re-attach it like read/write
operations on some file. So we'd probably need to do something like
scan for split folios in the hugepage-ranges that contain the range
that got hole-punched and re-attach the private data immeditely after
each hole-punch operation. That, or interesting some other flag to ask
mm/truncate.c to handle this for us. Or just stick with the prior
in Isaku's patchset[1].


== Determining whether #NPFs were for private access (patch #5-8) ==

This is mainly driven by these discussions[6][7], which suggest moving
toward special-casing handling based on VM type where necessary, but
consolidating around the use of the AMD-defined encryption bit to
encode whether a guest #NPF / EPT violation was for a private page or
not. Toward that end I defined the SNP vm type here and pulled in a
patch from the SNP patchset that introduces PFERR_GUEST_ENC_MASK, and
use those to initialize struct kvm_page_fault's .is_private field. My
that with TDX sythesizing the same PFERR_GUEST_ENC_MASK that logic
there would work the same for both TDX/SNP vm types.


== References ==

[1] https://lkml.kernel.org/kvm/CUU93XA8UKMG.X15YWDK533GB@suppilovahvero/t/
[2] https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/
[3] https://lore.kernel.org/kvm/CABgObfZiS+e7oDbwuC1Uycsz8Mjsu-FSfSmu=3R0M71vUhpq_Q@mail.gmail.com/
[4] https://lore.kernel.org/kvm/ZOjpIL0SFH+E3Dj4@google.com/
[5] https://lore.kernel.org/lkml/ZN%2F81KNAWofRCaQK@google.com/t/
[6] https://lkml.kernel.org/kvm/ZJnXObRHhn5Q1dX2@google.com/
[7] https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#me8a395cdf682068d8e5152c358016bf2fa4328e5


Any suggestions and feedback are very much appreciated.

-Mike

----------------------------------------------------------------
Brijesh Singh (1):
      KVM: x86: Define RMP page fault error bits for #NPF

Michael Roth (7):
      mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory
      KVM: Use AS_INACCESSIBLE when creating guest_memfd inode
      KVM: x86: Add gmem hook for initializing memory
      KVM: x86: Add gmem hook for invalidating memory
      KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults
      KVM: x86: Add KVM_X86_SNP_VM vm_type
      KVM: x86: Determine shared/private faults based on vm_type

 arch/x86/include/asm/kvm-x86-ops.h |  2 ++
 arch/x86/include/asm/kvm_host.h    | 13 +++++++
 arch/x86/include/uapi/asm/kvm.h    |  1 +
 arch/x86/kvm/mmu/mmu.c             | 15 +++++---
 arch/x86/kvm/mmu/mmu_internal.h    | 24 +++++++++++--
 arch/x86/kvm/mmu/mmutrace.h        |  2 +-
 arch/x86/kvm/mmu/paging_tmpl.h     |  2 +-
 arch/x86/kvm/x86.c                 | 21 ++++++++++-
 include/linux/kvm_host.h           | 18 ++++++++++
 include/linux/pagemap.h            |  1 +
 mm/truncate.c                      |  3 +-
 virt/kvm/Kconfig                   |  8 +++++
 virt/kvm/guest_memfd.c             | 71 +++++++++++++++++++++++++++++++++++---
 13 files changed, 165 insertions(+), 16 deletions(-)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 1/8] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 2/8] KVM: Use AS_INACCESSIBLE when creating guest_memfd inode Michael Roth
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta,
	Matthew Wilcox

filemap users like guest_memfd may use page cache pages to
allocate/manage memory that is only intended to be accessed by guests
via hardware protections like encryption. Writes to memory of this sort
in common paths like truncation may cause unexpected behavior such
writing garbage instead of zeros when attempting to zero pages, or
worse, triggering hardware protections that are considered fatal as far
as the kernel is concerned.

Introduce a new address_space flag, AS_INACCESSIBLE, and use this
initially to prevent zero'ing of pages during truncation, with the
understanding that it is up to the owner of the mapping to handle this
specially if needed.

Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/
Cc: Matthew Wilcox <willy@infradead.org>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 include/linux/pagemap.h | 1 +
 mm/truncate.c           | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 82c9bf506b79..9e79cf48f67a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -205,6 +205,7 @@ enum mapping_flags {
 	AS_LARGE_FOLIO_SUPPORT = 6,
 	AS_RELEASE_ALWAYS = 7,	/* Call ->release_folio(), even if no private data */
 	AS_UNMOVABLE	= 8,	/* The mapping cannot be moved, ever */
+	AS_INACCESSIBLE	= 9,	/* Do not attempt direct R/W access to the mapping */
 };
 
 /**
diff --git a/mm/truncate.c b/mm/truncate.c
index 8e3aa9e8618e..0d80bcc250af 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -233,7 +233,8 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 	 * doing a complex calculation here, and then doing the zeroing
 	 * anyway if the page split fails.
 	 */
-	folio_zero_range(folio, offset, length);
+	if (!(folio->mapping->flags & AS_INACCESSIBLE))
+		folio_zero_range(folio, offset, length);
 
 	if (folio_has_private(folio))
 		folio_invalidate(folio, offset, length);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 2/8] KVM: Use AS_INACCESSIBLE when creating guest_memfd inode
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 1/8] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory Michael Roth
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

truncate_inode_pages_range() may attempt to zero pages before truncating
them, and this will occur before arch-specific invalidations can be
triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
AMD SEV-SNP this would result in an RMP #PF being generated by the
hardware, which is currently treated as fatal (and even if specifically
allowed for, would not result in anything other than garbage being
written to guest pages due to encryption). On Intel TDX this would also
result in undesirable behavior.

Set the AS_INACCESSIBLE flag to prevent the MM from attempting
unexpected accesses of this sort during operations like truncation.

This may also in some cases yield a decent performance improvement for
guest_memfd userspace implementations that hole-punch ranges immediately
after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
the current implementation of truncate_inode_pages_range() always ends
up zero'ing an entire 4K range if it is backing by a 2M folio.

Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 virt/kvm/guest_memfd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9ffce54555ae..f6f1b17a319c 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -398,6 +398,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
 	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mapping->flags |= AS_INACCESSIBLE;
 	inode->i_mode |= S_IFREG;
 	inode->i_size = size;
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 1/8] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 2/8] KVM: Use AS_INACCESSIBLE when creating guest_memfd inode Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2024-02-08 10:57   ` Suzuki K Poulose
  2023-10-16 11:50 ` [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory Michael Roth
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.

Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.

The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).

In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned', so also add a
helper that performs the same function as kvm_gmem_get_pfn(), but allows
for the preparation callback to be bypassed to allow for pages to be
accessed beforehand.

Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/x86.c                 |  6 ++++
 include/linux/kvm_host.h           | 14 ++++++++
 virt/kvm/Kconfig                   |  4 +++
 virt/kvm/guest_memfd.c             | 56 +++++++++++++++++++++++++++---
 6 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e3054e3e46d5..0c113f42d5c7 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 95018cc653f5..66fc89d1858f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1752,6 +1752,8 @@ struct kvm_x86_ops {
 	 * Returns vCPU specific APICv inhibit reasons
 	 */
 	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 767236b4d771..33a4cc33d86d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13301,6 +13301,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
 
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
+{
+	return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
+}
+#endif
 
 int kvm_spec_ctrl_test_value(u64 value)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c5c017ab4e9..c7f82c2f1bcf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2403,9 +2403,19 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
+int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 			      gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
 #else
+static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
+				     struct kvm_memory_slot *slot, gfn_t gfn,
+				     kvm_pfn_t *pfn, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
+
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
 				   kvm_pfn_t *pfn, int *max_order)
@@ -2415,4 +2425,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
+#endif
+
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 2c964586aa14..992cf6ed86ef 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,3 +109,7 @@ config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_PRIVATE_MEM
        bool
+
+config HAVE_KVM_GMEM_PREPARE
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index f6f1b17a319c..72ff8b7b31d5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -44,7 +44,40 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
 #endif
 }
 
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+	struct list_head *gmem_list = &inode->i_mapping->private_list;
+	struct kvm_gmem *gmem;
+
+	list_for_each_entry(gmem, gmem_list, entry) {
+		struct kvm_memory_slot *slot;
+		struct kvm *kvm = gmem->kvm;
+		struct page *page;
+		kvm_pfn_t pfn;
+		gfn_t gfn;
+		int rc;
+
+		slot = xa_load(&gmem->bindings, index);
+		if (!slot)
+			continue;
+
+		page = folio_file_page(folio, index);
+		pfn = page_to_pfn(page);
+		gfn = slot->base_gfn + index - slot->gmem.pgoff;
+		rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
+		if (rc) {
+			pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
+					    index, rc);
+			return rc;
+		}
+	}
+
+#endif
+	return 0;
+}
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prep)
 {
 	struct folio *folio;
 
@@ -74,6 +107,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 		folio_mark_uptodate(folio);
 	}
 
+	if (prep && kvm_gmem_prepare_folio(inode, index, folio)) {
+		folio_unlock(folio);
+		folio_put(folio);
+		return NULL;
+	}
+
 	/*
 	 * Ignore accessed, referenced, and dirty flags.  The memory is
 	 * unevictable and there is no storage to write back to.
@@ -178,7 +217,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
 			break;
 		}
 
-		folio = kvm_gmem_get_folio(inode, index);
+		folio = kvm_gmem_get_folio(inode, index, true);
 		if (!folio) {
 			r = -ENOMEM;
 			break;
@@ -537,8 +576,8 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 	fput(file);
 }
 
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
 {
 	pgoff_t index, huge_index;
 	struct kvm_gmem *gmem;
@@ -559,7 +598,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		goto out_fput;
 	}
 
-	folio = kvm_gmem_get_folio(file_inode(file), index);
+	folio = kvm_gmem_get_folio(file_inode(file), index, prep);
 	if (!folio) {
 		r = -ENOMEM;
 		goto out_fput;
@@ -600,4 +639,11 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	return r;
 }
+EXPORT_SYMBOL_GPL(__kvm_gmem_get_pfn);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+	return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
+}
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
                   ` (2 preceding siblings ...)
  2023-10-16 11:50 ` [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2024-02-09 10:11   ` Steven Price
  2023-10-16 11:50 ` [PATCH RFC gmem v1 5/8] KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults Michael Roth
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/x86.c                 |  7 +++++++
 include/linux/kvm_host.h           |  4 ++++
 virt/kvm/Kconfig                   |  4 ++++
 virt/kvm/guest_memfd.c             | 14 ++++++++++++++
 6 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 0c113f42d5c7..f1505a5fa781 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL(gmem_invalidate)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 66fc89d1858f..dbec74783f48 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1754,6 +1754,7 @@ struct kvm_x86_ops {
 	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
 
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
+	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 33a4cc33d86d..0e95c3a95e59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
 }
 #endif
 
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+{
+	static_call_cond(kvm_x86_gmem_invalidate)(start, end);
+}
+#endif
+
 int kvm_spec_ctrl_test_value(u64 value)
 {
 	/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c7f82c2f1bcf..840a5be5962a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
 #endif
 
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#endif
+
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 992cf6ed86ef..7fd1362a7ebe 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM
 config HAVE_KVM_GMEM_PREPARE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_INVALIDATE
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 72ff8b7b31d5..b4c4df259fb8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
 	return MF_DELAYED;
 }
 
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+static void kvm_gmem_free_folio(struct folio *folio)
+{
+	struct page *page = folio_page(folio, 0);
+	kvm_pfn_t pfn = page_to_pfn(page);
+	int order = folio_order(folio);
+
+	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+}
+#endif
+
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 #ifdef CONFIG_MIGRATION
 	.migrate_folio	= kvm_gmem_migrate_folio,
 #endif
 	.error_remove_page = kvm_gmem_error_page,
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+	.free_folio = kvm_gmem_free_folio,
+#endif
 };
 
 static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 5/8] KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
                   ` (3 preceding siblings ...)
  2023-10-16 11:50 ` [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 6/8] KVM: x86: Add KVM_X86_SNP_VM vm_type Michael Roth
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

In some cases the full 64-bit error code for the KVM page fault will be
needed to determine things like whether or not a fault was for a private
or shared guest page, so update related code to accept the full 64-bit
value so it can be plumbed all the way through to where it is needed.

The accessors of fault->error_code are changed as follows:

- FNAME(page_fault): change to explicitly use lower_32_bits() since that
                     is no longer done in kvm_mmu_page_fault()
- kvm_mmu_page_fault(): explicit mask with PFERR_RSVD_MASK,
                        PFERR_NESTED_GUEST_PAGE
- mmutrace: changed u32 -> u64

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#mbd0b20c9a2cf50319d5d2a27b63f73c772112076
[mdr: drop references/changes to code not in current gmem tree, update
      commit message]
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
 arch/x86/kvm/mmu/mmutrace.h     | 2 +-
 arch/x86/kvm/mmu/paging_tmpl.h  | 2 +-
 4 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bcb812a7f563..686f88c263a9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5802,8 +5802,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	}
 
 	if (r == RET_PF_INVALID) {
-		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
-					  lower_32_bits(error_code), false,
+		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
 					  &emulation_type);
 		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
 			return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 71ba4f833dc1..759c8b718201 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
 struct kvm_page_fault {
 	/* arguments to kvm_mmu_do_page_fault.  */
 	const gpa_t addr;
-	const u32 error_code;
+	const u64 error_code;
 	const bool prefetch;
 
 	/* Derived from error_code.  */
@@ -280,7 +280,7 @@ enum {
 };
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-					u32 err, bool prefetch, int *emulation_type)
+					u64 err, bool prefetch, int *emulation_type)
 {
 	struct kvm_page_fault fault = {
 		.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..195d98bc8de8 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -260,7 +260,7 @@ TRACE_EVENT(
 	TP_STRUCT__entry(
 		__field(int, vcpu_id)
 		__field(gpa_t, cr2_or_gpa)
-		__field(u32, error_code)
+		__field(u64, error_code)
 		__field(u64 *, sptep)
 		__field(u64, old_spte)
 		__field(u64, new_spte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..2f60f68f5f2d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -787,7 +787,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * The bit needs to be cleared before walking guest page tables.
 	 */
 	r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
-			     fault->error_code & ~PFERR_RSVD_MASK);
+			     lower_32_bits(fault->error_code) & ~PFERR_RSVD_MASK);
 
 	/*
 	 * The page is not mapped by the guest.  Let the guest handle it.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 6/8] KVM: x86: Add KVM_X86_SNP_VM vm_type
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
                   ` (4 preceding siblings ...)
  2023-10-16 11:50 ` [PATCH RFC gmem v1 5/8] KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 7/8] KVM: x86: Define RMP page fault error bits for #NPF Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type Michael Roth
  7 siblings, 0 replies; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

In some cases, such as detecting whether a page fault should be handled
as a private fault or not, KVM will need to handle things differently
versus the existing KVM_X86_PROTECTED_VM type.

Add a new KVM_X86_SNP_VM to allow for this, along with a helper to query
the vm_type.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/include/uapi/asm/kvm.h | 1 +
 arch/x86/kvm/x86.c              | 8 +++++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dbec74783f48..cdc235277a6f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2089,6 +2089,8 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 #define kvm_arch_has_private_mem(kvm) false
 #endif
 
+bool kvm_is_vm_type(struct kvm *kvm, unsigned long type);
+
 static inline u16 kvm_read_ldt(void)
 {
 	u16 ldt;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a448d0964fc0..57e4ba484aa2 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -564,5 +564,6 @@ struct kvm_pmu_event_filter {
 
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_SW_PROTECTED_VM	1
+#define KVM_X86_SNP_VM		3
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e95c3a95e59..12f9e99c7ad0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4444,10 +4444,16 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
 static bool kvm_is_vm_type_supported(unsigned long type)
 {
 	return type == KVM_X86_DEFAULT_VM ||
-	       (type == KVM_X86_SW_PROTECTED_VM &&
+	       ((type == KVM_X86_SW_PROTECTED_VM ||
+		 type == KVM_X86_SNP_VM) &&
 		IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
 }
 
+bool kvm_is_vm_type(struct kvm *kvm, unsigned long type)
+{
+	return kvm->arch.vm_type == type;
+}
+
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
 	int r = 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 7/8] KVM: x86: Define RMP page fault error bits for #NPF
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
                   ` (5 preceding siblings ...)
  2023-10-16 11:50 ` [PATCH RFC gmem v1 6/8] KVM: x86: Add KVM_X86_SNP_VM vm_type Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2023-10-16 11:50 ` [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type Michael Roth
  7 siblings, 0 replies; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta,
	Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled globally, the hardware places restrictions on all
memory accesses based on the RMP entry, whether the hypervisor or a VM,
performs the accesses. When hardware encounters an RMP access violation
during a guest access, it will cause a #VMEXIT(NPF) with a number of
additional bits set to indicate the reasons for the #NPF. Define those
here.

See APM2 section 16.36.10 for more details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
[mdr: add some additional details to commit message]
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/include/asm/kvm_host.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cdc235277a6f..fa401cb1a552 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -253,9 +253,13 @@ enum x86_intercept_stage;
 #define PFERR_FETCH_BIT 4
 #define PFERR_PK_BIT 5
 #define PFERR_SGX_BIT 15
+#define PFERR_GUEST_RMP_BIT 31
 #define PFERR_GUEST_FINAL_BIT 32
 #define PFERR_GUEST_PAGE_BIT 33
 #define PFERR_IMPLICIT_ACCESS_BIT 48
+#define PFERR_GUEST_ENC_BIT 34
+#define PFERR_GUEST_SIZEM_BIT 35
+#define PFERR_GUEST_VMPL_BIT 36
 
 #define PFERR_PRESENT_MASK	BIT(PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK	BIT(PFERR_WRITE_BIT)
@@ -267,6 +271,10 @@ enum x86_intercept_stage;
 #define PFERR_GUEST_FINAL_MASK	BIT_ULL(PFERR_GUEST_FINAL_BIT)
 #define PFERR_GUEST_PAGE_MASK	BIT_ULL(PFERR_GUEST_PAGE_BIT)
 #define PFERR_IMPLICIT_ACCESS	BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
+#define PFERR_GUEST_RMP_MASK	BIT_ULL(PFERR_GUEST_RMP_BIT)
+#define PFERR_GUEST_ENC_MASK	BIT_ULL(PFERR_GUEST_ENC_BIT)
+#define PFERR_GUEST_SIZEM_MASK	BIT_ULL(PFERR_GUEST_SIZEM_BIT)
+#define PFERR_GUEST_VMPL_MASK	BIT_ULL(PFERR_GUEST_VMPL_BIT)
 
 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
 				 PFERR_WRITE_MASK |		\
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type
  2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
                   ` (6 preceding siblings ...)
  2023-10-16 11:50 ` [PATCH RFC gmem v1 7/8] KVM: x86: Define RMP page fault error bits for #NPF Michael Roth
@ 2023-10-16 11:50 ` Michael Roth
  2024-01-31  1:13   ` Sean Christopherson
  7 siblings, 1 reply; 22+ messages in thread
From: Michael Roth @ 2023-10-16 11:50 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
determine with an #NPF is due to a private/shared access by the guest.
Implement that handling here. Also add handling needed to deal with
SNP guests which in some cases will make MMIO accesses with the
encryption bit.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/mmu/mmu.c          | 12 ++++++++++--
 arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++++++++++++-
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 686f88c263a9..10c323e2faa4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4327,6 +4327,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
+	bool private_fault = fault->is_private;
 	bool async;
 
 	/*
@@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+	/*
+	 * In some cases SNP guests will make MMIO accesses with the encryption
+	 * bit set. Handle these via the normal MMIO fault path.
+	 */
+	if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
+		private_fault = false;
+
+	if (private_fault != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
 	}
 
-	if (fault->is_private)
+	if (private_fault)
 		return kvm_faultin_pfn_private(vcpu, fault);
 
 	async = false;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 759c8b718201..e5b973051ad9 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -251,6 +251,24 @@ struct kvm_page_fault {
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
+static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
+{
+	bool private_fault = false;
+
+	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
+		private_fault = !!(err & PFERR_GUEST_ENC_MASK);
+	} else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
+		/*
+		 * This handling is for gmem self-tests and guests that treat
+		 * userspace as the authority on whether a fault should be
+		 * private or not.
+		 */
+		private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
+	}
+
+	return private_fault;
+}
+
 /*
  * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
  * and of course kvm_mmu_do_page_fault().
@@ -298,7 +316,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
-		.is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+		.is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
 	};
 	int r;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type
  2023-10-16 11:50 ` [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type Michael Roth
@ 2024-01-31  1:13   ` Sean Christopherson
  2024-02-08  0:24     ` Michael Roth
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2024-01-31  1:13 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, isaku.yamahata, ackerleytng, vbabka,
	ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

On Mon, Oct 16, 2023, Michael Roth wrote:
> For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
> determine with an #NPF is due to a private/shared access by the guest.
> Implement that handling here. Also add handling needed to deal with
> SNP guests which in some cases will make MMIO accesses with the
> encryption bit.

...

> @@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> +	/*
> +	 * In some cases SNP guests will make MMIO accesses with the encryption
> +	 * bit set. Handle these via the normal MMIO fault path.
> +	 */
> +	if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
> +		private_fault = false;

Why?  This is inarguably a guest bug.

> +	if (private_fault != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
>  		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>  		return -EFAULT;
>  	}
>  
> -	if (fault->is_private)
> +	if (private_fault)
>  		return kvm_faultin_pfn_private(vcpu, fault);
>  
>  	async = false;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 759c8b718201..e5b973051ad9 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -251,6 +251,24 @@ struct kvm_page_fault {
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> +{
> +	bool private_fault = false;
> +
> +	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
> +		private_fault = !!(err & PFERR_GUEST_ENC_MASK);
> +	} else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
> +		/*
> +		 * This handling is for gmem self-tests and guests that treat
> +		 * userspace as the authority on whether a fault should be
> +		 * private or not.
> +		 */
> +		private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +	}

This can be more simply:

	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM))
		return !!(err & PFERR_GUEST_ENC_MASK);

	if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM))
		return kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type
  2024-01-31  1:13   ` Sean Christopherson
@ 2024-02-08  0:24     ` Michael Roth
  2024-02-08 17:27       ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Roth @ 2024-02-08  0:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, isaku.yamahata, ackerleytng, vbabka,
	ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta,
	thomas.lendacky

On Tue, Jan 30, 2024 at 05:13:00PM -0800, Sean Christopherson wrote:
> On Mon, Oct 16, 2023, Michael Roth wrote:
> > For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
> > determine with an #NPF is due to a private/shared access by the guest.
> > Implement that handling here. Also add handling needed to deal with
> > SNP guests which in some cases will make MMIO accesses with the
> > encryption bit.
> 
> ...
> 
> > @@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > +	/*
> > +	 * In some cases SNP guests will make MMIO accesses with the encryption
> > +	 * bit set. Handle these via the normal MMIO fault path.
> > +	 */
> > +	if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
> > +		private_fault = false;
> 
> Why?  This is inarguably a guest bug.

AFAICT this isn't explicitly disallowed by the SNP spec. There was
however a set of security mitigations for SEV-ES that resulted in this
being behavior being highly discouraged in linux guest code:

  https://lkml.org/lkml/2020/10/20/464  

as well as OVMF guest code:

  https://edk2.groups.io/g/devel/message/69948

However the OVMF guest code still allows 1 exception for accesses to the
local APIC base address, which is the only case I'm aware of that
triggers this condition:

  https://github.com/tianocore/edk2/blob/master/OvmfPkg/Library/CcExitLib/CcExitVcHandler.c#L100

I think the rationale there is that if the guest absolutely *knows* that
encrypted information is not stored at a particular MMIO address, then
it can selectively choose to allow for exceptional cases like these. So
KVM would need to allow for these cases in order to be fully compatible
with existing SNP guests that do this.

> 
> > +	if (private_fault != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> >  		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> >  		return -EFAULT;
> >  	}
> >  
> > -	if (fault->is_private)
> > +	if (private_fault)
> >  		return kvm_faultin_pfn_private(vcpu, fault);
> >  
> >  	async = false;
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 759c8b718201..e5b973051ad9 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -251,6 +251,24 @@ struct kvm_page_fault {
> >  
> >  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >  
> > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> > +{
> > +	bool private_fault = false;
> > +
> > +	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
> > +		private_fault = !!(err & PFERR_GUEST_ENC_MASK);
> > +	} else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
> > +		/*
> > +		 * This handling is for gmem self-tests and guests that treat
> > +		 * userspace as the authority on whether a fault should be
> > +		 * private or not.
> > +		 */
> > +		private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > +	}
> 
> This can be more simply:
> 
> 	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM))
> 		return !!(err & PFERR_GUEST_ENC_MASK);
> 
> 	if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM))
> 		return kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> 

Yes, indeed. But TDX has taken a different approach for SW_PROTECTED_VM
case where they do this check in kvm_mmu_page_fault() and then synthesize
the PFERR_GUEST_ENC_MASK into error_code before calling
kvm_mmu_do_page_fault(). It's not in the v18 patchset AFAICT, but it's
in the tdx-upstream git branch that corresponds to it:

  https://github.com/intel/tdx/commit/3717a903ef453aa7b62e7eb65f230566b7f158d4

Would you prefer that SNP adopt the same approach?

-Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory
  2023-10-16 11:50 ` [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory Michael Roth
@ 2024-02-08 10:57   ` Suzuki K Poulose
  2024-02-08 17:29     ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Suzuki K Poulose @ 2024-02-08 10:57 UTC (permalink / raw)
  To: Michael Roth, kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

Hi,

On 16/10/2023 12:50, Michael Roth wrote:
> guest_memfd pages are generally expected to be in some arch-defined
> initial state prior to using them for guest memory. For SEV-SNP this
> initial state is 'private', or 'guest-owned', and requires additional
> operations to move these pages into a 'private' state by updating the
> corresponding entries the RMP table.
> 
> Allow for an arch-defined hook to handle updates of this sort, and go
> ahead and implement one for x86 so KVM implementations like AMD SVM can
> register a kvm_x86_ops callback to handle these updates for SEV-SNP
> guests.
> 
> The preparation callback is always called when allocating/grabbing
> folios via gmem, and it is up to the architecture to keep track of
> whether or not the pages are already in the expected state (e.g. the RMP
> table in the case of SEV-SNP).
> 
> In some cases, it is necessary to defer the preparation of the pages to
> handle things like in-place encryption of initial guest memory payloads
> before marking these pages as 'private'/'guest-owned', so also add a
> helper that performs the same function as kvm_gmem_get_pfn(), but allows
> for the preparation callback to be bypassed to allow for pages to be
> accessed beforehand.

This will be useful for Arm CCA, where the pages need to be moved into
"Realm state". Some minor comments below.

> 
> Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |  1 +
>   arch/x86/include/asm/kvm_host.h    |  2 ++
>   arch/x86/kvm/x86.c                 |  6 ++++
>   include/linux/kvm_host.h           | 14 ++++++++
>   virt/kvm/Kconfig                   |  4 +++
>   virt/kvm/guest_memfd.c             | 56 +++++++++++++++++++++++++++---
>   6 files changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index e3054e3e46d5..0c113f42d5c7 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
>   KVM_X86_OP(complete_emulated_msr)
>   KVM_X86_OP(vcpu_deliver_sipi_vector)
>   KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> +KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>   
>   #undef KVM_X86_OP
>   #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 95018cc653f5..66fc89d1858f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1752,6 +1752,8 @@ struct kvm_x86_ops {
>   	 * Returns vCPU specific APICv inhibit reasons
>   	 */
>   	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
> +
> +	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
>   };
>   
>   struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 767236b4d771..33a4cc33d86d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13301,6 +13301,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
>   }
>   EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
>   
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
> +{
> +	return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
> +}
> +#endif
>   
>   int kvm_spec_ctrl_test_value(u64 value)
>   {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8c5c017ab4e9..c7f82c2f1bcf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2403,9 +2403,19 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>   #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
>   
>   #ifdef CONFIG_KVM_PRIVATE_MEM
> +int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> +		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
>   int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   			      gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
>   #else
> +static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
> +				     struct kvm_memory_slot *slot, gfn_t gfn,
> +				     kvm_pfn_t *pfn, int *max_order)

Missing "bool prep" here ?

minor nit: Do we need to export both __kvm_gmem_get_pfn and 
kvm_gmem_get_pfn ? I don't see anyone else using the former.

We could have :

#ifdef CONFIG_KVM_PRIVATE_MEM
int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
#else
static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
				    struct kvm_memory_slot *slot, gfn_t gfn,
				    kvm_pfn_t *pfn, int *max_order,
				    bool prep)
{
	KVM_BUG_ON(1, kvm);
	return -EIO;
}
#endif

static inline int kvm_gmem_get_pfn(struct kvm *kvm,
				 struct kvm_memory_slot *slot, gfn_t gfn,
				kvm_pfn_t *pfn, int *max_order)
{
	return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
}


Suzuki

> +	KVM_BUG_ON(1, kvm);
> +	return -EIO;
> +}
> +
>   static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>   				   struct kvm_memory_slot *slot, gfn_t gfn,
>   				   kvm_pfn_t *pfn, int *max_order)
> @@ -2415,4 +2425,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>   }
>   #endif /* CONFIG_KVM_PRIVATE_MEM */
>   
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
> +#endif
> +
>   #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 2c964586aa14..992cf6ed86ef 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,3 +109,7 @@ config KVM_GENERIC_PRIVATE_MEM
>          select KVM_GENERIC_MEMORY_ATTRIBUTES
>          select KVM_PRIVATE_MEM
>          bool
> +
> +config HAVE_KVM_GMEM_PREPARE
> +       bool
> +       depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index f6f1b17a319c..72ff8b7b31d5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -44,7 +44,40 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
>   #endif
>   }
>   
> -static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> +static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
> +{
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +	struct list_head *gmem_list = &inode->i_mapping->private_list;
> +	struct kvm_gmem *gmem;
> +
> +	list_for_each_entry(gmem, gmem_list, entry) {
> +		struct kvm_memory_slot *slot;
> +		struct kvm *kvm = gmem->kvm;
> +		struct page *page;
> +		kvm_pfn_t pfn;
> +		gfn_t gfn;
> +		int rc;
> +
> +		slot = xa_load(&gmem->bindings, index);
> +		if (!slot)
> +			continue;
> +
> +		page = folio_file_page(folio, index);
> +		pfn = page_to_pfn(page);
> +		gfn = slot->base_gfn + index - slot->gmem.pgoff;
> +		rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
> +		if (rc) {
> +			pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
> +					    index, rc);
> +			return rc;
> +		}
> +	}
> +
> +#endif
> +	return 0;
> +}
> +
> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prep)
>   {
>   	struct folio *folio;
>   
> @@ -74,6 +107,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   		folio_mark_uptodate(folio);
>   	}
>   
> +	if (prep && kvm_gmem_prepare_folio(inode, index, folio)) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		return NULL;
> +	}
> +
>   	/*
>   	 * Ignore accessed, referenced, and dirty flags.  The memory is
>   	 * unevictable and there is no storage to write back to.
> @@ -178,7 +217,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>   			break;
>   		}
>   
> -		folio = kvm_gmem_get_folio(inode, index);
> +		folio = kvm_gmem_get_folio(inode, index, true);
>   		if (!folio) {
>   			r = -ENOMEM;
>   			break;
> @@ -537,8 +576,8 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
>   	fput(file);
>   }
>   
> -int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> -		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> +		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
>   {
>   	pgoff_t index, huge_index;
>   	struct kvm_gmem *gmem;
> @@ -559,7 +598,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		goto out_fput;
>   	}
>   
> -	folio = kvm_gmem_get_folio(file_inode(file), index);
> +	folio = kvm_gmem_get_folio(file_inode(file), index, prep);
>   	if (!folio) {
>   		r = -ENOMEM;
>   		goto out_fput;
> @@ -600,4 +639,11 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   
>   	return r;
>   }
> +EXPORT_SYMBOL_GPL(__kvm_gmem_get_pfn);
> +
> +int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> +		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +{
> +	return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
> +} >   EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type
  2024-02-08  0:24     ` Michael Roth
@ 2024-02-08 17:27       ` Sean Christopherson
  2024-02-08 17:30         ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2024-02-08 17:27 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, isaku.yamahata, ackerleytng, vbabka,
	ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta,
	thomas.lendacky

On Wed, Feb 07, 2024, Michael Roth wrote:
> On Tue, Jan 30, 2024 at 05:13:00PM -0800, Sean Christopherson wrote:
> > On Mon, Oct 16, 2023, Michael Roth wrote:
> > > For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
> > > determine with an #NPF is due to a private/shared access by the guest.
> > > Implement that handling here. Also add handling needed to deal with
> > > SNP guests which in some cases will make MMIO accesses with the
> > > encryption bit.
> > 
> > ...
> > 
> > > @@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  			return RET_PF_EMULATE;
> > >  	}
> > >  
> > > -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > +	/*
> > > +	 * In some cases SNP guests will make MMIO accesses with the encryption
> > > +	 * bit set. Handle these via the normal MMIO fault path.
> > > +	 */
> > > +	if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
> > > +		private_fault = false;
> > 
> > Why?  This is inarguably a guest bug.
> 
> AFAICT this isn't explicitly disallowed by the SNP spec.

There are _lots_ of things that aren't explicitly disallowed by the APM, that
doesn't mean that _KVM_ needs to actively support them.

I am *not* taking on more broken crud in KVM to workaround OVMF's stupidity, the
KVM_X86_QUIRK_CD_NW_CLEARED has taken up literally days of my time at this point.

> So KVM would need to allow for these cases in order to be fully compatible
> with existing SNP guests that do this.

No.  KVM does not yet support SNP, so as far as KVM's ABI goes, there are no
existing guests.  Yes, I realize that I am burying my head in the sand to some
extent, but it is simply not sustainable for KVM to keep trying to pick up the
pieces of poorly defined hardware specs and broken guest firmware.

> > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> > > +{
> > > +	bool private_fault = false;
> > > +
> > > +	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
> > > +		private_fault = !!(err & PFERR_GUEST_ENC_MASK);
> > > +	} else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
> > > +		/*
> > > +		 * This handling is for gmem self-tests and guests that treat
> > > +		 * userspace as the authority on whether a fault should be
> > > +		 * private or not.
> > > +		 */
> > > +		private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > > +	}
> > 
> > This can be more simply:
> > 
> > 	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM))
> > 		return !!(err & PFERR_GUEST_ENC_MASK);
> > 
> > 	if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM))
> > 		return kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > 
> 
> Yes, indeed. But TDX has taken a different approach for SW_PROTECTED_VM
> case where they do this check in kvm_mmu_page_fault() and then synthesize
> the PFERR_GUEST_ENC_MASK into error_code before calling
> kvm_mmu_do_page_fault(). It's not in the v18 patchset AFAICT, but it's
> in the tdx-upstream git branch that corresponds to it:
> 
>   https://github.com/intel/tdx/commit/3717a903ef453aa7b62e7eb65f230566b7f158d4
> 
> Would you prefer that SNP adopt the same approach?

Ah, yes, 'twas my suggestion in the first place.  FWIW, I was just reviewing the
literal code here and wasn't paying much attention to the content.

https://lore.kernel.org/all/f474282d701aca7af00e4f7171445abb5e734c6f.1689893403.git.isaku.yamahata@intel.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory
  2024-02-08 10:57   ` Suzuki K Poulose
@ 2024-02-08 17:29     ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2024-02-08 17:29 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-crypto, x86,
	linux-kernel, linux-fsdevel, pbonzini, isaku.yamahata,
	ackerleytng, vbabka, ashish.kalra, nikunj.dadhania, jroedel,
	pankaj.gupta

On Thu, Feb 08, 2024, Suzuki K Poulose wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 8c5c017ab4e9..c7f82c2f1bcf 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2403,9 +2403,19 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> >   #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> >   #ifdef CONFIG_KVM_PRIVATE_MEM
> > +int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
> >   int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >   			      gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
> >   #else
> > +static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
> > +				     struct kvm_memory_slot *slot, gfn_t gfn,
> > +				     kvm_pfn_t *pfn, int *max_order)
> 
> Missing "bool prep" here ?
> 
> minor nit: Do we need to export both __kvm_gmem_get_pfn and kvm_gmem_get_pfn

Minor nit on the nit: s/export/expose.  My initial reaction was "we should *never*
export any of these" :-)

> ? I don't see anyone else using the former.
> 
> We could have :
> 
> #ifdef CONFIG_KVM_PRIVATE_MEM
> int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> 		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
> #else
> static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
> 				    struct kvm_memory_slot *slot, gfn_t gfn,
> 				    kvm_pfn_t *pfn, int *max_order,
> 				    bool prep)
> {
> 	KVM_BUG_ON(1, kvm);
> 	return -EIO;
> }
> #endif
> 
> static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> 				 struct kvm_memory_slot *slot, gfn_t gfn,
> 				kvm_pfn_t *pfn, int *max_order)
> {
> 	return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
> }

I suspect all of this will be moot.  As discussed on the PUCK call[1] and in the
SNP enabling series[2], the plan is to have guest_memfd do (or at least initiate)
the actual copying into the backing pages, e.g. to guarantee that the pages are
in the correct state, that the appropriate locks are held, etc.

[1] https://drive.google.com/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ&usp=drive_link
[2] https://lore.kernel.org/all/ZcLuGxZ-w4fPmFxd@google.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type
  2024-02-08 17:27       ` Sean Christopherson
@ 2024-02-08 17:30         ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2024-02-08 17:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-crypto, x86,
	linux-kernel, linux-fsdevel, isaku.yamahata, ackerleytng, vbabka,
	ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta,
	thomas.lendacky

On Thu, Feb 8, 2024 at 6:27 PM Sean Christopherson <seanjc@google.com> wrote:
> No.  KVM does not yet support SNP, so as far as KVM's ABI goes, there are no
> existing guests.  Yes, I realize that I am burying my head in the sand to some
> extent, but it is simply not sustainable for KVM to keep trying to pick up the
> pieces of poorly defined hardware specs and broken guest firmware.

101% agreed. There are cases in which we have to and should bend
together backwards for guests (e.g. older Linux kernels), but not for
code that---according to current practices---is chosen by the host
admin.

(I am of the opinion that "bring your own firmware" is the only sane
way to handle attestation/measurement, but that's not how things are
done currently).

Paolo

> > > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> > > > +{
> > > > + bool private_fault = false;
> > > > +
> > > > + if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
> > > > +         private_fault = !!(err & PFERR_GUEST_ENC_MASK);
> > > > + } else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
> > > > +         /*
> > > > +          * This handling is for gmem self-tests and guests that treat
> > > > +          * userspace as the authority on whether a fault should be
> > > > +          * private or not.
> > > > +          */
> > > > +         private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > > > + }
> > >
> > > This can be more simply:
> > >
> > >     if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM))
> > >             return !!(err & PFERR_GUEST_ENC_MASK);
> > >
> > >     if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM))
> > >             return kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > >
> >
> > Yes, indeed. But TDX has taken a different approach for SW_PROTECTED_VM
> > case where they do this check in kvm_mmu_page_fault() and then synthesize
> > the PFERR_GUEST_ENC_MASK into error_code before calling
> > kvm_mmu_do_page_fault(). It's not in the v18 patchset AFAICT, but it's
> > in the tdx-upstream git branch that corresponds to it:
> >
> >   https://github.com/intel/tdx/commit/3717a903ef453aa7b62e7eb65f230566b7f158d4
> >
> > Would you prefer that SNP adopt the same approach?
>
> Ah, yes, 'twas my suggestion in the first place.  FWIW, I was just reviewing the
> literal code here and wasn't paying much attention to the content.
>
> https://lore.kernel.org/all/f474282d701aca7af00e4f7171445abb5e734c6f.1689893403.git.isaku.yamahata@intel.com
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2023-10-16 11:50 ` [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory Michael Roth
@ 2024-02-09 10:11   ` Steven Price
  2024-02-09 14:28     ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Price @ 2024-02-09 10:11 UTC (permalink / raw)
  To: Michael Roth, kvm, Suzuki K Poulose, tabba
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, seanjc, isaku.yamahata, ackerleytng,
	vbabka, ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta

On 16/10/2023 12:50, Michael Roth wrote:
> In some cases, like with SEV-SNP, guest memory needs to be updated in a
> platform-specific manner before it can be safely freed back to the host.
> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> allow for special handling of this sort when freeing memory in response
> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> ahead and define an arch-specific hook for x86 since it will be needed
> for handling memory used for SEV-SNP guests.

Hi all,

Arm CCA has a similar need to prepare/unprepare memory (granule
delegate/undelegate using our terminology) before it is used for
protected memory.

However I see a problem with the current gmem implementation that the
"invalidations" are not precise enough for our RMI API. When punching a
hole in the memfd the code currently hits the same path (ending in
kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
the shared version). The Arm CCA architecture doesn't allow the
protected memory to be removed and refaulted without the permission of
the guest (the memory contents would be wiped in this case).

One option that I've considered is to implement a seperate CCA ioctl to
notify KVM whether the memory should be mapped protected. The
invalidations would then be ignored on ranges that are currently
protected for this guest.

This 'solves' the problem nicely except for the case where the VMM
deliberately punches holes in memory which the guest is using.

The issue in this case is that there's no way of failing the punch hole
operation - we can detect that the memory is in use and shouldn't be
freed, but this callback doesn't give the opportunity to actually block
the freeing of the memory.

Sadly there's no easy way to map from a physical page in a gmem back to
which VM (and where in the VM) the page is mapped. So actually ripping
the page out of the appropriate VM isn't really possible in this case.

How is this situation handled on x86? Is it possible to invalidate and
then refault a protected page without affecting the memory contents? My
guess is yes and that is a CCA specific problem - is my understanding
correct?

My current thoughts for CCA are one of three options:

1. Represent shared and protected memory as two separate memslots. This
matches the underlying architecture more closely (the top address bit is
repurposed as a 'shared' flag), but I don't like it because it's a
deviation from other CoCo architectures (notably pKVM).

2. Allow punch-hole to fail on CCA if the memory is mapped into the
guest's protected space. Again, this is CCA being different and also
creates nasty corner cases where the gmem descriptor could have to
outlive the VMM - so looks like a potential source of memory leaks.

3. 'Fix' the invalidation to provide more precise semantics. I haven't
yet prototyped it but it might be possible to simply provide a flag from
kvm_gmem_invalidate_begin specifying that the invalidation is for the
protected memory. KVM would then only unmap the protected memory when
this flag is set (avoiding issues with VMA updates causing spurious unmaps).

Fairly obviously (3) is my preferred option, but it relies on the
guarantees that the "invalidation" is actually a precise set of
addresses where the memory is actually being freed.

Comments, thoughts, objections welcome!

Steve

> Signed-off-by: Michael Roth <michael.roth@amd.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  1 +
>  arch/x86/kvm/x86.c                 |  7 +++++++
>  include/linux/kvm_host.h           |  4 ++++
>  virt/kvm/Kconfig                   |  4 ++++
>  virt/kvm/guest_memfd.c             | 14 ++++++++++++++
>  6 files changed, 31 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 0c113f42d5c7..f1505a5fa781 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr)
>  KVM_X86_OP(vcpu_deliver_sipi_vector)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL(gmem_invalidate)
>  
>  #undef KVM_X86_OP
>  #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 66fc89d1858f..dbec74783f48 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1754,6 +1754,7 @@ struct kvm_x86_ops {
>  	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>  
>  	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> +	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 33a4cc33d86d..0e95c3a95e59 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
>  }
>  #endif
>  
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +{
> +	static_call_cond(kvm_x86_gmem_invalidate)(start, end);
> +}
> +#endif
> +
>  int kvm_spec_ctrl_test_value(u64 value)
>  {
>  	/*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c7f82c2f1bcf..840a5be5962a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
>  #endif
>  
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +#endif
> +
>  #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 992cf6ed86ef..7fd1362a7ebe 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM
>  config HAVE_KVM_GMEM_PREPARE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config HAVE_KVM_GMEM_INVALIDATE
> +       bool
> +       depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 72ff8b7b31d5..b4c4df259fb8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
>  	return MF_DELAYED;
>  }
>  
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +static void kvm_gmem_free_folio(struct folio *folio)
> +{
> +	struct page *page = folio_page(folio, 0);
> +	kvm_pfn_t pfn = page_to_pfn(page);
> +	int order = folio_order(folio);
> +
> +	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> +}
> +#endif
> +
>  static const struct address_space_operations kvm_gmem_aops = {
>  	.dirty_folio = noop_dirty_folio,
>  #ifdef CONFIG_MIGRATION
>  	.migrate_folio	= kvm_gmem_migrate_folio,
>  #endif
>  	.error_remove_page = kvm_gmem_error_page,
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +	.free_folio = kvm_gmem_free_folio,
> +#endif
>  };
>  
>  static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-02-09 10:11   ` Steven Price
@ 2024-02-09 14:28     ` Sean Christopherson
  2024-02-09 15:02       ` Steven Price
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2024-02-09 14:28 UTC (permalink / raw)
  To: Steven Price
  Cc: Michael Roth, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

On Fri, Feb 09, 2024, Steven Price wrote:
> On 16/10/2023 12:50, Michael Roth wrote:
> > In some cases, like with SEV-SNP, guest memory needs to be updated in a
> > platform-specific manner before it can be safely freed back to the host.
> > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> > allow for special handling of this sort when freeing memory in response
> > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> > ahead and define an arch-specific hook for x86 since it will be needed
> > for handling memory used for SEV-SNP guests.
> 
> Hi all,
> 
> Arm CCA has a similar need to prepare/unprepare memory (granule
> delegate/undelegate using our terminology) before it is used for
> protected memory.
> 
> However I see a problem with the current gmem implementation that the
> "invalidations" are not precise enough for our RMI API. When punching a
> hole in the memfd the code currently hits the same path (ending in
> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
> the shared version).
>
> The Arm CCA architecture doesn't allow the protected memory to be removed and
> refaulted without the permission of the guest (the memory contents would be
> wiped in this case).

TDX behaves almost exactly like CCA.  Well, that's not technically true, strictly
speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
!PRESENT, but those aren't in play for invalidation events like this.

SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
would do, is destructive, so SNP also behaves the same way for all intents and
purposes.

> One option that I've considered is to implement a seperate CCA ioctl to
> notify KVM whether the memory should be mapped protected.

That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?

> The invalidations would then be ignored on ranges that are currently
> protected for this guest.

That's backwards.  Invalidations on a guest_memfd should affect only *protected*
mappings.  And for that, the plan/proposal is to plumb only_{shared,private} flags
into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
mappings, and mmu_notifier invalidation don't zap private mappings.  Sample usage
in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
patch despite, I only provided a rough sketch).

[1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@google.com
[2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com

> This 'solves' the problem nicely except for the case where the VMM
> deliberately punches holes in memory which the guest is using.

I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
so don't do that.

> The issue in this case is that there's no way of failing the punch hole
> operation - we can detect that the memory is in use and shouldn't be
> freed, but this callback doesn't give the opportunity to actually block
> the freeing of the memory.

Why is this KVM's problem?  E.g. the same exact thing happens without guest_memfd
if userspace munmap()s memory the guest is using.

> Sadly there's no easy way to map from a physical page in a gmem back to
> which VM (and where in the VM) the page is mapped. So actually ripping
> the page out of the appropriate VM isn't really possible in this case.

I don't follow.  guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
not know what exactly needs to be invalidated?

> How is this situation handled on x86? Is it possible to invalidate and
> then refault a protected page without affecting the memory contents? My
> guess is yes and that is a CCA specific problem - is my understanding
> correct?
> 
> My current thoughts for CCA are one of three options:
> 
> 1. Represent shared and protected memory as two separate memslots. This
> matches the underlying architecture more closely (the top address bit is
> repurposed as a 'shared' flag), but I don't like it because it's a
> deviation from other CoCo architectures (notably pKVM).
> 
> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
> guest's protected space. Again, this is CCA being different and also
> creates nasty corner cases where the gmem descriptor could have to
> outlive the VMM - so looks like a potential source of memory leaks.
> 
> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
> yet prototyped it but it might be possible to simply provide a flag from
> kvm_gmem_invalidate_begin specifying that the invalidation is for the
> protected memory. KVM would then only unmap the protected memory when
> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
> 
> Fairly obviously (3) is my preferred option, but it relies on the
> guarantees that the "invalidation" is actually a precise set of
> addresses where the memory is actually being freed.

#3 is what we are planning for x86, and except for the only_{shared,private} flags,
the requisite functionality should already be in Linus' tree, though it does need
to be wired up for ARM.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-02-09 14:28     ` Sean Christopherson
@ 2024-02-09 15:02       ` Steven Price
  2024-02-09 15:13         ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Price @ 2024-02-09 15:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

Hi Sean,

Thanks for the reply.

On 09/02/2024 14:28, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
>> On 16/10/2023 12:50, Michael Roth wrote:
>>> In some cases, like with SEV-SNP, guest memory needs to be updated in a
>>> platform-specific manner before it can be safely freed back to the host.
>>> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
>>> allow for special handling of this sort when freeing memory in response
>>> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
>>> ahead and define an arch-specific hook for x86 since it will be needed
>>> for handling memory used for SEV-SNP guests.
>>
>> Hi all,
>>
>> Arm CCA has a similar need to prepare/unprepare memory (granule
>> delegate/undelegate using our terminology) before it is used for
>> protected memory.
>>
>> However I see a problem with the current gmem implementation that the
>> "invalidations" are not precise enough for our RMI API. When punching a
>> hole in the memfd the code currently hits the same path (ending in
>> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
>> the shared version).
>>
>> The Arm CCA architecture doesn't allow the protected memory to be removed and
>> refaulted without the permission of the guest (the memory contents would be
>> wiped in this case).
> 
> TDX behaves almost exactly like CCA.  Well, that's not technically true, strictly
> speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
> !PRESENT, but those aren't in play for invalidation events like this.

Ok, great I was under the impression they were similar.

> SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
> would do, is destructive, so SNP also behaves the same way for all intents and
> purposes.

Zapping page table mappings is what the invalidate calls imply. This is
something CCA can't do. Obviously fully removing the page would be
destructive.

>> One option that I've considered is to implement a seperate CCA ioctl to
>> notify KVM whether the memory should be mapped protected.
> 
> That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?

Sorry, I really didn't explain that well. Yes effectively this is the
attribute flag, but there's corner cases for destruction of the VM. My
thought was that if the VMM wanted to tear down part of the protected
range (without making it shared) then a separate ioctl would be needed
to notify KVM of the unmap.

>> The invalidations would then be ignored on ranges that are currently
>> protected for this guest.
> 
> That's backwards.  Invalidations on a guest_memfd should affect only *protected*
> mappings.  And for that, the plan/proposal is to plumb only_{shared,private} flags
> into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
> mappings, and mmu_notifier invalidation don't zap private mappings.  Sample usage
> in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
> patch despite, I only provided a rough sketch).

Aha, this sounds much like my option 3 below - a way to tell if the
invalidate comes from guest_memfd as opposed to VMA changes.

> [1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@google.com
> [2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com
> 
>> This 'solves' the problem nicely except for the case where the VMM
>> deliberately punches holes in memory which the guest is using.
> 
> I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> so don't do that.

A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
my concern here is a VMM which is trying to break the host. In this case
either the PUNCH_HOLE needs to fail, or we actually need to recover the
memory from the guest (effectively killing the guest in the process).

>> The issue in this case is that there's no way of failing the punch hole
>> operation - we can detect that the memory is in use and shouldn't be
>> freed, but this callback doesn't give the opportunity to actually block
>> the freeing of the memory.
> 
> Why is this KVM's problem?  E.g. the same exact thing happens without guest_memfd
> if userspace munmap()s memory the guest is using.

Indeed. The difference here is that for a normal non-realm guest the
pages can be removed from the page-table and refaulted on a later
access. Indeed there's nothing stopping the VMM from using freeing the
pages and reallocating them later.

For a realm guest if the memory is pulled from the guest then the guest
is effectively dead (at least until migration is implemented but even
then there's going to be a specific controlled mechanism).

>> Sadly there's no easy way to map from a physical page in a gmem back to
>> which VM (and where in the VM) the page is mapped. So actually ripping
>> the page out of the appropriate VM isn't really possible in this case.
> 
> I don't follow.  guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
> not know what exactly needs to be invalidated?

At the point that gmem calls kvm_mmu_unmap_gfn_range() the fact that the
range is a gmem is lost.

>> How is this situation handled on x86? Is it possible to invalidate and
>> then refault a protected page without affecting the memory contents? My
>> guess is yes and that is a CCA specific problem - is my understanding
>> correct?
>>
>> My current thoughts for CCA are one of three options:
>>
>> 1. Represent shared and protected memory as two separate memslots. This
>> matches the underlying architecture more closely (the top address bit is
>> repurposed as a 'shared' flag), but I don't like it because it's a
>> deviation from other CoCo architectures (notably pKVM).
>>
>> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
>> guest's protected space. Again, this is CCA being different and also
>> creates nasty corner cases where the gmem descriptor could have to
>> outlive the VMM - so looks like a potential source of memory leaks.
>>
>> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
>> yet prototyped it but it might be possible to simply provide a flag from
>> kvm_gmem_invalidate_begin specifying that the invalidation is for the
>> protected memory. KVM would then only unmap the protected memory when
>> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
>>
>> Fairly obviously (3) is my preferred option, but it relies on the
>> guarantees that the "invalidation" is actually a precise set of
>> addresses where the memory is actually being freed.
> 
> #3 is what we are planning for x86, and except for the only_{shared,private} flags,
> the requisite functionality should already be in Linus' tree, though it does need
> to be wired up for ARM.

Thanks, looks like the only_{shared,private} flags should do it. My only
worry about that solution was that it implicitly changes the
"invalidation" when only_private==1 to a precise list of pages that are
to be unmapped. Whereas for a normal guest it's only a performance issue
if a larger region is invalidated, for a CoCo guest it would be fatal to
the guest.

I'll cherry-pick the "KVM: Add new members to struct kvm_gfn_range to
operate on" patch from the TDX tree as I think this should do the trick.
I have hacked up something similar and it looks like it should work.

Thanks,

Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-02-09 15:02       ` Steven Price
@ 2024-02-09 15:13         ` Sean Christopherson
  2024-03-11 17:24           ` Michael Roth
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2024-02-09 15:13 UTC (permalink / raw)
  To: Steven Price
  Cc: Michael Roth, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

On Fri, Feb 09, 2024, Steven Price wrote:
> >> One option that I've considered is to implement a seperate CCA ioctl to
> >> notify KVM whether the memory should be mapped protected.
> > 
> > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> 
> Sorry, I really didn't explain that well. Yes effectively this is the
> attribute flag, but there's corner cases for destruction of the VM. My
> thought was that if the VMM wanted to tear down part of the protected
> range (without making it shared) then a separate ioctl would be needed
> to notify KVM of the unmap.

No new uAPI should be needed, because the only scenario time a benign VMM should
do this is if the guest also knows the memory is being removed, in which case
PUNCH_HOLE will suffice.

> >> This 'solves' the problem nicely except for the case where the VMM
> >> deliberately punches holes in memory which the guest is using.
> > 
> > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > so don't do that.
> 
> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> my concern here is a VMM which is trying to break the host. In this case
> either the PUNCH_HOLE needs to fail, or we actually need to recover the
> memory from the guest (effectively killing the guest in the process).

The latter.  IIRC, we talked about this exact case somewhere in the hour-long
rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
this multiple times on-list, though I don't know that there is a single thread
that captures the entire plan.

The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
instance that's attached to a given guest_memfd inode when a page is being fully
removed, i.e. when a page is being freed back to the normal memory pool.  Something
like this proposed SNP patch[2].

Mike, do have WIP patches you can share?

[1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
[2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-02-09 15:13         ` Sean Christopherson
@ 2024-03-11 17:24           ` Michael Roth
  2024-03-12 20:26             ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Roth @ 2024-03-11 17:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steven Price, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
> > >> One option that I've considered is to implement a seperate CCA ioctl to
> > >> notify KVM whether the memory should be mapped protected.
> > > 
> > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> > 
> > Sorry, I really didn't explain that well. Yes effectively this is the
> > attribute flag, but there's corner cases for destruction of the VM. My
> > thought was that if the VMM wanted to tear down part of the protected
> > range (without making it shared) then a separate ioctl would be needed
> > to notify KVM of the unmap.
> 
> No new uAPI should be needed, because the only scenario time a benign VMM should
> do this is if the guest also knows the memory is being removed, in which case
> PUNCH_HOLE will suffice.
> 
> > >> This 'solves' the problem nicely except for the case where the VMM
> > >> deliberately punches holes in memory which the guest is using.
> > > 
> > > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > > so don't do that.
> > 
> > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > my concern here is a VMM which is trying to break the host. In this case
> > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > memory from the guest (effectively killing the guest in the process).
> 
> The latter.  IIRC, we talked about this exact case somewhere in the hour-long
> rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
> this multiple times on-list, though I don't know that there is a single thread
> that captures the entire plan.
> 
> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> instance that's attached to a given guest_memfd inode when a page is being fully
> removed, i.e. when a page is being freed back to the normal memory pool.  Something
> like this proposed SNP patch[2].
> 
> Mike, do have WIP patches you can share?

Sorry, I missed this query earlier. I'm a bit confused though, I thought
the kvm_arch_gmem_invalidate() hook provided in this patch was what we
ended up agreeing on during the PUCK call in question.

There was an open question about what to do if a use-case came along
where we needed to pass additional parameters to
kvm_arch_gmem_invalidate() other than just the start/end PFN range for
the pages being freed, but we'd determined that SNP and TDX did not
currently need this, so I didn't have any changes planned in this
regard.

If we now have such a need, what we had proposed was to modify
__filemap_remove_folio()/page_cache_delete() to defer setting
folio->mapping to NULL so that we could still access it in
kvm_gmem_free_folio() so that we can still access mapping->i_private_list
to get the list of gmem/KVM instances and pass them on via
kvm_arch_gmem_invalidate().

So that's doable, but it's not clear from this discussion that that's
needed. If the idea to block/kill the guest if VMM tries to hole-punch,
and ARM CCA already has plans to wire up the shared/private flags in
kvm_unmap_gfn_range(), wouldn't that have all the information needed to
kill that guest? At that point, kvm_gmem_free_folio() can handle
additional per-page cleanup (with additional gmem/KVM info plumbed in
if necessary).

-Mike


[1] https://lore.kernel.org/kvm/20240202230611.351544-1-seanjc@google.com/T/


> 
> [1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
> [2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-03-11 17:24           ` Michael Roth
@ 2024-03-12 20:26             ` Sean Christopherson
  2024-03-13 17:11               ` Steven Price
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2024-03-12 20:26 UTC (permalink / raw)
  To: Michael Roth
  Cc: Steven Price, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

On Mon, Mar 11, 2024, Michael Roth wrote:
> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> > On Fri, Feb 09, 2024, Steven Price wrote:
> > > >> One option that I've considered is to implement a seperate CCA ioctl to
> > > >> notify KVM whether the memory should be mapped protected.
> > > > 
> > > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> > > 
> > > Sorry, I really didn't explain that well. Yes effectively this is the
> > > attribute flag, but there's corner cases for destruction of the VM. My
> > > thought was that if the VMM wanted to tear down part of the protected
> > > range (without making it shared) then a separate ioctl would be needed
> > > to notify KVM of the unmap.
> > 
> > No new uAPI should be needed, because the only scenario time a benign VMM should
> > do this is if the guest also knows the memory is being removed, in which case
> > PUNCH_HOLE will suffice.
> > 
> > > >> This 'solves' the problem nicely except for the case where the VMM
> > > >> deliberately punches holes in memory which the guest is using.
> > > > 
> > > > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > > > so don't do that.
> > > 
> > > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > > my concern here is a VMM which is trying to break the host. In this case
> > > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > > memory from the guest (effectively killing the guest in the process).
> > 
> > The latter.  IIRC, we talked about this exact case somewhere in the hour-long
> > rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
> > this multiple times on-list, though I don't know that there is a single thread
> > that captures the entire plan.
> > 
> > The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> > instance that's attached to a given guest_memfd inode when a page is being fully
> > removed, i.e. when a page is being freed back to the normal memory pool.  Something
> > like this proposed SNP patch[2].
> > 
> > Mike, do have WIP patches you can share?
> 
> Sorry, I missed this query earlier. I'm a bit confused though, I thought
> the kvm_arch_gmem_invalidate() hook provided in this patch was what we
> ended up agreeing on during the PUCK call in question.

Heh, I trust your memory of things far more than I trust mine.  I'm just proving
Cunningham's Law.  :-)

> There was an open question about what to do if a use-case came along
> where we needed to pass additional parameters to
> kvm_arch_gmem_invalidate() other than just the start/end PFN range for
> the pages being freed, but we'd determined that SNP and TDX did not
> currently need this, so I didn't have any changes planned in this
> regard.
> 
> If we now have such a need, what we had proposed was to modify
> __filemap_remove_folio()/page_cache_delete() to defer setting
> folio->mapping to NULL so that we could still access it in
> kvm_gmem_free_folio() so that we can still access mapping->i_private_list
> to get the list of gmem/KVM instances and pass them on via
> kvm_arch_gmem_invalidate().

Yeah, this is what I was remembering.  I obviously forgot that we didn't have a
need to iterate over all bindings at this time.

> So that's doable, but it's not clear from this discussion that that's
> needed.

Same here.  And even if it is needed, it's not your problem to solve.  The above
blurb about needing to preserve folio->mapping being free_folio() is sufficient
to get the ARM code moving in the right direction.

Thanks!

> If the idea to block/kill the guest if VMM tries to hole-punch,
> and ARM CCA already has plans to wire up the shared/private flags in
> kvm_unmap_gfn_range(), wouldn't that have all the information needed to
> kill that guest? At that point, kvm_gmem_free_folio() can handle
> additional per-page cleanup (with additional gmem/KVM info plumbed in
> if necessary).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-03-12 20:26             ` Sean Christopherson
@ 2024-03-13 17:11               ` Steven Price
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Price @ 2024-03-13 17:11 UTC (permalink / raw)
  To: Sean Christopherson, Michael Roth
  Cc: kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm, linux-crypto,
	x86, linux-kernel, linux-fsdevel, pbonzini, isaku.yamahata,
	ackerleytng, vbabka, ashish.kalra, nikunj.dadhania, jroedel,
	pankaj.gupta

On 12/03/2024 20:26, Sean Christopherson wrote:
> On Mon, Mar 11, 2024, Michael Roth wrote:
>> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
>>> On Fri, Feb 09, 2024, Steven Price wrote:
>>>>>> One option that I've considered is to implement a seperate CCA ioctl to
>>>>>> notify KVM whether the memory should be mapped protected.
>>>>>
>>>>> That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
>>>>
>>>> Sorry, I really didn't explain that well. Yes effectively this is the
>>>> attribute flag, but there's corner cases for destruction of the VM. My
>>>> thought was that if the VMM wanted to tear down part of the protected
>>>> range (without making it shared) then a separate ioctl would be needed
>>>> to notify KVM of the unmap.
>>>
>>> No new uAPI should be needed, because the only scenario time a benign VMM should
>>> do this is if the guest also knows the memory is being removed, in which case
>>> PUNCH_HOLE will suffice.
>>>
>>>>>> This 'solves' the problem nicely except for the case where the VMM
>>>>>> deliberately punches holes in memory which the guest is using.
>>>>>
>>>>> I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
>>>>> so don't do that.
>>>>
>>>> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
>>>> my concern here is a VMM which is trying to break the host. In this case
>>>> either the PUNCH_HOLE needs to fail, or we actually need to recover the
>>>> memory from the guest (effectively killing the guest in the process).
>>>
>>> The latter.  IIRC, we talked about this exact case somewhere in the hour-long
>>> rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
>>> this multiple times on-list, though I don't know that there is a single thread
>>> that captures the entire plan.
>>>
>>> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
>>> instance that's attached to a given guest_memfd inode when a page is being fully
>>> removed, i.e. when a page is being freed back to the normal memory pool.  Something
>>> like this proposed SNP patch[2].
>>>
>>> Mike, do have WIP patches you can share?
>>
>> Sorry, I missed this query earlier. I'm a bit confused though, I thought
>> the kvm_arch_gmem_invalidate() hook provided in this patch was what we
>> ended up agreeing on during the PUCK call in question.
> 
> Heh, I trust your memory of things far more than I trust mine.  I'm just proving
> Cunningham's Law.  :-)
> 
>> There was an open question about what to do if a use-case came along
>> where we needed to pass additional parameters to
>> kvm_arch_gmem_invalidate() other than just the start/end PFN range for
>> the pages being freed, but we'd determined that SNP and TDX did not
>> currently need this, so I didn't have any changes planned in this
>> regard.
>>
>> If we now have such a need, what we had proposed was to modify
>> __filemap_remove_folio()/page_cache_delete() to defer setting
>> folio->mapping to NULL so that we could still access it in
>> kvm_gmem_free_folio() so that we can still access mapping->i_private_list
>> to get the list of gmem/KVM instances and pass them on via
>> kvm_arch_gmem_invalidate().
> 
> Yeah, this is what I was remembering.  I obviously forgot that we didn't have a
> need to iterate over all bindings at this time.
> 
>> So that's doable, but it's not clear from this discussion that that's
>> needed.
> 
> Same here.  And even if it is needed, it's not your problem to solve.  The above
> blurb about needing to preserve folio->mapping being free_folio() is sufficient
> to get the ARM code moving in the right direction.
> 
> Thanks!
> 
>> If the idea to block/kill the guest if VMM tries to hole-punch,
>> and ARM CCA already has plans to wire up the shared/private flags in
>> kvm_unmap_gfn_range(), wouldn't that have all the information needed to
>> kill that guest? At that point, kvm_gmem_free_folio() can handle
>> additional per-page cleanup (with additional gmem/KVM info plumbed in
>> if necessary).

Yes, the missing piece of the puzzle was provided by "KVM: Prepare for
handling only shared mappings in mmu_notifier events"[1] - namely the
"only_shared" flag. We don't need to actually block/kill the guest until
it attempts access to the memory which has been removed from the guest -
at that point the guest cannot continue because the security properties
have been violated (the protected memory contents have been lost) so
attempts to continue the guest will fail.

You can ignore most of my other ramblings - as long as everyone is happy
with that flag then Arm CCA should be fine. I was just looking at other
options.

Thanks,

Steve

[1]
https://lore.kernel.org/lkml/20231027182217.3615211-13-seanjc@google.com/

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-03-13 17:11 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-16 11:50 [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
2023-10-16 11:50 ` [PATCH RFC gmem v1 1/8] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory Michael Roth
2023-10-16 11:50 ` [PATCH RFC gmem v1 2/8] KVM: Use AS_INACCESSIBLE when creating guest_memfd inode Michael Roth
2023-10-16 11:50 ` [PATCH RFC gmem v1 3/8] KVM: x86: Add gmem hook for initializing memory Michael Roth
2024-02-08 10:57   ` Suzuki K Poulose
2024-02-08 17:29     ` Sean Christopherson
2023-10-16 11:50 ` [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory Michael Roth
2024-02-09 10:11   ` Steven Price
2024-02-09 14:28     ` Sean Christopherson
2024-02-09 15:02       ` Steven Price
2024-02-09 15:13         ` Sean Christopherson
2024-03-11 17:24           ` Michael Roth
2024-03-12 20:26             ` Sean Christopherson
2024-03-13 17:11               ` Steven Price
2023-10-16 11:50 ` [PATCH RFC gmem v1 5/8] KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults Michael Roth
2023-10-16 11:50 ` [PATCH RFC gmem v1 6/8] KVM: x86: Add KVM_X86_SNP_VM vm_type Michael Roth
2023-10-16 11:50 ` [PATCH RFC gmem v1 7/8] KVM: x86: Define RMP page fault error bits for #NPF Michael Roth
2023-10-16 11:50 ` [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type Michael Roth
2024-01-31  1:13   ` Sean Christopherson
2024-02-08  0:24     ` Michael Roth
2024-02-08 17:27       ` Sean Christopherson
2024-02-08 17:30         ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).