All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
@ 2024-02-22 16:10 Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 01/26] KVM: Split KVM memory attributes into user and kernel attributes Fuad Tabba
                   ` (28 more replies)
  0 siblings, 29 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

This series adds restricted mmap() support to guest_memfd [1], as
well as support guest_memfd on pKVM/arm64.

This series is based on Linux 6.8-rc4 + our pKVM core series [2].
The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
the remainder (patches 7-26) require the pKVM core series. A git
repo with this series applied can be found here [3]. We have a
(WIP) kvmtool port capable of running the code in this series
[4]. For a technical deep dive into pKVM, please refer to Quentin
Perret's KVM Forum Presentation [5, 6].

I've covered some of the issues presented here in my LPC 2023
presentation [7].

We haven't started using this in Android yet, but we aim to move
away from anonymous memory to guest_memfd once we have the
necessary support merged upstream. Others (e.g., Gunyah [8]) are
also looking into guest_memfd for similar reasons as us.

By design, guest_memfd cannot be mapped, read, or written by the
host userspace. In pKVM, memory shared between a protected guest
and the host is shared in-place, unlike the other confidential
computing solutions that guest_memfd was originally envisaged for
(e.g, TDX). When initializing a guest, as well as when accessing
memory shared by the guest to the host, it would be useful to
support mapping that memory at the host to avoid copying its
contents.

One of the benefits of guest_memfd is that it prevents a
misbehaving host process from crashing the system when attempting
to access (deliberately or accidentally) protected guest memory,
since this memory isn't mapped to begin with. Without
guest_memfd, the hypervisor would still prevent such accesses,
but in certain cases the host kernel wouldn't be able to recover,
causing the system to crash.

Support for mmap() in this patch series maintains the invariant
that only memory shared with the host, either explicitly by the
guest or implicitly before the guest has started running (in
order to populate its memory) is allowed to be mapped. At no time
should private memory be mapped at the host.

This patch series is divided into two parts:

The first part is to the KVM core code (patches 1-6), and is
based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
for mapping guest memory only as long as it is shared. For that,
the host needs to know the sharing status of guest memory.
Therefore, the series adds a new KVM memory attribute, accessible
only by the host kernel, that specifies whether the memory is
allowed to be mapped by the host userspace.

The second part of the series (patches 7-26) adds guest_memfd
support for pKVM/arm64, and is based on the latest version of our
pKVM series [2]. It uses guest_memfd instead of the current
approach in Android (not upstreamed) of maintaining a long-term
GUP on anonymous memory donated to the guest. These patches
handle faulting in guest memory for a guest, as well as handling
sharing and unsharing of guest memory while maintaining the
invariant mentioned earlier.

In addition to general feedback, we would like feedback on how we
handle mmap() and faulting-in guest pages at the host (KVM: Add
restricted support for mapping guest_memfd by the host).

We don't enforce the invariant that only memory shared with the
host can be mapped by the host userspace in
file_operations::mmap(), but in vm_operations_struct:fault(). On
vm_operations_struct::fault(), we check whether the page is
shared with the host. If not, we deliver a SIGBUS to the current
task. The reason for enforcing this at fault() is that mmap()
does not elevate the pagecount(); it's the faulting in of the
page which does. Even if we were to check at mmap() whether an
address can be mapped, we would still need to check again on
fault(), since between mmap() and fault() the status of the page
can change.

This creates the situation where access to successfully mmap()'d
memory might SIGBUS at page fault. There is precedence for
similar behavior in the kernel I believe, with MADV_HWPOISON and
the hugetlbfs cgroups controller, which could SIGBUS at page
fault time depending on the accounting limit.

Another pKVM specific aspect we would like feedback on, is how to
handle memory mapped by the host being unshared by a guest. The
approach we've taken is that on an unshare call from the guest,
the host userspace is notified that the memory has been unshared,
in order to allow it to unmap it and mark it as PRIVATE as
acknowledgment. If the host does not unmap the memory, the
unshare call issued by the guest fails, which the guest is
informed about on return.

Cheers,
/fuad

[1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/

[2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core

[3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1

[4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8

[5] Protected KVM on arm64 (slides)
https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf

[6] Protected KVM on arm64 (video)
https://www.youtube.com/watch?v=9npebeVFbFw

[7] Supporting guest private memory in Protected KVM on Android (presentation)
https://lpc.events/event/17/contributions/1487/

[8] Drivers for Gunyah (patch series)
https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/

Fuad Tabba (20):
  KVM: Split KVM memory attributes into user and kernel attributes
  KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  KVM: Add restricted support for mapping guestmem by the host
  KVM: Don't allow private attribute to be set if mapped by host
  KVM: Don't allow private attribute to be removed for unmappable memory
  KVM: Implement kvm_(read|/write)_guest_page for private memory slots
  KVM: arm64: Create hypercall return handler
  KVM: arm64: Refactor code around handling return from host to guest
  KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
  KVM: arm64: Add a field to indicate whether the guest page was pinned
  KVM: arm64: Do not allow changes to private memory slots
  KVM: arm64: Skip VMA checks for slots without userspace address
  KVM: arm64: Handle guest_memfd()-backed guest page faults
  KVM: arm64: Track sharing of memory from protected guest to host
  KVM: arm64: Mark a protected VM's memory as unmappable at
    initialization
  KVM: arm64: Handle unshare on way back to guest entry rather than exit
  KVM: arm64: Check that host unmaps memory unshared by guest
  KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
  KVM: arm64: Enable private memory support when pKVM is enabled
  KVM: arm64: Enable private memory kconfig for arm64

Keir Fraser (3):
  KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
  KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
  KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall

Quentin Perret (1):
  KVM: arm64: Turn llist of pinned pages into an rb-tree

Will Deacon (2):
  KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
  KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications

 arch/arm64/include/asm/kvm_host.h             |  17 +-
 arch/arm64/include/asm/kvm_pkvm.h             |   1 +
 arch/arm64/kvm/Kconfig                        |   2 +
 arch/arm64/kvm/arm.c                          |  32 ++-
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
 arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
 arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
 arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
 arch/arm64/kvm/mmu.c                          | 138 +++++++++-
 arch/arm64/kvm/pkvm.c                         |  83 +++++-
 include/linux/arm-smccc.h                     |   7 +
 include/linux/kvm_host.h                      |  34 +++
 include/uapi/linux/kvm.h                      |   4 +
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        |  89 +++++-
 virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
 19 files changed, 904 insertions(+), 68 deletions(-)

-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 01/26] KVM: Split KVM memory attributes into user and kernel attributes
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 02/26] KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Currently userspace can set all KVM memory attributes. Future
patches will add new attributes that should only be set by the
kernel.  Split the attribute space into two parts, one that
userspace can set, and one that can only be set by the kernel.

This patch introduces two new functions,
kvm_vm_set_mem_attributes_kernel() and
kvm_vm_set_mem_attributes_user(), whereby each sets the
attributes associated with the kernel or with userspace, without
clobbering the other's attributes.

Since these attributes are stored in an xarray, do the split at
bit 16, so that this would still work on 32-bit architectures if
needed.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h |  3 +++
 include/uapi/linux/kvm.h |  3 +++
 virt/kvm/kvm_main.c      | 36 +++++++++++++++++++++++++++++++-----
 3 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7df0779ceaba..4cacf2a9a5d5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1438,6 +1438,9 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
 
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
 
+int kvm_vm_set_mem_attributes_kernel(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attributes);
+
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 					struct kvm_memory_slot *slot,
 					gfn_t gfn_offset,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 59e7f5fd74e1..0862d6cc3e66 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2225,6 +2225,9 @@ struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
+#define KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT     (16)
+#define KVM_MEMORY_ATTRIBUTES_KERNEL_MASK      GENMASK(63, KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT)
+
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
 struct kvm_create_guest_memfd {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8f0dec2fa0f1..fba4dc6e4107 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2536,8 +2536,8 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
 }
 
 /* Set @attributes for the gfn range [@start, @end). */
-static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long attributes)
+static int __kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				      unsigned long attributes, bool userspace)
 {
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
@@ -2559,8 +2559,6 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	void *entry;
 	int r = 0;
 
-	entry = attributes ? xa_mk_value(attributes) : NULL;
-
 	mutex_lock(&kvm->slots_lock);
 
 	/* Nothing to do if the entire range as the desired attributes. */
@@ -2580,6 +2578,17 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	kvm_handle_gfn_range(kvm, &pre_set_range);
 
 	for (i = start; i < end; i++) {
+		/* Maintain kernel/userspace attributes separately. */
+		unsigned long attr = xa_to_value(xa_load(&kvm->mem_attr_array, i));
+
+		if (userspace)
+			attr &= KVM_MEMORY_ATTRIBUTES_KERNEL_MASK;
+		else
+			attr &= ~KVM_MEMORY_ATTRIBUTES_KERNEL_MASK;
+
+		attributes |= attr;
+		entry = attributes ? xa_mk_value(attributes) : NULL;
+
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
 				    GFP_KERNEL_ACCOUNT));
 		KVM_BUG_ON(r, kvm);
@@ -2592,6 +2601,23 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	return r;
 }
+
+int kvm_vm_set_mem_attributes_kernel(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attributes)
+{
+	attributes &= KVM_MEMORY_ATTRIBUTES_KERNEL_MASK;
+
+	return __kvm_vm_set_mem_attributes(kvm, start, end, attributes, false);
+}
+
+static int kvm_vm_set_mem_attributes_userspace(struct kvm *kvm, gfn_t start, gfn_t end,
+					       unsigned long attributes)
+{
+	attributes &= ~KVM_MEMORY_ATTRIBUTES_KERNEL_MASK;
+
+	return __kvm_vm_set_mem_attributes(kvm, start, end, attributes, true);
+}
+
 static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 					   struct kvm_memory_attributes *attrs)
 {
@@ -2617,7 +2643,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 	 */
 	BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long));
 
-	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
+	return kvm_vm_set_mem_attributes_userspace(kvm, start, end, attrs->attributes);
 }
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 02/26] KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 01/26] KVM: Split KVM memory attributes into user and kernel attributes Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host Fuad Tabba
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Create a new variant of kvm_gmem_get_pfn(), which retains the
folio lock if it returns successfully.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 11 +++++++++++
 virt/kvm/guest_memfd.c   | 21 ++++++++++++++++++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4cacf2a9a5d5..b96abeeb2b65 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2381,6 +2381,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2389,6 +2391,15 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+
+static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn,
+					  int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0f4e0cf4f158..7e3ea7a3f086 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -482,8 +482,8 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 	fput(file);
 }
 
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
 {
 	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
 	struct kvm_gmem *gmem;
@@ -523,10 +523,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	r = 0;
 
 out_unlock:
-	folio_unlock(folio);
+	if (r)
+		folio_unlock(folio);
 out_fput:
 	fput(file);
 
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn_locked);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+	int r;
+
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, pfn, max_order);
+	if (r)
+		return r;
+
+	unlock_page(pfn_to_page(*pfn));
+	return 0;
+}
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 01/26] KVM: Split KVM memory attributes into user and kernel attributes Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 02/26] KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:28   ` David Hildenbrand
  2024-02-22 16:10 ` [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host Fuad Tabba
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Allow guestmem-backed memory to be mapped by the host if the
configuration option is enabled, and the (newly added) KVM memory
attribute KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE isn't set.

This new attribute is a kernel attribute, which cannot be
modified by userspace. This will be used in future patches so
that KVM can decide whether to allow the host to map guest memory
based on certain criteria, e.g., that the memory is shared with
the host.

This attribute has negative polarity (i.e., as opposed to being
an ALLOW attribute), to simplify the code, since it will usually
apply to memory marked as KVM_MEMORY_ATTRIBUTE_PRIVATE (i.e.,
already has an entry in the xarray). Its absence implies that
there are no restrictions for mapping that memory by the host.

An invariant that future patches will maintain is that memory
that is private for the guest, regardless whether it's marked
with the PRIVATE attribute, must always be NOT_MAPPABLE.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 13 ++++++++
 include/uapi/linux/kvm.h |  1 +
 virt/kvm/Kconfig         |  4 +++
 virt/kvm/guest_memfd.c   | 68 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 86 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b96abeeb2b65..fad296baa84e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2402,4 +2402,17 @@ static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
+{
+	return !(kvm_get_memory_attributes(kvm, gfn) &
+		 KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE);
+}
+#else
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
+{
+	return false;
+}
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0862d6cc3e66..b8db8fb88bbe 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2227,6 +2227,7 @@ struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT     (16)
 #define KVM_MEMORY_ATTRIBUTES_KERNEL_MASK      GENMASK(63, KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT)
+#define KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE      (1ULL << KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 184dab4ee871..457019de9e6d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -108,3 +108,7 @@ config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_PRIVATE_MEM
        bool
+
+config KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+       bool
+       depends on KVM_GENERIC_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7e3ea7a3f086..ca3a5d8b1fa7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,7 +248,75 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+static bool kvm_gmem_isfaultable(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct kvm_gmem *gmem = vma->vm_file->private_data;
+	pgoff_t pgoff = vmf->pgoff;
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+
+	xa_for_each_range(&gmem->bindings, index, slot, pgoff, pgoff) {
+		pgoff_t base_gfn = slot->base_gfn;
+		pgoff_t gfn_pgoff = slot->gmem.pgoff;
+		pgoff_t gfn = base_gfn + max(gfn_pgoff, pgoff) - gfn_pgoff;
+
+		if (!kvm_gmem_is_mappable(kvm, gfn))
+			return false;
+	}
+
+	return true;
+}
+
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct folio *folio;
+
+	folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
+	if (!folio)
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * Check if the page is allowed to be faulted to the host, with the
+	 * folio lock held to ensure that the check and incrementing the page
+	 * count are protected by the same folio lock.
+	 */
+	if (!kvm_gmem_isfaultable(vmf)) {
+		folio_unlock(folio);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/* No support for private mappings to avoid COW.  */
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	file_accessed(file);
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+#else
+#define kvm_gmem_mmap NULL
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (2 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-04-17 23:27   ` Sean Christopherson
  2024-04-18 10:54   ` David Hildenbrand
  2024-02-22 16:10 ` [RFC PATCH v1 05/26] KVM: Don't allow private attribute to be removed for unmappable memory Fuad Tabba
                   ` (24 subsequent siblings)
  28 siblings, 2 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Guest private memory should never be mapped by the host.
Therefore, do not allow setting the private attribute to guest
memory if that memory is mapped by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h |  7 ++++++
 virt/kvm/kvm_main.c      | 51 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fad296baa84e..f52d5503ddef 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2408,11 +2408,18 @@ static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
 	return !(kvm_get_memory_attributes(kvm, gfn) &
 		 KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE);
 }
+
+bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 #else
 static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
+
+static inline bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fba4dc6e4107..9f6ff314bda3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2516,6 +2516,48 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 		KVM_MMU_UNLOCK(kvm);
 }
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
+{
+	struct kvm_memslot_iter iter;
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), gfn_start, gfn_end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t start, end, i;
+
+		start = max(gfn_start, memslot->base_gfn);
+		end = min(gfn_end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(start >= end))
+			continue;
+
+		for (i = start; i < end; i++) {
+			struct page *page;
+			bool is_mapped;
+			kvm_pfn_t pfn;
+			int ret;
+
+			/*
+			 * Check the page_mapcount with the page lock held to
+			 * avoid racing with kvm_gmem_fault().
+			 */
+			ret = kvm_gmem_get_pfn_locked(kvm, memslot, i, &pfn, NULL);
+			if (WARN_ON_ONCE(ret))
+				continue;
+
+			page = pfn_to_page(pfn);
+			is_mapped = page_mapcount(page);
+			unlock_page(page);
+			put_page(page);
+
+			if (is_mapped)
+				return true;
+		}
+	}
+
+	return false;
+}
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
+
 static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
 					  struct kvm_gfn_range *range)
 {
@@ -2565,6 +2607,15 @@ static int __kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
 		goto out_unlock;
 
+	if (IS_ENABLED(CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE) && userspace) {
+		/* Host-mapped memory cannot be private. */
+		if ((attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) &&
+		    kvm_is_gmem_mapped(kvm, start, end)) {
+			r = -EPERM;
+			goto out_unlock;
+		}
+	}
+
 	/*
 	 * Reserve memory ahead of time to avoid having to deal with failures
 	 * partway through setting the new attributes.
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 05/26] KVM: Don't allow private attribute to be removed for unmappable memory
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (3 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 06/26] KVM: Implement kvm_(read|/write)_guest_page for private memory slots Fuad Tabba
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Unmappable memory cannot be shared with the host. Ensure that the
private attribute cannot be removed from unmappable memory.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/kvm_main.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9f6ff314bda3..adfee6592f6c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -52,6 +52,7 @@
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
 #include <linux/suspend.h>
+#include <linux/rcupdate_wait.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -2464,6 +2465,40 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	return has_attrs;
 }
 
+/*
+ * Returns true if _any_ gfn in the range [@start, @end) has _any_ attribute
+ * matching @attr.
+ */
+static bool kvm_any_range_has_memory_attribute(struct kvm *kvm, gfn_t start,
+					       gfn_t end, unsigned long attr)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	bool has_attr = false;
+	void *entry;
+
+	rcu_read_lock();
+	xas_for_each(&xas, entry, end - 1) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		if (!xa_is_value(entry))
+			continue;
+
+		if ((xa_to_value(entry) & attr) == attr) {
+			has_attr = true;
+			break;
+		}
+
+		if (need_resched()) {
+			xas_pause(&xas);
+			cond_resched_rcu();
+		}
+	}
+	rcu_read_unlock();
+
+	return has_attr;
+}
+
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
 	if (!kvm || kvm_arch_has_private_mem(kvm))
@@ -2614,6 +2649,14 @@ static int __kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 			r = -EPERM;
 			goto out_unlock;
 		}
+
+		/* Unmappable memory cannot be shared. */
+		if (!(attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) &&
+		     kvm_any_range_has_memory_attribute(kvm, start, end,
+				KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE)) {
+			r = -EPERM;
+			goto out_unlock;
+		}
 	}
 
 	/*
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 06/26] KVM: Implement kvm_(read|/write)_guest_page for private memory slots
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (4 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 05/26] KVM: Don't allow private attribute to be removed for unmappable memory Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 07/26] KVM: arm64: Turn llist of pinned pages into an rb-tree Fuad Tabba
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Make __kvm_read_guest_page/__kvm_write_guest_page capable of
accessing guest memory if no userspace address is available.
Moreover, check that the memory being accessed is shared with the
host before attempting the access.

KVM at the host might need to access shared memory that is not
mapped in the host userspace but is in fact shared with the host,
e.g., when accounting for stolen time. This allows the access
without relying on the slot's userspace_addr being set.

This does not circumvent protection, since the access is only
attempted if the memory is shared with the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/kvm_main.c | 130 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 114 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index adfee6592f6c..f5a0619cb520 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3446,17 +3446,107 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+static int __kvm_read_private_guest_page(struct kvm *kvm,
+					 struct kvm_memory_slot *slot,
+					 gfn_t gfn, void *data, int offset,
+					 int len)
+{
+	u64 pfn;
+	struct page *page;
+	int r = 0;
+
+	if (size_add(offset, len) > PAGE_SIZE)
+		return -E2BIG;
+
+	mutex_lock(&kvm->slots_lock);
+	if (kvm_any_range_has_memory_attribute(kvm, gfn, gfn + 1,
+		KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE)) {
+		r = -EPERM;
+		goto out_unlock;
+	}
+
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
+	if (r)
+		goto out_unlock;
+
+	page = pfn_to_page(pfn);
+	memcpy(data, page_address(page) + offset, len);
+	unlock_page(page);
+	kvm_release_pfn_clean(pfn);
+out_unlock:
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+static int __kvm_write_private_guest_page(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn, const void *data,
+					  int offset, int len)
+{
+	u64 pfn;
+	struct page *page;
+	int r = 0;
+
+	if (size_add(offset, len) > PAGE_SIZE)
+		return -E2BIG;
+
+	mutex_lock(&kvm->slots_lock);
+	if (kvm_any_range_has_memory_attribute(kvm, gfn, gfn + 1,
+		KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE)) {
+		r = -EPERM;
+		goto out_unlock;
+	}
+
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
+	if (r)
+		goto out_unlock;
+
+	page = pfn_to_page(pfn);
+	memcpy(page_address(page) + offset, data, len);
+	unlock_page(page);
+	kvm_release_pfn_dirty(pfn);
+out_unlock:
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+#else
+static int __kvm_read_private_guest_page(struct kvm *kvm,
+					 struct kvm_memory_slot *slot,
+					 gfn_t gfn, void *data, int offset,
+					 int len)
+{
+	BUG();
+	return -EIO;
+}
+
+static int __kvm_write_private_guest_page(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn, const void *data,
+					  int offset, int len)
+{
+	BUG();
+	return -EIO;
+}
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
+
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gfn_t gfn, void *data, int offset, int len)
 {
-	int r;
 	unsigned long addr;
 
+	if (IS_ENABLED(CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE) &&
+	    kvm_slot_can_be_private(slot)) {
+		return __kvm_read_private_guest_page(kvm, slot, gfn, data,
+						     offset, len);
+	}
+
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
+	if (__copy_from_user(data, (void __user *)addr + offset, len))
 		return -EFAULT;
 	return 0;
 }
@@ -3466,7 +3556,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -3475,7 +3565,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -3547,19 +3637,27 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
+
 static int __kvm_write_guest_page(struct kvm *kvm,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len)
+				  const void *data, int offset, int len)
 {
-	int r;
-	unsigned long addr;
+	if (IS_ENABLED(CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE) &&
+	    kvm_slot_can_be_private(memslot)) {
+		int r = __kvm_write_private_guest_page(kvm, memslot, gfn, data,
+						       offset, len);
+
+		if (r)
+			return r;
+	} else {
+		unsigned long addr = gfn_to_hva_memslot(memslot, gfn);
+
+		if (kvm_is_error_hva(addr))
+			return -EFAULT;
+		if (__copy_to_user((void __user *)addr + offset, data, len))
+			return -EFAULT;
+	}
 
-	addr = gfn_to_hva_memslot(memslot, gfn);
-	if (kvm_is_error_hva(addr))
-		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
-	if (r)
-		return -EFAULT;
 	mark_page_dirty_in_slot(kvm, memslot, gfn);
 	return 0;
 }
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 07/26] KVM: arm64: Turn llist of pinned pages into an rb-tree
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (5 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 06/26] KVM: Implement kvm_(read|/write)_guest_page for private memory slots Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 08/26] KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall Fuad Tabba
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

From: Quentin Perret <qperret@google.com>

Indexed by IPA, so we can efficiently lookup.

Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  5 +++--
 arch/arm64/kvm/mmu.c              | 30 ++++++++++++++++++++++++++----
 arch/arm64/kvm/pkvm.c             | 12 +++++++-----
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2777b0fe1b12..55de71791233 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -207,8 +207,9 @@ struct kvm_smccc_features {
 };
 
 struct kvm_pinned_page {
-	struct list_head	link;
+	struct rb_node		node;
 	struct page		*page;
+	u64			ipa;
 };
 
 typedef unsigned int pkvm_handle_t;
@@ -216,7 +217,7 @@ typedef unsigned int pkvm_handle_t;
 struct kvm_protected_vm {
 	pkvm_handle_t handle;
 	struct kvm_hyp_memcache teardown_mc;
-	struct list_head pinned_pages;
+	struct rb_root pinned_pages;
 	bool enabled;
 };
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ac088dc198e6..f796e092a921 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -337,6 +337,7 @@ static void unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 si
 static void pkvm_stage2_flush(struct kvm *kvm)
 {
 	struct kvm_pinned_page *ppage;
+	struct rb_node *node;
 
 	/*
 	 * Contrary to stage2_apply_range(), we don't need to check
@@ -344,7 +345,8 @@ static void pkvm_stage2_flush(struct kvm *kvm)
 	 * from a vcpu thread, and the list is only ever freed on VM
 	 * destroy (which only occurs when all vcpu are gone).
 	 */
-	list_for_each_entry(ppage, &kvm->arch.pkvm.pinned_pages, link) {
+	for (node = rb_first(&kvm->arch.pkvm.pinned_pages); node; node = rb_next(node)) {
+		ppage = rb_entry(node, struct kvm_pinned_page, node);
 		__clean_dcache_guest_page(page_address(ppage->page), PAGE_SIZE);
 		cond_resched_rwlock_write(&kvm->mmu_lock);
 	}
@@ -913,7 +915,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
 	mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
 	mmu->vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
-	INIT_LIST_HEAD(&kvm->arch.pkvm.pinned_pages);
+	kvm->arch.pkvm.pinned_pages = RB_ROOT;
 	mmu->arch = &kvm->arch;
 
 	if (is_protected_kvm_enabled())
@@ -1412,6 +1414,26 @@ static int pkvm_host_map_guest(u64 pfn, u64 gfn)
 	return (ret == -EPERM) ? -EAGAIN : ret;
 }
 
+static int cmp_ppages(struct rb_node *node, const struct rb_node *parent)
+{
+	struct kvm_pinned_page *a = container_of(node, struct kvm_pinned_page, node);
+	struct kvm_pinned_page *b = container_of(parent, struct kvm_pinned_page, node);
+
+	if (a->ipa < b->ipa)
+		return -1;
+	if (a->ipa > b->ipa)
+		return 1;
+	return 0;
+}
+
+static int insert_ppage(struct kvm *kvm, struct kvm_pinned_page *ppage)
+{
+	if (rb_find_add(&ppage->node, &kvm->arch.pkvm.pinned_pages, cmp_ppages))
+		return -EEXIST;
+
+	return 0;
+}
+
 static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot)
 {
@@ -1479,8 +1501,8 @@ static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 	ppage->page = page;
-	INIT_LIST_HEAD(&ppage->link);
-	list_add(&ppage->link, &kvm->arch.pkvm.pinned_pages);
+	ppage->ipa = fault_ipa;
+	WARN_ON(insert_ppage(kvm, ppage));
 	write_unlock(&kvm->mmu_lock);
 
 	return 0;
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 10a619b257c4..11355980e18d 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -246,9 +246,9 @@ static bool pkvm_teardown_vm(struct kvm *host_kvm)
 
 void pkvm_destroy_hyp_vm(struct kvm *host_kvm)
 {
-	struct kvm_pinned_page *ppage, *tmp;
+	struct kvm_pinned_page *ppage;
 	struct mm_struct *mm = current->mm;
-	struct list_head *ppages;
+	struct rb_node *node;
 	unsigned long pages = 0;
 
 	if (!pkvm_teardown_vm(host_kvm))
@@ -256,14 +256,16 @@ void pkvm_destroy_hyp_vm(struct kvm *host_kvm)
 
 	free_hyp_memcache(&host_kvm->arch.pkvm.teardown_mc);
 
-	ppages = &host_kvm->arch.pkvm.pinned_pages;
-	list_for_each_entry_safe(ppage, tmp, ppages, link) {
+	node = rb_first(&host_kvm->arch.pkvm.pinned_pages);
+	while (node) {
+		ppage = rb_entry(node, struct kvm_pinned_page, node);
 		WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_reclaim_page,
 					  page_to_pfn(ppage->page)));
 		cond_resched();
 
 		unpin_user_pages_dirty_lock(&ppage->page, 1, true);
-		list_del(&ppage->link);
+		node = rb_next(node);
+		rb_erase(&ppage->node, &host_kvm->arch.pkvm.pinned_pages);
 		kfree(ppage);
 		pages++;
 	}
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 08/26] KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (6 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 07/26] KVM: arm64: Turn llist of pinned pages into an rb-tree Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 09/26] KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall Fuad Tabba
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

From: Keir Fraser <keirf@google.com>

This allows a VM running on PKVM to notify the hypervisor (and host)
that it is returning pages to host ownership.

Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_pkvm.h             |  1 +
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  2 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |  1 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  4 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 30 +++++++
 arch/arm64/kvm/hyp/nvhe/pkvm.c                | 83 +++++++++++++++++--
 arch/arm64/kvm/hyp/nvhe/switch.c              |  1 +
 arch/arm64/kvm/hypercalls.c                   | 19 ++++-
 arch/arm64/kvm/pkvm.c                         | 35 ++++++++
 include/linux/arm-smccc.h                     |  7 ++
 10 files changed, 173 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index 60b2d4965e4a..ea9d9529e412 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -22,6 +22,7 @@ int pkvm_init_host_vm(struct kvm *kvm, unsigned long type);
 int pkvm_create_hyp_vm(struct kvm *kvm);
 void pkvm_destroy_hyp_vm(struct kvm *kvm);
 bool pkvm_is_hyp_created(struct kvm *kvm);
+void pkvm_host_reclaim_page(struct kvm *host_kvm, phys_addr_t ipa);
 
 /*
  * Definitions for features to be allowed or restricted for guest virtual
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 973983d78f31..a20e5b9426ce 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -75,6 +75,8 @@ int __pkvm_host_share_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu);
 int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu);
 int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *hyp_vcpu, u64 ipa);
 int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *hyp_vcpu, u64 ipa);
+int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
+				    u64 ipa, u64 *ppa);
 
 bool addr_is_memory(phys_addr_t phys);
 int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
index 7940a042289a..094599692187 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
@@ -113,6 +113,7 @@ int kvm_check_pvm_sysreg_table(void);
 void pkvm_reset_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu);
 
 bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code);
+bool kvm_hyp_handle_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code);
 
 struct pkvm_hyp_vcpu *pkvm_mpidr_to_hyp_vcpu(struct pkvm_hyp_vm *vm, u64 mpidr);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 1fd419cef3db..1c93c225915b 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -85,6 +85,8 @@ static void handle_pvm_entry_hvc64(struct pkvm_hyp_vcpu *hyp_vcpu)
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
 		fallthrough;
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
+		fallthrough;
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
 		vcpu_set_reg(&hyp_vcpu->vcpu, 0, SMCCC_RET_SUCCESS);
 		break;
 	default:
@@ -253,6 +255,8 @@ static void handle_pvm_exit_hvc64(struct pkvm_hyp_vcpu *hyp_vcpu)
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
 		fallthrough;
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
+		fallthrough;
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
 		n = 4;
 		break;
 
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 899164515e0c..1dd8eee1ab28 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -321,6 +321,36 @@ void reclaim_guest_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc)
 	}
 }
 
+int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
+				    u64 ipa, u64 *ppa)
+{
+	struct kvm_pgtable_walker walker = {
+		.cb     = reclaim_walker,
+		.arg    = ppa,
+		.flags  = KVM_PGTABLE_WALK_LEAF
+	};
+	struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu);
+	int ret;
+
+	host_lock_component();
+	guest_lock_component(vm);
+
+	/* Set default pa value to "not found". */
+	*ppa = 0;
+
+	/* If ipa is mapped: sets page flags, and gets the pa. */
+	ret = kvm_pgtable_walk(&vm->pgt, ipa, PAGE_SIZE, &walker);
+
+	/* Zap the guest stage2 pte. */
+	if (!ret)
+		kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
+
+	guest_unlock_component(vm);
+	host_unlock_component();
+
+	return ret;
+}
+
 int __pkvm_prot_finalize(void)
 {
 	struct kvm_s2_mmu *mmu = &host_mmu.arch.mmu;
diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c
index 199ad51f1169..4209c75e7fba 100644
--- a/arch/arm64/kvm/hyp/nvhe/pkvm.c
+++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c
@@ -1258,6 +1258,54 @@ static bool pkvm_memunshare_call(struct pkvm_hyp_vcpu *hyp_vcpu)
 	return true;
 }
 
+static bool pkvm_meminfo_call(struct pkvm_hyp_vcpu *hyp_vcpu)
+{
+	struct kvm_vcpu *vcpu = &hyp_vcpu->vcpu;
+	u64 arg1 = smccc_get_arg1(vcpu);
+	u64 arg2 = smccc_get_arg2(vcpu);
+	u64 arg3 = smccc_get_arg3(vcpu);
+
+	if (arg1 || arg2 || arg3)
+		goto out_guest_err;
+
+	smccc_set_retval(vcpu, PAGE_SIZE, 0, 0, 0);
+	return true;
+
+out_guest_err:
+	smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
+	return true;
+}
+
+static bool pkvm_memrelinquish_call(struct pkvm_hyp_vcpu *hyp_vcpu)
+{
+	struct kvm_vcpu *vcpu = &hyp_vcpu->vcpu;
+	u64 ipa = smccc_get_arg1(vcpu);
+	u64 arg2 = smccc_get_arg2(vcpu);
+	u64 arg3 = smccc_get_arg3(vcpu);
+	u64 pa = 0;
+	int ret;
+
+	if (arg2 || arg3)
+		goto out_guest_err;
+
+	ret = __pkvm_guest_relinquish_to_host(hyp_vcpu, ipa, &pa);
+	if (ret)
+		goto out_guest_err;
+
+	if (pa != 0) {
+		/* Now pass to host. */
+		return false;
+	}
+
+	/* This was a NOP as no page was actually mapped at the IPA. */
+	smccc_set_retval(vcpu, 0, 0, 0, 0);
+	return true;
+
+out_guest_err:
+	smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
+	return true;
+}
+
 /*
  * Handler for protected VM HVC calls.
  *
@@ -1288,20 +1336,16 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code)
 		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO);
 		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE);
 		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_UNSHARE);
+		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_RELINQUISH);
 		break;
 	case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID:
-		if (smccc_get_arg1(vcpu) ||
-		    smccc_get_arg2(vcpu) ||
-		    smccc_get_arg3(vcpu)) {
-			val[0] = SMCCC_RET_INVALID_PARAMETER;
-		} else {
-			val[0] = PAGE_SIZE;
-		}
-		break;
+		return pkvm_meminfo_call(hyp_vcpu);
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
 		return pkvm_memshare_call(hyp_vcpu, exit_code);
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
 		return pkvm_memunshare_call(hyp_vcpu);
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
+		return pkvm_memrelinquish_call(hyp_vcpu);
 	default:
 		return pkvm_handle_psci(hyp_vcpu);
 	}
@@ -1309,3 +1353,26 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code)
 	smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
 	return true;
 }
+
+/*
+ * Handler for non-protected VM HVC calls.
+ *
+ * Returns true if the hypervisor has handled the exit, and control should go
+ * back to the guest, or false if it hasn't.
+ */
+bool kvm_hyp_handle_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code)
+{
+	u32 fn = smccc_get_function(vcpu);
+	struct pkvm_hyp_vcpu *hyp_vcpu;
+
+	hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu);
+
+	switch (fn) {
+	case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID:
+		return pkvm_meminfo_call(hyp_vcpu);
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
+		return pkvm_memrelinquish_call(hyp_vcpu);
+	}
+
+	return false;
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 31c46491e65a..12b7d56d3842 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -185,6 +185,7 @@ static bool kvm_handle_pvm_sys64(struct kvm_vcpu *vcpu, u64 *exit_code)
 static const exit_handler_fn hyp_exit_handlers[] = {
 	[0 ... ESR_ELx_EC_MAX]		= NULL,
 	[ESR_ELx_EC_CP15_32]		= kvm_hyp_handle_cp15_32,
+	[ESR_ELx_EC_HVC64]		= kvm_hyp_handle_hvc64,
 	[ESR_ELx_EC_SYS64]		= kvm_hyp_handle_sysreg,
 	[ESR_ELx_EC_SVE]		= kvm_hyp_handle_fpsimd,
 	[ESR_ELx_EC_FP_ASIMD]		= kvm_hyp_handle_fpsimd,
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 5763d979d8ca..89b5b61bc9f7 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -5,6 +5,7 @@
 #include <linux/kvm_host.h>
 
 #include <asm/kvm_emulate.h>
+#include <asm/kvm_pkvm.h>
 
 #include <kvm/arm_hypercalls.h>
 #include <kvm/arm_psci.h>
@@ -13,8 +14,15 @@
 	GENMASK(KVM_REG_ARM_STD_BMAP_BIT_COUNT - 1, 0)
 #define KVM_ARM_SMCCC_STD_HYP_FEATURES				\
 	GENMASK(KVM_REG_ARM_STD_HYP_BMAP_BIT_COUNT - 1, 0)
-#define KVM_ARM_SMCCC_VENDOR_HYP_FEATURES			\
-	GENMASK(KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT - 1, 0)
+#define KVM_ARM_SMCCC_VENDOR_HYP_FEATURES ({				\
+	unsigned long f;						\
+	f = GENMASK(KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT - 1, 0);	\
+	if (is_protected_kvm_enabled()) {				\
+		f |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO);		\
+		f |= BIT(ARM_SMCCC_KVM_FUNC_MEM_RELINQUISH);		\
+	}								\
+	f;								\
+})
 
 static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
 {
@@ -116,6 +124,9 @@ static bool kvm_smccc_test_fw_bmap(struct kvm_vcpu *vcpu, u32 func_id)
 	case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
 		return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_PTP,
 				&smccc_feat->vendor_hyp_bmap);
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
+		return test_bit(ARM_SMCCC_KVM_FUNC_MEM_RELINQUISH,
+				&smccc_feat->vendor_hyp_bmap);
 	default:
 		return false;
 	}
@@ -364,6 +375,10 @@ int kvm_smccc_call_handler(struct kvm_vcpu *vcpu)
 	case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
 		kvm_ptp_get_time(vcpu, val);
 		break;
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
+		pkvm_host_reclaim_page(vcpu->kvm, smccc_get_arg1(vcpu));
+		val[0] = SMCCC_RET_SUCCESS;
+		break;
 	case ARM_SMCCC_TRNG_VERSION:
 	case ARM_SMCCC_TRNG_FEATURES:
 	case ARM_SMCCC_TRNG_GET_UUID:
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 11355980e18d..713bbb023177 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -24,6 +24,14 @@ static unsigned int *hyp_memblock_nr_ptr = &kvm_nvhe_sym(hyp_memblock_nr);
 phys_addr_t hyp_mem_base;
 phys_addr_t hyp_mem_size;
 
+static int rb_ppage_cmp(const void *key, const struct rb_node *node)
+{
+       struct kvm_pinned_page *p = container_of(node, struct kvm_pinned_page, node);
+       phys_addr_t ipa = (phys_addr_t)key;
+
+       return (ipa < p->ipa) ? -1 : (ipa > p->ipa);
+}
+
 static int cmp_hyp_memblock(const void *p1, const void *p2)
 {
 	const struct memblock_region *r1 = p1;
@@ -330,3 +338,30 @@ static int __init finalize_pkvm(void)
 	return ret;
 }
 device_initcall_sync(finalize_pkvm);
+
+void pkvm_host_reclaim_page(struct kvm *host_kvm, phys_addr_t ipa)
+{
+	struct kvm_pinned_page *ppage;
+	struct mm_struct *mm = current->mm;
+	struct rb_node *node;
+
+	write_lock(&host_kvm->mmu_lock);
+	node = rb_find((void *)ipa, &host_kvm->arch.pkvm.pinned_pages,
+		       rb_ppage_cmp);
+	if (node)
+		rb_erase(node, &host_kvm->arch.pkvm.pinned_pages);
+	write_unlock(&host_kvm->mmu_lock);
+
+	WARN_ON(!node);
+	if (!node)
+		return;
+
+	ppage = container_of(node, struct kvm_pinned_page, node);
+
+	WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_reclaim_page,
+				  page_to_pfn(ppage->page)));
+
+	account_locked_vm(mm, 1, false);
+	unpin_user_pages_dirty_lock(&ppage->page, 1, true);
+	kfree(ppage);
+}
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 9cb7c95920b0..ec85f1be2040 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -118,6 +118,7 @@
 #define ARM_SMCCC_KVM_FUNC_HYP_MEMINFO		2
 #define ARM_SMCCC_KVM_FUNC_MEM_SHARE		3
 #define ARM_SMCCC_KVM_FUNC_MEM_UNSHARE		4
+#define ARM_SMCCC_KVM_FUNC_MEM_RELINQUISH	9
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
 #define ARM_SMCCC_KVM_NUM_FUNCS			128
 
@@ -158,6 +159,12 @@
 			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
 			   ARM_SMCCC_KVM_FUNC_MEM_UNSHARE)
 
+#define ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID			\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
+			   ARM_SMCCC_SMC_64,				\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
+			   ARM_SMCCC_KVM_FUNC_MEM_RELINQUISH)
+
 /* ptp_kvm counter type ID */
 #define KVM_PTP_VIRT_COUNTER			0
 #define KVM_PTP_PHYS_COUNTER			1
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 09/26] KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (7 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 08/26] KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 10/26] KVM: arm64: Avoid unnecessary unmap walk " Fuad Tabba
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

From: Keir Fraser <keirf@google.com>

The VM should only relinquish "normal" pages. For a protected VM, this
means PAGE_OWNED; For a normal VM, this means PAGE_SHARED_BORROWED. All
other page types are rejected and failure is reported to the caller.

Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/hyp/nvhe/mem_protect.c | 45 ++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 1dd8eee1ab28..405d6e3e17e0 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -321,13 +321,44 @@ void reclaim_guest_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc)
 	}
 }
 
+struct relinquish_data {
+	enum pkvm_page_state expected_state;
+	u64 pa;
+};
+
+static int relinquish_walker(const struct kvm_pgtable_visit_ctx *ctx,
+			     enum kvm_pgtable_walk_flags visit)
+{
+	kvm_pte_t pte = *ctx->ptep;
+	struct hyp_page *page;
+	struct relinquish_data *data = ctx->arg;
+	enum pkvm_page_state state;
+
+	if (!kvm_pte_valid(pte))
+		return 0;
+
+	state = pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte));
+	if (state != data->expected_state)
+		return -EPERM;
+
+	page = hyp_phys_to_page(kvm_pte_to_phys(pte));
+	if (state == PKVM_PAGE_OWNED)
+		page->flags |= HOST_PAGE_NEED_POISONING;
+	page->flags |= HOST_PAGE_PENDING_RECLAIM;
+
+	data->pa = kvm_pte_to_phys(pte);
+
+	return 0;
+}
+
 int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
 				    u64 ipa, u64 *ppa)
 {
+	struct relinquish_data data;
 	struct kvm_pgtable_walker walker = {
-		.cb     = reclaim_walker,
-		.arg    = ppa,
-		.flags  = KVM_PGTABLE_WALK_LEAF
+		.cb     = relinquish_walker,
+		.flags  = KVM_PGTABLE_WALK_LEAF,
+		.arg    = &data,
 	};
 	struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu);
 	int ret;
@@ -335,8 +366,13 @@ int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
 	host_lock_component();
 	guest_lock_component(vm);
 
+	/* Expected page state depends on VM type. */
+	data.expected_state = pkvm_hyp_vcpu_is_protected(vcpu) ?
+		PKVM_PAGE_OWNED :
+		PKVM_PAGE_SHARED_BORROWED;
+
 	/* Set default pa value to "not found". */
-	*ppa = 0;
+	data.pa = 0;
 
 	/* If ipa is mapped: sets page flags, and gets the pa. */
 	ret = kvm_pgtable_walk(&vm->pgt, ipa, PAGE_SIZE, &walker);
@@ -348,6 +384,7 @@ int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
 	guest_unlock_component(vm);
 	host_unlock_component();
 
+	*ppa = data.pa;
 	return ret;
 }
 
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 10/26] KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (8 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 09/26] KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 11/26] KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL Fuad Tabba
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

From: Keir Fraser <keirf@google.com>

If the mapping is determined to be not present in an earlier walk,
attempting the unmap is pointless.

Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 405d6e3e17e0..4889f0510c7e 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -378,7 +378,7 @@ int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
 	ret = kvm_pgtable_walk(&vm->pgt, ipa, PAGE_SIZE, &walker);
 
 	/* Zap the guest stage2 pte. */
-	if (!ret)
+	if (!ret && data.pa)
 		kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
 
 	guest_unlock_component(vm);
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 11/26] KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (9 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 10/26] KVM: arm64: Avoid unnecessary unmap walk " Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 12/26] KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications Fuad Tabba
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

From: Will Deacon <will@kernel.org>

Allow the VMM to hook into and handle a subset of guest hypercalls
advertised by the host. For now, no such hypercalls exist, and so the
new capability returns 0 when queried.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/arm.c              | 25 +++++++++++++++++++++++++
 arch/arm64/kvm/hypercalls.c       | 19 +++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 55de71791233..f6187526685a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -325,6 +325,8 @@ struct kvm_arch {
 	 * the associated pKVM instance in the hypervisor.
 	 */
 	struct kvm_protected_vm pkvm;
+
+	u64 hypercall_exit_enabled;
 };
 
 struct kvm_vcpu_fault_info {
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index c0e683bde111..cd6c4df27c7b 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -60,6 +60,9 @@ static bool vgic_present, kvm_arm_initialised;
 static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized);
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
+/* KVM "vendor" hypercalls which may be forwarded to userspace on request. */
+#define KVM_EXIT_HYPERCALL_VALID_MASK	(0)
+
 bool is_kvm_arm_initialised(void)
 {
 	return kvm_arm_initialised;
@@ -123,6 +126,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->slots_lock);
 		break;
+	case KVM_CAP_EXIT_HYPERCALL:
+		if (cap->flags)
+			return -EINVAL;
+
+		if (cap->args[0] & ~KVM_EXIT_HYPERCALL_VALID_MASK)
+			return -EINVAL;
+
+		if (cap->args[1] || cap->args[2] || cap->args[3])
+			return -EINVAL;
+
+		WRITE_ONCE(kvm->arch.hypercall_exit_enabled, cap->args[0]);
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -334,6 +350,9 @@ static int kvm_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
 		r = BIT(0);
 		break;
+	case KVM_CAP_EXIT_HYPERCALL:
+		r = KVM_EXIT_HYPERCALL_VALID_MASK;
+		break;
 	default:
 		r = 0;
 	}
@@ -1071,6 +1090,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		ret = kvm_handle_mmio_return(vcpu);
 		if (ret <= 0)
 			return ret;
+	} else if (run->exit_reason == KVM_EXIT_HYPERCALL) {
+		smccc_set_retval(vcpu,
+				 vcpu->run->hypercall.ret,
+				 vcpu->run->hypercall.args[0],
+				 vcpu->run->hypercall.args[1],
+				 vcpu->run->hypercall.args[2]);
 	}
 
 	vcpu_load(vcpu);
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 89b5b61bc9f7..5e04be7c026a 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -132,6 +132,25 @@ static bool kvm_smccc_test_fw_bmap(struct kvm_vcpu *vcpu, u32 func_id)
 	}
 }
 
+static int __maybe_unused kvm_vcpu_exit_hcall(struct kvm_vcpu *vcpu, u32 nr, u32 nr_args)
+{
+	u64 mask = vcpu->kvm->arch.hypercall_exit_enabled;
+	u32 i;
+
+	if (nr_args > 6 || !(mask & BIT(nr))) {
+		smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
+		return 1;
+	}
+
+	vcpu->run->exit_reason		= KVM_EXIT_HYPERCALL;
+	vcpu->run->hypercall.nr		= nr;
+
+	for (i = 0; i < nr_args; ++i)
+		vcpu->run->hypercall.args[i] = vcpu_get_reg(vcpu, i + 1);
+
+	return 0;
+}
+
 #define SMC32_ARCH_RANGE_BEGIN	ARM_SMCCC_VERSION_FUNC_ID
 #define SMC32_ARCH_RANGE_END	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,		\
 						   ARM_SMCCC_SMC_32,		\
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 12/26] KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (10 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 11/26] KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 13/26] KVM: arm64: Create hypercall return handler Fuad Tabba
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

From: Will Deacon <will@kernel.org>

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/arm.c        | 3 ++-
 arch/arm64/kvm/hypercalls.c | 8 +++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index cd6c4df27c7b..6bba6f1fee88 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -61,7 +61,8 @@ static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized);
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
 /* KVM "vendor" hypercalls which may be forwarded to userspace on request. */
-#define KVM_EXIT_HYPERCALL_VALID_MASK	(0)
+#define KVM_EXIT_HYPERCALL_VALID_MASK	(BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE) |	\
+					 BIT(ARM_SMCCC_KVM_FUNC_MEM_UNSHARE))
 
 bool is_kvm_arm_initialised(void)
 {
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 5e04be7c026a..b93546dd222f 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -83,6 +83,8 @@ static bool kvm_smccc_default_allowed(u32 func_id)
 	 */
 	case ARM_SMCCC_VERSION_FUNC_ID:
 	case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
 		return true;
 	default:
 		/* PSCI 0.2 and up is in the 0:0x1f range */
@@ -132,7 +134,7 @@ static bool kvm_smccc_test_fw_bmap(struct kvm_vcpu *vcpu, u32 func_id)
 	}
 }
 
-static int __maybe_unused kvm_vcpu_exit_hcall(struct kvm_vcpu *vcpu, u32 nr, u32 nr_args)
+static int kvm_vcpu_exit_hcall(struct kvm_vcpu *vcpu, u32 nr, u32 nr_args)
 {
 	u64 mask = vcpu->kvm->arch.hypercall_exit_enabled;
 	u32 i;
@@ -398,6 +400,10 @@ int kvm_smccc_call_handler(struct kvm_vcpu *vcpu)
 		pkvm_host_reclaim_page(vcpu->kvm, smccc_get_arg1(vcpu));
 		val[0] = SMCCC_RET_SUCCESS;
 		break;
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
+		return kvm_vcpu_exit_hcall(vcpu, ARM_SMCCC_KVM_FUNC_MEM_SHARE, 3);
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
+		return kvm_vcpu_exit_hcall(vcpu, ARM_SMCCC_KVM_FUNC_MEM_UNSHARE, 3);
 	case ARM_SMCCC_TRNG_VERSION:
 	case ARM_SMCCC_TRNG_FEATURES:
 	case ARM_SMCCC_TRNG_GET_UUID:
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 13/26] KVM: arm64: Create hypercall return handler
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (11 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 12/26] KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 14/26] KVM: arm64: Refactor code around handling return from host to guest Fuad Tabba
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Instead of handling the hypercall return to guest from host
inline, create a handler function. More logic will be added to
this handler in subsequent patches.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/arm.c              |  8 +++-----
 arch/arm64/kvm/hypercalls.c       | 10 ++++++++++
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f6187526685a..fb7aff14fd1a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1100,6 +1100,7 @@ void kvm_mmio_write_buf(void *buf, unsigned int len, unsigned long data);
 unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len);
 
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
+int kvm_handle_hypercall_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
 /*
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 6bba6f1fee88..ab7e02acb17d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1092,11 +1092,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		if (ret <= 0)
 			return ret;
 	} else if (run->exit_reason == KVM_EXIT_HYPERCALL) {
-		smccc_set_retval(vcpu,
-				 vcpu->run->hypercall.ret,
-				 vcpu->run->hypercall.args[0],
-				 vcpu->run->hypercall.args[1],
-				 vcpu->run->hypercall.args[2]);
+		ret = kvm_handle_hypercall_return(vcpu);
+		if (ret <= 0)
+			return ret;
 	}
 
 	vcpu_load(vcpu);
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index b93546dd222f..b08e18128de4 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -24,6 +24,16 @@
 	f;								\
 })
 
+int kvm_handle_hypercall_return(struct kvm_vcpu *vcpu)
+{
+	smccc_set_retval(vcpu, vcpu->run->hypercall.ret,
+			 vcpu->run->hypercall.args[0],
+			 vcpu->run->hypercall.args[1],
+			 vcpu->run->hypercall.args[2]);
+
+	return 1;
+}
+
 static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
 {
 	struct system_time_snapshot systime_snapshot;
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 14/26] KVM: arm64: Refactor code around handling return from host to guest
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (12 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 13/26] KVM: arm64: Create hypercall return handler Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 15/26] KVM: arm64: Rename kvm_pinned_page to kvm_guest_page Fuad Tabba
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Make the code more consistent and easier to read.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/arm.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ab7e02acb17d..0a6991ee9615 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1085,17 +1085,14 @@ static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 {
 	struct kvm_run *run = vcpu->run;
-	int ret;
+	int ret = 1;
 
-	if (run->exit_reason == KVM_EXIT_MMIO) {
+	if (run->exit_reason == KVM_EXIT_MMIO)
 		ret = kvm_handle_mmio_return(vcpu);
-		if (ret <= 0)
-			return ret;
-	} else if (run->exit_reason == KVM_EXIT_HYPERCALL) {
+	else if (run->exit_reason == KVM_EXIT_HYPERCALL)
 		ret = kvm_handle_hypercall_return(vcpu);
-		if (ret <= 0)
-			return ret;
-	}
+	if (ret <= 0)
+		return ret;
 
 	vcpu_load(vcpu);
 
@@ -1106,7 +1103,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
 	kvm_sigset_activate(vcpu);
 
-	ret = 1;
 	run->exit_reason = KVM_EXIT_UNKNOWN;
 	run->flags = 0;
 	while (ret > 0) {
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 15/26] KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (13 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 14/26] KVM: arm64: Refactor code around handling return from host to guest Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 16/26] KVM: arm64: Add a field to indicate whether the guest page was pinned Fuad Tabba
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

With guestmem, pages won't be pinned. Change the name of the
structure to reflect that.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/mmu.c              | 12 ++++++------
 arch/arm64/kvm/pkvm.c             | 10 +++++-----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index fb7aff14fd1a..99bf2b534ff8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -206,7 +206,7 @@ struct kvm_smccc_features {
 	unsigned long vendor_hyp_bmap;
 };
 
-struct kvm_pinned_page {
+struct kvm_guest_page {
 	struct rb_node		node;
 	struct page		*page;
 	u64			ipa;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f796e092a921..ae6f65717178 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -336,7 +336,7 @@ static void unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 si
 
 static void pkvm_stage2_flush(struct kvm *kvm)
 {
-	struct kvm_pinned_page *ppage;
+	struct kvm_guest_page *ppage;
 	struct rb_node *node;
 
 	/*
@@ -346,7 +346,7 @@ static void pkvm_stage2_flush(struct kvm *kvm)
 	 * destroy (which only occurs when all vcpu are gone).
 	 */
 	for (node = rb_first(&kvm->arch.pkvm.pinned_pages); node; node = rb_next(node)) {
-		ppage = rb_entry(node, struct kvm_pinned_page, node);
+		ppage = rb_entry(node, struct kvm_guest_page, node);
 		__clean_dcache_guest_page(page_address(ppage->page), PAGE_SIZE);
 		cond_resched_rwlock_write(&kvm->mmu_lock);
 	}
@@ -1416,8 +1416,8 @@ static int pkvm_host_map_guest(u64 pfn, u64 gfn)
 
 static int cmp_ppages(struct rb_node *node, const struct rb_node *parent)
 {
-	struct kvm_pinned_page *a = container_of(node, struct kvm_pinned_page, node);
-	struct kvm_pinned_page *b = container_of(parent, struct kvm_pinned_page, node);
+	struct kvm_guest_page *a = container_of(node, struct kvm_guest_page, node);
+	struct kvm_guest_page *b = container_of(parent, struct kvm_guest_page, node);
 
 	if (a->ipa < b->ipa)
 		return -1;
@@ -1426,7 +1426,7 @@ static int cmp_ppages(struct rb_node *node, const struct rb_node *parent)
 	return 0;
 }
 
-static int insert_ppage(struct kvm *kvm, struct kvm_pinned_page *ppage)
+static int insert_ppage(struct kvm *kvm, struct kvm_guest_page *ppage)
 {
 	if (rb_find_add(&ppage->node, &kvm->arch.pkvm.pinned_pages, cmp_ppages))
 		return -EEXIST;
@@ -1440,7 +1440,7 @@ static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
 	struct mm_struct *mm = current->mm;
 	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
-	struct kvm_pinned_page *ppage;
+	struct kvm_guest_page *ppage;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_s2_mmu *mmu =  &kvm->arch.mmu;
 	struct page *page;
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 713bbb023177..0dbde37d21d0 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -26,7 +26,7 @@ phys_addr_t hyp_mem_size;
 
 static int rb_ppage_cmp(const void *key, const struct rb_node *node)
 {
-       struct kvm_pinned_page *p = container_of(node, struct kvm_pinned_page, node);
+       struct kvm_guest_page *p = container_of(node, struct kvm_guest_page, node);
        phys_addr_t ipa = (phys_addr_t)key;
 
        return (ipa < p->ipa) ? -1 : (ipa > p->ipa);
@@ -254,7 +254,7 @@ static bool pkvm_teardown_vm(struct kvm *host_kvm)
 
 void pkvm_destroy_hyp_vm(struct kvm *host_kvm)
 {
-	struct kvm_pinned_page *ppage;
+	struct kvm_guest_page *ppage;
 	struct mm_struct *mm = current->mm;
 	struct rb_node *node;
 	unsigned long pages = 0;
@@ -266,7 +266,7 @@ void pkvm_destroy_hyp_vm(struct kvm *host_kvm)
 
 	node = rb_first(&host_kvm->arch.pkvm.pinned_pages);
 	while (node) {
-		ppage = rb_entry(node, struct kvm_pinned_page, node);
+		ppage = rb_entry(node, struct kvm_guest_page, node);
 		WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_reclaim_page,
 					  page_to_pfn(ppage->page)));
 		cond_resched();
@@ -341,7 +341,7 @@ device_initcall_sync(finalize_pkvm);
 
 void pkvm_host_reclaim_page(struct kvm *host_kvm, phys_addr_t ipa)
 {
-	struct kvm_pinned_page *ppage;
+	struct kvm_guest_page *ppage;
 	struct mm_struct *mm = current->mm;
 	struct rb_node *node;
 
@@ -356,7 +356,7 @@ void pkvm_host_reclaim_page(struct kvm *host_kvm, phys_addr_t ipa)
 	if (!node)
 		return;
 
-	ppage = container_of(node, struct kvm_pinned_page, node);
+	ppage = container_of(node, struct kvm_guest_page, node);
 
 	WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_reclaim_page,
 				  page_to_pfn(ppage->page)));
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 16/26] KVM: arm64: Add a field to indicate whether the guest page was pinned
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (14 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 15/26] KVM: arm64: Rename kvm_pinned_page to kvm_guest_page Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 17/26] KVM: arm64: Do not allow changes to private memory slots Fuad Tabba
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

This is needed only during the transition phase from pinning to
using guestmem. Once pKVM moves to guestmem, this field will be
removed.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h | 1 +
 arch/arm64/kvm/mmu.c              | 1 +
 arch/arm64/kvm/pkvm.c             | 6 ++++--
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 99bf2b534ff8..ab61c3ecba0c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -210,6 +210,7 @@ struct kvm_guest_page {
 	struct rb_node		node;
 	struct page		*page;
 	u64			ipa;
+	bool			is_pinned;
 };
 
 typedef unsigned int pkvm_handle_t;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ae6f65717178..391d168e95d0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1502,6 +1502,7 @@ static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	ppage->page = page;
 	ppage->ipa = fault_ipa;
+	ppage->is_pinned = true;
 	WARN_ON(insert_ppage(kvm, ppage));
 	write_unlock(&kvm->mmu_lock);
 
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 0dbde37d21d0..bfd4858a7bd1 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -271,7 +271,8 @@ void pkvm_destroy_hyp_vm(struct kvm *host_kvm)
 					  page_to_pfn(ppage->page)));
 		cond_resched();
 
-		unpin_user_pages_dirty_lock(&ppage->page, 1, true);
+		if (ppage->is_pinned)
+			unpin_user_pages_dirty_lock(&ppage->page, 1, true);
 		node = rb_next(node);
 		rb_erase(&ppage->node, &host_kvm->arch.pkvm.pinned_pages);
 		kfree(ppage);
@@ -362,6 +363,7 @@ void pkvm_host_reclaim_page(struct kvm *host_kvm, phys_addr_t ipa)
 				  page_to_pfn(ppage->page)));
 
 	account_locked_vm(mm, 1, false);
-	unpin_user_pages_dirty_lock(&ppage->page, 1, true);
+	if (ppage->is_pinned)
+		unpin_user_pages_dirty_lock(&ppage->page, 1, true);
 	kfree(ppage);
 }
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 17/26] KVM: arm64: Do not allow changes to private memory slots
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (15 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 16/26] KVM: arm64: Add a field to indicate whether the guest page was pinned Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 18/26] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Handling changes to private memory slots can be difficult, since
it would probably require some cooperation from the hypervisor
and/or the guest. Do not allow such changes for now.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 391d168e95d0..4d2881648b58 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2158,6 +2158,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		}
 	}
 
+	if ((change == KVM_MR_MOVE || change == KVM_MR_FLAGS_ONLY) &&
+	    ((kvm_slot_can_be_private(old)) || (kvm_slot_can_be_private(new))))
+		return -EPERM;
+
 	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
 			change != KVM_MR_FLAGS_ONLY)
 		return 0;
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 18/26] KVM: arm64: Skip VMA checks for slots without userspace address
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (16 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 17/26] KVM: arm64: Do not allow changes to private memory slots Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 19/26] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Memory slots backed by guest memory might be created with no
intention of being mapped by the host. These are recognized by
not having a userspace address in the memory slot.

VMA checks are neither possible nor necessary for this kind of
slot, so skip them.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4d2881648b58..6ad79390b15c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -973,6 +973,10 @@ static void stage2_unmap_memslot(struct kvm *kvm,
 	phys_addr_t size = PAGE_SIZE * memslot->npages;
 	hva_t reg_end = hva + size;
 
+	/* Host will not map this private memory without a userspace address. */
+	if (kvm_slot_can_be_private(memslot) && !hva)
+		return;
+
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
 	 * between them, so iterate over all of them to find out if we should
@@ -2176,6 +2180,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	hva = new->userspace_addr;
 	reg_end = hva + (new->npages << PAGE_SHIFT);
 
+	/* Host will not map this private memory without a userspace address. */
+	if ((kvm_slot_can_be_private(new)) && !hva)
+		return 0;
+
 	mmap_read_lock(current->mm);
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 19/26] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (17 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 18/26] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 20/26] KVM: arm64: Track sharing of memory from protected guest to host Fuad Tabba
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Introduce a new fault handler which responds to guest faults for
guestmem pages.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 75 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6ad79390b15c..570b14da16b1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1438,6 +1438,70 @@ static int insert_ppage(struct kvm *kvm, struct kvm_guest_page *ppage)
 	return 0;
 }
 
+static int guestmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			  struct kvm_memory_slot *memslot)
+{
+	struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
+	struct kvm_guest_page *guest_page;
+	struct mm_struct *mm = current->mm;
+	gfn_t gfn = gpa_to_gfn(fault_ipa);
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_s2_mmu *mmu =  &kvm->arch.mmu;
+	struct page *page = NULL;
+	kvm_pfn_t pfn;
+	int ret;
+
+	ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(mmu));
+	if (ret)
+		return ret;
+
+	/*
+	 * Acquire the page lock to avoid racing with kvm_gmem_fault() when
+	 * checking the page_mapcount later on.
+	 */
+	ret = kvm_gmem_get_pfn_locked(kvm, memslot, gfn, &pfn, NULL);
+	if (ret)
+		return ret;
+
+	page = pfn_to_page(pfn);
+
+	if (!kvm_gmem_is_mappable(kvm, gfn) && page_mapcount(page)) {
+		ret = -EPERM;
+		goto rel_page;
+	}
+
+	guest_page = kmalloc(sizeof(*guest_page), GFP_KERNEL_ACCOUNT);
+	if (!guest_page) {
+		ret = -ENOMEM;
+		goto rel_page;
+	}
+
+	guest_page->page = page;
+	guest_page->ipa = fault_ipa;
+	guest_page->is_pinned = false;
+
+	ret = account_locked_vm(mm, 1, true);
+	if (ret)
+		goto free_gp;
+
+	write_lock(&kvm->mmu_lock);
+	ret = pkvm_host_map_guest(pfn, gfn);
+	if (!ret)
+		WARN_ON(insert_ppage(kvm, guest_page));
+	write_unlock(&kvm->mmu_lock);
+
+	if (ret)
+		account_locked_vm(mm, 1, false);
+free_gp:
+	if (ret)
+		kfree(guest_page);
+rel_page:
+	unlock_page(page);
+	put_page(page);
+
+	return ret != -EAGAIN ? ret : 0;
+}
+
 static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot)
 {
@@ -1887,11 +1951,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 	}
 
-	if (is_protected_kvm_enabled())
-		ret = pkvm_mem_abort(vcpu, fault_ipa, memslot);
-	else
+	if (is_protected_kvm_enabled()) {
+		if ((kvm_slot_can_be_private(memslot)))
+			ret = guestmem_abort(vcpu, fault_ipa, memslot);
+		else
+			ret = pkvm_mem_abort(vcpu, fault_ipa, memslot);
+	} else {
 		ret = user_mem_abort(vcpu, fault_ipa, memslot,
 				     esr_fsc_is_permission_fault(esr));
+	}
+
 
 	if (ret == 0)
 		ret = 1;
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 20/26] KVM: arm64: Track sharing of memory from protected guest to host
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (18 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 19/26] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 21/26] KVM: arm64: Mark a protected VM's memory as unmappable at initialization Fuad Tabba
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Track memory being explicitly shared by the guest to the host,
and update the NOT_MAPPABLE attribute accordingly.  This would
allow shared memory to be mapped by the host, and enable us to
ensure that unshared memory is not mappable by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/hypercalls.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index b08e18128de4..56fb4fa70eec 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -163,6 +163,32 @@ static int kvm_vcpu_exit_hcall(struct kvm_vcpu *vcpu, u32 nr, u32 nr_args)
 	return 0;
 }
 
+static int kvm_vcpu_handle_xshare(struct kvm_vcpu *vcpu, u32 nr)
+{
+	if (IS_ENABLED(CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE)) {
+		u64 mask = vcpu->kvm->arch.hypercall_exit_enabled;
+		gfn_t gfn = vcpu_get_reg(vcpu, 1) >> PAGE_SHIFT;
+		unsigned long attributes = 0;
+		int ret;
+
+		if (!(mask & BIT(nr)))
+			goto err;
+
+		if (nr == ARM_SMCCC_KVM_FUNC_MEM_UNSHARE)
+			attributes = KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE;
+
+		ret = kvm_vm_set_mem_attributes_kernel(vcpu->kvm, gfn, gfn + 1, attributes);
+		if (ret)
+			goto err;
+	}
+
+	return kvm_vcpu_exit_hcall(vcpu, nr, 3);
+
+err:
+	smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
+	return 1;
+}
+
 #define SMC32_ARCH_RANGE_BEGIN	ARM_SMCCC_VERSION_FUNC_ID
 #define SMC32_ARCH_RANGE_END	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,		\
 						   ARM_SMCCC_SMC_32,		\
@@ -411,9 +437,9 @@ int kvm_smccc_call_handler(struct kvm_vcpu *vcpu)
 		val[0] = SMCCC_RET_SUCCESS;
 		break;
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
-		return kvm_vcpu_exit_hcall(vcpu, ARM_SMCCC_KVM_FUNC_MEM_SHARE, 3);
+		return kvm_vcpu_handle_xshare(vcpu, ARM_SMCCC_KVM_FUNC_MEM_SHARE);
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
-		return kvm_vcpu_exit_hcall(vcpu, ARM_SMCCC_KVM_FUNC_MEM_UNSHARE, 3);
+		return kvm_vcpu_handle_xshare(vcpu, ARM_SMCCC_KVM_FUNC_MEM_UNSHARE);
 	case ARM_SMCCC_TRNG_VERSION:
 	case ARM_SMCCC_TRNG_FEATURES:
 	case ARM_SMCCC_TRNG_GET_UUID:
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 21/26] KVM: arm64: Mark a protected VM's memory as unmappable at initialization
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (19 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 20/26] KVM: arm64: Track sharing of memory from protected guest to host Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 22/26] KVM: arm64: Handle unshare on way back to guest entry rather than exit Fuad Tabba
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

A protected VM's memory is private by default and not mappable by
the host until explicitly shared by the guest. Therefore, start
off with all the memory of a protected guest as NOT_MAPPABLE.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/pkvm.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index bfd4858a7bd1..75247a3ced3d 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -220,13 +220,41 @@ bool pkvm_is_hyp_created(struct kvm *host_kvm)
 	return READ_ONCE(host_kvm->arch.pkvm.handle);
 }
 
+static int pkvm_mark_protected_mem_not_mappable(struct kvm *kvm)
+{
+	struct kvm_memory_slot *memslot;
+	struct kvm_memslots *slots;
+	int bkt, r;
+
+	if (!IS_ENABLED(CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE))
+		return 0;
+
+	slots = kvm_memslots(kvm);
+	kvm_for_each_memslot(memslot, bkt, slots) {
+		if (!kvm_slot_can_be_private(memslot))
+			continue;
+
+		r = kvm_vm_set_mem_attributes_kernel(kvm,
+			memslot->base_gfn, memslot->base_gfn + memslot->npages,
+			KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE);
+		if (r)
+			return r;
+	}
+
+	return 0;
+}
+
 int pkvm_create_hyp_vm(struct kvm *host_kvm)
 {
 	int ret = 0;
 
 	mutex_lock(&host_kvm->lock);
-	if (!pkvm_is_hyp_created(host_kvm))
-		ret = __pkvm_create_hyp_vm(host_kvm);
+	if (!pkvm_is_hyp_created(host_kvm)) {
+		if (kvm_vm_is_protected(host_kvm))
+			ret = pkvm_mark_protected_mem_not_mappable(host_kvm);
+		if (!ret)
+			ret = __pkvm_create_hyp_vm(host_kvm);
+	}
 	mutex_unlock(&host_kvm->lock);
 
 	return ret;
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 22/26] KVM: arm64: Handle unshare on way back to guest entry rather than exit
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (20 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 21/26] KVM: arm64: Mark a protected VM's memory as unmappable at initialization Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 23/26] KVM: arm64: Check that host unmaps memory unshared by guest Fuad Tabba
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Host might not be able to unmap memory that's unshared with it.
If that happens, the host will deny the unshare, and the guest
will be notified of its failure when returning from its unshare
call.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 24 ++++++++++++++++++++----
 arch/arm64/kvm/hyp/nvhe/pkvm.c     | 22 ++++++++--------------
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 1c93c225915b..2198a146e773 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -79,16 +79,32 @@ static void handle_pvm_entry_psci(struct pkvm_hyp_vcpu *hyp_vcpu)
 
 static void handle_pvm_entry_hvc64(struct pkvm_hyp_vcpu *hyp_vcpu)
 {
-	u32 fn = smccc_get_function(&hyp_vcpu->vcpu);
+	struct kvm_vcpu *vcpu = &hyp_vcpu->vcpu;
+	u32 fn = smccc_get_function(vcpu);
 
 	switch (fn) {
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
 		fallthrough;
-	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
-		fallthrough;
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
-		vcpu_set_reg(&hyp_vcpu->vcpu, 0, SMCCC_RET_SUCCESS);
+		vcpu_set_reg(vcpu, 0, SMCCC_RET_SUCCESS);
+		break;
+	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
+	{
+		/*
+		 * Get the host vcpu view of whether the unshare is successful.
+		 * If the host wasn't able to unmap it first, hyp cannot unshare
+		 * it as the host would have a mapping to a private guest page.
+		 */
+		int smccc_ret = vcpu_get_reg(hyp_vcpu->host_vcpu, 0);
+		u64 ipa = smccc_get_arg1(vcpu);
+
+		if (smccc_ret != SMCCC_RET_SUCCESS ||
+		    __pkvm_guest_unshare_host(hyp_vcpu, ipa))
+			smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
+		else
+			smccc_set_retval(vcpu, SMCCC_RET_SUCCESS, 0, 0, 0);
 		break;
+	}
 	default:
 		handle_pvm_entry_psci(hyp_vcpu);
 		break;
diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c
index 4209c75e7fba..fa94b88fe9a8 100644
--- a/arch/arm64/kvm/hyp/nvhe/pkvm.c
+++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c
@@ -1236,26 +1236,19 @@ static bool pkvm_memshare_call(struct pkvm_hyp_vcpu *hyp_vcpu, u64 *exit_code)
 	return false;
 }
 
-static bool pkvm_memunshare_call(struct pkvm_hyp_vcpu *hyp_vcpu)
+static bool pkvm_memunshare_check(struct pkvm_hyp_vcpu *hyp_vcpu)
 {
 	struct kvm_vcpu *vcpu = &hyp_vcpu->vcpu;
-	u64 ipa = smccc_get_arg1(vcpu);
+	u64 arg1 = smccc_get_arg1(vcpu);
 	u64 arg2 = smccc_get_arg2(vcpu);
 	u64 arg3 = smccc_get_arg3(vcpu);
-	int err;
-
-	if (arg2 || arg3)
-		goto out_guest_err;
 
-	err = __pkvm_guest_unshare_host(hyp_vcpu, ipa);
-	if (err)
-		goto out_guest_err;
+	if (!arg1 || arg2 || arg3) {
+		smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
+		return true;
+	}
 
 	return false;
-
-out_guest_err:
-	smccc_set_retval(vcpu, SMCCC_RET_INVALID_PARAMETER, 0, 0, 0);
-	return true;
 }
 
 static bool pkvm_meminfo_call(struct pkvm_hyp_vcpu *hyp_vcpu)
@@ -1343,7 +1336,8 @@ bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code)
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID:
 		return pkvm_memshare_call(hyp_vcpu, exit_code);
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID:
-		return pkvm_memunshare_call(hyp_vcpu);
+		/* Handle unshare on guest return because it could be denied by the host. */
+		return pkvm_memunshare_check(hyp_vcpu);
 	case ARM_SMCCC_VENDOR_HYP_KVM_MEM_RELINQUISH_FUNC_ID:
 		return pkvm_memrelinquish_call(hyp_vcpu);
 	default:
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 23/26] KVM: arm64: Check that host unmaps memory unshared by guest
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (21 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 22/26] KVM: arm64: Handle unshare on way back to guest entry rather than exit Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 24/26] KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() Fuad Tabba
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

After an unshare call from the guest, check that the memory isn't
mapped by the host before returning to the guest.

If the host has acknowledged the unsharing, by marking the memory
as private, but hasn't unmapped it, return an error to the host.

On the other hand if the host has not acknowledged the unsharing,
then return to the guest with an error, indicating that the
unsharing has failed.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/hypercalls.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 56fb4fa70eec..23237ca400ec 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -24,8 +24,45 @@
 	f;								\
 })
 
+static int kvm_handle_unshare_return(struct kvm_vcpu *vcpu)
+{
+	gpa_t gpa = vcpu->run->hypercall.args[0];
+	gfn_t gfn = gpa_to_gfn(gpa);
+	struct kvm *kvm = vcpu->kvm;
+
+	if (!IS_ENABLED(CONFIG_KVM_PRIVATE_MEM))
+		return 1;
+
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		/* Inform the guest that host refused to unshare the memory. */
+		vcpu->run->hypercall.ret = SMCCC_RET_INVALID_PARAMETER;
+		WARN_ON(kvm_vm_set_mem_attributes_kernel(vcpu->kvm, gfn, gfn + 1, 0));
+
+		return 1;
+	}
+
+	/*
+	 * Host has acknowledged that the memory has been unshared by marking it
+	 * as private, so check if it still has mapping. If it does, exit back
+	 * to the host to fix it.
+	 * The exit reason should still be preserved.
+	 */
+	if (kvm_is_gmem_mapped(kvm, gfn, gfn + 1))
+		return -EPERM;
+
+	return 1;
+}
+
 int kvm_handle_hypercall_return(struct kvm_vcpu *vcpu)
 {
+	if (vcpu->run->hypercall.ret == SMCCC_RET_SUCCESS &&
+	    vcpu->run->hypercall.nr == ARM_SMCCC_KVM_FUNC_MEM_UNSHARE) {
+		int ret = kvm_handle_unshare_return(vcpu);
+
+		if (ret <= 0)
+			return ret;
+	}
+
 	smccc_set_retval(vcpu, vcpu->run->hypercall.ret,
 			 vcpu->run->hypercall.args[0],
 			 vcpu->run->hypercall.args[1],
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 24/26] KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (22 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 23/26] KVM: arm64: Check that host unmaps memory unshared by guest Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 25/26] KVM: arm64: Enable private memory support when pKVM is enabled Fuad Tabba
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

pKVM doesn't need to do anything specific, but the functions need
to be implemented since guestmem expects them.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 570b14da16b1..36de5748fb0a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2386,3 +2386,19 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
 
 	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
+{
+	return false;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
+{
+	return false;
+}
+
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 25/26] KVM: arm64: Enable private memory support when pKVM is enabled
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (23 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 24/26] KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 16:10 ` [RFC PATCH v1 26/26] KVM: arm64: Enable private memory kconfig for arm64 Fuad Tabba
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Mark the support for private memory in arm64 as being dependent
on pKVM for now.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index ab61c3ecba0c..437509b5d881 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1263,4 +1263,10 @@ static inline void kvm_hyp_reserve(void) { }
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) is_protected_kvm_enabled()
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
 #endif /* __ARM64_KVM_HOST_H__ */
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [RFC PATCH v1 26/26] KVM: arm64: Enable private memory kconfig for arm64
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (24 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 25/26] KVM: arm64: Enable private memory support when pKVM is enabled Fuad Tabba
@ 2024-02-22 16:10 ` Fuad Tabba
  2024-02-22 23:43 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Elliot Berman
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-22 16:10 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, tabba

Now that the infrastructure is in place for arm64 to support
guest private memory, enable it in the arm64 kernel
configuration.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 6c3c8ca73e7f..559eb9d34447 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -40,6 +40,8 @@ menuconfig KVM
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
 	select XARRAY_MULTI
+	select KVM_GENERIC_PRIVATE_MEM
+	select KVM_GENERIC_PRIVATE_MEM_MAPPABLE
 	help
 	  Support hosting virtualized guest machines.
 
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host
  2024-02-22 16:10 ` [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host Fuad Tabba
@ 2024-02-22 16:28   ` David Hildenbrand
  2024-02-26  8:58     ` Fuad Tabba
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-22 16:28 UTC (permalink / raw)
  To: Fuad Tabba, kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> +{
> +	struct folio *folio;
> +
> +	folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
> +	if (!folio)
> +		return VM_FAULT_SIGBUS;
> +
> +	/*
> +	 * Check if the page is allowed to be faulted to the host, with the
> +	 * folio lock held to ensure that the check and incrementing the page
> +	 * count are protected by the same folio lock.
> +	 */
> +	if (!kvm_gmem_isfaultable(vmf)) {
> +		folio_unlock(folio);
> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	vmf->page = folio_file_page(folio, vmf->pgoff);

We won't currently get hugetlb (or even THP) here. It mimics what shmem 
would do.

finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), 
getting the rmap involved.

Do we have some tests in place that make sure that 
fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap 
the page again (IOW, that the rmap does indeed work?).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (25 preceding siblings ...)
  2024-02-22 16:10 ` [RFC PATCH v1 26/26] KVM: arm64: Enable private memory kconfig for arm64 Fuad Tabba
@ 2024-02-22 23:43 ` Elliot Berman
  2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
  2024-02-26  9:03   ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
  2024-02-23 12:00 ` Alexandru Elisei
  2024-02-26  9:47 ` David Hildenbrand
  28 siblings, 2 replies; 96+ messages in thread
From: Elliot Berman @ 2024-02-22 23:43 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf

On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd [1], as
> well as support guest_memfd on pKVM/arm64.
> 
> We haven't started using this in Android yet, but we aim to move
> away from anonymous memory to guest_memfd once we have the
> necessary support merged upstream. Others (e.g., Gunyah [8]) are
> also looking into guest_memfd for similar reasons as us.

I'm especially interested to see if we can factor out much of the common
implementation bits between KVM and Gunyah. In principle, we're doing
same thing: the difference is the exact mechanics to interact with the
hypervisor which (I think) could be easily extracted into an ops
structure.

[...]

> In addition to general feedback, we would like feedback on how we
> handle mmap() and faulting-in guest pages at the host (KVM: Add
> restricted support for mapping guest_memfd by the host).
> 
> We don't enforce the invariant that only memory shared with the
> host can be mapped by the host userspace in
> file_operations::mmap(), but in vm_operations_struct:fault(). On
> vm_operations_struct::fault(), we check whether the page is
> shared with the host. If not, we deliver a SIGBUS to the current
> task. The reason for enforcing this at fault() is that mmap()
> does not elevate the pagecount(); it's the faulting in of the
> page which does. Even if we were to check at mmap() whether an
> address can be mapped, we would still need to check again on
> fault(), since between mmap() and fault() the status of the page
> can change.
> 
> This creates the situation where access to successfully mmap()'d
> memory might SIGBUS at page fault. There is precedence for
> similar behavior in the kernel I believe, with MADV_HWPOISON and
> the hugetlbfs cgroups controller, which could SIGBUS at page
> fault time depending on the accounting limit.

I added a test: folio_mmapped() [1] which checks if there's a vma
covering the corresponding offset into the guest_memfd. I use this
test before trying to make page private to guest and I've been able to
ensure that userspace can't even mmap() private guest memory. If I try
to make memory private, I can test that it's not mmapped and not allow
memory to become private. In my testing so far, this is enough to
prevent SIGBUS from happening.

This test probably should be moved outside Gunyah specific code, and was
looking for maintainer to suggest the right home for it :)

[1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/

> 
> Another pKVM specific aspect we would like feedback on, is how to
> handle memory mapped by the host being unshared by a guest. The
> approach we've taken is that on an unshare call from the guest,
> the host userspace is notified that the memory has been unshared,
> in order to allow it to unmap it and mark it as PRIVATE as
> acknowledgment. If the host does not unmap the memory, the
> unshare call issued by the guest fails, which the guest is
> informed about on return.
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> 
> [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> 
> [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> 
> [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> 
> [5] Protected KVM on arm64 (slides)
> https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> 
> [6] Protected KVM on arm64 (video)
> https://www.youtube.com/watch?v=9npebeVFbFw
> 
> [7] Supporting guest private memory in Protected KVM on Android (presentation)
> https://lpc.events/event/17/contributions/1487/
> 
> [8] Drivers for Gunyah (patch series)
> https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/

As of 5 min ago when I send this, there's a v17:
https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/

Thanks,
Elliot


^ permalink raw reply	[flat|nested] 96+ messages in thread

* folio_mmapped
  2024-02-22 23:43 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Elliot Berman
@ 2024-02-23  0:35   ` Matthew Wilcox
  2024-02-26  9:28     ` folio_mmapped David Hildenbrand
  2024-02-26  9:03   ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
  1 sibling, 1 reply; 96+ messages in thread
From: Matthew Wilcox @ 2024-02-23  0:35 UTC (permalink / raw)
  To: Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf
  Cc: linux-mm

On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote:
> > This creates the situation where access to successfully mmap()'d
> > memory might SIGBUS at page fault. There is precedence for
> > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > the hugetlbfs cgroups controller, which could SIGBUS at page
> > fault time depending on the accounting limit.
> 
> I added a test: folio_mmapped() [1] which checks if there's a vma
> covering the corresponding offset into the guest_memfd. I use this
> test before trying to make page private to guest and I've been able to
> ensure that userspace can't even mmap() private guest memory. If I try
> to make memory private, I can test that it's not mmapped and not allow
> memory to become private. In my testing so far, this is enough to
> prevent SIGBUS from happening.
> 
> This test probably should be moved outside Gunyah specific code, and was
> looking for maintainer to suggest the right home for it :)
> 
> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/

You, um, might have wanted to send an email to linux-mm, not bury it in
the middle of a series of 35 patches?

So this isn't folio_mapped() because you're interested if anyone _could_
fault this page, not whether the folio is currently present in anyone's
page tables.

It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure
it's worth the effort to use walk_page_mapping(), but I would defer to
David.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (26 preceding siblings ...)
  2024-02-22 23:43 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Elliot Berman
@ 2024-02-23 12:00 ` Alexandru Elisei
  2024-02-26  9:05   ` Fuad Tabba
  2024-02-26  9:47 ` David Hildenbrand
  28 siblings, 1 reply; 96+ messages in thread
From: Alexandru Elisei @ 2024-02-23 12:00 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi,

I have a question regarding memory shared between the host and a protected
guest. I scanned the series, and the pKVM patches this series is based on,
but I couldn't easily find the answer.

When a page is shared, that page is not mapped in the stage 2 tables that
the host maintains for a regular VM (kvm->arch.mmu), right? It wouldn't
make much sense for KVM to maintain its own stage 2 that is never used, but
I thought I should double check that to make sure I'm not missing
something.

Thanks,
Alex

On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd [1], as
> well as support guest_memfd on pKVM/arm64.
> 
> This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> the remainder (patches 7-26) require the pKVM core series. A git
> repo with this series applied can be found here [3]. We have a
> (WIP) kvmtool port capable of running the code in this series
> [4]. For a technical deep dive into pKVM, please refer to Quentin
> Perret's KVM Forum Presentation [5, 6].
> 
> I've covered some of the issues presented here in my LPC 2023
> presentation [7].
> 
> We haven't started using this in Android yet, but we aim to move
> away from anonymous memory to guest_memfd once we have the
> necessary support merged upstream. Others (e.g., Gunyah [8]) are
> also looking into guest_memfd for similar reasons as us.
> 
> By design, guest_memfd cannot be mapped, read, or written by the
> host userspace. In pKVM, memory shared between a protected guest
> and the host is shared in-place, unlike the other confidential
> computing solutions that guest_memfd was originally envisaged for
> (e.g, TDX). When initializing a guest, as well as when accessing
> memory shared by the guest to the host, it would be useful to
> support mapping that memory at the host to avoid copying its
> contents.
> 
> One of the benefits of guest_memfd is that it prevents a
> misbehaving host process from crashing the system when attempting
> to access (deliberately or accidentally) protected guest memory,
> since this memory isn't mapped to begin with. Without
> guest_memfd, the hypervisor would still prevent such accesses,
> but in certain cases the host kernel wouldn't be able to recover,
> causing the system to crash.
> 
> Support for mmap() in this patch series maintains the invariant
> that only memory shared with the host, either explicitly by the
> guest or implicitly before the guest has started running (in
> order to populate its memory) is allowed to be mapped. At no time
> should private memory be mapped at the host.
> 
> This patch series is divided into two parts:
> 
> The first part is to the KVM core code (patches 1-6), and is
> based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
> for mapping guest memory only as long as it is shared. For that,
> the host needs to know the sharing status of guest memory.
> Therefore, the series adds a new KVM memory attribute, accessible
> only by the host kernel, that specifies whether the memory is
> allowed to be mapped by the host userspace.
> 
> The second part of the series (patches 7-26) adds guest_memfd
> support for pKVM/arm64, and is based on the latest version of our
> pKVM series [2]. It uses guest_memfd instead of the current
> approach in Android (not upstreamed) of maintaining a long-term
> GUP on anonymous memory donated to the guest. These patches
> handle faulting in guest memory for a guest, as well as handling
> sharing and unsharing of guest memory while maintaining the
> invariant mentioned earlier.
> 
> In addition to general feedback, we would like feedback on how we
> handle mmap() and faulting-in guest pages at the host (KVM: Add
> restricted support for mapping guest_memfd by the host).
> 
> We don't enforce the invariant that only memory shared with the
> host can be mapped by the host userspace in
> file_operations::mmap(), but in vm_operations_struct:fault(). On
> vm_operations_struct::fault(), we check whether the page is
> shared with the host. If not, we deliver a SIGBUS to the current
> task. The reason for enforcing this at fault() is that mmap()
> does not elevate the pagecount(); it's the faulting in of the
> page which does. Even if we were to check at mmap() whether an
> address can be mapped, we would still need to check again on
> fault(), since between mmap() and fault() the status of the page
> can change.
> 
> This creates the situation where access to successfully mmap()'d
> memory might SIGBUS at page fault. There is precedence for
> similar behavior in the kernel I believe, with MADV_HWPOISON and
> the hugetlbfs cgroups controller, which could SIGBUS at page
> fault time depending on the accounting limit.
> 
> Another pKVM specific aspect we would like feedback on, is how to
> handle memory mapped by the host being unshared by a guest. The
> approach we've taken is that on an unshare call from the guest,
> the host userspace is notified that the memory has been unshared,
> in order to allow it to unmap it and mark it as PRIVATE as
> acknowledgment. If the host does not unmap the memory, the
> unshare call issued by the guest fails, which the guest is
> informed about on return.
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> 
> [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> 
> [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> 
> [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> 
> [5] Protected KVM on arm64 (slides)
> https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> 
> [6] Protected KVM on arm64 (video)
> https://www.youtube.com/watch?v=9npebeVFbFw
> 
> [7] Supporting guest private memory in Protected KVM on Android (presentation)
> https://lpc.events/event/17/contributions/1487/
> 
> [8] Drivers for Gunyah (patch series)
> https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/
> 
> Fuad Tabba (20):
>   KVM: Split KVM memory attributes into user and kernel attributes
>   KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
>   KVM: Add restricted support for mapping guestmem by the host
>   KVM: Don't allow private attribute to be set if mapped by host
>   KVM: Don't allow private attribute to be removed for unmappable memory
>   KVM: Implement kvm_(read|/write)_guest_page for private memory slots
>   KVM: arm64: Create hypercall return handler
>   KVM: arm64: Refactor code around handling return from host to guest
>   KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
>   KVM: arm64: Add a field to indicate whether the guest page was pinned
>   KVM: arm64: Do not allow changes to private memory slots
>   KVM: arm64: Skip VMA checks for slots without userspace address
>   KVM: arm64: Handle guest_memfd()-backed guest page faults
>   KVM: arm64: Track sharing of memory from protected guest to host
>   KVM: arm64: Mark a protected VM's memory as unmappable at
>     initialization
>   KVM: arm64: Handle unshare on way back to guest entry rather than exit
>   KVM: arm64: Check that host unmaps memory unshared by guest
>   KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
>   KVM: arm64: Enable private memory support when pKVM is enabled
>   KVM: arm64: Enable private memory kconfig for arm64
> 
> Keir Fraser (3):
>   KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
>   KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
>   KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall
> 
> Quentin Perret (1):
>   KVM: arm64: Turn llist of pinned pages into an rb-tree
> 
> Will Deacon (2):
>   KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
>   KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications
> 
>  arch/arm64/include/asm/kvm_host.h             |  17 +-
>  arch/arm64/include/asm/kvm_pkvm.h             |   1 +
>  arch/arm64/kvm/Kconfig                        |   2 +
>  arch/arm64/kvm/arm.c                          |  32 ++-
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
>  arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
>  arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
>  arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
>  arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
>  arch/arm64/kvm/mmu.c                          | 138 +++++++++-
>  arch/arm64/kvm/pkvm.c                         |  83 +++++-
>  include/linux/arm-smccc.h                     |   7 +
>  include/linux/kvm_host.h                      |  34 +++
>  include/uapi/linux/kvm.h                      |   4 +
>  virt/kvm/Kconfig                              |   4 +
>  virt/kvm/guest_memfd.c                        |  89 +++++-
>  virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
>  19 files changed, 904 insertions(+), 68 deletions(-)
> 
> -- 
> 2.44.0.rc1.240.g4c46232300-goog
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host
  2024-02-22 16:28   ` David Hildenbrand
@ 2024-02-26  8:58     ` Fuad Tabba
  2024-02-26  9:57       ` David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Fuad Tabba @ 2024-02-26  8:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi David,

On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > +{
> > +     struct folio *folio;
> > +
> > +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
> > +     if (!folio)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     /*
> > +      * Check if the page is allowed to be faulted to the host, with the
> > +      * folio lock held to ensure that the check and incrementing the page
> > +      * count are protected by the same folio lock.
> > +      */
> > +     if (!kvm_gmem_isfaultable(vmf)) {
> > +             folio_unlock(folio);
> > +             return VM_FAULT_SIGBUS;
> > +     }
> > +
> > +     vmf->page = folio_file_page(folio, vmf->pgoff);
>
> We won't currently get hugetlb (or even THP) here. It mimics what shmem
> would do.

At the moment there isn't hugetlb support in guest_memfd(), and
neither in pKVM. Although we do plan on supporting it.

> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
> getting the rmap involved.
>
> Do we have some tests in place that make sure that
> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
> the page again (IOW, that the rmap does indeed work?).

I'm not sure if you mean kernel tests, or if I've tested it. There are
guest_memfd() tests for
fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
run. I've also tested it manually with sample programs, and it behaves
as expected.

Otherwise, for gunyah Elliot has used folio_mmapped() [], but Matthew
doesn't think that it would do what we'd like it to do, i.e., ensure
that _noone_ can fault in the page [2]

I would appreciate any ideas, comments, or suggestions regarding this.

Thanks!
/fuad

[1] https://lore.kernel.org/all/20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com/

[2] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/




> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-22 23:43 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Elliot Berman
  2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
@ 2024-02-26  9:03   ` Fuad Tabba
  1 sibling, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-26  9:03 UTC (permalink / raw)
  To: Fuad Tabba, kvm, kvmarm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng,
	jarkko, amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf

Hi Elliot,

On Thu, Feb 22, 2024 at 11:44 PM Elliot Berman <quic_eberman@quicinc.com> wrote:
>
> On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd [1], as
> > well as support guest_memfd on pKVM/arm64.
> >
> > We haven't started using this in Android yet, but we aim to move
> > away from anonymous memory to guest_memfd once we have the
> > necessary support merged upstream. Others (e.g., Gunyah [8]) are
> > also looking into guest_memfd for similar reasons as us.
>
> I'm especially interested to see if we can factor out much of the common
> implementation bits between KVM and Gunyah. In principle, we're doing
> same thing: the difference is the exact mechanics to interact with the
> hypervisor which (I think) could be easily extracted into an ops
> structure.

I agree. We should share and reuse as much code as possible. I'll sync
with you before the V2 of this series.

> [...]
>
> > In addition to general feedback, we would like feedback on how we
> > handle mmap() and faulting-in guest pages at the host (KVM: Add
> > restricted support for mapping guest_memfd by the host).
> >
> > We don't enforce the invariant that only memory shared with the
> > host can be mapped by the host userspace in
> > file_operations::mmap(), but in vm_operations_struct:fault(). On
> > vm_operations_struct::fault(), we check whether the page is
> > shared with the host. If not, we deliver a SIGBUS to the current
> > task. The reason for enforcing this at fault() is that mmap()
> > does not elevate the pagecount(); it's the faulting in of the
> > page which does. Even if we were to check at mmap() whether an
> > address can be mapped, we would still need to check again on
> > fault(), since between mmap() and fault() the status of the page
> > can change.
> >
> > This creates the situation where access to successfully mmap()'d
> > memory might SIGBUS at page fault. There is precedence for
> > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > the hugetlbfs cgroups controller, which could SIGBUS at page
> > fault time depending on the accounting limit.
>
> I added a test: folio_mmapped() [1] which checks if there's a vma
> covering the corresponding offset into the guest_memfd. I use this
> test before trying to make page private to guest and I've been able to
> ensure that userspace can't even mmap() private guest memory. If I try
> to make memory private, I can test that it's not mmapped and not allow
> memory to become private. In my testing so far, this is enough to
> prevent SIGBUS from happening.
>
> This test probably should be moved outside Gunyah specific code, and was
> looking for maintainer to suggest the right home for it :)
>
> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/

Let's see what the mm-folks think about this [*].

Thanks!
/fuad

[*] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/


> >
> > Another pKVM specific aspect we would like feedback on, is how to
> > handle memory mapped by the host being unshared by a guest. The
> > approach we've taken is that on an unshare call from the guest,
> > the host userspace is notified that the memory has been unshared,
> > in order to allow it to unmap it and mark it as PRIVATE as
> > acknowledgment. If the host does not unmap the memory, the
> > unshare call issued by the guest fails, which the guest is
> > informed about on return.
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> >
> > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> >
> > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> >
> > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> >
> > [5] Protected KVM on arm64 (slides)
> > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> >
> > [6] Protected KVM on arm64 (video)
> > https://www.youtube.com/watch?v=9npebeVFbFw
> >
> > [7] Supporting guest private memory in Protected KVM on Android (presentation)
> > https://lpc.events/event/17/contributions/1487/
> >
> > [8] Drivers for Gunyah (patch series)
> > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/
>
> As of 5 min ago when I send this, there's a v17:
> https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/
>
> Thanks,
> Elliot
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-23 12:00 ` Alexandru Elisei
@ 2024-02-26  9:05   ` Fuad Tabba
  0 siblings, 0 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-26  9:05 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi Alex,

On Fri, Feb 23, 2024 at 12:01 PM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> I have a question regarding memory shared between the host and a protected
> guest. I scanned the series, and the pKVM patches this series is based on,
> but I couldn't easily find the answer.
>
> When a page is shared, that page is not mapped in the stage 2 tables that
> the host maintains for a regular VM (kvm->arch.mmu), right? It wouldn't
> make much sense for KVM to maintain its own stage 2 that is never used, but
> I thought I should double check that to make sure I'm not missing
> something.

You're right. In protected mode the stage-2 tables are maintained by
the hypervisor in EL2, since we don't trust the host kernel. It is
still KVM of course, but not the regular VM structure, like you said.

Cheers,
/fuad

>
> Thanks,
> Alex
>
> On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd [1], as
> > well as support guest_memfd on pKVM/arm64.
> >
> > This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> > the remainder (patches 7-26) require the pKVM core series. A git
> > repo with this series applied can be found here [3]. We have a
> > (WIP) kvmtool port capable of running the code in this series
> > [4]. For a technical deep dive into pKVM, please refer to Quentin
> > Perret's KVM Forum Presentation [5, 6].
> >
> > I've covered some of the issues presented here in my LPC 2023
> > presentation [7].
> >
> > We haven't started using this in Android yet, but we aim to move
> > away from anonymous memory to guest_memfd once we have the
> > necessary support merged upstream. Others (e.g., Gunyah [8]) are
> > also looking into guest_memfd for similar reasons as us.
> >
> > By design, guest_memfd cannot be mapped, read, or written by the
> > host userspace. In pKVM, memory shared between a protected guest
> > and the host is shared in-place, unlike the other confidential
> > computing solutions that guest_memfd was originally envisaged for
> > (e.g, TDX). When initializing a guest, as well as when accessing
> > memory shared by the guest to the host, it would be useful to
> > support mapping that memory at the host to avoid copying its
> > contents.
> >
> > One of the benefits of guest_memfd is that it prevents a
> > misbehaving host process from crashing the system when attempting
> > to access (deliberately or accidentally) protected guest memory,
> > since this memory isn't mapped to begin with. Without
> > guest_memfd, the hypervisor would still prevent such accesses,
> > but in certain cases the host kernel wouldn't be able to recover,
> > causing the system to crash.
> >
> > Support for mmap() in this patch series maintains the invariant
> > that only memory shared with the host, either explicitly by the
> > guest or implicitly before the guest has started running (in
> > order to populate its memory) is allowed to be mapped. At no time
> > should private memory be mapped at the host.
> >
> > This patch series is divided into two parts:
> >
> > The first part is to the KVM core code (patches 1-6), and is
> > based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
> > for mapping guest memory only as long as it is shared. For that,
> > the host needs to know the sharing status of guest memory.
> > Therefore, the series adds a new KVM memory attribute, accessible
> > only by the host kernel, that specifies whether the memory is
> > allowed to be mapped by the host userspace.
> >
> > The second part of the series (patches 7-26) adds guest_memfd
> > support for pKVM/arm64, and is based on the latest version of our
> > pKVM series [2]. It uses guest_memfd instead of the current
> > approach in Android (not upstreamed) of maintaining a long-term
> > GUP on anonymous memory donated to the guest. These patches
> > handle faulting in guest memory for a guest, as well as handling
> > sharing and unsharing of guest memory while maintaining the
> > invariant mentioned earlier.
> >
> > In addition to general feedback, we would like feedback on how we
> > handle mmap() and faulting-in guest pages at the host (KVM: Add
> > restricted support for mapping guest_memfd by the host).
> >
> > We don't enforce the invariant that only memory shared with the
> > host can be mapped by the host userspace in
> > file_operations::mmap(), but in vm_operations_struct:fault(). On
> > vm_operations_struct::fault(), we check whether the page is
> > shared with the host. If not, we deliver a SIGBUS to the current
> > task. The reason for enforcing this at fault() is that mmap()
> > does not elevate the pagecount(); it's the faulting in of the
> > page which does. Even if we were to check at mmap() whether an
> > address can be mapped, we would still need to check again on
> > fault(), since between mmap() and fault() the status of the page
> > can change.
> >
> > This creates the situation where access to successfully mmap()'d
> > memory might SIGBUS at page fault. There is precedence for
> > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > the hugetlbfs cgroups controller, which could SIGBUS at page
> > fault time depending on the accounting limit.
> >
> > Another pKVM specific aspect we would like feedback on, is how to
> > handle memory mapped by the host being unshared by a guest. The
> > approach we've taken is that on an unshare call from the guest,
> > the host userspace is notified that the memory has been unshared,
> > in order to allow it to unmap it and mark it as PRIVATE as
> > acknowledgment. If the host does not unmap the memory, the
> > unshare call issued by the guest fails, which the guest is
> > informed about on return.
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> >
> > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> >
> > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> >
> > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> >
> > [5] Protected KVM on arm64 (slides)
> > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> >
> > [6] Protected KVM on arm64 (video)
> > https://www.youtube.com/watch?v=9npebeVFbFw
> >
> > [7] Supporting guest private memory in Protected KVM on Android (presentation)
> > https://lpc.events/event/17/contributions/1487/
> >
> > [8] Drivers for Gunyah (patch series)
> > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/
> >
> > Fuad Tabba (20):
> >   KVM: Split KVM memory attributes into user and kernel attributes
> >   KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
> >   KVM: Add restricted support for mapping guestmem by the host
> >   KVM: Don't allow private attribute to be set if mapped by host
> >   KVM: Don't allow private attribute to be removed for unmappable memory
> >   KVM: Implement kvm_(read|/write)_guest_page for private memory slots
> >   KVM: arm64: Create hypercall return handler
> >   KVM: arm64: Refactor code around handling return from host to guest
> >   KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
> >   KVM: arm64: Add a field to indicate whether the guest page was pinned
> >   KVM: arm64: Do not allow changes to private memory slots
> >   KVM: arm64: Skip VMA checks for slots without userspace address
> >   KVM: arm64: Handle guest_memfd()-backed guest page faults
> >   KVM: arm64: Track sharing of memory from protected guest to host
> >   KVM: arm64: Mark a protected VM's memory as unmappable at
> >     initialization
> >   KVM: arm64: Handle unshare on way back to guest entry rather than exit
> >   KVM: arm64: Check that host unmaps memory unshared by guest
> >   KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
> >   KVM: arm64: Enable private memory support when pKVM is enabled
> >   KVM: arm64: Enable private memory kconfig for arm64
> >
> > Keir Fraser (3):
> >   KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
> >   KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
> >   KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall
> >
> > Quentin Perret (1):
> >   KVM: arm64: Turn llist of pinned pages into an rb-tree
> >
> > Will Deacon (2):
> >   KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
> >   KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications
> >
> >  arch/arm64/include/asm/kvm_host.h             |  17 +-
> >  arch/arm64/include/asm/kvm_pkvm.h             |   1 +
> >  arch/arm64/kvm/Kconfig                        |   2 +
> >  arch/arm64/kvm/arm.c                          |  32 ++-
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
> >  arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
> >  arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
> >  arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
> >  arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
> >  arch/arm64/kvm/mmu.c                          | 138 +++++++++-
> >  arch/arm64/kvm/pkvm.c                         |  83 +++++-
> >  include/linux/arm-smccc.h                     |   7 +
> >  include/linux/kvm_host.h                      |  34 +++
> >  include/uapi/linux/kvm.h                      |   4 +
> >  virt/kvm/Kconfig                              |   4 +
> >  virt/kvm/guest_memfd.c                        |  89 +++++-
> >  virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
> >  19 files changed, 904 insertions(+), 68 deletions(-)
> >
> > --
> > 2.44.0.rc1.240.g4c46232300-goog
> >
> >

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
@ 2024-02-26  9:28     ` David Hildenbrand
  2024-02-26 21:14       ` folio_mmapped Elliot Berman
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-26  9:28 UTC (permalink / raw)
  To: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf
  Cc: linux-mm

On 23.02.24 01:35, Matthew Wilcox wrote:
> On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote:
>>> This creates the situation where access to successfully mmap()'d
>>> memory might SIGBUS at page fault. There is precedence for
>>> similar behavior in the kernel I believe, with MADV_HWPOISON and
>>> the hugetlbfs cgroups controller, which could SIGBUS at page
>>> fault time depending on the accounting limit.
>>
>> I added a test: folio_mmapped() [1] which checks if there's a vma
>> covering the corresponding offset into the guest_memfd. I use this
>> test before trying to make page private to guest and I've been able to
>> ensure that userspace can't even mmap() private guest memory. If I try
>> to make memory private, I can test that it's not mmapped and not allow
>> memory to become private. In my testing so far, this is enough to
>> prevent SIGBUS from happening.
>>
>> This test probably should be moved outside Gunyah specific code, and was
>> looking for maintainer to suggest the right home for it :)
>>
>> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/
> 
> You, um, might have wanted to send an email to linux-mm, not bury it in
> the middle of a series of 35 patches?
> 
> So this isn't folio_mapped() because you're interested if anyone _could_
> fault this page, not whether the folio is currently present in anyone's
> page tables.
> 
> It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure
> it's worth the effort to use walk_page_mapping(), but I would defer to
> David.

First, I suspect we are not only concerned about current+future VMAs 
covering the page, we are also interested in any page pins that could 
have been derived from such a VMA?

Imagine user space mmap'ed the file, faulted in page, took a pin on the 
page using pin_user_pages() and friends, and then munmap()'ed the VMA.

You likely want to catch that as well and not allow a conversion to private?

[I assume you want to convert the page to private only if you hold all 
the folio references -- i.e., if the refcount of a small folio is 1]


Now, regarding the original question (disallow mapping the page), I see 
the following approaches:

1) SIGBUS during page fault. There are other cases that can trigger
    SIGBUS during page faults: hugetlb when we are out of free hugetlb
    pages, userfaultfd with UFFD_FEATURE_SIGBUS.

-> Simple and should get the job done.

2) folio_mmapped() + preventing new mmaps covering that folio

-> More complicated, requires an rmap walk on every conversion.

3) Disallow any mmaps of the file while any page is private

-> Likely not what you want.


Why was 1) abandoned? I looks a lot easier and harder to mess up. Why 
are you trying to avoid page faults? What's the use case?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
                   ` (27 preceding siblings ...)
  2024-02-23 12:00 ` Alexandru Elisei
@ 2024-02-26  9:47 ` David Hildenbrand
  2024-02-27  9:37   ` Fuad Tabba
  28 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-26  9:47 UTC (permalink / raw)
  To: Fuad Tabba, kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

On 22.02.24 17:10, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd [1], as
> well as support guest_memfd on pKVM/arm64.
> 
> This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> the remainder (patches 7-26) require the pKVM core series. A git
> repo with this series applied can be found here [3]. We have a
> (WIP) kvmtool port capable of running the code in this series
> [4]. For a technical deep dive into pKVM, please refer to Quentin
> Perret's KVM Forum Presentation [5, 6].
> 
> I've covered some of the issues presented here in my LPC 2023
> presentation [7].
> 
> We haven't started using this in Android yet, but we aim to move
> away from anonymous memory to guest_memfd once we have the
> necessary support merged upstream. Others (e.g., Gunyah [8]) are
> also looking into guest_memfd for similar reasons as us.
> 
> By design, guest_memfd cannot be mapped, read, or written by the
> host userspace. In pKVM, memory shared between a protected guest
> and the host is shared in-place, unlike the other confidential
> computing solutions that guest_memfd was originally envisaged for
> (e.g, TDX).

Can you elaborate (or point to a summary) why pKVM has to be special 
here? Why can't you use guest_memfd only for private memory and another 
(ordinary) memfd for shared memory, like the other confidential 
computing technologies are planning to?

What's the main reason for that decision and can it be avoided?

(s390x also shares in-place, but doesn't need any special-casing like 
guest_memfd provides)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host
  2024-02-26  8:58     ` Fuad Tabba
@ 2024-02-26  9:57       ` David Hildenbrand
  2024-02-26 17:30         ` Fuad Tabba
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-26  9:57 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

On 26.02.24 09:58, Fuad Tabba wrote:
> Hi David,
> 
> On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>> +{
>>> +     struct folio *folio;
>>> +
>>> +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
>>> +     if (!folio)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     /*
>>> +      * Check if the page is allowed to be faulted to the host, with the
>>> +      * folio lock held to ensure that the check and incrementing the page
>>> +      * count are protected by the same folio lock.
>>> +      */
>>> +     if (!kvm_gmem_isfaultable(vmf)) {
>>> +             folio_unlock(folio);
>>> +             return VM_FAULT_SIGBUS;
>>> +     }
>>> +
>>> +     vmf->page = folio_file_page(folio, vmf->pgoff);
>>
>> We won't currently get hugetlb (or even THP) here. It mimics what shmem
>> would do.
> 
> At the moment there isn't hugetlb support in guest_memfd(), and
> neither in pKVM. Although we do plan on supporting it.
> 
>> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
>> getting the rmap involved.
>>
>> Do we have some tests in place that make sure that
>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
>> the page again (IOW, that the rmap does indeed work?).
> 
> I'm not sure if you mean kernel tests, or if I've tested it. There are
> guest_memfd() tests for
> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
> run. I've also tested it manually with sample programs, and it behaves
> as expected.

Can you point me at the existing tests? I'm interested in 
mmap()-specific guest_memfd tests.

A test that would make sense to me:

1) Create guest_memfd() and size it to contain at least one page.

2) mmap() it

3) Write some pattern (e.g., all 1's) to the first page using the mmap

4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page

5) Make sure reading from the first page using the mmap reads all 0's

IOW, during 4) we properly unmapped (via rmap) and discarded the page, 
such that 5) will populate a fresh page in the page cache filled with 
0's and map that one.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host
  2024-02-26  9:57       ` David Hildenbrand
@ 2024-02-26 17:30         ` Fuad Tabba
  2024-02-27  7:40           ` David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Fuad Tabba @ 2024-02-26 17:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi David,

Thank you very much for reviewing this.

On Mon, Feb 26, 2024 at 9:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.02.24 09:58, Fuad Tabba wrote:
> > Hi David,
> >
> > On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> >>> +{
> >>> +     struct folio *folio;
> >>> +
> >>> +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
> >>> +     if (!folio)
> >>> +             return VM_FAULT_SIGBUS;
> >>> +
> >>> +     /*
> >>> +      * Check if the page is allowed to be faulted to the host, with the
> >>> +      * folio lock held to ensure that the check and incrementing the page
> >>> +      * count are protected by the same folio lock.
> >>> +      */
> >>> +     if (!kvm_gmem_isfaultable(vmf)) {
> >>> +             folio_unlock(folio);
> >>> +             return VM_FAULT_SIGBUS;
> >>> +     }
> >>> +
> >>> +     vmf->page = folio_file_page(folio, vmf->pgoff);
> >>
> >> We won't currently get hugetlb (or even THP) here. It mimics what shmem
> >> would do.
> >
> > At the moment there isn't hugetlb support in guest_memfd(), and
> > neither in pKVM. Although we do plan on supporting it.
> >
> >> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
> >> getting the rmap involved.
> >>
> >> Do we have some tests in place that make sure that
> >> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
> >> the page again (IOW, that the rmap does indeed work?).
> >
> > I'm not sure if you mean kernel tests, or if I've tested it. There are
> > guest_memfd() tests for
> > fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
> > run. I've also tested it manually with sample programs, and it behaves
> > as expected.
>
> Can you point me at the existing tests? I'm interested in
> mmap()-specific guest_memfd tests.
>
> A test that would make sense to me:
>
> 1) Create guest_memfd() and size it to contain at least one page.
>
> 2) mmap() it
>
> 3) Write some pattern (e.g., all 1's) to the first page using the mmap
>
> 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page
>
> 5) Make sure reading from the first page using the mmap reads all 0's
>
> IOW, during 4) we properly unmapped (via rmap) and discarded the page,
> such that 5) will populate a fresh page in the page cache filled with
> 0's and map that one.

The existing tests don't test mmap (or rather, they test the inability
to mmap). They do test FALLOC_FL_PUNCH_HOLE. [1]

The tests for mmap() are ones that I wrote myself. I will write a test
like the one you mentioned, and send it with V2, or earlier if you
prefer.

Thanks again,
/fuad


> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-02-26  9:28     ` folio_mmapped David Hildenbrand
@ 2024-02-26 21:14       ` Elliot Berman
  2024-02-27 14:59         ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-02-26 21:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, linux-mm

On Mon, Feb 26, 2024 at 10:28:11AM +0100, David Hildenbrand wrote:
> On 23.02.24 01:35, Matthew Wilcox wrote:
> > On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote:
> > > > This creates the situation where access to successfully mmap()'d
> > > > memory might SIGBUS at page fault. There is precedence for
> > > > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > > > the hugetlbfs cgroups controller, which could SIGBUS at page
> > > > fault time depending on the accounting limit.
> > > 
> > > I added a test: folio_mmapped() [1] which checks if there's a vma
> > > covering the corresponding offset into the guest_memfd. I use this
> > > test before trying to make page private to guest and I've been able to
> > > ensure that userspace can't even mmap() private guest memory. If I try
> > > to make memory private, I can test that it's not mmapped and not allow
> > > memory to become private. In my testing so far, this is enough to
> > > prevent SIGBUS from happening.
> > > 
> > > This test probably should be moved outside Gunyah specific code, and was
> > > looking for maintainer to suggest the right home for it :)
> > > 
> > > [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/
> > 
> > You, um, might have wanted to send an email to linux-mm, not bury it in
> > the middle of a series of 35 patches?
> > 
> > So this isn't folio_mapped() because you're interested if anyone _could_
> > fault this page, not whether the folio is currently present in anyone's
> > page tables.
> > 
> > It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure
> > it's worth the effort to use walk_page_mapping(), but I would defer to
> > David.
> 
> First, I suspect we are not only concerned about current+future VMAs
> covering the page, we are also interested in any page pins that could have
> been derived from such a VMA?
> 
> Imagine user space mmap'ed the file, faulted in page, took a pin on the page
> using pin_user_pages() and friends, and then munmap()'ed the VMA.
> 
> You likely want to catch that as well and not allow a conversion to private?
> 
> [I assume you want to convert the page to private only if you hold all the
> folio references -- i.e., if the refcount of a small folio is 1]
> 

Ah, this was something I hadn't thought about. I think both Fuad and I
need to update our series to check the refcount rather than mapcount
(kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).

> 
> Now, regarding the original question (disallow mapping the page), I see the
> following approaches:
> 
> 1) SIGBUS during page fault. There are other cases that can trigger
>    SIGBUS during page faults: hugetlb when we are out of free hugetlb
>    pages, userfaultfd with UFFD_FEATURE_SIGBUS.
> 
> -> Simple and should get the job done.
> 
> 2) folio_mmapped() + preventing new mmaps covering that folio
> 
> -> More complicated, requires an rmap walk on every conversion.
> 
> 3) Disallow any mmaps of the file while any page is private
> 
> -> Likely not what you want.
> 
> 
> Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are
> you trying to avoid page faults? What's the use case?
> 

We were chatting whether we could do better than the SIGBUS approach.
SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to
return errors early. One difference between hugetlb and this usecase is
that running out of free hugetlb pages isn't something we could detect
at mmap time. In guest_memfd usecase, we should be able to detect when
SIGBUS becomes possible due to memory being lent to guest.

I can't think of a reason why userspace would want/be able to resume
operation after trying to access a page that it shouldn't be allowed, so
SIGBUS is functional. The advantage of trying to avoid SIGBUS was
better/easier reporting to userspace.

- Elliot

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host
  2024-02-26 17:30         ` Fuad Tabba
@ 2024-02-27  7:40           ` David Hildenbrand
  0 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-02-27  7:40 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

On 26.02.24 18:30, Fuad Tabba wrote:
> Hi David,
> 
> Thank you very much for reviewing this.
> 
> On Mon, Feb 26, 2024 at 9:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.02.24 09:58, Fuad Tabba wrote:
>>> Hi David,
>>>
>>> On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>>>> +{
>>>>> +     struct folio *folio;
>>>>> +
>>>>> +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
>>>>> +     if (!folio)
>>>>> +             return VM_FAULT_SIGBUS;
>>>>> +
>>>>> +     /*
>>>>> +      * Check if the page is allowed to be faulted to the host, with the
>>>>> +      * folio lock held to ensure that the check and incrementing the page
>>>>> +      * count are protected by the same folio lock.
>>>>> +      */
>>>>> +     if (!kvm_gmem_isfaultable(vmf)) {
>>>>> +             folio_unlock(folio);
>>>>> +             return VM_FAULT_SIGBUS;
>>>>> +     }
>>>>> +
>>>>> +     vmf->page = folio_file_page(folio, vmf->pgoff);
>>>>
>>>> We won't currently get hugetlb (or even THP) here. It mimics what shmem
>>>> would do.
>>>
>>> At the moment there isn't hugetlb support in guest_memfd(), and
>>> neither in pKVM. Although we do plan on supporting it.
>>>
>>>> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
>>>> getting the rmap involved.
>>>>
>>>> Do we have some tests in place that make sure that
>>>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
>>>> the page again (IOW, that the rmap does indeed work?).
>>>
>>> I'm not sure if you mean kernel tests, or if I've tested it. There are
>>> guest_memfd() tests for
>>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
>>> run. I've also tested it manually with sample programs, and it behaves
>>> as expected.
>>
>> Can you point me at the existing tests? I'm interested in
>> mmap()-specific guest_memfd tests.
>>
>> A test that would make sense to me:
>>
>> 1) Create guest_memfd() and size it to contain at least one page.
>>
>> 2) mmap() it
>>
>> 3) Write some pattern (e.g., all 1's) to the first page using the mmap
>>
>> 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page
>>
>> 5) Make sure reading from the first page using the mmap reads all 0's
>>
>> IOW, during 4) we properly unmapped (via rmap) and discarded the page,
>> such that 5) will populate a fresh page in the page cache filled with
>> 0's and map that one.
> 
> The existing tests don't test mmap (or rather, they test the inability
> to mmap). They do test FALLOC_FL_PUNCH_HOLE. [1]
> 
> The tests for mmap() are ones that I wrote myself. I will write a test
> like the one you mentioned, and send it with V2, or earlier if you
> prefer.

Thanks, no need to rush. As long as we have some simple test for that 
scenario at some point, all good!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-26  9:47 ` David Hildenbrand
@ 2024-02-27  9:37   ` Fuad Tabba
  2024-02-27 14:41     ` David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Fuad Tabba @ 2024-02-27  9:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi David,

On Mon, Feb 26, 2024 at 9:47 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.02.24 17:10, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd [1], as
> > well as support guest_memfd on pKVM/arm64.
> >
> > This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> > the remainder (patches 7-26) require the pKVM core series. A git
> > repo with this series applied can be found here [3]. We have a
> > (WIP) kvmtool port capable of running the code in this series
> > [4]. For a technical deep dive into pKVM, please refer to Quentin
> > Perret's KVM Forum Presentation [5, 6].
> >
> > I've covered some of the issues presented here in my LPC 2023
> > presentation [7].
> >
> > We haven't started using this in Android yet, but we aim to move
> > away from anonymous memory to guest_memfd once we have the
> > necessary support merged upstream. Others (e.g., Gunyah [8]) are
> > also looking into guest_memfd for similar reasons as us.
> >
> > By design, guest_memfd cannot be mapped, read, or written by the
> > host userspace. In pKVM, memory shared between a protected guest
> > and the host is shared in-place, unlike the other confidential
> > computing solutions that guest_memfd was originally envisaged for
> > (e.g, TDX).
>
> Can you elaborate (or point to a summary) why pKVM has to be special
> here? Why can't you use guest_memfd only for private memory and another
> (ordinary) memfd for shared memory, like the other confidential
> computing technologies are planning to?

Because the same memory location can switch back and forth between
being shared and private in-place. The host/vmm doesn't know
beforehand which parts of the guest's private memory might be shared
with it later, therefore, it cannot use guest_memfd() for the private
memory and anonymous memory for the shared memory without resorting to
copying. Even if it did know beforehand, it wouldn't help much since
that memory can change back to being private later on. Other
confidential computing proposals like TDX and Arm CCA don't share in
place, and need to copy shared data between private and shared memory.

If you're interested, there was also a more detailed discussion about
this in an earlier guest_memfd() thread [1]

> What's the main reason for that decision and can it be avoided?
> (s390x also shares in-place, but doesn't need any special-casing like
> guest_memfd provides)

In our current implementation of pKVM, we use anonymous memory with a
long-term gup, and the host ends up with valid mappings. This isn't
just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX,
accessing guest private memory can be fatal to the host and the system
as a whole since it could result in a machine check. In arm64 it's not
necessarily fatal to the system as a whole if a userspace process were
to attempt the access. However, a userspace process could trick the
host kernel to try to access the protected guest memory, e.g., by
having a process A strace a malicious process B which passes protected
guest memory as argument to a syscall.

What makes pKVM and Gunyah special is that both can easily share
memory (and its contents) in place, since it's not encrypted, and
convert memory locations between shared/unshared. I'm not familiar
with how s390x handles sharing in place, or how it handles memory
donated to the guest. I assume it's by donating anonymous memory. I
would be also interested to know how it handles and recovers from
similar situations, i.e., host (userspace or kernel) trying to access
guest protected memory.

Thank you,
/fuad

[1] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/

[2] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/

[3] https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/



>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-27  9:37   ` Fuad Tabba
@ 2024-02-27 14:41     ` David Hildenbrand
  2024-02-27 14:49       ` David Hildenbrand
  2024-02-28  9:57       ` Fuad Tabba
  0 siblings, 2 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-02-27 14:41 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi,

>> Can you elaborate (or point to a summary) why pKVM has to be special
>> here? Why can't you use guest_memfd only for private memory and another
>> (ordinary) memfd for shared memory, like the other confidential
>> computing technologies are planning to?
> 
> Because the same memory location can switch back and forth between
> being shared and private in-place. The host/vmm doesn't know
> beforehand which parts of the guest's private memory might be shared
> with it later, therefore, it cannot use guest_memfd() for the private
> memory and anonymous memory for the shared memory without resorting to

I don't remember the latest details about the guest_memfd incarnation in 
user space, but I though we'd be using guest_memfd for private memory 
and an ordinary memfd for shared memory. But maybe it also works with 
anon memory instead of the memfd and that was just an implementation 
detail :)

> copying. Even if it did know beforehand, it wouldn't help much since
> that memory can change back to being private later on. Other
> confidential computing proposals like TDX and Arm CCA don't share in
> place, and need to copy shared data between private and shared memory.

Right.

> 
> If you're interested, there was also a more detailed discussion about
> this in an earlier guest_memfd() thread [1]

Thanks for the pointer!

> 
>> What's the main reason for that decision and can it be avoided?
>> (s390x also shares in-place, but doesn't need any special-casing like
>> guest_memfd provides)
> 
> In our current implementation of pKVM, we use anonymous memory with a
> long-term gup, and the host ends up with valid mappings. This isn't
> just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX,
> accessing guest private memory can be fatal to the host and the system
> as a whole since it could result in a machine check. In arm64 it's not
> necessarily fatal to the system as a whole if a userspace process were
> to attempt the access. However, a userspace process could trick the
> host kernel to try to access the protected guest memory, e.g., by
> having a process A strace a malicious process B which passes protected
> guest memory as argument to a syscall.

Right.

> 
> What makes pKVM and Gunyah special is that both can easily share
> memory (and its contents) in place, since it's not encrypted, and
> convert memory locations between shared/unshared. I'm not familiar
> with how s390x handles sharing in place, or how it handles memory
> donated to the guest. I assume it's by donating anonymous memory. I
> would be also interested to know how it handles and recovers from
> similar situations, i.e., host (userspace or kernel) trying to access
> guest protected memory.

I don't know all of the s390x "protected VM" details, but it is pretty 
similar. Take a look at arch/s390/kernel/uv.c if you are interested.

There are "ultravisor" calls that can convert a page
* from secure (inaccessible by the host) to non-secure (encrypted but
   accessible by the host)
* from non-secure to secure

Once the host tries to access a "secure" page -- either from the kernel 
or from user space, the host gets a page fault and calls 
arch_make_page_accessible(). That will encrypt page content such that 
the host can access it (migrate/swapout/whatsoever).

The host has to set aside some memory area for the ultravisor to 
"remember" page state.

So you can swapout/migrate these pages, but the host will only read 
encrypted garbage. In contrast to disallowing access to these pages.

So you don't need any guest_memfd games to protect from that -- and one 
doesn't have to travel back in time to have memory that isn't 
swappable/migratable and only comes in one page size.

[I'm not up-to-date which obscure corner-cases CCA requirement the s390x 
implementation cannot fulfill -- like replacing pages in page tables and 
such; I suspect pKVM also cannot cover all these corner-cases]


Extending guest_memfd (the one that was promised initially to not be 
mmappable) to be mmappable just to avoid some crashes in corner cases is 
the right approach. But I'm pretty sure that has all been discussed 
before, that's why I am asking about some details :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-27 14:41     ` David Hildenbrand
@ 2024-02-27 14:49       ` David Hildenbrand
  2024-02-28  9:57       ` Fuad Tabba
  1 sibling, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-02-27 14:49 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

I missed here "I'm wondering whether ..."

> Extending guest_memfd (the one that was promised initially to not be
> mmappable) to be mmappable just to avoid some crashes in corner cases is
> the right approach. But I'm pretty sure that has all been discussed
> before, that's why I am asking about some details :)
> 

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-26 21:14       ` folio_mmapped Elliot Berman
@ 2024-02-27 14:59         ` David Hildenbrand
  2024-02-28 10:48           ` folio_mmapped Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-27 14:59 UTC (permalink / raw)
  To: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, linux-mm

> 
> Ah, this was something I hadn't thought about. I think both Fuad and I
> need to update our series to check the refcount rather than mapcount
> (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).

An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). 
But checking for any unexpected references might be better (there are 
still some GUP users that don't use FOLL_PIN).

At least concurrent migration/swapout (that temporarily unmaps a folio 
and can give you folio_mapped() "false negatives", which both take a 
temporary folio reference and hold the page lock) should not be a 
concern because guest_memfd doesn't support that yet.

> 
>>
>> Now, regarding the original question (disallow mapping the page), I see the
>> following approaches:
>>
>> 1) SIGBUS during page fault. There are other cases that can trigger
>>     SIGBUS during page faults: hugetlb when we are out of free hugetlb
>>     pages, userfaultfd with UFFD_FEATURE_SIGBUS.
>>
>> -> Simple and should get the job done.
>>
>> 2) folio_mmapped() + preventing new mmaps covering that folio
>>
>> -> More complicated, requires an rmap walk on every conversion.
>>
>> 3) Disallow any mmaps of the file while any page is private
>>
>> -> Likely not what you want.
>>
>>
>> Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are
>> you trying to avoid page faults? What's the use case?
>>
> 
> We were chatting whether we could do better than the SIGBUS approach.
> SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to
> return errors early. One difference between hugetlb and this usecase is
> that running out of free hugetlb pages isn't something we could detect

With hugetlb reservation one can try detecting it at mmap() time. But as 
reservations are not NUMA aware, it's not reliable.

> at mmap time. In guest_memfd usecase, we should be able to detect when
> SIGBUS becomes possible due to memory being lent to guest.
> 
> I can't think of a reason why userspace would want/be able to resume
> operation after trying to access a page that it shouldn't be allowed, so
> SIGBUS is functional. The advantage of trying to avoid SIGBUS was
> better/easier reporting to userspace.

To me, it sounds conceptually easier and less error-prone to

1) Converting a page to private only if there are no unexpected
    references (no mappings, GUP pins, ...)
2) Disallowing mapping private pages and failing the page fault.
3) Handling that small race window only (page lock?)

Instead of

1) Converting a page to private only if there are no unexpected
    references (no mappings, GUP pins, ...) and no VMAs covering it where
    we could fault it in later
2) Disallowing mmap when the range would contain any private page
3) Handling races between mmap and page conversion

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-27 14:41     ` David Hildenbrand
  2024-02-27 14:49       ` David Hildenbrand
@ 2024-02-28  9:57       ` Fuad Tabba
  2024-02-28 10:12         ` David Hildenbrand
  1 sibling, 1 reply; 96+ messages in thread
From: Fuad Tabba @ 2024-02-28  9:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

Hi David,

On Tue, Feb 27, 2024 at 2:41 PM David Hildenbrand <david@redhat.com> wrote:
>
> Hi,
>
> >> Can you elaborate (or point to a summary) why pKVM has to be special
> >> here? Why can't you use guest_memfd only for private memory and another
> >> (ordinary) memfd for shared memory, like the other confidential
> >> computing technologies are planning to?
> >
> > Because the same memory location can switch back and forth between
> > being shared and private in-place. The host/vmm doesn't know
> > beforehand which parts of the guest's private memory might be shared
> > with it later, therefore, it cannot use guest_memfd() for the private
> > memory and anonymous memory for the shared memory without resorting to
>
> I don't remember the latest details about the guest_memfd incarnation in
> user space, but I though we'd be using guest_memfd for private memory
> and an ordinary memfd for shared memory. But maybe it also works with
> anon memory instead of the memfd and that was just an implementation
> detail :)
>
> > copying. Even if it did know beforehand, it wouldn't help much since
> > that memory can change back to being private later on. Other
> > confidential computing proposals like TDX and Arm CCA don't share in
> > place, and need to copy shared data between private and shared memory.
>
> Right.
>
> >
> > If you're interested, there was also a more detailed discussion about
> > this in an earlier guest_memfd() thread [1]
>
> Thanks for the pointer!
>
> >
> >> What's the main reason for that decision and can it be avoided?
> >> (s390x also shares in-place, but doesn't need any special-casing like
> >> guest_memfd provides)
> >
> > In our current implementation of pKVM, we use anonymous memory with a
> > long-term gup, and the host ends up with valid mappings. This isn't
> > just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX,
> > accessing guest private memory can be fatal to the host and the system
> > as a whole since it could result in a machine check. In arm64 it's not
> > necessarily fatal to the system as a whole if a userspace process were
> > to attempt the access. However, a userspace process could trick the
> > host kernel to try to access the protected guest memory, e.g., by
> > having a process A strace a malicious process B which passes protected
> > guest memory as argument to a syscall.
>
> Right.
>
> >
> > What makes pKVM and Gunyah special is that both can easily share
> > memory (and its contents) in place, since it's not encrypted, and
> > convert memory locations between shared/unshared. I'm not familiar
> > with how s390x handles sharing in place, or how it handles memory
> > donated to the guest. I assume it's by donating anonymous memory. I
> > would be also interested to know how it handles and recovers from
> > similar situations, i.e., host (userspace or kernel) trying to access
> > guest protected memory.
>
> I don't know all of the s390x "protected VM" details, but it is pretty
> similar. Take a look at arch/s390/kernel/uv.c if you are interested.
>
> There are "ultravisor" calls that can convert a page
> * from secure (inaccessible by the host) to non-secure (encrypted but
>    accessible by the host)
> * from non-secure to secure
>
> Once the host tries to access a "secure" page -- either from the kernel
> or from user space, the host gets a page fault and calls
> arch_make_page_accessible(). That will encrypt page content such that
> the host can access it (migrate/swapout/whatsoever).
>
> The host has to set aside some memory area for the ultravisor to
> "remember" page state.
>
> So you can swapout/migrate these pages, but the host will only read
> encrypted garbage. In contrast to disallowing access to these pages.
>
> So you don't need any guest_memfd games to protect from that -- and one
> doesn't have to travel back in time to have memory that isn't
> swappable/migratable and only comes in one page size.
>
> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
> implementation cannot fulfill -- like replacing pages in page tables and
> such; I suspect pKVM also cannot cover all these corner-cases]

Thanks for this. I'll do some more reading on how things work with s390x.

Right, and of course, one key difference of course is that pKVM
doesn't encrypt anything, and only relies on stage-2 protection to
protect the guest.

>
> Extending guest_memfd (the one that was promised initially to not be
> mmappable) to be mmappable just to avoid some crashes in corner cases is
> the right approach. But I'm pretty sure that has all been discussed
> before, that's why I am asking about some details :)

Thank you very much for your reviews and comments. They've already
been very helpful. I noticed the gmap.h in the s390 source, which
might also be something that we could learn from. So please do ask for
as much details as you like.

Cheers,
/fuad

> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-28  9:57       ` Fuad Tabba
@ 2024-02-28 10:12         ` David Hildenbrand
  2024-02-28 14:01           ` Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-28 10:12 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

>> So you don't need any guest_memfd games to protect from that -- and one
>> doesn't have to travel back in time to have memory that isn't
>> swappable/migratable and only comes in one page size.
>>
>> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
>> implementation cannot fulfill -- like replacing pages in page tables and
>> such; I suspect pKVM also cannot cover all these corner-cases]
> 
> Thanks for this. I'll do some more reading on how things work with s390x.
> 
> Right, and of course, one key difference of course is that pKVM
> doesn't encrypt anything, and only relies on stage-2 protection to
> protect the guest.

I don't remember what exactly s390x does, but I recall that it might 
only encrypt the memory content as it transitions a page from secure to 
non-secure.

Something like that could also be implemented using pKVM (unless I am 
missing something), but it might not be that trivial, of course :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-27 14:59         ` folio_mmapped David Hildenbrand
@ 2024-02-28 10:48           ` Quentin Perret
  2024-02-28 11:11             ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-02-28 10:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
> > 
> > Ah, this was something I hadn't thought about. I think both Fuad and I
> > need to update our series to check the refcount rather than mapcount
> > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
> 
> An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
> checking for any unexpected references might be better (there are still some
> GUP users that don't use FOLL_PIN).

As a non-mm person I'm not sure to understand to consequences of holding
a GUP pin to a page that is not covered by any VMA. The absence of VMAs
imply that userspace cannot access the page right? Presumably the kernel
can't be coerced into accessing that page either? Is that correct?

> At least concurrent migration/swapout (that temporarily unmaps a folio and
> can give you folio_mapped() "false negatives", which both take a temporary
> folio reference and hold the page lock) should not be a concern because
> guest_memfd doesn't support that yet.
> 
> > 
> > > 
> > > Now, regarding the original question (disallow mapping the page), I see the
> > > following approaches:
> > > 
> > > 1) SIGBUS during page fault. There are other cases that can trigger
> > >     SIGBUS during page faults: hugetlb when we are out of free hugetlb
> > >     pages, userfaultfd with UFFD_FEATURE_SIGBUS.
> > > 
> > > -> Simple and should get the job done.
> > > 
> > > 2) folio_mmapped() + preventing new mmaps covering that folio
> > > 
> > > -> More complicated, requires an rmap walk on every conversion.
> > > 
> > > 3) Disallow any mmaps of the file while any page is private
> > > 
> > > -> Likely not what you want.
> > > 
> > > 
> > > Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are
> > > you trying to avoid page faults? What's the use case?
> > > 
> > 
> > We were chatting whether we could do better than the SIGBUS approach.
> > SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to
> > return errors early. One difference between hugetlb and this usecase is
> > that running out of free hugetlb pages isn't something we could detect
> 
> With hugetlb reservation one can try detecting it at mmap() time. But as
> reservations are not NUMA aware, it's not reliable.
> 
> > at mmap time. In guest_memfd usecase, we should be able to detect when
> > SIGBUS becomes possible due to memory being lent to guest.
> > 
> > I can't think of a reason why userspace would want/be able to resume
> > operation after trying to access a page that it shouldn't be allowed, so
> > SIGBUS is functional. The advantage of trying to avoid SIGBUS was
> > better/easier reporting to userspace.
> 
> To me, it sounds conceptually easier and less error-prone to
> 
> 1) Converting a page to private only if there are no unexpected
>    references (no mappings, GUP pins, ...)
> 2) Disallowing mapping private pages and failing the page fault.
> 3) Handling that small race window only (page lock?)
> 
> Instead of
> 
> 1) Converting a page to private only if there are no unexpected
>    references (no mappings, GUP pins, ...) and no VMAs covering it where
>    we could fault it in later
> 2) Disallowing mmap when the range would contain any private page
> 3) Handling races between mmap and page conversion

The one thing that makes the second option cleaner from a userspace
perspective (IMO) is that the conversion to private is happening lazily
during guest faults. So whether or not an mmapped page can indeed be
accessed from userspace will be entirely undeterministic as it depends
on the guest faulting pattern which userspace is entirely unaware of.
Elliot's suggestion would prevent spurious crashes caused by that
somewhat odd behaviour, though arguably sane userspace software
shouldn't be doing that to start with.

To add a layer of paint to the shed, the usage of SIGBUS for
something that is really a permission access problem doesn't feel
appropriate. Allocating memory via guestmem and donating that to a
protected guest is a way for userspace to voluntarily relinquish access
permissions to the memory it allocated. So a userspace process violating
that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
point that signal would be sent, the page would have been accounted
against that userspace process, so not sure the paging examples that
were discussed earlier are exactly comparable. To illustrate that
differently, given that pKVM and Gunyah use MMU-based protection, there
is nothing architecturally that prevents a guest from sharing a page
back with Linux as RO. Note that we don't currently support this, so I
don't want to conflate this use case, but that hopefully makes it a
little more obvious that this is a "there is a page, but you don't
currently have the permission to access it" problem rather than "sorry
but we ran out of pages" problem.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-28 10:48           ` folio_mmapped Quentin Perret
@ 2024-02-28 11:11             ` David Hildenbrand
  2024-02-28 12:44               ` folio_mmapped Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-28 11:11 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On 28.02.24 11:48, Quentin Perret wrote:
> On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
>>>
>>> Ah, this was something I hadn't thought about. I think both Fuad and I
>>> need to update our series to check the refcount rather than mapcount
>>> (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
>>
>> An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
>> checking for any unexpected references might be better (there are still some
>> GUP users that don't use FOLL_PIN).
> 
> As a non-mm person I'm not sure to understand to consequences of holding
> a GUP pin to a page that is not covered by any VMA. The absence of VMAs
> imply that userspace cannot access the page right? Presumably the kernel
> can't be coerced into accessing that page either? Is that correct?

Simple example: register the page using an iouring fixed buffer, then 
unmap the VMA. iouring now has the page pinned and can read/write it 
using an address in the kernel vitual address space (direct map).

Then, you can happily make the kernel read/write that page using 
iouring, even though no VMA still covers/maps that page.

[...]

>> Instead of
>>
>> 1) Converting a page to private only if there are no unexpected
>>     references (no mappings, GUP pins, ...) and no VMAs covering it where
>>     we could fault it in later
>> 2) Disallowing mmap when the range would contain any private page
>> 3) Handling races between mmap and page conversion
> 
> The one thing that makes the second option cleaner from a userspace
> perspective (IMO) is that the conversion to private is happening lazily
> during guest faults. So whether or not an mmapped page can indeed be
> accessed from userspace will be entirely undeterministic as it depends
> on the guest faulting pattern which userspace is entirely unaware of.
> Elliot's suggestion would prevent spurious crashes caused by that
> somewhat odd behaviour, though arguably sane userspace software
> shouldn't be doing that to start with.

The last sentence is the important one. User space should not access 
that memory. If it does, it gets a slap on the hand. Because it should 
not access that memory.

We might even be able to export to user space which pages are currently 
accessible and which ones not (e.g., pagemap), although it would be racy 
as long as the VM is running and can trigger a conversion.

> 
> To add a layer of paint to the shed, the usage of SIGBUS for
> something that is really a permission access problem doesn't feel

SIGBUS stands for "BUS error (bad memory access)."

Which makes sense, if you try accessing something that can no longer be 
accessed. It's now inaccessible. Even if it is temporarily.

Just like a page with an MCE error. Swapin errors. Etc. You cannot 
access it.

It might be a permission problem on the pKVM side, but it's not the 
traditional "permission problem" as in mprotect() and friends. You 
cannot resolve that permission problem yourself. It's a higher entity 
that turned that memory inaccessible.

> appropriate. Allocating memory via guestmem and donating that to a
> protected guest is a way for userspace to voluntarily relinquish access
> permissions to the memory it allocated. So a userspace process violating
> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> point that signal would be sent, the page would have been accounted
> against that userspace process, so not sure the paging examples that
> were discussed earlier are exactly comparable. To illustrate that
> differently, given that pKVM and Gunyah use MMU-based protection, there
> is nothing architecturally that prevents a guest from sharing a page
> back with Linux as RO. 

Sure, then allow page faults that allow for reads and give a signal on 
write faults.

In the scenario, it even makes more sense to not constantly require new 
mmap's from user space just to access a now-shared page.

> Note that we don't currently support this, so I
> don't want to conflate this use case, but that hopefully makes it a
> little more obvious that this is a "there is a page, but you don't
> currently have the permission to access it" problem rather than "sorry
> but we ran out of pages" problem.

We could user other signals, at least as the semantics are clear and 
it's documented. Maybe SIGSEGV would be warranted.

I consider that a minor detail, though.

Requiring mmap()/munmap() dances just to access a page that is now 
shared from user space sounds a bit suboptimal. But I don't know all the 
details of the user space implementation.

"mmap() the whole thing once and only access what you are supposed to 
access" sounds reasonable to me. If you don't play by the rules, you get 
a signal.

But I'm happy to learn more details.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-28 11:11             ` folio_mmapped David Hildenbrand
@ 2024-02-28 12:44               ` Quentin Perret
  2024-02-28 13:00                 ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-02-28 12:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Wednesday 28 Feb 2024 at 12:11:30 (+0100), David Hildenbrand wrote:
> On 28.02.24 11:48, Quentin Perret wrote:
> > On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
> > > > 
> > > > Ah, this was something I hadn't thought about. I think both Fuad and I
> > > > need to update our series to check the refcount rather than mapcount
> > > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
> > > 
> > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
> > > checking for any unexpected references might be better (there are still some
> > > GUP users that don't use FOLL_PIN).
> > 
> > As a non-mm person I'm not sure to understand to consequences of holding
> > a GUP pin to a page that is not covered by any VMA. The absence of VMAs
> > imply that userspace cannot access the page right? Presumably the kernel
> > can't be coerced into accessing that page either? Is that correct?
> 
> Simple example: register the page using an iouring fixed buffer, then unmap
> the VMA. iouring now has the page pinned and can read/write it using an
> address in the kernel vitual address space (direct map).
> 
> Then, you can happily make the kernel read/write that page using iouring,
> even though no VMA still covers/maps that page.

Makes sense, and yes that would be a major bug if we let that happen,
thanks for the explanation.

> [...]
> 
> > > Instead of
> > > 
> > > 1) Converting a page to private only if there are no unexpected
> > >     references (no mappings, GUP pins, ...) and no VMAs covering it where
> > >     we could fault it in later
> > > 2) Disallowing mmap when the range would contain any private page
> > > 3) Handling races between mmap and page conversion
> > 
> > The one thing that makes the second option cleaner from a userspace
> > perspective (IMO) is that the conversion to private is happening lazily
> > during guest faults. So whether or not an mmapped page can indeed be
> > accessed from userspace will be entirely undeterministic as it depends
> > on the guest faulting pattern which userspace is entirely unaware of.
> > Elliot's suggestion would prevent spurious crashes caused by that
> > somewhat odd behaviour, though arguably sane userspace software
> > shouldn't be doing that to start with.
> 
> The last sentence is the important one. User space should not access that
> memory. If it does, it gets a slap on the hand. Because it should not access
> that memory.
> 
> We might even be able to export to user space which pages are currently
> accessible and which ones not (e.g., pagemap), although it would be racy as
> long as the VM is running and can trigger a conversion.
> 
> > 
> > To add a layer of paint to the shed, the usage of SIGBUS for
> > something that is really a permission access problem doesn't feel
> 
> SIGBUS stands for "BUS error (bad memory access)."
> 
> Which makes sense, if you try accessing something that can no longer be
> accessed. It's now inaccessible. Even if it is temporarily.
> 
> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
> it.
> 
> It might be a permission problem on the pKVM side, but it's not the
> traditional "permission problem" as in mprotect() and friends. You cannot
> resolve that permission problem yourself. It's a higher entity that turned
> that memory inaccessible.

Well that's where I'm not sure to agree. Userspace can, in fact, get
back all of that memory by simply killing the protected VM. With the
approach suggested here, the guestmem pages are entirely accessible to
the host until they are attached to a running protected VM which
triggers the protection. It is very much userspace saying "I promise not
to touch these pages from now on" when it does that, in a way that I
personally find very comparable to the mprotect case. It is not some
other entity that pulls the carpet from under userspace's feet, it is
userspace being inconsistent with itself that causes the issue here, and
that's why SIGBUS feels kinda wrong as it tends to be used to report
external errors of some sort.

> > appropriate. Allocating memory via guestmem and donating that to a
> > protected guest is a way for userspace to voluntarily relinquish access
> > permissions to the memory it allocated. So a userspace process violating
> > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> > point that signal would be sent, the page would have been accounted
> > against that userspace process, so not sure the paging examples that
> > were discussed earlier are exactly comparable. To illustrate that
> > differently, given that pKVM and Gunyah use MMU-based protection, there
> > is nothing architecturally that prevents a guest from sharing a page
> > back with Linux as RO.
> 
> Sure, then allow page faults that allow for reads and give a signal on write
> faults.
> 
> In the scenario, it even makes more sense to not constantly require new
> mmap's from user space just to access a now-shared page.
> 
> > Note that we don't currently support this, so I
> > don't want to conflate this use case, but that hopefully makes it a
> > little more obvious that this is a "there is a page, but you don't
> > currently have the permission to access it" problem rather than "sorry
> > but we ran out of pages" problem.
> 
> We could user other signals, at least as the semantics are clear and it's
> documented. Maybe SIGSEGV would be warranted.
> 
> I consider that a minor detail, though.
>
> Requiring mmap()/munmap() dances just to access a page that is now shared
> from user space sounds a bit suboptimal. But I don't know all the details of
> the user space implementation.

Agreed, if we could save having to mmap() each page that gets shared
back that would be a nice performance optimization.

> "mmap() the whole thing once and only access what you are supposed to
> access" sounds reasonable to me. If you don't play by the rules, you get a
> signal.

"... you get a signal, or maybe you don't". But yes I understand your
point, and as per the above there are real benefits to this approach so
why not.

What do we expect userspace to do when a page goes from shared back to
being guest-private, because e.g. the guest decides to unshare? Use
munmap() on that page? Or perhaps an madvise() call of some sort? Note
that this will be needed when starting a guest as well, as userspace
needs to copy the guest payload in the guestmem file prior to starting
the protected VM.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-28 12:44               ` folio_mmapped Quentin Perret
@ 2024-02-28 13:00                 ` David Hildenbrand
  2024-02-28 13:34                   ` folio_mmapped Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-02-28 13:00 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

>>> To add a layer of paint to the shed, the usage of SIGBUS for
>>> something that is really a permission access problem doesn't feel
>>
>> SIGBUS stands for "BUS error (bad memory access)."
>>
>> Which makes sense, if you try accessing something that can no longer be
>> accessed. It's now inaccessible. Even if it is temporarily.
>>
>> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
>> it.
>>
>> It might be a permission problem on the pKVM side, but it's not the
>> traditional "permission problem" as in mprotect() and friends. You cannot
>> resolve that permission problem yourself. It's a higher entity that turned
>> that memory inaccessible.
> 
> Well that's where I'm not sure to agree. Userspace can, in fact, get
> back all of that memory by simply killing the protected VM. With the

Right, but that would likely "wipe" the pages so they can be made 
accessible again, right?

That's the whole point why we are handing the pages over to the "higher 
entity", and allow someone else (the VM) to turn them into a state where 
we can no longer read them.

(if you follow the other discussion, it would actually be nice if we 
could read them and would get encrypted content back, like s390x does; 
but that's a different discussion and I assume pretty much out of scope :) )

> approach suggested here, the guestmem pages are entirely accessible to
> the host until they are attached to a running protected VM which
> triggers the protection. It is very much userspace saying "I promise not
> to touch these pages from now on" when it does that, in a way that I
> personally find very comparable to the mprotect case. It is not some
> other entity that pulls the carpet from under userspace's feet, it is
> userspace being inconsistent with itself that causes the issue here, and
> that's why SIGBUS feels kinda wrong as it tends to be used to report
> external errors of some sort.

I recall that user space can also trigger SIGBUS when doing some 
mmap()+truncate() thingies, and probably a bunch more, that could be 
fixed up later.

I don't see a problem with SIUGBUS here, but I do understand your view. 
I consider the exact signal a minor detail, though.

> 
>>> appropriate. Allocating memory via guestmem and donating that to a
>>> protected guest is a way for userspace to voluntarily relinquish access
>>> permissions to the memory it allocated. So a userspace process violating
>>> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
>>> point that signal would be sent, the page would have been accounted
>>> against that userspace process, so not sure the paging examples that
>>> were discussed earlier are exactly comparable. To illustrate that
>>> differently, given that pKVM and Gunyah use MMU-based protection, there
>>> is nothing architecturally that prevents a guest from sharing a page
>>> back with Linux as RO.
>>
>> Sure, then allow page faults that allow for reads and give a signal on write
>> faults.
>>
>> In the scenario, it even makes more sense to not constantly require new
>> mmap's from user space just to access a now-shared page.
>>
>>> Note that we don't currently support this, so I
>>> don't want to conflate this use case, but that hopefully makes it a
>>> little more obvious that this is a "there is a page, but you don't
>>> currently have the permission to access it" problem rather than "sorry
>>> but we ran out of pages" problem.
>>
>> We could user other signals, at least as the semantics are clear and it's
>> documented. Maybe SIGSEGV would be warranted.
>>
>> I consider that a minor detail, though.
>>
>> Requiring mmap()/munmap() dances just to access a page that is now shared
>> from user space sounds a bit suboptimal. But I don't know all the details of
>> the user space implementation.
> 
> Agreed, if we could save having to mmap() each page that gets shared
> back that would be a nice performance optimization.
> 
>> "mmap() the whole thing once and only access what you are supposed to
>> access" sounds reasonable to me. If you don't play by the rules, you get a
>> signal.
> 
> "... you get a signal, or maybe you don't". But yes I understand your
> point, and as per the above there are real benefits to this approach so
> why not.
> 
> What do we expect userspace to do when a page goes from shared back to
> being guest-private, because e.g. the guest decides to unshare? Use
> munmap() on that page? Or perhaps an madvise() call of some sort? Note
> that this will be needed when starting a guest as well, as userspace
> needs to copy the guest payload in the guestmem file prior to starting
> the protected VM.

Let's assume we have the whole guest_memfd mapped exactly once in our 
process, a single VMA.

When setting up the VM, we'll write the payload and then fire up the VM.

That will (I assume) trigger some shared -> private conversion.

When we want to convert shared -> private in the kernel, we would first 
check if the page is currently mapped. If it is, we could try unmapping 
that page using an rmap walk.

Then, we'd make sure that there are really no other references and 
protect against concurrent mapping of the page. Now we can convert the 
page to private.

As we want to avoid the rmap walk, user space can be nice and simply 
MADV_DONTNEED the shared memory portions once it's done with it. For 
example, after writing the payload.

Just a thought, maybe I am missing some details.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-28 13:00                 ` folio_mmapped David Hildenbrand
@ 2024-02-28 13:34                   ` Quentin Perret
  2024-02-28 18:43                     ` folio_mmapped Elliot Berman
  2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
  0 siblings, 2 replies; 96+ messages in thread
From: Quentin Perret @ 2024-02-28 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Wednesday 28 Feb 2024 at 14:00:50 (+0100), David Hildenbrand wrote:
> > > > To add a layer of paint to the shed, the usage of SIGBUS for
> > > > something that is really a permission access problem doesn't feel
> > > 
> > > SIGBUS stands for "BUS error (bad memory access)."
> > > 
> > > Which makes sense, if you try accessing something that can no longer be
> > > accessed. It's now inaccessible. Even if it is temporarily.
> > > 
> > > Just like a page with an MCE error. Swapin errors. Etc. You cannot access
> > > it.
> > > 
> > > It might be a permission problem on the pKVM side, but it's not the
> > > traditional "permission problem" as in mprotect() and friends. You cannot
> > > resolve that permission problem yourself. It's a higher entity that turned
> > > that memory inaccessible.
> > 
> > Well that's where I'm not sure to agree. Userspace can, in fact, get
> > back all of that memory by simply killing the protected VM. With the
> 
> Right, but that would likely "wipe" the pages so they can be made accessible
> again, right?

Yep, indeed.

> That's the whole point why we are handing the pages over to the "higher
> entity", and allow someone else (the VM) to turn them into a state where we
> can no longer read them.
> 
> (if you follow the other discussion, it would actually be nice if we could
> read them and would get encrypted content back, like s390x does; but that's
> a different discussion and I assume pretty much out of scope :) )

Interesting, I'll read up. On a side note, I'm also considering adding a
guest-facing hypervisor interface to let the guest decide to opt out of
the hypervisor wipe as discussed above. That would be useful for a guest
that is shutting itself down (which could be cooperating with the host
Linux) and that knows it has erased its secrets. That is in general
difficult to do for an OS, but a simple approach could be to poison all
its memory (or maybe encrypt it?) before opting out of that wipe.

The hypervisor wipe is done in hypervisor context (obviously), which is
non-preemptible, so avoiding wiping (or encrypting) loads of memory
there is highly desirable. Also pKVM doesn't have a linear map of all
memory for security reasons, so we need to map/unmap the pages one by
one, which sucks as much as it sounds.

But yes, we're digressing, that is all for later :)

> > approach suggested here, the guestmem pages are entirely accessible to
> > the host until they are attached to a running protected VM which
> > triggers the protection. It is very much userspace saying "I promise not
> > to touch these pages from now on" when it does that, in a way that I
> > personally find very comparable to the mprotect case. It is not some
> > other entity that pulls the carpet from under userspace's feet, it is
> > userspace being inconsistent with itself that causes the issue here, and
> > that's why SIGBUS feels kinda wrong as it tends to be used to report
> > external errors of some sort.
> 
> I recall that user space can also trigger SIGBUS when doing some
> mmap()+truncate() thingies, and probably a bunch more, that could be fixed
> up later.

Right, so that probably still falls into "there is no page" bucket
rather than the "there is a page that is already accounted against the
userspace process, but it doesn't have the permission to access it
bucket. But yes that's probably an infinite debate.

> I don't see a problem with SIUGBUS here, but I do understand your view. I
> consider the exact signal a minor detail, though.
> 
> > 
> > > > appropriate. Allocating memory via guestmem and donating that to a
> > > > protected guest is a way for userspace to voluntarily relinquish access
> > > > permissions to the memory it allocated. So a userspace process violating
> > > > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> > > > point that signal would be sent, the page would have been accounted
> > > > against that userspace process, so not sure the paging examples that
> > > > were discussed earlier are exactly comparable. To illustrate that
> > > > differently, given that pKVM and Gunyah use MMU-based protection, there
> > > > is nothing architecturally that prevents a guest from sharing a page
> > > > back with Linux as RO.
> > > 
> > > Sure, then allow page faults that allow for reads and give a signal on write
> > > faults.
> > > 
> > > In the scenario, it even makes more sense to not constantly require new
> > > mmap's from user space just to access a now-shared page.
> > > 
> > > > Note that we don't currently support this, so I
> > > > don't want to conflate this use case, but that hopefully makes it a
> > > > little more obvious that this is a "there is a page, but you don't
> > > > currently have the permission to access it" problem rather than "sorry
> > > > but we ran out of pages" problem.
> > > 
> > > We could user other signals, at least as the semantics are clear and it's
> > > documented. Maybe SIGSEGV would be warranted.
> > > 
> > > I consider that a minor detail, though.
> > > 
> > > Requiring mmap()/munmap() dances just to access a page that is now shared
> > > from user space sounds a bit suboptimal. But I don't know all the details of
> > > the user space implementation.
> > 
> > Agreed, if we could save having to mmap() each page that gets shared
> > back that would be a nice performance optimization.
> > 
> > > "mmap() the whole thing once and only access what you are supposed to
 (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
> > > signal.
> > 
> > "... you get a signal, or maybe you don't". But yes I understand your
> > point, and as per the above there are real benefits to this approach so
> > why not.
> > 
> > What do we expect userspace to do when a page goes from shared back to
> > being guest-private, because e.g. the guest decides to unshare? Use
> > munmap() on that page? Or perhaps an madvise() call of some sort? Note
> > that this will be needed when starting a guest as well, as userspace
> > needs to copy the guest payload in the guestmem file prior to starting
> > the protected VM.
> 
> Let's assume we have the whole guest_memfd mapped exactly once in our
> process, a single VMA.
> 
> When setting up the VM, we'll write the payload and then fire up the VM.
> 
> That will (I assume) trigger some shared -> private conversion.
> 
> When we want to convert shared -> private in the kernel, we would first
> check if the page is currently mapped. If it is, we could try unmapping that
> page using an rmap walk.

I had not considered that. That would most certainly be slow, but a well
behaved userspace process shouldn't hit it so, that's probably not a
problem...

> Then, we'd make sure that there are really no other references and protect
> against concurrent mapping of the page. Now we can convert the page to
> private.

Right.

Alternatively, the shared->private conversion happens in the KVM vcpu
run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
new exit reason saying "can't donate that page while it's shared" and
have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
of that. But I tend to prefer the rmap option if it's workable as that
avoids adding new KVM userspace ABI.

> As we want to avoid the rmap walk, user space can be nice and simply
> MADV_DONTNEED the shared memory portions once it's done with it. For
> example, after writing the payload.

That makes sense to me.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-28 10:12         ` David Hildenbrand
@ 2024-02-28 14:01           ` Quentin Perret
  2024-02-29  9:51             ` David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-02-28 14:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf

On Wednesday 28 Feb 2024 at 11:12:18 (+0100), David Hildenbrand wrote:
> > > So you don't need any guest_memfd games to protect from that -- and one
> > > doesn't have to travel back in time to have memory that isn't
> > > swappable/migratable and only comes in one page size.
> > > 
> > > [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
> > > implementation cannot fulfill -- like replacing pages in page tables and
> > > such; I suspect pKVM also cannot cover all these corner-cases]
> > 
> > Thanks for this. I'll do some more reading on how things work with s390x.
> > 
> > Right, and of course, one key difference of course is that pKVM
> > doesn't encrypt anything, and only relies on stage-2 protection to
> > protect the guest.
> 
> I don't remember what exactly s390x does, but I recall that it might only
> encrypt the memory content as it transitions a page from secure to
> non-secure.
> 
> Something like that could also be implemented using pKVM (unless I am
> missing something), but it might not be that trivial, of course :)

One of the complicated aspects of having the host migrate pages like so
is for the hypervisor to make sure the content of the page has not been
tempered with when the new page is re-mapped in the guest. That means
having additional tracking in the hypervisor of pages that have been
encrypted and returned to the host, indexed by IPA, with something like
a 'checksum' of some sort, which is non-trivial to do securely from a
cryptographic PoV.

A simpler and secure way to do this is (I think) is to do
hypervisor-assisted migration. IOW, pKVM exposes a new migrate_page(ipa,
old_pa, new_pa) hypercall which Linux can call to migrate a page.
pKVM unmaps the new page from the host stage-2, unmap the old page from
guest stage-2, does the copy, wipes the old page, maps the pages in the
respective page-tables, and off we go. That way the content is never
visible to Linux and that avoids the problems I highlighted above by
construction. The downside is that it doesn't work for swapping, but
that is quite hard to do in general...

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-02-28 13:34                   ` folio_mmapped Quentin Perret
@ 2024-02-28 18:43                     ` Elliot Berman
  2024-02-28 18:51                       ` Quentin Perret
  2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
  1 sibling, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-02-28 18:43 UTC (permalink / raw)
  To: Quentin Perret
  Cc: David Hildenbrand, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng,
	jarkko, amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Wed, Feb 28, 2024 at 01:34:15PM +0000, Quentin Perret wrote:
> Alternatively, the shared->private conversion happens in the KVM vcpu
> run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
> new exit reason saying "can't donate that page while it's shared" and
> have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
> of that. But I tend to prefer the rmap option if it's workable as that
> avoids adding new KVM userspace ABI.
> 

You'll still probably need the new exit reason saying "can't donate that
page while it's shared" if the refcount tests fail. Can use David's
iouring as example of some other part of the kernel has a reference to
the page. I can't think of anything to do other than exiting to
userspace because we don't know how to drop that extra ref.

Thanks,
Elliot


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-02-28 18:43                     ` folio_mmapped Elliot Berman
@ 2024-02-28 18:51                       ` Quentin Perret
  0 siblings, 0 replies; 96+ messages in thread
From: Quentin Perret @ 2024-02-28 18:51 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng,
	jarkko, amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Wednesday 28 Feb 2024 at 10:43:27 (-0800), Elliot Berman wrote:
> On Wed, Feb 28, 2024 at 01:34:15PM +0000, Quentin Perret wrote:
> > Alternatively, the shared->private conversion happens in the KVM vcpu
> > run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
> > new exit reason saying "can't donate that page while it's shared" and
> > have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
> > of that. But I tend to prefer the rmap option if it's workable as that
> > avoids adding new KVM userspace ABI.
> > 
> 
> You'll still probably need the new exit reason saying "can't donate that
> page while it's shared" if the refcount tests fail. Can use David's
> iouring as example of some other part of the kernel has a reference to
> the page. I can't think of anything to do other than exiting to
> userspace because we don't know how to drop that extra ref.

Ack, I realized that later on. I guess there may be cases where
userspace doesn't know how to drop that pin, but that's not the kernel's
fault and it can't do anything about it, so a userspace exit is our best
chance...

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support
  2024-02-28 14:01           ` Quentin Perret
@ 2024-02-29  9:51             ` David Hildenbrand
  0 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-02-29  9:51 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf

On 28.02.24 15:01, Quentin Perret wrote:
> On Wednesday 28 Feb 2024 at 11:12:18 (+0100), David Hildenbrand wrote:
>>>> So you don't need any guest_memfd games to protect from that -- and one
>>>> doesn't have to travel back in time to have memory that isn't
>>>> swappable/migratable and only comes in one page size.
>>>>
>>>> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
>>>> implementation cannot fulfill -- like replacing pages in page tables and
>>>> such; I suspect pKVM also cannot cover all these corner-cases]
>>>
>>> Thanks for this. I'll do some more reading on how things work with s390x.
>>>
>>> Right, and of course, one key difference of course is that pKVM
>>> doesn't encrypt anything, and only relies on stage-2 protection to
>>> protect the guest.
>>
>> I don't remember what exactly s390x does, but I recall that it might only
>> encrypt the memory content as it transitions a page from secure to
>> non-secure.
>>
>> Something like that could also be implemented using pKVM (unless I am
>> missing something), but it might not be that trivial, of course :)
> 
> One of the complicated aspects of having the host migrate pages like so
> is for the hypervisor to make sure the content of the page has not been
> tempered with when the new page is re-mapped in the guest. That means
> having additional tracking in the hypervisor of pages that have been
> encrypted and returned to the host, indexed by IPA, with something like
> a 'checksum' of some sort, which is non-trivial to do securely from a
> cryptographic PoV.
> 

I don't know what exactly s390x does, and how it does it -- and which 
CCA cases they can handle.

Details are scarce, for example:
 
https://www.ibm.com/docs/en/linux-on-systems?topic=virtualization-introducing-secure-execution-linux

I suspect they do it in some way you describe, and I fully agree with 
the "non-trivial" aspect :)

> A simpler and secure way to do this is (I think) is to do
> hypervisor-assisted migration. IOW, pKVM exposes a new migrate_page(ipa,
> old_pa, new_pa) hypercall which Linux can call to migrate a page.
> pKVM unmaps the new page from the host stage-2, unmap the old page from
> guest stage-2, does the copy, wipes the old page, maps the pages in the
> respective page-tables, and off we go. That way the content is never
> visible to Linux and that avoids the problems I highlighted above by
> construction. The downside is that it doesn't work for swapping, but
> that is quite hard to do in general...

The main "problem" with that is that you still end up with these 
inaccessible pages, that require the use of guest_memfd for your 
in-place conversion requirement in the first place.

I'm sure at some point we'll see migration support also for TDX and 
friends. For pKVM it might be even easier to support.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-28 13:34                   ` folio_mmapped Quentin Perret
  2024-02-28 18:43                     ` folio_mmapped Elliot Berman
@ 2024-02-29 10:04                     ` David Hildenbrand
  2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
  2024-03-04 12:36                       ` folio_mmapped Quentin Perret
  1 sibling, 2 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-02-29 10:04 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On 28.02.24 14:34, Quentin Perret wrote:
> On Wednesday 28 Feb 2024 at 14:00:50 (+0100), David Hildenbrand wrote:
>>>>> To add a layer of paint to the shed, the usage of SIGBUS for
>>>>> something that is really a permission access problem doesn't feel
>>>>
>>>> SIGBUS stands for "BUS error (bad memory access)."
>>>>
>>>> Which makes sense, if you try accessing something that can no longer be
>>>> accessed. It's now inaccessible. Even if it is temporarily.
>>>>
>>>> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
>>>> it.
>>>>
>>>> It might be a permission problem on the pKVM side, but it's not the
>>>> traditional "permission problem" as in mprotect() and friends. You cannot
>>>> resolve that permission problem yourself. It's a higher entity that turned
>>>> that memory inaccessible.
>>>
>>> Well that's where I'm not sure to agree. Userspace can, in fact, get
>>> back all of that memory by simply killing the protected VM. With the
>>
>> Right, but that would likely "wipe" the pages so they can be made accessible
>> again, right?
> 
> Yep, indeed.
> 
>> That's the whole point why we are handing the pages over to the "higher
>> entity", and allow someone else (the VM) to turn them into a state where we
>> can no longer read them.
>>
>> (if you follow the other discussion, it would actually be nice if we could
>> read them and would get encrypted content back, like s390x does; but that's
>> a different discussion and I assume pretty much out of scope :) )
> 
> Interesting, I'll read up. On a side note, I'm also considering adding a
> guest-facing hypervisor interface to let the guest decide to opt out of
> the hypervisor wipe as discussed above. That would be useful for a guest
> that is shutting itself down (which could be cooperating with the host
> Linux) and that knows it has erased its secrets. That is in general
> difficult to do for an OS, but a simple approach could be to poison all
> its memory (or maybe encrypt it?) before opting out of that wipe.
> 
> The hypervisor wipe is done in hypervisor context (obviously), which is
> non-preemptible, so avoiding wiping (or encrypting) loads of memory
> there is highly desirable. Also pKVM doesn't have a linear map of all
> memory for security reasons, so we need to map/unmap the pages one by
> one, which sucks as much as it sounds.
> 
> But yes, we're digressing, that is all for later :)

:) Sounds like an interesting optimization.

An alternative would be to remember in pKVM that a page needs a wipe 
before reaccess. Once re-accessed by anybody (hypervisor or new guest), 
it first has to be wiped by pKVM.

... but that also sounds complicated and similar requires the linear 
map+unmap in pKVM page-by-page as they are reused. But at least a guest 
shutdown would be faster.

> 
>>> approach suggested here, the guestmem pages are entirely accessible to
>>> the host until they are attached to a running protected VM which
>>> triggers the protection. It is very much userspace saying "I promise not
>>> to touch these pages from now on" when it does that, in a way that I
>>> personally find very comparable to the mprotect case. It is not some
>>> other entity that pulls the carpet from under userspace's feet, it is
>>> userspace being inconsistent with itself that causes the issue here, and
>>> that's why SIGBUS feels kinda wrong as it tends to be used to report
>>> external errors of some sort.
>>
>> I recall that user space can also trigger SIGBUS when doing some
>> mmap()+truncate() thingies, and probably a bunch more, that could be fixed
>> up later.
> 
> Right, so that probably still falls into "there is no page" bucket
> rather than the "there is a page that is already accounted against the
> userspace process, but it doesn't have the permission to access it
> bucket. But yes that's probably an infinite debate.

Yes, we should rather focus on the bigger idea: have inaccessible memory 
that fails a pagefault instead of the mmap.

> 
>> I don't see a problem with SIUGBUS here, but I do understand your view. I
>> consider the exact signal a minor detail, though.
>>
>>>
>>>>> appropriate. Allocating memory via guestmem and donating that to a
>>>>> protected guest is a way for userspace to voluntarily relinquish access
>>>>> permissions to the memory it allocated. So a userspace process violating
>>>>> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
>>>>> point that signal would be sent, the page would have been accounted
>>>>> against that userspace process, so not sure the paging examples that
>>>>> were discussed earlier are exactly comparable. To illustrate that
>>>>> differently, given that pKVM and Gunyah use MMU-based protection, there
>>>>> is nothing architecturally that prevents a guest from sharing a page
>>>>> back with Linux as RO.
>>>>
>>>> Sure, then allow page faults that allow for reads and give a signal on write
>>>> faults.
>>>>
>>>> In the scenario, it even makes more sense to not constantly require new
>>>> mmap's from user space just to access a now-shared page.
>>>>
>>>>> Note that we don't currently support this, so I
>>>>> don't want to conflate this use case, but that hopefully makes it a
>>>>> little more obvious that this is a "there is a page, but you don't
>>>>> currently have the permission to access it" problem rather than "sorry
>>>>> but we ran out of pages" problem.
>>>>
>>>> We could user other signals, at least as the semantics are clear and it's
>>>> documented. Maybe SIGSEGV would be warranted.
>>>>
>>>> I consider that a minor detail, though.
>>>>
>>>> Requiring mmap()/munmap() dances just to access a page that is now shared
>>>> from user space sounds a bit suboptimal. But I don't know all the details of
>>>> the user space implementation.
>>>
>>> Agreed, if we could save having to mmap() each page that gets shared
>>> back that would be a nice performance optimization.
>>>
>>>> "mmap() the whole thing once and only access what you are supposed to
>   (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
>>>> signal.
>>>
>>> "... you get a signal, or maybe you don't". But yes I understand your
>>> point, and as per the above there are real benefits to this approach so
>>> why not.
>>>
>>> What do we expect userspace to do when a page goes from shared back to
>>> being guest-private, because e.g. the guest decides to unshare? Use
>>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
>>> that this will be needed when starting a guest as well, as userspace
>>> needs to copy the guest payload in the guestmem file prior to starting
>>> the protected VM.
>>
>> Let's assume we have the whole guest_memfd mapped exactly once in our
>> process, a single VMA.
>>
>> When setting up the VM, we'll write the payload and then fire up the VM.
>>
>> That will (I assume) trigger some shared -> private conversion.
>>
>> When we want to convert shared -> private in the kernel, we would first
>> check if the page is currently mapped. If it is, we could try unmapping that
>> page using an rmap walk.
> 
> I had not considered that. That would most certainly be slow, but a well
> behaved userspace process shouldn't hit it so, that's probably not a
> problem...

If there really only is a single VMA that covers the page (or even mmaps 
the guest_memfd), it should not be too bad. For example, any 
fallocate(PUNCHHOLE) has to do the same, to unmap the page before 
discarding it from the pagecache.

But of course, no rmap walk is always better.

> 
>> Then, we'd make sure that there are really no other references and protect
>> against concurrent mapping of the page. Now we can convert the page to
>> private.
> 
> Right.
> 
> Alternatively, the shared->private conversion happens in the KVM vcpu
> run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
> new exit reason saying "can't donate that page while it's shared" and
> have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
> of that. But I tend to prefer the rmap option if it's workable as that
> avoids adding new KVM userspace ABI.

As discussed in the sub-thread, that might still be required.

One could think about completely forbidding GUP on these mmap'ed 
guest-memfds. But likely, there might be use cases in the future where 
you want to use GUP on shared memory inside a guest_memfd.

(the iouring example I gave might currently not work because 
FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and 
guest_memfd will likely not be detected as shmem; 8ac268436e6d contains 
some details)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
@ 2024-02-29 19:01                       ` Fuad Tabba
  2024-03-01  0:40                         ` folio_mmapped Elliot Berman
  2024-03-01 11:06                         ` folio_mmapped David Hildenbrand
  2024-03-04 12:36                       ` folio_mmapped Quentin Perret
  1 sibling, 2 replies; 96+ messages in thread
From: Fuad Tabba @ 2024-02-29 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Quentin Perret, Matthew Wilcox, kvm, kvmarm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc,
	brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

Hi David,

...

>>>> "mmap() the whole thing once and only access what you are supposed to
> >   (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
> >>>> signal.
> >>>
> >>> "... you get a signal, or maybe you don't". But yes I understand your
> >>> point, and as per the above there are real benefits to this approach so
> >>> why not.
> >>>
> >>> What do we expect userspace to do when a page goes from shared back to
> >>> being guest-private, because e.g. the guest decides to unshare? Use
> >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
> >>> that this will be needed when starting a guest as well, as userspace
> >>> needs to copy the guest payload in the guestmem file prior to starting
> >>> the protected VM.
> >>
> >> Let's assume we have the whole guest_memfd mapped exactly once in our
> >> process, a single VMA.
> >>
> >> When setting up the VM, we'll write the payload and then fire up the VM.
> >>
> >> That will (I assume) trigger some shared -> private conversion.
> >>
> >> When we want to convert shared -> private in the kernel, we would first
> >> check if the page is currently mapped. If it is, we could try unmapping that
> >> page using an rmap walk.
> >
> > I had not considered that. That would most certainly be slow, but a well
> > behaved userspace process shouldn't hit it so, that's probably not a
> > problem...
>
> If there really only is a single VMA that covers the page (or even mmaps
> the guest_memfd), it should not be too bad. For example, any
> fallocate(PUNCHHOLE) has to do the same, to unmap the page before
> discarding it from the pagecache.

I don't think that we can assume that only a single VMA covers a page.

> But of course, no rmap walk is always better.

We've been thinking some more about how to handle the case where the
host userspace has a mapping of a page that later becomes private.

One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
to the host with a meaningful exit reason) until the host unmaps that
page, and check for the refcount to the page as you mentioned earlier.
This is essentially what the RFC I sent does (minus the bugs :) ) .

The other idea is to use the rmap walk as you suggested to zap that
page. If the host tries to access that page again, it would get a
SIGBUS on the fault. This has the advantage that, as you'd mentioned,
the host doesn't need to constantly mmap() and munmap() pages. It
could potentially be optimised further as suggested if we have a
cooperating VMM that would issue a MADV_DONTNEED or something like
that, but that's just an optimisation and we would still need to have
the option of the rmap walk. However, I was wondering how practical
this idea would be if more than a single VMA covers a page?

Also, there's the question of what to do if the page is gupped? In
this case I think the only thing we can do is refuse to run the guest
until the gup (and all references) are released, which also brings us
back to the way things (kind of) are...

Thanks,
/fuad

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
@ 2024-03-01  0:40                         ` Elliot Berman
  2024-03-01 11:16                           ` folio_mmapped David Hildenbrand
  2024-03-01 11:06                         ` folio_mmapped David Hildenbrand
  1 sibling, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-03-01  0:40 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: David Hildenbrand, Quentin Perret, Matthew Wilcox, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Thu, Feb 29, 2024 at 07:01:51PM +0000, Fuad Tabba wrote:
> Hi David,
> 
> ...
> 
> >>>> "mmap() the whole thing once and only access what you are supposed to
> > >   (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
> > >>>> signal.
> > >>>
> > >>> "... you get a signal, or maybe you don't". But yes I understand your
> > >>> point, and as per the above there are real benefits to this approach so
> > >>> why not.
> > >>>
> > >>> What do we expect userspace to do when a page goes from shared back to
> > >>> being guest-private, because e.g. the guest decides to unshare? Use
> > >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
> > >>> that this will be needed when starting a guest as well, as userspace
> > >>> needs to copy the guest payload in the guestmem file prior to starting
> > >>> the protected VM.
> > >>
> > >> Let's assume we have the whole guest_memfd mapped exactly once in our
> > >> process, a single VMA.
> > >>
> > >> When setting up the VM, we'll write the payload and then fire up the VM.
> > >>
> > >> That will (I assume) trigger some shared -> private conversion.
> > >>
> > >> When we want to convert shared -> private in the kernel, we would first
> > >> check if the page is currently mapped. If it is, we could try unmapping that
> > >> page using an rmap walk.
> > >
> > > I had not considered that. That would most certainly be slow, but a well
> > > behaved userspace process shouldn't hit it so, that's probably not a
> > > problem...
> >
> > If there really only is a single VMA that covers the page (or even mmaps
> > the guest_memfd), it should not be too bad. For example, any
> > fallocate(PUNCHHOLE) has to do the same, to unmap the page before
> > discarding it from the pagecache.
> 
> I don't think that we can assume that only a single VMA covers a page.
> 
> > But of course, no rmap walk is always better.
> 
> We've been thinking some more about how to handle the case where the
> host userspace has a mapping of a page that later becomes private.
> 
> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
> to the host with a meaningful exit reason) until the host unmaps that
> page, and check for the refcount to the page as you mentioned earlier.
> This is essentially what the RFC I sent does (minus the bugs :) ) .
> 
> The other idea is to use the rmap walk as you suggested to zap that
> page. If the host tries to access that page again, it would get a
> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
> the host doesn't need to constantly mmap() and munmap() pages. It
> could potentially be optimised further as suggested if we have a
> cooperating VMM that would issue a MADV_DONTNEED or something like
> that, but that's just an optimisation and we would still need to have
> the option of the rmap walk. However, I was wondering how practical
> this idea would be if more than a single VMA covers a page?
> 

Agree with all your points here. I changed Gunyah's implementation to do
the unmap instead of erroring out. I didn't observe a significant
performance difference. However, doing unmap might be a little faster
because we can check folio_mapped() before doing the rmap walk. When
erroring out at mmap() level, we always have to do the walk.

> Also, there's the question of what to do if the page is gupped? In
> this case I think the only thing we can do is refuse to run the guest
> until the gup (and all references) are released, which also brings us
> back to the way things (kind of) are...
> 

If there are gup users who don't do FOLL_PIN, I think we either need to
fix them or live with possibility here? We don't have a reliable
refcount for a folio to be safe to unmap: it might be that another vCPU
is trying to get the same page, has incremented the refcount, and
waiting for the folio_lock. This problem exists whether we block the
mmap() or do SIGBUS.

Thanks,
Elliot

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
  2024-03-01  0:40                         ` folio_mmapped Elliot Berman
@ 2024-03-01 11:06                         ` David Hildenbrand
  1 sibling, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-01 11:06 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Quentin Perret, Matthew Wilcox, kvm, kvmarm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc,
	brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On 29.02.24 20:01, Fuad Tabba wrote:
> Hi David,
> 
> ...
> 

Hi!

>>>>> "mmap() the whole thing once and only access what you are supposed to
>>>    (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
>>>>>> signal.
>>>>>
>>>>> "... you get a signal, or maybe you don't". But yes I understand your
>>>>> point, and as per the above there are real benefits to this approach so
>>>>> why not.
>>>>>
>>>>> What do we expect userspace to do when a page goes from shared back to
>>>>> being guest-private, because e.g. the guest decides to unshare? Use
>>>>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
>>>>> that this will be needed when starting a guest as well, as userspace
>>>>> needs to copy the guest payload in the guestmem file prior to starting
>>>>> the protected VM.
>>>>
>>>> Let's assume we have the whole guest_memfd mapped exactly once in our
>>>> process, a single VMA.
>>>>
>>>> When setting up the VM, we'll write the payload and then fire up the VM.
>>>>
>>>> That will (I assume) trigger some shared -> private conversion.
>>>>
>>>> When we want to convert shared -> private in the kernel, we would first
>>>> check if the page is currently mapped. If it is, we could try unmapping that
>>>> page using an rmap walk.
>>>
>>> I had not considered that. That would most certainly be slow, but a well
>>> behaved userspace process shouldn't hit it so, that's probably not a
>>> problem...
>>
>> If there really only is a single VMA that covers the page (or even mmaps
>> the guest_memfd), it should not be too bad. For example, any
>> fallocate(PUNCHHOLE) has to do the same, to unmap the page before
>> discarding it from the pagecache.
> 
> I don't think that we can assume that only a single VMA covers a page.
> 
>> But of course, no rmap walk is always better.
> 
> We've been thinking some more about how to handle the case where the
> host userspace has a mapping of a page that later becomes private.
> 
> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
> to the host with a meaningful exit reason) until the host unmaps that
> page, and check for the refcount to the page as you mentioned earlier.
> This is essentially what the RFC I sent does (minus the bugs :) ) .

:)

> 
> The other idea is to use the rmap walk as you suggested to zap that
> page. If the host tries to access that page again, it would get a
> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
> the host doesn't need to constantly mmap() and munmap() pages. It
> could potentially be optimised further as suggested if we have a
> cooperating VMM that would issue a MADV_DONTNEED or something like
> that, but that's just an optimisation and we would still need to have
> the option of the rmap walk. However, I was wondering how practical
> this idea would be if more than a single VMA covers a page?

A few VMAs won't make a difference I assume. Any idea how many "sharers" 
you'd expect in a sane configuration?

> 
> Also, there's the question of what to do if the page is gupped? In
> this case I think the only thing we can do is refuse to run the guest
> until the gup (and all references) are released, which also brings us
> back to the way things (kind of) are...

Indeed. There is no way you could possibly make progress.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-01  0:40                         ` folio_mmapped Elliot Berman
@ 2024-03-01 11:16                           ` David Hildenbrand
  2024-03-04 12:53                             ` folio_mmapped Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-01 11:16 UTC (permalink / raw)
  To: Fuad Tabba, Quentin Perret, Matthew Wilcox, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

>> I don't think that we can assume that only a single VMA covers a page.
>>
>>> But of course, no rmap walk is always better.
>>
>> We've been thinking some more about how to handle the case where the
>> host userspace has a mapping of a page that later becomes private.
>>
>> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
>> to the host with a meaningful exit reason) until the host unmaps that
>> page, and check for the refcount to the page as you mentioned earlier.
>> This is essentially what the RFC I sent does (minus the bugs :) ) .
>>
>> The other idea is to use the rmap walk as you suggested to zap that
>> page. If the host tries to access that page again, it would get a
>> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
>> the host doesn't need to constantly mmap() and munmap() pages. It
>> could potentially be optimised further as suggested if we have a
>> cooperating VMM that would issue a MADV_DONTNEED or something like
>> that, but that's just an optimisation and we would still need to have
>> the option of the rmap walk. However, I was wondering how practical
>> this idea would be if more than a single VMA covers a page?
>>
> 
> Agree with all your points here. I changed Gunyah's implementation to do
> the unmap instead of erroring out. I didn't observe a significant
> performance difference. However, doing unmap might be a little faster
> because we can check folio_mapped() before doing the rmap walk. When
> erroring out at mmap() level, we always have to do the walk.

Right. On the mmap() level you won't really have to walk page tables, as 
the the munmap() already zapped the page and removed the "problematic" VMA.

Likely, you really want to avoid repeatedly calling mmap()+munmap() just 
to access shared memory; but that's just my best guess about your user 
space app :)

> 
>> Also, there's the question of what to do if the page is gupped? In
>> this case I think the only thing we can do is refuse to run the guest
>> until the gup (and all references) are released, which also brings us
>> back to the way things (kind of) are...
>>
> 
> If there are gup users who don't do FOLL_PIN, I think we either need to
> fix them or live with possibility here? We don't have a reliable
> refcount for a folio to be safe to unmap: it might be that another vCPU
> is trying to get the same page, has incremented the refcount, and
> waiting for the folio_lock.

Likely there could be a way to detect that when only the vCPUs are your 
concern? But yes, it's nasty.

(has to be handled in either case :()

Disallowing any FOLL_GET|FOLL_PIN could work. Not sure how some 
core-kernel FOLL_GET users would react to that, though.

See vma_is_secretmem() and folio_is_secretmem() in mm/gup.c, where we 
disallow any FOLL_GET|FOLL_PIN of secretmem pages.

We'd need a way to teach core-mm similarly about guest_memfd, which 
might end up rather tricky, but not impossible :)

> This problem exists whether we block the
> mmap() or do SIGBUS.

There is work on doing more conversion to FOLL_PIN, but some cases are 
harder to convert. Most of O_DIRECT should be using it nowadays, but 
some other known use cases don't.

The simplest and readily-available example is still vmsplice(). I don't 
think it was fixed yet to use FOLL_PIN.

Use vmsplice() to pin the page in the pipe (read-only). Unmap the VMA. 
You can read the page any time later by reading from the pipe.

So I wouldn't bet on all relevant cases being gone in the near future.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
  2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
@ 2024-03-04 12:36                       ` Quentin Perret
  2024-03-04 19:04                         ` folio_mmapped Sean Christopherson
  1 sibling, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-03-04 12:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Thursday 29 Feb 2024 at 11:04:09 (+0100), David Hildenbrand wrote:
> An alternative would be to remember in pKVM that a page needs a wipe before
> reaccess. Once re-accessed by anybody (hypervisor or new guest), it first
> has to be wiped by pKVM.
> 
> ... but that also sounds complicated and similar requires the linear
> map+unmap in pKVM page-by-page as they are reused. But at least a guest
> shutdown would be faster.

Yep, FWIW we did try that, but ended up having issues with Linux trying
to DMA to these pages before 'touching' them from the CPU side. pKVM can
keep the pages unmapped from the CPU and IOMMU stage-2 page-tables, and
we can easily handle the CPU faults lazily, but not faults from other
masters, our hardware generally doesn't support that.

<snip>
> As discussed in the sub-thread, that might still be required.
> 
> One could think about completely forbidding GUP on these mmap'ed
> guest-memfds. But likely, there might be use cases in the future where you
> want to use GUP on shared memory inside a guest_memfd.
> 
> (the iouring example I gave might currently not work because
> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> details)

Perhaps it would be wise to start with GUP being forbidden if the
current users do not need it (not sure if that is the case in Android,
I'll check) ? We can always relax this constraint later when/if the
use-cases arise, which is obviously much harder to do the other way
around.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-01 11:16                           ` folio_mmapped David Hildenbrand
@ 2024-03-04 12:53                             ` Quentin Perret
  2024-03-04 20:22                               ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-03-04 12:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Fuad Tabba, Matthew Wilcox, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, keirf, linux-mm

On Friday 01 Mar 2024 at 12:16:54 (+0100), David Hildenbrand wrote:
> > > I don't think that we can assume that only a single VMA covers a page.
> > > 
> > > > But of course, no rmap walk is always better.
> > > 
> > > We've been thinking some more about how to handle the case where the
> > > host userspace has a mapping of a page that later becomes private.
> > > 
> > > One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
> > > to the host with a meaningful exit reason) until the host unmaps that
> > > page, and check for the refcount to the page as you mentioned earlier.
> > > This is essentially what the RFC I sent does (minus the bugs :) ) .
> > > 
> > > The other idea is to use the rmap walk as you suggested to zap that
> > > page. If the host tries to access that page again, it would get a
> > > SIGBUS on the fault. This has the advantage that, as you'd mentioned,
> > > the host doesn't need to constantly mmap() and munmap() pages. It
> > > could potentially be optimised further as suggested if we have a
> > > cooperating VMM that would issue a MADV_DONTNEED or something like
> > > that, but that's just an optimisation and we would still need to have
> > > the option of the rmap walk. However, I was wondering how practical
> > > this idea would be if more than a single VMA covers a page?
> > > 
> > 
> > Agree with all your points here. I changed Gunyah's implementation to do
> > the unmap instead of erroring out. I didn't observe a significant
> > performance difference. However, doing unmap might be a little faster
> > because we can check folio_mapped() before doing the rmap walk. When
> > erroring out at mmap() level, we always have to do the walk.
> 
> Right. On the mmap() level you won't really have to walk page tables, as the
> the munmap() already zapped the page and removed the "problematic" VMA.
> 
> Likely, you really want to avoid repeatedly calling mmap()+munmap() just to
> access shared memory; but that's just my best guess about your user space
> app :)

Ack, and expecting userspace to munmap the pages whenever we hit a valid
mapping in userspace page-tables in the KVM faults path makes for a
somewhat unusual interface IMO. Userspace can munmap, mmap again, and if
it doesn't touch the pages, it can proceed to run the guest just fine,
is that the expectation? If so, it feels like we're 'leaking' internal
kernel state somehow. The kernel is normally well within its rights to
zap userspace mappings if it wants to e.g. swap. (Obviously mlock is a
weird case, but even in that case, IIRC the kernel still has a certain
amount of flexibility and can use compaction and friends). Similarly,
it should be well within its right to proactively create them. How
would this scheme work if, 10 years from now, something like
Speculative Page Faults makes it into the kernel in a different form?

Not requiring to userspace to unmap makes the userspace interface a lot
simpler I think -- once a protected guest starts, you better not touch
its memory if it's not been shared back or you'll get slapped on the
wrist. Whether or not those pages have been accessed beforehand for
example is irrelevant.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-04 12:36                       ` folio_mmapped Quentin Perret
@ 2024-03-04 19:04                         ` Sean Christopherson
  2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Sean Christopherson @ 2024-03-04 19:04 UTC (permalink / raw)
  To: Quentin Perret
  Cc: David Hildenbrand, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Mon, Mar 04, 2024, Quentin Perret wrote:
> > As discussed in the sub-thread, that might still be required.
> > 
> > One could think about completely forbidding GUP on these mmap'ed
> > guest-memfds. But likely, there might be use cases in the future where you
> > want to use GUP on shared memory inside a guest_memfd.
> > 
> > (the iouring example I gave might currently not work because
> > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> > details)
> 
> Perhaps it would be wise to start with GUP being forbidden if the
> current users do not need it (not sure if that is the case in Android,
> I'll check) ? We can always relax this constraint later when/if the
> use-cases arise, which is obviously much harder to do the other way
> around.

+1000.  At least on the KVM side, I would like to be as conservative as possible
when it comes to letting anything other than the guest access guest_memfd.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-04 19:04                         ` folio_mmapped Sean Christopherson
@ 2024-03-04 20:17                           ` David Hildenbrand
  2024-03-04 21:43                             ` folio_mmapped Elliot Berman
  2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
  0 siblings, 2 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-04 20:17 UTC (permalink / raw)
  To: Sean Christopherson, Quentin Perret
  Cc: Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, keirf, linux-mm

On 04.03.24 20:04, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>> As discussed in the sub-thread, that might still be required.
>>>
>>> One could think about completely forbidding GUP on these mmap'ed
>>> guest-memfds. But likely, there might be use cases in the future where you
>>> want to use GUP on shared memory inside a guest_memfd.
>>>
>>> (the iouring example I gave might currently not work because
>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>> details)
>>
>> Perhaps it would be wise to start with GUP being forbidden if the
>> current users do not need it (not sure if that is the case in Android,
>> I'll check) ? We can always relax this constraint later when/if the
>> use-cases arise, which is obviously much harder to do the other way
>> around.
> 
> +1000.  At least on the KVM side, I would like to be as conservative as possible
> when it comes to letting anything other than the guest access guest_memfd.

So we'll have to do it similar to any occurrences of "secretmem" in 
gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code 
similar to e.g., folio_is_secretmem().

IIRC, we might not be able to de-reference the actual mapping because it 
could get free concurrently ...

That will then prohibit any kind of GUP access to these pages, including 
reading/writing for ptrace/debugging purposes, for core dumping purposes 
etc. But at least, you know that nobody was able to optain page 
references using GUP that might be used for reading/writing later.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-04 12:53                             ` folio_mmapped Quentin Perret
@ 2024-03-04 20:22                               ` David Hildenbrand
  0 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-04 20:22 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Fuad Tabba, Matthew Wilcox, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, keirf, linux-mm

On 04.03.24 13:53, Quentin Perret wrote:
> On Friday 01 Mar 2024 at 12:16:54 (+0100), David Hildenbrand wrote:
>>>> I don't think that we can assume that only a single VMA covers a page.
>>>>
>>>>> But of course, no rmap walk is always better.
>>>>
>>>> We've been thinking some more about how to handle the case where the
>>>> host userspace has a mapping of a page that later becomes private.
>>>>
>>>> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
>>>> to the host with a meaningful exit reason) until the host unmaps that
>>>> page, and check for the refcount to the page as you mentioned earlier.
>>>> This is essentially what the RFC I sent does (minus the bugs :) ) .
>>>>
>>>> The other idea is to use the rmap walk as you suggested to zap that
>>>> page. If the host tries to access that page again, it would get a
>>>> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
>>>> the host doesn't need to constantly mmap() and munmap() pages. It
>>>> could potentially be optimised further as suggested if we have a
>>>> cooperating VMM that would issue a MADV_DONTNEED or something like
>>>> that, but that's just an optimisation and we would still need to have
>>>> the option of the rmap walk. However, I was wondering how practical
>>>> this idea would be if more than a single VMA covers a page?
>>>>
>>>
>>> Agree with all your points here. I changed Gunyah's implementation to do
>>> the unmap instead of erroring out. I didn't observe a significant
>>> performance difference. However, doing unmap might be a little faster
>>> because we can check folio_mapped() before doing the rmap walk. When
>>> erroring out at mmap() level, we always have to do the walk.
>>
>> Right. On the mmap() level you won't really have to walk page tables, as the
>> the munmap() already zapped the page and removed the "problematic" VMA.
>>
>> Likely, you really want to avoid repeatedly calling mmap()+munmap() just to
>> access shared memory; but that's just my best guess about your user space
>> app :)
> 
> Ack, and expecting userspace to munmap the pages whenever we hit a valid
> mapping in userspace page-tables in the KVM faults path makes for a
> somewhat unusual interface IMO. Userspace can munmap, mmap again, and if
> it doesn't touch the pages, it can proceed to run the guest just fine,
> is that the expectation? If so, it feels like we're 'leaking' internal

It would be weird, and I would not suggest that. It's either

(1) you can leave it mmap'ed, but any access to private memory will 
SIGBUS. The kernel will try zapping pages inside a VMA itself.

(2) you cannot leave it mmap'ed. In order to convert shared -> private, 
you have to munmap. mmap will fail if it would cover a currently-private 
page.

So for (1) you could mmap once in user space and be done with it. For 
(2) you would have to mmap+munmap when accessing shared memory.

> kernel state somehow. The kernel is normally well within its rights to
> zap userspace mappings if it wants to e.g. swap. (Obviously mlock is a
> weird case, but even in that case, IIRC the kernel still has a certain
> amount of flexibility and can use compaction and friends). Similarly,
> it should be well within its right to proactively create them. How
> would this scheme work if, 10 years from now, something like
> Speculative Page Faults makes it into the kernel in a different form?

At least with (2), speculative page faults would never generate a 
SIGBUS. Just like HW must not generate a fault on speculative access.

> 
> Not requiring to userspace to unmap makes the userspace interface a lot
> simpler I think -- once a protected guest starts, you better not touch
> its memory if it's not been shared back or you'll get slapped on the
> wrist. Whether or not those pages have been accessed beforehand for
> example is irrelevant.

Yes. So the theory :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
@ 2024-03-04 21:43                             ` Elliot Berman
  2024-03-04 21:58                               ` folio_mmapped David Hildenbrand
  2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
  1 sibling, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-03-04 21:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, keirf, linux-mm

On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
> On 04.03.24 20:04, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Quentin Perret wrote:
> > > > As discussed in the sub-thread, that might still be required.
> > > > 
> > > > One could think about completely forbidding GUP on these mmap'ed
> > > > guest-memfds. But likely, there might be use cases in the future where you
> > > > want to use GUP on shared memory inside a guest_memfd.
> > > > 
> > > > (the iouring example I gave might currently not work because
> > > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> > > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> > > > details)
> > > 
> > > Perhaps it would be wise to start with GUP being forbidden if the
> > > current users do not need it (not sure if that is the case in Android,
> > > I'll check) ? We can always relax this constraint later when/if the
> > > use-cases arise, which is obviously much harder to do the other way
> > > around.
> > 
> > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > when it comes to letting anything other than the guest access guest_memfd.
> 
> So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
> We'll have to see how to marry KVM guest_memfd with core-mm code similar to
> e.g., folio_is_secretmem().
> 
> IIRC, we might not be able to de-reference the actual mapping because it
> could get free concurrently ...
> 
> That will then prohibit any kind of GUP access to these pages, including
> reading/writing for ptrace/debugging purposes, for core dumping purposes
> etc. But at least, you know that nobody was able to optain page references
> using GUP that might be used for reading/writing later.

Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
replacing folio_is_secretmem() with a test of this bit instead of
comparing the a_ops? I think it scales better.

Thanks,
Elliot


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-04 21:43                             ` folio_mmapped Elliot Berman
@ 2024-03-04 21:58                               ` David Hildenbrand
  2024-03-19  9:47                                 ` folio_mmapped Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-04 21:58 UTC (permalink / raw)
  To: Sean Christopherson, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, keirf, linux-mm

On 04.03.24 22:43, Elliot Berman wrote:
> On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
>> On 04.03.24 20:04, Sean Christopherson wrote:
>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>> As discussed in the sub-thread, that might still be required.
>>>>>
>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>
>>>>> (the iouring example I gave might currently not work because
>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>> details)
>>>>
>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>> current users do not need it (not sure if that is the case in Android,
>>>> I'll check) ? We can always relax this constraint later when/if the
>>>> use-cases arise, which is obviously much harder to do the other way
>>>> around.
>>>
>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>> when it comes to letting anything other than the guest access guest_memfd.
>>
>> So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
>> We'll have to see how to marry KVM guest_memfd with core-mm code similar to
>> e.g., folio_is_secretmem().
>>
>> IIRC, we might not be able to de-reference the actual mapping because it
>> could get free concurrently ...
>>
>> That will then prohibit any kind of GUP access to these pages, including
>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>> etc. But at least, you know that nobody was able to optain page references
>> using GUP that might be used for reading/writing later.
> 
> Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
> replacing folio_is_secretmem() with a test of this bit instead of
> comparing the a_ops? I think it scales better.

The only concern I have are races, but let's look into the details:

In GUP-fast, we can essentially race with unmap of folios, munmap() of 
VMAs etc.

We had a similar discussion recently about possible races. It's 
documented in folio_fast_pin_allowed() regarding disabled IRQs and RCU 
grace periods.

"inodes and thus their mappings are freed under RCU, which means the 
mapping cannot be freed beneath us and thus we can safely dereference it."

So if we follow the same rules as folio_fast_pin_allowed(), we can 
de-reference folio->mapping, for example comparing mapping->a_ops.

[folio_is_secretmem should better follow the same approach]

So likely that should just work!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
  2024-03-04 21:43                             ` folio_mmapped Elliot Berman
@ 2024-03-18 17:06                             ` Vishal Annapurve
  2024-03-18 22:02                               ` folio_mmapped David Hildenbrand
  1 sibling, 1 reply; 96+ messages in thread
From: Vishal Annapurve @ 2024-03-18 17:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	keirf, linux-mm

On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 20:04, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Quentin Perret wrote:
> >>> As discussed in the sub-thread, that might still be required.
> >>>
> >>> One could think about completely forbidding GUP on these mmap'ed
> >>> guest-memfds. But likely, there might be use cases in the future where you
> >>> want to use GUP on shared memory inside a guest_memfd.
> >>>
> >>> (the iouring example I gave might currently not work because
> >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> >>> details)
> >>
> >> Perhaps it would be wise to start with GUP being forbidden if the
> >> current users do not need it (not sure if that is the case in Android,
> >> I'll check) ? We can always relax this constraint later when/if the
> >> use-cases arise, which is obviously much harder to do the other way
> >> around.
> >
> > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > when it comes to letting anything other than the guest access guest_memfd.
>
> So we'll have to do it similar to any occurrences of "secretmem" in
> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
> similar to e.g., folio_is_secretmem().
>
> IIRC, we might not be able to de-reference the actual mapping because it
> could get free concurrently ...
>
> That will then prohibit any kind of GUP access to these pages, including
> reading/writing for ptrace/debugging purposes, for core dumping purposes
> etc. But at least, you know that nobody was able to optain page
> references using GUP that might be used for reading/writing later.
>

There has been little discussion about supporting 1G pages with
guest_memfd for TDX/SNP or pKVM. I would like to restart this
discussion [1]. 1G pages should be a very important usecase for guest
memfd, especially considering large VM sizes supporting confidential
GPU/TPU workloads.

Using separate backing stores for private and shared memory ranges is
not going to work effectively when using 1G pages. Consider the
following scenario of memory conversion when using 1G pages to back
private memory:
* Guest requests conversion of 4KB range from private to shared, host
in response ideally does following steps:
    a) Updates the guest memory attributes
    b) Unbacks the corresponding private memory
    c) Allocates corresponding shared memory or let it be faulted in
when guest accesses it

Step b above can't be skipped here, otherwise we would have two
physical pages (1 backing private memory, another backing the shared
memory) for the same GPA range causing "double allocation".

With 1G pages, it would be difficult to punch KBs or even MBs sized
hole since to support that:
1G page would need to be split (which hugetlbfs doesn't support today
because of right reasons), causing -
        - loss of vmemmap optimization [3]
        - losing ability to reconstitute the huge page again,
especially as private pages in CVMs are not relocatable today,
increasing overall fragmentation over time.
              - unless a smarter algorithm is devised for memory
reclaim to reconstitute large pages for unmovable memory.

With the above limitations in place, best thing could be to allow:
 - single backing store for both shared and private memory ranges
 - host userspace to mmap the guest memfd (as this series is trying to do)
 - allow userspace to fault in memfd file ranges that correspond to
shared GPA ranges
     - pagetable mappings will need to be restricted to shared memory
ranges causing higher granularity mappings (somewhat similar to what
HGM series from James [2] was trying to do) than 1G.
 - Allow IOMMU also to map those pages (pfns would be requested using
get_user_pages* APIs) to allow devices to access shared memory. IOMMU
management code would have to be enlightened or somehow restricted to
map only shared regions of guest memfd.
 - Upon conversion from shared to private, host will have to ensure
that there are no mappings/references present for the memory ranges
being converted to private.

If the above usecase sounds reasonable, GUP access to guest memfd
pages should be allowed.

[1] https://lore.kernel.org/lkml/CAGtprH_H1afUJ2cUnznWqYLTZVuEcOogRwXF6uBAeHbLMQsrsQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20230218002819.1486479-2-jthoughton@google.com/
[3] https://docs.kernel.org/mm/vmemmap_dedup.html



> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
@ 2024-03-18 22:02                               ` David Hildenbrand
  2024-03-18 23:07                                 ` folio_mmapped Vishal Annapurve
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-18 22:02 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Sean Christopherson, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	keirf, linux-mm

On 18.03.24 18:06, Vishal Annapurve wrote:
> On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 20:04, Sean Christopherson wrote:
>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>> As discussed in the sub-thread, that might still be required.
>>>>>
>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>
>>>>> (the iouring example I gave might currently not work because
>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>> details)
>>>>
>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>> current users do not need it (not sure if that is the case in Android,
>>>> I'll check) ? We can always relax this constraint later when/if the
>>>> use-cases arise, which is obviously much harder to do the other way
>>>> around.
>>>
>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>> when it comes to letting anything other than the guest access guest_memfd.
>>
>> So we'll have to do it similar to any occurrences of "secretmem" in
>> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
>> similar to e.g., folio_is_secretmem().
>>
>> IIRC, we might not be able to de-reference the actual mapping because it
>> could get free concurrently ...
>>
>> That will then prohibit any kind of GUP access to these pages, including
>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>> etc. But at least, you know that nobody was able to optain page
>> references using GUP that might be used for reading/writing later.
>>
> 
> There has been little discussion about supporting 1G pages with
> guest_memfd for TDX/SNP or pKVM. I would like to restart this
> discussion [1]. 1G pages should be a very important usecase for guest
> memfd, especially considering large VM sizes supporting confidential
> GPU/TPU workloads.
> 
> Using separate backing stores for private and shared memory ranges is
> not going to work effectively when using 1G pages. Consider the
> following scenario of memory conversion when using 1G pages to back
> private memory:
> * Guest requests conversion of 4KB range from private to shared, host
> in response ideally does following steps:
>      a) Updates the guest memory attributes
>      b) Unbacks the corresponding private memory
>      c) Allocates corresponding shared memory or let it be faulted in
> when guest accesses it
> 
> Step b above can't be skipped here, otherwise we would have two
> physical pages (1 backing private memory, another backing the shared
> memory) for the same GPA range causing "double allocation".
> 
> With 1G pages, it would be difficult to punch KBs or even MBs sized
> hole since to support that:
> 1G page would need to be split (which hugetlbfs doesn't support today
> because of right reasons), causing -
>          - loss of vmemmap optimization [3]
>          - losing ability to reconstitute the huge page again,
> especially as private pages in CVMs are not relocatable today,
> increasing overall fragmentation over time.
>                - unless a smarter algorithm is devised for memory
> reclaim to reconstitute large pages for unmovable memory.
> 
> With the above limitations in place, best thing could be to allow:
>   - single backing store for both shared and private memory ranges
>   - host userspace to mmap the guest memfd (as this series is trying to do)
>   - allow userspace to fault in memfd file ranges that correspond to
> shared GPA ranges
>       - pagetable mappings will need to be restricted to shared memory
> ranges causing higher granularity mappings (somewhat similar to what
> HGM series from James [2] was trying to do) than 1G.
>   - Allow IOMMU also to map those pages (pfns would be requested using
> get_user_pages* APIs) to allow devices to access shared memory. IOMMU
> management code would have to be enlightened or somehow restricted to
> map only shared regions of guest memfd.
>   - Upon conversion from shared to private, host will have to ensure
> that there are no mappings/references present for the memory ranges
> being converted to private.
> 
> If the above usecase sounds reasonable, GUP access to guest memfd
> pages should be allowed.

To say it with nice words: "Not a fan".

First, I don't think only 1 GiB will be problematic. Already 2 MiB ones 
will be problematic and so far it is even unclear how guest_memfd will 
consume them in a way acceptable to upstream MM. Likely not using 
hugetlb from what I recall after the previous discussions with Mike.

Second, we should find better ways to let an IOMMU map these pages, 
*not* using GUP. There were already discussions on providing a similar 
fd+offset-style interface instead. GUP really sounds like the wrong 
approach here. Maybe we should look into passing not only guest_memfd, 
but also "ordinary" memfds.

Third, I don't think we should be using huge pages where huge pages 
don't make any sense. Using a 1 GiB page so the VM will convert some 
pieces to map it using PTEs will destroy the whole purpose of using 1 
GiB pages. It doesn't make any sense.

A direction that might make sense is either (A) enlighting the VM about 
the granularity in which memory can be converted (but also problematic 
for 1 GiB pages) and/or (B) physically restricting the memory that can 
be converted.

For example, one could create a GPA layout where some regions are backed 
by gigantic pages that cannot be converted/can only be converted as a 
whole, and some are backed by 4k pages that can be converted back and 
forth. We'd use multiple guest_memfds for that. I recall that physically 
restricting such conversions/locations (e.g., for bounce buffers) in 
Linux was already discussed somewhere, but I don't recall the details.

It's all not trivial and not easy to get "clean".

Concluding that individual pieces of a 1 GiB / 2 MiB huge page should 
not be converted back and forth might be a reasonable. Although I'm sure 
people will argue the opposite and develop hackish solutions in 
desperate ways to make it work somehow.

Huge pages, and especially gigantic pages, are simply a bad fit if the 
VM will convert individual 4k pages.


But to answer your last question: we might be able to avoid GUP by using 
a different mapping API, similar to the once KVM now provides.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-18 22:02                               ` folio_mmapped David Hildenbrand
@ 2024-03-18 23:07                                 ` Vishal Annapurve
  2024-03-19  0:10                                   ` folio_mmapped Sean Christopherson
  0 siblings, 1 reply; 96+ messages in thread
From: Vishal Annapurve @ 2024-03-18 23:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	keirf, linux-mm

On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.03.24 18:06, Vishal Annapurve wrote:
> > On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 04.03.24 20:04, Sean Christopherson wrote:
> >>> On Mon, Mar 04, 2024, Quentin Perret wrote:
> >>>>> As discussed in the sub-thread, that might still be required.
> >>>>>
> >>>>> One could think about completely forbidding GUP on these mmap'ed
> >>>>> guest-memfds. But likely, there might be use cases in the future where you
> >>>>> want to use GUP on shared memory inside a guest_memfd.
> >>>>>
> >>>>> (the iouring example I gave might currently not work because
> >>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> >>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> >>>>> details)
> >>>>
> >>>> Perhaps it would be wise to start with GUP being forbidden if the
> >>>> current users do not need it (not sure if that is the case in Android,
> >>>> I'll check) ? We can always relax this constraint later when/if the
> >>>> use-cases arise, which is obviously much harder to do the other way
> >>>> around.
> >>>
> >>> +1000.  At least on the KVM side, I would like to be as conservative as possible
> >>> when it comes to letting anything other than the guest access guest_memfd.
> >>
> >> So we'll have to do it similar to any occurrences of "secretmem" in
> >> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
> >> similar to e.g., folio_is_secretmem().
> >>
> >> IIRC, we might not be able to de-reference the actual mapping because it
> >> could get free concurrently ...
> >>
> >> That will then prohibit any kind of GUP access to these pages, including
> >> reading/writing for ptrace/debugging purposes, for core dumping purposes
> >> etc. But at least, you know that nobody was able to optain page
> >> references using GUP that might be used for reading/writing later.
> >>
> >
> > There has been little discussion about supporting 1G pages with
> > guest_memfd for TDX/SNP or pKVM. I would like to restart this
> > discussion [1]. 1G pages should be a very important usecase for guest
> > memfd, especially considering large VM sizes supporting confidential
> > GPU/TPU workloads.
> >
> > Using separate backing stores for private and shared memory ranges is
> > not going to work effectively when using 1G pages. Consider the
> > following scenario of memory conversion when using 1G pages to back
> > private memory:
> > * Guest requests conversion of 4KB range from private to shared, host
> > in response ideally does following steps:
> >      a) Updates the guest memory attributes
> >      b) Unbacks the corresponding private memory
> >      c) Allocates corresponding shared memory or let it be faulted in
> > when guest accesses it
> >
> > Step b above can't be skipped here, otherwise we would have two
> > physical pages (1 backing private memory, another backing the shared
> > memory) for the same GPA range causing "double allocation".
> >
> > With 1G pages, it would be difficult to punch KBs or even MBs sized
> > hole since to support that:
> > 1G page would need to be split (which hugetlbfs doesn't support today
> > because of right reasons), causing -
> >          - loss of vmemmap optimization [3]
> >          - losing ability to reconstitute the huge page again,
> > especially as private pages in CVMs are not relocatable today,
> > increasing overall fragmentation over time.
> >                - unless a smarter algorithm is devised for memory
> > reclaim to reconstitute large pages for unmovable memory.
> >
> > With the above limitations in place, best thing could be to allow:
> >   - single backing store for both shared and private memory ranges
> >   - host userspace to mmap the guest memfd (as this series is trying to do)
> >   - allow userspace to fault in memfd file ranges that correspond to
> > shared GPA ranges
> >       - pagetable mappings will need to be restricted to shared memory
> > ranges causing higher granularity mappings (somewhat similar to what
> > HGM series from James [2] was trying to do) than 1G.
> >   - Allow IOMMU also to map those pages (pfns would be requested using
> > get_user_pages* APIs) to allow devices to access shared memory. IOMMU
> > management code would have to be enlightened or somehow restricted to
> > map only shared regions of guest memfd.
> >   - Upon conversion from shared to private, host will have to ensure
> > that there are no mappings/references present for the memory ranges
> > being converted to private.
> >
> > If the above usecase sounds reasonable, GUP access to guest memfd
> > pages should be allowed.
>
> To say it with nice words: "Not a fan".
>
> First, I don't think only 1 GiB will be problematic. Already 2 MiB ones
> will be problematic and so far it is even unclear how guest_memfd will
> consume them in a way acceptable to upstream MM. Likely not using
> hugetlb from what I recall after the previous discussions with Mike.
>

Agree, the support for 1G pages with guest memfd is yet to be figured
out, but it remains a scenario to be considered here.

> Second, we should find better ways to let an IOMMU map these pages,
> *not* using GUP. There were already discussions on providing a similar
> fd+offset-style interface instead. GUP really sounds like the wrong
> approach here. Maybe we should look into passing not only guest_memfd,
> but also "ordinary" memfds.

I need to dig into past discussions around this, but agree that
passing guest memfds to VFIO drivers in addition to HVAs seems worth
exploring. This may be required anyways for devices supporting TDX
connect [1].

If we are talking about the same file catering to both private and
shared memory, there has to be some way to keep track of references on
the shared memory from both host userspace and IOMMU.

>
> Third, I don't think we should be using huge pages where huge pages
> don't make any sense. Using a 1 GiB page so the VM will convert some
> pieces to map it using PTEs will destroy the whole purpose of using 1
> GiB pages. It doesn't make any sense.

I had started a discussion for this [2] using an RFC series. Main
challenge here remain:
1) Unifying all the conversions under one layer
2) Ensuring shared memory allocations are huge page aligned at boot
time and runtime.

Using any kind of unified shared memory allocator (today this part is
played by SWIOTLB) will need to support huge page aligned dynamic
increments, which can be only guaranteed by carving out enough memory
at boot time for CMA area and using CMA area for allocation at
runtime.
   - Since it's hard to come up with a maximum amount of shared memory
needed by VM, especially with GPUs/TPUs around, it's difficult to come
up with CMA area size at boot time.

I think it's arguable that even if a VM converts 10 % of its memory to
shared using 4k granularity, we still have fewer page table walks on
the rest of the memory when using 1G/2M pages, which is a significant
portion.

>
> A direction that might make sense is either (A) enlighting the VM about
> the granularity in which memory can be converted (but also problematic
> for 1 GiB pages) and/or (B) physically restricting the memory that can
> be converted.

Physically restricting the memory will still need a safe maximum bound
to be calculated based on all the shared memory usecases that VM can
encounter.

>
> For example, one could create a GPA layout where some regions are backed
> by gigantic pages that cannot be converted/can only be converted as a
> whole, and some are backed by 4k pages that can be converted back and
> forth. We'd use multiple guest_memfds for that. I recall that physically
> restricting such conversions/locations (e.g., for bounce buffers) in
> Linux was already discussed somewhere, but I don't recall the details.
>
> It's all not trivial and not easy to get "clean".

Yeah, agree with this point, it's difficult to get a clean solution
here, but the host side solution might be easier to deploy (not
necessarily easier to implement) and possibly cleaner than attempts to
regulate the guest side.

>
> Concluding that individual pieces of a 1 GiB / 2 MiB huge page should
> not be converted back and forth might be a reasonable. Although I'm sure
> people will argue the opposite and develop hackish solutions in
> desperate ways to make it work somehow.
>
> Huge pages, and especially gigantic pages, are simply a bad fit if the
> VM will convert individual 4k pages.
>
>
> But to answer your last question: we might be able to avoid GUP by using
> a different mapping API, similar to the once KVM now provides.
>
> --
> Cheers,
>
> David / dhildenb
>

[1] -> https://cdrdv2.intel.com/v1/dl/getContent/773614
[2] https://lore.kernel.org/lkml/20240112055251.36101-2-vannapurve@google.com/

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-18 23:07                                 ` folio_mmapped Vishal Annapurve
@ 2024-03-19  0:10                                   ` Sean Christopherson
  2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Sean Christopherson @ 2024-03-19  0:10 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: David Hildenbrand, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	keirf, linux-mm

On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > Second, we should find better ways to let an IOMMU map these pages,
> > *not* using GUP. There were already discussions on providing a similar
> > fd+offset-style interface instead. GUP really sounds like the wrong
> > approach here. Maybe we should look into passing not only guest_memfd,
> > but also "ordinary" memfds.

+1.  I am not completely opposed to letting SNP and TDX effectively convert
pages between private and shared, but I also completely agree that letting
anything gup() guest_memfd memory is likely to end in tears.

> I need to dig into past discussions around this, but agree that
> passing guest memfds to VFIO drivers in addition to HVAs seems worth
> exploring. This may be required anyways for devices supporting TDX
> connect [1].
> 
> If we are talking about the same file catering to both private and
> shared memory, there has to be some way to keep track of references on
> the shared memory from both host userspace and IOMMU.
> 
> >
> > Third, I don't think we should be using huge pages where huge pages
> > don't make any sense. Using a 1 GiB page so the VM will convert some
> > pieces to map it using PTEs will destroy the whole purpose of using 1
> > GiB pages. It doesn't make any sense.

I don't disagree, but the fundamental problem is that we have no guarantees as to
what that guest will or will not do.  We can certainly make very educated guesses,
and probably be right 99.99% of the time, but being wrong 0.01% of the time
probably means a lot of broken VMs, and a lot of unhappy customers.

> I had started a discussion for this [2] using an RFC series. 

David is talking about the host side of things, AFAICT you're talking about the
guest side...

> challenge here remain:
> 1) Unifying all the conversions under one layer
> 2) Ensuring shared memory allocations are huge page aligned at boot
> time and runtime.
> 
> Using any kind of unified shared memory allocator (today this part is
> played by SWIOTLB) will need to support huge page aligned dynamic
> increments, which can be only guaranteed by carving out enough memory
> at boot time for CMA area and using CMA area for allocation at
> runtime.
>    - Since it's hard to come up with a maximum amount of shared memory
> needed by VM, especially with GPUs/TPUs around, it's difficult to come
> up with CMA area size at boot time.

...which is very relevant as carving out memory in the guest is nigh impossible,
but carving out memory in the host for systems whose sole purpose is to run VMs
is very doable.

> I think it's arguable that even if a VM converts 10 % of its memory to
> shared using 4k granularity, we still have fewer page table walks on
> the rest of the memory when using 1G/2M pages, which is a significant
> portion.

Performance is a secondary concern.  If this were _just_ about guest performance,
I would unequivocally side with David: the guest gets to keep the pieces if it
fragments a 1GiB page.

The main problem we're trying to solve is that we want to provision a host such
that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
would also get the benefits of 1GiB mappings, that's not the driving motiviation
for this discussion.

As HugeTLB exists today, supporting that use case isn't really feasible because
there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute
the 1GiB when the sliver is converted/freed back).

Peeking ahead at my next comment, I don't think that solving this in the guest
is a realistic option, i.e. IMO, we need to figure out a way to handle this in
the host, without relying on the guest to cooperate.  Luckily, we haven't added
hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate
to work with.

The other big advantage that we should lean into is that we can make assumptions
about guest_memfd usage that would never fly for a general purpose backing stores,
e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
desirable, for (almost?) all of the CoCo use cases.

I don't have any concrete ideas at this time, but my gut feeling is that this
won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
general purposes, and if we we account for conversion scenarios when designing
hugepage support for guest_memfd.

> > For example, one could create a GPA layout where some regions are backed
> > by gigantic pages that cannot be converted/can only be converted as a
> > whole, and some are backed by 4k pages that can be converted back and
> > forth. We'd use multiple guest_memfds for that. I recall that physically
> > restricting such conversions/locations (e.g., for bounce buffers) in
> > Linux was already discussed somewhere, but I don't recall the details.
> >
> > It's all not trivial and not easy to get "clean".
> 
> Yeah, agree with this point, it's difficult to get a clean solution
> here, but the host side solution might be easier to deploy (not
> necessarily easier to implement) and possibly cleaner than attempts to
> regulate the guest side.

I think we missed the opportunity to regulate the guest side by several years.
To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer
VMs, KVM would need to be able to _enforce_ the scheme.  And while I am more than
willing to put my foot down on things where the guest is being blatantly ridiculous,
wanting to convert an arbitrary 4KiB chunk of memory between private and shared
isn't ridiculous (likely inefficient, but not ridiculous).  I.e. I'm not willing
to have KVM refuse conversions that are legal according to the SNP and TDX specs
(and presumably the CCA spec, too).

That's why I think we're years too late; this sort of restriction needs to go in
the "hardware" spec, and that ship has sailed.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-04 21:58                               ` folio_mmapped David Hildenbrand
@ 2024-03-19  9:47                                 ` Quentin Perret
  2024-03-19  9:54                                   ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-03-19  9:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On Monday 04 Mar 2024 at 22:58:49 (+0100), David Hildenbrand wrote:
> On 04.03.24 22:43, Elliot Berman wrote:
> > On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
> > > On 04.03.24 20:04, Sean Christopherson wrote:
> > > > On Mon, Mar 04, 2024, Quentin Perret wrote:
> > > > > > As discussed in the sub-thread, that might still be required.
> > > > > > 
> > > > > > One could think about completely forbidding GUP on these mmap'ed
> > > > > > guest-memfds. But likely, there might be use cases in the future where you
> > > > > > want to use GUP on shared memory inside a guest_memfd.
> > > > > > 
> > > > > > (the iouring example I gave might currently not work because
> > > > > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> > > > > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> > > > > > details)
> > > > > 
> > > > > Perhaps it would be wise to start with GUP being forbidden if the
> > > > > current users do not need it (not sure if that is the case in Android,
> > > > > I'll check) ? We can always relax this constraint later when/if the
> > > > > use-cases arise, which is obviously much harder to do the other way
> > > > > around.
> > > > 
> > > > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > > > when it comes to letting anything other than the guest access guest_memfd.
> > > 
> > > So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
> > > We'll have to see how to marry KVM guest_memfd with core-mm code similar to
> > > e.g., folio_is_secretmem().
> > > 
> > > IIRC, we might not be able to de-reference the actual mapping because it
> > > could get free concurrently ...
> > > 
> > > That will then prohibit any kind of GUP access to these pages, including
> > > reading/writing for ptrace/debugging purposes, for core dumping purposes
> > > etc. But at least, you know that nobody was able to optain page references
> > > using GUP that might be used for reading/writing later.
> > 
> > Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
> > replacing folio_is_secretmem() with a test of this bit instead of
> > comparing the a_ops? I think it scales better.
> 
> The only concern I have are races, but let's look into the details:
> 
> In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs
> etc.
> 
> We had a similar discussion recently about possible races. It's documented
> in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods.
> 
> "inodes and thus their mappings are freed under RCU, which means the mapping
> cannot be freed beneath us and thus we can safely dereference it."
> 
> So if we follow the same rules as folio_fast_pin_allowed(), we can
> de-reference folio->mapping, for example comparing mapping->a_ops.
> 
> [folio_is_secretmem should better follow the same approach]

Resurecting this discussion, I had discussions internally and as it
turns out Android makes extensive use of vhost/vsock when communicating
with guest VMs, which requires GUP. So, my bad, not supporting GUP for
the pKVM variant of guest_memfd is a bit of a non-starter, we'll need to
support it from the start. But again this should be a matter of 'simply'
having a dedicated KVM exit reason so hopefully it's not too bad.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19  9:47                                 ` folio_mmapped Quentin Perret
@ 2024-03-19  9:54                                   ` David Hildenbrand
  0 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-19  9:54 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Sean Christopherson, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On 19.03.24 10:47, Quentin Perret wrote:
> On Monday 04 Mar 2024 at 22:58:49 (+0100), David Hildenbrand wrote:
>> On 04.03.24 22:43, Elliot Berman wrote:
>>> On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
>>>> On 04.03.24 20:04, Sean Christopherson wrote:
>>>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>>>> As discussed in the sub-thread, that might still be required.
>>>>>>>
>>>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>>>
>>>>>>> (the iouring example I gave might currently not work because
>>>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>>>> details)
>>>>>>
>>>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>>>> current users do not need it (not sure if that is the case in Android,
>>>>>> I'll check) ? We can always relax this constraint later when/if the
>>>>>> use-cases arise, which is obviously much harder to do the other way
>>>>>> around.
>>>>>
>>>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>>>> when it comes to letting anything other than the guest access guest_memfd.
>>>>
>>>> So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
>>>> We'll have to see how to marry KVM guest_memfd with core-mm code similar to
>>>> e.g., folio_is_secretmem().
>>>>
>>>> IIRC, we might not be able to de-reference the actual mapping because it
>>>> could get free concurrently ...
>>>>
>>>> That will then prohibit any kind of GUP access to these pages, including
>>>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>>>> etc. But at least, you know that nobody was able to optain page references
>>>> using GUP that might be used for reading/writing later.
>>>
>>> Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
>>> replacing folio_is_secretmem() with a test of this bit instead of
>>> comparing the a_ops? I think it scales better.
>>
>> The only concern I have are races, but let's look into the details:
>>
>> In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs
>> etc.
>>
>> We had a similar discussion recently about possible races. It's documented
>> in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods.
>>
>> "inodes and thus their mappings are freed under RCU, which means the mapping
>> cannot be freed beneath us and thus we can safely dereference it."
>>
>> So if we follow the same rules as folio_fast_pin_allowed(), we can
>> de-reference folio->mapping, for example comparing mapping->a_ops.
>>
>> [folio_is_secretmem should better follow the same approach]
> 
> Resurecting this discussion, I had discussions internally and as it
> turns out Android makes extensive use of vhost/vsock when communicating
> with guest VMs, which requires GUP. So, my bad, not supporting GUP for
> the pKVM variant of guest_memfd is a bit of a non-starter, we'll need to
> support it from the start. But again this should be a matter of 'simply'
> having a dedicated KVM exit reason so hopefully it's not too bad.

Likely you should look into ways to teach these interfaces that require 
GUP to consume fd+offset instead.

Yes, there is no such thing as free lunch :P

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19  0:10                                   ` folio_mmapped Sean Christopherson
@ 2024-03-19 10:26                                     ` David Hildenbrand
  2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
                                                         ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-19 10:26 UTC (permalink / raw)
  To: Sean Christopherson, Vishal Annapurve
  Cc: Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

On 19.03.24 01:10, Sean Christopherson wrote:
> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>> Second, we should find better ways to let an IOMMU map these pages,
>>> *not* using GUP. There were already discussions on providing a similar
>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>> approach here. Maybe we should look into passing not only guest_memfd,
>>> but also "ordinary" memfds.
> 
> +1.  I am not completely opposed to letting SNP and TDX effectively convert
> pages between private and shared, but I also completely agree that letting
> anything gup() guest_memfd memory is likely to end in tears.

Yes. Avoid it right from the start, if possible.

People wanted guest_memfd to *not* have to mmap guest memory ("even for 
ordinary VMs"). Now people are saying we have to be able to mmap it in 
order to GUP it. It's getting tiring, really.

> 
>> I need to dig into past discussions around this, but agree that
>> passing guest memfds to VFIO drivers in addition to HVAs seems worth
>> exploring. This may be required anyways for devices supporting TDX
>> connect [1].
>>
>> If we are talking about the same file catering to both private and
>> shared memory, there has to be some way to keep track of references on
>> the shared memory from both host userspace and IOMMU.
>>
>>>
>>> Third, I don't think we should be using huge pages where huge pages
>>> don't make any sense. Using a 1 GiB page so the VM will convert some
>>> pieces to map it using PTEs will destroy the whole purpose of using 1
>>> GiB pages. It doesn't make any sense.
> 
> I don't disagree, but the fundamental problem is that we have no guarantees as to
> what that guest will or will not do.  We can certainly make very educated guesses,
> and probably be right 99.99% of the time, but being wrong 0.01% of the time
> probably means a lot of broken VMs, and a lot of unhappy customers.
> 

Right, then don't use huge/gigantic pages. Because it doesn't make any 
sense. You have no guarantees about memory waste. You have no guarantee 
about performance. Then just don't use huge/gigantic pages.

Use them where reasonable, they are an expensive resource. See below.

>> I had started a discussion for this [2] using an RFC series.
> 
> David is talking about the host side of things, AFAICT you're talking about the
> guest side...
> 
>> challenge here remain:
>> 1) Unifying all the conversions under one layer
>> 2) Ensuring shared memory allocations are huge page aligned at boot
>> time and runtime.
>>
>> Using any kind of unified shared memory allocator (today this part is
>> played by SWIOTLB) will need to support huge page aligned dynamic
>> increments, which can be only guaranteed by carving out enough memory
>> at boot time for CMA area and using CMA area for allocation at
>> runtime.
>>     - Since it's hard to come up with a maximum amount of shared memory
>> needed by VM, especially with GPUs/TPUs around, it's difficult to come
>> up with CMA area size at boot time.
> 
> ...which is very relevant as carving out memory in the guest is nigh impossible,
> but carving out memory in the host for systems whose sole purpose is to run VMs
> is very doable.
> 
>> I think it's arguable that even if a VM converts 10 % of its memory to
>> shared using 4k granularity, we still have fewer page table walks on
>> the rest of the memory when using 1G/2M pages, which is a significant
>> portion.
> 
> Performance is a secondary concern.  If this were _just_ about guest performance,
> I would unequivocally side with David: the guest gets to keep the pieces if it
> fragments a 1GiB page.
> 
> The main problem we're trying to solve is that we want to provision a host such
> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
> run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
> 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
> would also get the benefits of 1GiB mappings, that's not the driving motiviation
> for this discussion.

Supporting 1 GiB mappings there sounds like unnecessary complexity and 
opening a big can of worms, especially if "it's not the driving motivation".

If I understand you correctly, the scenario is

(1) We have free 1 GiB hugetlb pages lying around
(2) We want to start a CoCo VM
(3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb
     pages is all we have.
(4) We want to be able to use the 1 GiB hugetlb page in the future.

With hugetlb, it's possible to reserve a CMA area from which to later 
allocate 1 GiB pages. While not allocated, they can be used for movable 
allocations.

So in the scenario above, free the hugetlb pages back to CMA. Then, 
consume them as 4K pages for the CoCo VM. When wanting to start a 
non-CoCo VM, re-allocate them from CMA.

One catch with that is that
(a) CMA pages cannot get longterm-pinned: for obvious reasons, we
     wouldn't be able to migrate them in order to free up the 1 GiB page.
(b) guest_memfd pages are not movable and cannot currently end up on CMA
     memory.

But maybe that's not actually required in this scenario and we'd like to 
have slightly different semantics: if you were to give the CoCo VM the 1 
GiB pages, they would similarly be unusable until that VM quit and freed 
up the memory!

So it might be acceptable to get "selected" unmovable allocations (from 
guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free 
up when wanting to re-allocate that memory. Otherwise, the CMA 
allocation will simply fail.

If we need improvements in that area to support this case, we can talk. 
Just an idea to avoid HGM and friends just to make it somehow work with 
1 GiB pages ...

> 
> As HugeTLB exists today, supporting that use case isn't really feasible because
> there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute
> the 1GiB when the sliver is converted/freed back).
> 
> Peeking ahead at my next comment, I don't think that solving this in the guest
> is a realistic option, i.e. IMO, we need to figure out a way to handle this in
> the host, without relying on the guest to cooperate.  Luckily, we haven't added
> hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate
> to work with.
> 
> The other big advantage that we should lean into is that we can make assumptions
> about guest_memfd usage that would never fly for a general purpose backing stores,
> e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
> desirable, for (almost?) all of the CoCo use cases.
> 
> I don't have any concrete ideas at this time, but my gut feeling is that this
> won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
> general purposes, and if we we account for conversion scenarios when designing
> hugepage support for guest_memfd.

I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;)

> 
>>> For example, one could create a GPA layout where some regions are backed
>>> by gigantic pages that cannot be converted/can only be converted as a
>>> whole, and some are backed by 4k pages that can be converted back and
>>> forth. We'd use multiple guest_memfds for that. I recall that physically
>>> restricting such conversions/locations (e.g., for bounce buffers) in
>>> Linux was already discussed somewhere, but I don't recall the details.
>>>
>>> It's all not trivial and not easy to get "clean".
>>
>> Yeah, agree with this point, it's difficult to get a clean solution
>> here, but the host side solution might be easier to deploy (not
>> necessarily easier to implement) and possibly cleaner than attempts to
>> regulate the guest side.
> 
> I think we missed the opportunity to regulate the guest side by several years.
> To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer
> VMs, KVM would need to be able to _enforce_ the scheme.  And while I am more than
> willing to put my foot down on things where the guest is being blatantly ridiculous,
> wanting to convert an arbitrary 4KiB chunk of memory between private and shared
> isn't ridiculous (likely inefficient, but not ridiculous).  I.e. I'm not willing
> to have KVM refuse conversions that are legal according to the SNP and TDX specs
> (and presumably the CCA spec, too).
> 
> That's why I think we're years too late; this sort of restriction needs to go in
> the "hardware" spec, and that ship has sailed.

Once could consider extend the spec and glue huge+gigantic page support 
to new hardware.

But ideally, we could just avoid any partial conversion / HGM just to 
support a scenario where we might not actually want 1 GiB pages, but 
somehow want to make it work with 1 GiB pages.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
@ 2024-03-19 13:19                                       ` David Hildenbrand
  2024-03-19 14:31                                       ` folio_mmapped Will Deacon
  2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
  2 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-19 13:19 UTC (permalink / raw)
  To: Sean Christopherson, Vishal Annapurve
  Cc: Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, keirf, linux-mm

>>> I had started a discussion for this [2] using an RFC series.
>>
>> David is talking about the host side of things, AFAICT you're talking about the
>> guest side...
>>
>>> challenge here remain:
>>> 1) Unifying all the conversions under one layer
>>> 2) Ensuring shared memory allocations are huge page aligned at boot
>>> time and runtime.
>>>
>>> Using any kind of unified shared memory allocator (today this part is
>>> played by SWIOTLB) will need to support huge page aligned dynamic
>>> increments, which can be only guaranteed by carving out enough memory
>>> at boot time for CMA area and using CMA area for allocation at
>>> runtime.
>>>      - Since it's hard to come up with a maximum amount of shared memory
>>> needed by VM, especially with GPUs/TPUs around, it's difficult to come
>>> up with CMA area size at boot time.
>>
>> ...which is very relevant as carving out memory in the guest is nigh impossible,
>> but carving out memory in the host for systems whose sole purpose is to run VMs
>> is very doable.
>>
>>> I think it's arguable that even if a VM converts 10 % of its memory to
>>> shared using 4k granularity, we still have fewer page table walks on
>>> the rest of the memory when using 1G/2M pages, which is a significant
>>> portion.
>>
>> Performance is a secondary concern.  If this were _just_ about guest performance,
>> I would unequivocally side with David: the guest gets to keep the pieces if it
>> fragments a 1GiB page.
>>
>> The main problem we're trying to solve is that we want to provision a host such
>> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
>> run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
>> 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
>> would also get the benefits of 1GiB mappings, that's not the driving motiviation
>> for this discussion.
> 
> Supporting 1 GiB mappings there sounds like unnecessary complexity and
> opening a big can of worms, especially if "it's not the driving motivation".
> 
> If I understand you correctly, the scenario is
> 
> (1) We have free 1 GiB hugetlb pages lying around
> (2) We want to start a CoCo VM
> (3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb
>       pages is all we have.
> (4) We want to be able to use the 1 GiB hugetlb page in the future.
> 
> With hugetlb, it's possible to reserve a CMA area from which to later
> allocate 1 GiB pages. While not allocated, they can be used for movable
> allocations.
> 
> So in the scenario above, free the hugetlb pages back to CMA. Then,
> consume them as 4K pages for the CoCo VM. When wanting to start a
> non-CoCo VM, re-allocate them from CMA.
> 
> One catch with that is that
> (a) CMA pages cannot get longterm-pinned: for obvious reasons, we
>       wouldn't be able to migrate them in order to free up the 1 GiB page.
> (b) guest_memfd pages are not movable and cannot currently end up on CMA
>       memory.
> 
> But maybe that's not actually required in this scenario and we'd like to
> have slightly different semantics: if you were to give the CoCo VM the 1
> GiB pages, they would similarly be unusable until that VM quit and freed
> up the memory!
> 
> So it might be acceptable to get "selected" unmovable allocations (from
> guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free
> up when wanting to re-allocate that memory. Otherwise, the CMA
> allocation will simply fail.
> 
> If we need improvements in that area to support this case, we can talk.
> Just an idea to avoid HGM and friends just to make it somehow work with
> 1 GiB pages ...


Thought about that some more and some cases can also be tricky (avoiding 
fragmenting multiple 1 GiB pages ...).

It's all tricky, especially once multiple (guest_)memfds are involved 
and we really want to avoid most waste. Knowing that large mappings for 
CoCo are rather "optional" and that the challenge is in "reusing" large 
pages is valuable, though.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
  2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
@ 2024-03-19 14:31                                       ` Will Deacon
  2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
  2024-03-22 17:52                                         ` folio_mmapped David Hildenbrand
  2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
  2 siblings, 2 replies; 96+ messages in thread
From: Will Deacon @ 2024-03-19 14:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Vishal Annapurve, Quentin Perret,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

Hi David,

On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> On 19.03.24 01:10, Sean Christopherson wrote:
> > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > *not* using GUP. There were already discussions on providing a similar
> > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > but also "ordinary" memfds.
> > 
> > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > pages between private and shared, but I also completely agree that letting
> > anything gup() guest_memfd memory is likely to end in tears.
> 
> Yes. Avoid it right from the start, if possible.
> 
> People wanted guest_memfd to *not* have to mmap guest memory ("even for
> ordinary VMs"). Now people are saying we have to be able to mmap it in order
> to GUP it. It's getting tiring, really.

From the pKVM side, we're working on guest_memfd primarily to avoid
diverging from what other CoCo solutions end up using, but if it gets
de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
today with anonymous memory, then it's a really hard sell to switch over
from what we have in production. We're also hoping that, over time,
guest_memfd will become more closely integrated with the mm subsystem to
enable things like hypervisor-assisted page migration, which we would
love to have.

Today, we use the existing KVM interfaces (i.e. based on anonymous
memory) and it mostly works with the one significant exception that
accessing private memory via a GUP pin will crash the host kernel. If
all guest_memfd() can offer to solve that problem is preventing GUP
altogether, then I'd sooner just add that same restriction to what we
currently have instead of overhauling the user ABI in favour of
something which offers us very little in return.

On the mmap() side of things for guest_memfd, a simpler option for us
than what has currently been proposed might be to enforce that the VMM
has unmapped all private pages on vCPU run, failing the ioctl if that's
not the case. It needs a little more tracking in guest_memfd but I think
GUP will then fall out in the wash because only shared pages will be
mapped by userspace and so GUP will fail by construction for private
pages.

We're happy to pursue alternative approaches using anonymous memory if
you'd prefer to keep guest_memfd limited in functionality (e.g.
preventing GUP of private pages by extending mapping_flags as per [1]),
but we're equally willing to contribute to guest_memfd if extensions are
welcome.

What do you prefer?

Cheers,

Will

[1] https://lore.kernel.org/r/4b0fd46a-cc4f-4cb7-9f6f-ce19a2d3064e@redhat.com

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
  2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
  2024-03-19 14:31                                       ` folio_mmapped Will Deacon
@ 2024-03-19 15:04                                       ` Sean Christopherson
  2024-03-22 17:16                                         ` folio_mmapped David Hildenbrand
  2 siblings, 1 reply; 96+ messages in thread
From: Sean Christopherson @ 2024-03-19 15:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vishal Annapurve, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	keirf, linux-mm

On Tue, Mar 19, 2024, David Hildenbrand wrote:
> On 19.03.24 01:10, Sean Christopherson wrote:
> > Performance is a secondary concern.  If this were _just_ about guest performance,
> > I would unequivocally side with David: the guest gets to keep the pieces if it
> > fragments a 1GiB page.
> > 
> > The main problem we're trying to solve is that we want to provision a host such
> > that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
> > run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
> > 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
> > would also get the benefits of 1GiB mappings, that's not the driving motiviation
> > for this discussion.
> 
> Supporting 1 GiB mappings there sounds like unnecessary complexity and
> opening a big can of worms, especially if "it's not the driving motivation".
> 
> If I understand you correctly, the scenario is
> 
> (1) We have free 1 GiB hugetlb pages lying around
> (2) We want to start a CoCo VM
> (3) We don't care about 1 GiB mappings for that CoCo VM,

We care about 1GiB mappings for CoCo VMs.  My comment about performance being a
secondary concern was specifically saying that it's the guest's responsilibity
to play nice with huge mappings if the guest cares about its performance.  For
guests that are well behaved, we most definitely want to provide a configuration
that performs as close to non-CoCo VMs as we can reasonably make it.

And we can do that today, but it requires some amount of host memory to NOT be
in the HugeTLB pool, and instead be kept in reserved so that it can be used for
shared memory for CoCo VMs.  That approach has many downsides, as the extra memory
overhead affects CoCo VM shapes, our ability to use a common pool for non-CoCo
and CoCo VMs, and so on and so forth.

>     but hguetlb pages is all we have.
> (4) We want to be able to use the 1 GiB hugetlb page in the future.

...

> > The other big advantage that we should lean into is that we can make assumptions
> > about guest_memfd usage that would never fly for a general purpose backing stores,
> > e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
> > desirable, for (almost?) all of the CoCo use cases.
> >
> > I don't have any concrete ideas at this time, but my gut feeling is that this
> > won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
> > general purposes, and if we we account for conversion scenarios when designing
> > hugepage support for guest_memfd.
>
> I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;)

Quite the opposite, I'm saying we should be very deliberate in how we add hugepage
support and others features to guest_memfd, so that guest_memfd doesn't become a
hacky mess.

And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g.
swap/reclaim and probably page migration should get a hard "no".

In other words, ditch the complexity for features that are well served by existing
general purpose solutions, so that guest_memfd can take on a bit of complexity to
serve use cases that are unique to KVM guests, without becoming an unmaintainble
mess due to cross-products.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-03-19 14:31                                       ` folio_mmapped Will Deacon
@ 2024-03-19 23:54                                         ` Elliot Berman
  2024-03-22 16:36                                           ` Will Deacon
  2024-03-22 17:52                                         ` folio_mmapped David Hildenbrand
  1 sibling, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-03-19 23:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Hildenbrand, Sean Christopherson, Vishal Annapurve,
	Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, keirf, linux-mm

On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > On 19.03.24 01:10, Sean Christopherson wrote:
> > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > > *not* using GUP. There were already discussions on providing a similar
> > > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > > but also "ordinary" memfds.
> > > 
> > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > pages between private and shared, but I also completely agree that letting
> > > anything gup() guest_memfd memory is likely to end in tears.
> > 
> > Yes. Avoid it right from the start, if possible.
> > 
> > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > to GUP it. It's getting tiring, really.
> 
> From the pKVM side, we're working on guest_memfd primarily to avoid
> diverging from what other CoCo solutions end up using, but if it gets
> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> today with anonymous memory, then it's a really hard sell to switch over
> from what we have in production. We're also hoping that, over time,
> guest_memfd will become more closely integrated with the mm subsystem to
> enable things like hypervisor-assisted page migration, which we would
> love to have.
> 
> Today, we use the existing KVM interfaces (i.e. based on anonymous
> memory) and it mostly works with the one significant exception that
> accessing private memory via a GUP pin will crash the host kernel. If
> all guest_memfd() can offer to solve that problem is preventing GUP
> altogether, then I'd sooner just add that same restriction to what we
> currently have instead of overhauling the user ABI in favour of
> something which offers us very little in return.

How would we add the restriction to anonymous memory?

Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
where mm can ensure that the exclusive GUP pin is the only pin? If the
refcount for the page is >1, then the exclusive GUP fails. Any future
GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.

> On the mmap() side of things for guest_memfd, a simpler option for us
> than what has currently been proposed might be to enforce that the VMM
> has unmapped all private pages on vCPU run, failing the ioctl if that's
> not the case. It needs a little more tracking in guest_memfd but I think
> GUP will then fall out in the wash because only shared pages will be
> mapped by userspace and so GUP will fail by construction for private
> pages.

We can prevent GUP after the pages are marked private, but the pages
could be marked private after the pages were already GUP'd. I don't have
a good way to detect this, so converting a page to private is difficult.

> We're happy to pursue alternative approaches using anonymous memory if
> you'd prefer to keep guest_memfd limited in functionality (e.g.
> preventing GUP of private pages by extending mapping_flags as per [1]),
> but we're equally willing to contribute to guest_memfd if extensions are
> welcome.
> 
> What do you prefer?
> 

I like this as a stepping stone. For the Android use cases, we don't
need to be able to convert a private page to shared and then also be
able to GUP it. If you want to GUP a page, use anonymous memory and that
memory will always be shared. If you don't care about GUP'ing (e.g. it's
going to be guest-private or you otherwise know you won't be GUP'ing),
you can use guest_memfd.

I don't think this design prevents us from adding "sometimes you can
GUP" to guest_memfd in the future. I don't think it creates extra
changes for KVM since anonymous memory is already supported; although
I'd have to add the support for Gunyah.

> [1] https://lore.kernel.org/r/4b0fd46a-cc4f-4cb7-9f6f-ce19a2d3064e@redhat.com

Thanks,
Elliot


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
@ 2024-03-22 16:36                                           ` Will Deacon
  2024-03-22 18:46                                             ` Elliot Berman
  0 siblings, 1 reply; 96+ messages in thread
From: Will Deacon @ 2024-03-22 16:36 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Vishal Annapurve,
	Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, keirf, linux-mm

Hi Elliot,

On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote:
> On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > pages between private and shared, but I also completely agree that letting
> > > > anything gup() guest_memfd memory is likely to end in tears.
> > > 
> > > Yes. Avoid it right from the start, if possible.
> > > 
> > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > to GUP it. It's getting tiring, really.
> > 
> > From the pKVM side, we're working on guest_memfd primarily to avoid
> > diverging from what other CoCo solutions end up using, but if it gets
> > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > today with anonymous memory, then it's a really hard sell to switch over
> > from what we have in production. We're also hoping that, over time,
> > guest_memfd will become more closely integrated with the mm subsystem to
> > enable things like hypervisor-assisted page migration, which we would
> > love to have.
> > 
> > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > memory) and it mostly works with the one significant exception that
> > accessing private memory via a GUP pin will crash the host kernel. If
> > all guest_memfd() can offer to solve that problem is preventing GUP
> > altogether, then I'd sooner just add that same restriction to what we
> > currently have instead of overhauling the user ABI in favour of
> > something which offers us very little in return.
> 
> How would we add the restriction to anonymous memory?
> 
> Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
> where mm can ensure that the exclusive GUP pin is the only pin? If the
> refcount for the page is >1, then the exclusive GUP fails. Any future
> GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.

Yes, I think we'd want something like that, but I don't think using a
bias on its own is a good idea as false positives due to a large number
of page references will then actually lead to problems (i.e. rejecting
GUP spuriously), no? I suppose if you only considered the new bias in
conjunction with the AS_NOGUP flag you proposed then it might be ok
(i.e. when you see the bias, you then go check the address space to
confirm). What do you think?

> > On the mmap() side of things for guest_memfd, a simpler option for us
> > than what has currently been proposed might be to enforce that the VMM
> > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > not the case. It needs a little more tracking in guest_memfd but I think
> > GUP will then fall out in the wash because only shared pages will be
> > mapped by userspace and so GUP will fail by construction for private
> > pages.
> 
> We can prevent GUP after the pages are marked private, but the pages
> could be marked private after the pages were already GUP'd. I don't have
> a good way to detect this, so converting a page to private is difficult.

For anonymous memory, marking the page as private is going to involve an
exclusive GUP so that the page can safely be donated to the guest. In
that case, any existing GUP pin should cause that to fail gracefully.
What is the situation you are concerned about here?

> > We're happy to pursue alternative approaches using anonymous memory if
> > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > preventing GUP of private pages by extending mapping_flags as per [1]),
> > but we're equally willing to contribute to guest_memfd if extensions are
> > welcome.
> > 
> > What do you prefer?
> > 
> 
> I like this as a stepping stone. For the Android use cases, we don't
> need to be able to convert a private page to shared and then also be
> able to GUP it.

I wouldn't want to rule that out, though. The VMM should be able to use
shared pages just like it can with normal anonymous pages.

> I don't think this design prevents us from adding "sometimes you can
> GUP" to guest_memfd in the future.

Technically, I think we can add all the stuff we need to guest_memfd,
but there's a desire to keep that as simple as possible for now, which
is why I'm keen to explore alternatives to unblock the pKVM upstreaming.

Will

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
@ 2024-03-22 17:16                                         ` David Hildenbrand
  0 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-22 17:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Quentin Perret, Matthew Wilcox, Fuad Tabba,
	kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	keirf, linux-mm

On 19.03.24 16:04, Sean Christopherson wrote:
> On Tue, Mar 19, 2024, David Hildenbrand wrote:
>> On 19.03.24 01:10, Sean Christopherson wrote:
>>> Performance is a secondary concern.  If this were _just_ about guest performance,
>>> I would unequivocally side with David: the guest gets to keep the pieces if it
>>> fragments a 1GiB page.
>>>
>>> The main problem we're trying to solve is that we want to provision a host such
>>> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
>>> run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
>>> 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
>>> would also get the benefits of 1GiB mappings, that's not the driving motiviation
>>> for this discussion.
>>
>> Supporting 1 GiB mappings there sounds like unnecessary complexity and
>> opening a big can of worms, especially if "it's not the driving motivation".
>>
>> If I understand you correctly, the scenario is
>>
>> (1) We have free 1 GiB hugetlb pages lying around
>> (2) We want to start a CoCo VM
>> (3) We don't care about 1 GiB mappings for that CoCo VM,
> 
> We care about 1GiB mappings for CoCo VMs.  My comment about performance being a
> secondary concern was specifically saying that it's the guest's responsilibity
> to play nice with huge mappings if the guest cares about its performance.  For
> guests that are well behaved, we most definitely want to provide a configuration
> that performs as close to non-CoCo VMs as we can reasonably make it.

How does the guest know the granularity? I suspect it's just implicit 
knowledge that "PUD granularity might be nice".

> 
> And we can do that today, but it requires some amount of host memory to NOT be
> in the HugeTLB pool, and instead be kept in reserved so that it can be used for
> shared memory for CoCo VMs.  That approach has many downsides, as the extra memory
> overhead affects CoCo VM shapes, our ability to use a common pool for non-CoCo
> and CoCo VMs, and so on and so forth.

Right. But avoiding memory waste as soon as hugetlb is involved (and we 
have two separate memfds for private/shared memory) is not feasible.

> 
>>      but hguetlb pages is all we have.
>> (4) We want to be able to use the 1 GiB hugetlb page in the future.
> 
> ...
> 
>>> The other big advantage that we should lean into is that we can make assumptions
>>> about guest_memfd usage that would never fly for a general purpose backing stores,
>>> e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
>>> desirable, for (almost?) all of the CoCo use cases.
>>>
>>> I don't have any concrete ideas at this time, but my gut feeling is that this
>>> won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
>>> general purposes, and if we we account for conversion scenarios when designing
>>> hugepage support for guest_memfd.
>>
>> I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;)
> 
> Quite the opposite, I'm saying we should be very deliberate in how we add hugepage
> support and others features to guest_memfd, so that guest_memfd doesn't become a
> hacky mess.

Good.

> 
> And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g.
> swap/reclaim and probably page migration should get a hard "no".

I thought people wanted to support at least page migration in the 
future? (for example, see the reply from Will)

> 
> In other words, ditch the complexity for features that are well served by existing
> general purpose solutions, so that guest_memfd can take on a bit of complexity to
> serve use cases that are unique to KVM guests, without becoming an unmaintainble
> mess due to cross-products.

And I believed that was true until people started wanting to mmap() this 
thing and brought GUP into the picture ... and then talk about HGM and 
all that. *shivers*

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-19 14:31                                       ` folio_mmapped Will Deacon
  2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
@ 2024-03-22 17:52                                         ` David Hildenbrand
  2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
  2024-03-27 19:34                                           ` folio_mmapped Will Deacon
  1 sibling, 2 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-22 17:52 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sean Christopherson, Vishal Annapurve, Quentin Perret,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

On 19.03.24 15:31, Will Deacon wrote:
> Hi David,

Hi Will,

sorry for the late reply!

> 
> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>> On 19.03.24 01:10, Sean Christopherson wrote:
>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>>> Second, we should find better ways to let an IOMMU map these pages,
>>>>> *not* using GUP. There were already discussions on providing a similar
>>>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>>>> approach here. Maybe we should look into passing not only guest_memfd,
>>>>> but also "ordinary" memfds.
>>>
>>> +1.  I am not completely opposed to letting SNP and TDX effectively convert
>>> pages between private and shared, but I also completely agree that letting
>>> anything gup() guest_memfd memory is likely to end in tears.
>>
>> Yes. Avoid it right from the start, if possible.
>>
>> People wanted guest_memfd to *not* have to mmap guest memory ("even for
>> ordinary VMs"). Now people are saying we have to be able to mmap it in order
>> to GUP it. It's getting tiring, really.
> 
>  From the pKVM side, we're working on guest_memfd primarily to avoid
> diverging from what other CoCo solutions end up using, but if it gets
> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> today with anonymous memory, then it's a really hard sell to switch over
> from what we have in production. We're also hoping that, over time,
> guest_memfd will become more closely integrated with the mm subsystem to
> enable things like hypervisor-assisted page migration, which we would
> love to have.

Reading Sean's reply, he has a different view on that. And I think 
that's the main issue: there are too many different use cases and too 
many different requirements that could turn guest_memfd into something 
that maybe it really shouldn't be.

> 
> Today, we use the existing KVM interfaces (i.e. based on anonymous
> memory) and it mostly works with the one significant exception that
> accessing private memory via a GUP pin will crash the host kernel. If
> all guest_memfd() can offer to solve that problem is preventing GUP
> altogether, then I'd sooner just add that same restriction to what we
> currently have instead of overhauling the user ABI in favour of
> something which offers us very little in return.
> 
> On the mmap() side of things for guest_memfd, a simpler option for us
> than what has currently been proposed might be to enforce that the VMM
> has unmapped all private pages on vCPU run, failing the ioctl if that's
> not the case. It needs a little more tracking in guest_memfd but I think
> GUP will then fall out in the wash because only shared pages will be
> mapped by userspace and so GUP will fail by construction for private
> pages.
> 
> We're happy to pursue alternative approaches using anonymous memory if
> you'd prefer to keep guest_memfd limited in functionality (e.g.
> preventing GUP of private pages by extending mapping_flags as per [1]),
> but we're equally willing to contribute to guest_memfd if extensions are
> welcome.
> 
> What do you prefer?

Let me summarize the history:

AMD had its thing running and it worked for them (but I recall it was 
hacky :) ).

TDX made it possible to crash the machine when accessing secure memory 
from user space (MCE).

So secure memory must not be mapped into user space -- no page tables. 
Prototypes with anonymous memory existed (and I didn't hate them, 
although hacky), but one of the other selling points of guest_memfd was 
that we could create VMs that wouldn't need any page tables at all, 
which I found interesting.

There was a bit more to that (easier conversion, avoiding GUP, 
specifying on allocation that the memory was unmovable ...), but I'll 
get to that later.

The design principle was: nasty private memory (unmovable, unswappable, 
inaccessible, un-GUPable) is allocated from guest_memfd, ordinary 
"shared" memory is allocated from an ordinary memfd.

This makes sense: shared memory is neither nasty nor special. You can 
migrate it, swap it out, map it into page tables, GUP it, ... without 
any issues.


So if I would describe some key characteristics of guest_memfd as of 
today, it would probably be:

1) Memory is unmovable and unswappable. Right from the beginning, it is
    allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
2) Memory is inaccessible. It cannot be read from user space, the
    kernel, it cannot be GUP'ed ... only some mechanisms might end up
    touching that memory (e.g., hibernation, /proc/kcore) might end up
    touching it "by accident", and we usually can handle these cases.
3) Memory can be discarded in page granularity. There should be no cases
    where you cannot discard memory to over-allocate memory for private
    pages that have been replaced by shared pages otherwise.
4) Page tables are not required (well, it's an memfd), and the fd could
    in theory be passed to other processes.

Having "ordinary shared" memory in there implies that 1) and 2) will 
have to be adjusted for them, which kind-of turns it "partially" into 
ordinary shmem again.


Going back to the beginning: with pKVM, we likely want the following

1) Convert pages private<->shared in-place
2) Stop user space + kernel from accessing private memory in process
    context. Likely for pKVM we would only crash the process, which
    would be acceptable.
3) Prevent GUP to private memory. Otherwise we could crash the kernel.
4) Prevent private pages from swapout+migration until supported.


I suspect your current solution with anonymous memory gets all but 3) 
sorted out, correct?

I'm curious, may there be a requirement in the future that shared memory 
could be mapped into other processes? (thinking vhost-user and such 
things). Of course that's impossible with anonymous memory; teaching 
shmem to contain private memory would kind-of lead to ... guest_memfd, 
just that we don't have shared memory there.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: Re: folio_mmapped
  2024-03-22 16:36                                           ` Will Deacon
@ 2024-03-22 18:46                                             ` Elliot Berman
  2024-03-27 19:31                                               ` Will Deacon
  0 siblings, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-03-22 18:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Hildenbrand, Sean Christopherson, Vishal Annapurve,
	Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, keirf, linux-mm

On Fri, Mar 22, 2024 at 04:36:55PM +0000, Will Deacon wrote:
> Hi Elliot,
> 
> On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote:
> > On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote:
> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > > pages between private and shared, but I also completely agree that letting
> > > > > anything gup() guest_memfd memory is likely to end in tears.
> > > > 
> > > > Yes. Avoid it right from the start, if possible.
> > > > 
> > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > > to GUP it. It's getting tiring, really.
> > > 
> > > From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > > 
> > > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > > memory) and it mostly works with the one significant exception that
> > > accessing private memory via a GUP pin will crash the host kernel. If
> > > all guest_memfd() can offer to solve that problem is preventing GUP
> > > altogether, then I'd sooner just add that same restriction to what we
> > > currently have instead of overhauling the user ABI in favour of
> > > something which offers us very little in return.
> > 
> > How would we add the restriction to anonymous memory?
> > 
> > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
> > where mm can ensure that the exclusive GUP pin is the only pin? If the
> > refcount for the page is >1, then the exclusive GUP fails. Any future
> > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.
> 
> Yes, I think we'd want something like that, but I don't think using a
> bias on its own is a good idea as false positives due to a large number
> of page references will then actually lead to problems (i.e. rejecting
> GUP spuriously), no? I suppose if you only considered the new bias in
> conjunction with the AS_NOGUP flag you proposed then it might be ok
> (i.e. when you see the bias, you then go check the address space to
> confirm). What do you think?
> 

I think the AS_NOGUP would prevent GUPing the first place. If we set the
EXCLUSIVE_BIAS value to something like INT_MAX, do we need to be worried
about there being INT_MAX-1 valid GUPs and wanting to add another?  From
the GUPer's perspective, I don't think it would be much different from
overflowing the refcount.

> > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > than what has currently been proposed might be to enforce that the VMM
> > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > not the case. It needs a little more tracking in guest_memfd but I think
> > > GUP will then fall out in the wash because only shared pages will be
> > > mapped by userspace and so GUP will fail by construction for private
> > > pages.
> > 
> > We can prevent GUP after the pages are marked private, but the pages
> > could be marked private after the pages were already GUP'd. I don't have
> > a good way to detect this, so converting a page to private is difficult.
> 
> For anonymous memory, marking the page as private is going to involve an
> exclusive GUP so that the page can safely be donated to the guest. In
> that case, any existing GUP pin should cause that to fail gracefully.
> What is the situation you are concerned about here?
> 

I wasn't thinking about exclusive GUP here. The exclusive GUP should be
able to get the guarantees we need.

I was thinking about making sure we gracefully handle a race to provide
the same page. The kernel should detect the difference between "we're
already providing the page" and "somebody has an unexpected pin". We can
easily read the refcount if we couldn't take the exclusive pin to know.

Thanks,
Elliot

> > > We're happy to pursue alternative approaches using anonymous memory if
> > > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > > preventing GUP of private pages by extending mapping_flags as per [1]),
> > > but we're equally willing to contribute to guest_memfd if extensions are
> > > welcome.
> > > 
> > > What do you prefer?
> > > 
> > 
> > I like this as a stepping stone. For the Android use cases, we don't
> > need to be able to convert a private page to shared and then also be
> > able to GUP it.
> 
> I wouldn't want to rule that out, though. The VMM should be able to use
> shared pages just like it can with normal anonymous pages.
> 
> > I don't think this design prevents us from adding "sometimes you can
> > GUP" to guest_memfd in the future.
> 
> Technically, I think we can add all the stuff we need to guest_memfd,
> but there's a desire to keep that as simple as possible for now, which
> is why I'm keen to explore alternatives to unblock the pKVM upstreaming.
> 
> Will
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-22 17:52                                         ` folio_mmapped David Hildenbrand
@ 2024-03-22 21:21                                           ` David Hildenbrand
  2024-03-26 22:04                                             ` folio_mmapped Elliot Berman
  2024-03-27 19:34                                           ` folio_mmapped Will Deacon
  1 sibling, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-22 21:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sean Christopherson, Vishal Annapurve, Quentin Perret,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

On 22.03.24 18:52, David Hildenbrand wrote:
> On 19.03.24 15:31, Will Deacon wrote:
>> Hi David,
> 
> Hi Will,
> 
> sorry for the late reply!
> 
>>
>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>> Second, we should find better ways to let an IOMMU map these pages,
>>>>>> *not* using GUP. There were already discussions on providing a similar
>>>>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>>>>> approach here. Maybe we should look into passing not only guest_memfd,
>>>>>> but also "ordinary" memfds.
>>>>
>>>> +1.  I am not completely opposed to letting SNP and TDX effectively convert
>>>> pages between private and shared, but I also completely agree that letting
>>>> anything gup() guest_memfd memory is likely to end in tears.
>>>
>>> Yes. Avoid it right from the start, if possible.
>>>
>>> People wanted guest_memfd to *not* have to mmap guest memory ("even for
>>> ordinary VMs"). Now people are saying we have to be able to mmap it in order
>>> to GUP it. It's getting tiring, really.
>>
>>   From the pKVM side, we're working on guest_memfd primarily to avoid
>> diverging from what other CoCo solutions end up using, but if it gets
>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>> today with anonymous memory, then it's a really hard sell to switch over
>> from what we have in production. We're also hoping that, over time,
>> guest_memfd will become more closely integrated with the mm subsystem to
>> enable things like hypervisor-assisted page migration, which we would
>> love to have.
> 
> Reading Sean's reply, he has a different view on that. And I think
> that's the main issue: there are too many different use cases and too
> many different requirements that could turn guest_memfd into something
> that maybe it really shouldn't be.
> 
>>
>> Today, we use the existing KVM interfaces (i.e. based on anonymous
>> memory) and it mostly works with the one significant exception that
>> accessing private memory via a GUP pin will crash the host kernel. If
>> all guest_memfd() can offer to solve that problem is preventing GUP
>> altogether, then I'd sooner just add that same restriction to what we
>> currently have instead of overhauling the user ABI in favour of
>> something which offers us very little in return.
>>
>> On the mmap() side of things for guest_memfd, a simpler option for us
>> than what has currently been proposed might be to enforce that the VMM
>> has unmapped all private pages on vCPU run, failing the ioctl if that's
>> not the case. It needs a little more tracking in guest_memfd but I think
>> GUP will then fall out in the wash because only shared pages will be
>> mapped by userspace and so GUP will fail by construction for private
>> pages.
>>
>> We're happy to pursue alternative approaches using anonymous memory if
>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>> preventing GUP of private pages by extending mapping_flags as per [1]),
>> but we're equally willing to contribute to guest_memfd if extensions are
>> welcome.
>>
>> What do you prefer?
> 
> Let me summarize the history:
> 
> AMD had its thing running and it worked for them (but I recall it was
> hacky :) ).
> 
> TDX made it possible to crash the machine when accessing secure memory
> from user space (MCE).
> 
> So secure memory must not be mapped into user space -- no page tables.
> Prototypes with anonymous memory existed (and I didn't hate them,
> although hacky), but one of the other selling points of guest_memfd was
> that we could create VMs that wouldn't need any page tables at all,
> which I found interesting.
> 
> There was a bit more to that (easier conversion, avoiding GUP,
> specifying on allocation that the memory was unmovable ...), but I'll
> get to that later.
> 
> The design principle was: nasty private memory (unmovable, unswappable,
> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
> "shared" memory is allocated from an ordinary memfd.
> 
> This makes sense: shared memory is neither nasty nor special. You can
> migrate it, swap it out, map it into page tables, GUP it, ... without
> any issues.
> 
> 
> So if I would describe some key characteristics of guest_memfd as of
> today, it would probably be:
> 
> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>      allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> 2) Memory is inaccessible. It cannot be read from user space, the
>      kernel, it cannot be GUP'ed ... only some mechanisms might end up
>      touching that memory (e.g., hibernation, /proc/kcore) might end up
>      touching it "by accident", and we usually can handle these cases.
> 3) Memory can be discarded in page granularity. There should be no cases
>      where you cannot discard memory to over-allocate memory for private
>      pages that have been replaced by shared pages otherwise.
> 4) Page tables are not required (well, it's an memfd), and the fd could
>      in theory be passed to other processes.
> 
> Having "ordinary shared" memory in there implies that 1) and 2) will
> have to be adjusted for them, which kind-of turns it "partially" into
> ordinary shmem again.
> 
> 
> Going back to the beginning: with pKVM, we likely want the following
> 
> 1) Convert pages private<->shared in-place
> 2) Stop user space + kernel from accessing private memory in process
>      context. Likely for pKVM we would only crash the process, which
>      would be acceptable.
> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> 4) Prevent private pages from swapout+migration until supported.
> 
> 
> I suspect your current solution with anonymous memory gets all but 3)
> sorted out, correct?
> 
> I'm curious, may there be a requirement in the future that shared memory
> could be mapped into other processes? (thinking vhost-user and such
> things). Of course that's impossible with anonymous memory; teaching
> shmem to contain private memory would kind-of lead to ... guest_memfd,
> just that we don't have shared memory there.
> 

I was just thinking of something stupid, not sure if it makes any sense. 
I'll raise it here before I forget over the weekend.

... what if we glued one guest_memfd and a memfd (shmem) together in the 
kernel somehow?

(1) A to-shared conversion moves a page from the guest_memfd to the memfd.

(2) A to-private conversion moves a page from the memfd to the guest_memfd.

Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd 
behave like any shmem pages: migratable, swappable etc.


Of course, (2) is only possible if the page is not pinned, not mapped 
(we can unmap it). AND, the page must not reside on ZONE_MOVABLE / 
MIGRATE_CMA.

We'd have to decide what to do when we access a "hole" in the memfd -- 
instead of allocating a fresh page and filling the hole, we'd want to 
SIGBUS.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: folio_mmapped
  2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
@ 2024-03-26 22:04                                             ` Elliot Berman
  2024-03-27 17:50                                               ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Elliot Berman @ 2024-03-26 22:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Will Deacon, Sean Christopherson, Vishal Annapurve,
	Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, keirf, linux-mm

On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote:
> On 22.03.24 18:52, David Hildenbrand wrote:
> > On 19.03.24 15:31, Will Deacon wrote:
> > > Hi David,
> > 
> > Hi Will,
> > 
> > sorry for the late reply!
> > 
> > > 
> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > > > > *not* using GUP. There were already discussions on providing a similar
> > > > > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > > > > but also "ordinary" memfds.
> > > > > 
> > > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > > pages between private and shared, but I also completely agree that letting
> > > > > anything gup() guest_memfd memory is likely to end in tears.
> > > > 
> > > > Yes. Avoid it right from the start, if possible.
> > > > 
> > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > > to GUP it. It's getting tiring, really.
> > > 
> > >   From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > 
> > Reading Sean's reply, he has a different view on that. And I think
> > that's the main issue: there are too many different use cases and too
> > many different requirements that could turn guest_memfd into something
> > that maybe it really shouldn't be.
> > 
> > > 
> > > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > > memory) and it mostly works with the one significant exception that
> > > accessing private memory via a GUP pin will crash the host kernel. If
> > > all guest_memfd() can offer to solve that problem is preventing GUP
> > > altogether, then I'd sooner just add that same restriction to what we
> > > currently have instead of overhauling the user ABI in favour of
> > > something which offers us very little in return.
> > > 
> > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > than what has currently been proposed might be to enforce that the VMM
> > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > not the case. It needs a little more tracking in guest_memfd but I think
> > > GUP will then fall out in the wash because only shared pages will be
> > > mapped by userspace and so GUP will fail by construction for private
> > > pages.
> > > 
> > > We're happy to pursue alternative approaches using anonymous memory if
> > > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > > preventing GUP of private pages by extending mapping_flags as per [1]),
> > > but we're equally willing to contribute to guest_memfd if extensions are
> > > welcome.
> > > 
> > > What do you prefer?
> > 
> > Let me summarize the history:
> > 
> > AMD had its thing running and it worked for them (but I recall it was
> > hacky :) ).
> > 
> > TDX made it possible to crash the machine when accessing secure memory
> > from user space (MCE).
> > 
> > So secure memory must not be mapped into user space -- no page tables.
> > Prototypes with anonymous memory existed (and I didn't hate them,
> > although hacky), but one of the other selling points of guest_memfd was
> > that we could create VMs that wouldn't need any page tables at all,
> > which I found interesting.
> > 
> > There was a bit more to that (easier conversion, avoiding GUP,
> > specifying on allocation that the memory was unmovable ...), but I'll
> > get to that later.
> > 
> > The design principle was: nasty private memory (unmovable, unswappable,
> > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
> > "shared" memory is allocated from an ordinary memfd.
> > 
> > This makes sense: shared memory is neither nasty nor special. You can
> > migrate it, swap it out, map it into page tables, GUP it, ... without
> > any issues.
> > 
> > 
> > So if I would describe some key characteristics of guest_memfd as of
> > today, it would probably be:
> > 
> > 1) Memory is unmovable and unswappable. Right from the beginning, it is
> >      allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> > 2) Memory is inaccessible. It cannot be read from user space, the
> >      kernel, it cannot be GUP'ed ... only some mechanisms might end up
> >      touching that memory (e.g., hibernation, /proc/kcore) might end up
> >      touching it "by accident", and we usually can handle these cases.
> > 3) Memory can be discarded in page granularity. There should be no cases
> >      where you cannot discard memory to over-allocate memory for private
> >      pages that have been replaced by shared pages otherwise.
> > 4) Page tables are not required (well, it's an memfd), and the fd could
> >      in theory be passed to other processes.
> > 
> > Having "ordinary shared" memory in there implies that 1) and 2) will
> > have to be adjusted for them, which kind-of turns it "partially" into
> > ordinary shmem again.
> > 
> > 
> > Going back to the beginning: with pKVM, we likely want the following
> > 
> > 1) Convert pages private<->shared in-place
> > 2) Stop user space + kernel from accessing private memory in process
> >      context. Likely for pKVM we would only crash the process, which
> >      would be acceptable.
> > 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> > 4) Prevent private pages from swapout+migration until supported.
> > 
> > 
> > I suspect your current solution with anonymous memory gets all but 3)
> > sorted out, correct?
> > 
> > I'm curious, may there be a requirement in the future that shared memory
> > could be mapped into other processes? (thinking vhost-user and such
> > things). Of course that's impossible with anonymous memory; teaching
> > shmem to contain private memory would kind-of lead to ... guest_memfd,
> > just that we don't have shared memory there.
> > 
> 
> I was just thinking of something stupid, not sure if it makes any sense.
> I'll raise it here before I forget over the weekend.
> 
> ... what if we glued one guest_memfd and a memfd (shmem) together in the
> kernel somehow?
> 
> (1) A to-shared conversion moves a page from the guest_memfd to the memfd.
> 
> (2) A to-private conversion moves a page from the memfd to the guest_memfd.
> 
> Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave
> like any shmem pages: migratable, swappable etc.
> 
> 
> Of course, (2) is only possible if the page is not pinned, not mapped (we
> can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA.
> 

Quentin gave idea offline of using splice to achieve the conversions.
I'd want to use the in-kernel APIs on page-fault to do the conversion;
not requiring userspace to make the splice() syscall.  One thing splice
currently requires is the source (in) file; KVM UAPI today only gives
userspace address. We could resolve that by for_each_vma_range(). I've
just started looking into splice(), but I believe it takes care of not
pinned and not mapped. guest_memfd would have to migrate the page out of
ZONE_MOVABLE / MIGRATE_CMA.

Does this seem like a good path to pursue further or any other ideas for
doing the conversion?

> We'd have to decide what to do when we access a "hole" in the memfd --
> instead of allocating a fresh page and filling the hole, we'd want to
> SIGBUS.

Since the KVM UAPI is based on userspace addresses and not fds for the
shared memory part, maybe we could add a mmu_notifier_ops that allows
KVM to intercept and reject faults if we couldn't reclaim the memory. I
think it would be conceptually similar to userfaultfd except in the
kernel; not sure if re-using userfaultfd makes sense?

Thanks,
Elliot


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-26 22:04                                             ` folio_mmapped Elliot Berman
@ 2024-03-27 17:50                                               ` David Hildenbrand
  0 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-03-27 17:50 UTC (permalink / raw)
  To: Will Deacon, Sean Christopherson, Vishal Annapurve,
	Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, keirf, linux-mm

On 26.03.24 23:04, Elliot Berman wrote:
> On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote:
>> On 22.03.24 18:52, David Hildenbrand wrote:
>>> On 19.03.24 15:31, Will Deacon wrote:
>>>> Hi David,
>>>
>>> Hi Will,
>>>
>>> sorry for the late reply!
>>>
>>>>
>>>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>> Second, we should find better ways to let an IOMMU map these pages,
>>>>>>>> *not* using GUP. There were already discussions on providing a similar
>>>>>>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>>>>>>> approach here. Maybe we should look into passing not only guest_memfd,
>>>>>>>> but also "ordinary" memfds.
>>>>>>
>>>>>> +1.  I am not completely opposed to letting SNP and TDX effectively convert
>>>>>> pages between private and shared, but I also completely agree that letting
>>>>>> anything gup() guest_memfd memory is likely to end in tears.
>>>>>
>>>>> Yes. Avoid it right from the start, if possible.
>>>>>
>>>>> People wanted guest_memfd to *not* have to mmap guest memory ("even for
>>>>> ordinary VMs"). Now people are saying we have to be able to mmap it in order
>>>>> to GUP it. It's getting tiring, really.
>>>>
>>>>    From the pKVM side, we're working on guest_memfd primarily to avoid
>>>> diverging from what other CoCo solutions end up using, but if it gets
>>>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>>>> today with anonymous memory, then it's a really hard sell to switch over
>>>> from what we have in production. We're also hoping that, over time,
>>>> guest_memfd will become more closely integrated with the mm subsystem to
>>>> enable things like hypervisor-assisted page migration, which we would
>>>> love to have.
>>>
>>> Reading Sean's reply, he has a different view on that. And I think
>>> that's the main issue: there are too many different use cases and too
>>> many different requirements that could turn guest_memfd into something
>>> that maybe it really shouldn't be.
>>>
>>>>
>>>> Today, we use the existing KVM interfaces (i.e. based on anonymous
>>>> memory) and it mostly works with the one significant exception that
>>>> accessing private memory via a GUP pin will crash the host kernel. If
>>>> all guest_memfd() can offer to solve that problem is preventing GUP
>>>> altogether, then I'd sooner just add that same restriction to what we
>>>> currently have instead of overhauling the user ABI in favour of
>>>> something which offers us very little in return.
>>>>
>>>> On the mmap() side of things for guest_memfd, a simpler option for us
>>>> than what has currently been proposed might be to enforce that the VMM
>>>> has unmapped all private pages on vCPU run, failing the ioctl if that's
>>>> not the case. It needs a little more tracking in guest_memfd but I think
>>>> GUP will then fall out in the wash because only shared pages will be
>>>> mapped by userspace and so GUP will fail by construction for private
>>>> pages.
>>>>
>>>> We're happy to pursue alternative approaches using anonymous memory if
>>>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>>>> preventing GUP of private pages by extending mapping_flags as per [1]),
>>>> but we're equally willing to contribute to guest_memfd if extensions are
>>>> welcome.
>>>>
>>>> What do you prefer?
>>>
>>> Let me summarize the history:
>>>
>>> AMD had its thing running and it worked for them (but I recall it was
>>> hacky :) ).
>>>
>>> TDX made it possible to crash the machine when accessing secure memory
>>> from user space (MCE).
>>>
>>> So secure memory must not be mapped into user space -- no page tables.
>>> Prototypes with anonymous memory existed (and I didn't hate them,
>>> although hacky), but one of the other selling points of guest_memfd was
>>> that we could create VMs that wouldn't need any page tables at all,
>>> which I found interesting.
>>>
>>> There was a bit more to that (easier conversion, avoiding GUP,
>>> specifying on allocation that the memory was unmovable ...), but I'll
>>> get to that later.
>>>
>>> The design principle was: nasty private memory (unmovable, unswappable,
>>> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
>>> "shared" memory is allocated from an ordinary memfd.
>>>
>>> This makes sense: shared memory is neither nasty nor special. You can
>>> migrate it, swap it out, map it into page tables, GUP it, ... without
>>> any issues.
>>>
>>>
>>> So if I would describe some key characteristics of guest_memfd as of
>>> today, it would probably be:
>>>
>>> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>>>       allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
>>> 2) Memory is inaccessible. It cannot be read from user space, the
>>>       kernel, it cannot be GUP'ed ... only some mechanisms might end up
>>>       touching that memory (e.g., hibernation, /proc/kcore) might end up
>>>       touching it "by accident", and we usually can handle these cases.
>>> 3) Memory can be discarded in page granularity. There should be no cases
>>>       where you cannot discard memory to over-allocate memory for private
>>>       pages that have been replaced by shared pages otherwise.
>>> 4) Page tables are not required (well, it's an memfd), and the fd could
>>>       in theory be passed to other processes.
>>>
>>> Having "ordinary shared" memory in there implies that 1) and 2) will
>>> have to be adjusted for them, which kind-of turns it "partially" into
>>> ordinary shmem again.
>>>
>>>
>>> Going back to the beginning: with pKVM, we likely want the following
>>>
>>> 1) Convert pages private<->shared in-place
>>> 2) Stop user space + kernel from accessing private memory in process
>>>       context. Likely for pKVM we would only crash the process, which
>>>       would be acceptable.
>>> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
>>> 4) Prevent private pages from swapout+migration until supported.
>>>
>>>
>>> I suspect your current solution with anonymous memory gets all but 3)
>>> sorted out, correct?
>>>
>>> I'm curious, may there be a requirement in the future that shared memory
>>> could be mapped into other processes? (thinking vhost-user and such
>>> things). Of course that's impossible with anonymous memory; teaching
>>> shmem to contain private memory would kind-of lead to ... guest_memfd,
>>> just that we don't have shared memory there.
>>>
>>
>> I was just thinking of something stupid, not sure if it makes any sense.
>> I'll raise it here before I forget over the weekend.
>>
>> ... what if we glued one guest_memfd and a memfd (shmem) together in the
>> kernel somehow?
>>
>> (1) A to-shared conversion moves a page from the guest_memfd to the memfd.
>>
>> (2) A to-private conversion moves a page from the memfd to the guest_memfd.
>>
>> Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave
>> like any shmem pages: migratable, swappable etc.
>>
>>
>> Of course, (2) is only possible if the page is not pinned, not mapped (we
>> can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA.
>>
> 
> Quentin gave idea offline of using splice to achieve the conversions.
> I'd want to use the in-kernel APIs on page-fault to do the conversion;
> not requiring userspace to make the splice() syscall.  One thing splice
> currently requires is the source (in) file; KVM UAPI today only gives
> userspace address. We could resolve that by for_each_vma_range(). I've
> just started looking into splice(), but I believe it takes care of not
> pinned and not mapped. guest_memfd would have to migrate the page out of
> ZONE_MOVABLE / MIGRATE_CMA.

I don't think we want to involve splice. Conceptually, I think KVM 
should create a pair of FDs: guest_memfd for private memory and 
"ordinary shmem/memfd" for shared memory.

Conversion back and forth can either be triggered using a KVM API (TDX 
use case), or internally from KVM (pkvm use case). Maybe it does 
something internally that splice would do that we can reuse, otherwise 
we have to do the plumbing.

Then, we have some logic on how to handle access to unbacked regions 
(SIGBUS instead of allocating memory) inside both memfds, and allow to 
allocate memory for parts of the fds explicitly.

No offset in the fd's can be populated the same time. That is, pages can 
be moved back and forth, but allocating a fresh page in an fd is only 
possible if there is nothing at that location in the other fd. No memory 
over-allocation.

Coming up with a KVM API for that should be possible.

> 
> Does this seem like a good path to pursue further or any other ideas for
> doing the conversion?
> 
>> We'd have to decide what to do when we access a "hole" in the memfd --
>> instead of allocating a fresh page and filling the hole, we'd want to
>> SIGBUS.
> 
> Since the KVM UAPI is based on userspace addresses and not fds for the
> shared memory part, maybe we could add a mmu_notifier_ops that allows
> KVM to intercept and reject faults if we couldn't reclaim the memory. I
> think it would be conceptually similar to userfaultfd except in the
> kernel; not sure if re-using userfaultfd makes sense?

Or if KVM exposes this other fd as well, we extend the UAPI to consume 
for the shared part also fd+offset.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: Re: Re: folio_mmapped
  2024-03-22 18:46                                             ` Elliot Berman
@ 2024-03-27 19:31                                               ` Will Deacon
  0 siblings, 0 replies; 96+ messages in thread
From: Will Deacon @ 2024-03-27 19:31 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Vishal Annapurve,
	Quentin Perret, Matthew Wilcox, Fuad Tabba, kvm, kvmarm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, keirf, linux-mm

Hi Elliot,

On Fri, Mar 22, 2024 at 11:46:10AM -0700, Elliot Berman wrote:
> On Fri, Mar 22, 2024 at 04:36:55PM +0000, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote:
> > > How would we add the restriction to anonymous memory?
> > > 
> > > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
> > > where mm can ensure that the exclusive GUP pin is the only pin? If the
> > > refcount for the page is >1, then the exclusive GUP fails. Any future
> > > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.
> > 
> > Yes, I think we'd want something like that, but I don't think using a
> > bias on its own is a good idea as false positives due to a large number
> > of page references will then actually lead to problems (i.e. rejecting
> > GUP spuriously), no? I suppose if you only considered the new bias in
> > conjunction with the AS_NOGUP flag you proposed then it might be ok
> > (i.e. when you see the bias, you then go check the address space to
> > confirm). What do you think?
> > 
> 
> I think the AS_NOGUP would prevent GUPing the first place. If we set the
> EXCLUSIVE_BIAS value to something like INT_MAX, do we need to be worried
> about there being INT_MAX-1 valid GUPs and wanting to add another?  From
> the GUPer's perspective, I don't think it would be much different from
> overflowing the refcount.

I don't think we should end up in a position where a malicious user can
take a tonne of references and cause functional problems. For example,
look at page_maybe_dma_pinned(); the worst case is we end up treating
heavily referenced pages as pinned and the soft-dirty logic leaves them
perpetually dirtied. The side-effects of what we're talking about here
seem to be much worse than that unless we can confirm that the page
really is exclusive.

> > > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > > than what has currently been proposed might be to enforce that the VMM
> > > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > > not the case. It needs a little more tracking in guest_memfd but I think
> > > > GUP will then fall out in the wash because only shared pages will be
> > > > mapped by userspace and so GUP will fail by construction for private
> > > > pages.
> > > 
> > > We can prevent GUP after the pages are marked private, but the pages
> > > could be marked private after the pages were already GUP'd. I don't have
> > > a good way to detect this, so converting a page to private is difficult.
> > 
> > For anonymous memory, marking the page as private is going to involve an
> > exclusive GUP so that the page can safely be donated to the guest. In
> > that case, any existing GUP pin should cause that to fail gracefully.
> > What is the situation you are concerned about here?
> > 
> 
> I wasn't thinking about exclusive GUP here. The exclusive GUP should be
> able to get the guarantees we need.
> 
> I was thinking about making sure we gracefully handle a race to provide
> the same page. The kernel should detect the difference between "we're
> already providing the page" and "somebody has an unexpected pin". We can
> easily read the refcount if we couldn't take the exclusive pin to know.

Thanks, that makes sense to me. For pKVM, the architecture code also
tracks all the donated pages, so we should be able to provide additional
metadata here if we shuffle things around a little.

Will

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-22 17:52                                         ` folio_mmapped David Hildenbrand
  2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
@ 2024-03-27 19:34                                           ` Will Deacon
  2024-03-28  9:06                                             ` folio_mmapped David Hildenbrand
  2024-04-04  0:15                                             ` folio_mmapped Sean Christopherson
  1 sibling, 2 replies; 96+ messages in thread
From: Will Deacon @ 2024-03-27 19:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Vishal Annapurve, Quentin Perret,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

Hi again, David,

On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
> On 19.03.24 15:31, Will Deacon wrote:
> sorry for the late reply!

Bah, you and me both!

> > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> >  From the pKVM side, we're working on guest_memfd primarily to avoid
> > diverging from what other CoCo solutions end up using, but if it gets
> > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > today with anonymous memory, then it's a really hard sell to switch over
> > from what we have in production. We're also hoping that, over time,
> > guest_memfd will become more closely integrated with the mm subsystem to
> > enable things like hypervisor-assisted page migration, which we would
> > love to have.
> 
> Reading Sean's reply, he has a different view on that. And I think that's
> the main issue: there are too many different use cases and too many
> different requirements that could turn guest_memfd into something that maybe
> it really shouldn't be.

No argument there, and we're certainly not tied to any specific
mechanism on the pKVM side. Maybe Sean can chime in, but we've
definitely spoken about migration being a goal in the past, so I guess
something changed since then on the guest_memfd side.

Regardless, from our point of view, we just need to make sure that
whatever we settle on for pKVM does the things we need it to do (or can
at least be extended to do them) and we're happy to implement that in
whatever way works best for upstream, guest_memfd or otherwise.

> > We're happy to pursue alternative approaches using anonymous memory if
> > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > preventing GUP of private pages by extending mapping_flags as per [1]),
> > but we're equally willing to contribute to guest_memfd if extensions are
> > welcome.
> > 
> > What do you prefer?
> 
> Let me summarize the history:

First off, thanks for piecing together the archaeology...

> AMD had its thing running and it worked for them (but I recall it was hacky
> :) ).
> 
> TDX made it possible to crash the machine when accessing secure memory from
> user space (MCE).
> 
> So secure memory must not be mapped into user space -- no page tables.
> Prototypes with anonymous memory existed (and I didn't hate them, although
> hacky), but one of the other selling points of guest_memfd was that we could
> create VMs that wouldn't need any page tables at all, which I found
> interesting.

Are the prototypes you refer to here based on the old stuff from Kirill?
We followed that work at the time, thinking we were going to be using
that before guest_memfd came along, so we've sadly been collecting
out-of-tree patches for a little while :/

> There was a bit more to that (easier conversion, avoiding GUP, specifying on
> allocation that the memory was unmovable ...), but I'll get to that later.
> 
> The design principle was: nasty private memory (unmovable, unswappable,
> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared"
> memory is allocated from an ordinary memfd.
> 
> This makes sense: shared memory is neither nasty nor special. You can
> migrate it, swap it out, map it into page tables, GUP it, ... without any
> issues.

Slight aside and not wanting to derail the discussion, but we have a few
different types of sharing which we'll have to consider:

  * Memory shared from the host to the guest. This remains owned by the
    host and the normal mm stuff can be made to work with it.

  * Memory shared from the guest to the host. This remains owned by the
    guest, so there's a pin on the pages and the normal mm stuff can't
    work without co-operation from the guest (see next point).

  * Memory relinquished from the guest to the host. This actually unmaps
    the pages from the host and transfers ownership back to the host,
    after which the pin is dropped and the normal mm stuff can work. We
    use this to implement ballooning.

I suppose the main thing is that the architecture backend can deal with
these states, so the core code shouldn't really care as long as it's
aware that shared memory may be pinned.

> So if I would describe some key characteristics of guest_memfd as of today,
> it would probably be:
> 
> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>    allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> 2) Memory is inaccessible. It cannot be read from user space, the
>    kernel, it cannot be GUP'ed ... only some mechanisms might end up
>    touching that memory (e.g., hibernation, /proc/kcore) might end up
>    touching it "by accident", and we usually can handle these cases.
> 3) Memory can be discarded in page granularity. There should be no cases
>    where you cannot discard memory to over-allocate memory for private
>    pages that have been replaced by shared pages otherwise.
> 4) Page tables are not required (well, it's an memfd), and the fd could
>    in theory be passed to other processes.
> 
> Having "ordinary shared" memory in there implies that 1) and 2) will have to
> be adjusted for them, which kind-of turns it "partially" into ordinary shmem
> again.

Yes, and we'd also need a way to establish hugepages (where possible)
even for the *private* memory so as to reduce the depth of the guest's
stage-2 walk.

> Going back to the beginning: with pKVM, we likely want the following
> 
> 1) Convert pages private<->shared in-place
> 2) Stop user space + kernel from accessing private memory in process
>    context. Likely for pKVM we would only crash the process, which
>    would be acceptable.
> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> 4) Prevent private pages from swapout+migration until supported.
> 
> 
> I suspect your current solution with anonymous memory gets all but 3) sorted
> out, correct?

I agree on all of these and, yes, (3) is the problem for us. We've also
been thinking a bit about CoW recently and I suspect the use of
vm_normal_page() in do_wp_page() could lead to issues similar to those
we hit with GUP. There are various ways to approach that, but I'm not
sure what's best.

> I'm curious, may there be a requirement in the future that shared memory
> could be mapped into other processes? (thinking vhost-user and such things).

It's not impossible. We use crosvm as our VMM, and that has a
multi-process sandbox mode which I think relies on just that...

Cheers,

Will

(btw: I'm getting some time away from the computer over Easter, so I'll be
 a little slow on email again. Nothing personal!).

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-27 19:34                                           ` folio_mmapped Will Deacon
@ 2024-03-28  9:06                                             ` David Hildenbrand
  2024-03-28 10:10                                               ` folio_mmapped Quentin Perret
  2024-04-04  0:15                                             ` folio_mmapped Sean Christopherson
  1 sibling, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-28  9:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sean Christopherson, Vishal Annapurve, Quentin Perret,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

On 27.03.24 20:34, Will Deacon wrote:
> Hi again, David,
> 
> On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
>> On 19.03.24 15:31, Will Deacon wrote:
>> sorry for the late reply!
> 
> Bah, you and me both!

This time I'm faster! :)

> 
>>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>   From the pKVM side, we're working on guest_memfd primarily to avoid
>>> diverging from what other CoCo solutions end up using, but if it gets
>>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>>> today with anonymous memory, then it's a really hard sell to switch over
>>> from what we have in production. We're also hoping that, over time,
>>> guest_memfd will become more closely integrated with the mm subsystem to
>>> enable things like hypervisor-assisted page migration, which we would
>>> love to have.
>>
>> Reading Sean's reply, he has a different view on that. And I think that's
>> the main issue: there are too many different use cases and too many
>> different requirements that could turn guest_memfd into something that maybe
>> it really shouldn't be.
> 
> No argument there, and we're certainly not tied to any specific
> mechanism on the pKVM side. Maybe Sean can chime in, but we've
> definitely spoken about migration being a goal in the past, so I guess
> something changed since then on the guest_memfd side.
> 
> Regardless, from our point of view, we just need to make sure that
> whatever we settle on for pKVM does the things we need it to do (or can
> at least be extended to do them) and we're happy to implement that in
> whatever way works best for upstream, guest_memfd or otherwise.
> 
>>> We're happy to pursue alternative approaches using anonymous memory if
>>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>>> preventing GUP of private pages by extending mapping_flags as per [1]),
>>> but we're equally willing to contribute to guest_memfd if extensions are
>>> welcome.
>>>
>>> What do you prefer?
>>
>> Let me summarize the history:
> 
> First off, thanks for piecing together the archaeology...
> 
>> AMD had its thing running and it worked for them (but I recall it was hacky
>> :) ).
>>
>> TDX made it possible to crash the machine when accessing secure memory from
>> user space (MCE).
>>
>> So secure memory must not be mapped into user space -- no page tables.
>> Prototypes with anonymous memory existed (and I didn't hate them, although
>> hacky), but one of the other selling points of guest_memfd was that we could
>> create VMs that wouldn't need any page tables at all, which I found
>> interesting.
> 
> Are the prototypes you refer to here based on the old stuff from Kirill?

Yes.

> We followed that work at the time, thinking we were going to be using
> that before guest_memfd came along, so we've sadly been collecting
> out-of-tree patches for a little while :/

:/

> 
>> There was a bit more to that (easier conversion, avoiding GUP, specifying on
>> allocation that the memory was unmovable ...), but I'll get to that later.
>>
>> The design principle was: nasty private memory (unmovable, unswappable,
>> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared"
>> memory is allocated from an ordinary memfd.
>>
>> This makes sense: shared memory is neither nasty nor special. You can
>> migrate it, swap it out, map it into page tables, GUP it, ... without any
>> issues.
> 
> Slight aside and not wanting to derail the discussion, but we have a few
> different types of sharing which we'll have to consider:

Thanks for sharing!

> 
>    * Memory shared from the host to the guest. This remains owned by the
>      host and the normal mm stuff can be made to work with it.

Okay, host and guest can access it. We can jut migrate memory around, 
swap it out ... like ordinary guest memory today.

> 
>    * Memory shared from the guest to the host. This remains owned by the
>      guest, so there's a pin on the pages and the normal mm stuff can't
>      work without co-operation from the guest (see next point).

Okay, host and guest can access it, but we cannot migrate memory around 
or swap it out ... like ordinary guest memory today that is longterm pinned.

> 
>    * Memory relinquished from the guest to the host. This actually unmaps
>      the pages from the host and transfers ownership back to the host,
>      after which the pin is dropped and the normal mm stuff can work. We
>      use this to implement ballooning.
> 

Okay, so this is essentially just a state transition between the two above.


> I suppose the main thing is that the architecture backend can deal with
> these states, so the core code shouldn't really care as long as it's
> aware that shared memory may be pinned.

So IIUC, the states are:

(1) Private: inaccesible by the host, accessible by the guest, "owned by
     the guest"

(2) Host Shared: accessible by the host + guest, "owned by the host"

(3) Guest Shared: accessible by the host, "owned by the guest"


Memory ballooning is simply transitioning from (3) to (2), and then 
discarding the memory.

Any state I am missing?


Which transitions are possible?

(1) <-> (2) ? Not sure if the direct transition is possible.

(2) <-> (3) ? IIUC yes.

(1) <-> (3) ? IIUC yes.



There is ongoing work on longterm-pinning memory from a memfd/shmem. So 
thinking in terms of my vague "fd guest_memfd + fd pair", that approach 
could look like the following:

(1) guest_memfd (could be "with longterm pin")

(2) memfd

(3) memfd with a longterm pin

But again, just some possible idea to make it work with guest_memfd.

> 
>> So if I would describe some key characteristics of guest_memfd as of today,
>> it would probably be:
>>
>> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>>     allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
>> 2) Memory is inaccessible. It cannot be read from user space, the
>>     kernel, it cannot be GUP'ed ... only some mechanisms might end up
>>     touching that memory (e.g., hibernation, /proc/kcore) might end up
>>     touching it "by accident", and we usually can handle these cases.
>> 3) Memory can be discarded in page granularity. There should be no cases
>>     where you cannot discard memory to over-allocate memory for private
>>     pages that have been replaced by shared pages otherwise.
>> 4) Page tables are not required (well, it's an memfd), and the fd could
>>     in theory be passed to other processes.
>>
>> Having "ordinary shared" memory in there implies that 1) and 2) will have to
>> be adjusted for them, which kind-of turns it "partially" into ordinary shmem
>> again.
> 
> Yes, and we'd also need a way to establish hugepages (where possible)
> even for the *private* memory so as to reduce the depth of the guest's
> stage-2 walk.
> 

Understood, and as discussed, that's a bit more "hairy".

>> Going back to the beginning: with pKVM, we likely want the following
>>
>> 1) Convert pages private<->shared in-place
>> 2) Stop user space + kernel from accessing private memory in process
>>     context. Likely for pKVM we would only crash the process, which
>>     would be acceptable.
>> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
>> 4) Prevent private pages from swapout+migration until supported.
>>
>>
>> I suspect your current solution with anonymous memory gets all but 3) sorted
>> out, correct?
> 
> I agree on all of these and, yes, (3) is the problem for us. We've also
> been thinking a bit about CoW recently and I suspect the use of
> vm_normal_page() in do_wp_page() could lead to issues similar to those
> we hit with GUP. There are various ways to approach that, but I'm not
> sure what's best.

Would COW be required or is that just the nasty side-effect of trying to 
use anonymous memory?

> 
>> I'm curious, may there be a requirement in the future that shared memory
>> could be mapped into other processes? (thinking vhost-user and such things).
> 
> It's not impossible. We use crosvm as our VMM, and that has a
> multi-process sandbox mode which I think relies on just that...
> 

Okay, so basing the design on anonymous memory might not be the best 
choice ... :/

> Cheers,
> 
> Will
> 
> (btw: I'm getting some time away from the computer over Easter, so I'll be
>   a little slow on email again. Nothing personal!).

Sure, no worries! Enjoy!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-28  9:06                                             ` folio_mmapped David Hildenbrand
@ 2024-03-28 10:10                                               ` Quentin Perret
  2024-03-28 10:32                                                 ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-03-28 10:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Will Deacon, Sean Christopherson, Vishal Annapurve,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

Hey David,

I'll try to pick up the baton while Will is away :-)

On Thursday 28 Mar 2024 at 10:06:52 (+0100), David Hildenbrand wrote:
> On 27.03.24 20:34, Will Deacon wrote:
> > I suppose the main thing is that the architecture backend can deal with
> > these states, so the core code shouldn't really care as long as it's
> > aware that shared memory may be pinned.
> 
> So IIUC, the states are:
> 
> (1) Private: inaccesible by the host, accessible by the guest, "owned by
>     the guest"
> 
> (2) Host Shared: accessible by the host + guest, "owned by the host"
> 
> (3) Guest Shared: accessible by the host, "owned by the guest"

Yup.

> Memory ballooning is simply transitioning from (3) to (2), and then
> discarding the memory.

Well, not quite actually, see below.

> Any state I am missing?

So there is probably state (0) which is 'owned only by the host'. It's a
bit obvious, but I'll make it explicit because it has its importance for
the rest of the discussion.

And while at it, there are other cases (memory shared/owned with/by the
hypervisor and/or TrustZone) but they're somewhat irrelevant to this
discussion. These pages are usually backed by kernel allocations, so
much less problematic to deal with. So let's ignore those.

> Which transitions are possible?

Basically a page must be in the 'exclusively owned' state for an owner
to initiate a share or donation. So e.g. a shared page must be unshared
before it can be donated to someone else (that is true regardless of the
owner, host, guest, hypervisor, ...). That simplifies significantly the
state tracking in pKVM.

> (1) <-> (2) ? Not sure if the direct transition is possible.

Yep, not possible.

> (2) <-> (3) ? IIUC yes.

Actually it's not directly possible as is. The ballooning procedure is
essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a
single hypercall when doing ballooning, but it's technically just a
(3) -> (1) -> (0) sequence that has been micro-optimized).

Note that state (2) is actually never used for protected VMs. It's
mainly used to implement standard non-protected VMs. The biggest
difference in pKVM between protected and non-protected VMs is basically
that in the former case, in the fault path KVM does a (0) -> (1)
transition, but in the latter it's (0) -> (2). That implies that in the
unprotected case, the host remains the page owner and is allowed to
decide to unshare arbitrary pages, to restrict the guest permissions for
the shared pages etc, which paves the way for implementing migration,
swap, ... relatively easily.

> (1) <-> (3) ? IIUC yes.

Yep.

<snip>
> > I agree on all of these and, yes, (3) is the problem for us. We've also
> > been thinking a bit about CoW recently and I suspect the use of
> > vm_normal_page() in do_wp_page() could lead to issues similar to those
> > we hit with GUP. There are various ways to approach that, but I'm not
> > sure what's best.
> 
> Would COW be required or is that just the nasty side-effect of trying to use
> anonymous memory?

That'd qualify as an undesirable side effect I think.

> > 
> > > I'm curious, may there be a requirement in the future that shared memory
> > > could be mapped into other processes? (thinking vhost-user and such things).
> > 
> > It's not impossible. We use crosvm as our VMM, and that has a
> > multi-process sandbox mode which I think relies on just that...
> > 
> 
> Okay, so basing the design on anonymous memory might not be the best choice
> ... :/

So, while we're at this stage, let me throw another idea at the wall to
see if it sticks :-)

One observation is that a standard memfd would work relatively well for
pKVM if we had a way to enforce that all mappings to it are MAP_SHARED.
KVM would still need to take an 'exclusive GUP' from the fault path
(which may fail in case of a pre-existing GUP, but that's fine), but
then CoW and friends largely become a non-issue by construction I think.
Is there any way we could enforce that cleanly? Perhaps introducing a
sort of 'mmap notifier' would do the trick? By that I mean something a
bit similar to an MMU notifier offered by memfd that KVM could register
against whenever the memfd is attached to a protected VM memslot.

One of the nice things here is that we could retain an entire mapping of
the whole of guest memory in userspace, conversions wouldn't require any
additional efforts from userspace. A bad thing is that a process that is
being passed such a memfd may not expect the new semantic and the
inability to map !MAP_SHARED. But I guess a process that receives a
handle to private memory must be enlightened regardless of the type of
fd, so maybe it's not so bad.

Thoughts?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-28 10:10                                               ` folio_mmapped Quentin Perret
@ 2024-03-28 10:32                                                 ` David Hildenbrand
  2024-03-28 10:58                                                   ` folio_mmapped Quentin Perret
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-28 10:32 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Will Deacon, Sean Christopherson, Vishal Annapurve,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

Hi!

[...]

> 
>> Any state I am missing?
> 
> So there is probably state (0) which is 'owned only by the host'. It's a
> bit obvious, but I'll make it explicit because it has its importance for
> the rest of the discussion.

Yes, I treated it as "simply not mapped into the VM".

> 
> And while at it, there are other cases (memory shared/owned with/by the
> hypervisor and/or TrustZone) but they're somewhat irrelevant to this
> discussion. These pages are usually backed by kernel allocations, so
> much less problematic to deal with. So let's ignore those.
> 
>> Which transitions are possible?
> 
> Basically a page must be in the 'exclusively owned' state for an owner
> to initiate a share or donation. So e.g. a shared page must be unshared
> before it can be donated to someone else (that is true regardless of the
> owner, host, guest, hypervisor, ...). That simplifies significantly the
> state tracking in pKVM.

Makes sense!

> 
>> (1) <-> (2) ? Not sure if the direct transition is possible.
> 
> Yep, not possible.
> 
>> (2) <-> (3) ? IIUC yes.
> 
> Actually it's not directly possible as is. The ballooning procedure is
> essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a
> single hypercall when doing ballooning, but it's technically just a
> (3) -> (1) -> (0) sequence that has been micro-optimized).
> 
> Note that state (2) is actually never used for protected VMs. It's
> mainly used to implement standard non-protected VMs. The biggest

Interesting.

> difference in pKVM between protected and non-protected VMs is basically
> that in the former case, in the fault path KVM does a (0) -> (1)
> transition, but in the latter it's (0) -> (2). That implies that in the
> unprotected case, the host remains the page owner and is allowed to
> decide to unshare arbitrary pages, to restrict the guest permissions for
> the shared pages etc, which paves the way for implementing migration,
> swap, ... relatively easily.

I'll have to digest that :)

... does that mean that for pKVM with protected VMs, "shared" pages are 
also never migratable/swappable?

> 
>> (1) <-> (3) ? IIUC yes.
> 
> Yep.
> 
> <snip>
>>> I agree on all of these and, yes, (3) is the problem for us. We've also
>>> been thinking a bit about CoW recently and I suspect the use of
>>> vm_normal_page() in do_wp_page() could lead to issues similar to those
>>> we hit with GUP. There are various ways to approach that, but I'm not
>>> sure what's best.
>>
>> Would COW be required or is that just the nasty side-effect of trying to use
>> anonymous memory?
> 
> That'd qualify as an undesirable side effect I think.

Makes sense!

> 
>>>
>>>> I'm curious, may there be a requirement in the future that shared memory
>>>> could be mapped into other processes? (thinking vhost-user and such things).
>>>
>>> It's not impossible. We use crosvm as our VMM, and that has a
>>> multi-process sandbox mode which I think relies on just that...
>>>
>>
>> Okay, so basing the design on anonymous memory might not be the best choice
>> ... :/
> 
> So, while we're at this stage, let me throw another idea at the wall to
> see if it sticks :-)
> 
> One observation is that a standard memfd would work relatively well for
> pKVM if we had a way to enforce that all mappings to it are MAP_SHARED.

It should be fairly easy to enforce, I wouldn't worry too much about that.

> KVM would still need to take an 'exclusive GUP' from the fault path
> (which may fail in case of a pre-existing GUP, but that's fine), but
> then CoW and friends largely become a non-issue by construction I think.
> Is there any way we could enforce that cleanly? Perhaps introducing a
> sort of 'mmap notifier' would do the trick? By that I mean something a
> bit similar to an MMU notifier offered by memfd that KVM could register
> against whenever the memfd is attached to a protected VM memslot.
> 
> One of the nice things here is that we could retain an entire mapping of
> the whole of guest memory in userspace, conversions wouldn't require any
> additional efforts from userspace. A bad thing is that a process that is
> being passed such a memfd may not expect the new semantic and the
> inability to map !MAP_SHARED. But I guess a process that receives a

I wouldn't worry about the !MAP_SHARED requirement. vhost-user and 
friends all *must* map it MAP_SHARED to do anything reasonable, so 
that's what they do.

> handle to private memory must be enlightened regardless of the type of
> fd, so maybe it's not so bad.
> 
> Thoughts?

The whole reason I brought up the guest_memfd+memfd pair idea is that 
you would similarly be able to do the conversion in the kernel, BUT, 
you'd never be able to mmap+GUP encrypted pages.

Essentially you're using guest_memfd for what it was designed for: 
private memory that is inaccessible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-28 10:32                                                 ` folio_mmapped David Hildenbrand
@ 2024-03-28 10:58                                                   ` Quentin Perret
  2024-03-28 11:41                                                     ` folio_mmapped David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Quentin Perret @ 2024-03-28 10:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Will Deacon, Sean Christopherson, Vishal Annapurve,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

On Thursday 28 Mar 2024 at 11:32:21 (+0100), David Hildenbrand wrote:
> ... does that mean that for pKVM with protected VMs, "shared" pages are also
> never migratable/swappable?

In our current implementation, yes, KVM keeps its longterm GUP pin on
pages that are shared back. And we might want to retain this behaviour
in the short term, even with guest_memfd or using the hybrid approach
you suggested. But that could totally be relaxed in the future, it's
"just" a matter of adding extra support to the hypervisor for that. That
has not been prioritized yet since the number of shared pages in
practice is relatively small for current use-cases, so ballooning was a
better option (and in the case of ballooning, we do drop the GUP pin).
But that's clearly on the TODO list!

> The whole reason I brought up the guest_memfd+memfd pair idea is that you
> would similarly be able to do the conversion in the kernel, BUT, you'd never
> be able to mmap+GUP encrypted pages.
> 
> Essentially you're using guest_memfd for what it was designed for: private
> memory that is inaccessible.

Ack, that sounds pretty reasonable to me. But I think we'd still want to
make sure the other users of guest_memfd have the _desire_ to support
huge pages,  migration, swap (probably longer term), and related
features, otherwise I don't think a guest_memfd-based option will
really work for us :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-28 10:58                                                   ` folio_mmapped Quentin Perret
@ 2024-03-28 11:41                                                     ` David Hildenbrand
  2024-03-29 18:38                                                       ` folio_mmapped Vishal Annapurve
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2024-03-28 11:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Will Deacon, Sean Christopherson, Vishal Annapurve,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

On 28.03.24 11:58, Quentin Perret wrote:
> On Thursday 28 Mar 2024 at 11:32:21 (+0100), David Hildenbrand wrote:
>> ... does that mean that for pKVM with protected VMs, "shared" pages are also
>> never migratable/swappable?
> 
> In our current implementation, yes, KVM keeps its longterm GUP pin on
> pages that are shared back. And we might want to retain this behaviour
> in the short term, even with guest_memfd or using the hybrid approach
> you suggested. But that could totally be relaxed in the future, it's
> "just" a matter of adding extra support to the hypervisor for that. That
> has not been prioritized yet since the number of shared pages in
> practice is relatively small for current use-cases, so ballooning was a
> better option (and in the case of ballooning, we do drop the GUP pin).
> But that's clearly on the TODO list!

Okay, so nothing "fundamental", good!

> 
>> The whole reason I brought up the guest_memfd+memfd pair idea is that you
>> would similarly be able to do the conversion in the kernel, BUT, you'd never
>> be able to mmap+GUP encrypted pages.
>>
>> Essentially you're using guest_memfd for what it was designed for: private
>> memory that is inaccessible.
> 
> Ack, that sounds pretty reasonable to me. But I think we'd still want to
> make sure the other users of guest_memfd have the _desire_ to support
> huge pages,  migration, swap (probably longer term), and related
> features, otherwise I don't think a guest_memfd-based option will
> really work for us :-)

*Probably* some easy way to get hugetlb pages into a guest_memfd would 
be by allocating them for an memfd and then converting/moving them into 
the guest_memfd part of the "fd pair" on conversion to private :)

(but the "partial shared, partial private" case is and remains the ugly 
thing that is hard and I still don't think it makes sense. Maybe it 
could be handles somehow in such a dual approach with some enlightment 
in the fds ... hard to find solutions for things that don't make any 
sense :P )

I also do strongly believe that we want to see some HW-assisted 
migration support for guest_memfd pages. Swap, as you say, maybe in the 
long-term. After all, we're not interested in having MM features for 
backing memory that you could similarly find under Windows 95. Wait, 
that one did support swapping! :P

But unfortunately, that's what the shiny new CoCo world currently 
offers. Well, excluding s390x secure execution, as discussed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-28 11:41                                                     ` folio_mmapped David Hildenbrand
@ 2024-03-29 18:38                                                       ` Vishal Annapurve
  0 siblings, 0 replies; 96+ messages in thread
From: Vishal Annapurve @ 2024-03-29 18:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Quentin Perret, Will Deacon, Sean Christopherson, Matthew Wilcox,
	Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, viro, brauner, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz,
	keirf, linux-mm

On Thu, Mar 28, 2024 at 4:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> ....
> >
> >> The whole reason I brought up the guest_memfd+memfd pair idea is that you
> >> would similarly be able to do the conversion in the kernel, BUT, you'd never
> >> be able to mmap+GUP encrypted pages.
> >>
> >> Essentially you're using guest_memfd for what it was designed for: private
> >> memory that is inaccessible.
> >
> > Ack, that sounds pretty reasonable to me. But I think we'd still want to
> > make sure the other users of guest_memfd have the _desire_ to support
> > huge pages,  migration, swap (probably longer term), and related
> > features, otherwise I don't think a guest_memfd-based option will
> > really work for us :-)
>
> *Probably* some easy way to get hugetlb pages into a guest_memfd would
> be by allocating them for an memfd and then converting/moving them into
> the guest_memfd part of the "fd pair" on conversion to private :)
>
> (but the "partial shared, partial private" case is and remains the ugly
> thing that is hard and I still don't think it makes sense. Maybe it
> could be handles somehow in such a dual approach with some enlightment
> in the fds ... hard to find solutions for things that don't make any
> sense :P )
>

I would again emphasize that this usecase exists for Confidential VMs,
whether we like it or not.

1) TDX hardware allows usage of 1G pages to back guest memory.
2) Larger VM sizes benefit more with 1G page sizes, which would be a
norm with VMs exposing GPU/TPU devices.
3) Confidential VMs will need to share host resources with
non-confidential VMs using 1G pages.
4) When using normal shmem/hugetlbfs files to back guest memory, this
usecase was achievable by just manipulating guest page tables
(although at the cost of host safety which led to invention of guest
memfd). Something equivalent "might be possible" with guest memfd.

Without handling "partial shared, partial private", it is impractical
to support 1G pages for Confidential VMs (discounting any long term
efforts to tame the guest VMs to play nice).

Maybe to handle this usecase, all the host side shared memory usage of
guest memfd (userspace, IOMMU etc) should be associated with (or
tracked via) file ranges rather than offsets within huge pages (like
it's done for faulting in private memory pages when populating guest
EPTs/NPTs). Given the current guest behavior, host MMU and IOMMU may
have to be forced to map shared memory regions always via 4KB
mappings.




> I also do strongly believe that we want to see some HW-assisted
> migration support for guest_memfd pages. Swap, as you say, maybe in the
> long-term. After all, we're not interested in having MM features for
> backing memory that you could similarly find under Windows 95. Wait,
> that one did support swapping! :P
>
> But unfortunately, that's what the shiny new CoCo world currently
> offers. Well, excluding s390x secure execution, as discussed.
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: folio_mmapped
  2024-03-27 19:34                                           ` folio_mmapped Will Deacon
  2024-03-28  9:06                                             ` folio_mmapped David Hildenbrand
@ 2024-04-04  0:15                                             ` Sean Christopherson
  1 sibling, 0 replies; 96+ messages in thread
From: Sean Christopherson @ 2024-04-04  0:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Hildenbrand, Vishal Annapurve, Quentin Perret,
	Matthew Wilcox, Fuad Tabba, kvm, kvmarm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, keirf, linux-mm

On Wed, Mar 27, 2024, Will Deacon wrote:
> Hi again, David,
> 
> On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
> > On 19.03.24 15:31, Will Deacon wrote:
> > sorry for the late reply!
> 
> Bah, you and me both!

Hold my beer ;-)

> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > >  From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > 
> > Reading Sean's reply, he has a different view on that. And I think that's
> > the main issue: there are too many different use cases and too many
> > different requirements that could turn guest_memfd into something that maybe
> > it really shouldn't be.
> 
> No argument there, and we're certainly not tied to any specific
> mechanism on the pKVM side. Maybe Sean can chime in, but we've
> definitely spoken about migration being a goal in the past, so I guess
> something changed since then on the guest_memfd side.

What's "hypervisor-assisted page migration"?  More specifically, what's the
mechanism that drives it?

I am not opposed to page migration itself, what I am opposed to is adding deep
integration with core MM to do some of the fancy/complex things that lead to page
migration.

Another thing I want to avoid is taking a hard dependency on "struct page", so
that we can have line of sight to eliminating "struct page" overhead for guest_memfd,
but that's definitely a more distant future concern.

> > This makes sense: shared memory is neither nasty nor special. You can
> > migrate it, swap it out, map it into page tables, GUP it, ... without any
> > issues.
> 
> Slight aside and not wanting to derail the discussion, but we have a few
> different types of sharing which we'll have to consider:
> 
>   * Memory shared from the host to the guest. This remains owned by the
>     host and the normal mm stuff can be made to work with it.

This seems like it should be !guest_memfd, i.e. can't be converted to guest
private (without first unmapping it from the host, but at that point it's
completely different memory, for all intents and purposes).

>   * Memory shared from the guest to the host. This remains owned by the
>     guest, so there's a pin on the pages and the normal mm stuff can't
>     work without co-operation from the guest (see next point).

Do you happen to have a list of exactly what you mean by "normal mm stuff"?  I
am not at all opposed to supporting .mmap(), because long term I also want to
use guest_memfd for non-CoCo VMs.  But I want to be very conservative with respect
to what is allowed for guest_memfd.   E.g. host userspace can map guest_memfd,
and do operations that are directly related to its mapping, but that's about it.

>   * Memory relinquished from the guest to the host. This actually unmaps
>     the pages from the host and transfers ownership back to the host,
>     after which the pin is dropped and the normal mm stuff can work. We
>     use this to implement ballooning.
> 
> I suppose the main thing is that the architecture backend can deal with
> these states, so the core code shouldn't really care as long as it's
> aware that shared memory may be pinned.
> 
> > So if I would describe some key characteristics of guest_memfd as of today,
> > it would probably be:
> > 
> > 1) Memory is unmovable and unswappable. Right from the beginning, it is
> >    allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> > 2) Memory is inaccessible. It cannot be read from user space, the
> >    kernel, it cannot be GUP'ed ... only some mechanisms might end up
> >    touching that memory (e.g., hibernation, /proc/kcore) might end up
> >    touching it "by accident", and we usually can handle these cases.
> > 3) Memory can be discarded in page granularity. There should be no cases
> >    where you cannot discard memory to over-allocate memory for private
> >    pages that have been replaced by shared pages otherwise.
> > 4) Page tables are not required (well, it's an memfd), and the fd could
> >    in theory be passed to other processes.o

More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
it's not subject to VMA protections, isn't restricted to host mapping size, etc.

> > Having "ordinary shared" memory in there implies that 1) and 2) will have to
> > be adjusted for them, which kind-of turns it "partially" into ordinary shmem
> > again.
> 
> Yes, and we'd also need a way to establish hugepages (where possible)
> even for the *private* memory so as to reduce the depth of the guest's
> stage-2 walk.

Yeah, hugepage support for guest_memfd is very much a WIP.  Getting _something_
is easy, getting the right thing is much harder.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host
  2024-02-22 16:10 ` [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host Fuad Tabba
@ 2024-04-17 23:27   ` Sean Christopherson
  2024-04-18 10:54   ` David Hildenbrand
  1 sibling, 0 replies; 96+ messages in thread
From: Sean Christopherson @ 2024-04-17 23:27 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, kvmarm, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

On Thu, Feb 22, 2024, Fuad Tabba wrote:
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
> +bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> +{
> +	struct kvm_memslot_iter iter;
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), gfn_start, gfn_end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t start, end, i;
> +
> +		start = max(gfn_start, memslot->base_gfn);
> +		end = min(gfn_end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(start >= end))
> +			continue;
> +
> +		for (i = start; i < end; i++) {
> +			struct page *page;
> +			bool is_mapped;
> +			kvm_pfn_t pfn;
> +			int ret;
> +
> +			/*
> +			 * Check the page_mapcount with the page lock held to
> +			 * avoid racing with kvm_gmem_fault().
> +			 */

I don't see how this avoids a TOCTOU race.   kvm_gmem_fault() presumably runs with
mmap_lock, but it definitely doesn't take slots_lock.  And this has slots_lock,
but definitely doesn't have mmap_lock.  If the fault is blocked on the page lock,
this will see page_mapcount() = 0, and the fault will map the page as soon as
unlock_page() runs.   Am I missing something?

I haven't thought deeply about this, but I'm pretty sure that "can this be
mapped" needs to tracked against the guest_memfd() inode, not in KVM.  While
each guest_memfd() *file* has a 1:1 binding with a KVM instance, the plan is to
allow multiple files per inode, e.g. to allow intra-host migration to a new KVM
instance, without destroying guest_memfd().

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host
  2024-02-22 16:10 ` [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host Fuad Tabba
  2024-04-17 23:27   ` Sean Christopherson
@ 2024-04-18 10:54   ` David Hildenbrand
  1 sibling, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2024-04-18 10:54 UTC (permalink / raw)
  To: Fuad Tabba, kvm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf

On 22.02.24 17:10, Fuad Tabba wrote:
> Guest private memory should never be mapped by the host.
> Therefore, do not allow setting the private attribute to guest
> memory if that memory is mapped by the host.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   include/linux/kvm_host.h |  7 ++++++
>   virt/kvm/kvm_main.c      | 51 ++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 58 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fad296baa84e..f52d5503ddef 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2408,11 +2408,18 @@ static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
>   	return !(kvm_get_memory_attributes(kvm, gfn) &
>   		 KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE);
>   }
> +
> +bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
>   #else
>   static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
>   {
>   	return false;
>   }
> +
> +static inline bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> +{
> +	return false;
> +}
>   #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
>   
>   #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fba4dc6e4107..9f6ff314bda3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2516,6 +2516,48 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
>   		KVM_MMU_UNLOCK(kvm);
>   }
>   
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
> +bool kvm_is_gmem_mapped(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> +{
> +	struct kvm_memslot_iter iter;
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), gfn_start, gfn_end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t start, end, i;
> +
> +		start = max(gfn_start, memslot->base_gfn);
> +		end = min(gfn_end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(start >= end))
> +			continue;
> +
> +		for (i = start; i < end; i++) {
> +			struct page *page;
> +			bool is_mapped;
> +			kvm_pfn_t pfn;
> +			int ret;
> +
> +			/*
> +			 * Check the page_mapcount with the page lock held to
> +			 * avoid racing with kvm_gmem_fault().
> +			 */
> +			ret = kvm_gmem_get_pfn_locked(kvm, memslot, i, &pfn, NULL);
> +			if (WARN_ON_ONCE(ret))
> +				continue;
> +
> +			page = pfn_to_page(pfn);
> +			is_mapped = page_mapcount(page);
> +			unlock_page(page);
> +			put_page(page);

Stumbling over this, please avoid using page_mapcount(). page_mapped() 
-- or better folio_mapped() -- is what you should be using here (iow, 
convert it to work on folios).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2024-04-18 10:55 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 01/26] KVM: Split KVM memory attributes into user and kernel attributes Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 02/26] KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host Fuad Tabba
2024-02-22 16:28   ` David Hildenbrand
2024-02-26  8:58     ` Fuad Tabba
2024-02-26  9:57       ` David Hildenbrand
2024-02-26 17:30         ` Fuad Tabba
2024-02-27  7:40           ` David Hildenbrand
2024-02-22 16:10 ` [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host Fuad Tabba
2024-04-17 23:27   ` Sean Christopherson
2024-04-18 10:54   ` David Hildenbrand
2024-02-22 16:10 ` [RFC PATCH v1 05/26] KVM: Don't allow private attribute to be removed for unmappable memory Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 06/26] KVM: Implement kvm_(read|/write)_guest_page for private memory slots Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 07/26] KVM: arm64: Turn llist of pinned pages into an rb-tree Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 08/26] KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 09/26] KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 10/26] KVM: arm64: Avoid unnecessary unmap walk " Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 11/26] KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 12/26] KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 13/26] KVM: arm64: Create hypercall return handler Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 14/26] KVM: arm64: Refactor code around handling return from host to guest Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 15/26] KVM: arm64: Rename kvm_pinned_page to kvm_guest_page Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 16/26] KVM: arm64: Add a field to indicate whether the guest page was pinned Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 17/26] KVM: arm64: Do not allow changes to private memory slots Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 18/26] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 19/26] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 20/26] KVM: arm64: Track sharing of memory from protected guest to host Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 21/26] KVM: arm64: Mark a protected VM's memory as unmappable at initialization Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 22/26] KVM: arm64: Handle unshare on way back to guest entry rather than exit Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 23/26] KVM: arm64: Check that host unmaps memory unshared by guest Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 24/26] KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 25/26] KVM: arm64: Enable private memory support when pKVM is enabled Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 26/26] KVM: arm64: Enable private memory kconfig for arm64 Fuad Tabba
2024-02-22 23:43 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Elliot Berman
2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
2024-02-26  9:28     ` folio_mmapped David Hildenbrand
2024-02-26 21:14       ` folio_mmapped Elliot Berman
2024-02-27 14:59         ` folio_mmapped David Hildenbrand
2024-02-28 10:48           ` folio_mmapped Quentin Perret
2024-02-28 11:11             ` folio_mmapped David Hildenbrand
2024-02-28 12:44               ` folio_mmapped Quentin Perret
2024-02-28 13:00                 ` folio_mmapped David Hildenbrand
2024-02-28 13:34                   ` folio_mmapped Quentin Perret
2024-02-28 18:43                     ` folio_mmapped Elliot Berman
2024-02-28 18:51                       ` Quentin Perret
2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
2024-03-01  0:40                         ` folio_mmapped Elliot Berman
2024-03-01 11:16                           ` folio_mmapped David Hildenbrand
2024-03-04 12:53                             ` folio_mmapped Quentin Perret
2024-03-04 20:22                               ` folio_mmapped David Hildenbrand
2024-03-01 11:06                         ` folio_mmapped David Hildenbrand
2024-03-04 12:36                       ` folio_mmapped Quentin Perret
2024-03-04 19:04                         ` folio_mmapped Sean Christopherson
2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
2024-03-04 21:43                             ` folio_mmapped Elliot Berman
2024-03-04 21:58                               ` folio_mmapped David Hildenbrand
2024-03-19  9:47                                 ` folio_mmapped Quentin Perret
2024-03-19  9:54                                   ` folio_mmapped David Hildenbrand
2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
2024-03-18 22:02                               ` folio_mmapped David Hildenbrand
2024-03-18 23:07                                 ` folio_mmapped Vishal Annapurve
2024-03-19  0:10                                   ` folio_mmapped Sean Christopherson
2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
2024-03-19 14:31                                       ` folio_mmapped Will Deacon
2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
2024-03-22 16:36                                           ` Will Deacon
2024-03-22 18:46                                             ` Elliot Berman
2024-03-27 19:31                                               ` Will Deacon
2024-03-22 17:52                                         ` folio_mmapped David Hildenbrand
2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
2024-03-26 22:04                                             ` folio_mmapped Elliot Berman
2024-03-27 17:50                                               ` folio_mmapped David Hildenbrand
2024-03-27 19:34                                           ` folio_mmapped Will Deacon
2024-03-28  9:06                                             ` folio_mmapped David Hildenbrand
2024-03-28 10:10                                               ` folio_mmapped Quentin Perret
2024-03-28 10:32                                                 ` folio_mmapped David Hildenbrand
2024-03-28 10:58                                                   ` folio_mmapped Quentin Perret
2024-03-28 11:41                                                     ` folio_mmapped David Hildenbrand
2024-03-29 18:38                                                       ` folio_mmapped Vishal Annapurve
2024-04-04  0:15                                             ` folio_mmapped Sean Christopherson
2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
2024-03-22 17:16                                         ` folio_mmapped David Hildenbrand
2024-02-26  9:03   ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
2024-02-23 12:00 ` Alexandru Elisei
2024-02-26  9:05   ` Fuad Tabba
2024-02-26  9:47 ` David Hildenbrand
2024-02-27  9:37   ` Fuad Tabba
2024-02-27 14:41     ` David Hildenbrand
2024-02-27 14:49       ` David Hildenbrand
2024-02-28  9:57       ` Fuad Tabba
2024-02-28 10:12         ` David Hildenbrand
2024-02-28 14:01           ` Quentin Perret
2024-02-29  9:51             ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.